diff mbox series

[v5,1/4] block: support compressed write at generic layer

Message ID 1571603828-185910-2-git-send-email-andrey.shinkevich@virtuozzo.com
State New
Headers show
Series qcow2: advanced compression options | expand

Commit Message

Andrey Shinkevich Oct. 20, 2019, 8:37 p.m. UTC
To inform the block layer about writing all the data compressed, we
introduce the 'compress' command line option. Based on that option, the
written data will be aligned by the cluster size at the generic layer.

Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---
 block.c                   | 20 +++++++++++++++++++-
 block/io.c                | 13 +++++++++----
 block/qcow2.c             |  4 ++++
 blockdev.c                |  9 ++++++++-
 include/block/block.h     |  1 +
 include/block/block_int.h |  2 ++
 qapi/block-core.json      |  5 ++++-
 qemu-options.hx           |  6 ++++--
 8 files changed, 51 insertions(+), 9 deletions(-)

Comments

Max Reitz Oct. 22, 2019, 9:28 a.m. UTC | #1
On 20.10.19 22:37, Andrey Shinkevich wrote:
> To inform the block layer about writing all the data compressed, we
> introduce the 'compress' command line option. Based on that option, the
> written data will be aligned by the cluster size at the generic layer.
> 
> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>  block.c                   | 20 +++++++++++++++++++-
>  block/io.c                | 13 +++++++++----
>  block/qcow2.c             |  4 ++++
>  blockdev.c                |  9 ++++++++-
>  include/block/block.h     |  1 +
>  include/block/block_int.h |  2 ++
>  qapi/block-core.json      |  5 ++++-
>  qemu-options.hx           |  6 ++++--
>  8 files changed, 51 insertions(+), 9 deletions(-)

The problem with compression is that there are such tight constraints on
it that it can really only work for very defined use cases.  Those
constraints are:

- Only write whole clusters,
- Clusters can be written to only once.

The first point is addressed in this patch by setting request_alignment.
 But I don’t see how the second one can be addressed.  Well, maybe by
allowing it in all drivers that support compression.  But if I just look
at qcow2, that isn’t going to be trivial: You need to allocate a
completely new cluster where you write the data (in case it grows), and
thus you leave behind a hole, which kind of defeats the purpose of
compression.

(For demonstration:

$ ./qemu-img create -f qcow2 test.qcow2 64M
Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
lazy_refcounts=off refcount_bits=16
$ x86_64-softmmu/qemu-system-x86_64 \
    -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
                'compress': true,
                'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
    -monitor stdio
QEMU 4.1.50 monitor - type 'help' for more information
(qemu) qemu-io drv0 "write -P 42 0 64k"
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
(qemu) qemu-io drv0 "write -P 23 0 64k"
write failed: Input/output error

)

Compression really only works when you fully write all of an image
exactly once; i.e. as the qemu-img convert or as a backup target.  For
both cases we already have a compression option.  So I’m wondering where
this new option is really useful.

(You do add a test for stream, but I don’t know whether that’s really a
good example, see my response there.)

Max
Andrey Shinkevich Oct. 22, 2019, 10:21 a.m. UTC | #2
On 22/10/2019 12:28, Max Reitz wrote:
> On 20.10.19 22:37, Andrey Shinkevich wrote:
>> To inform the block layer about writing all the data compressed, we
>> introduce the 'compress' command line option. Based on that option, the
>> written data will be aligned by the cluster size at the generic layer.
>>
>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>   block.c                   | 20 +++++++++++++++++++-
>>   block/io.c                | 13 +++++++++----
>>   block/qcow2.c             |  4 ++++
>>   blockdev.c                |  9 ++++++++-
>>   include/block/block.h     |  1 +
>>   include/block/block_int.h |  2 ++
>>   qapi/block-core.json      |  5 ++++-
>>   qemu-options.hx           |  6 ++++--
>>   8 files changed, 51 insertions(+), 9 deletions(-)
> 
> The problem with compression is that there are such tight constraints on
> it that it can really only work for very defined use cases.  Those
> constraints are:
> 
> - Only write whole clusters,
> - Clusters can be written to only once.
> 
> The first point is addressed in this patch by setting request_alignment.
>   But I don’t see how the second one can be addressed.  Well, maybe by
> allowing it in all drivers that support compression.  But if I just look
> at qcow2, that isn’t going to be trivial: You need to allocate a
> completely new cluster where you write the data (in case it grows), and
> thus you leave behind a hole, which kind of defeats the purpose of
> compression.
> 
> (For demonstration:
> 
> $ ./qemu-img create -f qcow2 test.qcow2 64M
> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
> lazy_refcounts=off refcount_bits=16
> $ x86_64-softmmu/qemu-system-x86_64 \
>      -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>                  'compress': true,
>                  'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>      -monitor stdio
> QEMU 4.1.50 monitor - type 'help' for more information
> (qemu) qemu-io drv0 "write -P 42 0 64k"
> wrote 65536/65536 bytes at offset 0
> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
> (qemu) qemu-io drv0 "write -P 23 0 64k"
> write failed: Input/output error
> 
> )
> 
> Compression really only works when you fully write all of an image
> exactly once; i.e. as the qemu-img convert or as a backup target.  For
> both cases we already have a compression option.  So I’m wondering where
> this new option is really useful.
> 
> (You do add a test for stream, but I don’t know whether that’s really a
> good example, see my response there.)
> 
> Max
> 

Thank you very much Max for your detailed response.

1) You are right that compression is used with the backup mostly. The 
option for the compression with backup would be replaced by usage at the 
block layer, with no duplication. Also, it can be useful for NBD for 
instance,

$ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
$ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
$ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
$ sudo ./qemu-nbd -d /dev/nbd0
$ du -sh ./image.qcow2
101M    ./image.qcow2

and with the compression

$ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
$ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
$ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
10+0 records in
10+0 records out
104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
$ sudo ./qemu-nbd -d /dev/nbd0
$ du -sh ./image.qcow2
5,3M    ./image.qcow2

The idea behind introducing the new 'compress' option is to use that 
only one instead of many other ones of such a kind.

2) You are right also that "Compression can't overwrite anything..."
It can be seen in the commit message 
b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2

I am not sure if data should be written compressed to the active layer.
I made the tests with the idea of bringing examples of usage the 
'compress' option because passing an option is a tricky thing in QEMU.
But the issue takes place anyway if we want to rewrite to allocated 
clusters.
I would like to investigate the matter and make a patch that resolves 
that issue.
Do you agree with that?

Andrey
Vladimir Sementsov-Ogievskiy Oct. 22, 2019, 10:46 a.m. UTC | #3
22.10.2019 13:21, Andrey Shinkevich wrote:
> 
> On 22/10/2019 12:28, Max Reitz wrote:
>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>> To inform the block layer about writing all the data compressed, we
>>> introduce the 'compress' command line option. Based on that option, the
>>> written data will be aligned by the cluster size at the generic layer.
>>>
>>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>> ---
>>>    block.c                   | 20 +++++++++++++++++++-
>>>    block/io.c                | 13 +++++++++----
>>>    block/qcow2.c             |  4 ++++
>>>    blockdev.c                |  9 ++++++++-
>>>    include/block/block.h     |  1 +
>>>    include/block/block_int.h |  2 ++
>>>    qapi/block-core.json      |  5 ++++-
>>>    qemu-options.hx           |  6 ++++--
>>>    8 files changed, 51 insertions(+), 9 deletions(-)
>>
>> The problem with compression is that there are such tight constraints on
>> it that it can really only work for very defined use cases.  Those
>> constraints are:
>>
>> - Only write whole clusters,
>> - Clusters can be written to only once.
>>
>> The first point is addressed in this patch by setting request_alignment.
>>    But I don’t see how the second one can be addressed.  Well, maybe by
>> allowing it in all drivers that support compression.  But if I just look
>> at qcow2, that isn’t going to be trivial: You need to allocate a
>> completely new cluster where you write the data (in case it grows), and
>> thus you leave behind a hole, which kind of defeats the purpose of
>> compression.
>>
>> (For demonstration:
>>
>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>> lazy_refcounts=off refcount_bits=16
>> $ x86_64-softmmu/qemu-system-x86_64 \
>>       -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>                   'compress': true,
>>                   'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>>       -monitor stdio
>> QEMU 4.1.50 monitor - type 'help' for more information
>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>> wrote 65536/65536 bytes at offset 0
>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>> write failed: Input/output error
>>
>> )
>>
>> Compression really only works when you fully write all of an image
>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>> both cases we already have a compression option.  So I’m wondering where
>> this new option is really useful.
>>
>> (You do add a test for stream, but I don’t know whether that’s really a
>> good example, see my response there.)
>>
>> Max
>>
> 
> Thank you very much Max for your detailed response.
> 
> 1) You are right that compression is used with the backup mostly. The
> option for the compression with backup would be replaced by usage at the
> block layer, with no duplication. Also, it can be useful for NBD for
> instance,
> 
> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
> 10+0 records in
> 10+0 records out
> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
> $ sudo ./qemu-nbd -d /dev/nbd0
> $ du -sh ./image.qcow2
> 101M    ./image.qcow2
> 
> and with the compression
> 
> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
> 10+0 records in
> 10+0 records out
> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
> $ sudo ./qemu-nbd -d /dev/nbd0
> $ du -sh ./image.qcow2
> 5,3M    ./image.qcow2
> 
> The idea behind introducing the new 'compress' option is to use that
> only one instead of many other ones of such a kind.
> 
> 2) You are right also that "Compression can't overwrite anything..."
> It can be seen in the commit message
> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
> 
> I am not sure if data should be written compressed to the active layer.
> I made the tests with the idea of bringing examples of usage the
> 'compress' option because passing an option is a tricky thing in QEMU.
> But the issue takes place anyway if we want to rewrite to allocated
> clusters.
> I would like to investigate the matter and make a patch that resolves
> that issue.
> Do you agree with that?
> 
> Andrey
> 


Yes, we want this option not to allow compressed writes for guests, but to
allow
- stream with compression (used to remove compressed incremental backup, we
need to merge it to the next incremental)
- backup with compression (we have an optional already, so it works)
- backup to nbd server with compression: enable compression on nbd server

So instead of adding two options (for stream and for nbd), it seems better to
add only one for generic layer.

Then, it becomes possible to run guest on image with compress=on. It's a side
effect, but still it should work correctly.

I think the simplest thing is to just run normal write, if compressed write
failed because of reallocation. We should check that on that failure-path
ENOTSUP is returned and handle it for compress=on option by fallback to
normal write in generic block/io.c

(Note, that in case with stream we rewrite only unallocated clusters)
Max Reitz Oct. 22, 2019, 11:31 a.m. UTC | #4
On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>
>> On 22/10/2019 12:28, Max Reitz wrote:
>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>> To inform the block layer about writing all the data compressed, we
>>>> introduce the 'compress' command line option. Based on that option, the
>>>> written data will be aligned by the cluster size at the generic layer.
>>>>
>>>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>> ---
>>>>    block.c                   | 20 +++++++++++++++++++-
>>>>    block/io.c                | 13 +++++++++----
>>>>    block/qcow2.c             |  4 ++++
>>>>    blockdev.c                |  9 ++++++++-
>>>>    include/block/block.h     |  1 +
>>>>    include/block/block_int.h |  2 ++
>>>>    qapi/block-core.json      |  5 ++++-
>>>>    qemu-options.hx           |  6 ++++--
>>>>    8 files changed, 51 insertions(+), 9 deletions(-)
>>>
>>> The problem with compression is that there are such tight constraints on
>>> it that it can really only work for very defined use cases.  Those
>>> constraints are:
>>>
>>> - Only write whole clusters,
>>> - Clusters can be written to only once.
>>>
>>> The first point is addressed in this patch by setting request_alignment.
>>>    But I don’t see how the second one can be addressed.  Well, maybe by
>>> allowing it in all drivers that support compression.  But if I just look
>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>> completely new cluster where you write the data (in case it grows), and
>>> thus you leave behind a hole, which kind of defeats the purpose of
>>> compression.
>>>
>>> (For demonstration:
>>>
>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>> lazy_refcounts=off refcount_bits=16
>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>       -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>                   'compress': true,
>>>                   'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>>>       -monitor stdio
>>> QEMU 4.1.50 monitor - type 'help' for more information
>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>> wrote 65536/65536 bytes at offset 0
>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>> write failed: Input/output error
>>>
>>> )
>>>
>>> Compression really only works when you fully write all of an image
>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>> both cases we already have a compression option.  So I’m wondering where
>>> this new option is really useful.
>>>
>>> (You do add a test for stream, but I don’t know whether that’s really a
>>> good example, see my response there.)
>>>
>>> Max
>>>
>>
>> Thank you very much Max for your detailed response.
>>
>> 1) You are right that compression is used with the backup mostly. The
>> option for the compression with backup would be replaced by usage at the
>> block layer, with no duplication. Also, it can be useful for NBD for
>> instance,
>>
>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>> $ sudo ./qemu-nbd -d /dev/nbd0
>> $ du -sh ./image.qcow2
>> 101M    ./image.qcow2
>>
>> and with the compression
>>
>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>> 10+0 records in
>> 10+0 records out
>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>> $ sudo ./qemu-nbd -d /dev/nbd0
>> $ du -sh ./image.qcow2
>> 5,3M    ./image.qcow2

That seems wrong to me.  Why not use qemu-img convert for this case?

Attaching an NBD server to a compressed disk has exactly the same
problem as attaching a compressed disk to a VM.  It won’t work unless
the client/guest is aware of the limitations.

>> The idea behind introducing the new 'compress' option is to use that
>> only one instead of many other ones of such a kind.
>>
>> 2) You are right also that "Compression can't overwrite anything..."
>> It can be seen in the commit message
>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>
>> I am not sure if data should be written compressed to the active layer.
>> I made the tests with the idea of bringing examples of usage the
>> 'compress' option because passing an option is a tricky thing in QEMU.
>> But the issue takes place anyway if we want to rewrite to allocated
>> clusters.
>> I would like to investigate the matter and make a patch that resolves
>> that issue.
>> Do you agree with that?

What seems wrong to me is that this series just adds a generic compress
option without ensuring that it works generically.

Either (1) it only works in well-defined cases, then either (1A) we have
to ensure that we only allow it then (as we do now, because only
qemu-img convert and the backup job have such an option; and those two
are safe), or (1B) we have to clearly give a big warning for the new
option that it doesn’t work correctly.  I don’t know whether such a
warning is even possible with just a generic node-level option.

Or (2) we make it work in generic cases.  Well, that might be possible
for qcow2, but who’s going to make it work for VMDK’s streamOptimized
subformat?

More on all of that below.

> Yes, we want this option not to allow compressed writes for guests, but to
> allow
> - stream with compression (used to remove compressed incremental backup, we
> need to merge it to the next incremental)

Based on the assumption that one shouldn’t attach a compressed disk to a
VM, I don’t see how streaming makes sense then.  Well, I suppose
intermediate streaming would work.

> - backup with compression (we have an optional already, so it works)
> - backup to nbd server with compression: enable compression on nbd server

The problem is clearly that if a generic client were to connect to the
NBD server, it wouldn’t work.  In this case, compression will work only
if the clients understands the limitation.

(The safe way would be to make the NBD server provide an option that
clients can see so they know this limitation and agree they’ll adhere to
it.  It’s also a stupid way.)

> So instead of adding two options (for stream and for nbd), it seems better to
> add only one for generic layer.

I don’t know.  It doesn’t work generically, so I don’t know whether it
should be a generic option.

> Then, it becomes possible to run guest on image with compress=on. It's a side
> effect, but still it should work correctly.

How so?  compress=on only works if every party involved only writes to
any cluster of the image exactly once.  That is just not the case for
guests unless they know this limitation, and even I don’t see a use case.

> I think the simplest thing is to just run normal write, if compressed write
> failed because of reallocation. We should check that on that failure-path
> ENOTSUP is returned and handle it for compress=on option by fallback to
> normal write in generic block/io.c

It seems wrong to not compress data with compress=on.

> (Note, that in case with stream we rewrite only unallocated clusters)

Yes, but if the stream job writes the cluster before the guest, the
guest gets an I/O error.


By the way, the other thing I wondered was whether this should be a
filter driver instead (if it makes sense at all).  Such a filter driver
would at least be sufficiently cumbersome to use that probably only
users who understand the implications would make use of it (1B above).


I’m not against adding a generic compress=on option if we ensure that it
actually works generically (2 above).  It doesn’t right now, so I don’t
think this is right.

I’m already not sure whether it’s really possible to support generic
compressed writes in qcow2.  I suppose it’d be at least awkward.  In
theory it should work, it’s just that we can’t keep track of subcluster
allocations, which in the worst case means that some compressed clusters
take up a whole cluster of space.

For VMDK...  I suppose we could consider a new flag for block drivers
that flags whether a driver supports arbitrary compressed writes or has
the old limitations.  compress=on could then refuse to work on any block
driver but the ones that support arbitrary compressed writes.


Our current model (1A) is simply to ensure that all compressed writes
can adhere to the limitations.  As I’ve said above, extending this to
NBD would mean adding some NBD negotiation so both client and server
agree on this limitation.  That seems kind of reasonable from the qemu
side, but probably very unreasonable from an NBD side.  Why would NBD
bother to reproduce qemu’s limitations?


So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
a generic option that unsuspecting users are not unlikely to use but
that simply doesn’t work generically doesn’t seem right to me.

Max
Vladimir Sementsov-Ogievskiy Oct. 22, 2019, 12:23 p.m. UTC | #5
22.10.2019 14:31, Max Reitz wrote:
> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
>> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>>
>>> On 22/10/2019 12:28, Max Reitz wrote:
>>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>>> To inform the block layer about writing all the data compressed, we
>>>>> introduce the 'compress' command line option. Based on that option, the
>>>>> written data will be aligned by the cluster size at the generic layer.
>>>>>
>>>>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>> ---
>>>>>     block.c                   | 20 +++++++++++++++++++-
>>>>>     block/io.c                | 13 +++++++++----
>>>>>     block/qcow2.c             |  4 ++++
>>>>>     blockdev.c                |  9 ++++++++-
>>>>>     include/block/block.h     |  1 +
>>>>>     include/block/block_int.h |  2 ++
>>>>>     qapi/block-core.json      |  5 ++++-
>>>>>     qemu-options.hx           |  6 ++++--
>>>>>     8 files changed, 51 insertions(+), 9 deletions(-)
>>>>
>>>> The problem with compression is that there are such tight constraints on
>>>> it that it can really only work for very defined use cases.  Those
>>>> constraints are:
>>>>
>>>> - Only write whole clusters,
>>>> - Clusters can be written to only once.
>>>>
>>>> The first point is addressed in this patch by setting request_alignment.
>>>>     But I don’t see how the second one can be addressed.  Well, maybe by
>>>> allowing it in all drivers that support compression.  But if I just look
>>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>>> completely new cluster where you write the data (in case it grows), and
>>>> thus you leave behind a hole, which kind of defeats the purpose of
>>>> compression.
>>>>
>>>> (For demonstration:
>>>>
>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>>> lazy_refcounts=off refcount_bits=16
>>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>>        -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>>                    'compress': true,
>>>>                    'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>>>>        -monitor stdio
>>>> QEMU 4.1.50 monitor - type 'help' for more information
>>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>>> wrote 65536/65536 bytes at offset 0
>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>>> write failed: Input/output error
>>>>
>>>> )
>>>>
>>>> Compression really only works when you fully write all of an image
>>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>>> both cases we already have a compression option.  So I’m wondering where
>>>> this new option is really useful.
>>>>
>>>> (You do add a test for stream, but I don’t know whether that’s really a
>>>> good example, see my response there.)
>>>>
>>>> Max
>>>>
>>>
>>> Thank you very much Max for your detailed response.
>>>
>>> 1) You are right that compression is used with the backup mostly. The
>>> option for the compression with backup would be replaced by usage at the
>>> block layer, with no duplication. Also, it can be useful for NBD for
>>> instance,
>>>
>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>> 10+0 records in
>>> 10+0 records out
>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>> $ du -sh ./image.qcow2
>>> 101M    ./image.qcow2
>>>
>>> and with the compression
>>>
>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>> 10+0 records in
>>> 10+0 records out
>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>> $ du -sh ./image.qcow2
>>> 5,3M    ./image.qcow2
> 
> That seems wrong to me.  Why not use qemu-img convert for this case?
> 
> Attaching an NBD server to a compressed disk has exactly the same
> problem as attaching a compressed disk to a VM.  It won’t work unless
> the client/guest is aware of the limitations.
> 
>>> The idea behind introducing the new 'compress' option is to use that
>>> only one instead of many other ones of such a kind.
>>>
>>> 2) You are right also that "Compression can't overwrite anything..."
>>> It can be seen in the commit message
>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>
>>> I am not sure if data should be written compressed to the active layer.
>>> I made the tests with the idea of bringing examples of usage the
>>> 'compress' option because passing an option is a tricky thing in QEMU.
>>> But the issue takes place anyway if we want to rewrite to allocated
>>> clusters.
>>> I would like to investigate the matter and make a patch that resolves
>>> that issue.
>>> Do you agree with that?
> 
> What seems wrong to me is that this series just adds a generic compress
> option without ensuring that it works generically.
> 
> Either (1) it only works in well-defined cases, then either (1A) we have
> to ensure that we only allow it then (as we do now, because only
> qemu-img convert and the backup job have such an option; and those two
> are safe), or (1B) we have to clearly give a big warning for the new
> option that it doesn’t work correctly.  I don’t know whether such a
> warning is even possible with just a generic node-level option.
> 
> Or (2) we make it work in generic cases.  Well, that might be possible
> for qcow2, but who’s going to make it work for VMDK’s streamOptimized
> subformat?
> 
> More on all of that below.
> 
>> Yes, we want this option not to allow compressed writes for guests, but to
>> allow
>> - stream with compression (used to remove compressed incremental backup, we
>> need to merge it to the next incremental)
> 
> Based on the assumption that one shouldn’t attach a compressed disk to a
> VM, I don’t see how streaming makes sense then.  Well, I suppose
> intermediate streaming would work.
> 
>> - backup with compression (we have an optional already, so it works)
>> - backup to nbd server with compression: enable compression on nbd server
> 
> The problem is clearly that if a generic client were to connect to the
> NBD server, it wouldn’t work.  In this case, compression will work only
> if the clients understands the limitation.
> 
> (The safe way would be to make the NBD server provide an option that
> clients can see so they know this limitation and agree they’ll adhere to
> it.  It’s also a stupid way.)
> 
>> So instead of adding two options (for stream and for nbd), it seems better to
>> add only one for generic layer.
> 
> I don’t know.  It doesn’t work generically, so I don’t know whether it
> should be a generic option.
> 
>> Then, it becomes possible to run guest on image with compress=on. It's a side
>> effect, but still it should work correctly.
> 
> How so?  compress=on only works if every party involved only writes to
> any cluster of the image exactly once.  That is just not the case for
> guests unless they know this limitation, and even I don’t see a use case.
> 
>> I think the simplest thing is to just run normal write, if compressed write
>> failed because of reallocation. We should check that on that failure-path
>> ENOTSUP is returned and handle it for compress=on option by fallback to
>> normal write in generic block/io.c
> 
> It seems wrong to not compress data with compress=on.

We already fallback to normal write if can't compress in qcow2.

> 
>> (Note, that in case with stream we rewrite only unallocated clusters)
> 
> Yes, but if the stream job writes the cluster before the guest, the
> guest gets an I/O error.

I don't think that using compressed writes by guest is really usable thing.
In our case with stream there is no guest (just use qemu binary to operate
on block layer)

> 
> 
> By the way, the other thing I wondered was whether this should be a
> filter driver instead (if it makes sense at all).  Such a filter driver
> would at least be sufficiently cumbersome to use that probably only
> users who understand the implications would make use of it (1B above).
> 
> 
> I’m not against adding a generic compress=on option if we ensure that it
> actually works generically (2 above).  It doesn’t right now, so I don’t
> think this is right.
> 
> I’m already not sure whether it’s really possible to support generic
> compressed writes in qcow2.  I suppose it’d be at least awkward.  In
> theory it should work, it’s just that we can’t keep track of subcluster
> allocations, which in the worst case means that some compressed clusters
> take up a whole cluster of space.
> 
> For VMDK...  I suppose we could consider a new flag for block drivers
> that flags whether a driver supports arbitrary compressed writes or has
> the old limitations.  compress=on could then refuse to work on any block
> driver but the ones that support arbitrary compressed writes.
> 
> 
> Our current model (1A) is simply to ensure that all compressed writes
> can adhere to the limitations.  As I’ve said above, extending this to
> NBD would mean adding some NBD negotiation so both client and server
> agree on this limitation.

In this case, I'd prefere simple cmdline option fro qemu-nbd.

> That seems kind of reasonable from the qemu
> side, but probably very unreasonable from an NBD side.  Why would NBD
> bother to reproduce qemu’s limitations?
> 
> 
> So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
> a generic option that unsuspecting users are not unlikely to use but
> that simply doesn’t work generically doesn’t seem right to me.
> 

Firstly, 1A already doesn't work correctly: we have compress option for backup.
So, it will not work if backup target is not empty.
So, 1A is impossible, as it's already broken, adding separate options for stream
and qemu-nbd is not better than just add one generic option.

I don't like (2) as it means a lot of effort to support actually not needed case.

I don't really like filter solution (as at seems too much to add a filter for simple
boolean option), but I think, we can live with it.

1B is OK for me, that is, just document what the option does in fact.

May be, name the option "compress-new-writes" instead of just "compress"?
And document, that writes to clusters, which was not written before will be
compressed?

Or make the option to be compress=<x>, where x may be 'no' or 'new-writes',
reserving 'all-writes' for future?
Max Reitz Oct. 22, 2019, 12:56 p.m. UTC | #6
On 22.10.19 14:23, Vladimir Sementsov-Ogievskiy wrote:
> 22.10.2019 14:31, Max Reitz wrote:
>> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
>>> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>>>
>>>> On 22/10/2019 12:28, Max Reitz wrote:
>>>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>>>> To inform the block layer about writing all the data compressed, we
>>>>>> introduce the 'compress' command line option. Based on that option, the
>>>>>> written data will be aligned by the cluster size at the generic layer.
>>>>>>
>>>>>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>> ---
>>>>>>     block.c                   | 20 +++++++++++++++++++-
>>>>>>     block/io.c                | 13 +++++++++----
>>>>>>     block/qcow2.c             |  4 ++++
>>>>>>     blockdev.c                |  9 ++++++++-
>>>>>>     include/block/block.h     |  1 +
>>>>>>     include/block/block_int.h |  2 ++
>>>>>>     qapi/block-core.json      |  5 ++++-
>>>>>>     qemu-options.hx           |  6 ++++--
>>>>>>     8 files changed, 51 insertions(+), 9 deletions(-)
>>>>>
>>>>> The problem with compression is that there are such tight constraints on
>>>>> it that it can really only work for very defined use cases.  Those
>>>>> constraints are:
>>>>>
>>>>> - Only write whole clusters,
>>>>> - Clusters can be written to only once.
>>>>>
>>>>> The first point is addressed in this patch by setting request_alignment.
>>>>>     But I don’t see how the second one can be addressed.  Well, maybe by
>>>>> allowing it in all drivers that support compression.  But if I just look
>>>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>>>> completely new cluster where you write the data (in case it grows), and
>>>>> thus you leave behind a hole, which kind of defeats the purpose of
>>>>> compression.
>>>>>
>>>>> (For demonstration:
>>>>>
>>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>>>> lazy_refcounts=off refcount_bits=16
>>>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>>>        -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>>>                    'compress': true,
>>>>>                    'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>>>>>        -monitor stdio
>>>>> QEMU 4.1.50 monitor - type 'help' for more information
>>>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>>>> wrote 65536/65536 bytes at offset 0
>>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>>>> write failed: Input/output error
>>>>>
>>>>> )
>>>>>
>>>>> Compression really only works when you fully write all of an image
>>>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>>>> both cases we already have a compression option.  So I’m wondering where
>>>>> this new option is really useful.
>>>>>
>>>>> (You do add a test for stream, but I don’t know whether that’s really a
>>>>> good example, see my response there.)
>>>>>
>>>>> Max
>>>>>
>>>>
>>>> Thank you very much Max for your detailed response.
>>>>
>>>> 1) You are right that compression is used with the backup mostly. The
>>>> option for the compression with backup would be replaced by usage at the
>>>> block layer, with no duplication. Also, it can be useful for NBD for
>>>> instance,
>>>>
>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>> 10+0 records in
>>>> 10+0 records out
>>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>> $ du -sh ./image.qcow2
>>>> 101M    ./image.qcow2
>>>>
>>>> and with the compression
>>>>
>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>> 10+0 records in
>>>> 10+0 records out
>>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>> $ du -sh ./image.qcow2
>>>> 5,3M    ./image.qcow2
>>
>> That seems wrong to me.  Why not use qemu-img convert for this case?
>>
>> Attaching an NBD server to a compressed disk has exactly the same
>> problem as attaching a compressed disk to a VM.  It won’t work unless
>> the client/guest is aware of the limitations.
>>
>>>> The idea behind introducing the new 'compress' option is to use that
>>>> only one instead of many other ones of such a kind.
>>>>
>>>> 2) You are right also that "Compression can't overwrite anything..."
>>>> It can be seen in the commit message
>>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>>
>>>> I am not sure if data should be written compressed to the active layer.
>>>> I made the tests with the idea of bringing examples of usage the
>>>> 'compress' option because passing an option is a tricky thing in QEMU.
>>>> But the issue takes place anyway if we want to rewrite to allocated
>>>> clusters.
>>>> I would like to investigate the matter and make a patch that resolves
>>>> that issue.
>>>> Do you agree with that?
>>
>> What seems wrong to me is that this series just adds a generic compress
>> option without ensuring that it works generically.
>>
>> Either (1) it only works in well-defined cases, then either (1A) we have
>> to ensure that we only allow it then (as we do now, because only
>> qemu-img convert and the backup job have such an option; and those two
>> are safe), or (1B) we have to clearly give a big warning for the new
>> option that it doesn’t work correctly.  I don’t know whether such a
>> warning is even possible with just a generic node-level option.
>>
>> Or (2) we make it work in generic cases.  Well, that might be possible
>> for qcow2, but who’s going to make it work for VMDK’s streamOptimized
>> subformat?
>>
>> More on all of that below.
>>
>>> Yes, we want this option not to allow compressed writes for guests, but to
>>> allow
>>> - stream with compression (used to remove compressed incremental backup, we
>>> need to merge it to the next incremental)
>>
>> Based on the assumption that one shouldn’t attach a compressed disk to a
>> VM, I don’t see how streaming makes sense then.  Well, I suppose
>> intermediate streaming would work.
>>
>>> - backup with compression (we have an optional already, so it works)
>>> - backup to nbd server with compression: enable compression on nbd server
>>
>> The problem is clearly that if a generic client were to connect to the
>> NBD server, it wouldn’t work.  In this case, compression will work only
>> if the clients understands the limitation.
>>
>> (The safe way would be to make the NBD server provide an option that
>> clients can see so they know this limitation and agree they’ll adhere to
>> it.  It’s also a stupid way.)
>>
>>> So instead of adding two options (for stream and for nbd), it seems better to
>>> add only one for generic layer.
>>
>> I don’t know.  It doesn’t work generically, so I don’t know whether it
>> should be a generic option.
>>
>>> Then, it becomes possible to run guest on image with compress=on. It's a side
>>> effect, but still it should work correctly.
>>
>> How so?  compress=on only works if every party involved only writes to
>> any cluster of the image exactly once.  That is just not the case for
>> guests unless they know this limitation, and even I don’t see a use case.
>>
>>> I think the simplest thing is to just run normal write, if compressed write
>>> failed because of reallocation. We should check that on that failure-path
>>> ENOTSUP is returned and handle it for compress=on option by fallback to
>>> normal write in generic block/io.c
>>
>> It seems wrong to not compress data with compress=on.
> 
> We already fallback to normal write if can't compress in qcow2.

In a very specific case, namely where the compressed size is larger than
the uncompressed size.  It would be a whole different story to only
compress the first write to a given cluster but not any following one.

>>> (Note, that in case with stream we rewrite only unallocated clusters)
>>
>> Yes, but if the stream job writes the cluster before the guest, the
>> guest gets an I/O error.
> 
> I don't think that using compressed writes by guest is really usable thing.
> In our case with stream there is no guest (just use qemu binary to operate
> on block layer)
> 
>>
>>
>> By the way, the other thing I wondered was whether this should be a
>> filter driver instead (if it makes sense at all).  Such a filter driver
>> would at least be sufficiently cumbersome to use that probably only
>> users who understand the implications would make use of it (1B above).
>>
>>
>> I’m not against adding a generic compress=on option if we ensure that it
>> actually works generically (2 above).  It doesn’t right now, so I don’t
>> think this is right.
>>
>> I’m already not sure whether it’s really possible to support generic
>> compressed writes in qcow2.  I suppose it’d be at least awkward.  In
>> theory it should work, it’s just that we can’t keep track of subcluster
>> allocations, which in the worst case means that some compressed clusters
>> take up a whole cluster of space.
>>
>> For VMDK...  I suppose we could consider a new flag for block drivers
>> that flags whether a driver supports arbitrary compressed writes or has
>> the old limitations.  compress=on could then refuse to work on any block
>> driver but the ones that support arbitrary compressed writes.
>>
>>
>> Our current model (1A) is simply to ensure that all compressed writes
>> can adhere to the limitations.  As I’ve said above, extending this to
>> NBD would mean adding some NBD negotiation so both client and server
>> agree on this limitation.
> 
> In this case, I'd prefere simple cmdline option fro qemu-nbd.
> 
>> That seems kind of reasonable from the qemu
>> side, but probably very unreasonable from an NBD side.  Why would NBD
>> bother to reproduce qemu’s limitations?
>>
>>
>> So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
>> a generic option that unsuspecting users are not unlikely to use but
>> that simply doesn’t work generically doesn’t seem right to me.
>>
> 
> Firstly, 1A already doesn't work correctly: we have compress option for backup.
> So, it will not work if backup target is not empty.

I’m not sure whether that qualifies because the user is simply
responsible to ensure that the target is empty.

Otherwise, you could also argue that backup doesn’t necessarily create a
real backup because with sync=top it won’t copy unallocated clusters, so
if the target already has data in such a place, it won’t be overwritten.

Furthermore I’d argue this would hardly be an accident.  Both
drive-backup with mode=existing and qemu-img convert -n are most likely
not used blindly but only when the situation requires it.

My point is that using compress=on can be an accident because people see
that option and think “That sounds useful!” without reading the warning.

> So, 1A is impossible, as it's already broken, adding separate options for stream
> and qemu-nbd is not better than just add one generic option.

I disagree that 1A is broken, and I disagree a generic option would not
be worse.  I still believe a simple generic option is prone to
accidental misuse.

> I don't like (2) as it means a lot of effort to support actually not needed case.

Well, you say it isn’t needed.  I suppose if it were supported, people
would indeed use it.

> I don't really like filter solution (as at seems too much to add a filter for simple
> boolean option), but I think, we can live with it.

First, it is not a simple boolean option.  This patch brings
implementation with it that converts writes to compressed writes and
sets a request alignment.  I actually think it’s better to do things
like that in a dedicated block driver than in the generic block layer code.

Second, we already this basically this for copy-on-read (which is also a
boolean option for -drive).

(FWIW, I’d prefer if detect-zeroes were a block driver instead of
generic block layer code.)

> 1B is OK for me, that is, just document what the option does in fact.
> 
> May be, name the option "compress-new-writes" instead of just "compress"?
> And document, that writes to clusters, which was not written before will be
> compressed?
> 
> Or make the option to be compress=<x>, where x may be 'no' or 'new-writes',
> reserving 'all-writes' for future?

I don’t know whether the limitations can be captured in three words, but
it would at least make users pay closer attention to what the option is
really doing.

OTOH, the only downsides to a dedicated block driver to me appear that
it’s more cumbersome to use (which I actually consider a benefit) and
that it requires the usual filter driver boilerplate – but that can be
copied from other drivers[1].

(The benefits are that we don’t need to modify generic code and that
people need to read into the filter and its caveats before they can use it.)

Max


[1] Actually, we could provide default implementations somewhere and
make block filters use them.  That should reduce the boilerplate size.
Andrey Shinkevich Oct. 22, 2019, 1:53 p.m. UTC | #7
On 22/10/2019 15:56, Max Reitz wrote:
> On 22.10.19 14:23, Vladimir Sementsov-Ogievskiy wrote:
>> 22.10.2019 14:31, Max Reitz wrote:
>>> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
>>>> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>>>>
>>>>> On 22/10/2019 12:28, Max Reitz wrote:
>>>>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>>>>> To inform the block layer about writing all the data compressed, we
>>>>>>> introduce the 'compress' command line option. Based on that option, the
>>>>>>> written data will be aligned by the cluster size at the generic layer.
>>>>>>>
>>>>>>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>> ---
>>>>>>>      block.c                   | 20 +++++++++++++++++++-
>>>>>>>      block/io.c                | 13 +++++++++----
>>>>>>>      block/qcow2.c             |  4 ++++
>>>>>>>      blockdev.c                |  9 ++++++++-
>>>>>>>      include/block/block.h     |  1 +
>>>>>>>      include/block/block_int.h |  2 ++
>>>>>>>      qapi/block-core.json      |  5 ++++-
>>>>>>>      qemu-options.hx           |  6 ++++--
>>>>>>>      8 files changed, 51 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> The problem with compression is that there are such tight constraints on
>>>>>> it that it can really only work for very defined use cases.  Those
>>>>>> constraints are:
>>>>>>
>>>>>> - Only write whole clusters,
>>>>>> - Clusters can be written to only once.
>>>>>>
>>>>>> The first point is addressed in this patch by setting request_alignment.
>>>>>>      But I don’t see how the second one can be addressed.  Well, maybe by
>>>>>> allowing it in all drivers that support compression.  But if I just look
>>>>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>>>>> completely new cluster where you write the data (in case it grows), and
>>>>>> thus you leave behind a hole, which kind of defeats the purpose of
>>>>>> compression.
>>>>>>
>>>>>> (For demonstration:
>>>>>>
>>>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>>>>> lazy_refcounts=off refcount_bits=16
>>>>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>>>>         -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>>>>                     'compress': true,
>>>>>>                     'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>>>>>>         -monitor stdio
>>>>>> QEMU 4.1.50 monitor - type 'help' for more information
>>>>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>>>>> wrote 65536/65536 bytes at offset 0
>>>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>>>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>>>>> write failed: Input/output error
>>>>>>
>>>>>> )
>>>>>>
>>>>>> Compression really only works when you fully write all of an image
>>>>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>>>>> both cases we already have a compression option.  So I’m wondering where
>>>>>> this new option is really useful.
>>>>>>
>>>>>> (You do add a test for stream, but I don’t know whether that’s really a
>>>>>> good example, see my response there.)
>>>>>>
>>>>>> Max
>>>>>>
>>>>>
>>>>> Thank you very much Max for your detailed response.
>>>>>
>>>>> 1) You are right that compression is used with the backup mostly. The
>>>>> option for the compression with backup would be replaced by usage at the
>>>>> block layer, with no duplication. Also, it can be useful for NBD for
>>>>> instance,
>>>>>
>>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>>> 10+0 records in
>>>>> 10+0 records out
>>>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>>> $ du -sh ./image.qcow2
>>>>> 101M    ./image.qcow2
>>>>>
>>>>> and with the compression
>>>>>
>>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>>> 10+0 records in
>>>>> 10+0 records out
>>>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>>> $ du -sh ./image.qcow2
>>>>> 5,3M    ./image.qcow2
>>>
>>> That seems wrong to me.  Why not use qemu-img convert for this case?
>>>
>>> Attaching an NBD server to a compressed disk has exactly the same
>>> problem as attaching a compressed disk to a VM.  It won’t work unless
>>> the client/guest is aware of the limitations.
>>>
>>>>> The idea behind introducing the new 'compress' option is to use that
>>>>> only one instead of many other ones of such a kind.
>>>>>
>>>>> 2) You are right also that "Compression can't overwrite anything..."
>>>>> It can be seen in the commit message
>>>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>>>
>>>>> I am not sure if data should be written compressed to the active layer.
>>>>> I made the tests with the idea of bringing examples of usage the
>>>>> 'compress' option because passing an option is a tricky thing in QEMU.
>>>>> But the issue takes place anyway if we want to rewrite to allocated
>>>>> clusters.
>>>>> I would like to investigate the matter and make a patch that resolves
>>>>> that issue.
>>>>> Do you agree with that?
>>>
>>> What seems wrong to me is that this series just adds a generic compress
>>> option without ensuring that it works generically.
>>>
>>> Either (1) it only works in well-defined cases, then either (1A) we have
>>> to ensure that we only allow it then (as we do now, because only
>>> qemu-img convert and the backup job have such an option; and those two
>>> are safe), or (1B) we have to clearly give a big warning for the new
>>> option that it doesn’t work correctly.  I don’t know whether such a
>>> warning is even possible with just a generic node-level option.
>>>
>>> Or (2) we make it work in generic cases.  Well, that might be possible
>>> for qcow2, but who’s going to make it work for VMDK’s streamOptimized
>>> subformat?
>>>
>>> More on all of that below.
>>>
>>>> Yes, we want this option not to allow compressed writes for guests, but to
>>>> allow
>>>> - stream with compression (used to remove compressed incremental backup, we
>>>> need to merge it to the next incremental)
>>>
>>> Based on the assumption that one shouldn’t attach a compressed disk to a
>>> VM, I don’t see how streaming makes sense then.  Well, I suppose
>>> intermediate streaming would work.
>>>
>>>> - backup with compression (we have an optional already, so it works)
>>>> - backup to nbd server with compression: enable compression on nbd server
>>>
>>> The problem is clearly that if a generic client were to connect to the
>>> NBD server, it wouldn’t work.  In this case, compression will work only
>>> if the clients understands the limitation.
>>>
>>> (The safe way would be to make the NBD server provide an option that
>>> clients can see so they know this limitation and agree they’ll adhere to
>>> it.  It’s also a stupid way.)
>>>
>>>> So instead of adding two options (for stream and for nbd), it seems better to
>>>> add only one for generic layer.
>>>
>>> I don’t know.  It doesn’t work generically, so I don’t know whether it
>>> should be a generic option.
>>>
>>>> Then, it becomes possible to run guest on image with compress=on. It's a side
>>>> effect, but still it should work correctly.
>>>
>>> How so?  compress=on only works if every party involved only writes to
>>> any cluster of the image exactly once.  That is just not the case for
>>> guests unless they know this limitation, and even I don’t see a use case.
>>>
>>>> I think the simplest thing is to just run normal write, if compressed write
>>>> failed because of reallocation. We should check that on that failure-path
>>>> ENOTSUP is returned and handle it for compress=on option by fallback to
>>>> normal write in generic block/io.c
>>>
>>> It seems wrong to not compress data with compress=on.
>>
>> We already fallback to normal write if can't compress in qcow2.
> 
> In a very specific case, namely where the compressed size is larger than
> the uncompressed size.  It would be a whole different story to only
> compress the first write to a given cluster but not any following one.
> 
>>>> (Note, that in case with stream we rewrite only unallocated clusters)
>>>
>>> Yes, but if the stream job writes the cluster before the guest, the
>>> guest gets an I/O error.
>>
>> I don't think that using compressed writes by guest is really usable thing.
>> In our case with stream there is no guest (just use qemu binary to operate
>> on block layer)
>>
>>>
>>>
>>> By the way, the other thing I wondered was whether this should be a
>>> filter driver instead (if it makes sense at all).  Such a filter driver
>>> would at least be sufficiently cumbersome to use that probably only
>>> users who understand the implications would make use of it (1B above).
>>>
>>>
>>> I’m not against adding a generic compress=on option if we ensure that it
>>> actually works generically (2 above).  It doesn’t right now, so I don’t
>>> think this is right.
>>>
>>> I’m already not sure whether it’s really possible to support generic
>>> compressed writes in qcow2.  I suppose it’d be at least awkward.  In
>>> theory it should work, it’s just that we can’t keep track of subcluster
>>> allocations, which in the worst case means that some compressed clusters
>>> take up a whole cluster of space.
>>>
>>> For VMDK...  I suppose we could consider a new flag for block drivers
>>> that flags whether a driver supports arbitrary compressed writes or has
>>> the old limitations.  compress=on could then refuse to work on any block
>>> driver but the ones that support arbitrary compressed writes.
>>>
>>>
>>> Our current model (1A) is simply to ensure that all compressed writes
>>> can adhere to the limitations.  As I’ve said above, extending this to
>>> NBD would mean adding some NBD negotiation so both client and server
>>> agree on this limitation.
>>
>> In this case, I'd prefere simple cmdline option fro qemu-nbd.
>>
>>> That seems kind of reasonable from the qemu
>>> side, but probably very unreasonable from an NBD side.  Why would NBD
>>> bother to reproduce qemu’s limitations?
>>>
>>>
>>> So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
>>> a generic option that unsuspecting users are not unlikely to use but
>>> that simply doesn’t work generically doesn’t seem right to me.
>>>
>>
>> Firstly, 1A already doesn't work correctly: we have compress option for backup.
>> So, it will not work if backup target is not empty.
> 
> I’m not sure whether that qualifies because the user is simply
> responsible to ensure that the target is empty.
> 
> Otherwise, you could also argue that backup doesn’t necessarily create a
> real backup because with sync=top it won’t copy unallocated clusters, so
> if the target already has data in such a place, it won’t be overwritten.
> 
> Furthermore I’d argue this would hardly be an accident.  Both
> drive-backup with mode=existing and qemu-img convert -n are most likely
> not used blindly but only when the situation requires it.
> 
> My point is that using compress=on can be an accident because people see
> that option and think “That sounds useful!” without reading the warning.
> 
>> So, 1A is impossible, as it's already broken, adding separate options for stream
>> and qemu-nbd is not better than just add one generic option.
> 
> I disagree that 1A is broken, and I disagree a generic option would not
> be worse.  I still believe a simple generic option is prone to
> accidental misuse.
> 
>> I don't like (2) as it means a lot of effort to support actually not needed case.
> 
> Well, you say it isn’t needed.  I suppose if it were supported, people
> would indeed use it.
> 
>> I don't really like filter solution (as at seems too much to add a filter for simple
>> boolean option), but I think, we can live with it.
> 
> First, it is not a simple boolean option.  This patch brings
> implementation with it that converts writes to compressed writes and
> sets a request alignment.  I actually think it’s better to do things
> like that in a dedicated block driver than in the generic block layer code.
> 
> Second, we already this basically this for copy-on-read (which is also a
> boolean option for -drive).
> 
> (FWIW, I’d prefer if detect-zeroes were a block driver instead of
> generic block layer code.)
> 
>> 1B is OK for me, that is, just document what the option does in fact.
>>
>> May be, name the option "compress-new-writes" instead of just "compress"?
>> And document, that writes to clusters, which was not written before will be
>> compressed?
>>
>> Or make the option to be compress=<x>, where x may be 'no' or 'new-writes',
>> reserving 'all-writes' for future?
> 
> I don’t know whether the limitations can be captured in three words, but
> it would at least make users pay closer attention to what the option is
> really doing.
> 
> OTOH, the only downsides to a dedicated block driver to me appear that
> it’s more cumbersome to use (which I actually consider a benefit) and
> that it requires the usual filter driver boilerplate – but that can be
> copied from other drivers[1].
> 
> (The benefits are that we don’t need to modify generic code and that
> people need to read into the filter and its caveats before they can use it.)
> 
> Max
> 
> 
> [1] Actually, we could provide default implementations somewhere and
> make block filters use them.  That should reduce the boilerplate size.
> 

If the support of COW for compressed writes is found feasible, will it 
make a sense to implement? Then this series will follow.
A comment for the option would warn a user that guest writing will work 
slower for if compress selected.

Andrey
Vladimir Sementsov-Ogievskiy Oct. 22, 2019, 2:28 p.m. UTC | #8
22.10.2019 15:56, Max Reitz wrote:
> On 22.10.19 14:23, Vladimir Sementsov-Ogievskiy wrote:
>> 22.10.2019 14:31, Max Reitz wrote:
>>> On 22.10.19 12:46, Vladimir Sementsov-Ogievskiy wrote:
>>>> 22.10.2019 13:21, Andrey Shinkevich wrote:
>>>>>
>>>>> On 22/10/2019 12:28, Max Reitz wrote:
>>>>>> On 20.10.19 22:37, Andrey Shinkevich wrote:
>>>>>>> To inform the block layer about writing all the data compressed, we
>>>>>>> introduce the 'compress' command line option. Based on that option, the
>>>>>>> written data will be aligned by the cluster size at the generic layer.
>>>>>>>
>>>>>>> Suggested-by: Roman Kagan <rkagan@virtuozzo.com>
>>>>>>> Suggested-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>> Signed-off-by: Andrey Shinkevich <andrey.shinkevich@virtuozzo.com>
>>>>>>> Reviewed-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>>>>>>> ---
>>>>>>>      block.c                   | 20 +++++++++++++++++++-
>>>>>>>      block/io.c                | 13 +++++++++----
>>>>>>>      block/qcow2.c             |  4 ++++
>>>>>>>      blockdev.c                |  9 ++++++++-
>>>>>>>      include/block/block.h     |  1 +
>>>>>>>      include/block/block_int.h |  2 ++
>>>>>>>      qapi/block-core.json      |  5 ++++-
>>>>>>>      qemu-options.hx           |  6 ++++--
>>>>>>>      8 files changed, 51 insertions(+), 9 deletions(-)
>>>>>>
>>>>>> The problem with compression is that there are such tight constraints on
>>>>>> it that it can really only work for very defined use cases.  Those
>>>>>> constraints are:
>>>>>>
>>>>>> - Only write whole clusters,
>>>>>> - Clusters can be written to only once.
>>>>>>
>>>>>> The first point is addressed in this patch by setting request_alignment.
>>>>>>      But I don’t see how the second one can be addressed.  Well, maybe by
>>>>>> allowing it in all drivers that support compression.  But if I just look
>>>>>> at qcow2, that isn’t going to be trivial: You need to allocate a
>>>>>> completely new cluster where you write the data (in case it grows), and
>>>>>> thus you leave behind a hole, which kind of defeats the purpose of
>>>>>> compression.
>>>>>>
>>>>>> (For demonstration:
>>>>>>
>>>>>> $ ./qemu-img create -f qcow2 test.qcow2 64M
>>>>>> Formatting 'test.qcow2', fmt=qcow2 size=67108864 cluster_size=65536
>>>>>> lazy_refcounts=off refcount_bits=16
>>>>>> $ x86_64-softmmu/qemu-system-x86_64 \
>>>>>>         -blockdev "{'node-name': 'drv0', 'driver': 'qcow2',
>>>>>>                     'compress': true,
>>>>>>                     'file': {'driver': 'file', 'filename': 'test.qcow2'}}" \
>>>>>>         -monitor stdio
>>>>>> QEMU 4.1.50 monitor - type 'help' for more information
>>>>>> (qemu) qemu-io drv0 "write -P 42 0 64k"
>>>>>> wrote 65536/65536 bytes at offset 0
>>>>>> 64 KiB, 1 ops; 00.02 sec (4.055 MiB/sec and 64.8793 ops/sec)
>>>>>> (qemu) qemu-io drv0 "write -P 23 0 64k"
>>>>>> write failed: Input/output error
>>>>>>
>>>>>> )
>>>>>>
>>>>>> Compression really only works when you fully write all of an image
>>>>>> exactly once; i.e. as the qemu-img convert or as a backup target.  For
>>>>>> both cases we already have a compression option.  So I’m wondering where
>>>>>> this new option is really useful.
>>>>>>
>>>>>> (You do add a test for stream, but I don’t know whether that’s really a
>>>>>> good example, see my response there.)
>>>>>>
>>>>>> Max
>>>>>>
>>>>>
>>>>> Thank you very much Max for your detailed response.
>>>>>
>>>>> 1) You are right that compression is used with the backup mostly. The
>>>>> option for the compression with backup would be replaced by usage at the
>>>>> block layer, with no duplication. Also, it can be useful for NBD for
>>>>> instance,
>>>>>
>>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>>> $ sudo ./qemu-nbd -c /dev/nbd0 ./image.qcow2
>>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>>> 10+0 records in
>>>>> 10+0 records out
>>>>> 104857600 bytes (105 MB) copied, 0,0890581 s, 1,2 GB/s
>>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>>> $ du -sh ./image.qcow2
>>>>> 101M    ./image.qcow2
>>>>>
>>>>> and with the compression
>>>>>
>>>>> $ ./qemu-img create -f qcow2 -o size=10G ./image.qcow2
>>>>> $ sudo ./qemu-nbd -C -c /dev/nbd0 ./image.qcow2
>>>>> $ sudo dd if=/dev/sda1 of=/dev/nbd0 bs=10M count=10
>>>>> 10+0 records in
>>>>> 10+0 records out
>>>>> 104857600 bytes (105 MB) copied, 0,076046 s, 1,4 GB/s
>>>>> $ sudo ./qemu-nbd -d /dev/nbd0
>>>>> $ du -sh ./image.qcow2
>>>>> 5,3M    ./image.qcow2
>>>
>>> That seems wrong to me.  Why not use qemu-img convert for this case?
>>>
>>> Attaching an NBD server to a compressed disk has exactly the same
>>> problem as attaching a compressed disk to a VM.  It won’t work unless
>>> the client/guest is aware of the limitations.
>>>
>>>>> The idea behind introducing the new 'compress' option is to use that
>>>>> only one instead of many other ones of such a kind.
>>>>>
>>>>> 2) You are right also that "Compression can't overwrite anything..."
>>>>> It can be seen in the commit message
>>>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>>>
>>>>> I am not sure if data should be written compressed to the active layer.
>>>>> I made the tests with the idea of bringing examples of usage the
>>>>> 'compress' option because passing an option is a tricky thing in QEMU.
>>>>> But the issue takes place anyway if we want to rewrite to allocated
>>>>> clusters.
>>>>> I would like to investigate the matter and make a patch that resolves
>>>>> that issue.
>>>>> Do you agree with that?
>>>
>>> What seems wrong to me is that this series just adds a generic compress
>>> option without ensuring that it works generically.
>>>
>>> Either (1) it only works in well-defined cases, then either (1A) we have
>>> to ensure that we only allow it then (as we do now, because only
>>> qemu-img convert and the backup job have such an option; and those two
>>> are safe), or (1B) we have to clearly give a big warning for the new
>>> option that it doesn’t work correctly.  I don’t know whether such a
>>> warning is even possible with just a generic node-level option.
>>>
>>> Or (2) we make it work in generic cases.  Well, that might be possible
>>> for qcow2, but who’s going to make it work for VMDK’s streamOptimized
>>> subformat?
>>>
>>> More on all of that below.
>>>
>>>> Yes, we want this option not to allow compressed writes for guests, but to
>>>> allow
>>>> - stream with compression (used to remove compressed incremental backup, we
>>>> need to merge it to the next incremental)
>>>
>>> Based on the assumption that one shouldn’t attach a compressed disk to a
>>> VM, I don’t see how streaming makes sense then.  Well, I suppose
>>> intermediate streaming would work.
>>>
>>>> - backup with compression (we have an optional already, so it works)
>>>> - backup to nbd server with compression: enable compression on nbd server
>>>
>>> The problem is clearly that if a generic client were to connect to the
>>> NBD server, it wouldn’t work.  In this case, compression will work only
>>> if the clients understands the limitation.
>>>
>>> (The safe way would be to make the NBD server provide an option that
>>> clients can see so they know this limitation and agree they’ll adhere to
>>> it.  It’s also a stupid way.)
>>>
>>>> So instead of adding two options (for stream and for nbd), it seems better to
>>>> add only one for generic layer.
>>>
>>> I don’t know.  It doesn’t work generically, so I don’t know whether it
>>> should be a generic option.
>>>
>>>> Then, it becomes possible to run guest on image with compress=on. It's a side
>>>> effect, but still it should work correctly.
>>>
>>> How so?  compress=on only works if every party involved only writes to
>>> any cluster of the image exactly once.  That is just not the case for
>>> guests unless they know this limitation, and even I don’t see a use case.
>>>
>>>> I think the simplest thing is to just run normal write, if compressed write
>>>> failed because of reallocation. We should check that on that failure-path
>>>> ENOTSUP is returned and handle it for compress=on option by fallback to
>>>> normal write in generic block/io.c
>>>
>>> It seems wrong to not compress data with compress=on.
>>
>> We already fallback to normal write if can't compress in qcow2.
> 
> In a very specific case, namely where the compressed size is larger than
> the uncompressed size.  It would be a whole different story to only
> compress the first write to a given cluster but not any following one.
> 
>>>> (Note, that in case with stream we rewrite only unallocated clusters)
>>>
>>> Yes, but if the stream job writes the cluster before the guest, the
>>> guest gets an I/O error.
>>
>> I don't think that using compressed writes by guest is really usable thing.
>> In our case with stream there is no guest (just use qemu binary to operate
>> on block layer)
>>
>>>
>>>
>>> By the way, the other thing I wondered was whether this should be a
>>> filter driver instead (if it makes sense at all).  Such a filter driver
>>> would at least be sufficiently cumbersome to use that probably only
>>> users who understand the implications would make use of it (1B above).
>>>
>>>
>>> I’m not against adding a generic compress=on option if we ensure that it
>>> actually works generically (2 above).  It doesn’t right now, so I don’t
>>> think this is right.
>>>
>>> I’m already not sure whether it’s really possible to support generic
>>> compressed writes in qcow2.  I suppose it’d be at least awkward.  In
>>> theory it should work, it’s just that we can’t keep track of subcluster
>>> allocations, which in the worst case means that some compressed clusters
>>> take up a whole cluster of space.
>>>
>>> For VMDK...  I suppose we could consider a new flag for block drivers
>>> that flags whether a driver supports arbitrary compressed writes or has
>>> the old limitations.  compress=on could then refuse to work on any block
>>> driver but the ones that support arbitrary compressed writes.
>>>
>>>
>>> Our current model (1A) is simply to ensure that all compressed writes
>>> can adhere to the limitations.  As I’ve said above, extending this to
>>> NBD would mean adding some NBD negotiation so both client and server
>>> agree on this limitation.
>>
>> In this case, I'd prefere simple cmdline option fro qemu-nbd.
>>
>>> That seems kind of reasonable from the qemu
>>> side, but probably very unreasonable from an NBD side.  Why would NBD
>>> bother to reproduce qemu’s limitations?
>>>
>>>
>>> So these are the three ways (1A, 1B, 2) I can imagine.  But just adding
>>> a generic option that unsuspecting users are not unlikely to use but
>>> that simply doesn’t work generically doesn’t seem right to me.
>>>
>>
>> Firstly, 1A already doesn't work correctly: we have compress option for backup.
>> So, it will not work if backup target is not empty.
> 
> I’m not sure whether that qualifies because the user is simply
> responsible to ensure that the target is empty.
> 
> Otherwise, you could also argue that backup doesn’t necessarily create a
> real backup because with sync=top it won’t copy unallocated clusters, so
> if the target already has data in such a place, it won’t be overwritten.
> 
> Furthermore I’d argue this would hardly be an accident.  Both
> drive-backup with mode=existing and qemu-img convert -n are most likely
> not used blindly but only when the situation requires it.
> 
> My point is that using compress=on can be an accident because people see
> that option and think “That sounds useful!” without reading the warning.
> 
>> So, 1A is impossible, as it's already broken, adding separate options for stream
>> and qemu-nbd is not better than just add one generic option.
> 
> I disagree that 1A is broken, and I disagree a generic option would not
> be worse.  I still believe a simple generic option is prone to
> accidental misuse.
> 
>> I don't like (2) as it means a lot of effort to support actually not needed case.
> 
> Well, you say it isn’t needed.  I suppose if it were supported, people
> would indeed use it.
> 
>> I don't really like filter solution (as at seems too much to add a filter for simple
>> boolean option), but I think, we can live with it.
> 
> First, it is not a simple boolean option.  This patch brings
> implementation with it that converts writes to compressed writes and
> sets a request alignment.  I actually think it’s better to do things
> like that in a dedicated block driver than in the generic block layer code.
> 
> Second, we already this basically this for copy-on-read (which is also a
> boolean option for -drive).
> 
> (FWIW, I’d prefer if detect-zeroes were a block driver instead of
> generic block layer code.)

So, if we want all three options, it should be three filter drivers? Hmm..

> 
>> 1B is OK for me, that is, just document what the option does in fact.
>>
>> May be, name the option "compress-new-writes" instead of just "compress"?
>> And document, that writes to clusters, which was not written before will be
>> compressed?
>>
>> Or make the option to be compress=<x>, where x may be 'no' or 'new-writes',
>> reserving 'all-writes' for future?
> 
> I don’t know whether the limitations can be captured in three words, but
> it would at least make users pay closer attention to what the option is
> really doing.
> 
> OTOH, the only downsides to a dedicated block driver to me appear that
> it’s more cumbersome to use (which I actually consider a benefit) and
> that it requires the usual filter driver boilerplate – but that can be
> copied from other drivers[1].
> 
> (The benefits are that we don’t need to modify generic code and that
> people need to read into the filter and its caveats before they can use it.)
> 
> Max
> 
> 
> [1] Actually, we could provide default implementations somewhere and
> make block filters use them.  That should reduce the boilerplate size.
> 

OK, let's continue with a filter driver.
Max Reitz Oct. 24, 2019, 9:34 a.m. UTC | #9
On 22.10.19 15:53, Andrey Shinkevich wrote:

[...]

> If the support of COW for compressed writes is found feasible, will it 
> make a sense to implement? Then this series will follow.

Hm, what exactly do you mean by support of COW for compressed writes?

> A comment for the option would warn a user that guest writing will work 
> slower for if compress selected.

I’m not sure whether that’s necessary.  As a user, I would assume that
enabling compression will always make something slower.

Max
Andrey Shinkevich Oct. 24, 2019, 12:56 p.m. UTC | #10
On 24/10/2019 12:34, Max Reitz wrote:
> On 22.10.19 15:53, Andrey Shinkevich wrote:
> 
> [...]
> 
>> If the support of COW for compressed writes is found feasible, will it
>> make a sense to implement? Then this series will follow.
> 
> Hm, what exactly do you mean by support of COW for compressed writes?
> 

I spoke in terms of the commit message with the following ID:

b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2

"qcow2: Fail write_compressed when overwriting data"

"...qcow2_write_compressed() doesn't perform COW as it would have to do..."

So, I suggest that we implement writing compressed data to the allocated 
clusters rather than qcow2_alloc_compressed_cluster_offset() returns the 
error. Particularly, when it comes to NBD server connection failure for 
writhing a compressed cluster, it may not be rewritten after the 
connection is restored.
Are there any issues with that implementation idea?

Andrey

>> A comment for the option would warn a user that guest writing will work
>> slower for if compress selected.
> 
> I’m not sure whether that’s necessary.  As a user, I would assume that
> enabling compression will always make something slower.
> 
> Max
>
Max Reitz Oct. 24, 2019, 1:48 p.m. UTC | #11
On 24.10.19 14:56, Andrey Shinkevich wrote:
> 
> 
> On 24/10/2019 12:34, Max Reitz wrote:
>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>
>> [...]
>>
>>> If the support of COW for compressed writes is found feasible, will it
>>> make a sense to implement? Then this series will follow.
>>
>> Hm, what exactly do you mean by support of COW for compressed writes?
>>
> 
> I spoke in terms of the commit message with the following ID:
> 
> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
> 
> "qcow2: Fail write_compressed when overwriting data"
> 
> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
> 
> So, I suggest that we implement writing compressed data to the allocated 
> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the 
> error. Particularly, when it comes to NBD server connection failure for 
> writhing a compressed cluster, it may not be rewritten after the 
> connection is restored.
> Are there any issues with that implementation idea?

Well, the COW in that commit is meant differently, because it refers to
the COW that’s required when writing to a cluster shared by an internal
snapshot.

OTOH, you could say that all compressed writes to a cluster that is
already allocated would need to do COW because we’d always have to fully
rewrite that cluster in an RMW cycle.

I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
existing cluster would solve the problem, though.  You’d generally need
to allocate a new cluster; or attempt to reuse the existing space in a
compressed cluster.

Max
Andrey Shinkevich Oct. 24, 2019, 2:07 p.m. UTC | #12
On 24/10/2019 16:48, Max Reitz wrote:
> On 24.10.19 14:56, Andrey Shinkevich wrote:
>>
>>
>> On 24/10/2019 12:34, Max Reitz wrote:
>>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>>
>>> [...]
>>>
>>>> If the support of COW for compressed writes is found feasible, will it
>>>> make a sense to implement? Then this series will follow.
>>>
>>> Hm, what exactly do you mean by support of COW for compressed writes?
>>>
>>
>> I spoke in terms of the commit message with the following ID:
>>
>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>
>> "qcow2: Fail write_compressed when overwriting data"
>>
>> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
>>
>> So, I suggest that we implement writing compressed data to the allocated
>> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the
>> error. Particularly, when it comes to NBD server connection failure for
>> writhing a compressed cluster, it may not be rewritten after the
>> connection is restored.
>> Are there any issues with that implementation idea?
> 
> Well, the COW in that commit is meant differently, because it refers to
> the COW that’s required when writing to a cluster shared by an internal
> snapshot.
> 
> OTOH, you could say that all compressed writes to a cluster that is
> already allocated would need to do COW because we’d always have to fully
> rewrite that cluster in an RMW cycle.
> 
> I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
> existing cluster would solve the problem, though.  You’d generally need
> to allocate a new cluster; or attempt to reuse the existing space in a
> compressed cluster.
> 
> Max
> 

Yes, new clusters would be allocated for the compressed RMW and the 
existing ones would be reused if possible. It seams to be ineffective 
but users are supposed to know what they do.
So, does it worth to check a feasibility of the implementation?

Andrey
Andrey Shinkevich Oct. 24, 2019, 2:27 p.m. UTC | #13
On 24/10/2019 16:48, Max Reitz wrote:
> On 24.10.19 14:56, Andrey Shinkevich wrote:
>>
>>
>> On 24/10/2019 12:34, Max Reitz wrote:
>>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>>
>>> [...]
>>>
>>>> If the support of COW for compressed writes is found feasible, will it
>>>> make a sense to implement? Then this series will follow.
>>>
>>> Hm, what exactly do you mean by support of COW for compressed writes?
>>>
>>
>> I spoke in terms of the commit message with the following ID:
>>
>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>
>> "qcow2: Fail write_compressed when overwriting data"
>>
>> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
>>
>> So, I suggest that we implement writing compressed data to the allocated
>> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the
>> error. Particularly, when it comes to NBD server connection failure for
>> writhing a compressed cluster, it may not be rewritten after the
>> connection is restored.
>> Are there any issues with that implementation idea?
> 
> Well, the COW in that commit is meant differently, because it refers to
> the COW that’s required when writing to a cluster shared by an internal
> snapshot.
> 
> OTOH, you could say that all compressed writes to a cluster that is
> already allocated would need to do COW because we’d always have to fully
> rewrite that cluster in an RMW cycle.
> 
> I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
> existing cluster would solve the problem, though.  You’d generally need
> to allocate a new cluster; or attempt to reuse the existing space in a
> compressed cluster.
> 
> Max
> 

The idea was to avoid the following error:

$ ./qemu-img create -f qcow2 -o size=10M ./image.qcow2
Formatting './image.qcow2', fmt=qcow2 size=10485760 cluster_size=65536 
lazy_refcounts=off refcount_bits=16
[andrey@dhcp-172-16-25-136 qemu]$ git diff --name-only
nbd/server.c
$ ./qemu-io -c "write -P 0x12 0 64k" ./image.qcow2
wrote 65536/65536 bytes at offset 0
64 KiB, 1 ops; 0.0152 sec (4.101 MiB/sec and 65.6082 ops/sec)
$ ./qemu-io -c "write -c -P 0x12 0 64k" ./image.qcow2
write failed: Input/output error

If it incurs a high cost, we can choose the filter option as discussed 
above.

Andrey
Max Reitz Oct. 24, 2019, 3:12 p.m. UTC | #14
On 24.10.19 16:07, Andrey Shinkevich wrote:
> 
> 
> On 24/10/2019 16:48, Max Reitz wrote:
>> On 24.10.19 14:56, Andrey Shinkevich wrote:
>>>
>>>
>>> On 24/10/2019 12:34, Max Reitz wrote:
>>>> On 22.10.19 15:53, Andrey Shinkevich wrote:
>>>>
>>>> [...]
>>>>
>>>>> If the support of COW for compressed writes is found feasible, will it
>>>>> make a sense to implement? Then this series will follow.
>>>>
>>>> Hm, what exactly do you mean by support of COW for compressed writes?
>>>>
>>>
>>> I spoke in terms of the commit message with the following ID:
>>>
>>> b0b6862e5e1a1394e0ab3d5da94ba8b0da8664e2
>>>
>>> "qcow2: Fail write_compressed when overwriting data"
>>>
>>> "...qcow2_write_compressed() doesn't perform COW as it would have to do..."
>>>
>>> So, I suggest that we implement writing compressed data to the allocated
>>> clusters rather than qcow2_alloc_compressed_cluster_offset() returns the
>>> error. Particularly, when it comes to NBD server connection failure for
>>> writhing a compressed cluster, it may not be rewritten after the
>>> connection is restored.
>>> Are there any issues with that implementation idea?
>>
>> Well, the COW in that commit is meant differently, because it refers to
>> the COW that’s required when writing to a cluster shared by an internal
>> snapshot.
>>
>> OTOH, you could say that all compressed writes to a cluster that is
>> already allocated would need to do COW because we’d always have to fully
>> rewrite that cluster in an RMW cycle.
>>
>> I don’t see how letting qcow2_alloc_compressed_cluster_offset() use the
>> existing cluster would solve the problem, though.  You’d generally need
>> to allocate a new cluster; or attempt to reuse the existing space in a
>> compressed cluster.
>>
>> Max
>>
> 
> Yes, new clusters would be allocated for the compressed RMW and the 
> existing ones would be reused if possible. It seams to be ineffective 
> but users are supposed to know what they do.
> So, does it worth to check a feasibility of the implementation?

I don’t know, Vladimir said that use case wouldn’t be needed.

I think if the option was there, people would actually use it.  But I
doubt that anyone misses it sufficiently to warrant the effort.

In addition, there’s still VMDK to consider, too.

Max
diff mbox series

Patch

diff --git a/block.c b/block.c
index 1946fc6..a674920 100644
--- a/block.c
+++ b/block.c
@@ -1418,6 +1418,11 @@  QemuOptsList bdrv_runtime_opts = {
             .type = QEMU_OPT_BOOL,
             .help = "always accept other writers (default: off)",
         },
+        {
+            .name = BDRV_OPT_COMPRESS,
+            .type = QEMU_OPT_BOOL,
+            .help = "compress all writes to the image (default: off)",
+        },
         { /* end of list */ }
     },
 };
@@ -1545,6 +1550,14 @@  static int bdrv_open_common(BlockDriverState *bs, BlockBackend *file,
     }
     pstrcpy(bs->exact_filename, sizeof(bs->exact_filename), bs->filename);
 
+    if (bs->all_write_compressed && !drv->bdrv_co_pwritev_compressed_part) {
+        error_setg(errp, "Compression is not supported for the driver '%s'",
+                   drv->format_name);
+        bs->all_write_compressed = false;
+        ret = -ENOTSUP;
+        goto fail_opts;
+    }
+
     /* Open the image, either directly or using a protocol */
     open_flags = bdrv_open_flags(bs, bs->open_flags);
     node_name = qemu_opt_get(opts, "node-name");
@@ -2983,6 +2996,11 @@  static BlockDriverState *bdrv_open_inherit(const char *filename,
         flags &= ~BDRV_O_RDWR;
     }
 
+    if (!g_strcmp0(qdict_get_try_str(options, BDRV_OPT_COMPRESS), "on") ||
+        qdict_get_try_bool(options, BDRV_OPT_COMPRESS, false)) {
+        bs->all_write_compressed = true;
+    }
+
     if (flags & BDRV_O_SNAPSHOT) {
         snapshot_options = qdict_new();
         bdrv_temp_snapshot_options(&snapshot_flags, snapshot_options,
@@ -3208,7 +3226,7 @@  static int bdrv_reset_options_allowed(BlockDriverState *bs,
      * in bdrv_reopen_prepare() so they can be left out of @new_opts */
     const char *const common_options[] = {
         "node-name", "discard", "cache.direct", "cache.no-flush",
-        "read-only", "auto-read-only", "detect-zeroes", NULL
+        "read-only", "auto-read-only", "detect-zeroes", "compress", NULL
     };
 
     for (e = qdict_first(bs->options); e; e = qdict_next(bs->options, e)) {
diff --git a/block/io.c b/block/io.c
index f0b86c1..eb2ed36 100644
--- a/block/io.c
+++ b/block/io.c
@@ -1360,9 +1360,14 @@  static int coroutine_fn bdrv_co_do_copy_on_readv(BdrvChild *child,
                 /* This does not change the data on the disk, it is not
                  * necessary to flush even in cache=writethrough mode.
                  */
-                ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
-                                          &local_qiov, 0,
-                                          BDRV_REQ_WRITE_UNCHANGED);
+                if (bs->all_write_compressed) {
+                    ret = bdrv_driver_pwritev_compressed(bs, cluster_offset,
+                                                         pnum, &local_qiov, 0);
+                } else {
+                    ret = bdrv_driver_pwritev(bs, cluster_offset, pnum,
+                                              &local_qiov, 0,
+                                              BDRV_REQ_WRITE_UNCHANGED);
+                }
             }
 
             if (ret < 0) {
@@ -1954,7 +1959,7 @@  static int coroutine_fn bdrv_aligned_pwritev(BdrvChild *child,
     } else if (flags & BDRV_REQ_ZERO_WRITE) {
         bdrv_debug_event(bs, BLKDBG_PWRITEV_ZERO);
         ret = bdrv_co_do_pwrite_zeroes(bs, offset, bytes, flags);
-    } else if (flags & BDRV_REQ_WRITE_COMPRESSED) {
+    } else if (flags & BDRV_REQ_WRITE_COMPRESSED || bs->all_write_compressed) {
         ret = bdrv_driver_pwritev_compressed(bs, offset, bytes,
                                              qiov, qiov_offset);
     } else if (bytes <= max_transfer) {
diff --git a/block/qcow2.c b/block/qcow2.c
index 7961c05..6b29e16 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -1787,6 +1787,10 @@  static void qcow2_refresh_limits(BlockDriverState *bs, Error **errp)
         /* Encryption works on a sector granularity */
         bs->bl.request_alignment = qcrypto_block_get_sector_size(s->crypto);
     }
+    if (bs->all_write_compressed) {
+        bs->bl.request_alignment = MAX(bs->bl.request_alignment,
+                                       s->cluster_size);
+    }
     bs->bl.pwrite_zeroes_alignment = s->cluster_size;
     bs->bl.pdiscard_alignment = s->cluster_size;
 }
diff --git a/blockdev.c b/blockdev.c
index f89e48f..0c0b398 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -471,7 +471,7 @@  static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
     int bdrv_flags = 0;
     int on_read_error, on_write_error;
     bool account_invalid, account_failed;
-    bool writethrough, read_only;
+    bool writethrough, read_only, compress;
     BlockBackend *blk;
     BlockDriverState *bs;
     ThrottleConfig cfg;
@@ -570,6 +570,7 @@  static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
     }
 
     read_only = qemu_opt_get_bool(opts, BDRV_OPT_READ_ONLY, false);
+    compress = qemu_opt_get_bool(opts, BDRV_OPT_COMPRESS, false);
 
     /* init */
     if ((!file || !*file) && !qdict_size(bs_opts)) {
@@ -595,6 +596,8 @@  static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
         qdict_set_default_str(bs_opts, BDRV_OPT_READ_ONLY,
                               read_only ? "on" : "off");
         qdict_set_default_str(bs_opts, BDRV_OPT_AUTO_READ_ONLY, "on");
+        qdict_set_default_str(bs_opts, BDRV_OPT_COMPRESS,
+                              compress ? "on" : "off");
         assert((bdrv_flags & BDRV_O_CACHE_MASK) == 0);
 
         if (runstate_check(RUN_STATE_INMIGRATE)) {
@@ -4683,6 +4686,10 @@  QemuOptsList qemu_common_drive_opts = {
             .name = BDRV_OPT_READ_ONLY,
             .type = QEMU_OPT_BOOL,
             .help = "open drive file as read-only",
+        },{
+            .name = BDRV_OPT_COMPRESS,
+            .type = QEMU_OPT_BOOL,
+            .help = "compress all writes to image",
         },
 
         THROTTLE_OPTS,
diff --git a/include/block/block.h b/include/block/block.h
index 792bb82..7e0a927 100644
--- a/include/block/block.h
+++ b/include/block/block.h
@@ -139,6 +139,7 @@  typedef struct HDGeometry {
 #define BDRV_OPT_AUTO_READ_ONLY "auto-read-only"
 #define BDRV_OPT_DISCARD        "discard"
 #define BDRV_OPT_FORCE_SHARE    "force-share"
+#define BDRV_OPT_COMPRESS       "compress"
 
 
 #define BDRV_SECTOR_BITS   9
diff --git a/include/block/block_int.h b/include/block/block_int.h
index 05056b3..3fe8923 100644
--- a/include/block/block_int.h
+++ b/include/block/block_int.h
@@ -923,6 +923,8 @@  struct BlockDriverState {
 
     /* BdrvChild links to this node may never be frozen */
     bool never_freeze;
+    /* Compress all writes to the image */
+    bool all_write_compressed;
 };
 
 struct BlockBackendRootState {
diff --git a/qapi/block-core.json b/qapi/block-core.json
index f66553a..d57064a 100644
--- a/qapi/block-core.json
+++ b/qapi/block-core.json
@@ -4013,6 +4013,8 @@ 
 #                 (default: off)
 # @force-share:   force share all permission on added nodes.
 #                 Requires read-only=true. (Since 2.10)
+# @compress:      compress all writes to the image (Since 4.2)
+#                 (default: false)
 #
 # Remaining options are determined by the block driver.
 #
@@ -4026,7 +4028,8 @@ 
             '*read-only': 'bool',
             '*auto-read-only': 'bool',
             '*force-share': 'bool',
-            '*detect-zeroes': 'BlockdevDetectZeroesOptions' },
+            '*detect-zeroes': 'BlockdevDetectZeroesOptions',
+            '*compress': 'bool' },
   'discriminator': 'driver',
   'data': {
       'blkdebug':   'BlockdevOptionsBlkdebug',
diff --git a/qemu-options.hx b/qemu-options.hx
index 793d70f..1bfbf1a 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -850,7 +850,7 @@  DEF("blockdev", HAS_ARG, QEMU_OPTION_blockdev,
     "-blockdev [driver=]driver[,node-name=N][,discard=ignore|unmap]\n"
     "          [,cache.direct=on|off][,cache.no-flush=on|off]\n"
     "          [,read-only=on|off][,detect-zeroes=on|off|unmap]\n"
-    "          [,driver specific parameters...]\n"
+    "          [,compress=on|off][,driver specific parameters...]\n"
     "                configure a block backend\n", QEMU_ARCH_ALL)
 STEXI
 @item -blockdev @var{option}[,@var{option}[,@var{option}[,...]]]
@@ -905,6 +905,8 @@  discard requests.
 conversion of plain zero writes by the OS to driver specific optimized
 zero write commands. You may even choose "unmap" if @var{discard} is set
 to "unmap" to allow a zero write to be converted to an @code{unmap} operation.
+@item compress
+Compress all writes to the image.
 @end table
 
 @item Driver-specific options for @code{file}
@@ -1026,7 +1028,7 @@  DEF("drive", HAS_ARG, QEMU_OPTION_drive,
     "       [,cache=writethrough|writeback|none|directsync|unsafe][,format=f]\n"
     "       [,snapshot=on|off][,rerror=ignore|stop|report]\n"
     "       [,werror=ignore|stop|report|enospc][,id=name][,aio=threads|native]\n"
-    "       [,readonly=on|off][,copy-on-read=on|off]\n"
+    "       [,readonly=on|off][,copy-on-read=on|off][,compress=on|off]\n"
     "       [,discard=ignore|unmap][,detect-zeroes=on|off|unmap]\n"
     "       [[,bps=b]|[[,bps_rd=r][,bps_wr=w]]]\n"
     "       [[,iops=i]|[[,iops_rd=r][,iops_wr=w]]]\n"