diff mbox series

[2/2] qcow2: Avoid memory over-allocation on compressed images

Message ID 20180220222459.8461-3-eblake@redhat.com
State New
Headers show
Series qcow2: minor compression improvements | expand

Commit Message

Eric Blake Feb. 20, 2018, 10:24 p.m. UTC
When reading a compressed image, we were allocating s->cluster_data
to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
with 2M clusters).  Let's check out the history:

Back when qcow2 was first written, we used s->cluster_data for
everything, including copy_sectors() and encryption, where we want
to operate on more than one cluster at once.  Obviously, at that
point, the buffer had to be aligned for other users, even though
compression itself doesn't require any alignment.

But commit 1b9f1491 (v1.1!) changed things to allocate parallel
buffers on demand rather than sharing a single buffer, for encryption
and COW, leaving compression as the final client of s->cluster_data.
That use was still preserved, because if a single compressed cluster
is read more than once, we reuse the cache instead of decompressing
it a second time (I'm not sure how often this optimization actually
fires, or if it penalizes us from being able to decompress multiple
clusters in parallel even though we can now decrypt clusters in
parallel; the XXX comment in qcow2_co_preadv for
QCOW2_CLUSTER_COMPRESSED is telling).

Much later, in commit de82815d (v2.2), we noticed that a 64M
allocation is prone to failure, so we switched over to a graceful
memory allocation error message.  But note that elsewhere in the
code, we do g_malloc(2 * cluster_size) without ever checking for
failure.

Then even later, in 3e4c7052 (2.11), we realized that allocating
a large buffer up front for every qcow2 image is expensive, and
switched to lazy allocation only for images that actually had
compressed clusters.  But in the process, we never even bothered
to check whether what we were allocating still made sense in its
new context!

So, it's time to cut back on the waste.  A compressed cluster
will NEVER occupy more than an uncompressed cluster (okay, gzip
DOES document that because the compression stream adds metadata,
and because of the pigeonhole principle, there are worst case
scenarios where attempts to compress will actually inflate an
image - but in those cases, we would just write the cluster
uncompressed instead of inflating it).  And as that is a smaller
amount of memory, we can get by with the simpler g_malloc.

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 block/qcow2-cluster.c | 12 +++---------
 block/qcow2.c         |  2 +-
 2 files changed, 4 insertions(+), 10 deletions(-)

Comments

Alberto Garcia Feb. 21, 2018, 10:04 a.m. UTC | #1
On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:

I was also preparing a patch to change this, but you arrived first :-)

> So, it's time to cut back on the waste.  A compressed cluster
> will NEVER occupy more than an uncompressed cluster (okay, gzip
> DOES document that because the compression stream adds metadata,
> and because of the pigeonhole principle, there are worst case
> scenarios where attempts to compress will actually inflate an
> image - but in those cases, we would just write the cluster
> uncompressed instead of inflating it).  And as that is a smaller
> amount of memory, we can get by with the simpler g_malloc.

> -        if (!s->cluster_cache) {
> -            s->cluster_cache = g_malloc(s->cluster_size);
> +            assert(!s->cluster_cache);
> +            s->cluster_data = g_try_malloc(s->cluster_size);
> +            s->cluster_cache = g_try_malloc(s->cluster_size);

There's a few things here:

- QEMU won't write compressed data if the size is >= s->cluster_size
  (there's an explicit check for that in qcow2_co_pwritev_compressed())

- The size field of the compressed cluster descriptor *does* allow
  larger sizes, so you can't simply read csize bytes into
  s->cluster_data becuase you could cause a buffer overflow.

- Solution a: check that csize < s->cluster_size and return an error if
  it's not. However! although QEMU won't produce an image with a
  compressed cluster that is larger than the uncompressed one, the qcow2
  on-disk format in principle allows for that, so arguably we should
  accept it.

- Solution b: the width of the 'compressed cluster size' field is
  (cluster_bits - 8), that's (cluster_size / 256) sectors. Since the
  size of each sector is 512 bytes, the maximum possible size that the
  field can store is (cluster_size * 2) bytes. So allocate that amount
  of space for s->cluster_data, read the data as it is on disk and let
  the decompression code return an error if the data is indeed
  corrupted or it doesn't fit in the output buffer.

Berto
Eric Blake Feb. 21, 2018, 3 p.m. UTC | #2
On 02/21/2018 04:04 AM, Alberto Garcia wrote:
> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
> 
> I was also preparing a patch to change this, but you arrived first :-)
> 
>> So, it's time to cut back on the waste.  A compressed cluster
>> will NEVER occupy more than an uncompressed cluster (okay, gzip
>> DOES document that because the compression stream adds metadata,
>> and because of the pigeonhole principle, there are worst case
>> scenarios where attempts to compress will actually inflate an
>> image - but in those cases, we would just write the cluster
>> uncompressed instead of inflating it).  And as that is a smaller
>> amount of memory, we can get by with the simpler g_malloc.
> 
>> -        if (!s->cluster_cache) {
>> -            s->cluster_cache = g_malloc(s->cluster_size);
>> +            assert(!s->cluster_cache);
>> +            s->cluster_data = g_try_malloc(s->cluster_size);
>> +            s->cluster_cache = g_try_malloc(s->cluster_size);

Shoot - I made edits that I forgot to commit before sending; I meant for 
these to be g_malloc() rather than g_try_malloc().

> 
> There's a few things here:
> 
> - QEMU won't write compressed data if the size is >= s->cluster_size
>    (there's an explicit check for that in qcow2_co_pwritev_compressed())

Correct, we never cause inflation (and even if we wanted to, we can't, 
because the qcow2 format doesn't have enough bits for us to record that 
many sectors for a compressed stream that occupies more space than the 
original cluster).

> 
> - The size field of the compressed cluster descriptor *does* allow
>    larger sizes, so you can't simply read csize bytes into
>    s->cluster_data becuase you could cause a buffer overflow.

Let's step through this:

   nb_csectors = ((cluster_offset >> s->csize_shift) & s->csize_mask + 1;

Since s->csize_mask is determined by s->cluster_size / 512, then after 
this assignment, the most nb_csectors can be is exactly s->cluster_size 
/ 512.

   sector_offset = coffset & 511;
   csize = nb_csectors * 512 - sector_offset;

And here, csize can only get smaller than the length picked by 
nb_csectors.  Therefore, csize is GUARANTEED to be <= c->sector_size.

> 
> - Solution a: check that csize < s->cluster_size and return an error if
>    it's not. However! although QEMU won't produce an image with a
>    compressed cluster that is larger than the uncompressed one, the qcow2
>    on-disk format in principle allows for that, so arguably we should
>    accept it.

No, the qcow2 on-disk format does not have enough bits reserved for 
that.  It is impossible to store an inflated cluster, because you don't 
have enough bits to record it.

That said, we MAY have a bug, more likely to be visible in 512-byte 
clusters but possible on any size.  While the 'number sectors' field IS 
sufficient for any compressed cluster starting at offset 0 in the 
cluster, we run into issues if the starting offset is later in the 
cluster.  That is, even though the compressed length of a 512-byte 
cluster is <= 512 (or we wouldn't compress it), if we start at offset 
510, we NEED to read the next cluster to get the rest of the compressed 
stream - but at 512-byte clusters, there are 0 bits reserved for 'number 
sectors'.  Per the above calculations with the example offset of 510, 
nb_csectors would be 1 (it can't be anything else for a 512-byte cluster 
image), and csize would then be 2 bytes, which is insufficient for 
reading back enough data to reconstruct the cluster.

We probably need to clarify in the spec that if the starting offset of a 
compressed cluster falls mid-sector, then the compressed size has to be 
smaller than cluster size - (offset % 512) (this requirement is already 
there implicitly due to the field widths, but being explicit can't 
hurt).  We probably also need to fix our compression code to actually do 
the right thing, particularly for 512-byte clusters where we are most 
likely to run into a compressed size that is likely to overflow the 
space available for nb_csectors.

> 
> - Solution b: the width of the 'compressed cluster size' field is
>    (cluster_bits - 8), that's (cluster_size / 256) sectors.

Not true.  It is (cluster_bits - 9) or (cluster_size / 512).  Remember, 
x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61.  The 
'number sectors' field is then bits x+1 - 61 (but you can't have a 
bitfield occupying bit 62 upto 61; especially since bit 62 is the bit 
for compressed cluster).

> Since the
>    size of each sector is 512 bytes, the maximum possible size that the
>    field can store is (cluster_size * 2) bytes. So allocate that amount
>    of space for s->cluster_data, read the data as it is on disk and let
>    the decompression code return an error if the data is indeed
>    corrupted or it doesn't fit in the output buffer.

Again, I argue that the qcow2 spec says that the maximum size for a 
compressed cluster is cluster_size, even if it spills over a host 
cluster boundary.  But if in practice, we HAVE allowed a spillover 
beyond the 'number fields' maximum size because of a non-zero offset, 
where we weren't careful about what got recorded into the qcow2 image, 
then we could instead state that reading s->cluster_size+512 is 
guaranteed to find the end of the compressed stream, even when 'number 
fields' is 0 because of overflow, at which point, sizing the buffer to 
be one sector larger than a cluster will mean that we can cope even with 
512-byte clusters that were compressed incorrectly.
Alberto Garcia Feb. 21, 2018, 3:22 p.m. UTC | #3
On Wed 21 Feb 2018 04:00:54 PM CET, Eric Blake wrote:

>> - Solution b: the width of the 'compressed cluster size' field is
>>    (cluster_bits - 8), that's (cluster_size / 256) sectors.
>
> Not true.  It is (cluster_bits - 9) or (cluster_size / 512).

It's not, it's (cluster_bits - 8), the documentation is wrong, see the
patch that I sent earlier.

Here's qcow2_do_open():

    s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;

For a 64k cluster (16 bits) the width is 16 - 8 = 8 bits.

That's 2^8 = 256 sectors, and 256 * 512 = 128k, exactly two clusters.

Coming back to your patch, since we know where the compressed data
starts and we know that it's not going to be larger than cluster_size,
we can read [coffset, coffset + cluster_size) and skip the final bytes
of the last sector because we know that we don't need that data.

Or we can read as much as the size field indicates (up to twice the
cluster size) and let the decompression code handle the rest.

Berto
Eric Blake Feb. 21, 2018, 3:59 p.m. UTC | #4
On 02/21/2018 09:00 AM, Eric Blake wrote:
> On 02/21/2018 04:04 AM, Alberto Garcia wrote:
>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
>>
>> I was also preparing a patch to change this, but you arrived first :-)
>>
>>> So, it's time to cut back on the waste.  A compressed cluster
>>> will NEVER occupy more than an uncompressed cluster


> And here, csize can only get smaller than the length picked by 
> nb_csectors.  Therefore, csize is GUARANTEED to be <= c->sector_size.
> 
>>
>> - Solution a: check that csize < s->cluster_size and return an error if
>>    it's not. However! although QEMU won't produce an image with a
>>    compressed cluster that is larger than the uncompressed one, the qcow2
>>    on-disk format in principle allows for that, so arguably we should
>>    accept it.
> 
> No, the qcow2 on-disk format does not have enough bits reserved for 
> that.  It is impossible to store an inflated cluster, because you don't 
> have enough bits to record it.

Okay, the spec is WRONG, compared to our code base.

> 
> That said, we MAY have a bug, more likely to be visible in 512-byte 
> clusters but possible on any size.  While the 'number sectors' field IS 
> sufficient for any compressed cluster starting at offset 0 in the 
> cluster, we run into issues if the starting offset is later in the 
> cluster.  That is, even though the compressed length of a 512-byte 
> cluster is <= 512 (or we wouldn't compress it), if we start at offset 
> 510, we NEED to read the next cluster to get the rest of the compressed 
> stream - but at 512-byte clusters, there are 0 bits reserved for 'number 
> sectors'.  Per the above calculations with the example offset of 510, 
> nb_csectors would be 1 (it can't be anything else for a 512-byte cluster 
> image), and csize would then be 2 bytes, which is insufficient for 
> reading back enough data to reconstruct the cluster.

In fact, here's a demonstration of a discrepancy, where qemu-img and 
John's qcheck tool [1] disagree about the validity of an image created 
by qemu (although it may just be that qcheck hasn't yet learned about 
compressed clusters):

[1]https://github.com/jnsnow/qcheck

$ f=12345678
$ f=$f$f$f$f # 32
$ f=$f$f$f$f # 128
$ f=$f$f$f$f # 512
$ f=$f$f$f$f # 2k
$ f=$f$f$f$f # 8k
$ f=$f$f$f$f # 32k
$ f=$f$f$f$f # 128k
$ printf "$f" > data
$ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
   data data.qcow2
$ qemu-img check data.qcow2
No errors were found on the image.
256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed clusters
Image end offset: 18944
$ ./qcheck data.qcow2
...
== L2 Tables ==

== L2 cluster l1[0] : 0x0000000000000800 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[1] : 0x0000000000000e00 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[2] : 0x0000000000001400 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
== L2 cluster l1[3] : 0x0000000000001a00 ==
Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
L2 tables: Could not complete analysis, 257 problems found


== Reference Count Analysis ==

Refcount analysis: 00 vacant clusters
Refcount analysis: 04 leaked clusters
Refcount analysis: 00 ghost clusters
Refcount analysis: 04 miscounted clusters
Refcount analysis: 8 problems found


== Cluster Counts ==

Metadata: 0x1000
Data: 0x800
Leaked: 0x800
Vacant: 0x0
total: 0x2000
qcheck: 73 problems found


> 
> Not true.  It is (cluster_bits - 9) or (cluster_size / 512).  Remember, 
> x = 62 - (cluster_bits - 8); for a 512-byte cluster, x = 61.  The 
> 'number sectors' field is then bits x+1 - 61 (but you can't have a 
> bitfield occupying bit 62 upto 61; especially since bit 62 is the bit 
> for compressed cluster).

So instead of blindly reading the spec, I'm now going to single-stepping 
through the 'qemu-img convert' command line above, to see what REALLY 
happens:

Line numbers from commit a6e0344fa0
$ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 
data data.qcow2
...
(gdb) b qcow2_alloc_bytes
Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
(gdb) r
Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
     bs=bs@entry=0x555555d87f50, size=size@entry=15)
     at block/qcow2-refcount.c:1052
1052	{
(gdb)

So we are compressing 512 bytes down to 15 every time, which means that 
after 34 clusters compressed, we should be at offset 510.  Let's resume 
debugging:

(gdb) c 34
Will ignore next 33 crossings of breakpoint 1.  Continuing.
[Thread 0x7fffe3cfe700 (LWP 32229) exited]
[New Thread 0x7fffe3cfe700 (LWP 32300)]
[New Thread 0x7fffe25ed700 (LWP 32301)]

Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
     bs=bs@entry=0x555555d87f50, size=size@entry=15)
     at block/qcow2-refcount.c:1052
1052	{
(gdb) n
1053	    BDRVQcow2State *s = bs->opaque;
(gdb)
1058	    BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
(gdb)
1059	    assert(size > 0 && size <= s->cluster_size);
(gdb) p s->free_byte_offset
$2 = 3070
(gdb) p 3070%512
$3 = 510
...
(gdb)
1076	    free_in_cluster = s->cluster_size - offset_into_cluster(s, offset);
(gdb)
1078	        if (!offset || free_in_cluster < size) {
(gdb) p free_in_cluster
$4 = 2
1079	            int64_t new_cluster = alloc_clusters_noref(bs, 
s->cluster_size);
(gdb)
1080	            if (new_cluster < 0) {
(gdb)
1084	            if (new_cluster == 0) {
(gdb)
1091	            if (!offset || ROUND_UP(offset, s->cluster_size) != 
new_cluster) {
(gdb)
1095	                free_in_cluster += s->cluster_size;
(gdb)
1099	        assert(offset);

so we got a contiguous cluster, and our goal is to let the caller bleed 
the compressed cluster into to the tail of the current sector and into 
the head of the next cluster.  Continuing:

(gdb) fin
Run till exit from #0  qcow2_alloc_bytes (bs=bs@entry=0x555555d87f50,
     size=size@entry=15) at block/qcow2-refcount.c:1118
[Thread 0x7fffe25ed700 (LWP 32301) exited]
[Thread 0x7fffe3cfe700 (LWP 32300) exited]
qcow2_alloc_compressed_cluster_offset (bs=bs@entry=0x555555d87f50,
     offset=offset@entry=17408, compressed_size=compressed_size@entry=15)
     at block/qcow2-cluster.c:768
768	    if (cluster_offset < 0) {
Value returned is $5 = 3070

(gdb) n
773	    nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
(gdb)
774	                  (cluster_offset >> 9);
(gdb)
773	    nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
(gdb)
777	                      ((uint64_t)nb_csectors << s->csize_shift);
(gdb) l
772	
773	    nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
774	                  (cluster_offset >> 9);
775	
776	    cluster_offset |= QCOW_OFLAG_COMPRESSED |
777	                      ((uint64_t)nb_csectors << s->csize_shift);
778	
779	    /* update L2 table */
780	
781	    /* compressed clusters never have the copied flag */
(gdb) p nb_csectors
$6 = 1
(gdb) p s->csize_shift
$7 = 61
(gdb) p/x cluster_offset
$8 = 0xbfe
(gdb) n
776	    cluster_offset |= QCOW_OFLAG_COMPRESSED |
(gdb)
783	    BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
(gdb) p/x cluster_offset
$9 = 0x6000000000000bfe

Where is s->csize_shift initialized?  In qcow2_do_open():

     s->csize_shift = (62 - (s->cluster_bits - 8));
     s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
     s->cluster_offset_mask = (1LL << s->csize_shift) - 1;

Revisiting the wording in the spec:

Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):

     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
                     cluster boundary!

        x+1 - 61:    Compressed size of the images in sectors of 512 bytes

which says bits 0-61 are the host cluster offset, and 62-61 is the 
number of sectors.  But our code sets s->csize_shift to divide this 
differently, at 0-60 and 61-61.  Which means your earlier claim that 
there are enough 'number sector' bits to allow for up to 2*cluster_size 
as the size of the compressed stream (rather than my claim of exactly 
cluster_size) is right, and other implementations CAN inflate a cluster 
(if we don't document otherwise), and that even if they DON'T inflate, 
they can STILL cause a read larger than a cluster size when the offset 
is near the tail of one sector (most likely at 512-byte clusters, but 
remotely possible at other cluster sizes as well).
Kevin Wolf Feb. 21, 2018, 4:51 p.m. UTC | #5
Am 20.02.2018 um 23:24 hat Eric Blake geschrieben:
> When reading a compressed image, we were allocating s->cluster_data
> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> with 2M clusters).  Let's check out the history:
> 
> Back when qcow2 was first written, we used s->cluster_data for
> everything, including copy_sectors() and encryption, where we want
> to operate on more than one cluster at once.  Obviously, at that
> point, the buffer had to be aligned for other users, even though
> compression itself doesn't require any alignment.
> 
> But commit 1b9f1491 (v1.1!) changed things to allocate parallel
> buffers on demand rather than sharing a single buffer, for encryption
> and COW, leaving compression as the final client of s->cluster_data.
> That use was still preserved, because if a single compressed cluster
> is read more than once, we reuse the cache instead of decompressing
> it a second time (I'm not sure how often this optimization actually
> fires, or if it penalizes us from being able to decompress multiple
> clusters in parallel even though we can now decrypt clusters in
> parallel; the XXX comment in qcow2_co_preadv for
> QCOW2_CLUSTER_COMPRESSED is telling).
> 
> Much later, in commit de82815d (v2.2), we noticed that a 64M
> allocation is prone to failure, so we switched over to a graceful
> memory allocation error message.  But note that elsewhere in the
> code, we do g_malloc(2 * cluster_size) without ever checking for
> failure.
> 
> Then even later, in 3e4c7052 (2.11), we realized that allocating
> a large buffer up front for every qcow2 image is expensive, and
> switched to lazy allocation only for images that actually had
> compressed clusters.  But in the process, we never even bothered
> to check whether what we were allocating still made sense in its
> new context!
> 
> So, it's time to cut back on the waste.  A compressed cluster
> will NEVER occupy more than an uncompressed cluster (okay, gzip
> DOES document that because the compression stream adds metadata,
> and because of the pigeonhole principle, there are worst case
> scenarios where attempts to compress will actually inflate an
> image - but in those cases, we would just write the cluster
> uncompressed instead of inflating it).  And as that is a smaller
> amount of memory, we can get by with the simpler g_malloc.
> 
> Signed-off-by: Eric Blake <eblake@redhat.com>

My comments feel a bit trivial compared to Berto's, but anyway:

> diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
> index 85be7d5e340..8c4b26ceaf2 100644
> --- a/block/qcow2-cluster.c
> +++ b/block/qcow2-cluster.c
> @@ -1603,15 +1603,9 @@ int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
>           * are freed in .bdrv_close().
>           */
>          if (!s->cluster_data) {
> -            /* one more sector for decompressed data alignment */
> -            s->cluster_data = qemu_try_blockalign(bs->file->bs,
> -                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
> -            if (!s->cluster_data) {
> -                return -ENOMEM;
> -            }
> -        }
> -        if (!s->cluster_cache) {
> -            s->cluster_cache = g_malloc(s->cluster_size);
> +            assert(!s->cluster_cache);

Wouldn't it be better to assert (!!s->cluster_cache ==
!!s->cluster_data) unconditionally?

> +            s->cluster_data = g_try_malloc(s->cluster_size);

Why are you going from qemu_try_blockalign() to simple malloc here? This
buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we
should avoid unnecessary use of a bounce buffer.

> +            s->cluster_cache = g_try_malloc(s->cluster_size);

As you already said, either g_malloc() or check the result. I actually
think that g_try_malloc() and checking the result is nicer, we still
allocate up to 2 MB here.

Kevin
Eric Blake Feb. 21, 2018, 4:59 p.m. UTC | #6
On 02/21/2018 10:51 AM, Kevin Wolf wrote:
> Am 20.02.2018 um 23:24 hat Eric Blake geschrieben:
>> When reading a compressed image, we were allocating s->cluster_data
>> to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
>> with 2M clusters).  Let's check out the history:
>>

>> Much later, in commit de82815d (v2.2), we noticed that a 64M
>> allocation is prone to failure, so we switched over to a graceful
>> memory allocation error message.  But note that elsewhere in the
>> code, we do g_malloc(2 * cluster_size) without ever checking for
>> failure.
>>

>> -        }
>> -        if (!s->cluster_cache) {
>> -            s->cluster_cache = g_malloc(s->cluster_size);
>> +            assert(!s->cluster_cache);
> 
> Wouldn't it be better to assert (!!s->cluster_cache ==
> !!s->cluster_data) unconditionally?
> 

Sure.

>> +            s->cluster_data = g_try_malloc(s->cluster_size);
> 
> Why are you going from qemu_try_blockalign() to simple malloc here? This
> buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we
> should avoid unnecessary use of a bounce buffer.

But does bdrv_pread() actually need to use a bounce buffer if we don't 
have an aligned buffer to read into?  Either the underlying protocol 
already supports byte-aligned reads (direct into our buffer, regardless 
of alignment, no bouncing required), or it already has do to a full 
sector read into a bounce buffer anyways (and it doesn't matter whether 
we aligned our buffer).  blockalign() made sense when we had multiple 
clients for the buffer, but ever since v1.1, when we have only a single 
client, and that single client is most likely not going to read 
sector-aligned data in the first place, aligning the buffer doesn't buy 
us anything.

> 
>> +            s->cluster_cache = g_try_malloc(s->cluster_size);
> 
> As you already said, either g_malloc() or check the result. I actually
> think that g_try_malloc() and checking the result is nicer, we still
> allocate up to 2 MB here.

See my commit message comment - we have other spots in the code base 
that blindly g_malloc(2 * s->cluster_size).  And I intended (but sent 
the email without amending my commit) to use g_malloc().  But as Berto 
has convinced me that an externally produced image can convince us to 
read up to 4M (even though we don't need that much to decompress), I 
suppose that the try_ variant plus checking is reasonable (and care in 
NULL'ing out if one but not both allocations succeed).
Kevin Wolf Feb. 21, 2018, 5:39 p.m. UTC | #7
Am 21.02.2018 um 17:59 hat Eric Blake geschrieben:
> On 02/21/2018 10:51 AM, Kevin Wolf wrote:
> > Am 20.02.2018 um 23:24 hat Eric Blake geschrieben:
> > > When reading a compressed image, we were allocating s->cluster_data
> > > to 32*cluster_size + 512 (possibly over 64 megabytes, for an image
> > > with 2M clusters).  Let's check out the history:
> > > 
> 
> > > Much later, in commit de82815d (v2.2), we noticed that a 64M
> > > allocation is prone to failure, so we switched over to a graceful
> > > memory allocation error message.  But note that elsewhere in the
> > > code, we do g_malloc(2 * cluster_size) without ever checking for
> > > failure.
> > > 
> 
> > > -        }
> > > -        if (!s->cluster_cache) {
> > > -            s->cluster_cache = g_malloc(s->cluster_size);
> > > +            assert(!s->cluster_cache);
> > 
> > Wouldn't it be better to assert (!!s->cluster_cache ==
> > !!s->cluster_data) unconditionally?
> > 
> 
> Sure.
> 
> > > +            s->cluster_data = g_try_malloc(s->cluster_size);
> > 
> > Why are you going from qemu_try_blockalign() to simple malloc here? This
> > buffer is used with bdrv_read() (or bdrv_pread() after patch 1), so we
> > should avoid unnecessary use of a bounce buffer.
> 
> But does bdrv_pread() actually need to use a bounce buffer if we don't have
> an aligned buffer to read into?  Either the underlying protocol already
> supports byte-aligned reads (direct into our buffer, regardless of
> alignment, no bouncing required), or it already has do to a full sector read
> into a bounce buffer anyways (and it doesn't matter whether we aligned our
> buffer).  blockalign() made sense when we had multiple clients for the
> buffer, but ever since v1.1, when we have only a single client, and that
> single client is most likely not going to read sector-aligned data in the
> first place, aligning the buffer doesn't buy us anything.

Good point.

To be honest, I don't even analyse each caller, but just consistently use
qemu_try_blockalign() whenever a buffer is used for I/O. It's a simple
rule of thumb that generally makes sense.

So as you say, in this case it's unlikely, but possible that we can
benefit from an aligned buffer. I guess my point is more about
consistency than actual functionality then. But it's okay either way.

> > 
> > > +            s->cluster_cache = g_try_malloc(s->cluster_size);
> > 
> > As you already said, either g_malloc() or check the result. I actually
> > think that g_try_malloc() and checking the result is nicer, we still
> > allocate up to 2 MB here.
> 
> See my commit message comment - we have other spots in the code base that
> blindly g_malloc(2 * s->cluster_size).

Though is that a reason to do the same in new code or to phase out such
allocations whenever you touch them?

> And I intended (but sent the email without amending my commit) to use
> g_malloc().  But as Berto has convinced me that an externally produced
> image can convince us to read up to 4M (even though we don't need that
> much to decompress), I suppose that the try_ variant plus checking is
> reasonable (and care in NULL'ing out if one but not both allocations
> succeed).

Sounds good.

Another thought I had is whether we should do per-request allocation for
compressed clusters, too, instead of having per-BDS buffers.

Kevin
Eric Blake Feb. 21, 2018, 6:32 p.m. UTC | #8
On 02/21/2018 11:39 AM, Kevin Wolf wrote:
>> See my commit message comment - we have other spots in the code base that
>> blindly g_malloc(2 * s->cluster_size).
> 
> Though is that a reason to do the same in new code or to phase out such
> allocations whenever you touch them?

Touché.

> 
>> And I intended (but sent the email without amending my commit) to use
>> g_malloc().  But as Berto has convinced me that an externally produced
>> image can convince us to read up to 4M (even though we don't need that
>> much to decompress), I suppose that the try_ variant plus checking is
>> reasonable (and care in NULL'ing out if one but not both allocations
>> succeed).
> 
> Sounds good.
> 
> Another thought I had is whether we should do per-request allocation for
> compressed clusters, too, instead of having per-BDS buffers.

The only benefit of a per-BDS buffer is that we cache things - multiple 
sub-cluster reads in a row all from the same compressed cluster benefit 
from decompressing only once.  The drawbacks of a per-BDS buffer: we 
can't do things in parallel (everything else in qcow2 drops the lock 
around bdrv_co_pread[v]), so the initial read prevents anything else in 
the qcow2 layer from progressing.

I also wonder - since we ARE allowing multiple parallel readers in other 
parts of qcow2 (without a patch, decompression is not in this boat, but 
decryption and even bounce buffers due to lower-layer alignment 
constraints are), what sort of mechanisms do we have for using a pool of 
reusable buffers, rather than having each cluster access that requires a 
buffer malloc and free the buffer on a per-access basis?  I don't know 
how much time the malloc/free per-transaction overhead adds, or if it is 
already much smaller than the actual I/O time.

But note that while reusable buffers from a pool would cut down on the 
per-I/O malloc/free overhead if we switch decompression away from 
per-BDS buffer, it would still not solve the fact that we only get the 
caching ability where multiple sub-cluster requests from the same 
compressed cluster require only one decompression, since that's only 
possible on a per-BDS caching level.
John Snow Feb. 21, 2018, 6:32 p.m. UTC | #9
On 02/21/2018 10:59 AM, Eric Blake wrote:
> On 02/21/2018 09:00 AM, Eric Blake wrote:
>> On 02/21/2018 04:04 AM, Alberto Garcia wrote:
>>> On Tue 20 Feb 2018 11:24:59 PM CET, Eric Blake wrote:
>>>
>>> I was also preparing a patch to change this, but you arrived first :-)
>>>
>>>> So, it's time to cut back on the waste.  A compressed cluster
>>>> will NEVER occupy more than an uncompressed cluster
> 
> 
>> And here, csize can only get smaller than the length picked by
>> nb_csectors.  Therefore, csize is GUARANTEED to be <= c->sector_size.
>>
>>>
>>> - Solution a: check that csize < s->cluster_size and return an error if
>>>    it's not. However! although QEMU won't produce an image with a
>>>    compressed cluster that is larger than the uncompressed one, the
>>> qcow2
>>>    on-disk format in principle allows for that, so arguably we should
>>>    accept it.
>>
>> No, the qcow2 on-disk format does not have enough bits reserved for
>> that.  It is impossible to store an inflated cluster, because you
>> don't have enough bits to record it.
> 
> Okay, the spec is WRONG, compared to our code base.
> 
>>
>> That said, we MAY have a bug, more likely to be visible in 512-byte
>> clusters but possible on any size.  While the 'number sectors' field
>> IS sufficient for any compressed cluster starting at offset 0 in the
>> cluster, we run into issues if the starting offset is later in the
>> cluster.  That is, even though the compressed length of a 512-byte
>> cluster is <= 512 (or we wouldn't compress it), if we start at offset
>> 510, we NEED to read the next cluster to get the rest of the
>> compressed stream - but at 512-byte clusters, there are 0 bits
>> reserved for 'number sectors'.  Per the above calculations with the
>> example offset of 510, nb_csectors would be 1 (it can't be anything
>> else for a 512-byte cluster image), and csize would then be 2 bytes,
>> which is insufficient for reading back enough data to reconstruct the
>> cluster.
> 
> In fact, here's a demonstration of a discrepancy, where qemu-img and
> John's qcheck tool [1] disagree about the validity of an image created
> by qemu (although it may just be that qcheck hasn't yet learned about
> compressed clusters):
> 
> [1]https://github.com/jnsnow/qcheck
> 

I wouldn't trust my tool's ability to understand compressed clusters :)

I didn't get very far, though I did run across the fact that there
appeared to be a discrepancy between the spec and the actual
implementation, IIRC.

It looked like you came to the same conclusion when you stepped through
it manually.

> $ f=12345678
> $ f=$f$f$f$f # 32
> $ f=$f$f$f$f # 128
> $ f=$f$f$f$f # 512
> $ f=$f$f$f$f # 2k
> $ f=$f$f$f$f # 8k
> $ f=$f$f$f$f # 32k
> $ f=$f$f$f$f # 128k
> $ printf "$f" > data
> $ qemu-img convert -c -f raw -O qcow2 -o cluster_size=512 \
>   data data.qcow2
> $ qemu-img check data.qcow2
> No errors were found on the image.
> 256/256 = 100.00% allocated, 100.00% fragmented, 100.00% compressed
> clusters
> Image end offset: 18944
> $ ./qcheck data.qcow2
> ...
> == L2 Tables ==
> 
> == L2 cluster l1[0] : 0x0000000000000800 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[1] : 0x0000000000000e00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[2] : 0x0000000000001400 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> == L2 cluster l1[3] : 0x0000000000001a00 ==
> Corrupt entries: 63; Non-zero entries: 64; Corrupt:Non-zero ratio: 0.984375
> L2 tables: Could not complete analysis, 257 problems found
> 
> 
> == Reference Count Analysis ==
> 
> Refcount analysis: 00 vacant clusters
> Refcount analysis: 04 leaked clusters
> Refcount analysis: 00 ghost clusters
> Refcount analysis: 04 miscounted clusters
> Refcount analysis: 8 problems found
> 
> 
> == Cluster Counts ==
> 
> Metadata: 0x1000
> Data: 0x800
> Leaked: 0x800
> Vacant: 0x0
> total: 0x2000
> qcheck: 73 problems found
> 
> 
>>
>> Not true.  It is (cluster_bits - 9) or (cluster_size / 512). 
>> Remember, x = 62 - (cluster_bits - 8); for a 512-byte cluster, x =
>> 61.  The 'number sectors' field is then bits x+1 - 61 (but you can't
>> have a bitfield occupying bit 62 upto 61; especially since bit 62 is
>> the bit for compressed cluster).
> 
> So instead of blindly reading the spec, I'm now going to single-stepping
> through the 'qemu-img convert' command line above, to see what REALLY
> happens:
> 
> Line numbers from commit a6e0344fa0
> $ gdb --args ./qemu-img convert -c -f raw -O qcow2 -o cluster_size=512
> data data.qcow2
> ...
> (gdb) b qcow2_alloc_bytes
> Breakpoint 1 at 0x57610: file block/qcow2-refcount.c, line 1052.
> (gdb) r
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
>     bs=bs@entry=0x555555d87f50, size=size@entry=15)
>     at block/qcow2-refcount.c:1052
> 1052    {
> (gdb)
> 
> So we are compressing 512 bytes down to 15 every time, which means that
> after 34 clusters compressed, we should be at offset 510.  Let's resume
> debugging:
> 
> (gdb) c 34
> Will ignore next 33 crossings of breakpoint 1.  Continuing.
> [Thread 0x7fffe3cfe700 (LWP 32229) exited]
> [New Thread 0x7fffe3cfe700 (LWP 32300)]
> [New Thread 0x7fffe25ed700 (LWP 32301)]
> 
> Thread 1 "qemu-img" hit Breakpoint 1, qcow2_alloc_bytes (
>     bs=bs@entry=0x555555d87f50, size=size@entry=15)
>     at block/qcow2-refcount.c:1052
> 1052    {
> (gdb) n
> 1053        BDRVQcow2State *s = bs->opaque;
> (gdb)
> 1058        BLKDBG_EVENT(bs->file, BLKDBG_CLUSTER_ALLOC_BYTES);
> (gdb)
> 1059        assert(size > 0 && size <= s->cluster_size);
> (gdb) p s->free_byte_offset
> $2 = 3070
> (gdb) p 3070%512
> $3 = 510
> ...
> (gdb)
> 1076        free_in_cluster = s->cluster_size - offset_into_cluster(s,
> offset);
> (gdb)
> 1078            if (!offset || free_in_cluster < size) {
> (gdb) p free_in_cluster
> $4 = 2
> 1079                int64_t new_cluster = alloc_clusters_noref(bs,
> s->cluster_size);
> (gdb)
> 1080                if (new_cluster < 0) {
> (gdb)
> 1084                if (new_cluster == 0) {
> (gdb)
> 1091                if (!offset || ROUND_UP(offset, s->cluster_size) !=
> new_cluster) {
> (gdb)
> 1095                    free_in_cluster += s->cluster_size;
> (gdb)
> 1099            assert(offset);
> 
> so we got a contiguous cluster, and our goal is to let the caller bleed
> the compressed cluster into to the tail of the current sector and into
> the head of the next cluster.  Continuing:
> 
> (gdb) fin
> Run till exit from #0  qcow2_alloc_bytes (bs=bs@entry=0x555555d87f50,
>     size=size@entry=15) at block/qcow2-refcount.c:1118
> [Thread 0x7fffe25ed700 (LWP 32301) exited]
> [Thread 0x7fffe3cfe700 (LWP 32300) exited]
> qcow2_alloc_compressed_cluster_offset (bs=bs@entry=0x555555d87f50,
>     offset=offset@entry=17408, compressed_size=compressed_size@entry=15)
>     at block/qcow2-cluster.c:768
> 768        if (cluster_offset < 0) {
> Value returned is $5 = 3070
> 
> (gdb) n
> 773        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 774                      (cluster_offset >> 9);
> (gdb)
> 773        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> (gdb)
> 777                          ((uint64_t)nb_csectors << s->csize_shift);
> (gdb) l
> 772   
> 773        nb_csectors = ((cluster_offset + compressed_size - 1) >> 9) -
> 774                      (cluster_offset >> 9);
> 775   
> 776        cluster_offset |= QCOW_OFLAG_COMPRESSED |
> 777                          ((uint64_t)nb_csectors << s->csize_shift);
> 778   
> 779        /* update L2 table */
> 780   
> 781        /* compressed clusters never have the copied flag */
> (gdb) p nb_csectors
> $6 = 1
> (gdb) p s->csize_shift
> $7 = 61
> (gdb) p/x cluster_offset
> $8 = 0xbfe
> (gdb) n
> 776        cluster_offset |= QCOW_OFLAG_COMPRESSED |
> (gdb)
> 783        BLKDBG_EVENT(bs->file, BLKDBG_L2_UPDATE_COMPRESSED);
> (gdb) p/x cluster_offset
> $9 = 0x6000000000000bfe
> 
> Where is s->csize_shift initialized?  In qcow2_do_open():
> 
>     s->csize_shift = (62 - (s->cluster_bits - 8));
>     s->csize_mask = (1 << (s->cluster_bits - 8)) - 1;
>     s->cluster_offset_mask = (1LL << s->csize_shift) - 1;
> 
> Revisiting the wording in the spec:
> 
> Compressed Clusters Descriptor (x = 62 - (cluster_bits - 8)):
> 
>     Bit  0 -  x:    Host cluster offset. This is usually _not_ aligned to a
>                     cluster boundary!
> 
>        x+1 - 61:    Compressed size of the images in sectors of 512 bytes
> 
> which says bits 0-61 are the host cluster offset, and 62-61 is the
> number of sectors.  But our code sets s->csize_shift to divide this
> differently, at 0-60 and 61-61.  Which means your earlier claim that
> there are enough 'number sector' bits to allow for up to 2*cluster_size
> as the size of the compressed stream (rather than my claim of exactly
> cluster_size) is right, and other implementations CAN inflate a cluster
> (if we don't document otherwise), and that even if they DON'T inflate,
> they can STILL cause a read larger than a cluster size when the offset
> is near the tail of one sector (most likely at 512-byte clusters, but
> remotely possible at other cluster sizes as well).
>
Kevin Wolf Feb. 21, 2018, 6:48 p.m. UTC | #10
Am 21.02.2018 um 19:32 hat Eric Blake geschrieben:
> On 02/21/2018 11:39 AM, Kevin Wolf wrote:
> > > See my commit message comment - we have other spots in the code base that
> > > blindly g_malloc(2 * s->cluster_size).
> > 
> > Though is that a reason to do the same in new code or to phase out such
> > allocations whenever you touch them?
> 
> Touché.
> 
> > 
> > > And I intended (but sent the email without amending my commit) to use
> > > g_malloc().  But as Berto has convinced me that an externally produced
> > > image can convince us to read up to 4M (even though we don't need that
> > > much to decompress), I suppose that the try_ variant plus checking is
> > > reasonable (and care in NULL'ing out if one but not both allocations
> > > succeed).
> > 
> > Sounds good.
> > 
> > Another thought I had is whether we should do per-request allocation for
> > compressed clusters, too, instead of having per-BDS buffers.
> 
> The only benefit of a per-BDS buffer is that we cache things - multiple
> sub-cluster reads in a row all from the same compressed cluster benefit from
> decompressing only once.

Oh, you're right. I missed that this is actually used as a cache.

I guess we want to leave it for now then. Maybe at some point we can
actually implement the data cache that I proposed a few years ago (using
Qcow2Cache for data clusters under some circumstances), then we could
probably make that hold the data instead of having a separate cache.

> The drawbacks of a per-BDS buffer: we can't do things in parallel
> (everything else in qcow2 drops the lock around bdrv_co_pread[v]), so
> the initial read prevents anything else in the qcow2 layer from
> progressing.

Yes, though there are probably other optimisations that could be made
for compression before this becomes relevant, like reading more than one
cluster at a time.

> I also wonder - since we ARE allowing multiple parallel readers in other
> parts of qcow2 (without a patch, decompression is not in this boat, but
> decryption and even bounce buffers due to lower-layer alignment constraints
> are), what sort of mechanisms do we have for using a pool of reusable
> buffers, rather than having each cluster access that requires a buffer
> malloc and free the buffer on a per-access basis?  I don't know how much
> time the malloc/free per-transaction overhead adds, or if it is already much
> smaller than the actual I/O time.

I don't either. A while ago, we used g_slice_alloc() in some places (I
remember qemu_aio_get), but it was actually slower than just using
malloc/free each time.

So if we do want to pool buffers, we probably need to implement that
manually. I don't think we have a generic memory pool in qemu yet.

> But note that while reusable buffers from a pool would cut down on the
> per-I/O malloc/free overhead if we switch decompression away from per-BDS
> buffer, it would still not solve the fact that we only get the caching
> ability where multiple sub-cluster requests from the same compressed cluster
> require only one decompression, since that's only possible on a per-BDS
> caching level.

Yes, as I said above, I didn't notice that it's a real cache. Without
the possibility to use Qcow2Cache instead, we'll want to keep it.

Kevin
Alberto Garcia Feb. 22, 2018, 1:57 p.m. UTC | #11
On Wed 21 Feb 2018 05:59:58 PM CET, Eric Blake wrote:
> But as Berto has convinced me that an externally produced image can
> convince us to read up to 4M (even though we don't need that much to
> decompress),

A (harmless but funny) consequence of the way this works is that for any
valid compressed cluster you should be able to increase the value of the
size field as much as you want without causing any user-visible effect.

So if you're working with 2MB clusters but for a particular compressed
cluster the size field is 0x0006 (7 sectors) you can still increase it
to the maximum (0x1fff, or 8192 sectors) and it should work just the
same. QEMU will read 4MB instead of ~4KB but since decompression stops
once the original cluster has been restored there's no harm.

I think I'll write a test case for this, it can be useful to verify that
QEMU can handle this kind of scenarios.

Berto
diff mbox series

Patch

diff --git a/block/qcow2-cluster.c b/block/qcow2-cluster.c
index 85be7d5e340..8c4b26ceaf2 100644
--- a/block/qcow2-cluster.c
+++ b/block/qcow2-cluster.c
@@ -1603,15 +1603,9 @@  int qcow2_decompress_cluster(BlockDriverState *bs, uint64_t cluster_offset)
          * are freed in .bdrv_close().
          */
         if (!s->cluster_data) {
-            /* one more sector for decompressed data alignment */
-            s->cluster_data = qemu_try_blockalign(bs->file->bs,
-                    QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size + 512);
-            if (!s->cluster_data) {
-                return -ENOMEM;
-            }
-        }
-        if (!s->cluster_cache) {
-            s->cluster_cache = g_malloc(s->cluster_size);
+            assert(!s->cluster_cache);
+            s->cluster_data = g_try_malloc(s->cluster_size);
+            s->cluster_cache = g_try_malloc(s->cluster_size);
         }

         BLKDBG_EVENT(bs->file, BLKDBG_READ_COMPRESSED);
diff --git a/block/qcow2.c b/block/qcow2.c
index 288b5299d80..6ad3436e0e5 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2103,7 +2103,7 @@  static void qcow2_close(BlockDriverState *bs)
     g_free(s->image_backing_format);

     g_free(s->cluster_cache);
-    qemu_vfree(s->cluster_data);
+    g_free(s->cluster_data);
     qcow2_refcount_close(bs);
     qcow2_free_snapshots(bs);
 }