diff mbox

qcow2: do not allocate extra memory

Message ID 1468345431-106198-1-git-send-email-vsementsov@virtuozzo.com
State New
Headers show

Commit Message

Vladimir Sementsov-Ogievskiy July 12, 2016, 5:43 p.m. UTC
There are no needs to allocate more than one cluster, as we set
avail_out for deflate to one cluster.

Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
---

Hi all!

Please, can anybody say me what I'm missing?

I've looked through deflate documentation at
http://www.zlib.net/manual.html, and I didn't find anything about
allocating more memory for out buffer than specified in avail_out
field.. What is this magic formula?

Comments

Eric Blake July 12, 2016, 6:43 p.m. UTC | #1
On 07/12/2016 11:43 AM, Vladimir Sementsov-Ogievskiy wrote:
> There are no needs to allocate more than one cluster, as we set
> avail_out for deflate to one cluster.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> Hi all!
> 
> Please, can anybody say me what I'm missing?
> 

https://tools.ietf.org/html/rfc1951 states a simple fact about compression:

      A simple counting argument shows that no lossless compression
      algorithm can compress every possible input data set.  For the
      format defined here, the worst case expansion is 5 bytes per 32K-
      byte block, i.e., a size increase of 0.015% for large data sets.

So overallocating the output buffer guarantees that you will get a valid
compression result via a single function call, even when the data is
incompressible (the zlib format specifically documents that if the
normal algorithm on the data does not reduce its size, then you merely
add a fixed-length marker that documents that fact, so you at least
avoid unbounded expansion when trying to compress pathological data).

But since the qcow2 format already has a way of documenting whether a
cluster is compressed or not, we probably don't have to rely on zlib's
marker for uncompressible data, and could instead tweak the code to
specifically refuse to compress any cluster whose output would result in
more than a cluster's worth of bytes.  I'm not familiar enough with
zlib's interface to know how easy or hard this is, and whether merely
checking error codes is sufficient, nor whether qemu's use of zlib would
behave correctly in the face of such an error when the output buffer is
undersized because the data was incompressible.


> ...
> strm.avail_out = s->cluster_size;
> strm.next_out = out_buf;
> 
> ret = deflate(&strm, Z_FINISH);
> ...
> out_len = strm.next_out - out_buf;

You've skipped what is done with ret, which will be different according
to whether the entire compressed stream fit in the buffer described by
strm, and that would have to be audited as part of your proposed patch.


> -    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> +    out_buf = g_malloc(s->cluster_size);

Is avoiding the fudge factor really worth it? I don't know that we'll
get a noticeable performance gain with this patch, and it may be easier
to leave things alone than to audit that we are correctly handling cases
where the attempt at compression results in a zlib buffer larger than
the original data, even when the output buffer size is now constrained
differently.
John Snow July 12, 2016, 7:03 p.m. UTC | #2
On 07/12/2016 01:43 PM, Vladimir Sementsov-Ogievskiy wrote:
> There are no needs to allocate more than one cluster, as we set
> avail_out for deflate to one cluster.
> 
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
> 
> Hi all!
> 
> Please, can anybody say me what I'm missing?
> 

Not the first mystery padding I've seen from a Blue Swirl checkin.

I can't figure out the purpose here either.

> I've looked through deflate documentation at
> http://www.zlib.net/manual.html, and I didn't find anything about
> allocating more memory for out buffer than specified in avail_out
> field.. What is this magic formula?
> 

Here's my guess. the qcow2 write compressed function *tries* to compress
a sector. Sometimes it isn't able to and the output data may be larger
than the input data.

In these cases, perhaps we were trying to allow enough buffer room for
the worst cast *expansion*. We check the length of the output data after
calling deflate to see if it actually increased.

Notably, since we only ever give it s->cluster_size, though, the out_len
can only ever be as big as s->cluster_size, which makes the >=
comparison a little misleading later.

Even if we could store something compressed if it was precisely 64KB as
an example, it's probably the right answer to store such things
uncompressed, so "==" (or >=) is probably an OK condition ... even if we
limit the buffer to exactly 64KB.

Where did s->cluster_size/1000 + 128 come from? No idea: AFAICT zlib
advertises a maximum overhead of 5 bytes per 16KB with a one-time
overhead of 6 bytes, so it should look something more like:

DIV_ROUND_UP(s->cluster_size, 16384) * 5 + 6

Even so, maybe we don't need it and we can just trash the result soundly
if we fill up a single cluster_size buffer.

> ========
> 
> All uses of out_buf in the function:
> 
> uint8_t *out_buf;
> ...
> out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> ...
> strm.avail_out = s->cluster_size;
> strm.next_out = out_buf;
> 
> ret = deflate(&strm, Z_FINISH);
> ...
> out_len = strm.next_out - out_buf;
> ...
> ret = bdrv_pwrite(bs->file, cluster_offset, out_buf, out_len);
> ...
> g_free(out_buf);
> 
>  block/qcow.c  | 2 +-
>  block/qcow2.c | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/block/qcow.c b/block/qcow.c
> index ac849bd..d8826f3 100644
> --- a/block/qcow.c
> +++ b/block/qcow.c
> @@ -983,7 +983,7 @@ static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
>          return ret;
>      }
>  
> -    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> +    out_buf = g_malloc(s->cluster_size);
>  
>      /* best compression, small window, no zlib header */
>      memset(&strm, 0, sizeof(strm));
> diff --git a/block/qcow2.c b/block/qcow2.c
> index a5ea19b..b1c90ae 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2612,7 +2612,7 @@ static int qcow2_write_compressed(BlockDriverState *bs, int64_t sector_num,
>          return ret;
>      }
>  
> -    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> +    out_buf = g_malloc(s->cluster_size);
>  
>      /* best compression, small window, no zlib header */
>      memset(&strm, 0, sizeof(strm));
>
Vladimir Sementsov-Ogievskiy July 12, 2016, 7:11 p.m. UTC | #3
On 12.07.2016 21:43, Eric Blake wrote:
> On 07/12/2016 11:43 AM, Vladimir Sementsov-Ogievskiy wrote:
>> There are no needs to allocate more than one cluster, as we set
>> avail_out for deflate to one cluster.
>>
>> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
>> ---
>>
>> Hi all!
>>
>> Please, can anybody say me what I'm missing?
>>
> https://tools.ietf.org/html/rfc1951 states a simple fact about compression:
>
>        A simple counting argument shows that no lossless compression
>        algorithm can compress every possible input data set.  For the
>        format defined here, the worst case expansion is 5 bytes per 32K-
>        byte block, i.e., a size increase of 0.015% for large data sets.
>
> So overallocating the output buffer guarantees that you will get a valid
> compression result via a single function call, even when the data is
> incompressible (the zlib format specifically documents that if the
> normal algorithm on the data does not reduce its size, then you merely
> add a fixed-length marker that documents that fact, so you at least
> avoid unbounded expansion when trying to compress pathological data).
>
> But since the qcow2 format already has a way of documenting whether a
> cluster is compressed or not, we probably don't have to rely on zlib's
> marker for uncompressible data, and could instead tweak the code to
> specifically refuse to compress any cluster whose output would result in
> more than a cluster's worth of bytes.  I'm not familiar enough with
> zlib's interface to know how easy or hard this is, and whether merely
> checking error codes is sufficient, nor whether qemu's use of zlib would
> behave correctly in the face of such an error when the output buffer is
> undersized because the data was incompressible.

zlib doc says:

"deflate compresses as much data as possible, and stops when the input 
buffer becomes empty or the output buffer becomes full"

It will not write more then avail_out bytes to out buffer.


>
>
>> ...
>> strm.avail_out = s->cluster_size;
>> strm.next_out = out_buf;
>>
>> ret = deflate(&strm, Z_FINISH);
>> ...
>> out_len = strm.next_out - out_buf;
> You've skipped what is done with ret, which will be different according
> to whether the entire compressed stream fit in the buffer described by
> strm, and that would have to be audited as part of your proposed patch.

ret would be Z_STREAM_END if it fit in and Z_OK if not. (if there are no 
errors ofcourse). What I've skipped? I just say that nobody knows about 
this extra allocation - neither zlib nor other code in this function 
(except g_free=).

>
>
>> -    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
>> +    out_buf = g_malloc(s->cluster_size);
> Is avoiding the fudge factor really worth it? I don't know that we'll
> get a noticeable performance gain with this patch, and it may be easier
> to leave things alone than to audit that we are correctly handling cases
> where the attempt at compression results in a zlib buffer larger than
> the original data, even when the output buffer size is now constrained
> differently.
>

I'm not insist on applying this patch, I am just trying to understand this.
Vladimir Sementsov-Ogievskiy July 12, 2016, 7:19 p.m. UTC | #4
add Fabrice Bellard

On 12.07.2016 20:43, Vladimir Sementsov-Ogievskiy wrote:
> There are no needs to allocate more than one cluster, as we set
> avail_out for deflate to one cluster.
>
> Signed-off-by: Vladimir Sementsov-Ogievskiy <vsementsov@virtuozzo.com>
> ---
>
> Hi all!
>
> Please, can anybody say me what I'm missing?
>
> I've looked through deflate documentation at
> http://www.zlib.net/manual.html, and I didn't find anything about
> allocating more memory for out buffer than specified in avail_out
> field.. What is this magic formula?
>
> ========
>
> All uses of out_buf in the function:
>
> uint8_t *out_buf;
> ...
> out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> ...
> strm.avail_out = s->cluster_size;
> strm.next_out = out_buf;
>
> ret = deflate(&strm, Z_FINISH);
> ...
> out_len = strm.next_out - out_buf;
> ...
> ret = bdrv_pwrite(bs->file, cluster_offset, out_buf, out_len);
> ...
> g_free(out_buf);
>
>   block/qcow.c  | 2 +-
>   block/qcow2.c | 2 +-
>   2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/block/qcow.c b/block/qcow.c
> index ac849bd..d8826f3 100644
> --- a/block/qcow.c
> +++ b/block/qcow.c
> @@ -983,7 +983,7 @@ static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
>           return ret;
>       }
>   
> -    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> +    out_buf = g_malloc(s->cluster_size);
>   
>       /* best compression, small window, no zlib header */
>       memset(&strm, 0, sizeof(strm));
> diff --git a/block/qcow2.c b/block/qcow2.c
> index a5ea19b..b1c90ae 100644
> --- a/block/qcow2.c
> +++ b/block/qcow2.c
> @@ -2612,7 +2612,7 @@ static int qcow2_write_compressed(BlockDriverState *bs, int64_t sector_num,
>           return ret;
>       }
>   
> -    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
> +    out_buf = g_malloc(s->cluster_size);
>   
>       /* best compression, small window, no zlib header */
>       memset(&strm, 0, sizeof(strm));
Eric Blake July 12, 2016, 8:30 p.m. UTC | #5
On 07/12/2016 01:11 PM, Vladimir Sementsov-Ogievskiy wrote:
> On 12.07.2016 21:43, Eric Blake wrote:
>> On 07/12/2016 11:43 AM, Vladimir Sementsov-Ogievskiy wrote:
>>> There are no needs to allocate more than one cluster, as we set
>>> avail_out for deflate to one cluster.
>>>

>>> ...
>>> strm.avail_out = s->cluster_size;
>>> strm.next_out = out_buf;
>>>
>>> ret = deflate(&strm, Z_FINISH);
>>> ...
>>> out_len = strm.next_out - out_buf;
>> You've skipped what is done with ret, which will be different according
>> to whether the entire compressed stream fit in the buffer described by
>> strm, and that would have to be audited as part of your proposed patch.
> 
> ret would be Z_STREAM_END if it fit in and Z_OK if not. (if there are no
> errors ofcourse). What I've skipped? I just say that nobody knows about
> this extra allocation - neither zlib nor other code in this function
> (except g_free=).

Okay, I've thought about this a bit more, and chatted with John on IRC.
 It looks like the slop is indeed wasted.  And while I don't know that
performance will be noticeably better, I _do_ think you are correct that
readability is easier to understand without the slop.

And who knows - for a malloc() implementation that uses mmap for large
requests, and rounds requests up to page multiples, malloc(64k) may
indeed be a more efficient use of memory than malloc(64k+slop), which
has to burn an entire page for memory that is never touched.

So my end conclusion is that I'd like the commit message to be a bit
more comprehensive (include some of your arguments made in the follow up
messages, such as the fact that we correctly handle Z_STREAM_END vs.
Z_OK in deciding whether to go with a compressed cluster in the first
place), but the idea itself is sane.

I'll give R-b to a v2, but not right now, because I want to make sure
the final commit message is sufficient to avoid another hour of digging
through RFC and zlib documentation when it gets revisited down the road.
diff mbox

Patch

========

All uses of out_buf in the function:

uint8_t *out_buf;
...
out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
...
strm.avail_out = s->cluster_size;
strm.next_out = out_buf;

ret = deflate(&strm, Z_FINISH);
...
out_len = strm.next_out - out_buf;
...
ret = bdrv_pwrite(bs->file, cluster_offset, out_buf, out_len);
...
g_free(out_buf);

 block/qcow.c  | 2 +-
 block/qcow2.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/block/qcow.c b/block/qcow.c
index ac849bd..d8826f3 100644
--- a/block/qcow.c
+++ b/block/qcow.c
@@ -983,7 +983,7 @@  static int qcow_write_compressed(BlockDriverState *bs, int64_t sector_num,
         return ret;
     }
 
-    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
+    out_buf = g_malloc(s->cluster_size);
 
     /* best compression, small window, no zlib header */
     memset(&strm, 0, sizeof(strm));
diff --git a/block/qcow2.c b/block/qcow2.c
index a5ea19b..b1c90ae 100644
--- a/block/qcow2.c
+++ b/block/qcow2.c
@@ -2612,7 +2612,7 @@  static int qcow2_write_compressed(BlockDriverState *bs, int64_t sector_num,
         return ret;
     }
 
-    out_buf = g_malloc(s->cluster_size + (s->cluster_size / 1000) + 128);
+    out_buf = g_malloc(s->cluster_size);
 
     /* best compression, small window, no zlib header */
     memset(&strm, 0, sizeof(strm));