Patchwork [qemu-img] CPU consuming optimization

login
register
mail settings
Submitter Dmitry Konishchev
Date May 17, 2011, 2:33 p.m.
Message ID <BANLkTinHhx5jTgcH-cytci=fNdk2V7zssg@mail.gmail.com>
Download mbox | patch
Permalink /patch/95943/
State New
Headers show

Comments

Dmitry Konishchev - May 17, 2011, 2:33 p.m.
Hi! I was wondering why qemu-img consumes so much CPU when it converts
one partially allocated qcow2 image to another qcow2 image and I've
written a patch which improves the situation a little.

I have an image:
> $ qemu-img info ubuntu.10.04.qcow2
> image: ubuntu.10.04.qcow2
> file format: qcow2
> virtual size: 20G (21474836480 bytes)
> disk size: 2.2G
> cluster_size: 65536

I create a new copy on write image:
> $ qemu-img create -f qcow2 -o backing_file=ubuntu.10.04.qcow2 volume.qcow2 100G
... and use it for a while.

Then I want to create a non-copy on write image from it to send it to someone:
> qemu-img convert -O qcow2 volume.qcow2 snapshot.qcow2


The last operation consumes a lot of CPU, so I run qemu-img under
profiler and realized, that most of CPU time is consumed by
is_not_zero() function. I had made a couple of optimizations on it and
got the following output for `time qemu-img convert -O qcow2
volume.qcow2 snapshot.qcow2`:

x86_64 machine:

Original qemu-img:
real 0m56.159s
user 0m34.670s
sys  0m12.079s

Patched qemu-img:
real 0m34.805s
user 0m18.445s
sys  0m12.552s


x86 machine:

Original qemu-img:
real 1m13.991s
user 0m24.734s
sys  0m6.604s

Patched qemu-img:
real 1m6.898s
user 0m16.021s
sys  0m6.700s

Please, see on the consumed user CPU time. I think that the
optimization worth it, so it will be awesome if you accept this patch
(see the attachment).

Thanks for your attention.
Stefan Hajnoczi - May 17, 2011, 3:35 p.m.
On Tue, May 17, 2011 at 3:33 PM, Dmitry Konishchev <konishchev@gmail.com> wrote:
> Hi! I was wondering why qemu-img consumes so much CPU when it converts
> one partially allocated qcow2 image to another qcow2 image and I've
> written a patch which improves the situation a little.

Please see http://wiki.qemu.org/Contribute/SubmitAPatch, which asks
that patches are sent inline (not as attachments) for easy review and
that you follow the coding style (see the CODING_STYLE file).  Patches
also need a Signed-off-by: line.

The unrolled loop makes the function rely on len being a multiple of
sizeof(long) * 4, otherwise it accesses beyond the end of sector[].
So for this use case it's okay but the function is generic anymore.

GNU cp(1) tries to detect holes in files and image formats could tell
us about unallocated regions using bdrv_is_allocated().  So I think
there are ways to avoid comparing so much data in the first place, if
you are interested in looking into that.

Stefan
Dmitry Konishchev - May 18, 2011, 6:55 a.m.
On Tue, May 17, 2011 at 7:35 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> Please see http://wiki.qemu.org/Contribute/SubmitAPatch, which asks
> that patches are sent inline (not as attachments) for easy review and
> that you follow the coding style (see the CODING_STYLE file).  Patches
> also need a Signed-off-by: line.
OK, thanks. I'll fix this in case you are willing to accept the patch.

On Tue, May 17, 2011 at 7:35 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> The unrolled loop makes the function rely on len being a multiple of
> sizeof(long) * 4, otherwise it accesses beyond the end of sector[].
> So for this use case it's okay but the function is generic anymore.
Yeah, but this function is static and within the whole file it is used
only for comparing clusters, so I think that we can sacrifice it's
universality for the sake of the performance.

On Tue, May 17, 2011 at 7:35 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> GNU cp(1) tries to detect holes in files and image formats could tell
> us about unallocated regions using bdrv_is_allocated().  So I think
> there are ways to avoid comparing so much data in the first place, if
> you are interested in looking into that.
OK, thanks, I'll look on this function (but sorry, only after 2 weeks,
since I'm on my vacation). But actually I think, that it will be
better to use the both ways, since they can give a boost in different
usecases.

So, if you are agreed with the said above, you can accept this patch
and then I'll write an enchancement for it with bdrv_is_allocated()
because it is going to include this patch.

--
Dmitry Konishchev
mailto:konishchev@gmail.com
Stefan Hajnoczi - May 18, 2011, 7:57 a.m.
On Wed, May 18, 2011 at 7:55 AM, Dmitry Konishchev <konishchev@gmail.com> wrote:
> So, if you are agreed with the said above, you can accept this patch
> and then I'll write an enchancement for it with bdrv_is_allocated()
> because it is going to include this patch.

Yes, optimizing is_not_zero() is good.  The only additional thing I
suggest is adding a comment before the function to document the length
constraint.

Kevin Wolf is CCed, he's the QEMU block layer maintainer and may have
additional thoughts before accepting this patch.

Stefan
Kevin Wolf - May 18, 2011, 8:05 a.m.
Am 18.05.2011 09:57, schrieb Stefan Hajnoczi:
> On Wed, May 18, 2011 at 7:55 AM, Dmitry Konishchev <konishchev@gmail.com> wrote:
>> So, if you are agreed with the said above, you can accept this patch
>> and then I'll write an enchancement for it with bdrv_is_allocated()
>> because it is going to include this patch.
> 
> Yes, optimizing is_not_zero() is good.  The only additional thing I
> suggest is adding a comment before the function to document the length
> constraint.
> 
> Kevin Wolf is CCed, he's the QEMU block layer maintainer and may have
> additional thoughts before accepting this patch.

For this one not really. Except for the coding style it looks good to me.

A future bdrv_is_allocated() patch must make sure that the conversion
falls back to a simple is_not_zero() when a backing file is used.

Kevin

Patch

From 61d228c0ea0d518de48a08577cd6d282e2f97759 Mon Sep 17 00:00:00 2001
From: Dmitry Konishchev <konishchev@gmail.com>
Date: Tue, 17 May 2011 16:29:48 +0400
Subject: [PATCH] is_not_zero() optimization in qemu-img

---
 qemu-img.c |   24 +++++++++++++++++++++---
 1 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/qemu-img.c b/qemu-img.c
index e825123..41b4e32 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -498,12 +498,30 @@  static int img_commit(int argc, char **argv)
 
 static int is_not_zero(const uint8_t *sector, int len)
 {
+    /*
+     * Use long as the biggest available internal data type that fits into the
+     * CPU register and unroll the loop to smooth out the effect of memory
+     * latency.
+     */
+
     int i;
-    len >>= 2;
-    for(i = 0;i < len; i++) {
-        if (((uint32_t *)sector)[i] != 0)
+    len /= sizeof(long);
+
+    long d0;
+    long d1;
+    long d2;
+    long d3;
+
+    for(i = 0; i < len; i += 4) {
+        d0 = ((const long*) sector)[i + 0];
+        d1 = ((const long*) sector)[i + 1];
+        d2 = ((const long*) sector)[i + 2];
+        d3 = ((const long*) sector)[i + 3];
+
+        if (d0 || d1 || d2 || d3)
             return 1;
     }
+
     return 0;
 }
 
-- 
1.7.4.1