Patchwork virtio-blk: Avoid zeroing every request structure

login
register
mail settings
Submitter Stefan Hajnoczi
Date May 14, 2010, 9:52 p.m.
Message ID <1273873950-25756-1-git-send-email-stefanha@linux.vnet.ibm.com>
Download mbox | patch
Permalink /patch/52691/
State New
Headers show

Comments

Stefan Hajnoczi - May 14, 2010, 9:52 p.m.
The VirtIOBlockRequest structure is about 40 KB in size.  This patch
avoids zeroing every request by only initializing fields that are read.
The other fields are either written to or may not be used at all.

Oprofile shows about 10% of CPU samples in memset called by
virtio_blk_alloc_request().  The workload is
dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
times.  This patch makes memset disappear to the bottom of the profile.

Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
---
This applies to qemu.git and qemu-kvm.git.

A related change would be a pool of requests to avoid malloc/free for every
single request.  That's a separate change and malloc/free do not show up at the
top of the profile, so I am not introducing a pool yet.

 hw/virtio-blk.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)
Christoph Hellwig - May 16, 2010, 1:25 p.m.
On Fri, May 14, 2010 at 10:52:30PM +0100, Stefan Hajnoczi wrote:
> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
> index b05d15e..d270225 100644
> --- a/hw/virtio-blk.c
> +++ b/hw/virtio-blk.c
> @@ -105,8 +105,10 @@ static void virtio_blk_flush_complete(void *opaque, int ret)
>  
>  static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>  {
> -    VirtIOBlockReq *req = qemu_mallocz(sizeof(*req));
> +    VirtIOBlockReq *req = qemu_malloc(sizeof(*req));
>      req->dev = s;
> +    req->qiov.size = 0;
> +    req->next = NULL;
>      return req;

Looks good, but you shouldn't even need to initialize req->qiov.size, we
do this later by calling qemu_iovec_init_external before using it.
Stefan Hajnoczi - May 16, 2010, 3:01 p.m.
On Sun, May 16, 2010 at 2:25 PM, Christoph Hellwig <hch@lst.de> wrote:
> On Fri, May 14, 2010 at 10:52:30PM +0100, Stefan Hajnoczi wrote:
>> diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
>> index b05d15e..d270225 100644
>> --- a/hw/virtio-blk.c
>> +++ b/hw/virtio-blk.c
>> @@ -105,8 +105,10 @@ static void virtio_blk_flush_complete(void *opaque, int ret)
>>
>>  static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
>>  {
>> -    VirtIOBlockReq *req = qemu_mallocz(sizeof(*req));
>> +    VirtIOBlockReq *req = qemu_malloc(sizeof(*req));
>>      req->dev = s;
>> +    req->qiov.size = 0;
>> +    req->next = NULL;
>>      return req;
>
> Looks good, but you shouldn't even need to initialize req->qiov.size, we
> do this later by calling qemu_iovec_init_external before using it.

virtio_blk_req_complete() uses req->qiov.size and may be called by
virtio_blk_handle_flush() or virtio_blk_handle_scsi() without being
initialized.  It's a little ugly that we use the qiov like that.

Stefan
Kevin Wolf - May 18, 2010, 9:36 a.m.
Am 14.05.2010 23:52, schrieb Stefan Hajnoczi:
> The VirtIOBlockRequest structure is about 40 KB in size.  This patch
> avoids zeroing every request by only initializing fields that are read.
> The other fields are either written to or may not be used at all.
> 
> Oprofile shows about 10% of CPU samples in memset called by
> virtio_blk_alloc_request().  The workload is
> dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
> times.  This patch makes memset disappear to the bottom of the profile.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Thanks, applied to the block branch.

Kevin
Jes Sorensen - May 18, 2010, 3:46 p.m.
On 05/14/10 23:52, Stefan Hajnoczi wrote:
> The VirtIOBlockRequest structure is about 40 KB in size.  This patch
> avoids zeroing every request by only initializing fields that are read.
> The other fields are either written to or may not be used at all.
> 
> Oprofile shows about 10% of CPU samples in memset called by
> virtio_blk_alloc_request().  The workload is
> dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
> times.  This patch makes memset disappear to the bottom of the profile.
> 
> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>

Great catch!

I ran some benchmarks using a ramdisk passed to the guest as a virtio
device and with this patch I saw improvements ranging from 5-20%. I
believe the fluctuations are due to not being able to numa bind it due
to limited memory.

However a win all the way round!

Acked-by: Jes Sorensen <Jes.Sorensen@redhat.com>

Jes
Alexander Graf - May 18, 2010, 4:26 p.m.
Jes Sorensen wrote:
> On 05/14/10 23:52, Stefan Hajnoczi wrote:
>   
>> The VirtIOBlockRequest structure is about 40 KB in size.  This patch
>> avoids zeroing every request by only initializing fields that are read.
>> The other fields are either written to or may not be used at all.
>>
>> Oprofile shows about 10% of CPU samples in memset called by
>> virtio_blk_alloc_request().  The workload is
>> dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
>> times.  This patch makes memset disappear to the bottom of the profile.
>>
>> Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com>
>>     
>
> Great catch!
>
> I ran some benchmarks using a ramdisk passed to the guest as a virtio
> device and with this patch I saw improvements ranging from 5-20%. I
> believe the fluctuations are due to not being able to numa bind it due
> to limited memory.
>
> However a win all the way round!
>   

It looks like a fairly small change with a huge win. Sounds like a
perfect candidate for 0.12.5 to me.

Alex
Anthony Liguori - May 18, 2010, 4:29 p.m.
On 05/18/2010 11:26 AM, Alexander Graf wrote:
> Jes Sorensen wrote:
>    
>> On 05/14/10 23:52, Stefan Hajnoczi wrote:
>>
>>      
>>> The VirtIOBlockRequest structure is about 40 KB in size.  This patch
>>> avoids zeroing every request by only initializing fields that are read.
>>> The other fields are either written to or may not be used at all.
>>>
>>> Oprofile shows about 10% of CPU samples in memset called by
>>> virtio_blk_alloc_request().  The workload is
>>> dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
>>> times.  This patch makes memset disappear to the bottom of the profile.
>>>
>>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>>
>>>        
>> Great catch!
>>
>> I ran some benchmarks using a ramdisk passed to the guest as a virtio
>> device and with this patch I saw improvements ranging from 5-20%. I
>> believe the fluctuations are due to not being able to numa bind it due
>> to limited memory.
>>
>> However a win all the way round!
>>
>>      
> It looks like a fairly small change with a huge win. Sounds like a
> perfect candidate for 0.12.5 to me.
>    

I'd prefer to stick to bug fixes for stable releases.  Performance 
improvements are a good motivation for people to upgrade to 0.13 :-)

Regards,

Anthony Liguori

> Alex
>
>
>
Alexander Graf - May 18, 2010, 4:31 p.m.
Anthony Liguori wrote:
> On 05/18/2010 11:26 AM, Alexander Graf wrote:
>> Jes Sorensen wrote:
>>   
>>> On 05/14/10 23:52, Stefan Hajnoczi wrote:
>>>
>>>     
>>>> The VirtIOBlockRequest structure is about 40 KB in size.  This patch
>>>> avoids zeroing every request by only initializing fields that are
>>>> read.
>>>> The other fields are either written to or may not be used at all.
>>>>
>>>> Oprofile shows about 10% of CPU samples in memset called by
>>>> virtio_blk_alloc_request().  The workload is
>>>> dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
>>>> times.  This patch makes memset disappear to the bottom of the
>>>> profile.
>>>>
>>>> Signed-off-by: Stefan Hajnoczi<stefanha@linux.vnet.ibm.com>
>>>>
>>>>        
>>> Great catch!
>>>
>>> I ran some benchmarks using a ramdisk passed to the guest as a virtio
>>> device and with this patch I saw improvements ranging from 5-20%. I
>>> believe the fluctuations are due to not being able to numa bind it due
>>> to limited memory.
>>>
>>> However a win all the way round!
>>>
>>>      
>> It looks like a fairly small change with a huge win. Sounds like a
>> perfect candidate for 0.12.5 to me.
>>    
>
> I'd prefer to stick to bug fixes for stable releases.  Performance
> improvements are a good motivation for people to upgrade to 0.13 :-)

In general I agree, but this one looks like a really simple one.

Alex
Corentin Chary - May 18, 2010, 4:37 p.m.
On Fri, May 14, 2010 at 11:52 PM, Stefan Hajnoczi
<stefanha@linux.vnet.ibm.com> wrote:
> The VirtIOBlockRequest structure is about 40 KB in size.  This patch
> avoids zeroing every request by only initializing fields that are read.
> The other fields are either written to or may not be used at all.
>
> Oprofile shows about 10% of CPU samples in memset called by
> virtio_blk_alloc_request().  The workload is
> dd if=/dev/vda of=/dev/null iflag=direct bs=8k running concurrently 4
> times.  This patch makes memset disappear to the bottom of the profile.
>

Did you try to profile using calloc in qemu_mallocz instead of malloc + memset ?

$ man calloc
calloc()  allocates memory for an array of nmemb elements of size
bytes each and returns a pointer to the allocated memory.  __The
memory is set to zero.__
Stefan Hajnoczi - May 18, 2010, 7:02 p.m.
On Tue, May 18, 2010 at 5:37 PM, Corentin Chary
<corentin.chary@gmail.com> wrote:
> Did you try to profile using calloc in qemu_mallocz instead of malloc + memset ?

No, I didn't try it.  I don't see how it could perform on par with not
clearing memory at all.  Something is going to have to clear that
memory when using memset() or calloc().  Either libc or the kernel (if
using mmap).

Stefan
Corentin Chary - May 18, 2010, 7:55 p.m.
On Tue, May 18, 2010 at 9:02 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Tue, May 18, 2010 at 5:37 PM, Corentin Chary
> <corentin.chary@gmail.com> wrote:
>> Did you try to profile using calloc in qemu_mallocz instead of malloc + memset ?
>
> No, I didn't try it.  I don't see how it could perform on par with not
> clearing memory at all.  Something is going to have to clear that
> memory when using memset() or calloc().  Either libc or the kernel (if
> using mmap).

I believe that if the allocation size is large enougth, getting a
zeroed page can be almost free with clever memory management.
Could you try to re-run your test with calloc and see what it does ?
Speeding up all mallocz calls is probably a good idea :)
Stefan Hajnoczi - May 18, 2010, 8:43 p.m.
On Tue, May 18, 2010 at 8:55 PM, Corentin Chary
<corentin.chary@gmail.com> wrote:
> I believe that if the allocation size is large enougth, getting a
> zeroed page can be almost free with clever memory management.
> Could you try to re-run your test with calloc and see what it does ?
> Speeding up all mallocz calls is probably a good idea :)

I think that scenario is libc using mmap to grab zeroed pages from the
kernel.  If the kernel manages zeroed pages so that zeroing is
performed out-of-line (e.g. there is a pool of zeroed pages waiting),
then it appears that calloc() is getting zeroed memory cheaply.

I'll rerun with profiling tomorrow to see if calloc() makes a
difference for general qemu_mallocz() usage.

In the case of virtio-blk requests, we definitely shouldn't be
clearing that memory even if calloc() were cheaper.  Much of the
request structure will not be touched in an average I/O request.
Using zeroed pages is wasteful because they aren't needed here.

Stefan
Stefan Hajnoczi - May 21, 2010, 11:17 a.m.
On Tue, May 18, 2010 at 9:43 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> I'll rerun with profiling tomorrow to see if calloc() makes a
> difference for general qemu_mallocz() usage.

The results are unchanged for direct calloc() instead of
malloc+memset.  The memset() symbol is still at the top of the profile
because glibc is using it internally for calloc().

In theory calloc() allows libc to do optimizations so I think changing
qemu_mallocz() to use calloc() is reasonable.

Stefan
Richard Henderson - May 21, 2010, 5:37 p.m.
I was reminded of this by the discussion on the list recently of
using calloc for qemu_mallocz.  I'd vaguely remembered that I
already had a patch that did this.

The second patch is a cleanup enabled by Paul's patch:
  2e9a5713f0567fffaa3518f495b8d16a2b74f84a
which removed PAGE_RESERVED.


r~


Richard Henderson (2):
  Use calloc in qemu_mallocz.
  linux-user: Use qemu-malloc.c.

 Makefile.target   |    3 ++-
 linux-user/mmap.c |   52 ----------------------------------------------------
 qemu-malloc.c     |    8 ++++----
 3 files changed, 6 insertions(+), 57 deletions(-)
Aurelien Jarno - May 28, 2010, 10:08 p.m.
On Fri, May 21, 2010 at 10:37:50AM -0700, Richard Henderson wrote:
> I was reminded of this by the discussion on the list recently of
> using calloc for qemu_mallocz.  I'd vaguely remembered that I
> already had a patch that did this.
> 
> The second patch is a cleanup enabled by Paul's patch:
>   2e9a5713f0567fffaa3518f495b8d16a2b74f84a
> which removed PAGE_RESERVED.
> 
> 

Thanks, both applied.

> 
> Richard Henderson (2):
>   Use calloc in qemu_mallocz.
>   linux-user: Use qemu-malloc.c.
> 
>  Makefile.target   |    3 ++-
>  linux-user/mmap.c |   52 ----------------------------------------------------
>  qemu-malloc.c     |    8 ++++----
>  3 files changed, 6 insertions(+), 57 deletions(-)
> 
> 
>
Jamie Lokier - May 29, 2010, 1:07 p.m.
Alexander Graf wrote:
> Anthony Liguori wrote:
> > I'd prefer to stick to bug fixes for stable releases.  Performance
> > improvements are a good motivation for people to upgrade to 0.13 :-)
> 
> In general I agree, but this one looks like a really simple one.

Besides, there are too many reported guest regressions at the moment
to upgrade if using any of them.

-- Jamie

Patch

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index b05d15e..d270225 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -105,8 +105,10 @@  static void virtio_blk_flush_complete(void *opaque, int ret)
 
 static VirtIOBlockReq *virtio_blk_alloc_request(VirtIOBlock *s)
 {
-    VirtIOBlockReq *req = qemu_mallocz(sizeof(*req));
+    VirtIOBlockReq *req = qemu_malloc(sizeof(*req));
     req->dev = s;
+    req->qiov.size = 0;
+    req->next = NULL;
     return req;
 }