Patchwork Re: [PATCH v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

login
register
mail settings
Submitter OHMURA Kei
Date Feb. 15, 2010, 6:12 a.m.
Message ID <4B78E5C5.80802@lab.ntt.co.jp>
Download mbox | patch
Permalink /patch/45346/
State New
Headers show

Comments

OHMURA Kei - Feb. 15, 2010, 6:12 a.m.
dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
But We think that dirty-bitmap-traveling by long size is faster than by byte
size especially when most of memory is not dirty.

Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 bswap.h    |    2 ++
 qemu-kvm.c |   31 ++++++++++++++++---------------
 2 files changed, 18 insertions(+), 15 deletions(-)
Alexander Graf - Feb. 15, 2010, 8:24 a.m.
On 15.02.2010, at 07:12, OHMURA Kei wrote:

> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
> But We think that dirty-bitmap-traveling by long size is faster than by byte

"We think"? I mean - yes, I think so too. But have you actually measured it? How much improvement are we talking here?
Is it still faster when a bswap is involved?

Alex
OHMURA Kei - Feb. 16, 2010, 11:16 a.m.
> "We think"? I mean - yes, I think so too. But have you actually measured it?
> How much improvement are we talking here?
> Is it still faster when a bswap is involved?

Thanks for pointing out.
I will post the data for x86 later.
However, I don't have a test environment to check the impact of bswap.
Would you please measure the run time between the following section if possible?

start ->
qemu-kvm.c:

static int kvm_get_dirty_bitmap_cb(unsigned long start, unsigned long len,
                                   void *bitmap, void *opaque)
{
    /* warm up each function */
    kvm_get_dirty_pages_log_range(start, bitmap, start, len);
    kvm_get_dirty_pages_log_range_new(start, bitmap, start, len);

    /* measurement */
    int64_t t1, t2;
    t1 = cpu_get_real_ticks();
    kvm_get_dirty_pages_log_range(start, bitmap, start, len);
    t1 = cpu_get_real_ticks() - t1;
    t2 = cpu_get_real_ticks();
    kvm_get_dirty_pages_log_range_new(start, bitmap, start, len);
    t2 = cpu_get_real_ticks() - t2;

    printf("## %zd, %zd\n", t1, t2); fflush(stdout);

    return kvm_get_dirty_pages_log_range_new(start, bitmap, start, len);
}
end ->
Alexander Graf - Feb. 16, 2010, 11:18 a.m.
On 16.02.2010, at 12:16, OHMURA Kei wrote:

>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>> How much improvement are we talking here?
>> Is it still faster when a bswap is involved?
> 
> Thanks for pointing out.
> I will post the data for x86 later.
> However, I don't have a test environment to check the impact of bswap.
> Would you please measure the run time between the following section if possible?

It'd make more sense to have a real stand alone test program, no?
I can try to write one today, but I have some really nasty important bugs to fix first.


Alex
OHMURA Kei - Feb. 17, 2010, 9:42 a.m.
>>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>>> How much improvement are we talking here?
>>> Is it still faster when a bswap is involved?
>> Thanks for pointing out.
>> I will post the data for x86 later.
>> However, I don't have a test environment to check the impact of bswap.
>> Would you please measure the run time between the following section if possible?
> 
> It'd make more sense to have a real stand alone test program, no?
> I can try to write one today, but I have some really nasty important bugs to fix first.


OK.  I will prepare a test code with sample data.  
Since I found a ppc machine around, I will run the code and post the results of
x86 and ppc.


By the way, the following data is a result of x86 measured in QEMU/KVM.  

This data shows, how many times the function is called (#called), runtime of 
original function(orig.), runtime of this patch(patch), speedup ratio (ratio).

Test1: Guest OS read 3GB file, which is bigger than memory.
#called     orig.(msec)     patch(msec)     ratio
108         1.1             0.1             7.6
102         1.0             0.1             6.8
132         1.6             0.2             7.1
 
Test2: Guest OS read/write 3GB file, which is bigger than memory.
#called     orig.(msec)     patch(msec)     ratio
2394        33              7.7             4.3
2100        29              7.1             4.1
2832        40              9.9             4.0
Alexander Graf - Feb. 17, 2010, 9:46 a.m.
On 17.02.2010, at 10:42, OHMURA Kei wrote:

>>>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>>>> How much improvement are we talking here?
>>>> Is it still faster when a bswap is involved?
>>> Thanks for pointing out.
>>> I will post the data for x86 later.
>>> However, I don't have a test environment to check the impact of bswap.
>>> Would you please measure the run time between the following section if possible?
>> It'd make more sense to have a real stand alone test program, no?
>> I can try to write one today, but I have some really nasty important bugs to fix first.
> 
> 
> OK.  I will prepare a test code with sample data.  Since I found a ppc machine around, I will run the code and post the results of
> x86 and ppc.
> 
> 
> By the way, the following data is a result of x86 measured in QEMU/KVM.  
> This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio).

That does indeed look promising!

Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly.


Alex
Avi Kivity - Feb. 17, 2010, 9:47 a.m.
On 02/17/2010 11:42 AM, OHMURA Kei wrote:
>>>> "We think"? I mean - yes, I think so too. But have you actually 
>>>> measured it?
>>>> How much improvement are we talking here?
>>>> Is it still faster when a bswap is involved?
>>> Thanks for pointing out.
>>> I will post the data for x86 later.
>>> However, I don't have a test environment to check the impact of bswap.
>>> Would you please measure the run time between the following section 
>>> if possible?
>>
>> It'd make more sense to have a real stand alone test program, no?
>> I can try to write one today, but I have some really nasty important 
>> bugs to fix first.
>
>
> OK.  I will prepare a test code with sample data.  Since I found a ppc 
> machine around, I will run the code and post the results of
> x86 and ppc.
>

I've applied the patch - I think the x86 results justify it, and I'll be 
very surprised if ppc doesn't show a similar gain.  Skipping 7 memory 
accesses and 7 tests must be a win.
Alexander Graf - Feb. 17, 2010, 9:49 a.m.
On 17.02.2010, at 10:47, Avi Kivity wrote:

> On 02/17/2010 11:42 AM, OHMURA Kei wrote:
>>>>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>>>>> How much improvement are we talking here?
>>>>> Is it still faster when a bswap is involved?
>>>> Thanks for pointing out.
>>>> I will post the data for x86 later.
>>>> However, I don't have a test environment to check the impact of bswap.
>>>> Would you please measure the run time between the following section if possible?
>>> 
>>> It'd make more sense to have a real stand alone test program, no?
>>> I can try to write one today, but I have some really nasty important bugs to fix first.
>> 
>> 
>> OK.  I will prepare a test code with sample data.  Since I found a ppc machine around, I will run the code and post the results of
>> x86 and ppc.
>> 
> 
> I've applied the patch - I think the x86 results justify it, and I'll be very surprised if ppc doesn't show a similar gain.  Skipping 7 memory accesses and 7 tests must be a win.

Sounds good to me. I don't assume bswap to be horribly slow either. Just want to be sure.


Alex
OHMURA Kei - Feb. 18, 2010, 5:57 a.m.
>>>>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>>>>> How much improvement are we talking here?
>>>>> Is it still faster when a bswap is involved?
>>>> Thanks for pointing out.
>>>> I will post the data for x86 later.
>>>> However, I don't have a test environment to check the impact of bswap.
>>>> Would you please measure the run time between the following section if possible?
>>> It'd make more sense to have a real stand alone test program, no?
>>> I can try to write one today, but I have some really nasty important bugs to fix first.
>>
>> OK.  I will prepare a test code with sample data.  Since I found a ppc machine around, I will run the code and post the results of
>> x86 and ppc.
>>
>>
>> By the way, the following data is a result of x86 measured in QEMU/KVM.  
>> This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio).
> 
> That does indeed look promising!
> 
> Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly.


I measured runtime of the test code with sample data.  My test environment 
and results are described below.

x86 Test Environment:
CPU: 4x Intel Xeon Quad Core 2.66GHz
Mem size: 6GB

ppc Test Environment:
CPU: 2x Dual Core PPC970MP
Mem size: 2GB

The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS
was live migrating.  To measure the runtime I copied cpu_get_real_ticks() of
QEMU to my test program.


Experimental results:
Test1: Guest OS read 3GB file, which is bigger than memory. 
       orig.(msec)    patch(msec)    ratio
x86    0.3            0.1            6.4 
ppc    7.9            2.7            3.0 

Test2: Guest OS read/write 3GB file, which is bigger than memory. 
       orig.(msec)    patch(msec)    ratio
x86    12.0           3.2            3.7 
ppc    251.1          123            2.0 


I also measured the runtime of bswap itself on ppc, and I found it was only 
just 0.3% ~ 0.7 % of the runtime described above.
Alexander Graf - Feb. 18, 2010, 10:30 a.m.
On 18.02.2010, at 06:57, OHMURA Kei wrote:

>>>>>> "We think"? I mean - yes, I think so too. But have you actually measured it?
>>>>>> How much improvement are we talking here?
>>>>>> Is it still faster when a bswap is involved?
>>>>> Thanks for pointing out.
>>>>> I will post the data for x86 later.
>>>>> However, I don't have a test environment to check the impact of bswap.
>>>>> Would you please measure the run time between the following section if possible?
>>>> It'd make more sense to have a real stand alone test program, no?
>>>> I can try to write one today, but I have some really nasty important bugs to fix first.
>>> 
>>> OK.  I will prepare a test code with sample data.  Since I found a ppc machine around, I will run the code and post the results of
>>> x86 and ppc.
>>> 
>>> 
>>> By the way, the following data is a result of x86 measured in QEMU/KVM.  This data shows, how many times the function is called (#called), runtime of original function(orig.), runtime of this patch(patch), speedup ratio (ratio).
>> That does indeed look promising!
>> Thanks for doing this micro-benchmark. I just want to be 100% sure that it doesn't affect performance for big endian badly.
> 
> 
> I measured runtime of the test code with sample data.  My test environment and results are described below.
> 
> x86 Test Environment:
> CPU: 4x Intel Xeon Quad Core 2.66GHz
> Mem size: 6GB
> 
> ppc Test Environment:
> CPU: 2x Dual Core PPC970MP
> Mem size: 2GB
> 
> The sample data of dirty bitmap was produced by QEMU/KVM while the guest OS
> was live migrating.  To measure the runtime I copied cpu_get_real_ticks() of
> QEMU to my test program.
> 
> 
> Experimental results:
> Test1: Guest OS read 3GB file, which is bigger than memory.       orig.(msec)    patch(msec)    ratio
> x86    0.3            0.1            6.4 ppc    7.9            2.7            3.0 
> Test2: Guest OS read/write 3GB file, which is bigger than memory.       orig.(msec)    patch(msec)    ratio
> x86    12.0           3.2            3.7 ppc    251.1          123            2.0 
> 
> I also measured the runtime of bswap itself on ppc, and I found it was only just 0.3% ~ 0.7 % of the runtime described above. 

Awesome! Thank you so much for giving actual data to make me feel comfortable with it :-).


Alex

Patch

diff --git a/bswap.h b/bswap.h
index 4558704..1f87e6d 100644
--- a/bswap.h
+++ b/bswap.h
@@ -205,8 +205,10 @@  static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
 
 #ifdef HOST_WORDS_BIGENDIAN
 #define cpu_to_32wu cpu_to_be32wu
+#define leul_to_cpu(v) le ## HOST_LONG_BITS ## _to_cpu(v)
 #else
 #define cpu_to_32wu cpu_to_le32wu
+#define leul_to_cpu(v) (v)
 #endif
 
 #undef le_bswap
diff --git a/qemu-kvm.c b/qemu-kvm.c
index a305907..6952aa5 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -2434,31 +2434,32 @@  int kvm_physical_memory_set_dirty_tracking(int enable)
 
 /* get kvm's dirty pages bitmap and update qemu's */
 static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
-                                         unsigned char *bitmap,
+                                         unsigned long *bitmap,
                                          unsigned long offset,
                                          unsigned long mem_size)
 {
-    unsigned int i, j, n = 0;
-    unsigned char c;
-    unsigned long page_number, addr, addr1;
+    unsigned int i, j;
+    unsigned long page_number, addr, addr1, c;
     ram_addr_t ram_addr;
-    unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + 7) / 8;
+    unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + HOST_LONG_BITS - 1) /
+        HOST_LONG_BITS;
 
     /* 
      * bitmap-traveling is faster than memory-traveling (for addr...) 
      * especially when most of the memory is not dirty.
      */
     for (i = 0; i < len; i++) {
-        c = bitmap[i];
-        while (c > 0) {
-            j = ffsl(c) - 1;
-            c &= ~(1u << j);
-            page_number = i * 8 + j;
-            addr1 = page_number * TARGET_PAGE_SIZE;
-            addr = offset + addr1;
-            ram_addr = cpu_get_physical_page_desc(addr);
-            cpu_physical_memory_set_dirty(ram_addr);
-            n++;
+        if (bitmap[i] != 0) {
+            c = leul_to_cpu(bitmap[i]);
+            do {
+                j = ffsl(c) - 1;
+                c &= ~(1ul << j);
+                page_number = i * HOST_LONG_BITS + j;
+                addr1 = page_number * TARGET_PAGE_SIZE;
+                addr = offset + addr1;
+                ram_addr = cpu_get_physical_page_desc(addr);
+                cpu_physical_memory_set_dirty(ram_addr);
+            } while (c != 0);
         }
     }
     return 0;