diff mbox

[v2] qemu-kvm: Speed up of the dirty-bitmap-traveling

Message ID 4B728FF9.6010707@lab.ntt.co.jp
State New
Headers show

Commit Message

OHMURA Kei Feb. 10, 2010, 10:52 a.m. UTC
dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
But We think that dirty-bitmap-traveling by long size is faster than by byte
size especially when most of memory is not dirty.

Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 bswap.h    |    1 -
 qemu-kvm.c |   30 ++++++++++++++++--------------
 2 files changed, 16 insertions(+), 15 deletions(-)

Comments

Ulrich Drepper Feb. 10, 2010, 1:10 p.m. UTC | #1
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 02/10/2010 02:52 AM, OHMURA Kei wrote:

>      for (i = 0; i < len; i++) {
> -        c = bitmap[i];
> -        while (c > 0) {
> -            j = ffsl(c) - 1;
> -            c &= ~(1u << j);
> -            page_number = i * 8 + j;
> -            addr1 = page_number * TARGET_PAGE_SIZE;
> -            addr = offset + addr1;
> -            ram_addr = cpu_get_physical_page_desc(addr);
> -            cpu_physical_memory_set_dirty(ram_addr);
> -            n++;
> +        if (bitmap_ul[i] != 0) {
> +            c = le_bswap(bitmap_ul[i], HOST_LONG_BITS);
> +            while (c > 0) {
> +                j = ffsl(c) - 1;
> +                c &= ~(1ul << j);
> +                page_number = i * HOST_LONG_BITS + j;
> +                addr1 = page_number * TARGET_PAGE_SIZE;
> +                addr = offset + addr1;
> +                ram_addr = cpu_get_physical_page_desc(addr);
> +                cpu_physical_memory_set_dirty(ram_addr);
> +            }

If you're optimizing this code you might want to do it all.  The
compiler might not see through the bswap call and create unnecessary
data dependencies.  Especially problematic if the bitmap is really
sparse.  Also, the outer test is != while the inner test is >.  Be
consistent.  I suggest to replace the inner loop with

      do {
        ...
      } while (c != 0);

Depending on how sparse the bitmap is populated this might reduce the
number of data dependencies quite a bit.

- -- 
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)

iEYEARECAAYFAktysDoACgkQ2ijCOnn/RHS2zwCfcj+G0S5ZAEA8MjGAVI/rKjJJ
+0oAnA4njIrwx3/5+o43ekYeYXSNyei0
=ukkz
-----END PGP SIGNATURE-----
Avi Kivity Feb. 10, 2010, 1:20 p.m. UTC | #2
On 02/10/2010 12:52 PM, OHMURA Kei wrote:
> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
> But We think that dirty-bitmap-traveling by long size is faster than by byte
> size especially when most of memory is not dirty.
>
> --- a/bswap.h
> +++ b/bswap.h
> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
>  #define cpu_to_32wu cpu_to_le32wu
>  #endif
>  
> -#undef le_bswap
>  #undef be_bswap
>  #undef le_bswaps
>   


Anthony, is it okay to export le_bswap this way, or will you want
leul_to_cpu()?
Anthony Liguori Feb. 10, 2010, 3:54 p.m. UTC | #3
On 02/10/2010 07:20 AM, Avi Kivity wrote:
> On 02/10/2010 12:52 PM, OHMURA Kei wrote:
>   
>> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
>> But We think that dirty-bitmap-traveling by long size is faster than by byte
>> size especially when most of memory is not dirty.
>>
>> --- a/bswap.h
>> +++ b/bswap.h
>> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
>>  #define cpu_to_32wu cpu_to_le32wu
>>  #endif
>>  
>> -#undef le_bswap
>>  #undef be_bswap
>>  #undef le_bswaps
>>   
>>     
>
> Anthony, is it okay to export le_bswap this way, or will you want
> leul_to_cpu()?
>   

kvm_get_dirty_pages_log_range() is kvm-specific code. We're guaranteed
that when we're using kvm, target byte order == host byte order.

So is it really necessary to use a byte swapping function at all?

Regards,

Anthony Liguori
Anthony Liguori Feb. 10, 2010, 3:55 p.m. UTC | #4
On 02/10/2010 07:20 AM, Avi Kivity wrote:
> On 02/10/2010 12:52 PM, OHMURA Kei wrote:
>   
>> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
>> But We think that dirty-bitmap-traveling by long size is faster than by byte
>> size especially when most of memory is not dirty.
>>
>> --- a/bswap.h
>> +++ b/bswap.h
>> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
>>  #define cpu_to_32wu cpu_to_le32wu
>>  #endif
>>  
>> -#undef le_bswap
>>  #undef be_bswap
>>  #undef le_bswaps
>>   
>>     
>
> Anthony, is it okay to export le_bswap this way, or will you want
> leul_to_cpu()?
>   

Oh, I see what's happening here. Yes, I think a leul_to_cpu() makes more
sense.

Regards,

Anthony Liguori
Avi Kivity Feb. 10, 2010, 3:57 p.m. UTC | #5
On 02/10/2010 05:54 PM, Anthony Liguori wrote:
> On 02/10/2010 07:20 AM, Avi Kivity wrote:
>   
>> On 02/10/2010 12:52 PM, OHMURA Kei wrote:
>>   
>>     
>>> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
>>> But We think that dirty-bitmap-traveling by long size is faster than by byte
>>> size especially when most of memory is not dirty.
>>>
>>> --- a/bswap.h
>>> +++ b/bswap.h
>>> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
>>>  #define cpu_to_32wu cpu_to_le32wu
>>>  #endif
>>>  
>>> -#undef le_bswap
>>>  #undef be_bswap
>>>  #undef le_bswaps
>>>   
>>>     
>>>       
>> Anthony, is it okay to export le_bswap this way, or will you want
>> leul_to_cpu()?
>>   
>>     
> kvm_get_dirty_pages_log_range() is kvm-specific code. We're guaranteed
> that when we're using kvm, target byte order == host byte order.
>
> So is it really necessary to use a byte swapping function at all?
>   

The dirty log bitmap is always little endian. This is so we don't have
to depend on sizeof(long) (which can vary between kernel and userspace)
or mandate some other access size.

(if native endian worked, then the previous byte-based code would have
been broken on big endian).

Seriously, those who say that big vs little endian is a matter of taste
are missing something.
Alexander Graf Feb. 10, 2010, 4 p.m. UTC | #6
Anthony Liguori wrote:
> On 02/10/2010 07:20 AM, Avi Kivity wrote:
>   
>> On 02/10/2010 12:52 PM, OHMURA Kei wrote:
>>   
>>     
>>> dirty-bitmap-traveling is carried out by byte size in qemu-kvm.c.
>>> But We think that dirty-bitmap-traveling by long size is faster than by byte
>>> size especially when most of memory is not dirty.
>>>
>>> --- a/bswap.h
>>> +++ b/bswap.h
>>> @@ -209,7 +209,6 @@ static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
>>>  #define cpu_to_32wu cpu_to_le32wu
>>>  #endif
>>>  
>>> -#undef le_bswap
>>>  #undef be_bswap
>>>  #undef le_bswaps
>>>   
>>>     
>>>       
>> Anthony, is it okay to export le_bswap this way, or will you want
>> leul_to_cpu()?
>>   
>>     
>
> kvm_get_dirty_pages_log_range() is kvm-specific code. We're guaranteed
> that when we're using kvm, target byte order == host byte order.
>
> So is it really necessary to use a byte swapping function at all?
>   

On PPC the bitmap is Little Endian.


Alex
Anthony Liguori Feb. 10, 2010, 4:35 p.m. UTC | #7
On 02/10/2010 10:00 AM, Alexander Graf wrote:
> On PPC the bitmap is Little Endian.
>   

Out of curiousity, why? It seems like an odd interface.

Regards,

Anthony Liguori
Alexander Graf Feb. 10, 2010, 4:43 p.m. UTC | #8
Anthony Liguori wrote:
> On 02/10/2010 10:00 AM, Alexander Graf wrote:
>   
>> On PPC the bitmap is Little Endian.
>>   
>>     
>
> Out of curiousity, why? It seems like an odd interface.
>   

Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
Unlike with x86, there's no real benefit in using 64 bit userspace.

So thanks to the nature of big endianness, that breaks our set_bit
helpers, because they assume you're using "long" data types for the
bits. While that's no real issue on little endian, since the next int is
just the high part of a u64, it messes everything up on ppc.

For more details, please just look in the archives on my patches to make
it little endian.


Alex
Avi Kivity Feb. 10, 2010, 4:43 p.m. UTC | #9
On 02/10/2010 06:35 PM, Anthony Liguori wrote:
> On 02/10/2010 10:00 AM, Alexander Graf wrote:
>   
>> On PPC the bitmap is Little Endian.
>>   
>>     
> Out of curiousity, why? It seems like an odd interface.
>
>   

Exactly this issue. If you specify it as unsigned long native endian,
there is ambiguity between 32-bit and 64-bit userspace. If you specify
it as uint64_t native endian, you have an inefficient implementation on
32-bit userspace. So we went for unsigned byte native endian, which is
the same as any size little endian.

(well I think the real reason is that it just grew that way out of x86,
but the above is quite plausible).
Avi Kivity Feb. 10, 2010, 4:46 p.m. UTC | #10
On 02/10/2010 06:43 PM, Alexander Graf wrote:
>
>> Out of curiousity, why? It seems like an odd interface.
>>   
>>     
> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
> Unlike with x86, there's no real benefit in using 64 bit userspace.
>   

btw, does 32-bit ppc qemu support large memory guests? It doesn't on
x86, and I don't remember any hacks to support large memory guests
elsewhere.
Alexander Graf Feb. 10, 2010, 4:47 p.m. UTC | #11
Avi Kivity wrote:
> On 02/10/2010 06:43 PM, Alexander Graf wrote:
>   
>>> Out of curiousity, why? It seems like an odd interface.
>>>   
>>>     
>>>       
>> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
>> Unlike with x86, there's no real benefit in using 64 bit userspace.
>>   
>>     
>
> btw, does 32-bit ppc qemu support large memory guests? It doesn't on
> x86, and I don't remember any hacks to support large memory guests
> elsewhere.
>   


It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2
GB anyways, because that needs an iommu implementation.


Alex
Avi Kivity Feb. 10, 2010, 4:52 p.m. UTC | #12
On 02/10/2010 06:47 PM, Alexander Graf wrote:
>>> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
>>> Unlike with x86, there's no real benefit in using 64 bit userspace.
>>>   
>>>     
>>>       
>> btw, does 32-bit ppc qemu support large memory guests? It doesn't on
>> x86, and I don't remember any hacks to support large memory guests
>> elsewhere.
>>   
>>     
>
> It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2
> GB anyways, because that needs an iommu implementation.
>   

Oh, so you may want to revisit the "there's no real benefit in using 64
bit userspace".

Seriously, that looks like a big deficiency. What would it take to
implement an iommu?

I imagine Anthony's latest patches are a first step in that journey.
Alexander Graf Feb. 10, 2010, 4:54 p.m. UTC | #13
Avi Kivity wrote:
> On 02/10/2010 06:47 PM, Alexander Graf wrote:
>   
>>>> Because on PPC, you usually run PPC32 userspace code on a PPC64 kernel.
>>>> Unlike with x86, there's no real benefit in using 64 bit userspace.
>>>>   
>>>>     
>>>>       
>>>>         
>>> btw, does 32-bit ppc qemu support large memory guests? It doesn't on
>>> x86, and I don't remember any hacks to support large memory guests
>>> elsewhere.
>>>   
>>>     
>>>       
>> It doesn't :-). In fact, the guest we virtualize wouldn't work with > 2
>> GB anyways, because that needs an iommu implementation.
>>   
>>     
>
> Oh, so you may want to revisit the "there's no real benefit in using 64
> bit userspace".
>   

Well, for normal users they don't. SLES11 is 64-bit only, so we're good
on that. But openSUSE uses 32-bit userland.

> Seriously, that looks like a big deficiency. What would it take to
> implement an iommu?
>
> I imagine Anthony's latest patches are a first step in that journey.
>   

All reads/writes from PCI devices would need to go through a wrapper.
Maybe we could also define a per-device offset for memory accesses. That
way the overhead might be less.

Yes, Anthony's patches look like they are a really big step in that
direction.


Alex
diff mbox

Patch

diff --git a/bswap.h b/bswap.h
index 4558704..d896f01 100644
--- a/bswap.h
+++ b/bswap.h
@@ -209,7 +209,6 @@  static inline void cpu_to_be32wu(uint32_t *p, uint32_t v)
 #define cpu_to_32wu cpu_to_le32wu
 #endif
 
-#undef le_bswap
 #undef be_bswap
 #undef le_bswaps
 #undef be_bswaps
diff --git a/qemu-kvm.c b/qemu-kvm.c
index a305907..ea07912 100644
--- a/qemu-kvm.c
+++ b/qemu-kvm.c
@@ -2438,27 +2438,29 @@  static int kvm_get_dirty_pages_log_range(unsigned long start_addr,
                                          unsigned long offset,
                                          unsigned long mem_size)
 {
-    unsigned int i, j, n = 0;
-    unsigned char c;
-    unsigned long page_number, addr, addr1;
+    unsigned int i, j;
+    unsigned long page_number, addr, addr1, c;
     ram_addr_t ram_addr;
-    unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + 7) / 8;
+    unsigned int len = ((mem_size / TARGET_PAGE_SIZE) + HOST_LONG_BITS - 1) /
+        HOST_LONG_BITS;
+    unsigned long *bitmap_ul = (unsigned long *)bitmap;
 
     /* 
      * bitmap-traveling is faster than memory-traveling (for addr...) 
      * especially when most of the memory is not dirty.
      */
     for (i = 0; i < len; i++) {
-        c = bitmap[i];
-        while (c > 0) {
-            j = ffsl(c) - 1;
-            c &= ~(1u << j);
-            page_number = i * 8 + j;
-            addr1 = page_number * TARGET_PAGE_SIZE;
-            addr = offset + addr1;
-            ram_addr = cpu_get_physical_page_desc(addr);
-            cpu_physical_memory_set_dirty(ram_addr);
-            n++;
+        if (bitmap_ul[i] != 0) {
+            c = le_bswap(bitmap_ul[i], HOST_LONG_BITS);
+            while (c > 0) {
+                j = ffsl(c) - 1;
+                c &= ~(1ul << j);
+                page_number = i * HOST_LONG_BITS + j;
+                addr1 = page_number * TARGET_PAGE_SIZE;
+                addr = offset + addr1;
+                ram_addr = cpu_get_physical_page_desc(addr);
+                cpu_physical_memory_set_dirty(ram_addr);
+            }
         }
     }
     return 0;