Patchwork [PATCHv5,06/10] migration: search for zero instead of dup pages

login
register
mail settings
Submitter Peter Lieven
Date March 26, 2013, 9:58 a.m.
Message ID <1364291919-19563-7-git-send-email-pl@kamp.de>
Download mbox | patch
Permalink /patch/231141/
State New
Headers show

Comments

Peter Lieven - March 26, 2013, 9:58 a.m.
virtually all dup pages are zero pages. remove
the special is_dup_page() function and use the
optimized buffer_find_nonzero_offset() function
instead.

here buffer_find_nonzero_offset() is used directly
to avoid the unnecssary additional checks in
buffer_is_zero().

raw performace gain checking 1 GByte zeroed memory
over is_dup_page() is approx. 10-12% with SSE2
and 8-10% with unsigned long arithmedtic.

Signed-off-by: Peter Lieven <pl@kamp.de>
Reviewed-by: Orit Wasserman <owasserm@redhat.com>
Reviewed-by: Eric Blake <eblake@redhat.com>
---
 arch_init.c |   21 ++++++---------------
 1 file changed, 6 insertions(+), 15 deletions(-)
Kevin Wolf - April 5, 2013, 7:23 p.m.
Am 26.03.2013 um 10:58 hat Peter Lieven geschrieben:
> virtually all dup pages are zero pages. remove
> the special is_dup_page() function and use the
> optimized buffer_find_nonzero_offset() function
> instead.
> 
> here buffer_find_nonzero_offset() is used directly
> to avoid the unnecssary additional checks in
> buffer_is_zero().
> 
> raw performace gain checking 1 GByte zeroed memory
> over is_dup_page() is approx. 10-12% with SSE2
> and 8-10% with unsigned long arithmedtic.
> 
> Signed-off-by: Peter Lieven <pl@kamp.de>
> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
> Reviewed-by: Eric Blake <eblake@redhat.com>

Okay, so I bisected again and this is the second patch that is involved
in the slowness of qemu-iotests case 007.

The problem seems to be that the RAM of a guest is in fact _not_ zeroed
during initialisation. It hits my test case reliably because I'm running
with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
under such test conditions, or if real guests could be affected as well
and we should make sure to get zeroed memory for RAM.

Kevin
Paolo Bonzini - April 5, 2013, 8 p.m.
Il 05/04/2013 21:23, Kevin Wolf ha scritto:
>> > virtually all dup pages are zero pages. remove
>> > the special is_dup_page() function and use the
>> > optimized buffer_find_nonzero_offset() function
>> > instead.
>> > 
>> > here buffer_find_nonzero_offset() is used directly
>> > to avoid the unnecssary additional checks in
>> > buffer_is_zero().
>> > 
>> > raw performace gain checking 1 GByte zeroed memory
>> > over is_dup_page() is approx. 10-12% with SSE2
>> > and 8-10% with unsigned long arithmedtic.
>> > 
>> > Signed-off-by: Peter Lieven <pl@kamp.de>
>> > Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>> > Reviewed-by: Eric Blake <eblake@redhat.com>
> Okay, so I bisected again and this is the second patch that is involved
> in the slowness of qemu-iotests case 007.
> 
> The problem seems to be that the RAM of a guest is in fact _not_ zeroed
> during initialisation. It hits my test case reliably because I'm running
> with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
> under such test conditions, or if real guests could be affected as well
> and we should make sure to get zeroed memory for RAM.

I think we should MADV_DONTNEED it.

Paolo
Peter Lieven - April 5, 2013, 9:44 p.m.
Am 05.04.2013 um 22:00 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 05/04/2013 21:23, Kevin Wolf ha scritto:
>>>> virtually all dup pages are zero pages. remove
>>>> the special is_dup_page() function and use the
>>>> optimized buffer_find_nonzero_offset() function
>>>> instead.
>>>> 
>>>> here buffer_find_nonzero_offset() is used directly
>>>> to avoid the unnecssary additional checks in
>>>> buffer_is_zero().
>>>> 
>>>> raw performace gain checking 1 GByte zeroed memory
>>>> over is_dup_page() is approx. 10-12% with SSE2
>>>> and 8-10% with unsigned long arithmedtic.
>>>> 
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> Okay, so I bisected again and this is the second patch that is involved
>> in the slowness of qemu-iotests case 007.
>> 
>> The problem seems to be that the RAM of a guest is in fact _not_ zeroed
>> during initialisation. It hits my test case reliably because I'm running
>> with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
>> under such test conditions, or if real guests could be affected as well
>> and we should make sure to get zeroed memory for RAM.
> 
> I think we should MADV_DONTNEED it.

This does not guarantee that the memory is unmapped afaik. Sadly, I think we have to revert

migration: do not sent zero pages in bulk stage

The memory assigned by posix_memalign is most likely zero as all GFP_USER pages
are zeroed out by the kernel on alloc (at least under Linux), but if the page is reused in the same process
it is not necessarily zero anymore.

What I was trying to achieve with this patch is that the memset when receiving a zero_page
at the target was allocating memory and the MADV_DONTNEED was not immediately
dropping the page. This lead to memory pressure and swapping etc. on overcommitted systems.

What I would propose as a solution for this is after reverting the "do not sent zero pages"
patch is sth like this when receiving a compressed page:

if (ch != 0 || !is_zero_page(host)) {
    memset(host, ch, TARGET_PAGE_SIZE);
}

Regarding Kevins observation of the speed regression in iotest 007 this is simply if
MALLOC_PERTURB_ is used there are simply no zero pages, but only dup pages in memory. On
a real system the observation is that pages are either zero or not dup at all.

Peter


> 
> Paolo
Peter Lieven - April 5, 2013, 10:06 p.m.
Am 05.04.2013 um 22:00 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 05/04/2013 21:23, Kevin Wolf ha scritto:
>>>> virtually all dup pages are zero pages. remove
>>>> the special is_dup_page() function and use the
>>>> optimized buffer_find_nonzero_offset() function
>>>> instead.
>>>> 
>>>> here buffer_find_nonzero_offset() is used directly
>>>> to avoid the unnecssary additional checks in
>>>> buffer_is_zero().
>>>> 
>>>> raw performace gain checking 1 GByte zeroed memory
>>>> over is_dup_page() is approx. 10-12% with SSE2
>>>> and 8-10% with unsigned long arithmedtic.
>>>> 
>>>> Signed-off-by: Peter Lieven <pl@kamp.de>
>>>> Reviewed-by: Orit Wasserman <owasserm@redhat.com>
>>>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> Okay, so I bisected again and this is the second patch that is involved
>> in the slowness of qemu-iotests case 007.
>> 
>> The problem seems to be that the RAM of a guest is in fact _not_ zeroed
>> during initialisation. It hits my test case reliably because I'm running
>> with MALLOC_PERTURB_. Now I'm wondering if in practice this happens only
>> under such test conditions, or if real guests could be affected as well
>> and we should make sure to get zeroed memory for RAM.
> 
> I think we should MADV_DONTNEED it.

i have to correct myself, on Linux it seems that MADV_DONTNEED guarantees that
subsequent reads will return a zero filled page.

If I am right with that we could MADV_DONTNEED the memory on startup and keep
not sending zero pages in bulk stage if we are running under Linux. Am I right with that?

Peter
Paolo Bonzini - April 8, 2013, 8:38 a.m.
Il 06/04/2013 00:06, Peter Lieven ha scritto:
>> > I think we should MADV_DONTNEED it.
> i have to correct myself, on Linux it seems that MADV_DONTNEED guarantees that
> subsequent reads will return a zero filled page.
> 
> If I am right with that we could MADV_DONTNEED the memory on startup and keep
> not sending zero pages in bulk stage if we are running under Linux. Am I right with that?

Actually we can use mmap+munmap to allocate a well-aligned, always zero
block for the RAM.

Paolo

Patch

diff --git a/arch_init.c b/arch_init.c
index 35974c2..dd5deff 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -146,19 +146,10 @@  int qemu_read_default_config_files(bool userconfig)
     return 0;
 }
 
-static int is_dup_page(uint8_t *page)
+static inline bool is_zero_page(uint8_t *p)
 {
-    VECTYPE *p = (VECTYPE *)page;
-    VECTYPE val = SPLAT(page);
-    int i;
-
-    for (i = 0; i < TARGET_PAGE_SIZE / sizeof(VECTYPE); i++) {
-        if (!ALL_EQ(val, p[i])) {
-            return 0;
-        }
-    }
-
-    return 1;
+    return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) ==
+        TARGET_PAGE_SIZE;
 }
 
 /* struct contains XBZRLE cache and a static page
@@ -445,12 +436,12 @@  static int ram_save_block(QEMUFile *f, bool last_stage)
 
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_dup_page(p)) {
+            if (is_zero_page(p)) {
                 acct_info.dup_pages++;
                 bytes_sent = save_block_hdr(f, block, offset, cont,
                                             RAM_SAVE_FLAG_COMPRESS);
-                qemu_put_byte(f, *p);
-                bytes_sent += 1;
+                qemu_put_byte(f, 0);
+                bytes_sent++;
             } else if (migrate_use_xbzrle()) {
                 current_addr = block->offset + offset;
                 bytes_sent = save_xbzrle_page(f, p, current_addr, block,