Patchwork broken incoming migration

login
register
mail settings
Submitter Peter Lieven
Date June 4, 2013, 2:38 p.m.
Message ID <51ADFBCE.3080200@kamp.de>
Download mbox | patch
Permalink /patch/248675/
State New
Headers show

Comments

Peter Lieven - June 4, 2013, 2:38 p.m.
On 04.06.2013 16:14, Paolo Bonzini wrote:
> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>> You could also scan the page for nonzero values before writing it.
>>>> i had this in mind, but then choosed the other approach.... turned
>>>> out to be a bad idea.
>>>>
>>>> alexey: i will prepare a patch later today, could you then please
>>>> verify it fixes your problem.
>>>>
>>>> paolo: would we still need the madvise or is it enough to not write
>>>> the zeroes?
>>> It should be enough to not write them.
>> Problem: checking the pages for zero allocates them. even at the source.
> It doesn't look like.  I tried this program and top doesn't show an
> increasing amount of reserved memory:
>
> #include <stdio.h>
> #include <stdlib.h>
> int main()
> {
>      char *x = malloc(500 << 20);
>      int i, j;
>      for (i = 0; i < 500; i += 10) {
>          for (j = 0; j < 10 << 20; j += 4096) {
>               *(volatile char*) (x + (i << 20) + j);
>          }
>          getchar();
>      }
> }
strange. we are talking about RSS size, right?
is the malloc above using mmapped memory?
which kernel version do you use?

what avoids allocating the memory for me is the following (with
whatever side effects it has ;-))



Peter




>
> Paolo
Paolo Bonzini - June 4, 2013, 2:40 p.m.
Il 04/06/2013 16:38, Peter Lieven ha scritto:
> On 04.06.2013 16:14, Paolo Bonzini wrote:
>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>> out to be a bad idea.
>>>>>
>>>>> alexey: i will prepare a patch later today, could you then please
>>>>> verify it fixes your problem.
>>>>>
>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>> the zeroes?
>>>> It should be enough to not write them.
>>> Problem: checking the pages for zero allocates them. even at the source.
>> It doesn't look like.  I tried this program and top doesn't show an
>> increasing amount of reserved memory:
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> int main()
>> {
>>      char *x = malloc(500 << 20);
>>      int i, j;
>>      for (i = 0; i < 500; i += 10) {
>>          for (j = 0; j < 10 << 20; j += 4096) {
>>               *(volatile char*) (x + (i << 20) + j);
>>          }
>>          getchar();
>>      }
>> }
> strange. we are talking about RSS size, right?

None of the three top values change, and only VIRT is >500 MB.

> is the malloc above using mmapped memory?

Yes.

> which kernel version do you use?

3.9.

> what avoids allocating the memory for me is the following (with
> whatever side effects it has ;-))

This would also fail to migrate any page that is swapped out, breaking
overcommit in a more subtle way. :)

Paolo
Peter Lieven - June 4, 2013, 2:48 p.m.
On 04.06.2013 16:40, Paolo Bonzini wrote:
> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>> out to be a bad idea.
>>>>>>
>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>> verify it fixes your problem.
>>>>>>
>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>> the zeroes?
>>>>> It should be enough to not write them.
>>>> Problem: checking the pages for zero allocates them. even at the source.
>>> It doesn't look like.  I tried this program and top doesn't show an
>>> increasing amount of reserved memory:
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> int main()
>>> {
>>>       char *x = malloc(500 << 20);
>>>       int i, j;
>>>       for (i = 0; i < 500; i += 10) {
>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>                *(volatile char*) (x + (i << 20) + j);
>>>           }
>>>           getchar();
>>>       }
>>> }
>> strange. we are talking about RSS size, right?
> None of the three top values change, and only VIRT is >500 MB.
>
>> is the malloc above using mmapped memory?
> Yes.
>
>> which kernel version do you use?
> 3.9.
Still using 3.2, but strange enough the above example is also not increasing RSS size for me.

Can you try the following:
qemu git master with 1G of memory (hanging in bios with no boot device) and migrate it. Before migration RSS Size os somewhat
around 16MB. After migration its RSS size is in the order of 1G.

Peter
Peter Lieven - June 4, 2013, 3:10 p.m.
On 04.06.2013 16:40, Paolo Bonzini wrote:
> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>> out to be a bad idea.
>>>>>>
>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>> verify it fixes your problem.
>>>>>>
>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>> the zeroes?
>>>>> It should be enough to not write them.
>>>> Problem: checking the pages for zero allocates them. even at the source.
>>> It doesn't look like.  I tried this program and top doesn't show an
>>> increasing amount of reserved memory:
>>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> int main()
>>> {
>>>       char *x = malloc(500 << 20);
>>>       int i, j;
>>>       for (i = 0; i < 500; i += 10) {
>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>                *(volatile char*) (x + (i << 20) + j);
>>>           }
>>>           getchar();
>>>       }
>>> }
>> strange. we are talking about RSS size, right?
> None of the three top values change, and only VIRT is >500 MB.
>
>> is the malloc above using mmapped memory?
> Yes.
>
>> which kernel version do you use?
> 3.9.
>
>> what avoids allocating the memory for me is the following (with
>> whatever side effects it has ;-))
> This would also fail to migrate any page that is swapped out, breaking
> overcommit in a more subtle way. :)
>
> Paolo
the following does also not allocate memory, but qemu does...

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <unistd.h>
#include <sys/resource.h>
#include <inttypes.h>
#include <string.h>
#include <sys/mman.h>
#include <errno.h>

#if defined __SSE2__
#include <emmintrin.h>
#define VECTYPE        __m128i
#define SPLAT(p)       _mm_set1_epi8(*(p))
#define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0xFFFF)
#else
#define VECTYPE        unsigned long
#define SPLAT(p)       (*(p) * (~0UL / 255))
#define ALL_EQ(v1, v2) ((v1) == (v2))
#endif

#define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8

/* Round number down to multiple */
#define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))

/* Round number up to multiple */
#define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))

#define QEMU_VMALLOC_ALIGN (256 * 4096)

/* alloc shared memory pages */
void *qemu_anon_ram_alloc(size_t size)
{
     size_t align = QEMU_VMALLOC_ALIGN;
     size_t total = size + align - getpagesize();
     void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
                      MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
     size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;

     if (ptr == MAP_FAILED) {
         fprintf(stderr, "Failed to allocate %zu B: %s\n",
                 size, strerror(errno));
         abort();
     }

     ptr += offset;
     total -= offset;

     if (offset > 0) {
         munmap(ptr - offset, offset);
     }
     if (total > size) {
         munmap(ptr + size, total - size);
     }

     return ptr;
}

static inline int
can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
{
     return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
                    * sizeof(VECTYPE)) == 0
             && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
}

size_t buffer_find_nonzero_offset(const void *buf, size_t len)
{
     const VECTYPE *p = buf;
     const VECTYPE zero = (VECTYPE){0};
     size_t i;

     if (!len) {
         return 0;
     }

     assert(can_use_buffer_find_nonzero_offset(buf, len));

     for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
         if (!ALL_EQ(p[i], zero)) {
             return i * sizeof(VECTYPE);
         }
     }

     for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
          i < len / sizeof(VECTYPE);
          i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
         VECTYPE tmp0 = p[i + 0] | p[i + 1];
         VECTYPE tmp1 = p[i + 2] | p[i + 3];
         VECTYPE tmp2 = p[i + 4] | p[i + 5];
         VECTYPE tmp3 = p[i + 6] | p[i + 7];
         VECTYPE tmp01 = tmp0 | tmp1;
         VECTYPE tmp23 = tmp2 | tmp3;
         if (!ALL_EQ(tmp01 | tmp23, zero)) {
             break;
         }
     }

     return i * sizeof(VECTYPE);
}

int main()
{
      //char *x = malloc(1024 << 20);
      char *x = qemu_anon_ram_alloc(1024 << 20);

      int i, j;
      int ret = 0;
      struct rusage rusage;
      for (i = 0; i < 500; i ++) {
          for (j = 0; j < 10 << 20; j += 4096) {
               ret += buffer_find_nonzero_offset((char*) (x + (i << 20) + j), 4096);
          }
          getrusage( RUSAGE_SELF, &rusage );
          printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10), rusage.ru_maxrss);
          getchar();
      }
      printf("%d zero pages\n", ret);
}
Paolo Bonzini - June 4, 2013, 3:17 p.m.
Il 04/06/2013 16:48, Peter Lieven ha scritto:
> Still using 3.2, but strange enough the above example is also not
> increasing RSS size for me.
> 
> Can you try the following:
> qemu git master with 1G of memory (hanging in bios with no boot device)
> and migrate it. Before migration RSS Size os somewhat
> around 16MB. After migration its RSS size is in the order of 1G.

That may be a kernel bug.  The kernel did not do the copy-on-write trick
on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.

Paolo
Peter Lieven - June 4, 2013, 7:15 p.m.
Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:

> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>> Still using 3.2, but strange enough the above example is also not
>> increasing RSS size for me.
>> 
>> Can you try the following:
>> qemu git master with 1G of memory (hanging in bios with no boot device)
>> and migrate it. Before migration RSS Size os somewhat
>> around 16MB. After migration its RSS size is in the order of 1G.
> 
> That may be a kernel bug.  The kernel did not do the copy-on-write trick
> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.

that's it. thanks for the pointer. the huge zero page was introduced in 3.8.

paolo, alexey: can you please verify the following works for you:
https://github.com/plieven/qemu/tree/fix-migration

thanks
peter


> 
> Paolo
>
Alexey Kardashevskiy - June 5, 2013, 3:37 a.m.
On 06/05/2013 05:15 AM, Peter Lieven wrote:
> 
> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
> 
>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>> Still using 3.2, but strange enough the above example is also not
>>> increasing RSS size for me.
>>>
>>> Can you try the following:
>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>> and migrate it. Before migration RSS Size os somewhat
>>> around 16MB. After migration its RSS size is in the order of 1G.
>>
>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
> 
> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
> 
> paolo, alexey: can you please verify the following works for you:
> https://github.com/plieven/qemu/tree/fix-migration

These two?
848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
overwrite zero pages
2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
not sent zero pages in bulk stage"

That works for me (qemu 1.5, kernel 3.9-rc2).
Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
Peter Lieven - June 5, 2013, 6:09 a.m.
Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:

> On 06/05/2013 05:15 AM, Peter Lieven wrote:
>> 
>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>> 
>>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>>> Still using 3.2, but strange enough the above example is also not
>>>> increasing RSS size for me.
>>>> 
>>>> Can you try the following:
>>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>>> and migrate it. Before migration RSS Size os somewhat
>>>> around 16MB. After migration its RSS size is in the order of 1G.
>>> 
>>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
>> 
>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
>> 
>> paolo, alexey: can you please verify the following works for you:
>> https://github.com/plieven/qemu/tree/fix-migration
> 
> These two?
> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
> overwrite zero pages
> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
> not sent zero pages in bulk stage"

Yes, sorry forgot to mention this.

> 
> That works for me (qemu 1.5, kernel 3.9-rc2).
> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>

Thank you,
Peter
Wayne Xia - June 8, 2013, 8:27 a.m.
> On 04.06.2013 16:40, Paolo Bonzini wrote:
>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>> out to be a bad idea.
>>>>>>>
>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>> verify it fixes your problem.
>>>>>>>
>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>> the zeroes?
>>>>>> It should be enough to not write them.
>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>> source.
>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>> increasing amount of reserved memory:
>>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> int main()
>>>> {
>>>>       char *x = malloc(500 << 20);
>>>>       int i, j;
>>>>       for (i = 0; i < 500; i += 10) {
>>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>>                *(volatile char*) (x + (i << 20) + j);
>>>>           }
>>>>           getchar();
>>>>       }
>>>> }
>>> strange. we are talking about RSS size, right?
>> None of the three top values change, and only VIRT is >500 MB.
>>
>>> is the malloc above using mmapped memory?
>> Yes.
>>
>>> which kernel version do you use?
>> 3.9.
>>
>>> what avoids allocating the memory for me is the following (with
>>> whatever side effects it has ;-))
>> This would also fail to migrate any page that is swapped out, breaking
>> overcommit in a more subtle way. :)
>>
>> Paolo
> the following does also not allocate memory, but qemu does...
>
Hi, Peter
   As the patch writes

"not sending zero pages breaks migration if a page is zero
at the source but not at the destination."

   I don't understand why it would be trouble, shouldn't all page
not received in dest be treated as zero pages?

Also, you mean following code is from qemu and it does not allocate
memory with you gcc right? Maybe it is related to KVM, how about
turn off KVM and retry following code in qemu?

> #include <stdio.h>
> #include <stdlib.h>
> #include <assert.h>
> #include <unistd.h>
> #include <sys/resource.h>
> #include <inttypes.h>
> #include <string.h>
> #include <sys/mman.h>
> #include <errno.h>
>
> #if defined __SSE2__
> #include <emmintrin.h>
> #define VECTYPE        __m128i
> #define SPLAT(p)       _mm_set1_epi8(*(p))
> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
> 0xFFFF)
> #else
> #define VECTYPE        unsigned long
> #define SPLAT(p)       (*(p) * (~0UL / 255))
> #define ALL_EQ(v1, v2) ((v1) == (v2))
> #endif
>
> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>
> /* Round number down to multiple */
> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>
> /* Round number up to multiple */
> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>
> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>
> /* alloc shared memory pages */
> void *qemu_anon_ram_alloc(size_t size)
> {
>      size_t align = QEMU_VMALLOC_ALIGN;
>      size_t total = size + align - getpagesize();
>      void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>                       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>      size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
>
>      if (ptr == MAP_FAILED) {
>          fprintf(stderr, "Failed to allocate %zu B: %s\n",
>                  size, strerror(errno));
>          abort();
>      }
>
>      ptr += offset;
>      total -= offset;
>
>      if (offset > 0) {
>          munmap(ptr - offset, offset);
>      }
>      if (total > size) {
>          munmap(ptr + size, total - size);
>      }
>
>      return ptr;
> }
>
> static inline int
> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> {
>      return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>                     * sizeof(VECTYPE)) == 0
>              && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> }
>
> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> {
>      const VECTYPE *p = buf;
>      const VECTYPE zero = (VECTYPE){0};
>      size_t i;
>
>      if (!len) {
>          return 0;
>      }
>
>      assert(can_use_buffer_find_nonzero_offset(buf, len));
>
>      for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>          if (!ALL_EQ(p[i], zero)) {
>              return i * sizeof(VECTYPE);
>          }
>      }
>
>      for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>           i < len / sizeof(VECTYPE);
>           i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>          VECTYPE tmp0 = p[i + 0] | p[i + 1];
>          VECTYPE tmp1 = p[i + 2] | p[i + 3];
>          VECTYPE tmp2 = p[i + 4] | p[i + 5];
>          VECTYPE tmp3 = p[i + 6] | p[i + 7];
>          VECTYPE tmp01 = tmp0 | tmp1;
>          VECTYPE tmp23 = tmp2 | tmp3;
>          if (!ALL_EQ(tmp01 | tmp23, zero)) {
>              break;
>          }
>      }
>
>      return i * sizeof(VECTYPE);
> }
>
> int main()
> {
>       //char *x = malloc(1024 << 20);
>       char *x = qemu_anon_ram_alloc(1024 << 20);
>
>       int i, j;
>       int ret = 0;
>       struct rusage rusage;
>       for (i = 0; i < 500; i ++) {
>           for (j = 0; j < 10 << 20; j += 4096) {
>                ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
> + j), 4096);
>           }
>           getrusage( RUSAGE_SELF, &rusage );
>           printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
> rusage.ru_maxrss);
>           getchar();
>       }
>       printf("%d zero pages\n", ret);
> }
>
Alexey Kardashevskiy - June 8, 2013, 8:30 a.m.
On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>> out to be a bad idea.
>>>>>>>>
>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>> verify it fixes your problem.
>>>>>>>>
>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>> the zeroes?
>>>>>>> It should be enough to not write them.
>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>> source.
>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>> increasing amount of reserved memory:
>>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> int main()
>>>>> {
>>>>>       char *x = malloc(500 << 20);
>>>>>       int i, j;
>>>>>       for (i = 0; i < 500; i += 10) {
>>>>>           for (j = 0; j < 10 << 20; j += 4096) {
>>>>>                *(volatile char*) (x + (i << 20) + j);
>>>>>           }
>>>>>           getchar();
>>>>>       }
>>>>> }
>>>> strange. we are talking about RSS size, right?
>>> None of the three top values change, and only VIRT is >500 MB.
>>>
>>>> is the malloc above using mmapped memory?
>>> Yes.
>>>
>>>> which kernel version do you use?
>>> 3.9.
>>>
>>>> what avoids allocating the memory for me is the following (with
>>>> whatever side effects it has ;-))
>>> This would also fail to migrate any page that is swapped out, breaking
>>> overcommit in a more subtle way. :)
>>>
>>> Paolo
>> the following does also not allocate memory, but qemu does...
>>
> Hi, Peter
>   As the patch writes
> 
> "not sending zero pages breaks migration if a page is zero
> at the source but not at the destination."
> 
>   I don't understand why it would be trouble, shouldn't all page
> not received in dest be treated as zero pages?


How would the destination guest know if some page must be cleared? The
previous patch (which Peter reverted) did not send anything for the pages
which were zero on the source side.



> Also, you mean following code is from qemu and it does not allocate
> memory with you gcc right? Maybe it is related to KVM, how about
> turn off KVM and retry following code in qemu?
> 
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <assert.h>
>> #include <unistd.h>
>> #include <sys/resource.h>
>> #include <inttypes.h>
>> #include <string.h>
>> #include <sys/mman.h>
>> #include <errno.h>
>>
>> #if defined __SSE2__
>> #include <emmintrin.h>
>> #define VECTYPE        __m128i
>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>> 0xFFFF)
>> #else
>> #define VECTYPE        unsigned long
>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>> #endif
>>
>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>
>> /* Round number down to multiple */
>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>
>> /* Round number up to multiple */
>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>
>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>
>> /* alloc shared memory pages */
>> void *qemu_anon_ram_alloc(size_t size)
>> {
>>      size_t align = QEMU_VMALLOC_ALIGN;
>>      size_t total = size + align - getpagesize();
>>      void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>                       MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>      size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
>>
>>      if (ptr == MAP_FAILED) {
>>          fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>                  size, strerror(errno));
>>          abort();
>>      }
>>
>>      ptr += offset;
>>      total -= offset;
>>
>>      if (offset > 0) {
>>          munmap(ptr - offset, offset);
>>      }
>>      if (total > size) {
>>          munmap(ptr + size, total - size);
>>      }
>>
>>      return ptr;
>> }
>>
>> static inline int
>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>> {
>>      return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>                     * sizeof(VECTYPE)) == 0
>>              && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>> }
>>
>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>> {
>>      const VECTYPE *p = buf;
>>      const VECTYPE zero = (VECTYPE){0};
>>      size_t i;
>>
>>      if (!len) {
>>          return 0;
>>      }
>>
>>      assert(can_use_buffer_find_nonzero_offset(buf, len));
>>
>>      for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>          if (!ALL_EQ(p[i], zero)) {
>>              return i * sizeof(VECTYPE);
>>          }
>>      }
>>
>>      for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>           i < len / sizeof(VECTYPE);
>>           i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>          VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>          VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>          VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>          VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>          VECTYPE tmp01 = tmp0 | tmp1;
>>          VECTYPE tmp23 = tmp2 | tmp3;
>>          if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>              break;
>>          }
>>      }
>>
>>      return i * sizeof(VECTYPE);
>> }
>>
>> int main()
>> {
>>       //char *x = malloc(1024 << 20);
>>       char *x = qemu_anon_ram_alloc(1024 << 20);
>>
>>       int i, j;
>>       int ret = 0;
>>       struct rusage rusage;
>>       for (i = 0; i < 500; i ++) {
>>           for (j = 0; j < 10 << 20; j += 4096) {
>>                ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>> + j), 4096);
>>           }
>>           getrusage( RUSAGE_SELF, &rusage );
>>           printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>> rusage.ru_maxrss);
>>           getchar();
>>       }
>>       printf("%d zero pages\n", ret);
>> }
>>
> 
>
Wayne Xia - June 9, 2013, 2:16 a.m.
于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>> out to be a bad idea.
>>>>>>>>>
>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>> verify it fixes your problem.
>>>>>>>>>
>>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>>> the zeroes?
>>>>>>>> It should be enough to not write them.
>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>> source.
>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>> increasing amount of reserved memory:
>>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> int main()
>>>>>> {
>>>>>>        char *x = malloc(500 << 20);
>>>>>>        int i, j;
>>>>>>        for (i = 0; i < 500; i += 10) {
>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>                 *(volatile char*) (x + (i << 20) + j);
>>>>>>            }
>>>>>>            getchar();
>>>>>>        }
>>>>>> }
>>>>> strange. we are talking about RSS size, right?
>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>
>>>>> is the malloc above using mmapped memory?
>>>> Yes.
>>>>
>>>>> which kernel version do you use?
>>>> 3.9.
>>>>
>>>>> what avoids allocating the memory for me is the following (with
>>>>> whatever side effects it has ;-))
>>>> This would also fail to migrate any page that is swapped out, breaking
>>>> overcommit in a more subtle way. :)
>>>>
>>>> Paolo
>>> the following does also not allocate memory, but qemu does...
>>>
>> Hi, Peter
>>    As the patch writes
>>
>> "not sending zero pages breaks migration if a page is zero
>> at the source but not at the destination."
>>
>>    I don't understand why it would be trouble, shouldn't all page
>> not received in dest be treated as zero pages?
>
>
> How would the destination guest know if some page must be cleared? The
> previous patch (which Peter reverted) did not send anything for the pages
> which were zero on the source side.
>
>
   If an page was not received and destination knows that page should
exist according to total size, fill it with zero at destination, would
it solve the problem?

>
>> Also, you mean following code is from qemu and it does not allocate
>> memory with you gcc right? Maybe it is related to KVM, how about
>> turn off KVM and retry following code in qemu?
>>
>>> #include <stdio.h>
>>> #include <stdlib.h>
>>> #include <assert.h>
>>> #include <unistd.h>
>>> #include <sys/resource.h>
>>> #include <inttypes.h>
>>> #include <string.h>
>>> #include <sys/mman.h>
>>> #include <errno.h>
>>>
>>> #if defined __SSE2__
>>> #include <emmintrin.h>
>>> #define VECTYPE        __m128i
>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0xFFFF)
>>> #else
>>> #define VECTYPE        unsigned long
>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>> #endif
>>>
>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>
>>> /* Round number down to multiple */
>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>
>>> /* Round number up to multiple */
>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>
>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>
>>> /* alloc shared memory pages */
>>> void *qemu_anon_ram_alloc(size_t size)
>>> {
>>>       size_t align = QEMU_VMALLOC_ALIGN;
>>>       size_t total = size + align - getpagesize();
>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) - (uintptr_t)ptr;
>>>
>>>       if (ptr == MAP_FAILED) {
>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>                   size, strerror(errno));
>>>           abort();
>>>       }
>>>
>>>       ptr += offset;
>>>       total -= offset;
>>>
>>>       if (offset > 0) {
>>>           munmap(ptr - offset, offset);
>>>       }
>>>       if (total > size) {
>>>           munmap(ptr + size, total - size);
>>>       }
>>>
>>>       return ptr;
>>> }
>>>
>>> static inline int
>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>> {
>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>                      * sizeof(VECTYPE)) == 0
>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>> }
>>>
>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>> {
>>>       const VECTYPE *p = buf;
>>>       const VECTYPE zero = (VECTYPE){0};
>>>       size_t i;
>>>
>>>       if (!len) {
>>>           return 0;
>>>       }
>>>
>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>
>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>           if (!ALL_EQ(p[i], zero)) {
>>>               return i * sizeof(VECTYPE);
>>>           }
>>>       }
>>>
>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>            i < len / sizeof(VECTYPE);
>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>           VECTYPE tmp01 = tmp0 | tmp1;
>>>           VECTYPE tmp23 = tmp2 | tmp3;
>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>               break;
>>>           }
>>>       }
>>>
>>>       return i * sizeof(VECTYPE);
>>> }
>>>
>>> int main()
>>> {
>>>        //char *x = malloc(1024 << 20);
>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
>>>
>>>        int i, j;
>>>        int ret = 0;
>>>        struct rusage rusage;
>>>        for (i = 0; i < 500; i ++) {
>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>> + j), 4096);
>>>            }
>>>            getrusage( RUSAGE_SELF, &rusage );
>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>> rusage.ru_maxrss);
>>>            getchar();
>>>        }
>>>        printf("%d zero pages\n", ret);
>>> }
>>>
>>
>>
>
>
Alexey Kardashevskiy - June 9, 2013, 2:34 a.m.
On 06/09/2013 12:16 PM, Wenchao Xia wrote:
> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>
>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>
>>>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>>>> the zeroes?
>>>>>>>>> It should be enough to not write them.
>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>> source.
>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>> increasing amount of reserved memory:
>>>>>>>
>>>>>>> #include <stdio.h>
>>>>>>> #include <stdlib.h>
>>>>>>> int main()
>>>>>>> {
>>>>>>>        char *x = malloc(500 << 20);
>>>>>>>        int i, j;
>>>>>>>        for (i = 0; i < 500; i += 10) {
>>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>                 *(volatile char*) (x + (i << 20) + j);
>>>>>>>            }
>>>>>>>            getchar();
>>>>>>>        }
>>>>>>> }
>>>>>> strange. we are talking about RSS size, right?
>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>
>>>>>> is the malloc above using mmapped memory?
>>>>> Yes.
>>>>>
>>>>>> which kernel version do you use?
>>>>> 3.9.
>>>>>
>>>>>> what avoids allocating the memory for me is the following (with
>>>>>> whatever side effects it has ;-))
>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>> overcommit in a more subtle way. :)
>>>>>
>>>>> Paolo
>>>> the following does also not allocate memory, but qemu does...
>>>>
>>> Hi, Peter
>>>    As the patch writes
>>>
>>> "not sending zero pages breaks migration if a page is zero
>>> at the source but not at the destination."
>>>
>>>    I don't understand why it would be trouble, shouldn't all page
>>> not received in dest be treated as zero pages?
>>
>>
>> How would the destination guest know if some page must be cleared? The
>> previous patch (which Peter reverted) did not send anything for the pages
>> which were zero on the source side.
>>
>>
>   If an page was not received and destination knows that page should
> exist according to total size, fill it with zero at destination, would
> it solve the problem?

It is _live_ migration, the source sends changes, same pages can change and
be sent several times. So we would need to turn tracking on on the
destination to know if some page was received from the source or changed by
the destination itself (by writing there bios/firmware images, etc) and
then clear pages which were touched by the destination and were not sent by
the source.

Or we do not make guesses, the source sends everything and the destination
simply checks if a page which is empty on the source is empty on the
destination and avoid writing zeroes to it. Looks simpler to me and this is
what the new patch does.


> 
>>
>>> Also, you mean following code is from qemu and it does not allocate
>>> memory with you gcc right? Maybe it is related to KVM, how about
>>> turn off KVM and retry following code in qemu?
>>>
>>>> #include <stdio.h>
>>>> #include <stdlib.h>
>>>> #include <assert.h>
>>>> #include <unistd.h>
>>>> #include <sys/resource.h>
>>>> #include <inttypes.h>
>>>> #include <string.h>
>>>> #include <sys/mman.h>
>>>> #include <errno.h>
>>>>
>>>> #if defined __SSE2__
>>>> #include <emmintrin.h>
>>>> #define VECTYPE        __m128i
>>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>>> 0xFFFF)
>>>> #else
>>>> #define VECTYPE        unsigned long
>>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>>> #endif
>>>>
>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>>
>>>> /* Round number down to multiple */
>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>>
>>>> /* Round number up to multiple */
>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>>
>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>>
>>>> /* alloc shared memory pages */
>>>> void *qemu_anon_ram_alloc(size_t size)
>>>> {
>>>>       size_t align = QEMU_VMALLOC_ALIGN;
>>>>       size_t total = size + align - getpagesize();
>>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
>>>> (uintptr_t)ptr;
>>>>
>>>>       if (ptr == MAP_FAILED) {
>>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>>                   size, strerror(errno));
>>>>           abort();
>>>>       }
>>>>
>>>>       ptr += offset;
>>>>       total -= offset;
>>>>
>>>>       if (offset > 0) {
>>>>           munmap(ptr - offset, offset);
>>>>       }
>>>>       if (total > size) {
>>>>           munmap(ptr + size, total - size);
>>>>       }
>>>>
>>>>       return ptr;
>>>> }
>>>>
>>>> static inline int
>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>>> {
>>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>                      * sizeof(VECTYPE)) == 0
>>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>> }
>>>>
>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>> {
>>>>       const VECTYPE *p = buf;
>>>>       const VECTYPE zero = (VECTYPE){0};
>>>>       size_t i;
>>>>
>>>>       if (!len) {
>>>>           return 0;
>>>>       }
>>>>
>>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>>
>>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>>           if (!ALL_EQ(p[i], zero)) {
>>>>               return i * sizeof(VECTYPE);
>>>>           }
>>>>       }
>>>>
>>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>>            i < len / sizeof(VECTYPE);
>>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>>           VECTYPE tmp01 = tmp0 | tmp1;
>>>>           VECTYPE tmp23 = tmp2 | tmp3;
>>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>>               break;
>>>>           }
>>>>       }
>>>>
>>>>       return i * sizeof(VECTYPE);
>>>> }
>>>>
>>>> int main()
>>>> {
>>>>        //char *x = malloc(1024 << 20);
>>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
>>>>
>>>>        int i, j;
>>>>        int ret = 0;
>>>>        struct rusage rusage;
>>>>        for (i = 0; i < 500; i ++) {
>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>>> + j), 4096);
>>>>            }
>>>>            getrusage( RUSAGE_SELF, &rusage );
>>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>>> rusage.ru_maxrss);
>>>>            getchar();
>>>>        }
>>>>        printf("%d zero pages\n", ret);
>>>> }
>>>>
>>>
>>>
>>
>>
> 
>
Benjamin Herrenschmidt - June 9, 2013, 2:52 a.m.
On Sun, 2013-06-09 at 12:34 +1000, Alexey Kardashevskiy wrote:

> It is _live_ migration, the source sends changes, same pages can change and
> be sent several times. So we would need to turn tracking on on the
> destination to know if some page was received from the source or changed by
> the destination itself (by writing there bios/firmware images, etc) and
> then clear pages which were touched by the destination and were not sent by
> the source.

Or we can set some kind of flag so that when creating a "migration
target" VM we don't load all these things into memory.

> Or we do not make guesses, the source sends everything and the destination
> simply checks if a page which is empty on the source is empty on the
> destination and avoid writing zeroes to it. Looks simpler to me and this is
> what the new patch does.

But you end up sending a lot of zero's ... is the migration compressed
(I am not familiar with it at all) ? If it is, that shouldn't be a big
deal, but else it feels to me that you should be able to send a special
packet instead that says "all zeros" because you'll potentially have an
awful lot of these.

Ben.

> > 
> >>
> >>> Also, you mean following code is from qemu and it does not allocate
> >>> memory with you gcc right? Maybe it is related to KVM, how about
> >>> turn off KVM and retry following code in qemu?
> >>>
> >>>> #include <stdio.h>
> >>>> #include <stdlib.h>
> >>>> #include <assert.h>
> >>>> #include <unistd.h>
> >>>> #include <sys/resource.h>
> >>>> #include <inttypes.h>
> >>>> #include <string.h>
> >>>> #include <sys/mman.h>
> >>>> #include <errno.h>
> >>>>
> >>>> #if defined __SSE2__
> >>>> #include <emmintrin.h>
> >>>> #define VECTYPE        __m128i
> >>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
> >>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
> >>>> 0xFFFF)
> >>>> #else
> >>>> #define VECTYPE        unsigned long
> >>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
> >>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
> >>>> #endif
> >>>>
> >>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
> >>>>
> >>>> /* Round number down to multiple */
> >>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
> >>>>
> >>>> /* Round number up to multiple */
> >>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
> >>>>
> >>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
> >>>>
> >>>> /* alloc shared memory pages */
> >>>> void *qemu_anon_ram_alloc(size_t size)
> >>>> {
> >>>>       size_t align = QEMU_VMALLOC_ALIGN;
> >>>>       size_t total = size + align - getpagesize();
> >>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
> >>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
> >>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
> >>>> (uintptr_t)ptr;
> >>>>
> >>>>       if (ptr == MAP_FAILED) {
> >>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
> >>>>                   size, strerror(errno));
> >>>>           abort();
> >>>>       }
> >>>>
> >>>>       ptr += offset;
> >>>>       total -= offset;
> >>>>
> >>>>       if (offset > 0) {
> >>>>           munmap(ptr - offset, offset);
> >>>>       }
> >>>>       if (total > size) {
> >>>>           munmap(ptr + size, total - size);
> >>>>       }
> >>>>
> >>>>       return ptr;
> >>>> }
> >>>>
> >>>> static inline int
> >>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
> >>>> {
> >>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
> >>>>                      * sizeof(VECTYPE)) == 0
> >>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
> >>>> }
> >>>>
> >>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
> >>>> {
> >>>>       const VECTYPE *p = buf;
> >>>>       const VECTYPE zero = (VECTYPE){0};
> >>>>       size_t i;
> >>>>
> >>>>       if (!len) {
> >>>>           return 0;
> >>>>       }
> >>>>
> >>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
> >>>>
> >>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
> >>>>           if (!ALL_EQ(p[i], zero)) {
> >>>>               return i * sizeof(VECTYPE);
> >>>>           }
> >>>>       }
> >>>>
> >>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
> >>>>            i < len / sizeof(VECTYPE);
> >>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
> >>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
> >>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
> >>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
> >>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
> >>>>           VECTYPE tmp01 = tmp0 | tmp1;
> >>>>           VECTYPE tmp23 = tmp2 | tmp3;
> >>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
> >>>>               break;
> >>>>           }
> >>>>       }
> >>>>
> >>>>       return i * sizeof(VECTYPE);
> >>>> }
> >>>>
> >>>> int main()
> >>>> {
> >>>>        //char *x = malloc(1024 << 20);
> >>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
> >>>>
> >>>>        int i, j;
> >>>>        int ret = 0;
> >>>>        struct rusage rusage;
> >>>>        for (i = 0; i < 500; i ++) {
> >>>>            for (j = 0; j < 10 << 20; j += 4096) {
> >>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
> >>>> + j), 4096);
> >>>>            }
> >>>>            getrusage( RUSAGE_SELF, &rusage );
> >>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
> >>>> rusage.ru_maxrss);
> >>>>            getchar();
> >>>>        }
> >>>>        printf("%d zero pages\n", ret);
> >>>> }
> >>>>
> >>>
> >>>
> >>
> >>
> > 
> > 
> 
>
Benjamin Herrenschmidt - June 9, 2013, 2:53 a.m.
On Sun, 2013-06-09 at 10:16 +0800, Wenchao Xia wrote:
>    If an page was not received and destination knows that page should
> exist according to total size, fill it with zero at destination, would
> it solve the problem?

The easiest way to do that is to not write to those pages at the
destination to begin with, when initializing the VM... Is there any way
to know that a VM is being setup as a migration target or not ?

Ben.
Wayne Xia - June 9, 2013, 3:01 a.m.
于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>> You could also scan the page for nonzero values before writing it.
>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>
>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>
>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not write
>>>>>>>>>>> the zeroes?
>>>>>>>>>> It should be enough to not write them.
>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>> source.
>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>> increasing amount of reserved memory:
>>>>>>>>
>>>>>>>> #include <stdio.h>
>>>>>>>> #include <stdlib.h>
>>>>>>>> int main()
>>>>>>>> {
>>>>>>>>         char *x = malloc(500 << 20);
>>>>>>>>         int i, j;
>>>>>>>>         for (i = 0; i < 500; i += 10) {
>>>>>>>>             for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>                  *(volatile char*) (x + (i << 20) + j);
>>>>>>>>             }
>>>>>>>>             getchar();
>>>>>>>>         }
>>>>>>>> }
>>>>>>> strange. we are talking about RSS size, right?
>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>
>>>>>>> is the malloc above using mmapped memory?
>>>>>> Yes.
>>>>>>
>>>>>>> which kernel version do you use?
>>>>>> 3.9.
>>>>>>
>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>> whatever side effects it has ;-))
>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>> overcommit in a more subtle way. :)
>>>>>>
>>>>>> Paolo
>>>>> the following does also not allocate memory, but qemu does...
>>>>>
>>>> Hi, Peter
>>>>     As the patch writes
>>>>
>>>> "not sending zero pages breaks migration if a page is zero
>>>> at the source but not at the destination."
>>>>
>>>>     I don't understand why it would be trouble, shouldn't all page
>>>> not received in dest be treated as zero pages?
>>>
>>>
>>> How would the destination guest know if some page must be cleared? The
>>> previous patch (which Peter reverted) did not send anything for the pages
>>> which were zero on the source side.
>>>
>>>
>>    If an page was not received and destination knows that page should
>> exist according to total size, fill it with zero at destination, would
>> it solve the problem?
>
> It is _live_ migration, the source sends changes, same pages can change and
> be sent several times. So we would need to turn tracking on on the
> destination to know if some page was received from the source or changed by
> the destination itself (by writing there bios/firmware images, etc) and
> then clear pages which were touched by the destination and were not sent by
> the source.
   OK, I can understand the problem is, for example:
Destination boots up with 0x0000-0xFFFF filled with bios image.
Source forgot to send zero pages in 0x0000-0xFFFF.
After migration destination got 0x0000-0xFFFF dirty(different with
source)
   Thanks for explain.

   This seems refer to the migration protocol: how should the guest treat
unsent pages. The patch causing the problem, actually treat zero pages
as "not to sent" at source, but another half is missing: treat "not
received" as zero pages at destination. I guess if second half is added,
problem is gone:
after page transfer completed, before destination resume,
fill zero in "not received" pages.

>
> Or we do not make guesses, the source sends everything and the destination
> simply checks if a page which is empty on the source is empty on the
> destination and avoid writing zeroes to it. Looks simpler to me and this is
> what the new patch does.
>
>
>>
>>>
>>>> Also, you mean following code is from qemu and it does not allocate
>>>> memory with you gcc right? Maybe it is related to KVM, how about
>>>> turn off KVM and retry following code in qemu?
>>>>
>>>>> #include <stdio.h>
>>>>> #include <stdlib.h>
>>>>> #include <assert.h>
>>>>> #include <unistd.h>
>>>>> #include <sys/resource.h>
>>>>> #include <inttypes.h>
>>>>> #include <string.h>
>>>>> #include <sys/mman.h>
>>>>> #include <errno.h>
>>>>>
>>>>> #if defined __SSE2__
>>>>> #include <emmintrin.h>
>>>>> #define VECTYPE        __m128i
>>>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>>>> 0xFFFF)
>>>>> #else
>>>>> #define VECTYPE        unsigned long
>>>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>>>> #endif
>>>>>
>>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>>>
>>>>> /* Round number down to multiple */
>>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>>>
>>>>> /* Round number up to multiple */
>>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>>>
>>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>>>
>>>>> /* alloc shared memory pages */
>>>>> void *qemu_anon_ram_alloc(size_t size)
>>>>> {
>>>>>        size_t align = QEMU_VMALLOC_ALIGN;
>>>>>        size_t total = size + align - getpagesize();
>>>>>        void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>>>                         MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>>        size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
>>>>> (uintptr_t)ptr;
>>>>>
>>>>>        if (ptr == MAP_FAILED) {
>>>>>            fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>>>                    size, strerror(errno));
>>>>>            abort();
>>>>>        }
>>>>>
>>>>>        ptr += offset;
>>>>>        total -= offset;
>>>>>
>>>>>        if (offset > 0) {
>>>>>            munmap(ptr - offset, offset);
>>>>>        }
>>>>>        if (total > size) {
>>>>>            munmap(ptr + size, total - size);
>>>>>        }
>>>>>
>>>>>        return ptr;
>>>>> }
>>>>>
>>>>> static inline int
>>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>> {
>>>>>        return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>>                       * sizeof(VECTYPE)) == 0
>>>>>                && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>>> }
>>>>>
>>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>> {
>>>>>        const VECTYPE *p = buf;
>>>>>        const VECTYPE zero = (VECTYPE){0};
>>>>>        size_t i;
>>>>>
>>>>>        if (!len) {
>>>>>            return 0;
>>>>>        }
>>>>>
>>>>>        assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>>>
>>>>>        for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>>>            if (!ALL_EQ(p[i], zero)) {
>>>>>                return i * sizeof(VECTYPE);
>>>>>            }
>>>>>        }
>>>>>
>>>>>        for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>>>             i < len / sizeof(VECTYPE);
>>>>>             i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>>>            VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>>>            VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>>>            VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>>>            VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>>>            VECTYPE tmp01 = tmp0 | tmp1;
>>>>>            VECTYPE tmp23 = tmp2 | tmp3;
>>>>>            if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>>>                break;
>>>>>            }
>>>>>        }
>>>>>
>>>>>        return i * sizeof(VECTYPE);
>>>>> }
>>>>>
>>>>> int main()
>>>>> {
>>>>>         //char *x = malloc(1024 << 20);
>>>>>         char *x = qemu_anon_ram_alloc(1024 << 20);
>>>>>
>>>>>         int i, j;
>>>>>         int ret = 0;
>>>>>         struct rusage rusage;
>>>>>         for (i = 0; i < 500; i ++) {
>>>>>             for (j = 0; j < 10 << 20; j += 4096) {
>>>>>                  ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>>>> + j), 4096);
>>>>>             }
>>>>>             getrusage( RUSAGE_SELF, &rusage );
>>>>>             printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>>>> rusage.ru_maxrss);
>>>>>             getchar();
>>>>>         }
>>>>>         printf("%d zero pages\n", ret);
>>>>> }
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
>
Alexey Kardashevskiy - June 9, 2013, 3:01 a.m.
On 06/09/2013 12:52 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2013-06-09 at 12:34 +1000, Alexey Kardashevskiy wrote:
> 
>> It is _live_ migration, the source sends changes, same pages can change and
>> be sent several times. So we would need to turn tracking on on the
>> destination to know if some page was received from the source or changed by
>> the destination itself (by writing there bios/firmware images, etc) and
>> then clear pages which were touched by the destination and were not sent by
>> the source.
> 
> Or we can set some kind of flag so that when creating a "migration
> target" VM we don't load all these things into memory.

How would we do that? The platform initialization code does not have a clue
whether it is going to receive a migrated host or not.


>> Or we do not make guesses, the source sends everything and the destination
>> simply checks if a page which is empty on the source is empty on the
>> destination and avoid writing zeroes to it. Looks simpler to me and this is
>> what the new patch does.
> 
> But you end up sending a lot of zero's ... is the migration compressed
> (I am not familiar with it at all) ? If it is, that shouldn't be a big
> deal, but else it feels to me that you should be able to send a special
> packet instead that says "all zeros" because you'll potentially have an
> awful lot of these.

It is compressed exactly as you described..


> 
> Ben.
> 
>>>
>>>>
>>>>> Also, you mean following code is from qemu and it does not allocate
>>>>> memory with you gcc right? Maybe it is related to KVM, how about
>>>>> turn off KVM and retry following code in qemu?
>>>>>
>>>>>> #include <stdio.h>
>>>>>> #include <stdlib.h>
>>>>>> #include <assert.h>
>>>>>> #include <unistd.h>
>>>>>> #include <sys/resource.h>
>>>>>> #include <inttypes.h>
>>>>>> #include <string.h>
>>>>>> #include <sys/mman.h>
>>>>>> #include <errno.h>
>>>>>>
>>>>>> #if defined __SSE2__
>>>>>> #include <emmintrin.h>
>>>>>> #define VECTYPE        __m128i
>>>>>> #define SPLAT(p)       _mm_set1_epi8(*(p))
>>>>>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>>>>> 0xFFFF)
>>>>>> #else
>>>>>> #define VECTYPE        unsigned long
>>>>>> #define SPLAT(p)       (*(p) * (~0UL / 255))
>>>>>> #define ALL_EQ(v1, v2) ((v1) == (v2))
>>>>>> #endif
>>>>>>
>>>>>> #define BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR 8
>>>>>>
>>>>>> /* Round number down to multiple */
>>>>>> #define QEMU_ALIGN_DOWN(n, m) ((n) / (m) * (m))
>>>>>>
>>>>>> /* Round number up to multiple */
>>>>>> #define QEMU_ALIGN_UP(n, m) QEMU_ALIGN_DOWN((n) + (m) - 1, (m))
>>>>>>
>>>>>> #define QEMU_VMALLOC_ALIGN (256 * 4096)
>>>>>>
>>>>>> /* alloc shared memory pages */
>>>>>> void *qemu_anon_ram_alloc(size_t size)
>>>>>> {
>>>>>>       size_t align = QEMU_VMALLOC_ALIGN;
>>>>>>       size_t total = size + align - getpagesize();
>>>>>>       void *ptr = mmap(0, total, PROT_READ | PROT_WRITE,
>>>>>>                        MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
>>>>>>       size_t offset = QEMU_ALIGN_UP((uintptr_t)ptr, align) -
>>>>>> (uintptr_t)ptr;
>>>>>>
>>>>>>       if (ptr == MAP_FAILED) {
>>>>>>           fprintf(stderr, "Failed to allocate %zu B: %s\n",
>>>>>>                   size, strerror(errno));
>>>>>>           abort();
>>>>>>       }
>>>>>>
>>>>>>       ptr += offset;
>>>>>>       total -= offset;
>>>>>>
>>>>>>       if (offset > 0) {
>>>>>>           munmap(ptr - offset, offset);
>>>>>>       }
>>>>>>       if (total > size) {
>>>>>>           munmap(ptr + size, total - size);
>>>>>>       }
>>>>>>
>>>>>>       return ptr;
>>>>>> }
>>>>>>
>>>>>> static inline int
>>>>>> can_use_buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>>> {
>>>>>>       return (len % (BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR
>>>>>>                      * sizeof(VECTYPE)) == 0
>>>>>>               && ((uintptr_t) buf) % sizeof(VECTYPE) == 0);
>>>>>> }
>>>>>>
>>>>>> size_t buffer_find_nonzero_offset(const void *buf, size_t len)
>>>>>> {
>>>>>>       const VECTYPE *p = buf;
>>>>>>       const VECTYPE zero = (VECTYPE){0};
>>>>>>       size_t i;
>>>>>>
>>>>>>       if (!len) {
>>>>>>           return 0;
>>>>>>       }
>>>>>>
>>>>>>       assert(can_use_buffer_find_nonzero_offset(buf, len));
>>>>>>
>>>>>>       for (i = 0; i < BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR; i++) {
>>>>>>           if (!ALL_EQ(p[i], zero)) {
>>>>>>               return i * sizeof(VECTYPE);
>>>>>>           }
>>>>>>       }
>>>>>>
>>>>>>       for (i = BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR;
>>>>>>            i < len / sizeof(VECTYPE);
>>>>>>            i += BUFFER_FIND_NONZERO_OFFSET_UNROLL_FACTOR) {
>>>>>>           VECTYPE tmp0 = p[i + 0] | p[i + 1];
>>>>>>           VECTYPE tmp1 = p[i + 2] | p[i + 3];
>>>>>>           VECTYPE tmp2 = p[i + 4] | p[i + 5];
>>>>>>           VECTYPE tmp3 = p[i + 6] | p[i + 7];
>>>>>>           VECTYPE tmp01 = tmp0 | tmp1;
>>>>>>           VECTYPE tmp23 = tmp2 | tmp3;
>>>>>>           if (!ALL_EQ(tmp01 | tmp23, zero)) {
>>>>>>               break;
>>>>>>           }
>>>>>>       }
>>>>>>
>>>>>>       return i * sizeof(VECTYPE);
>>>>>> }
>>>>>>
>>>>>> int main()
>>>>>> {
>>>>>>        //char *x = malloc(1024 << 20);
>>>>>>        char *x = qemu_anon_ram_alloc(1024 << 20);
>>>>>>
>>>>>>        int i, j;
>>>>>>        int ret = 0;
>>>>>>        struct rusage rusage;
>>>>>>        for (i = 0; i < 500; i ++) {
>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>                 ret += buffer_find_nonzero_offset((char*) (x + (i << 20)
>>>>>> + j), 4096);
>>>>>>            }
>>>>>>            getrusage( RUSAGE_SELF, &rusage );
>>>>>>            printf("read offset: %d kB, RSS size: %ld kB", ((i+1) << 10),
>>>>>> rusage.ru_maxrss);
>>>>>>            getchar();
>>>>>>        }
>>>>>>        printf("%d zero pages\n", ret);
>>>>>> }
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>>
> 
>
Alexey Kardashevskiy - June 9, 2013, 3:09 a.m.
On 06/09/2013 01:01 PM, Wenchao Xia wrote:
> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>> You could also scan the page for nonzero values before
>>>>>>>>>>>>>> writing it.
>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>>
>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>>
>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not
>>>>>>>>>>>> write
>>>>>>>>>>>> the zeroes?
>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>>> source.
>>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>>> increasing amount of reserved memory:
>>>>>>>>>
>>>>>>>>> #include <stdio.h>
>>>>>>>>> #include <stdlib.h>
>>>>>>>>> int main()
>>>>>>>>> {
>>>>>>>>>         char *x = malloc(500 << 20);
>>>>>>>>>         int i, j;
>>>>>>>>>         for (i = 0; i < 500; i += 10) {
>>>>>>>>>             for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>>                  *(volatile char*) (x + (i << 20) + j);
>>>>>>>>>             }
>>>>>>>>>             getchar();
>>>>>>>>>         }
>>>>>>>>> }
>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>>
>>>>>>>> is the malloc above using mmapped memory?
>>>>>>> Yes.
>>>>>>>
>>>>>>>> which kernel version do you use?
>>>>>>> 3.9.
>>>>>>>
>>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>>> whatever side effects it has ;-))
>>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>>> overcommit in a more subtle way. :)
>>>>>>>
>>>>>>> Paolo
>>>>>> the following does also not allocate memory, but qemu does...
>>>>>>
>>>>> Hi, Peter
>>>>>     As the patch writes
>>>>>
>>>>> "not sending zero pages breaks migration if a page is zero
>>>>> at the source but not at the destination."
>>>>>
>>>>>     I don't understand why it would be trouble, shouldn't all page
>>>>> not received in dest be treated as zero pages?
>>>>
>>>>
>>>> How would the destination guest know if some page must be cleared? The
>>>> previous patch (which Peter reverted) did not send anything for the pages
>>>> which were zero on the source side.
>>>>
>>>>
>>>    If an page was not received and destination knows that page should
>>> exist according to total size, fill it with zero at destination, would
>>> it solve the problem?
>>
>> It is _live_ migration, the source sends changes, same pages can change and
>> be sent several times. So we would need to turn tracking on on the
>> destination to know if some page was received from the source or changed by
>> the destination itself (by writing there bios/firmware images, etc) and
>> then clear pages which were touched by the destination and were not sent by
>> the source.
>   OK, I can understand the problem is, for example:
> Destination boots up with 0x0000-0xFFFF filled with bios image.
> Source forgot to send zero pages in 0x0000-0xFFFF.


The source did not forget, instead it zeroed these pages during its life
and thought that they must be zeroed at the destination already (as the
destination did not start and did not have a chance to write something there).


> After migration destination got 0x0000-0xFFFF dirty(different with
> source)

Yep. And those pages were empty on the source what made debugging very easy :)


>   Thanks for explain.
> 
>   This seems refer to the migration protocol: how should the guest treat
> unsent pages. The patch causing the problem, actually treat zero pages
> as "not to sent" at source, but another half is missing: treat "not
> received" as zero pages at destination. I guess if second half is added,
> problem is gone:
> after page transfer completed, before destination resume,
> fill zero in "not received" pages.



Make a working patch, we'll discuss it :) I do not see much acceleration
coming from there.
Wayne Xia - June 9, 2013, 3:31 a.m.
于 2013-6-9 11:09, Alexey Kardashevskiy 写道:
> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>> You could also scan the page for nonzero values before
>>>>>>>>>>>>>>> writing it.
>>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>>>
>>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not
>>>>>>>>>>>>> write
>>>>>>>>>>>>> the zeroes?
>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>>>> source.
>>>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>>>> increasing amount of reserved memory:
>>>>>>>>>>
>>>>>>>>>> #include <stdio.h>
>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>> int main()
>>>>>>>>>> {
>>>>>>>>>>          char *x = malloc(500 << 20);
>>>>>>>>>>          int i, j;
>>>>>>>>>>          for (i = 0; i < 500; i += 10) {
>>>>>>>>>>              for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>>>                   *(volatile char*) (x + (i << 20) + j);
>>>>>>>>>>              }
>>>>>>>>>>              getchar();
>>>>>>>>>>          }
>>>>>>>>>> }
>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>>>
>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>> Yes.
>>>>>>>>
>>>>>>>>> which kernel version do you use?
>>>>>>>> 3.9.
>>>>>>>>
>>>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>>>> whatever side effects it has ;-))
>>>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>>>> overcommit in a more subtle way. :)
>>>>>>>>
>>>>>>>> Paolo
>>>>>>> the following does also not allocate memory, but qemu does...
>>>>>>>
>>>>>> Hi, Peter
>>>>>>      As the patch writes
>>>>>>
>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>> at the source but not at the destination."
>>>>>>
>>>>>>      I don't understand why it would be trouble, shouldn't all page
>>>>>> not received in dest be treated as zero pages?
>>>>>
>>>>>
>>>>> How would the destination guest know if some page must be cleared? The
>>>>> previous patch (which Peter reverted) did not send anything for the pages
>>>>> which were zero on the source side.
>>>>>
>>>>>
>>>>     If an page was not received and destination knows that page should
>>>> exist according to total size, fill it with zero at destination, would
>>>> it solve the problem?
>>>
>>> It is _live_ migration, the source sends changes, same pages can change and
>>> be sent several times. So we would need to turn tracking on on the
>>> destination to know if some page was received from the source or changed by
>>> the destination itself (by writing there bios/firmware images, etc) and
>>> then clear pages which were touched by the destination and were not sent by
>>> the source.
>>    OK, I can understand the problem is, for example:
>> Destination boots up with 0x0000-0xFFFF filled with bios image.
>> Source forgot to send zero pages in 0x0000-0xFFFF.
>
>
> The source did not forget, instead it zeroed these pages during its life
> and thought that they must be zeroed at the destination already (as the
> destination did not start and did not have a chance to write something there).
>
>
>> After migration destination got 0x0000-0xFFFF dirty(different with
>> source)
>
> Yep. And those pages were empty on the source what made debugging very easy :)
>
>
>>    Thanks for explain.
>>
>>    This seems refer to the migration protocol: how should the guest treat
>> unsent pages. The patch causing the problem, actually treat zero pages
>> as "not to sent" at source, but another half is missing: treat "not
>> received" as zero pages at destination. I guess if second half is added,
>> problem is gone:
>> after page transfer completed, before destination resume,
>> fill zero in "not received" pages.
>
>
>
> Make a working patch, we'll discuss it :) I do not see much acceleration
> coming from there.
>
>
   4k zero page is compressed into header: 8 bytes flag + 1 byte tail +
( 1 + strlen(idstr) when ramblock is a new one), so take 10 bytes
as average, ram:network flow is 4000:10 = 400:1
   Then for a typical 4GB guest, sending the zero pages will take about
10M network flow, indeed not much acceleration. I think current method
is already good enough, unless there are other benefits in not sending
zero pages.
pingfan liu - June 9, 2013, 4:12 a.m.
Hi Peter,

Is it that sending zero page mostly service the first iteration, ie
bluk-stage? And for the subsequent iteration, dirty pages are normally
not zero.

Thanks

On Wed, Jun 5, 2013 at 2:09 PM, Peter Lieven <pl@kamp.de> wrote:
>
> Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>
>> On 06/05/2013 05:15 AM, Peter Lieven wrote:
>>>
>>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>
>>>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>>>> Still using 3.2, but strange enough the above example is also not
>>>>> increasing RSS size for me.
>>>>>
>>>>> Can you try the following:
>>>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>>>> and migrate it. Before migration RSS Size os somewhat
>>>>> around 16MB. After migration its RSS size is in the order of 1G.
>>>>
>>>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>>>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
>>>
>>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
>>>
>>> paolo, alexey: can you please verify the following works for you:
>>> https://github.com/plieven/qemu/tree/fix-migration
>>
>> These two?
>> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
>> overwrite zero pages
>> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
>> not sent zero pages in bulk stage"
>
> Yes, sorry forgot to mention this.
>
>>
>> That works for me (qemu 1.5, kernel 3.9-rc2).
>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
> Thank you,
> Peter
Peter Lieven - June 9, 2013, 7:22 a.m.
Am 09.06.2013 um 06:12 schrieb liu ping fan <qemulist@gmail.com>:

> Hi Peter,
> 
> Is it that sending zero page mostly service the first iteration, ie
> bluk-stage? And for the subsequent iteration, dirty pages are normally
> not zero.
> 

Yes most Zero Pages are sent during bulk stage except for busy windows guests or linux with page sanitization.
In these cases freed Memory is zeroed.

Peter

> Thanks
> 
> On Wed, Jun 5, 2013 at 2:09 PM, Peter Lieven <pl@kamp.de> wrote:
>> 
>> Am 05.06.2013 um 05:37 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>> 
>>> On 06/05/2013 05:15 AM, Peter Lieven wrote:
>>>> 
>>>> Am 04.06.2013 um 17:17 schrieb Paolo Bonzini <pbonzini@redhat.com>:
>>>> 
>>>>> Il 04/06/2013 16:48, Peter Lieven ha scritto:
>>>>>> Still using 3.2, but strange enough the above example is also not
>>>>>> increasing RSS size for me.
>>>>>> 
>>>>>> Can you try the following:
>>>>>> qemu git master with 1G of memory (hanging in bios with no boot device)
>>>>>> and migrate it. Before migration RSS Size os somewhat
>>>>>> around 16MB. After migration its RSS size is in the order of 1G.
>>>>> 
>>>>> That may be a kernel bug.  The kernel did not do the copy-on-write trick
>>>>> on huge zero pages.  It was fixed last year, maybe 3.2 is not enough.
>>>>> Try adding a MADV_HUGEPAGE madvise to the testcase and see if it reproduces.
>>>> 
>>>> that's it. thanks for the pointer. the huge zero page was introduced in 3.8.
>>>> 
>>>> paolo, alexey: can you please verify the following works for you:
>>>> https://github.com/plieven/qemu/tree/fix-migration
>>> 
>>> These two?
>>> 848b796 Tue Jun 4 14:43:04 2013 +0200 Peter Lieven migration: do not
>>> overwrite zero pages
>>> 2206ac8 Tue Jun 4 14:25:33 2013 +0200 Peter Lieven Revert "migration: do
>>> not sent zero pages in bulk stage"
>> 
>> Yes, sorry forgot to mention this.
>> 
>>> 
>>> That works for me (qemu 1.5, kernel 3.9-rc2).
>>> Tested-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> 
>> Thank you,
>> Peter
Peter Lieven - June 9, 2013, 7:27 a.m.
Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:

> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>> You could also scan the page for nonzero values before
>>>>>>>>>>>>>>> writing it.
>>>>>>>>>>>>> i had this in mind, but then choosed the other approach.... turned
>>>>>>>>>>>>> out to be a bad idea.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> alexey: i will prepare a patch later today, could you then please
>>>>>>>>>>>>> verify it fixes your problem.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> paolo: would we still need the madvise or is it enough to not
>>>>>>>>>>>>> write
>>>>>>>>>>>>> the zeroes?
>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>> Problem: checking the pages for zero allocates them. even at the
>>>>>>>>>>> source.
>>>>>>>>>> It doesn't look like.  I tried this program and top doesn't show an
>>>>>>>>>> increasing amount of reserved memory:
>>>>>>>>>> 
>>>>>>>>>> #include <stdio.h>
>>>>>>>>>> #include <stdlib.h>
>>>>>>>>>> int main()
>>>>>>>>>> {
>>>>>>>>>>        char *x = malloc(500 << 20);
>>>>>>>>>>        int i, j;
>>>>>>>>>>        for (i = 0; i < 500; i += 10) {
>>>>>>>>>>            for (j = 0; j < 10 << 20; j += 4096) {
>>>>>>>>>>                 *(volatile char*) (x + (i << 20) + j);
>>>>>>>>>>            }
>>>>>>>>>>            getchar();
>>>>>>>>>>        }
>>>>>>>>>> }
>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>> None of the three top values change, and only VIRT is >500 MB.
>>>>>>>> 
>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>> Yes.
>>>>>>>> 
>>>>>>>>> which kernel version do you use?
>>>>>>>> 3.9.
>>>>>>>> 
>>>>>>>>> what avoids allocating the memory for me is the following (with
>>>>>>>>> whatever side effects it has ;-))
>>>>>>>> This would also fail to migrate any page that is swapped out, breaking
>>>>>>>> overcommit in a more subtle way. :)
>>>>>>>> 
>>>>>>>> Paolo
>>>>>>> the following does also not allocate memory, but qemu does...
>>>>>> Hi, Peter
>>>>>>    As the patch writes
>>>>>> 
>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>> at the source but not at the destination."
>>>>>> 
>>>>>>    I don't understand why it would be trouble, shouldn't all page
>>>>>> not received in dest be treated as zero pages?
>>>>> 
>>>>> 
>>>>> How would the destination guest know if some page must be cleared? The
>>>>> previous patch (which Peter reverted) did not send anything for the pages
>>>>> which were zero on the source side.
>>>>   If an page was not received and destination knows that page should
>>>> exist according to total size, fill it with zero at destination, would
>>>> it solve the problem?
>>> 
>>> It is _live_ migration, the source sends changes, same pages can change and
>>> be sent several times. So we would need to turn tracking on on the
>>> destination to know if some page was received from the source or changed by
>>> the destination itself (by writing there bios/firmware images, etc) and
>>> then clear pages which were touched by the destination and were not sent by
>>> the source.
>>  OK, I can understand the problem is, for example:
>> Destination boots up with 0x0000-0xFFFF filled with bios image.
>> Source forgot to send zero pages in 0x0000-0xFFFF.
> 
> 
> The source did not forget, instead it zeroed these pages during its life
> and thought that they must be zeroed at the destination already (as the
> destination did not start and did not have a chance to write something there).
> 
> 
>> After migration destination got 0x0000-0xFFFF dirty(different with
>> source)
> 
> Yep. And those pages were empty on the source what made debugging very easy :)
> 
> 
>>  Thanks for explain.
>> 
>>  This seems refer to the migration protocol: how should the guest treat
>> unsent pages. The patch causing the problem, actually treat zero pages
>> as "not to sent" at source, but another half is missing: treat "not
>> received" as zero pages at destination. I guess if second half is added,
>> problem is gone:
>> after page transfer completed, before destination resume,
>> fill zero in "not received" pages.
> 
> 
> 
> Make a working patch, we'll discuss it :) I do not see much acceleration
> coming from there.

I would also not spent much time with this. I would either look to find an easy way to fix the initialization code to not unneccessarily load data into RAM or i will sent a v2 of my patch following Eric's concerns.

Peter

> 
> 
> -- 
> Alexey
Alexey Kardashevskiy - June 10, 2013, 6:39 a.m.
On 06/09/2013 05:27 PM, Peter Lieven wrote:
> 
> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
> 
>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>> memory:
>>>>>>>>>>> 
>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() { 
>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); } 
>>>>>>>>>>> getchar(); } }
>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>> >500 MB.
>>>>>>>>> 
>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>> Yes.
>>>>>>>>> 
>>>>>>>>>> which kernel version do you use?
>>>>>>>>> 3.9.
>>>>>>>>> 
>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>> 
>>>>>>>>> Paolo
>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>> does...
>>>>>>> Hi, Peter As the patch writes
>>>>>>> 
>>>>>>> "not sending zero pages breaks migration if a page is zero 
>>>>>>> at the source but not at the destination."
>>>>>>> 
>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>> page not received in dest be treated as zero pages?
>>>>>> 
>>>>>> 
>>>>>> How would the destination guest know if some page must be
>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>> send anything for the pages which were zero on the source
>>>>>> side.
>>>>> If an page was not received and destination knows that page
>>>>> should exist according to total size, fill it with zero at
>>>>> destination, would it solve the problem?
>>>> 
>>>> It is _live_ migration, the source sends changes, same pages can
>>>> change and be sent several times. So we would need to turn
>>>> tracking on on the destination to know if some page was received
>>>> from the source or changed by the destination itself (by writing
>>>> there bios/firmware images, etc) and then clear pages which were
>>>> touched by the destination and were not sent by the source.
>>> OK, I can understand the problem is, for example: Destination boots
>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>> zero pages in 0x0000-0xFFFF.
>> 
>> 
>> The source did not forget, instead it zeroed these pages during its
>> life and thought that they must be zeroed at the destination already
>> (as the destination did not start and did not have a chance to write
>> something there).
>> 
>> 
>>> After migration destination got 0x0000-0xFFFF dirty(different with 
>>> source)
>> 
>> Yep. And those pages were empty on the source what made debugging very
>> easy :)
>> 
>> 
>>> Thanks for explain.
>>> 
>>> This seems refer to the migration protocol: how should the guest
>>> treat unsent pages. The patch causing the problem, actually treat
>>> zero pages as "not to sent" at source, but another half is missing:
>>> treat "not received" as zero pages at destination. I guess if second
>>> half is added, problem is gone: after page transfer completed,
>>> before destination resume, fill zero in "not received" pages.
>> 
>> 
>> 
>> Make a working patch, we'll discuss it :) I do not see much
>> acceleration coming from there.
> 

> I would also not spent much time with this. I would either look to find
> an easy way to fix the initialization code to not unneccessarily load
> data into RAM or i will sent a v2 of my patch following Eric's
> concerns.

There is no easy way to implement the flag and keep your original patch as
we have to implement this flag in all architectures which got broken by
your patch and I personally can fix only PPC64-pseries but not the others.

Furthermore your revert + new patches perfectly solve the problem, why
would we want to bother now with this new flag which nobody really needs
right now?

Please, please, revert the original patch or I'll try to do it :)
Paolo Bonzini - June 12, 2013, 2 p.m.
Il 08/06/2013 22:53, Benjamin Herrenschmidt ha scritto:
> On Sun, 2013-06-09 at 10:16 +0800, Wenchao Xia wrote:
>>    If an page was not received and destination knows that page should
>> exist according to total size, fill it with zero at destination, would
>> it solve the problem?
> 
> The easiest way to do that is to not write to those pages at the
> destination to begin with, when initializing the VM... Is there any way
> to know that a VM is being setup as a migration target or not ?

There is the "incoming" variable in vl.c (currently not a global), but I
suspect Peter's patch could have also broken loadvm.  It could quickly
become a rat hole.

The only bug we have is not a performance bug related to compression;
it's that writing zero pages breaks overcommit.  Let's fix that, and
only that.

Paolo
Benjamin Herrenschmidt - June 12, 2013, 2:11 p.m.
On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote:
> The only bug we have is not a performance bug related to compression;
> it's that writing zero pages breaks overcommit.  Let's fix that, and
> only that.

Right, do we have a way to madvise "throw away" these instead ? Or do we
have a way to track that the platform init code did write something
there and only clear *those* pages ?

Cheers,
Ben.
Paolo Bonzini - June 12, 2013, 8:10 p.m.
Il 12/06/2013 10:11, Benjamin Herrenschmidt ha scritto:
> On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote:
>> The only bug we have is not a performance bug related to compression;
>> it's that writing zero pages breaks overcommit.  Let's fix that, and
>> only that.
> 
> Right, do we have a way to madvise "throw away" these instead ?

We already do that, but apparently that madvise is asynchronous.

> Or do we
> have a way to track that the platform init code did write something
> there and only clear *those* pages ?

No need for; since it's copy-on-write, not copy-on-read :) we can just
check for pages that are zero and not rewrite them with zeros.  That's
what Peter's patches do, I'll review them right away.

Paolo
Wayne Xia - June 13, 2013, 2:41 a.m.
于 2013-6-13 4:10, Paolo Bonzini 写道:
> Il 12/06/2013 10:11, Benjamin Herrenschmidt ha scritto:
>> On Wed, 2013-06-12 at 10:00 -0400, Paolo Bonzini wrote:
>>> The only bug we have is not a performance bug related to compression;
>>> it's that writing zero pages breaks overcommit.  Let's fix that, and
>>> only that.
>>
>> Right, do we have a way to madvise "throw away" these instead ?
>
> We already do that, but apparently that madvise is asynchronous.
>
>> Or do we
>> have a way to track that the platform init code did write something
>> there and only clear *those* pages ?
>
> No need for; since it's copy-on-write, not copy-on-read :) we can just
> check for pages that are zero and not rewrite them with zeros.  That's
   I think it is the right way to improve overcommit without breaking
anything.

> what Peter's patches do, I'll review them right away.
>
> Paolo
>



>

Patch

diff --git a/arch_init.c b/arch_init.c
index 642f241..25d20a9 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -148,6 +148,10 @@  int qemu_read_default_config_files(bool userconfig)

  static inline bool is_zero_page(uint8_t *p)
  {
+    uint8_t ret;
+    if (mincore(p, TARGET_PAGE_SIZE, &ret) == 0 && !(ret&0x1)) {
+        return 1;
+    }
      return buffer_find_nonzero_offset(p, TARGET_PAGE_SIZE) ==
          TARGET_PAGE_SIZE;
  }