Patchwork broken incoming migration

login
register
mail settings
Submitter Peter Lieven
Date June 10, 2013, 6:50 a.m.
Message ID <51B57727.9080903@kamp.de>
Download mbox | patch
Permalink /patch/250179/
State New
Headers show

Comments

Peter Lieven - June 10, 2013, 6:50 a.m.
On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>
>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>> memory:
>>>>>>>>>>>>
>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>> 500 MB.
>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>> Yes.
>>>>>>>>>>
>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>> 3.9.
>>>>>>>>>>
>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>
>>>>>>>>>> Paolo
>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>> does...
>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>
>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>> at the source but not at the destination."
>>>>>>>>
>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>
>>>>>>> How would the destination guest know if some page must be
>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>> send anything for the pages which were zero on the source
>>>>>>> side.
>>>>>> If an page was not received and destination knows that page
>>>>>> should exist according to total size, fill it with zero at
>>>>>> destination, would it solve the problem?
>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>> change and be sent several times. So we would need to turn
>>>>> tracking on on the destination to know if some page was received
>>>>> from the source or changed by the destination itself (by writing
>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>> touched by the destination and were not sent by the source.
>>>> OK, I can understand the problem is, for example: Destination boots
>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>> zero pages in 0x0000-0xFFFF.
>>>
>>> The source did not forget, instead it zeroed these pages during its
>>> life and thought that they must be zeroed at the destination already
>>> (as the destination did not start and did not have a chance to write
>>> something there).
>>>
>>>
>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>> source)
>>> Yep. And those pages were empty on the source what made debugging very
>>> easy :)
>>>
>>>
>>>> Thanks for explain.
>>>>
>>>> This seems refer to the migration protocol: how should the guest
>>>> treat unsent pages. The patch causing the problem, actually treat
>>>> zero pages as "not to sent" at source, but another half is missing:
>>>> treat "not received" as zero pages at destination. I guess if second
>>>> half is added, problem is gone: after page transfer completed,
>>>> before destination resume, fill zero in "not received" pages.
>>>
>>>
>>> Make a working patch, we'll discuss it :) I do not see much
>>> acceleration coming from there.
>> I would also not spent much time with this. I would either look to find
>> an easy way to fix the initialization code to not unneccessarily load
>> data into RAM or i will sent a v2 of my patch following Eric's
>> concerns.
> There is no easy way to implement the flag and keep your original patch as
> we have to implement this flag in all architectures which got broken by
> your patch and I personally can fix only PPC64-pseries but not the others.
>
> Furthermore your revert + new patches perfectly solve the problem, why
> would we want to bother now with this new flag which nobody really needs
> right now?
>
> Please, please, revert the original patch or I'll try to do it :)
>
>
I tried, but there where concerns by the community. Alternativly I found
the following alternate solution. Please drop the 2 patches and try the
following:


This is done at setup time so there is no additional cost for zero checking at each compressed page
coming in.

Peter
Alexey Kardashevskiy - June 10, 2013, 6:55 a.m.
On 06/10/2013 04:50 PM, Peter Lieven wrote:
> On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
>> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>
>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>>> memory:
>>>>>>>>>>>>>
>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>>> 500 MB.
>>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>>> Yes.
>>>>>>>>>>>
>>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>>> 3.9.
>>>>>>>>>>>
>>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>>
>>>>>>>>>>> Paolo
>>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>>> does...
>>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>>
>>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>>> at the source but not at the destination."
>>>>>>>>>
>>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>>
>>>>>>>> How would the destination guest know if some page must be
>>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>>> send anything for the pages which were zero on the source
>>>>>>>> side.
>>>>>>> If an page was not received and destination knows that page
>>>>>>> should exist according to total size, fill it with zero at
>>>>>>> destination, would it solve the problem?
>>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>>> change and be sent several times. So we would need to turn
>>>>>> tracking on on the destination to know if some page was received
>>>>>> from the source or changed by the destination itself (by writing
>>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>>> touched by the destination and were not sent by the source.
>>>>> OK, I can understand the problem is, for example: Destination boots
>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>>> zero pages in 0x0000-0xFFFF.
>>>>
>>>> The source did not forget, instead it zeroed these pages during its
>>>> life and thought that they must be zeroed at the destination already
>>>> (as the destination did not start and did not have a chance to write
>>>> something there).
>>>>
>>>>
>>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>>> source)
>>>> Yep. And those pages were empty on the source what made debugging very
>>>> easy :)
>>>>
>>>>
>>>>> Thanks for explain.
>>>>>
>>>>> This seems refer to the migration protocol: how should the guest
>>>>> treat unsent pages. The patch causing the problem, actually treat
>>>>> zero pages as "not to sent" at source, but another half is missing:
>>>>> treat "not received" as zero pages at destination. I guess if second
>>>>> half is added, problem is gone: after page transfer completed,
>>>>> before destination resume, fill zero in "not received" pages.
>>>>
>>>>
>>>> Make a working patch, we'll discuss it :) I do not see much
>>>> acceleration coming from there.
>>> I would also not spent much time with this. I would either look to find
>>> an easy way to fix the initialization code to not unneccessarily load
>>> data into RAM or i will sent a v2 of my patch following Eric's
>>> concerns.
>> There is no easy way to implement the flag and keep your original patch as
>> we have to implement this flag in all architectures which got broken by
>> your patch and I personally can fix only PPC64-pseries but not the others.
>>
>> Furthermore your revert + new patches perfectly solve the problem, why
>> would we want to bother now with this new flag which nobody really needs
>> right now?
>>
>> Please, please, revert the original patch or I'll try to do it :)
>>
>>
> I tried, but there where concerns by the community. 


Was here anybody who did not want to revert the patch (besides you)?
I did not notice.


> Alternativly I found
> the following alternate solution. Please drop the 2 patches and try the
> following:


How is it going to work if upstream QEMU doesn't send anything about empty
pages at all (this is why I want to revert that patch)?


> 
> diff --git a/arch_init.c b/arch_init.c
> index 5d32ecf..458bf8c 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -799,6 +799,8 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>                  while (total_ram_bytes) {
>                      RAMBlock *block;
>                      uint8_t len;
> +                    void *base;
> +                    ram_addr_t offset;
> 
>                      len = qemu_get_byte(f);
>                      qemu_get_buffer(f, (uint8_t *)id, len);
> @@ -822,6 +824,14 @@ static int ram_load(QEMUFile *f, void *opaque, int
> version_id)
>                          goto done;
>                      }
> 
> +                    base = memory_region_get_ram_ptr(block->mr);
> +                    for (offset = 0; offset < block->length;
> +                         offset += TARGET_PAGE_SIZE) {
> +                        if (!is_zero_page(base + offset)) {
> +                            memset(base + offset, 0x00, TARGET_PAGE_SIZE);
> +                        }
> +                    }
> +
>                      total_ram_bytes -= length;
>                  }
>              }
> 
> This is done at setup time so there is no additional cost for zero checking
> at each compressed page
> coming in.
> 
> Peter
Peter Lieven - June 10, 2013, 8:44 a.m.
On 10.06.2013 08:55, Alexey Kardashevskiy wrote:
> On 06/10/2013 04:50 PM, Peter Lieven wrote:
>> On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
>>> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>
>>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>>>> memory:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>>>> 500 MB.
>>>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>>>> Yes.
>>>>>>>>>>>>
>>>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>>>> 3.9.
>>>>>>>>>>>>
>>>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>>>
>>>>>>>>>>>> Paolo
>>>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>>>> does...
>>>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>>>
>>>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>>>> at the source but not at the destination."
>>>>>>>>>>
>>>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>>> How would the destination guest know if some page must be
>>>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>>>> send anything for the pages which were zero on the source
>>>>>>>>> side.
>>>>>>>> If an page was not received and destination knows that page
>>>>>>>> should exist according to total size, fill it with zero at
>>>>>>>> destination, would it solve the problem?
>>>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>>>> change and be sent several times. So we would need to turn
>>>>>>> tracking on on the destination to know if some page was received
>>>>>>> from the source or changed by the destination itself (by writing
>>>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>>>> touched by the destination and were not sent by the source.
>>>>>> OK, I can understand the problem is, for example: Destination boots
>>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>>>> zero pages in 0x0000-0xFFFF.
>>>>> The source did not forget, instead it zeroed these pages during its
>>>>> life and thought that they must be zeroed at the destination already
>>>>> (as the destination did not start and did not have a chance to write
>>>>> something there).
>>>>>
>>>>>
>>>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>>>> source)
>>>>> Yep. And those pages were empty on the source what made debugging very
>>>>> easy :)
>>>>>
>>>>>
>>>>>> Thanks for explain.
>>>>>>
>>>>>> This seems refer to the migration protocol: how should the guest
>>>>>> treat unsent pages. The patch causing the problem, actually treat
>>>>>> zero pages as "not to sent" at source, but another half is missing:
>>>>>> treat "not received" as zero pages at destination. I guess if second
>>>>>> half is added, problem is gone: after page transfer completed,
>>>>>> before destination resume, fill zero in "not received" pages.
>>>>>
>>>>> Make a working patch, we'll discuss it :) I do not see much
>>>>> acceleration coming from there.
>>>> I would also not spent much time with this. I would either look to find
>>>> an easy way to fix the initialization code to not unneccessarily load
>>>> data into RAM or i will sent a v2 of my patch following Eric's
>>>> concerns.
>>> There is no easy way to implement the flag and keep your original patch as
>>> we have to implement this flag in all architectures which got broken by
>>> your patch and I personally can fix only PPC64-pseries but not the others.
>>>
>>> Furthermore your revert + new patches perfectly solve the problem, why
>>> would we want to bother now with this new flag which nobody really needs
>>> right now?
>>>
>>> Please, please, revert the original patch or I'll try to do it :)
>>>
>>>
>> I tried, but there where concerns by the community.
>
> Was here anybody who did not want to revert the patch (besides you)?
> I did not notice.
Eric said I should not drop the skipped_pages stuff in the monitor.
>
>
>> Alternativly I found
>> the following alternate solution. Please drop the 2 patches and try the
>> following:
>
> How is it going to work if upstream QEMU doesn't send anything about empty
> pages at all (this is why I want to revert that patch)?
I do not understand your question. The patch below zeroes out the destination
memory if it is not zero (e.g. if there is a BIOS copied to memory already during
machine init).

I would prefer not to completely drop the patch since it saves bandwidth and
resources.

Peter
Alexey Kardashevskiy - June 10, 2013, 9:10 a.m.
On 06/10/2013 06:44 PM, Peter Lieven wrote:
> On 10.06.2013 08:55, Alexey Kardashevskiy wrote:
>> On 06/10/2013 04:50 PM, Peter Lieven wrote:
>>> On 10.06.2013 08:39, Alexey Kardashevskiy wrote:
>>>> On 06/09/2013 05:27 PM, Peter Lieven wrote:
>>>>> Am 09.06.2013 um 05:09 schrieb Alexey Kardashevskiy <aik@ozlabs.ru>:
>>>>>
>>>>>> On 06/09/2013 01:01 PM, Wenchao Xia wrote:
>>>>>>> 于 2013-6-9 10:34, Alexey Kardashevskiy 写道:
>>>>>>>> On 06/09/2013 12:16 PM, Wenchao Xia wrote:
>>>>>>>>> 于 2013-6-8 16:30, Alexey Kardashevskiy 写道:
>>>>>>>>>> On 06/08/2013 06:27 PM, Wenchao Xia wrote:
>>>>>>>>>>>> On 04.06.2013 16:40, Paolo Bonzini wrote:
>>>>>>>>>>>>> Il 04/06/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>> On 04.06.2013 16:14, Paolo Bonzini wrote:
>>>>>>>>>>>>>>> Il 04/06/2013 15:52, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>> On 30.05.2013 16:41, Paolo Bonzini wrote:
>>>>>>>>>>>>>>>>> Il 30/05/2013 16:38, Peter Lieven ha scritto:
>>>>>>>>>>>>>>>>>>>> You could also scan the page for nonzero
>>>>>>>>>>>>>>>>>>>> values before writing it.
>>>>>>>>>>>>>>>>>> i had this in mind, but then choosed the other
>>>>>>>>>>>>>>>>>> approach.... turned out to be a bad idea.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> alexey: i will prepare a patch later today,
>>>>>>>>>>>>>>>>>> could you then please verify it fixes your
>>>>>>>>>>>>>>>>>> problem.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> paolo: would we still need the madvise or is
>>>>>>>>>>>>>>>>>> it enough to not write the zeroes?
>>>>>>>>>>>>>>>>> It should be enough to not write them.
>>>>>>>>>>>>>>>> Problem: checking the pages for zero allocates
>>>>>>>>>>>>>>>> them. even at the source.
>>>>>>>>>>>>>>> It doesn't look like.  I tried this program and top
>>>>>>>>>>>>>>> doesn't show an increasing amount of reserved
>>>>>>>>>>>>>>> memory:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> #include <stdio.h> #include <stdlib.h> int main() {
>>>>>>>>>>>>>>> char *x = malloc(500 << 20); int i, j; for (i = 0; i
>>>>>>>>>>>>>>> < 500; i += 10) { for (j = 0; j < 10 << 20; j +=
>>>>>>>>>>>>>>> 4096) { *(volatile char*) (x + (i << 20) + j); }
>>>>>>>>>>>>>>> getchar(); } }
>>>>>>>>>>>>>> strange. we are talking about RSS size, right?
>>>>>>>>>>>>> None of the three top values change, and only VIRT is
>>>>>>>>>>>>>> 500 MB.
>>>>>>>>>>>>>> is the malloc above using mmapped memory?
>>>>>>>>>>>>> Yes.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> which kernel version do you use?
>>>>>>>>>>>>> 3.9.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> what avoids allocating the memory for me is the
>>>>>>>>>>>>>> following (with whatever side effects it has ;-))
>>>>>>>>>>>>> This would also fail to migrate any page that is swapped
>>>>>>>>>>>>> out, breaking overcommit in a more subtle way. :)
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paolo
>>>>>>>>>>>> the following does also not allocate memory, but qemu
>>>>>>>>>>>> does...
>>>>>>>>>>> Hi, Peter As the patch writes
>>>>>>>>>>>
>>>>>>>>>>> "not sending zero pages breaks migration if a page is zero
>>>>>>>>>>> at the source but not at the destination."
>>>>>>>>>>>
>>>>>>>>>>> I don't understand why it would be trouble, shouldn't all
>>>>>>>>>>> page not received in dest be treated as zero pages?
>>>>>>>>>> How would the destination guest know if some page must be
>>>>>>>>>> cleared? The previous patch (which Peter reverted) did not
>>>>>>>>>> send anything for the pages which were zero on the source
>>>>>>>>>> side.
>>>>>>>>> If an page was not received and destination knows that page
>>>>>>>>> should exist according to total size, fill it with zero at
>>>>>>>>> destination, would it solve the problem?
>>>>>>>> It is _live_ migration, the source sends changes, same pages can
>>>>>>>> change and be sent several times. So we would need to turn
>>>>>>>> tracking on on the destination to know if some page was received
>>>>>>>> from the source or changed by the destination itself (by writing
>>>>>>>> there bios/firmware images, etc) and then clear pages which were
>>>>>>>> touched by the destination and were not sent by the source.
>>>>>>> OK, I can understand the problem is, for example: Destination boots
>>>>>>> up with 0x0000-0xFFFF filled with bios image. Source forgot to send
>>>>>>> zero pages in 0x0000-0xFFFF.
>>>>>> The source did not forget, instead it zeroed these pages during its
>>>>>> life and thought that they must be zeroed at the destination already
>>>>>> (as the destination did not start and did not have a chance to write
>>>>>> something there).
>>>>>>
>>>>>>
>>>>>>> After migration destination got 0x0000-0xFFFF dirty(different with
>>>>>>> source)
>>>>>> Yep. And those pages were empty on the source what made debugging very
>>>>>> easy :)
>>>>>>
>>>>>>
>>>>>>> Thanks for explain.
>>>>>>>
>>>>>>> This seems refer to the migration protocol: how should the guest
>>>>>>> treat unsent pages. The patch causing the problem, actually treat
>>>>>>> zero pages as "not to sent" at source, but another half is missing:
>>>>>>> treat "not received" as zero pages at destination. I guess if second
>>>>>>> half is added, problem is gone: after page transfer completed,
>>>>>>> before destination resume, fill zero in "not received" pages.
>>>>>>
>>>>>> Make a working patch, we'll discuss it :) I do not see much
>>>>>> acceleration coming from there.
>>>>> I would also not spent much time with this. I would either look to find
>>>>> an easy way to fix the initialization code to not unneccessarily load
>>>>> data into RAM or i will sent a v2 of my patch following Eric's
>>>>> concerns.
>>>> There is no easy way to implement the flag and keep your original patch as
>>>> we have to implement this flag in all architectures which got broken by
>>>> your patch and I personally can fix only PPC64-pseries but not the others.
>>>>
>>>> Furthermore your revert + new patches perfectly solve the problem, why
>>>> would we want to bother now with this new flag which nobody really needs
>>>> right now?
>>>>
>>>> Please, please, revert the original patch or I'll try to do it :)
>>>>
>>>>
>>> I tried, but there where concerns by the community.
>>
>> Was here anybody who did not want to revert the patch (besides you)?
>> I did not notice.
> Eric said I should not drop the skipped_pages stuff in the monitor.
>>
>>
>>> Alternativly I found
>>> the following alternate solution. Please drop the 2 patches and try the
>>> following:
>>
>> How is it going to work if upstream QEMU doesn't send anything about empty
>> pages at all (this is why I want to revert that patch)?
> I do not understand your question. The patch below zeroes out the destination
> memory if it is not zero (e.g. if there is a BIOS copied to memory already
> during
> machine init).
> 
> I would prefer not to completely drop the patch since it saves bandwidth and
> resources.

I would like migration to do what it should do - send pages no matter what,
this is exactly what migration is for. If there any many, many empty pages
(which I doubt to be a very often real life case), they could all merged in
big consecutive chunks and sent at the end of migration.
Benjamin Herrenschmidt - June 10, 2013, 9:33 a.m.
On Mon, 2013-06-10 at 19:10 +1000, Alexey Kardashevskiy wrote:
> > I would prefer not to completely drop the patch since it saves bandwidth and
> > resources.
> 
> I would like migration to do what it should do - send pages no matter what,
> this is exactly what migration is for. If there any many, many empty pages
> (which I doubt to be a very often real life case), they could all merged in
> big consecutive chunks and sent at the end of migration.

I tend to agree. The problem of sending empty pages is purely a problem of
compression. If the current mechanism is deemed "not efficient enough" for
in the case of having lots of zero-pages, then by all means invent a better
packet format for more tightly representing them on the wire, but don't
break things by not sending them at all.

Cheers,
Ben.
Peter Lieven - June 10, 2013, 9:42 a.m.
On 10.06.2013 11:33, Benjamin Herrenschmidt wrote:
> On Mon, 2013-06-10 at 19:10 +1000, Alexey Kardashevskiy wrote:
>>> I would prefer not to completely drop the patch since it saves bandwidth and
>>> resources.
>> I would like migration to do what it should do - send pages no matter what,
>> this is exactly what migration is for. If there any many, many empty pages
>> (which I doubt to be a very often real life case), they could all merged in
>> big consecutive chunks and sent at the end of migration.
> I tend to agree. The problem of sending empty pages is purely a problem of
> compression. If the current mechanism is deemed "not efficient enough" for
> in the case of having lots of zero-pages, then by all means invent a better
> packet format for more tightly representing them on the wire, but don't
> break things by not sending them at all.

Ok, I see the point. I think the paradigm to say that the destination
should "decide" if it needs a page or not is a sound one.

Zero pages are quite often depending on the lifetime and the operating
system used. But a consecutive range of zero pages is only likely
in the bulk stage. I don't know if its reasonable to add a special encoding
for that.

I will sent a v2 of my previous revert patch addressing Erics concerns shortly.

Peter

Patch

diff --git a/arch_init.c b/arch_init.c
index 5d32ecf..458bf8c 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -799,6 +799,8 @@  static int ram_load(QEMUFile *f, void *opaque, int version_id)
                  while (total_ram_bytes) {
                      RAMBlock *block;
                      uint8_t len;
+                    void *base;
+                    ram_addr_t offset;

                      len = qemu_get_byte(f);
                      qemu_get_buffer(f, (uint8_t *)id, len);
@@ -822,6 +824,14 @@  static int ram_load(QEMUFile *f, void *opaque, int version_id)
                          goto done;
                      }

+                    base = memory_region_get_ram_ptr(block->mr);
+                    for (offset = 0; offset < block->length;
+                         offset += TARGET_PAGE_SIZE) {
+                        if (!is_zero_page(base + offset)) {
+                            memset(base + offset, 0x00, TARGET_PAGE_SIZE);
+                        }
+                    }
+
                      total_ram_bytes -= length;
                  }
              }