diff mbox

[RFC] mem-prealloc: Reduce large guest start-up and migration time.

Message ID 1483601042-6435-1-git-send-email-jitendra.kolhe@hpe.com
State New
Headers show

Commit Message

Jitendra Kolhe Jan. 5, 2017, 7:24 a.m. UTC
Using "-mem-prealloc" option for a very large guest leads to huge guest
start-up and migration time. This is because with "-mem-prealloc" option
qemu tries to map every guest page (create address translations), and
make sure the pages are available during runtime. virsh/libvirt by
default, seems to use "-mem-prealloc" option in case the guest is
configured to use huge pages. The patch tries to map all guest pages
simultaneously by spawning multiple threads. Given the problem is more
prominent for large guests, the patch limits the changes to the guests
of at-least 64GB of memory size. Currently limiting the change to QEMU
library functions on POSIX compliant host only, as we are not sure if
the problem exists on win32. Below are some stats with "-mem-prealloc"
option for guest configured to use huge pages.

------------------------------------------------------------------------
Idle Guest      | Start-up time | Migration time
------------------------------------------------------------------------
Guest stats with 2M HugePage usage - single threaded (existing code)
------------------------------------------------------------------------
64 Core - 4TB   | 54m11.796s    | 75m43.843s
64 Core - 1TB   | 8m56.576s     | 14m29.049s
64 Core - 256GB | 2m11.245s     | 3m26.598s
------------------------------------------------------------------------
Guest stats with 2M HugePage usage - map guest pages using 8 threads
------------------------------------------------------------------------
64 Core - 4TB   | 5m1.027s      | 34m10.565s
64 Core - 1TB   | 1m10.366s     | 8m28.188s
64 Core - 256GB | 0m19.040s     | 2m10.148s
-----------------------------------------------------------------------
Guest stats with 2M HugePage usage - map guest pages using 16 threads
-----------------------------------------------------------------------
64 Core - 4TB   | 1m58.970s     | 31m43.400s
64 Core - 1TB   | 0m39.885s     | 7m55.289s
64 Core - 256GB | 0m11.960s     | 2m0.135s
-----------------------------------------------------------------------

Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
---
 util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 61 insertions(+), 3 deletions(-)

Comments

Juan Quintela Jan. 27, 2017, 12:53 p.m. UTC | #1
Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
> Using "-mem-prealloc" option for a very large guest leads to huge guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Given the problem is more
> prominent for large guests, the patch limits the changes to the guests
> of at-least 64GB of memory size. Currently limiting the change to QEMU
> library functions on POSIX compliant host only, as we are not sure if
> the problem exists on win32. Below are some stats with "-mem-prealloc"
> option for guest configured to use huge pages.
>
> ------------------------------------------------------------------------
> Idle Guest      | Start-up time | Migration time
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - single threaded (existing code)
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
                    ^^^^^^^^^^

> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> 64 Core - 256GB | 2m11.245s     | 3m26.598s
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> 64 Core - 256GB | 0m19.040s     | 2m10.148s
> -----------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> -----------------------------------------------------------------------
> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
                    ^^^^^^^^^

Impressive, not everyday one get an speedup of 20 O:-)


> +static void *do_touch_pages(void *arg)
> +{
> +    PageRange *range = (PageRange *)arg;
> +    char *start_addr = range->addr;
> +    uint64_t numpages = range->numpages;
> +    uint64_t hpagesize = range->hpagesize;
> +    uint64_t i = 0;
> +
> +    for (i = 0; i < numpages; i++) {
> +        memset(start_addr + (hpagesize * i), 0, 1);

I would use the range->addr and similar here directly, but it is just a
question of taste.

> -        /* MAP_POPULATE silently ignores failures */
> -        for (i = 0; i < numpages; i++) {
> -            memset(area + (hpagesize * i), 0, 1);
> +        /* touch pages simultaneously for memory >= 64G */
> +        if (memory < (1ULL << 36)) {

64GB guest already took quite a bit of time, I think I would put it
always as min(num_vcpus, 16).  So, we always execute the multiple theard
codepath?

But very nice, thanks.

Later, Juan.
Dr. David Alan Gilbert Jan. 27, 2017, 1:03 p.m. UTC | #2
* Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote:
> Using "-mem-prealloc" option for a very large guest leads to huge guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Given the problem is more
> prominent for large guests, the patch limits the changes to the guests
> of at-least 64GB of memory size. Currently limiting the change to QEMU
> library functions on POSIX compliant host only, as we are not sure if
> the problem exists on win32. Below are some stats with "-mem-prealloc"
> option for guest configured to use huge pages.
> 
> ------------------------------------------------------------------------
> Idle Guest      | Start-up time | Migration time
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - single threaded (existing code)
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> 64 Core - 256GB | 2m11.245s     | 3m26.598s
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> 64 Core - 256GB | 0m19.040s     | 2m10.148s
> -----------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> -----------------------------------------------------------------------
> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> 64 Core - 256GB | 0m11.960s     | 2m0.135s
> -----------------------------------------------------------------------

That's a nice improvement.

> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
>  util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 61 insertions(+), 3 deletions(-)
> 
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index f631464..a8bd7c2 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -55,6 +55,13 @@
>  #include "qemu/error-report.h"
>  #endif
>  
> +#define PAGE_TOUCH_THREAD_COUNT 8

It seems a shame to fix that number as a constant.

> +typedef struct {
> +    char *addr;
> +    uint64_t numpages;
> +    uint64_t hpagesize;
> +} PageRange;
> +
>  int qemu_get_thread_id(void)
>  {
>  #if defined(__linux__)
> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>      siglongjmp(sigjump, 1);
>  }
>  
> +static void *do_touch_pages(void *arg)
> +{
> +    PageRange *range = (PageRange *)arg;
> +    char *start_addr = range->addr;
> +    uint64_t numpages = range->numpages;
> +    uint64_t hpagesize = range->hpagesize;
> +    uint64_t i = 0;
> +
> +    for (i = 0; i < numpages; i++) {
> +        memset(start_addr + (hpagesize * i), 0, 1);
> +    }
> +    qemu_thread_exit(NULL);
> +
> +    return NULL;
> +}
> +
> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
> +{
> +    QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
> +    PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
> +    uint64_t    numpage_per_thread, size_per_thread;
> +    int         i = 0, tcount = 0;
> +
> +    numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
> +    size_per_thread = (hpagesize * numpage_per_thread);
> +    for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
> +        page_range[i].addr = area;
> +        page_range[i].numpages = numpage_per_thread;
> +        page_range[i].hpagesize = hpagesize;
> +
> +        qemu_thread_create(page_threads + i, "touch_pages",
> +                           do_touch_pages, (page_range + i),
> +                           QEMU_THREAD_JOINABLE);
> +        tcount++;
> +        area += size_per_thread;
> +        numpages -= numpage_per_thread;
> +    }
> +    for (i = 0; i < numpages; i++) {
> +        memset(area + (hpagesize * i), 0, 1);
> +    }
> +    for (i = 0; i < tcount; i++) {
> +        qemu_thread_join(page_threads + i);
> +    }
> +    return 0;
> +}
> +
>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>  {
>      int ret;
> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>          size_t hpagesize = qemu_fd_getpagesize(fd);
>          size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>  
> -        /* MAP_POPULATE silently ignores failures */
> -        for (i = 0; i < numpages; i++) {
> -            memset(area + (hpagesize * i), 0, 1);
> +        /* touch pages simultaneously for memory >= 64G */
> +        if (memory < (1ULL << 36)) {
> +            /* MAP_POPULATE silently ignores failures */
> +            for (i = 0; i < numpages; i++) {
> +                memset(area + (hpagesize * i), 0, 1);
> +            }
> +        } else {
> +            touch_all_pages(area, hpagesize, numpages);
>          }
>      }

Maybe it's possible to do this quicker?
If we are using NUMA, and have separate memory-blocks for each NUMA node,
wont this call os_mem_prealloc separately for each node?
I wonder if it's possible to get that to run in parallel?

Dave

> -- 
> 1.8.3.1
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Paolo Bonzini Jan. 27, 2017, 1:06 p.m. UTC | #3
On 27/01/2017 13:53, Juan Quintela wrote:
>> +static void *do_touch_pages(void *arg)
>> +{
>> +    PageRange *range = (PageRange *)arg;
>> +    char *start_addr = range->addr;
>> +    uint64_t numpages = range->numpages;
>> +    uint64_t hpagesize = range->hpagesize;
>> +    uint64_t i = 0;
>> +
>> +    for (i = 0; i < numpages; i++) {
>> +        memset(start_addr + (hpagesize * i), 0, 1);
> 
> I would use the range->addr and similar here directly, but it is just a
> question of taste.
> 
>> -        /* MAP_POPULATE silently ignores failures */
>> -        for (i = 0; i < numpages; i++) {
>> -            memset(area + (hpagesize * i), 0, 1);
>> +        /* touch pages simultaneously for memory >= 64G */
>> +        if (memory < (1ULL << 36)) {
> 
> 64GB guest already took quite a bit of time, I think I would put it
> always as min(num_vcpus, 16).  So, we always execute the multiple theard
> codepath?

I too would like some kind of heuristic to choose the number of threads.
 Juan's suggested usage of the VCPUs (smp_cpus) is a good one.

Paolo
Daniel P. Berrangé Jan. 27, 2017, 1:26 p.m. UTC | #4
On Thu, Jan 05, 2017 at 12:54:02PM +0530, Jitendra Kolhe wrote:
> Using "-mem-prealloc" option for a very large guest leads to huge guest
> start-up and migration time. This is because with "-mem-prealloc" option
> qemu tries to map every guest page (create address translations), and
> make sure the pages are available during runtime. virsh/libvirt by
> default, seems to use "-mem-prealloc" option in case the guest is
> configured to use huge pages. The patch tries to map all guest pages
> simultaneously by spawning multiple threads. Given the problem is more
> prominent for large guests, the patch limits the changes to the guests
> of at-least 64GB of memory size. Currently limiting the change to QEMU
> library functions on POSIX compliant host only, as we are not sure if
> the problem exists on win32. Below are some stats with "-mem-prealloc"
> option for guest configured to use huge pages.
> 
> ------------------------------------------------------------------------
> Idle Guest      | Start-up time | Migration time
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - single threaded (existing code)
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
> 64 Core - 256GB | 2m11.245s     | 3m26.598s
> ------------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 8 threads
> ------------------------------------------------------------------------
> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
> 64 Core - 256GB | 0m19.040s     | 2m10.148s
> -----------------------------------------------------------------------
> Guest stats with 2M HugePage usage - map guest pages using 16 threads
> -----------------------------------------------------------------------
> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
> 64 Core - 256GB | 0m11.960s     | 2m0.135s
> -----------------------------------------------------------------------

For comparison, what is performance like if you replace memset() in
the current code with a call to mlock().

IIUC, huge pages are non-swappable once allocated, so it feels like
we ought to be able to just call mlock() to preallocate them with
no downside, rather than spawning many threads to memset() them.

Of course you'd still need the memset() trick if qemu was given
non-hugepages in combination with --mem-prealloc, as you don't
want to lock normal pages into ram permanently.

Regards,
Daniel
Jitendra Kolhe Jan. 30, 2017, 8:19 a.m. UTC | #5
On 1/27/2017 6:23 PM, Juan Quintela wrote:
> Jitendra Kolhe <jitendra.kolhe@hpe.com> wrote:
>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>> start-up and migration time. This is because with "-mem-prealloc" option
>> qemu tries to map every guest page (create address translations), and
>> make sure the pages are available during runtime. virsh/libvirt by
>> default, seems to use "-mem-prealloc" option in case the guest is
>> configured to use huge pages. The patch tries to map all guest pages
>> simultaneously by spawning multiple threads. Given the problem is more
>> prominent for large guests, the patch limits the changes to the guests
>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>> library functions on POSIX compliant host only, as we are not sure if
>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>> option for guest configured to use huge pages.
>>
>> ------------------------------------------------------------------------
>> Idle Guest      | Start-up time | Migration time
>> ------------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - single threaded (existing code)
>> ------------------------------------------------------------------------
>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>                     ^^^^^^^^^^
> 
>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>> ------------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>> ------------------------------------------------------------------------
>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>> -----------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>> -----------------------------------------------------------------------
>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>                     ^^^^^^^^^
> 
> Impressive, not everyday one get an speedup of 20 O:-)
> 
> 
>> +static void *do_touch_pages(void *arg)
>> +{
>> +    PageRange *range = (PageRange *)arg;
>> +    char *start_addr = range->addr;
>> +    uint64_t numpages = range->numpages;
>> +    uint64_t hpagesize = range->hpagesize;
>> +    uint64_t i = 0;
>> +
>> +    for (i = 0; i < numpages; i++) {
>> +        memset(start_addr + (hpagesize * i), 0, 1);
> 
> I would use the range->addr and similar here directly, but it is just a
> question of taste.
> 

Thanks for your response, 
will update my next patch.

>> -        /* MAP_POPULATE silently ignores failures */
>> -        for (i = 0; i < numpages; i++) {
>> -            memset(area + (hpagesize * i), 0, 1);
>> +        /* touch pages simultaneously for memory >= 64G */
>> +        if (memory < (1ULL << 36)) {
> 
> 64GB guest already took quite a bit of time, I think I would put it
> always as min(num_vcpus, 16).  So, we always execute the multiple theard
> codepath?
> 

It sounds good idea to have a heuristic on vcpu count. I will add in my 
next version. But shouldn't we also restrict based on guest RAM size too, 
to avoid overhead spawning multiple threads for thin guests?

Thanks,
- Jitendra

> But very nice, thanks.
> 
> Later, Juan.
>
Jitendra Kolhe Jan. 30, 2017, 8:32 a.m. UTC | #6
On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote:
> * Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote:
>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>> start-up and migration time. This is because with "-mem-prealloc" option
>> qemu tries to map every guest page (create address translations), and
>> make sure the pages are available during runtime. virsh/libvirt by
>> default, seems to use "-mem-prealloc" option in case the guest is
>> configured to use huge pages. The patch tries to map all guest pages
>> simultaneously by spawning multiple threads. Given the problem is more
>> prominent for large guests, the patch limits the changes to the guests
>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>> library functions on POSIX compliant host only, as we are not sure if
>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>> option for guest configured to use huge pages.
>>
>> ------------------------------------------------------------------------
>> Idle Guest      | Start-up time | Migration time
>> ------------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - single threaded (existing code)
>> ------------------------------------------------------------------------
>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>> ------------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>> ------------------------------------------------------------------------
>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>> -----------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>> -----------------------------------------------------------------------
>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
>> -----------------------------------------------------------------------
> 
> That's a nice improvement.
> 
>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>> ---
>>  util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++---
>>  1 file changed, 61 insertions(+), 3 deletions(-)
>>
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index f631464..a8bd7c2 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -55,6 +55,13 @@
>>  #include "qemu/error-report.h"
>>  #endif
>>  
>> +#define PAGE_TOUCH_THREAD_COUNT 8
> 
> It seems a shame to fix that number as a constant.
> 

Yes, as per comments received we will update patch to incorporate vcpu count.

>> +typedef struct {
>> +    char *addr;
>> +    uint64_t numpages;
>> +    uint64_t hpagesize;
>> +} PageRange;
>> +
>>  int qemu_get_thread_id(void)
>>  {
>>  #if defined(__linux__)
>> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>>      siglongjmp(sigjump, 1);
>>  }
>>  
>> +static void *do_touch_pages(void *arg)
>> +{
>> +    PageRange *range = (PageRange *)arg;
>> +    char *start_addr = range->addr;
>> +    uint64_t numpages = range->numpages;
>> +    uint64_t hpagesize = range->hpagesize;
>> +    uint64_t i = 0;
>> +
>> +    for (i = 0; i < numpages; i++) {
>> +        memset(start_addr + (hpagesize * i), 0, 1);
>> +    }
>> +    qemu_thread_exit(NULL);
>> +
>> +    return NULL;
>> +}
>> +
>> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
>> +{
>> +    QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
>> +    PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
>> +    uint64_t    numpage_per_thread, size_per_thread;
>> +    int         i = 0, tcount = 0;
>> +
>> +    numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
>> +    size_per_thread = (hpagesize * numpage_per_thread);
>> +    for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
>> +        page_range[i].addr = area;
>> +        page_range[i].numpages = numpage_per_thread;
>> +        page_range[i].hpagesize = hpagesize;
>> +
>> +        qemu_thread_create(page_threads + i, "touch_pages",
>> +                           do_touch_pages, (page_range + i),
>> +                           QEMU_THREAD_JOINABLE);
>> +        tcount++;
>> +        area += size_per_thread;
>> +        numpages -= numpage_per_thread;
>> +    }
>> +    for (i = 0; i < numpages; i++) {
>> +        memset(area + (hpagesize * i), 0, 1);
>> +    }
>> +    for (i = 0; i < tcount; i++) {
>> +        qemu_thread_join(page_threads + i);
>> +    }
>> +    return 0;
>> +}
>> +
>>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>  {
>>      int ret;
>> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>          size_t hpagesize = qemu_fd_getpagesize(fd);
>>          size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>>  
>> -        /* MAP_POPULATE silently ignores failures */
>> -        for (i = 0; i < numpages; i++) {
>> -            memset(area + (hpagesize * i), 0, 1);
>> +        /* touch pages simultaneously for memory >= 64G */
>> +        if (memory < (1ULL << 36)) {
>> +            /* MAP_POPULATE silently ignores failures */
>> +            for (i = 0; i < numpages; i++) {
>> +                memset(area + (hpagesize * i), 0, 1);
>> +            }
>> +        } else {
>> +            touch_all_pages(area, hpagesize, numpages);
>>          }
>>      }
> 
> Maybe it's possible to do this quicker?
> If we are using NUMA, and have separate memory-blocks for each NUMA node,
> wont this call os_mem_prealloc separately for each node?
> I wonder if it's possible to get that to run in parallel?
> 

I will investigate.

Thanks,
- Jitendra

> Dave
> 
>> -- 
>> 1.8.3.1
>>
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>
Jitendra Kolhe Feb. 2, 2017, 9:35 a.m. UTC | #7
On 1/27/2017 6:56 PM, Daniel P. Berrange wrote:
> On Thu, Jan 05, 2017 at 12:54:02PM +0530, Jitendra Kolhe wrote:
>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>> start-up and migration time. This is because with "-mem-prealloc" option
>> qemu tries to map every guest page (create address translations), and
>> make sure the pages are available during runtime. virsh/libvirt by
>> default, seems to use "-mem-prealloc" option in case the guest is
>> configured to use huge pages. The patch tries to map all guest pages
>> simultaneously by spawning multiple threads. Given the problem is more
>> prominent for large guests, the patch limits the changes to the guests
>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>> library functions on POSIX compliant host only, as we are not sure if
>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>> option for guest configured to use huge pages.
>>
>> ------------------------------------------------------------------------
>> Idle Guest      | Start-up time | Migration time
>> ------------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - single threaded (existing code)
>> ------------------------------------------------------------------------
>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>> ------------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>> ------------------------------------------------------------------------
>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>> -----------------------------------------------------------------------
>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>> -----------------------------------------------------------------------
>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
>> -----------------------------------------------------------------------
> 
> For comparison, what is performance like if you replace memset() in
> the current code with a call to mlock().
> 

It doesn't look like we get much benefit by replacing memset() for loop, 
with a single instance of mlock(). Here are some numbers from my system.

#hugepages    | memset 	    | memset      | memset 	| mlock (entire range)
(2M size)     | (1 thread)  | (8 threads) | (16 threads)| (1 thread)
--------------|-------------|-------------|-------------|--------------------
1048576 (2TB) |	1790661 ms  | 105577 ms   | 37331 ms    | 1789580 ms
524288 (1TB)  |	 895119 ms  | 52795 ms	  | 18686 ms	| 894199 ms
131072 (256G) |	 173081 ms  | 9337 ms	  | 4667 ms	| 172506 ms
-----------------------------------------------------------------------------

> IIUC, huge pages are non-swappable once allocated, so it feels like
> we ought to be able to just call mlock() to preallocate them with
> no downside, rather than spawning many threads to memset() them.
> 

yes, for me too it looks like mlock() should do the job in case of
hugepages.

> Of course you'd still need the memset() trick if qemu was given
> non-hugepages in combination with --mem-prealloc, as you don't
> want to lock normal pages into ram permanently.
> 

given above numbers, I think we can stick to memset() implementation for
both hugepage and non-hugepage cases?

Thanks,
- Jitendra

> Regards,
> Daniel
>
Paolo Bonzini Feb. 3, 2017, 6:59 p.m. UTC | #8
On 02/02/2017 01:35, Jitendra Kolhe wrote:
>> Of course you'd still need the memset() trick if qemu was given
>> non-hugepages in combination with --mem-prealloc, as you don't
>> want to lock normal pages into ram permanently.
>>
> given above numbers, I think we can stick to memset() implementation for
> both hugepage and non-hugepage cases?

Yes, of course!

Paolo
Jitendra Kolhe Feb. 7, 2017, 7:44 a.m. UTC | #9
On 1/30/2017 2:02 PM, Jitendra Kolhe wrote:
> On 1/27/2017 6:33 PM, Dr. David Alan Gilbert wrote:
>> * Jitendra Kolhe (jitendra.kolhe@hpe.com) wrote:
>>> Using "-mem-prealloc" option for a very large guest leads to huge guest
>>> start-up and migration time. This is because with "-mem-prealloc" option
>>> qemu tries to map every guest page (create address translations), and
>>> make sure the pages are available during runtime. virsh/libvirt by
>>> default, seems to use "-mem-prealloc" option in case the guest is
>>> configured to use huge pages. The patch tries to map all guest pages
>>> simultaneously by spawning multiple threads. Given the problem is more
>>> prominent for large guests, the patch limits the changes to the guests
>>> of at-least 64GB of memory size. Currently limiting the change to QEMU
>>> library functions on POSIX compliant host only, as we are not sure if
>>> the problem exists on win32. Below are some stats with "-mem-prealloc"
>>> option for guest configured to use huge pages.
>>>
>>> ------------------------------------------------------------------------
>>> Idle Guest      | Start-up time | Migration time
>>> ------------------------------------------------------------------------
>>> Guest stats with 2M HugePage usage - single threaded (existing code)
>>> ------------------------------------------------------------------------
>>> 64 Core - 4TB   | 54m11.796s    | 75m43.843s
>>> 64 Core - 1TB   | 8m56.576s     | 14m29.049s
>>> 64 Core - 256GB | 2m11.245s     | 3m26.598s
>>> ------------------------------------------------------------------------
>>> Guest stats with 2M HugePage usage - map guest pages using 8 threads
>>> ------------------------------------------------------------------------
>>> 64 Core - 4TB   | 5m1.027s      | 34m10.565s
>>> 64 Core - 1TB   | 1m10.366s     | 8m28.188s
>>> 64 Core - 256GB | 0m19.040s     | 2m10.148s
>>> -----------------------------------------------------------------------
>>> Guest stats with 2M HugePage usage - map guest pages using 16 threads
>>> -----------------------------------------------------------------------
>>> 64 Core - 4TB   | 1m58.970s     | 31m43.400s
>>> 64 Core - 1TB   | 0m39.885s     | 7m55.289s
>>> 64 Core - 256GB | 0m11.960s     | 2m0.135s
>>> -----------------------------------------------------------------------
>>
>> That's a nice improvement.
>>
>>> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
>>> ---
>>>  util/oslib-posix.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++---
>>>  1 file changed, 61 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>>> index f631464..a8bd7c2 100644
>>> --- a/util/oslib-posix.c
>>> +++ b/util/oslib-posix.c
>>> @@ -55,6 +55,13 @@
>>>  #include "qemu/error-report.h"
>>>  #endif
>>>  
>>> +#define PAGE_TOUCH_THREAD_COUNT 8
>>
>> It seems a shame to fix that number as a constant.
>>
> 
> Yes, as per comments received we will update patch to incorporate vcpu count.
> 
>>> +typedef struct {
>>> +    char *addr;
>>> +    uint64_t numpages;
>>> +    uint64_t hpagesize;
>>> +} PageRange;
>>> +
>>>  int qemu_get_thread_id(void)
>>>  {
>>>  #if defined(__linux__)
>>> @@ -323,6 +330,52 @@ static void sigbus_handler(int signal)
>>>      siglongjmp(sigjump, 1);
>>>  }
>>>  
>>> +static void *do_touch_pages(void *arg)
>>> +{
>>> +    PageRange *range = (PageRange *)arg;
>>> +    char *start_addr = range->addr;
>>> +    uint64_t numpages = range->numpages;
>>> +    uint64_t hpagesize = range->hpagesize;
>>> +    uint64_t i = 0;
>>> +
>>> +    for (i = 0; i < numpages; i++) {
>>> +        memset(start_addr + (hpagesize * i), 0, 1);
>>> +    }
>>> +    qemu_thread_exit(NULL);
>>> +
>>> +    return NULL;
>>> +}
>>> +
>>> +static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
>>> +{
>>> +    QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
>>> +    PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
>>> +    uint64_t    numpage_per_thread, size_per_thread;
>>> +    int         i = 0, tcount = 0;
>>> +
>>> +    numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
>>> +    size_per_thread = (hpagesize * numpage_per_thread);
>>> +    for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
>>> +        page_range[i].addr = area;
>>> +        page_range[i].numpages = numpage_per_thread;
>>> +        page_range[i].hpagesize = hpagesize;
>>> +
>>> +        qemu_thread_create(page_threads + i, "touch_pages",
>>> +                           do_touch_pages, (page_range + i),
>>> +                           QEMU_THREAD_JOINABLE);
>>> +        tcount++;
>>> +        area += size_per_thread;
>>> +        numpages -= numpage_per_thread;
>>> +    }
>>> +    for (i = 0; i < numpages; i++) {
>>> +        memset(area + (hpagesize * i), 0, 1);
>>> +    }
>>> +    for (i = 0; i < tcount; i++) {
>>> +        qemu_thread_join(page_threads + i);
>>> +    }
>>> +    return 0;
>>> +}
>>> +
>>>  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>>  {
>>>      int ret;
>>> @@ -353,9 +406,14 @@ void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
>>>          size_t hpagesize = qemu_fd_getpagesize(fd);
>>>          size_t numpages = DIV_ROUND_UP(memory, hpagesize);
>>>  
>>> -        /* MAP_POPULATE silently ignores failures */
>>> -        for (i = 0; i < numpages; i++) {
>>> -            memset(area + (hpagesize * i), 0, 1);
>>> +        /* touch pages simultaneously for memory >= 64G */
>>> +        if (memory < (1ULL << 36)) {
>>> +            /* MAP_POPULATE silently ignores failures */
>>> +            for (i = 0; i < numpages; i++) {
>>> +                memset(area + (hpagesize * i), 0, 1);
>>> +            }
>>> +        } else {
>>> +            touch_all_pages(area, hpagesize, numpages);
>>>          }
>>>      }
>>
>> Maybe it's possible to do this quicker?
>> If we are using NUMA, and have separate memory-blocks for each NUMA node,
>> wont this call os_mem_prealloc separately for each node?
>> I wonder if it's possible to get that to run in parallel?
>>
> 
> I will investigate.
> 
each numa node, seems to be getting treated as an independent qemu object. 
While parsing and creating the object itself we try to touch pages in 
os_mem_prealloc(). To parallelize numa node creation we would need to modify 
host_memory_backend_memory_complete() for the last numa object to wait for 
all previously spawned numa-node creation threads to finish there job. It 
involves parsing of cmdline options more than once (to identify if the current
numa node being serviced is the last numa node). 
Parsing cmdline in object specific implementation does not look correct?

also by parallelizing each numa node, the # memset threads would be reduced 
accordingly so that we don’t spawn too may threads. For example 
# threads spawned per numa node = min(#vcpus, 16)/(# numa nodes). 
With current implementation we would see min(#vcpus, 16) threads spawned, working 
on each numa node a time. Both implementations should have almost same performance?

Thanks,
- Jitendra
> Thanks,
> - Jitendra
> 
>> Dave
>>
>>> -- 
>>> 1.8.3.1
>>>
>>>
>> --
>> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>>
>
diff mbox

Patch

diff --git a/util/oslib-posix.c b/util/oslib-posix.c
index f631464..a8bd7c2 100644
--- a/util/oslib-posix.c
+++ b/util/oslib-posix.c
@@ -55,6 +55,13 @@ 
 #include "qemu/error-report.h"
 #endif
 
+#define PAGE_TOUCH_THREAD_COUNT 8
+typedef struct {
+    char *addr;
+    uint64_t numpages;
+    uint64_t hpagesize;
+} PageRange;
+
 int qemu_get_thread_id(void)
 {
 #if defined(__linux__)
@@ -323,6 +330,52 @@  static void sigbus_handler(int signal)
     siglongjmp(sigjump, 1);
 }
 
+static void *do_touch_pages(void *arg)
+{
+    PageRange *range = (PageRange *)arg;
+    char *start_addr = range->addr;
+    uint64_t numpages = range->numpages;
+    uint64_t hpagesize = range->hpagesize;
+    uint64_t i = 0;
+
+    for (i = 0; i < numpages; i++) {
+        memset(start_addr + (hpagesize * i), 0, 1);
+    }
+    qemu_thread_exit(NULL);
+
+    return NULL;
+}
+
+static int touch_all_pages(char *area, size_t hpagesize, size_t numpages)
+{
+    QemuThread page_threads[PAGE_TOUCH_THREAD_COUNT];
+    PageRange page_range[PAGE_TOUCH_THREAD_COUNT];
+    uint64_t    numpage_per_thread, size_per_thread;
+    int         i = 0, tcount = 0;
+
+    numpage_per_thread = (numpages / PAGE_TOUCH_THREAD_COUNT);
+    size_per_thread = (hpagesize * numpage_per_thread);
+    for (i = 0; i < (PAGE_TOUCH_THREAD_COUNT - 1); i++) {
+        page_range[i].addr = area;
+        page_range[i].numpages = numpage_per_thread;
+        page_range[i].hpagesize = hpagesize;
+
+        qemu_thread_create(page_threads + i, "touch_pages",
+                           do_touch_pages, (page_range + i),
+                           QEMU_THREAD_JOINABLE);
+        tcount++;
+        area += size_per_thread;
+        numpages -= numpage_per_thread;
+    }
+    for (i = 0; i < numpages; i++) {
+        memset(area + (hpagesize * i), 0, 1);
+    }
+    for (i = 0; i < tcount; i++) {
+        qemu_thread_join(page_threads + i);
+    }
+    return 0;
+}
+
 void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
 {
     int ret;
@@ -353,9 +406,14 @@  void os_mem_prealloc(int fd, char *area, size_t memory, Error **errp)
         size_t hpagesize = qemu_fd_getpagesize(fd);
         size_t numpages = DIV_ROUND_UP(memory, hpagesize);
 
-        /* MAP_POPULATE silently ignores failures */
-        for (i = 0; i < numpages; i++) {
-            memset(area + (hpagesize * i), 0, 1);
+        /* touch pages simultaneously for memory >= 64G */
+        if (memory < (1ULL << 36)) {
+            /* MAP_POPULATE silently ignores failures */
+            for (i = 0; i < numpages; i++) {
+                memset(area + (hpagesize * i), 0, 1);
+            }
+        } else {
+            touch_all_pages(area, hpagesize, numpages);
         }
     }