Patchwork [RFC,v2] Throttle-down guest when live migration does not converge.

login
register
mail settings
Submitter Chegu Vinod
Date April 27, 2013, 8:50 p.m.
Message ID <1367095836-19318-1-git-send-email-chegu_vinod@hp.com>
Download mbox | patch
Permalink /patch/240199/
State New
Headers show

Comments

Chegu Vinod - April 27, 2013, 8:50 p.m.
Busy enterprise workloads hosted on large sized VM's tend to dirty
memory faster than the transfer rate achieved via live guest migration.
Despite some good recent improvements (& using dedicated 10Gig NICs
between hosts) the live migration does NOT converge.

A few options that were discussed/being-pursued to help with
the convergence issue include:

1) Slow down guest considerably via cgroup's CPU controls - requires
   libvirt client support to detect & trigger action, but conceptually
   similar to this RFC change.

2) Speed up transfer rate:
   - RDMA based Pre-copy - lower overhead and fast (Unfortunately
     has a few restrictions and some customers still choose not
     to deploy RDMA :-( ).
   - Add parallelism to improve transfer rate and use multiple 10Gig
     connections (bonded). - could add some overhead on the host.

3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
   promising but need to consider & handle newer failure scenarios.

If an enterprise user chooses to force convergence of their migration
via the new capability "auto-converge" then with this change we auto-detect
lack of convergence scenario and trigger a slow down of the workload
by explicitly disallowing the VCPUs from spending much time in the VM
context.

The migration thread tries to catchup and this eventually leads
to convergence in some "deterministic" amount of time. Yes it does
impact the performance of all the VCPUs but in my observation that
lasts only for a short duration of time. i.e. we end up entering
stage 3 (downtime phase) soon after that.

No exernal trigger is required (unlike option 1) and it can co-exist
with enhancements being pursued as part of Option 2 (e.g. RDMA).

Thanks to Juan and Paolo for their useful suggestions.

Verified the convergence using the following :
- SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
- OLTP like workload running on a 80VCPU/512G guest (~80% busy)

Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
migrate downtime set to 4seconds).

(qemu) info migrate
capabilities: xbzrle: off auto-converge: off  <----
Migration status: active
total time: 1487503 milliseconds
expected downtime: 519 milliseconds
transferred ram: 383749347 kbytes
remaining ram: 2753372 kbytes
total ram: 268444224 kbytes
duplicate: 65461532 pages
skipped: 64901568 pages
normal: 95750218 pages
normal bytes: 383000872 kbytes
dirty pages rate: 67551 pages

---

(qemu) info migrate
capabilities: xbzrle: off auto-converge: on   <----
Migration status: completed
total time: 241161 milliseconds
downtime: 6373 milliseconds
transferred ram: 28235307 kbytes
remaining ram: 0 kbytes
total ram: 268444224 kbytes
duplicate: 64946416 pages
skipped: 64903523 pages
normal: 7044971 pages
normal bytes: 28179884 kbytes

Changes from v1:
- rebased to latest qemu.git
- added auto-converge capability(default off) - suggested by Anthony Liguori &
                                                Eric Blake.

Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
---
 arch_init.c                   |   44 +++++++++++++++++++++++++++++++++++
 cpus.c                        |   12 +++++++++
 include/migration/migration.h |   12 +++++++++
 include/qemu/main-loop.h      |    3 ++
 kvm-all.c                     |   51 +++++++++++++++++++++++++++++++++++++++++
 migration.c                   |   15 ++++++++++++
 qapi-schema.json              |    6 ++++-
 7 files changed, 142 insertions(+), 1 deletions(-)
Eric Blake - April 29, 2013, 2:53 p.m.
On 04/27/2013 02:50 PM, Chegu Vinod wrote:
> Busy enterprise workloads hosted on large sized VM's tend to dirty
> memory faster than the transfer rate achieved via live guest migration.
> Despite some good recent improvements (& using dedicated 10Gig NICs
> between hosts) the live migration does NOT converge.
> 

> 
> No exernal trigger is required (unlike option 1) and it can co-exist

s/exernal/external/

> with enhancements being pursued as part of Option 2 (e.g. RDMA).
> 
> Thanks to Juan and Paolo for their useful suggestions.
> 

> 
> ---
> 
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: on   <----

This part looks nice.

I'm not reviewing the entire patch (I'm not an expert on the internals
of migration), but just the interface:

> +++ b/qapi-schema.json
> @@ -599,10 +599,14 @@
>  #          This feature allows us to minimize migration traffic for certain work
>  #          loads, by sending compressed difference of the pages
>  #
> +# @auto-converge: Controls whether or not the we want the migration to
> +#          automaticially detect and force convergence by slowing

s/automaticially/automatically/

> +#          down the guest. Disabled by default.

Missing a (since 1.6) designation.

Also, use of first-person (us, we) in docs seems a bit unprofessional,
although you were copying pre-existing usage.  How about:

@xbzrle: Migration supports xbzrle (Xor Based Zero Run Length Encoding),
         which minimizes migration traffic for certain workloads by
         sending compressed differences of active pages

@auto-converge: Migration supports automatic throttling of guest
                activity to force convergence (since 1.6)
Chegu Vinod - April 29, 2013, 5:48 p.m.
On 4/29/2013 7:53 AM, Eric Blake wrote:
> On 04/27/2013 02:50 PM, Chegu Vinod wrote:
>> Busy enterprise workloads hosted on large sized VM's tend to dirty
>> memory faster than the transfer rate achieved via live guest migration.
>> Despite some good recent improvements (& using dedicated 10Gig NICs
>> between hosts) the live migration does NOT converge.
>>
>> No exernal trigger is required (unlike option 1) and it can co-exist
> s/exernal/external/
>
>> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>>
>> Thanks to Juan and Paolo for their useful suggestions.
>>
>> ---
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: on   <----
> This part looks nice.
>
> I'm not reviewing the entire patch (I'm not an expert on the internals
> of migration), but just the interface:

Thanks for taking a look at this. I shall incorporate your suggested 
changes in the
next version.

Hoping to hear from Juan/Orit and others on the live migration part.

Thanks,
Vinod

>> +++ b/qapi-schema.json
>> @@ -599,10 +599,14 @@
>>   #          This feature allows us to minimize migration traffic for certain work
>>   #          loads, by sending compressed difference of the pages
>>   #
>> +# @auto-converge: Controls whether or not the we want the migration to
>> +#          automaticially detect and force convergence by slowing
> s/automaticially/automatically/
>
>> +#          down the guest. Disabled by default.
> Missing a (since 1.6) designation.
>
> Also, use of first-person (us, we) in docs seems a bit unprofessional,
> although you were copying pre-existing usage.  How about:
>
> @xbzrle: Migration supports xbzrle (Xor Based Zero Run Length Encoding),
>           which minimizes migration traffic for certain workloads by
>           sending compressed differences of active pages
>
> @auto-converge: Migration supports automatic throttling of guest
>                  activity to force convergence (since 1.6)
>
Orit Wasserman - April 30, 2013, 3:04 p.m.
On 04/27/2013 11:50 PM, Chegu Vinod wrote:
> Busy enterprise workloads hosted on large sized VM's tend to dirty
> memory faster than the transfer rate achieved via live guest migration.
> Despite some good recent improvements (& using dedicated 10Gig NICs
> between hosts) the live migration does NOT converge.
> 
> A few options that were discussed/being-pursued to help with
> the convergence issue include:
> 
> 1) Slow down guest considerably via cgroup's CPU controls - requires
>    libvirt client support to detect & trigger action, but conceptually
>    similar to this RFC change.
> 
> 2) Speed up transfer rate:
>    - RDMA based Pre-copy - lower overhead and fast (Unfortunately
>      has a few restrictions and some customers still choose not
>      to deploy RDMA :-( ).
>    - Add parallelism to improve transfer rate and use multiple 10Gig
>      connections (bonded). - could add some overhead on the host.
> 
> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
>    promising but need to consider & handle newer failure scenarios.
> 
> If an enterprise user chooses to force convergence of their migration
> via the new capability "auto-converge" then with this change we auto-detect
> lack of convergence scenario and trigger a slow down of the workload
> by explicitly disallowing the VCPUs from spending much time in the VM
> context.
> 
> The migration thread tries to catchup and this eventually leads
> to convergence in some "deterministic" amount of time. Yes it does
> impact the performance of all the VCPUs but in my observation that
> lasts only for a short duration of time. i.e. we end up entering
> stage 3 (downtime phase) soon after that.
> 
> No exernal trigger is required (unlike option 1) and it can co-exist
> with enhancements being pursued as part of Option 2 (e.g. RDMA).
> 
> Thanks to Juan and Paolo for their useful suggestions.
> 
> Verified the convergence using the following :
> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
> 
> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
> migrate downtime set to 4seconds).
> 
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: off  <----
> Migration status: active
> total time: 1487503 milliseconds
> expected downtime: 519 milliseconds
> transferred ram: 383749347 kbytes
> remaining ram: 2753372 kbytes
> total ram: 268444224 kbytes
> duplicate: 65461532 pages
> skipped: 64901568 pages
> normal: 95750218 pages
> normal bytes: 383000872 kbytes
> dirty pages rate: 67551 pages
> 
> ---
> 
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: on   <----
> Migration status: completed
> total time: 241161 milliseconds
> downtime: 6373 milliseconds
> transferred ram: 28235307 kbytes
> remaining ram: 0 kbytes
> total ram: 268444224 kbytes
> duplicate: 64946416 pages
> skipped: 64903523 pages
> normal: 7044971 pages
> normal bytes: 28179884 kbytes
> 
> Changes from v1:
> - rebased to latest qemu.git
> - added auto-converge capability(default off) - suggested by Anthony Liguori &
>                                                 Eric Blake.
> 
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> ---
>  arch_init.c                   |   44 +++++++++++++++++++++++++++++++++++
>  cpus.c                        |   12 +++++++++
>  include/migration/migration.h |   12 +++++++++
>  include/qemu/main-loop.h      |    3 ++
>  kvm-all.c                     |   51 +++++++++++++++++++++++++++++++++++++++++
>  migration.c                   |   15 ++++++++++++
>  qapi-schema.json              |    6 ++++-
>  7 files changed, 142 insertions(+), 1 deletions(-)
> 
> diff --git a/arch_init.c b/arch_init.c
> index 92de1bd..6dcc742 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -104,6 +104,7 @@ int graphic_depth = 15;
>  #endif
>  
>  const uint32_t arch_type = QEMU_ARCH;
> +static uint64_t mig_throttle_on;
>  
>  /***********************************************************/
>  /* ram save/restore */
> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
>      MigrationState *s = migrate_get_current();
>      static int64_t start_time;
>      static int64_t num_dirty_pages_period;
> +    static int64_t bytes_xfer_prev;
>      int64_t end_time;
> +    int64_t bytes_xfer_now;
> +    static int dirty_rate_high_cnt;
> +
> +    if (migrate_auto_converge() && !bytes_xfer_prev) {
> +        bytes_xfer_prev = ram_bytes_transferred();
> +    }
>  
>      if (!start_time) {
>          start_time = qemu_get_clock_ms(rt_clock);
>      }
>  
> +
>      trace_migration_bitmap_sync_start();
>      memory_global_sync_dirty_bitmap(get_system_memory());
>  
> @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
>  
>      /* more than 1 second = 1000 millisecons */
>      if (end_time > start_time + 1000) {
> +        if (migrate_auto_converge()) {
> +            /* The following detection logic can be refined later. For now:
> +               Check to see if the dirtied bytes is 50% more than the approx.
> +               amount of bytes that just got transferred since the last time we
> +               were in this routine. If that happens N times (for now N==5)
> +               we turn on the throttle down logic */
> +            bytes_xfer_now = ram_bytes_transferred();
> +            if (s->dirty_pages_rate &&
> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
> +                if (dirty_rate_high_cnt++ > 5) {
> +                    DPRINTF("Unable to converge. Throtting down guest\n");
> +                    mig_throttle_on = 1;
> +                }

Why not check to see if mig_throttle_on is already on?

> +             }

> +             bytes_xfer_prev = bytes_xfer_now;
> +        }
>          s->dirty_pages_rate = num_dirty_pages_period * 1000
>              / (end_time - start_time);
>          s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
> @@ -496,6 +522,24 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>      return bytes_sent;
>  }
>  
> +bool throttling_needed(void)
> +{
> +    bool value;
> +
> +    if (!migrate_auto_converge()) {
> +        return false;
> +    }
> +
> +    qemu_mutex_lock_mig_throttle();
> +    value = mig_throttle_on;
> +    qemu_mutex_unlock_mig_throttle();
> +
> +    if (value) {
> +        return true;
> +    }
> +    return false;
> +}
> +

Why not return value here ?

Cheers,
Orit

>  static uint64_t bytes_transferred;
>  
>  static ram_addr_t ram_save_remaining(void)
> diff --git a/cpus.c b/cpus.c
> index 5a98a37..615c25a 100644
> --- a/cpus.c
> +++ b/cpus.c
> @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void)
>  #endif /* _WIN32 */
>  
>  static QemuMutex qemu_global_mutex;
> +static QemuMutex qemu_mig_throttle_mutex;
>  static QemuCond qemu_io_proceeded_cond;
>  static bool iothread_requesting_mutex;
>  
> @@ -638,6 +639,7 @@ void qemu_init_cpu_loop(void)
>      qemu_cond_init(&qemu_work_cond);
>      qemu_cond_init(&qemu_io_proceeded_cond);
>      qemu_mutex_init(&qemu_global_mutex);
> +    qemu_mutex_init(&qemu_mig_throttle_mutex);
>  
>      qemu_thread_get_self(&io_thread);
>  }
> @@ -943,6 +945,16 @@ void qemu_mutex_unlock_iothread(void)
>      qemu_mutex_unlock(&qemu_global_mutex);
>  }
>  
> +void qemu_mutex_lock_mig_throttle(void)
> +{
> +    qemu_mutex_lock(&qemu_mig_throttle_mutex);
> +}
> +
> +void qemu_mutex_unlock_mig_throttle(void)
> +{
> +    qemu_mutex_unlock(&qemu_mig_throttle_mutex);
> +}
> +
>  static int all_vcpus_paused(void)
>  {
>      CPUArchState *penv = first_cpu;
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index e2acec6..94bdb8c 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -127,4 +127,16 @@ int migrate_use_xbzrle(void);
>  int64_t migrate_xbzrle_cache_size(void);
>  
>  int64_t xbzrle_cache_resize(int64_t new_size);
> +
> +#ifndef _QEMU_MIG_THROTTLE
> +#define _QEMU_MIG_THROTTLE

Do you need those defines?

> +
> +bool migrate_auto_converge(void);
> +
> +bool throttling_needed(void);
> +bool throttling_now(void);
> +void *migration_throttle_down(void *);
> +
> +#endif
> +
>  #endif
> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
> index 6f0200a..9a3886d 100644
> --- a/include/qemu/main-loop.h
> +++ b/include/qemu/main-loop.h
> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>   */
>  void qemu_mutex_unlock_iothread(void);
>  
> +void qemu_mutex_lock_mig_throttle(void);
> +void qemu_mutex_unlock_mig_throttle(void);
> +
>  /* internal interfaces */
>  
>  void qemu_fd_register(int fd);
> diff --git a/kvm-all.c b/kvm-all.c
> index 2d92721..a92cb77 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -33,6 +33,8 @@
>  #include "exec/memory.h"
>  #include "exec/address-spaces.h"
>  #include "qemu/event_notifier.h"
> +#include "sysemu/cpus.h"
> +#include "migration/migration.h"
>  
>  /* This check must be after config-host.h is included */
>  #ifdef CONFIG_EVENTFD
> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>      KVM_CAP_LAST_INFO
>  };
>  
> +static void mig_delay_vcpu(void);
> +
>  static KVMSlot *kvm_alloc_slot(KVMState *s)
>  {
>      int i;
> @@ -1609,6 +1613,10 @@ int kvm_cpu_exec(CPUArchState *env)
>          }
>          qemu_mutex_unlock_iothread();
>  
> +        if (throttling_needed()) {
> +            mig_delay_vcpu();
> +        }
> +
>          run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
>  
>          qemu_mutex_lock_iothread();
> @@ -2032,3 +2040,46 @@ int kvm_on_sigbus(int code, void *addr)
>  {
>      return kvm_arch_on_sigbus(code, addr);
>  }
> +
> +static bool throttling;
> +bool throttling_now(void)
> +{
> +    if (throttling) {
> +        return true;
> +    }
> +    return false;
> +}
> +

it will be simpler to just return throttling ?

> +static void mig_delay_vcpu(void)
> +{
> +    g_usleep(50*1000);
> +}
> +
> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a greater performance drop but for a shorter
> +   duration.
> +*/
> +void *migration_throttle_down(void *opaque)
> +{
> +    throttling = true;
> +    while (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;
> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
> +        }
> +        g_usleep(25*1000);
> +    }
> +    throttling = false;
> +    return NULL;
> +}
> diff --git a/migration.c b/migration.c
> index 3eb0fad..834156e 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -24,6 +24,7 @@
>  #include "qemu/thread.h"
>  #include "qmp-commands.h"
>  #include "trace.h"
> +#include "sysemu/cpus.h"
>  
>  //#define DEBUG_MIGRATION
>  
> @@ -474,6 +475,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>      max_downtime = (uint64_t)value;
>  }
>  
> +bool migrate_auto_converge(void)
> +{
> +    MigrationState *s;
> +
> +    s = migrate_get_current();
> +
> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
> +}
> +
>  int migrate_use_xbzrle(void)
>  {
>      MigrationState *s;
> @@ -503,6 +513,7 @@ static void *migration_thread(void *opaque)
>      int64_t max_size = 0;
>      int64_t start_time = initial_time;
>      bool old_vm_running = false;
> +    QemuThread thread;
>  
>      DPRINTF("beginning savevm\n");
>      qemu_savevm_state_begin(s->file, &s->params);
> @@ -517,6 +528,10 @@ static void *migration_thread(void *opaque)
>              DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>              if (pending_size && pending_size >= max_size) {
>                  qemu_savevm_state_iterate(s->file);
> +                if (throttling_needed() && !throttling_now()) {
> +                    qemu_thread_create(&thread, migration_throttle_down,
> +                               NULL, QEMU_THREAD_DETACHED);
> +                }
>              } else {
>                  DPRINTF("done iterating\n");
>                  qemu_mutex_lock_iothread();
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 5b0fb3b..b662e33 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -599,10 +599,14 @@
>  #          This feature allows us to minimize migration traffic for certain work
>  #          loads, by sending compressed difference of the pages
>  #
> +# @auto-converge: Controls whether or not the we want the migration to
> +#          automaticially detect and force convergence by slowing
> +#          down the guest. Disabled by default.
> +#
>  # Since: 1.2
>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle'] }
> +  'data': ['xbzrle', 'auto-converge'] }
>  
>  ##
>  # @MigrationCapabilityStatus
>
Juan Quintela - April 30, 2013, 3:20 p.m.
Chegu Vinod <chegu_vinod@hp.com> wrote:
> Busy enterprise workloads hosted on large sized VM's tend to dirty
> memory faster than the transfer rate achieved via live guest migration.
> Despite some good recent improvements (& using dedicated 10Gig NICs
> between hosts) the live migration does NOT converge.
>
> A few options that were discussed/being-pursued to help with
> the convergence issue include:
>
> 1) Slow down guest considerably via cgroup's CPU controls - requires
>    libvirt client support to detect & trigger action, but conceptually
>    similar to this RFC change.
>
> 2) Speed up transfer rate:
>    - RDMA based Pre-copy - lower overhead and fast (Unfortunately
>      has a few restrictions and some customers still choose not
>      to deploy RDMA :-( ).
>    - Add parallelism to improve transfer rate and use multiple 10Gig
>      connections (bonded). - could add some overhead on the host.
>
> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
>    promising but need to consider & handle newer failure scenarios.
>
> If an enterprise user chooses to force convergence of their migration
> via the new capability "auto-converge" then with this change we auto-detect
> lack of convergence scenario and trigger a slow down of the workload
> by explicitly disallowing the VCPUs from spending much time in the VM
> context.
>
> The migration thread tries to catchup and this eventually leads
> to convergence in some "deterministic" amount of time. Yes it does
> impact the performance of all the VCPUs but in my observation that
> lasts only for a short duration of time. i.e. we end up entering
> stage 3 (downtime phase) soon after that.
>
> No exernal trigger is required (unlike option 1) and it can co-exist
> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>
> Thanks to Juan and Paolo for their useful suggestions.
>
> Verified the convergence using the following :
> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>
> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
> migrate downtime set to 4seconds).
>
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: off  <----
> Migration status: active
> total time: 1487503 milliseconds

148 seconds

> expected downtime: 519 milliseconds
> transferred ram: 383749347 kbytes
> remaining ram: 2753372 kbytes
> total ram: 268444224 kbytes
> duplicate: 65461532 pages
> skipped: 64901568 pages
> normal: 95750218 pages
> normal bytes: 383000872 kbytes
> dirty pages rate: 67551 pages
>
> ---
>
> (qemu) info migrate
> capabilities: xbzrle: off auto-converge: on   <----
> Migration status: completed
> total time: 241161 milliseconds
> downtime: 6373 milliseconds

6.3 seconds and finished,  not bad at all O:-)

How much does the guest throughput drops while we enter autoconverge mode?

> transferred ram: 28235307 kbytes
> remaining ram: 0 kbytes
> total ram: 268444224 kbytes
> duplicate: 64946416 pages
> skipped: 64903523 pages
> normal: 7044971 pages
> normal bytes: 28179884 kbytes
>
> Changes from v1:
> - rebased to latest qemu.git
> - added auto-converge capability(default off) - suggested by Anthony Liguori &
>                                                 Eric Blake.
>
> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
>      MigrationState *s = migrate_get_current();
>      static int64_t start_time;
>      static int64_t num_dirty_pages_period;
> +    static int64_t bytes_xfer_prev;
>      int64_t end_time;
> +    int64_t bytes_xfer_now;
> +    static int dirty_rate_high_cnt;
> +
> +    if (migrate_auto_converge() && !bytes_xfer_prev) {

Just do the !bytes_xfer_prev test here?  migrate_autoconverge is more
expensive to call that just do the assignment?

> +
> +    if (value) {
> +        return true;
> +    }
> +    return false;

this code is just:

return value;

> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
> index 6f0200a..9a3886d 100644
> --- a/include/qemu/main-loop.h
> +++ b/include/qemu/main-loop.h
> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>   */
>  void qemu_mutex_unlock_iothread(void);
>  
> +void qemu_mutex_lock_mig_throttle(void);
> +void qemu_mutex_unlock_mig_throttle(void);
> +
>  /* internal interfaces */
>  
>  void qemu_fd_register(int fd);
> diff --git a/kvm-all.c b/kvm-all.c
> index 2d92721..a92cb77 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -33,6 +33,8 @@
>  #include "exec/memory.h"
>  #include "exec/address-spaces.h"
>  #include "qemu/event_notifier.h"
> +#include "sysemu/cpus.h"
> +#include "migration/migration.h"
>  
>  /* This check must be after config-host.h is included */
>  #ifdef CONFIG_EVENTFD
> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>      KVM_CAP_LAST_INFO
>  };
>  
> +static void mig_delay_vcpu(void);
> +

move function definiton to here?

> +
> +static bool throttling;
> +bool throttling_now(void)
> +{
> +    if (throttling) {
> +        return true;
> +    }
> +    return false;
  return throttling;

> +/* Stub used for getting the vcpu out of VM and into qemu via
> +   run_on_cpu()*/
> +static void mig_kick_cpu(void *opq)
> +{
> +    return;
> +}
> +
> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
> +   much time in the VM. The migration thread will try to catchup.
> +   Workload will experience a greater performance drop but for a shorter
> +   duration.
> +*/
> +void *migration_throttle_down(void *opaque)
> +{
> +    throttling = true;
> +    while (throttling_needed()) {
> +        CPUArchState *penv = first_cpu;

I am not sure that we can follow the list without the iothread lock
here.


> +        while (penv) {
> +            qemu_mutex_lock_iothread();
> +            run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
> +            qemu_mutex_unlock_iothread();
> +            penv = penv->next_cpu;
> +        }
> +        g_usleep(25*1000);
> +    }
> +    throttling = false;
> +    return NULL;
> +}
Chegu Vinod - April 30, 2013, 3:55 p.m.
On 4/30/2013 8:20 AM, Juan Quintela wrote:
> Chegu Vinod <chegu_vinod@hp.com> wrote:
>> Busy enterprise workloads hosted on large sized VM's tend to dirty
>> memory faster than the transfer rate achieved via live guest migration.
>> Despite some good recent improvements (& using dedicated 10Gig NICs
>> between hosts) the live migration does NOT converge.
>>
>> A few options that were discussed/being-pursued to help with
>> the convergence issue include:
>>
>> 1) Slow down guest considerably via cgroup's CPU controls - requires
>>     libvirt client support to detect & trigger action, but conceptually
>>     similar to this RFC change.
>>
>> 2) Speed up transfer rate:
>>     - RDMA based Pre-copy - lower overhead and fast (Unfortunately
>>       has a few restrictions and some customers still choose not
>>       to deploy RDMA :-( ).
>>     - Add parallelism to improve transfer rate and use multiple 10Gig
>>       connections (bonded). - could add some overhead on the host.
>>
>> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
>>     promising but need to consider & handle newer failure scenarios.
>>
>> If an enterprise user chooses to force convergence of their migration
>> via the new capability "auto-converge" then with this change we auto-detect
>> lack of convergence scenario and trigger a slow down of the workload
>> by explicitly disallowing the VCPUs from spending much time in the VM
>> context.
>>
>> The migration thread tries to catchup and this eventually leads
>> to convergence in some "deterministic" amount of time. Yes it does
>> impact the performance of all the VCPUs but in my observation that
>> lasts only for a short duration of time. i.e. we end up entering
>> stage 3 (downtime phase) soon after that.
>>
>> No exernal trigger is required (unlike option 1) and it can co-exist
>> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>>
>> Thanks to Juan and Paolo for their useful suggestions.
>>
>> Verified the convergence using the following :
>> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>
>> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>> migrate downtime set to 4seconds).
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: off  <----
>> Migration status: active
>> total time: 1487503 milliseconds
> 148 seconds

1487 seconds and still the Migration is not completed.

>
>> expected downtime: 519 milliseconds
>> transferred ram: 383749347 kbytes
>> remaining ram: 2753372 kbytes
>> total ram: 268444224 kbytes
>> duplicate: 65461532 pages
>> skipped: 64901568 pages
>> normal: 95750218 pages
>> normal bytes: 383000872 kbytes
>> dirty pages rate: 67551 pages
>>
>> ---
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: on   <----
>> Migration status: completed
>> total time: 241161 milliseconds
>> downtime: 6373 milliseconds
> 6.3 seconds and finished,  not bad at all O:-)
That's the *downtime*..  The total time for migration to complete is  
241 secs. (SpecJBB is
one of those workloads that dirties memory quite a bit).

>
> How much does the guest throughput drops while we enter autoconverge mode?

Workload performance drops for some short duration and it...but it soon 
switches to stage 3.

>
>> transferred ram: 28235307 kbytes
>> remaining ram: 0 kbytes
>> total ram: 268444224 kbytes
>> duplicate: 64946416 pages
>> skipped: 64903523 pages
>> normal: 7044971 pages
>> normal bytes: 28179884 kbytes
>>
>> Changes from v1:
>> - rebased to latest qemu.git
>> - added auto-converge capability(default off) - suggested by Anthony Liguori &
>>                                                  Eric Blake.
>>
>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
>>       MigrationState *s = migrate_get_current();
>>       static int64_t start_time;
>>       static int64_t num_dirty_pages_period;
>> +    static int64_t bytes_xfer_prev;
>>       int64_t end_time;
>> +    int64_t bytes_xfer_now;
>> +    static int dirty_rate_high_cnt;
>> +
>> +    if (migrate_auto_converge() && !bytes_xfer_prev) {
> Just do the !bytes_xfer_prev test here?  migrate_autoconverge is more
> expensive to call that just do the assignment?

Sure
>
>> +
>> +    if (value) {
>> +        return true;
>> +    }
>> +    return false;
> this code is just:
>
> return value;

ok

>
>> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
>> index 6f0200a..9a3886d 100644
>> --- a/include/qemu/main-loop.h
>> +++ b/include/qemu/main-loop.h
>> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>>    */
>>   void qemu_mutex_unlock_iothread(void);
>>   
>> +void qemu_mutex_lock_mig_throttle(void);
>> +void qemu_mutex_unlock_mig_throttle(void);
>> +
>>   /* internal interfaces */
>>   
>>   void qemu_fd_register(int fd);
>> diff --git a/kvm-all.c b/kvm-all.c
>> index 2d92721..a92cb77 100644
>> --- a/kvm-all.c
>> +++ b/kvm-all.c
>> @@ -33,6 +33,8 @@
>>   #include "exec/memory.h"
>>   #include "exec/address-spaces.h"
>>   #include "qemu/event_notifier.h"
>> +#include "sysemu/cpus.h"
>> +#include "migration/migration.h"
>>   
>>   /* This check must be after config-host.h is included */
>>   #ifdef CONFIG_EVENTFD
>> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>>       KVM_CAP_LAST_INFO
>>   };
>>   
>> +static void mig_delay_vcpu(void);
>> +
> move function definiton to here?
Ok.
>> +
>> +static bool throttling;
>> +bool throttling_now(void)
>> +{
>> +    if (throttling) {
>> +        return true;
>> +    }
>> +    return false;
>    return throttling;
>
>> +/* Stub used for getting the vcpu out of VM and into qemu via
>> +   run_on_cpu()*/
>> +static void mig_kick_cpu(void *opq)
>> +{
>> +    return;
>> +}
>> +
>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>> +   much time in the VM. The migration thread will try to catchup.
>> +   Workload will experience a greater performance drop but for a shorter
>> +   duration.
>> +*/
>> +void *migration_throttle_down(void *opaque)
>> +{
>> +    throttling = true;
>> +    while (throttling_needed()) {
>> +        CPUArchState *penv = first_cpu;
> I am not sure that we can follow the list without the iothread lock
> here.

Hmm.. Is this due to vcpu hot plug that might happen at the time of live 
migration (or) due
to something else ? I was trying to avoid holding the iothread lock for 
longer duration and slow
down the migration thread...

>
>> +        while (penv) {
>> +            qemu_mutex_lock_iothread();
>> +            run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>> +            qemu_mutex_unlock_iothread();
>> +            penv = penv->next_cpu;
>> +        }
>> +        g_usleep(25*1000);
>> +    }
>> +    throttling = false;
>> +    return NULL;
>> +}
> .
Thanks
Vinod
Juan Quintela - April 30, 2013, 4:01 p.m.
Chegu Vinod <chegu_vinod@hp.com> wrote:
> On 4/30/2013 8:20 AM, Juan Quintela wrote:
>>>
>>> (qemu) info migrate
>>> capabilities: xbzrle: off auto-converge: off  <----
>>> Migration status: active
>>> total time: 1487503 milliseconds
>> 148 seconds
>
> 1487 seconds and still the Migration is not completed.
>
>>
>>> expected downtime: 519 milliseconds
>>> transferred ram: 383749347 kbytes
>>> remaining ram: 2753372 kbytes
>>> total ram: 268444224 kbytes
>>> duplicate: 65461532 pages
>>> skipped: 64901568 pages
>>> normal: 95750218 pages
>>> normal bytes: 383000872 kbytes
>>> dirty pages rate: 67551 pages
>>>
>>> ---
>>>
>>> (qemu) info migrate
>>> capabilities: xbzrle: off auto-converge: on   <----
>>> Migration status: completed
>>> total time: 241161 milliseconds
>>> downtime: 6373 milliseconds
>> 6.3 seconds and finished,  not bad at all O:-)
> That's the *downtime*..  The total time for migration to complete is
> 241 secs. (SpecJBB is
> one of those workloads that dirties memory quite a bit).

Sorry,  you are right.  Imressive anyways for such small change.

>>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>>> +   much time in the VM. The migration thread will try to catchup.
>>> +   Workload will experience a greater performance drop but for a shorter
>>> +   duration.
>>> +*/
>>> +void *migration_throttle_down(void *opaque)
>>> +{
>>> +    throttling = true;
>>> +    while (throttling_needed()) {
>>> +        CPUArchState *penv = first_cpu;
>> I am not sure that we can follow the list without the iothread lock
>> here.
>
> Hmm.. Is this due to vcpu hot plug that might happen at the time of
> live migration (or) due
> to something else ? I was trying to avoid holding the iothread lock
> for longer duration and slow
> down the migration thread...

Well,  thinking back about it,  what we should do is disable cpu
hotplug/unplug during migration (it is not working well anyways as
Today).

Thanks,  Juan.
Chegu Vinod - April 30, 2013, 5:48 p.m.
On 4/30/2013 9:01 AM, Juan Quintela wrote:
> Chegu Vinod <chegu_vinod@hp.com> wrote:
>> On 4/30/2013 8:20 AM, Juan Quintela wrote:
>>>> (qemu) info migrate
>>>> capabilities: xbzrle: off auto-converge: off  <----
>>>> Migration status: active
>>>> total time: 1487503 milliseconds
>>> 148 seconds
>> 1487 seconds and still the Migration is not completed.
>>
>>>> expected downtime: 519 milliseconds
>>>> transferred ram: 383749347 kbytes
>>>> remaining ram: 2753372 kbytes
>>>> total ram: 268444224 kbytes
>>>> duplicate: 65461532 pages
>>>> skipped: 64901568 pages
>>>> normal: 95750218 pages
>>>> normal bytes: 383000872 kbytes
>>>> dirty pages rate: 67551 pages
>>>>
>>>> ---
>>>>
>>>> (qemu) info migrate
>>>> capabilities: xbzrle: off auto-converge: on   <----
>>>> Migration status: completed
>>>> total time: 241161 milliseconds
>>>> downtime: 6373 milliseconds
>>> 6.3 seconds and finished,  not bad at all O:-)
>> That's the *downtime*..  The total time for migration to complete is
>> 241 secs. (SpecJBB is
>> one of those workloads that dirties memory quite a bit).
> Sorry,  you are right.  Imressive anyways for such small change.
>
>>>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>>>> +   much time in the VM. The migration thread will try to catchup.
>>>> +   Workload will experience a greater performance drop but for a shorter
>>>> +   duration.
>>>> +*/
>>>> +void *migration_throttle_down(void *opaque)
>>>> +{
>>>> +    throttling = true;
>>>> +    while (throttling_needed()) {
>>>> +        CPUArchState *penv = first_cpu;
>>> I am not sure that we can follow the list without the iothread lock
>>> here.
>> Hmm.. Is this due to vcpu hot plug that might happen at the time of
>> live migration (or) due
>> to something else ? I was trying to avoid holding the iothread lock
>> for longer duration and slow
>> down the migration thread...
> Well,  thinking back about it,  what we should do is disable cpu
> hotplug/unplug during migration

I tend to agree.

For now I am not going to hold the iothread lock for following the list...

> (it is not working well anyways as
> Today).

Yes...and I see that Igor, Eduardo et.al. are trying to fix this.

Vinod

>
> Thanks,  Juan.
> .
>
Chegu Vinod - April 30, 2013, 5:51 p.m.
On 4/30/2013 8:04 AM, Orit Wasserman wrote:
> On 04/27/2013 11:50 PM, Chegu Vinod wrote:
>> Busy enterprise workloads hosted on large sized VM's tend to dirty
>> memory faster than the transfer rate achieved via live guest migration.
>> Despite some good recent improvements (& using dedicated 10Gig NICs
>> between hosts) the live migration does NOT converge.
>>
>> A few options that were discussed/being-pursued to help with
>> the convergence issue include:
>>
>> 1) Slow down guest considerably via cgroup's CPU controls - requires
>>     libvirt client support to detect & trigger action, but conceptually
>>     similar to this RFC change.
>>
>> 2) Speed up transfer rate:
>>     - RDMA based Pre-copy - lower overhead and fast (Unfortunately
>>       has a few restrictions and some customers still choose not
>>       to deploy RDMA :-( ).
>>     - Add parallelism to improve transfer rate and use multiple 10Gig
>>       connections (bonded). - could add some overhead on the host.
>>
>> 3) Post-copy (preferably with RDMA) or a Pre+Post copy hybrid - Sounds
>>     promising but need to consider & handle newer failure scenarios.
>>
>> If an enterprise user chooses to force convergence of their migration
>> via the new capability "auto-converge" then with this change we auto-detect
>> lack of convergence scenario and trigger a slow down of the workload
>> by explicitly disallowing the VCPUs from spending much time in the VM
>> context.
>>
>> The migration thread tries to catchup and this eventually leads
>> to convergence in some "deterministic" amount of time. Yes it does
>> impact the performance of all the VCPUs but in my observation that
>> lasts only for a short duration of time. i.e. we end up entering
>> stage 3 (downtime phase) soon after that.
>>
>> No exernal trigger is required (unlike option 1) and it can co-exist
>> with enhancements being pursued as part of Option 2 (e.g. RDMA).
>>
>> Thanks to Juan and Paolo for their useful suggestions.
>>
>> Verified the convergence using the following :
>> - SpecJbb2005 workload running on a 20VCPU/256G guest(~80% busy)
>> - OLTP like workload running on a 80VCPU/512G guest (~80% busy)
>>
>> Sample results with SpecJbb2005 workload : (migrate speed set to 20Gb and
>> migrate downtime set to 4seconds).
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: off  <----
>> Migration status: active
>> total time: 1487503 milliseconds
>> expected downtime: 519 milliseconds
>> transferred ram: 383749347 kbytes
>> remaining ram: 2753372 kbytes
>> total ram: 268444224 kbytes
>> duplicate: 65461532 pages
>> skipped: 64901568 pages
>> normal: 95750218 pages
>> normal bytes: 383000872 kbytes
>> dirty pages rate: 67551 pages
>>
>> ---
>>
>> (qemu) info migrate
>> capabilities: xbzrle: off auto-converge: on   <----
>> Migration status: completed
>> total time: 241161 milliseconds
>> downtime: 6373 milliseconds
>> transferred ram: 28235307 kbytes
>> remaining ram: 0 kbytes
>> total ram: 268444224 kbytes
>> duplicate: 64946416 pages
>> skipped: 64903523 pages
>> normal: 7044971 pages
>> normal bytes: 28179884 kbytes
>>
>> Changes from v1:
>> - rebased to latest qemu.git
>> - added auto-converge capability(default off) - suggested by Anthony Liguori &
>>                                                  Eric Blake.
>>
>> Signed-off-by: Chegu Vinod <chegu_vinod@hp.com>
>> ---
>>   arch_init.c                   |   44 +++++++++++++++++++++++++++++++++++
>>   cpus.c                        |   12 +++++++++
>>   include/migration/migration.h |   12 +++++++++
>>   include/qemu/main-loop.h      |    3 ++
>>   kvm-all.c                     |   51 +++++++++++++++++++++++++++++++++++++++++
>>   migration.c                   |   15 ++++++++++++
>>   qapi-schema.json              |    6 ++++-
>>   7 files changed, 142 insertions(+), 1 deletions(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 92de1bd..6dcc742 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -104,6 +104,7 @@ int graphic_depth = 15;
>>   #endif
>>   
>>   const uint32_t arch_type = QEMU_ARCH;
>> +static uint64_t mig_throttle_on;
>>   
>>   /***********************************************************/
>>   /* ram save/restore */
>> @@ -379,12 +380,20 @@ static void migration_bitmap_sync(void)
>>       MigrationState *s = migrate_get_current();
>>       static int64_t start_time;
>>       static int64_t num_dirty_pages_period;
>> +    static int64_t bytes_xfer_prev;
>>       int64_t end_time;
>> +    int64_t bytes_xfer_now;
>> +    static int dirty_rate_high_cnt;
>> +
>> +    if (migrate_auto_converge() && !bytes_xfer_prev) {
>> +        bytes_xfer_prev = ram_bytes_transferred();
>> +    }
>>   
>>       if (!start_time) {
>>           start_time = qemu_get_clock_ms(rt_clock);
>>       }
>>   
>> +
>>       trace_migration_bitmap_sync_start();
>>       memory_global_sync_dirty_bitmap(get_system_memory());
>>   
>> @@ -404,6 +413,23 @@ static void migration_bitmap_sync(void)
>>   
>>       /* more than 1 second = 1000 millisecons */
>>       if (end_time > start_time + 1000) {
>> +        if (migrate_auto_converge()) {
>> +            /* The following detection logic can be refined later. For now:
>> +               Check to see if the dirtied bytes is 50% more than the approx.
>> +               amount of bytes that just got transferred since the last time we
>> +               were in this routine. If that happens N times (for now N==5)
>> +               we turn on the throttle down logic */
>> +            bytes_xfer_now = ram_bytes_transferred();
>> +            if (s->dirty_pages_rate &&
>> +                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
>> +                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
>> +                if (dirty_rate_high_cnt++ > 5) {
>> +                    DPRINTF("Unable to converge. Throtting down guest\n");
>> +                    mig_throttle_on = 1;
>> +                }
> Why not check to see if mig_throttle_on is already on?


Once its set to 1 it shouldn't really matter setting it again to 1. Not 
sure I understand the need for adding the check.

>
>> +             }
>> +             bytes_xfer_prev = bytes_xfer_now;
>> +        }
>>           s->dirty_pages_rate = num_dirty_pages_period * 1000
>>               / (end_time - start_time);
>>           s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
>> @@ -496,6 +522,24 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>       return bytes_sent;
>>   }
>>   
>> +bool throttling_needed(void)
>> +{
>> +    bool value;
>> +
>> +    if (!migrate_auto_converge()) {
>> +        return false;
>> +    }
>> +
>> +    qemu_mutex_lock_mig_throttle();
>> +    value = mig_throttle_on;
>> +    qemu_mutex_unlock_mig_throttle();
>> +
>> +    if (value) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
> Why not return value here ?
yes
> Cheers,
> Orit
>
>>   static uint64_t bytes_transferred;
>>   
>>   static ram_addr_t ram_save_remaining(void)
>> diff --git a/cpus.c b/cpus.c
>> index 5a98a37..615c25a 100644
>> --- a/cpus.c
>> +++ b/cpus.c
>> @@ -616,6 +616,7 @@ static void qemu_tcg_init_cpu_signals(void)
>>   #endif /* _WIN32 */
>>   
>>   static QemuMutex qemu_global_mutex;
>> +static QemuMutex qemu_mig_throttle_mutex;
>>   static QemuCond qemu_io_proceeded_cond;
>>   static bool iothread_requesting_mutex;
>>   
>> @@ -638,6 +639,7 @@ void qemu_init_cpu_loop(void)
>>       qemu_cond_init(&qemu_work_cond);
>>       qemu_cond_init(&qemu_io_proceeded_cond);
>>       qemu_mutex_init(&qemu_global_mutex);
>> +    qemu_mutex_init(&qemu_mig_throttle_mutex);
>>   
>>       qemu_thread_get_self(&io_thread);
>>   }
>> @@ -943,6 +945,16 @@ void qemu_mutex_unlock_iothread(void)
>>       qemu_mutex_unlock(&qemu_global_mutex);
>>   }
>>   
>> +void qemu_mutex_lock_mig_throttle(void)
>> +{
>> +    qemu_mutex_lock(&qemu_mig_throttle_mutex);
>> +}
>> +
>> +void qemu_mutex_unlock_mig_throttle(void)
>> +{
>> +    qemu_mutex_unlock(&qemu_mig_throttle_mutex);
>> +}
>> +
>>   static int all_vcpus_paused(void)
>>   {
>>       CPUArchState *penv = first_cpu;
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index e2acec6..94bdb8c 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -127,4 +127,16 @@ int migrate_use_xbzrle(void);
>>   int64_t migrate_xbzrle_cache_size(void);
>>   
>>   int64_t xbzrle_cache_resize(int64_t new_size);
>> +
>> +#ifndef _QEMU_MIG_THROTTLE
>> +#define _QEMU_MIG_THROTTLE
> Do you need those defines?
Not needed ...will remove them.
>
>> +
>> +bool migrate_auto_converge(void);
>> +
>> +bool throttling_needed(void);
>> +bool throttling_now(void);
>> +void *migration_throttle_down(void *);
>> +
>> +#endif
>> +
>>   #endif
>> diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
>> index 6f0200a..9a3886d 100644
>> --- a/include/qemu/main-loop.h
>> +++ b/include/qemu/main-loop.h
>> @@ -299,6 +299,9 @@ void qemu_mutex_lock_iothread(void);
>>    */
>>   void qemu_mutex_unlock_iothread(void);
>>   
>> +void qemu_mutex_lock_mig_throttle(void);
>> +void qemu_mutex_unlock_mig_throttle(void);
>> +
>>   /* internal interfaces */
>>   
>>   void qemu_fd_register(int fd);
>> diff --git a/kvm-all.c b/kvm-all.c
>> index 2d92721..a92cb77 100644
>> --- a/kvm-all.c
>> +++ b/kvm-all.c
>> @@ -33,6 +33,8 @@
>>   #include "exec/memory.h"
>>   #include "exec/address-spaces.h"
>>   #include "qemu/event_notifier.h"
>> +#include "sysemu/cpus.h"
>> +#include "migration/migration.h"
>>   
>>   /* This check must be after config-host.h is included */
>>   #ifdef CONFIG_EVENTFD
>> @@ -116,6 +118,8 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>>       KVM_CAP_LAST_INFO
>>   };
>>   
>> +static void mig_delay_vcpu(void);
>> +
>>   static KVMSlot *kvm_alloc_slot(KVMState *s)
>>   {
>>       int i;
>> @@ -1609,6 +1613,10 @@ int kvm_cpu_exec(CPUArchState *env)
>>           }
>>           qemu_mutex_unlock_iothread();
>>   
>> +        if (throttling_needed()) {
>> +            mig_delay_vcpu();
>> +        }
>> +
>>           run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
>>   
>>           qemu_mutex_lock_iothread();
>> @@ -2032,3 +2040,46 @@ int kvm_on_sigbus(int code, void *addr)
>>   {
>>       return kvm_arch_on_sigbus(code, addr);
>>   }
>> +
>> +static bool throttling;
>> +bool throttling_now(void)
>> +{
>> +    if (throttling) {
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
> it will be simpler to just return throttling ?

yes. will fix it.


Thanks for your comments
Vinod
>
>> +static void mig_delay_vcpu(void)
>> +{
>> +    g_usleep(50*1000);
>> +}
>> +
>> +/* Stub used for getting the vcpu out of VM and into qemu via
>> +   run_on_cpu()*/
>> +static void mig_kick_cpu(void *opq)
>> +{
>> +    return;
>> +}
>> +
>> +/* To reduce the dirty rate explicitly disallow the VCPUs from spending
>> +   much time in the VM. The migration thread will try to catchup.
>> +   Workload will experience a greater performance drop but for a shorter
>> +   duration.
>> +*/
>> +void *migration_throttle_down(void *opaque)
>> +{
>> +    throttling = true;
>> +    while (throttling_needed()) {
>> +        CPUArchState *penv = first_cpu;
>> +        while (penv) {
>> +            qemu_mutex_lock_iothread();
>> +            run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
>> +            qemu_mutex_unlock_iothread();
>> +            penv = penv->next_cpu;
>> +        }
>> +        g_usleep(25*1000);
>> +    }
>> +    throttling = false;
>> +    return NULL;
>> +}
>> diff --git a/migration.c b/migration.c
>> index 3eb0fad..834156e 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -24,6 +24,7 @@
>>   #include "qemu/thread.h"
>>   #include "qmp-commands.h"
>>   #include "trace.h"
>> +#include "sysemu/cpus.h"
>>   
>>   //#define DEBUG_MIGRATION
>>   
>> @@ -474,6 +475,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>>       max_downtime = (uint64_t)value;
>>   }
>>   
>> +bool migrate_auto_converge(void)
>> +{
>> +    MigrationState *s;
>> +
>> +    s = migrate_get_current();
>> +
>> +    return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
>> +}
>> +
>>   int migrate_use_xbzrle(void)
>>   {
>>       MigrationState *s;
>> @@ -503,6 +513,7 @@ static void *migration_thread(void *opaque)
>>       int64_t max_size = 0;
>>       int64_t start_time = initial_time;
>>       bool old_vm_running = false;
>> +    QemuThread thread;
>>   
>>       DPRINTF("beginning savevm\n");
>>       qemu_savevm_state_begin(s->file, &s->params);
>> @@ -517,6 +528,10 @@ static void *migration_thread(void *opaque)
>>               DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>>               if (pending_size && pending_size >= max_size) {
>>                   qemu_savevm_state_iterate(s->file);
>> +                if (throttling_needed() && !throttling_now()) {
>> +                    qemu_thread_create(&thread, migration_throttle_down,
>> +                               NULL, QEMU_THREAD_DETACHED);
>> +                }
>>               } else {
>>                   DPRINTF("done iterating\n");
>>                   qemu_mutex_lock_iothread();
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 5b0fb3b..b662e33 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -599,10 +599,14 @@
>>   #          This feature allows us to minimize migration traffic for certain work
>>   #          loads, by sending compressed difference of the pages
>>   #
>> +# @auto-converge: Controls whether or not the we want the migration to
>> +#          automaticially detect and force convergence by slowing
>> +#          down the guest. Disabled by default.
>> +#
>>   # Since: 1.2
>>   ##
>>   { 'enum': 'MigrationCapability',
>> -  'data': ['xbzrle'] }
>> +  'data': ['xbzrle', 'auto-converge'] }
>>   
>>   ##
>>   # @MigrationCapabilityStatus



>> .
>>

Patch

diff --git a/arch_init.c b/arch_init.c
index 92de1bd..6dcc742 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -104,6 +104,7 @@  int graphic_depth = 15;
 #endif
 
 const uint32_t arch_type = QEMU_ARCH;
+static uint64_t mig_throttle_on;
 
 /***********************************************************/
 /* ram save/restore */
@@ -379,12 +380,20 @@  static void migration_bitmap_sync(void)
     MigrationState *s = migrate_get_current();
     static int64_t start_time;
     static int64_t num_dirty_pages_period;
+    static int64_t bytes_xfer_prev;
     int64_t end_time;
+    int64_t bytes_xfer_now;
+    static int dirty_rate_high_cnt;
+
+    if (migrate_auto_converge() && !bytes_xfer_prev) {
+        bytes_xfer_prev = ram_bytes_transferred();
+    }
 
     if (!start_time) {
         start_time = qemu_get_clock_ms(rt_clock);
     }
 
+
     trace_migration_bitmap_sync_start();
     memory_global_sync_dirty_bitmap(get_system_memory());
 
@@ -404,6 +413,23 @@  static void migration_bitmap_sync(void)
 
     /* more than 1 second = 1000 millisecons */
     if (end_time > start_time + 1000) {
+        if (migrate_auto_converge()) {
+            /* The following detection logic can be refined later. For now:
+               Check to see if the dirtied bytes is 50% more than the approx.
+               amount of bytes that just got transferred since the last time we
+               were in this routine. If that happens N times (for now N==5)
+               we turn on the throttle down logic */
+            bytes_xfer_now = ram_bytes_transferred();
+            if (s->dirty_pages_rate &&
+                ((num_dirty_pages_period*TARGET_PAGE_SIZE) >
+                ((bytes_xfer_now - bytes_xfer_prev)/2))) {
+                if (dirty_rate_high_cnt++ > 5) {
+                    DPRINTF("Unable to converge. Throtting down guest\n");
+                    mig_throttle_on = 1;
+                }
+             }
+             bytes_xfer_prev = bytes_xfer_now;
+        }
         s->dirty_pages_rate = num_dirty_pages_period * 1000
             / (end_time - start_time);
         s->dirty_bytes_rate = s->dirty_pages_rate * TARGET_PAGE_SIZE;
@@ -496,6 +522,24 @@  static int ram_save_block(QEMUFile *f, bool last_stage)
     return bytes_sent;
 }
 
+bool throttling_needed(void)
+{
+    bool value;
+
+    if (!migrate_auto_converge()) {
+        return false;
+    }
+
+    qemu_mutex_lock_mig_throttle();
+    value = mig_throttle_on;
+    qemu_mutex_unlock_mig_throttle();
+
+    if (value) {
+        return true;
+    }
+    return false;
+}
+
 static uint64_t bytes_transferred;
 
 static ram_addr_t ram_save_remaining(void)
diff --git a/cpus.c b/cpus.c
index 5a98a37..615c25a 100644
--- a/cpus.c
+++ b/cpus.c
@@ -616,6 +616,7 @@  static void qemu_tcg_init_cpu_signals(void)
 #endif /* _WIN32 */
 
 static QemuMutex qemu_global_mutex;
+static QemuMutex qemu_mig_throttle_mutex;
 static QemuCond qemu_io_proceeded_cond;
 static bool iothread_requesting_mutex;
 
@@ -638,6 +639,7 @@  void qemu_init_cpu_loop(void)
     qemu_cond_init(&qemu_work_cond);
     qemu_cond_init(&qemu_io_proceeded_cond);
     qemu_mutex_init(&qemu_global_mutex);
+    qemu_mutex_init(&qemu_mig_throttle_mutex);
 
     qemu_thread_get_self(&io_thread);
 }
@@ -943,6 +945,16 @@  void qemu_mutex_unlock_iothread(void)
     qemu_mutex_unlock(&qemu_global_mutex);
 }
 
+void qemu_mutex_lock_mig_throttle(void)
+{
+    qemu_mutex_lock(&qemu_mig_throttle_mutex);
+}
+
+void qemu_mutex_unlock_mig_throttle(void)
+{
+    qemu_mutex_unlock(&qemu_mig_throttle_mutex);
+}
+
 static int all_vcpus_paused(void)
 {
     CPUArchState *penv = first_cpu;
diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..94bdb8c 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -127,4 +127,16 @@  int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+#ifndef _QEMU_MIG_THROTTLE
+#define _QEMU_MIG_THROTTLE
+
+bool migrate_auto_converge(void);
+
+bool throttling_needed(void);
+bool throttling_now(void);
+void *migration_throttle_down(void *);
+
+#endif
+
 #endif
diff --git a/include/qemu/main-loop.h b/include/qemu/main-loop.h
index 6f0200a..9a3886d 100644
--- a/include/qemu/main-loop.h
+++ b/include/qemu/main-loop.h
@@ -299,6 +299,9 @@  void qemu_mutex_lock_iothread(void);
  */
 void qemu_mutex_unlock_iothread(void);
 
+void qemu_mutex_lock_mig_throttle(void);
+void qemu_mutex_unlock_mig_throttle(void);
+
 /* internal interfaces */
 
 void qemu_fd_register(int fd);
diff --git a/kvm-all.c b/kvm-all.c
index 2d92721..a92cb77 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -33,6 +33,8 @@ 
 #include "exec/memory.h"
 #include "exec/address-spaces.h"
 #include "qemu/event_notifier.h"
+#include "sysemu/cpus.h"
+#include "migration/migration.h"
 
 /* This check must be after config-host.h is included */
 #ifdef CONFIG_EVENTFD
@@ -116,6 +118,8 @@  static const KVMCapabilityInfo kvm_required_capabilites[] = {
     KVM_CAP_LAST_INFO
 };
 
+static void mig_delay_vcpu(void);
+
 static KVMSlot *kvm_alloc_slot(KVMState *s)
 {
     int i;
@@ -1609,6 +1613,10 @@  int kvm_cpu_exec(CPUArchState *env)
         }
         qemu_mutex_unlock_iothread();
 
+        if (throttling_needed()) {
+            mig_delay_vcpu();
+        }
+
         run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
 
         qemu_mutex_lock_iothread();
@@ -2032,3 +2040,46 @@  int kvm_on_sigbus(int code, void *addr)
 {
     return kvm_arch_on_sigbus(code, addr);
 }
+
+static bool throttling;
+bool throttling_now(void)
+{
+    if (throttling) {
+        return true;
+    }
+    return false;
+}
+
+static void mig_delay_vcpu(void)
+{
+    g_usleep(50*1000);
+}
+
+/* Stub used for getting the vcpu out of VM and into qemu via
+   run_on_cpu()*/
+static void mig_kick_cpu(void *opq)
+{
+    return;
+}
+
+/* To reduce the dirty rate explicitly disallow the VCPUs from spending
+   much time in the VM. The migration thread will try to catchup.
+   Workload will experience a greater performance drop but for a shorter
+   duration.
+*/
+void *migration_throttle_down(void *opaque)
+{
+    throttling = true;
+    while (throttling_needed()) {
+        CPUArchState *penv = first_cpu;
+        while (penv) {
+            qemu_mutex_lock_iothread();
+            run_on_cpu(ENV_GET_CPU(penv), mig_kick_cpu, NULL);
+            qemu_mutex_unlock_iothread();
+            penv = penv->next_cpu;
+        }
+        g_usleep(25*1000);
+    }
+    throttling = false;
+    return NULL;
+}
diff --git a/migration.c b/migration.c
index 3eb0fad..834156e 100644
--- a/migration.c
+++ b/migration.c
@@ -24,6 +24,7 @@ 
 #include "qemu/thread.h"
 #include "qmp-commands.h"
 #include "trace.h"
+#include "sysemu/cpus.h"
 
 //#define DEBUG_MIGRATION
 
@@ -474,6 +475,15 @@  void qmp_migrate_set_downtime(double value, Error **errp)
     max_downtime = (uint64_t)value;
 }
 
+bool migrate_auto_converge(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_AUTO_CONVERGE];
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
@@ -503,6 +513,7 @@  static void *migration_thread(void *opaque)
     int64_t max_size = 0;
     int64_t start_time = initial_time;
     bool old_vm_running = false;
+    QemuThread thread;
 
     DPRINTF("beginning savevm\n");
     qemu_savevm_state_begin(s->file, &s->params);
@@ -517,6 +528,10 @@  static void *migration_thread(void *opaque)
             DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
             if (pending_size && pending_size >= max_size) {
                 qemu_savevm_state_iterate(s->file);
+                if (throttling_needed() && !throttling_now()) {
+                    qemu_thread_create(&thread, migration_throttle_down,
+                               NULL, QEMU_THREAD_DETACHED);
+                }
             } else {
                 DPRINTF("done iterating\n");
                 qemu_mutex_lock_iothread();
diff --git a/qapi-schema.json b/qapi-schema.json
index 5b0fb3b..b662e33 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -599,10 +599,14 @@ 
 #          This feature allows us to minimize migration traffic for certain work
 #          loads, by sending compressed difference of the pages
 #
+# @auto-converge: Controls whether or not the we want the migration to
+#          automaticially detect and force convergence by slowing
+#          down the guest. Disabled by default.
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'auto-converge'] }
 
 ##
 # @MigrationCapabilityStatus