diff mbox

[qemu,V3,2/2] kvmclock: reduce kvmclock difference on migration

Message ID 20161121105052.598267440@redhat.com
State New
Headers show

Commit Message

Marcelo Tosatti Nov. 21, 2016, 10:50 a.m. UTC
Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
indicates that KVM_GET_CLOCK returns a value as seen by the guest at
that moment.

For new machine types, use this value rather than reading 
from guest memory.

This reduces kvmclock difference on migration from 5s to 0.1s
(when max_downtime == 5s).

Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

---
 hw/i386/kvm/clock.c    |  107 ++++++++++++++++++++++++++++++++++++++++++-------
 include/hw/i386/pc.h   |    5 ++
 target-i386/kvm.c      |    7 +++
 target-i386/kvm_i386.h |    1 
 4 files changed, 106 insertions(+), 14 deletions(-)

v2: 
- improve variable names (Juan)
- consolidate code on kvm_get_clock function (Paolo)
- return mach_use_reliable_get_clock from needed function (Paolo)
v3: 
- simplify check for src_use_reliable_get_clock (Eduardo)
- move src_use_reliable_get_clock initialization to realize (Eduardo)

Comments

Eduardo Habkost Nov. 28, 2016, 2:13 p.m. UTC | #1
Sorry for not noticing the following issues on the previous
reviews. I was only paying attention to the vmstate and machine
code:

On Mon, Nov 21, 2016 at 08:50:04AM -0200, Marcelo Tosatti wrote:
> Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
> indicates that KVM_GET_CLOCK returns a value as seen by the guest at
> that moment.
> 
> For new machine types, use this value rather than reading 
> from guest memory.
> 
> This reduces kvmclock difference on migration from 5s to 0.1s
> (when max_downtime == 5s).
> 
> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> 
> ---
>  hw/i386/kvm/clock.c    |  107 ++++++++++++++++++++++++++++++++++++++++++-------
>  include/hw/i386/pc.h   |    5 ++
>  target-i386/kvm.c      |    7 +++
>  target-i386/kvm_i386.h |    1 
>  4 files changed, 106 insertions(+), 14 deletions(-)
> 
> v2: 
> - improve variable names (Juan)
> - consolidate code on kvm_get_clock function (Paolo)
> - return mach_use_reliable_get_clock from needed function (Paolo)
> v3: 
> - simplify check for src_use_reliable_get_clock (Eduardo)
> - move src_use_reliable_get_clock initialization to realize (Eduardo)
> 
> 
> Index: qemu-mig-advance-clock/hw/i386/kvm/clock.c
> ===================================================================
> --- qemu-mig-advance-clock.orig/hw/i386/kvm/clock.c	2016-11-17 15:07:11.220632761 -0200
> +++ qemu-mig-advance-clock/hw/i386/kvm/clock.c	2016-11-17 15:11:51.372048640 -0200
> @@ -36,6 +36,12 @@
>  
>      uint64_t clock;
>      bool clock_valid;
> +
> +    /* whether machine supports reliable KVM_GET_CLOCK */
> +    bool mach_use_reliable_get_clock;
> +
> +    /* whether source host supported reliable KVM_GET_CLOCK */
> +    bool src_use_reliable_get_clock;
>  } KVMClockState;
>  
>  struct pvclock_vcpu_time_info {
> @@ -81,6 +87,19 @@
>      return nsec + time.system_time;
>  }
>  
> +static uint64_t kvm_get_clock(void)
> +{
> +    struct kvm_clock_data data;
> +    int ret;
> +
> +    ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> +    if (ret < 0) {
> +        fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> +                abort();
> +    }
> +    return data.clock;
> +}
> +
>  static void kvmclock_vm_state_change(void *opaque, int running,
>                                       RunState state)
>  {
> @@ -91,15 +110,36 @@

Can you please use "diff -p" on your patches?

>  
>      if (running) {
>          struct kvm_clock_data data = {};
> -        uint64_t time_at_migration = kvmclock_current_nsec(s);
> +        uint64_t pvclock_via_mem = 0;
>  
> -        s->clock_valid = false;
> +        /* local (running VM) restore */
> +        if (s->clock_valid) {
> +            /*
> +             * if host does not support reliable KVM_GET_CLOCK,
> +             * read kvmclock value from memory
> +             */
> +            if (!kvm_has_adjust_clock_stable()) {

Why not use src_use_reliable_get_clock here, too? (Maybe rename
it to simply clock_is_reliable?)

You can change kvm_get_clock() to kvm_get_clock(KVMClockState*),
and make it update both s->clock and s->clock_is_reliable. With
this, s->clock and s->clock_is_reliable will be always in sync,
and have a single code path for both local restore and migration:

    /*
     * If host that set s->clock did not support reliable
     * KVM_GET_CLOCK, read kvmclock value from memory
     */
    if (!s->clock_is_reliable) {
        pvclock_via_mem = kvmclock_current_nsec(s);
    }

Updating the table from my previous message, the result would be:

--------+------------+---------------+-----------+---------+
        | kvm_has_adj_clock_stable() | clock_is_reliable   |
machine | src        | dst           | src       | dst (*) |
--------+------------+---------------+-----------+---------+
pc-2.8  | false      | false         | false     | false   |
pc-2.8  | false      | true          | false     | false   |
pc-2.8  | true       | false         | true      | true    |
pc-2.8  | true       | true          | true      | true    |
--------+------------+---------------+-----------+---------+
pc-2.7  | false      | (any)         | false     | false   |
pc-2.7  | true       | (any)         | true (**) | false   |
--------+------------+---------------+-----------+---------+

(*) More precisely: in the first kvmclock_vm_state_change() call
    just after migration. It will get updated the next time
    s->clock is set.

(**) On purpose, so the code I suggest above would work for local
     restore, too.


> +                pvclock_via_mem = kvmclock_current_nsec(s);
> +            }
> +        /* migration/savevm/init restore */
> +        } else {
> +            /*
> +             * use s->clock in case machine uses reliable
> +             * get clock and source host supported
> +             * reliable get clock
> +             */
> +            if (!s->src_use_reliable_get_clock) {
> +                pvclock_via_mem = kvmclock_current_nsec(s);
> +            }
> +        }
>  
> -        /* We can't rely on the migrated clock value, just discard it */
> -        if (time_at_migration) {
> -            s->clock = time_at_migration;
> +        /* We can't rely on the saved clock value, just discard it */
> +        if (pvclock_via_mem) {
> +            s->clock = pvclock_via_mem;
>          }
>  
> +        s->clock_valid = false;
> +
>          data.clock = s->clock;
>          ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
>          if (ret < 0) {
> @@ -120,8 +160,6 @@
>              }
>          }
>      } else {
> -        struct kvm_clock_data data;
> -        int ret;
>  
>          if (s->clock_valid) {
>              return;
> @@ -129,13 +167,7 @@
>  
>          kvm_synchronize_all_tsc();
>  
> -        ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> -        if (ret < 0) {
> -            fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> -            abort();
> -        }
> -        s->clock = data.clock;
> -
> +        s->clock = kvm_get_clock();
>          /*
>           * If the VM is stopped, declare the clock state valid to
>           * avoid re-reading it on next vmsave (which would return
> @@ -149,25 +181,72 @@
>  {
>      KVMClockState *s = KVM_CLOCK(dev);
>  
> +    /*
> +     * On machine types that support reliable KVM_GET_CLOCK,
> +     * if host kernel does provide reliable KVM_GET_CLOCK,
> +     * set src_use_reliable_get_clock=true so that destination
> +     * avoids reading kvmclock from memory.
> +     */
> +    if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
> +        s->src_use_reliable_get_clock = true;
> +    }
> +
>      qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
>  }
>  
> +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> +{
> +    KVMClockState *s = opaque;
> +
> +    return s->mach_use_reliable_get_clock;
> +}
> +
> +static const VMStateDescription kvmclock_reliable_get_clock = {
> +    .name = "kvmclock/src_use_reliable_get_clock",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = kvmclock_src_use_reliable_get_clock,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BOOL(src_use_reliable_get_clock, KVMClockState),
> +        VMSTATE_END_OF_LIST()
> +    }
> +};
> +
> +static void kvmclock_pre_save(void *opaque)
> +{
> +    KVMClockState *s = opaque;
> +
> +    s->clock = kvm_get_clock();

I can't tell if s->clock_valid here is a bug, or a feature. In
either case, we seem to be contradicting the comment at
kvmclock_vm_state_change():

    /*
     * If the VM is stopped, declare the clock state valid to
     * avoid re-reading it on next vmsave (which would return
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     * a different value). Will be reset when the VM is continued.
     */
    s->clock_valid = true;

> +}
> +
>  static const VMStateDescription kvmclock_vmsd = {
>      .name = "kvmclock",
>      .version_id = 1,
>      .minimum_version_id = 1,
> +    .pre_save = kvmclock_pre_save,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT64(clock, KVMClockState),
>          VMSTATE_END_OF_LIST()
> +    },
> +    .subsections = (const VMStateDescription * []) {
> +        &kvmclock_reliable_get_clock,
> +        NULL
>      }
>  };
>  
> +static Property kvmclock_properties[] = {
> +    DEFINE_PROP_BOOL("x-mach-use-reliable-get-clock", KVMClockState,
> +                      mach_use_reliable_get_clock, true),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
>  static void kvmclock_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
>  
>      dc->realize = kvmclock_realize;
>      dc->vmsd = &kvmclock_vmsd;
> +    dc->props = kvmclock_properties;
>  }
>  
>  static const TypeInfo kvmclock_info = {
> Index: qemu-mig-advance-clock/include/hw/i386/pc.h
> ===================================================================
> --- qemu-mig-advance-clock.orig/include/hw/i386/pc.h	2016-11-17 15:07:11.220632761 -0200
> +++ qemu-mig-advance-clock/include/hw/i386/pc.h	2016-11-17 15:08:58.096791416 -0200
> @@ -370,6 +370,11 @@
>  #define PC_COMPAT_2_7 \
>      HW_COMPAT_2_7 \
>      {\
> +        .driver   = "kvmclock",\
> +        .property = "x-mach-use-reliable-get-clock",\
> +        .value    = "off",\
> +    },\
> +    {\
>          .driver   = TYPE_X86_CPU,\
>          .property = "l3-cache",\
>          .value    = "off",\
> Index: qemu-mig-advance-clock/target-i386/kvm.c
> ===================================================================
> --- qemu-mig-advance-clock.orig/target-i386/kvm.c	2016-11-17 15:07:11.221632762 -0200
> +++ qemu-mig-advance-clock/target-i386/kvm.c	2016-11-17 15:07:15.867639659 -0200
> @@ -117,6 +117,13 @@
>      return kvm_check_extension(kvm_state, KVM_CAP_X86_SMM);
>  }
>  
> +bool kvm_has_adjust_clock_stable(void)
> +{
> +    int ret = kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK);
> +
> +    return (ret == KVM_CLOCK_TSC_STABLE);
> +}
> +
>  bool kvm_allows_irq0_override(void)
>  {
>      return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
> Index: qemu-mig-advance-clock/target-i386/kvm_i386.h
> ===================================================================
> --- qemu-mig-advance-clock.orig/target-i386/kvm_i386.h	2016-11-17 15:07:11.222632764 -0200
> +++ qemu-mig-advance-clock/target-i386/kvm_i386.h	2016-11-17 15:07:15.867639659 -0200
> @@ -17,6 +17,7 @@
>  
>  bool kvm_allows_irq0_override(void);
>  bool kvm_has_smm(void);
> +bool kvm_has_adjust_clock_stable(void);
>  void kvm_synchronize_all_tsc(void);
>  void kvm_arch_reset_vcpu(X86CPU *cs);
>  void kvm_arch_do_init_vcpu(X86CPU *cs);
> 
>
Marcelo Tosatti Nov. 28, 2016, 4:45 p.m. UTC | #2
On Mon, Nov 28, 2016 at 12:13:22PM -0200, Eduardo Habkost wrote:
> Sorry for not noticing the following issues on the previous
> reviews. I was only paying attention to the vmstate and machine
> code:
> 
> On Mon, Nov 21, 2016 at 08:50:04AM -0200, Marcelo Tosatti wrote:
> > Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
> > indicates that KVM_GET_CLOCK returns a value as seen by the guest at
> > that moment.
> > 
> > For new machine types, use this value rather than reading 
> > from guest memory.
> > 
> > This reduces kvmclock difference on migration from 5s to 0.1s
> > (when max_downtime == 5s).
> > 
> > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > 
> > ---
> >  hw/i386/kvm/clock.c    |  107 ++++++++++++++++++++++++++++++++++++++++++-------
> >  include/hw/i386/pc.h   |    5 ++
> >  target-i386/kvm.c      |    7 +++
> >  target-i386/kvm_i386.h |    1 
> >  4 files changed, 106 insertions(+), 14 deletions(-)
> > 
> > v2: 
> > - improve variable names (Juan)
> > - consolidate code on kvm_get_clock function (Paolo)
> > - return mach_use_reliable_get_clock from needed function (Paolo)
> > v3: 
> > - simplify check for src_use_reliable_get_clock (Eduardo)
> > - move src_use_reliable_get_clock initialization to realize (Eduardo)
> > 
> > 
> > Index: qemu-mig-advance-clock/hw/i386/kvm/clock.c
> > ===================================================================
> > --- qemu-mig-advance-clock.orig/hw/i386/kvm/clock.c	2016-11-17 15:07:11.220632761 -0200
> > +++ qemu-mig-advance-clock/hw/i386/kvm/clock.c	2016-11-17 15:11:51.372048640 -0200
> > @@ -36,6 +36,12 @@
> >  
> >      uint64_t clock;
> >      bool clock_valid;
> > +
> > +    /* whether machine supports reliable KVM_GET_CLOCK */
> > +    bool mach_use_reliable_get_clock;
> > +
> > +    /* whether source host supported reliable KVM_GET_CLOCK */
> > +    bool src_use_reliable_get_clock;
> >  } KVMClockState;
> >  
> >  struct pvclock_vcpu_time_info {
> > @@ -81,6 +87,19 @@
> >      return nsec + time.system_time;
> >  }
> >  
> > +static uint64_t kvm_get_clock(void)
> > +{
> > +    struct kvm_clock_data data;
> > +    int ret;
> > +
> > +    ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > +    if (ret < 0) {
> > +        fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> > +                abort();
> > +    }
> > +    return data.clock;
> > +}
> > +
> >  static void kvmclock_vm_state_change(void *opaque, int running,
> >                                       RunState state)
> >  {
> > @@ -91,15 +110,36 @@
> 
> Can you please use "diff -p" on your patches?
> 
> >  
> >      if (running) {
> >          struct kvm_clock_data data = {};
> > -        uint64_t time_at_migration = kvmclock_current_nsec(s);
> > +        uint64_t pvclock_via_mem = 0;
> >  
> > -        s->clock_valid = false;
> > +        /* local (running VM) restore */
> > +        if (s->clock_valid) {
> > +            /*
> > +             * if host does not support reliable KVM_GET_CLOCK,
> > +             * read kvmclock value from memory
> > +             */
> > +            if (!kvm_has_adjust_clock_stable()) {
> 
> Why not use src_use_reliable_get_clock here, too? (Maybe rename
> it to simply clock_is_reliable?)

Because you end up mixing two mental ideas (one: whether the source
has KVM_GET_CLOCK, second: whether the destination has KVM_GET_CLOCK 
into one variable). I find it more confusing than not.
Maybe its just my limited brain but i find it
confusing.

> You can change kvm_get_clock() to kvm_get_clock(KVMClockState*),
> and make it update both s->clock and s->clock_is_reliable. With
> this, s->clock and s->clock_is_reliable will be always in sync,
> and have a single code path for both local restore and migration:

There should not be a single path for migration/runtime cases: 
on migration, we care whether the source used reliable KVM_GET_CLOCK.

On runtime, we care whether the destination uses reliable KVM_GET_CLOCK.

>     /*
>      * If host that set s->clock did not support reliable
>      * KVM_GET_CLOCK, read kvmclock value from memory
>      */
>     if (!s->clock_is_reliable) {
>         pvclock_via_mem = kvmclock_current_nsec(s);
>     }
> 
> Updating the table from my previous message, the result would be:
> 
> --------+------------+---------------+-----------+---------+
>         | kvm_has_adj_clock_stable() | clock_is_reliable   |
> machine | src        | dst           | src       | dst (*) |
> --------+------------+---------------+-----------+---------+
> pc-2.8  | false      | false         | false     | false   |
> pc-2.8  | false      | true          | false     | false   |
> pc-2.8  | true       | false         | true      | true    |
> pc-2.8  | true       | true          | true      | true    |
> --------+------------+---------------+-----------+---------+
> pc-2.7  | false      | (any)         | false     | false   |
> pc-2.7  | true       | (any)         | true (**) | false   |
> --------+------------+---------------+-----------+---------+
> 
> (*) More precisely: in the first kvmclock_vm_state_change() call
>     just after migration. It will get updated the next time
>     s->clock is set.
> 
> (**) On purpose, so the code I suggest above would work for local
>      restore, too.
> 
> 
> > +                pvclock_via_mem = kvmclock_current_nsec(s);
> > +            }
> > +        /* migration/savevm/init restore */
> > +        } else {
> > +            /*
> > +             * use s->clock in case machine uses reliable
> > +             * get clock and source host supported
> > +             * reliable get clock
> > +             */
> > +            if (!s->src_use_reliable_get_clock) {
> > +                pvclock_via_mem = kvmclock_current_nsec(s);
> > +            }
> > +        }
> >  
> > -        /* We can't rely on the migrated clock value, just discard it */
> > -        if (time_at_migration) {
> > -            s->clock = time_at_migration;
> > +        /* We can't rely on the saved clock value, just discard it */
> > +        if (pvclock_via_mem) {
> > +            s->clock = pvclock_via_mem;
> >          }
> >  
> > +        s->clock_valid = false;
> > +
> >          data.clock = s->clock;
> >          ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> >          if (ret < 0) {
> > @@ -120,8 +160,6 @@
> >              }
> >          }
> >      } else {
> > -        struct kvm_clock_data data;
> > -        int ret;
> >  
> >          if (s->clock_valid) {
> >              return;
> > @@ -129,13 +167,7 @@
> >  
> >          kvm_synchronize_all_tsc();
> >  
> > -        ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > -        if (ret < 0) {
> > -            fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> > -            abort();
> > -        }
> > -        s->clock = data.clock;
> > -
> > +        s->clock = kvm_get_clock();
> >          /*
> >           * If the VM is stopped, declare the clock state valid to
> >           * avoid re-reading it on next vmsave (which would return
> > @@ -149,25 +181,72 @@
> >  {
> >      KVMClockState *s = KVM_CLOCK(dev);
> >  
> > +    /*
> > +     * On machine types that support reliable KVM_GET_CLOCK,
> > +     * if host kernel does provide reliable KVM_GET_CLOCK,
> > +     * set src_use_reliable_get_clock=true so that destination
> > +     * avoids reading kvmclock from memory.
> > +     */
> > +    if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
> > +        s->src_use_reliable_get_clock = true;
> > +    }
> > +
> >      qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
> >  }
> >  
> > +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> > +{
> > +    KVMClockState *s = opaque;
> > +
> > +    return s->mach_use_reliable_get_clock;
> > +}
> > +
> > +static const VMStateDescription kvmclock_reliable_get_clock = {
> > +    .name = "kvmclock/src_use_reliable_get_clock",
> > +    .version_id = 1,
> > +    .minimum_version_id = 1,
> > +    .needed = kvmclock_src_use_reliable_get_clock,
> > +    .fields = (VMStateField[]) {
> > +        VMSTATE_BOOL(src_use_reliable_get_clock, KVMClockState),
> > +        VMSTATE_END_OF_LIST()
> > +    }
> > +};
> > +
> > +static void kvmclock_pre_save(void *opaque)
> > +{
> > +    KVMClockState *s = opaque;
> > +
> > +    s->clock = kvm_get_clock();
> 
> I can't tell if s->clock_valid here is a bug, or a feature. In
> either case, we seem to be contradicting the comment at
> kvmclock_vm_state_change():
> 
>     /*
>      * If the VM is stopped, declare the clock state valid to
>      * avoid re-reading it on next vmsave (which would return
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>      * a different value). Will be reset when the VM is continued.
>      */
>     s->clock_valid = true;

Sorry, what is the problem??

> 
> > +}
> > +
> >  static const VMStateDescription kvmclock_vmsd = {
> >      .name = "kvmclock",
> >      .version_id = 1,
> >      .minimum_version_id = 1,
> > +    .pre_save = kvmclock_pre_save,
> >      .fields = (VMStateField[]) {
> >          VMSTATE_UINT64(clock, KVMClockState),
> >          VMSTATE_END_OF_LIST()
> > +    },
> > +    .subsections = (const VMStateDescription * []) {
> > +        &kvmclock_reliable_get_clock,
> > +        NULL
> >      }
> >  };
> >  
> > +static Property kvmclock_properties[] = {
> > +    DEFINE_PROP_BOOL("x-mach-use-reliable-get-clock", KVMClockState,
> > +                      mach_use_reliable_get_clock, true),
> > +    DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> >  static void kvmclock_class_init(ObjectClass *klass, void *data)
> >  {
> >      DeviceClass *dc = DEVICE_CLASS(klass);
> >  
> >      dc->realize = kvmclock_realize;
> >      dc->vmsd = &kvmclock_vmsd;
> > +    dc->props = kvmclock_properties;
> >  }
> >  
> >  static const TypeInfo kvmclock_info = {
> > Index: qemu-mig-advance-clock/include/hw/i386/pc.h
> > ===================================================================
> > --- qemu-mig-advance-clock.orig/include/hw/i386/pc.h	2016-11-17 15:07:11.220632761 -0200
> > +++ qemu-mig-advance-clock/include/hw/i386/pc.h	2016-11-17 15:08:58.096791416 -0200
> > @@ -370,6 +370,11 @@
> >  #define PC_COMPAT_2_7 \
> >      HW_COMPAT_2_7 \
> >      {\
> > +        .driver   = "kvmclock",\
> > +        .property = "x-mach-use-reliable-get-clock",\
> > +        .value    = "off",\
> > +    },\
> > +    {\
> >          .driver   = TYPE_X86_CPU,\
> >          .property = "l3-cache",\
> >          .value    = "off",\
> > Index: qemu-mig-advance-clock/target-i386/kvm.c
> > ===================================================================
> > --- qemu-mig-advance-clock.orig/target-i386/kvm.c	2016-11-17 15:07:11.221632762 -0200
> > +++ qemu-mig-advance-clock/target-i386/kvm.c	2016-11-17 15:07:15.867639659 -0200
> > @@ -117,6 +117,13 @@
> >      return kvm_check_extension(kvm_state, KVM_CAP_X86_SMM);
> >  }
> >  
> > +bool kvm_has_adjust_clock_stable(void)
> > +{
> > +    int ret = kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK);
> > +
> > +    return (ret == KVM_CLOCK_TSC_STABLE);
> > +}
> > +
> >  bool kvm_allows_irq0_override(void)
> >  {
> >      return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
> > Index: qemu-mig-advance-clock/target-i386/kvm_i386.h
> > ===================================================================
> > --- qemu-mig-advance-clock.orig/target-i386/kvm_i386.h	2016-11-17 15:07:11.222632764 -0200
> > +++ qemu-mig-advance-clock/target-i386/kvm_i386.h	2016-11-17 15:07:15.867639659 -0200
> > @@ -17,6 +17,7 @@
> >  
> >  bool kvm_allows_irq0_override(void);
> >  bool kvm_has_smm(void);
> > +bool kvm_has_adjust_clock_stable(void);
> >  void kvm_synchronize_all_tsc(void);
> >  void kvm_arch_reset_vcpu(X86CPU *cs);
> >  void kvm_arch_do_init_vcpu(X86CPU *cs);
> > 
> > 
> 
> -- 
> Eduardo
Eduardo Habkost Nov. 28, 2016, 5:12 p.m. UTC | #3
On Mon, Nov 28, 2016 at 02:45:24PM -0200, Marcelo Tosatti wrote:
> On Mon, Nov 28, 2016 at 12:13:22PM -0200, Eduardo Habkost wrote:
> > Sorry for not noticing the following issues on the previous
> > reviews. I was only paying attention to the vmstate and machine
> > code:
> > 
> > On Mon, Nov 21, 2016 at 08:50:04AM -0200, Marcelo Tosatti wrote:
> > > Check for KVM_CAP_ADJUST_CLOCK capability KVM_CLOCK_TSC_STABLE, which
> > > indicates that KVM_GET_CLOCK returns a value as seen by the guest at
> > > that moment.
> > > 
> > > For new machine types, use this value rather than reading 
> > > from guest memory.
> > > 
> > > This reduces kvmclock difference on migration from 5s to 0.1s
> > > (when max_downtime == 5s).
> > > 
> > > Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
> > > 
> > > ---
> > >  hw/i386/kvm/clock.c    |  107 ++++++++++++++++++++++++++++++++++++++++++-------
> > >  include/hw/i386/pc.h   |    5 ++
> > >  target-i386/kvm.c      |    7 +++
> > >  target-i386/kvm_i386.h |    1 
> > >  4 files changed, 106 insertions(+), 14 deletions(-)
> > > 
> > > v2: 
> > > - improve variable names (Juan)
> > > - consolidate code on kvm_get_clock function (Paolo)
> > > - return mach_use_reliable_get_clock from needed function (Paolo)
> > > v3: 
> > > - simplify check for src_use_reliable_get_clock (Eduardo)
> > > - move src_use_reliable_get_clock initialization to realize (Eduardo)
> > > 
> > > 
> > > Index: qemu-mig-advance-clock/hw/i386/kvm/clock.c
> > > ===================================================================
> > > --- qemu-mig-advance-clock.orig/hw/i386/kvm/clock.c	2016-11-17 15:07:11.220632761 -0200
> > > +++ qemu-mig-advance-clock/hw/i386/kvm/clock.c	2016-11-17 15:11:51.372048640 -0200
> > > @@ -36,6 +36,12 @@
> > >  
> > >      uint64_t clock;
> > >      bool clock_valid;
> > > +
> > > +    /* whether machine supports reliable KVM_GET_CLOCK */
> > > +    bool mach_use_reliable_get_clock;
> > > +
> > > +    /* whether source host supported reliable KVM_GET_CLOCK */
> > > +    bool src_use_reliable_get_clock;
> > >  } KVMClockState;
> > >  
> > >  struct pvclock_vcpu_time_info {
> > > @@ -81,6 +87,19 @@
> > >      return nsec + time.system_time;
> > >  }
> > >  
> > > +static uint64_t kvm_get_clock(void)
> > > +{
> > > +    struct kvm_clock_data data;
> > > +    int ret;
> > > +
> > > +    ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > > +    if (ret < 0) {
> > > +        fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> > > +                abort();
> > > +    }
> > > +    return data.clock;
> > > +}
> > > +
> > >  static void kvmclock_vm_state_change(void *opaque, int running,
> > >                                       RunState state)
> > >  {
> > > @@ -91,15 +110,36 @@
> > 
> > Can you please use "diff -p" on your patches?
> > 
> > >  
> > >      if (running) {
> > >          struct kvm_clock_data data = {};
> > > -        uint64_t time_at_migration = kvmclock_current_nsec(s);
> > > +        uint64_t pvclock_via_mem = 0;
> > >  
> > > -        s->clock_valid = false;
> > > +        /* local (running VM) restore */
> > > +        if (s->clock_valid) {
> > > +            /*
> > > +             * if host does not support reliable KVM_GET_CLOCK,
> > > +             * read kvmclock value from memory
> > > +             */
> > > +            if (!kvm_has_adjust_clock_stable()) {
> > 
> > Why not use src_use_reliable_get_clock here, too? (Maybe rename
> > it to simply clock_is_reliable?)
> 
> Because you end up mixing two mental ideas (one: whether the source
> has KVM_GET_CLOCK, second: whether the destination has KVM_GET_CLOCK 
> into one variable). I find it more confusing than not.
> Maybe its just my limited brain but i find it
> confusing.

I find it simpler and easier to reason about.

> 
> > You can change kvm_get_clock() to kvm_get_clock(KVMClockState*),
> > and make it update both s->clock and s->clock_is_reliable. With
> > this, s->clock and s->clock_is_reliable will be always in sync,
> > and have a single code path for both local restore and migration:
> 
> There should not be a single path for migration/runtime cases: 
> on migration, we care whether the source used reliable KVM_GET_CLOCK.
> 
> On runtime, we care whether the destination uses reliable KVM_GET_CLOCK.

The conditions may be different, but both can be translated to a
"should we trust the value in s->clock" flag, to simplify the
code (just like Paolo's suggestion on his reply to v1 today).

> 
> >     /*
> >      * If host that set s->clock did not support reliable
> >      * KVM_GET_CLOCK, read kvmclock value from memory
> >      */
> >     if (!s->clock_is_reliable) {
> >         pvclock_via_mem = kvmclock_current_nsec(s);
> >     }
> > 
> > Updating the table from my previous message, the result would be:
> > 
> > --------+------------+---------------+-----------+---------+
> >         | kvm_has_adj_clock_stable() | clock_is_reliable   |
> > machine | src        | dst           | src       | dst (*) |
> > --------+------------+---------------+-----------+---------+
> > pc-2.8  | false      | false         | false     | false   |
> > pc-2.8  | false      | true          | false     | false   |
> > pc-2.8  | true       | false         | true      | true    |
> > pc-2.8  | true       | true          | true      | true    |
> > --------+------------+---------------+-----------+---------+
> > pc-2.7  | false      | (any)         | false     | false   |
> > pc-2.7  | true       | (any)         | true (**) | false   |
> > --------+------------+---------------+-----------+---------+
> > 
> > (*) More precisely: in the first kvmclock_vm_state_change() call
> >     just after migration. It will get updated the next time
> >     s->clock is set.
> > 
> > (**) On purpose, so the code I suggest above would work for local
> >      restore, too.
> > 
> > 
> > > +                pvclock_via_mem = kvmclock_current_nsec(s);
> > > +            }
> > > +        /* migration/savevm/init restore */
> > > +        } else {
> > > +            /*
> > > +             * use s->clock in case machine uses reliable
> > > +             * get clock and source host supported
> > > +             * reliable get clock
> > > +             */
> > > +            if (!s->src_use_reliable_get_clock) {
> > > +                pvclock_via_mem = kvmclock_current_nsec(s);
> > > +            }
> > > +        }
> > >  
> > > -        /* We can't rely on the migrated clock value, just discard it */
> > > -        if (time_at_migration) {
> > > -            s->clock = time_at_migration;
> > > +        /* We can't rely on the saved clock value, just discard it */
> > > +        if (pvclock_via_mem) {
> > > +            s->clock = pvclock_via_mem;
> > >          }
> > >  
> > > +        s->clock_valid = false;
> > > +
> > >          data.clock = s->clock;
> > >          ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
> > >          if (ret < 0) {
> > > @@ -120,8 +160,6 @@
> > >              }
> > >          }
> > >      } else {
> > > -        struct kvm_clock_data data;
> > > -        int ret;
> > >  
> > >          if (s->clock_valid) {
> > >              return;
> > > @@ -129,13 +167,7 @@
> > >  
> > >          kvm_synchronize_all_tsc();
> > >  
> > > -        ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
> > > -        if (ret < 0) {
> > > -            fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
> > > -            abort();
> > > -        }
> > > -        s->clock = data.clock;
> > > -
> > > +        s->clock = kvm_get_clock();
> > >          /*
> > >           * If the VM is stopped, declare the clock state valid to
> > >           * avoid re-reading it on next vmsave (which would return
> > > @@ -149,25 +181,72 @@
> > >  {
> > >      KVMClockState *s = KVM_CLOCK(dev);
> > >  
> > > +    /*
> > > +     * On machine types that support reliable KVM_GET_CLOCK,
> > > +     * if host kernel does provide reliable KVM_GET_CLOCK,
> > > +     * set src_use_reliable_get_clock=true so that destination
> > > +     * avoids reading kvmclock from memory.
> > > +     */
> > > +    if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
> > > +        s->src_use_reliable_get_clock = true;
> > > +    }
> > > +
> > >      qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
> > >  }
> > >  
> > > +static bool kvmclock_src_use_reliable_get_clock(void *opaque)
> > > +{
> > > +    KVMClockState *s = opaque;
> > > +
> > > +    return s->mach_use_reliable_get_clock;
> > > +}
> > > +
> > > +static const VMStateDescription kvmclock_reliable_get_clock = {
> > > +    .name = "kvmclock/src_use_reliable_get_clock",
> > > +    .version_id = 1,
> > > +    .minimum_version_id = 1,
> > > +    .needed = kvmclock_src_use_reliable_get_clock,
> > > +    .fields = (VMStateField[]) {
> > > +        VMSTATE_BOOL(src_use_reliable_get_clock, KVMClockState),
> > > +        VMSTATE_END_OF_LIST()
> > > +    }
> > > +};
> > > +
> > > +static void kvmclock_pre_save(void *opaque)
> > > +{
> > > +    KVMClockState *s = opaque;
> > > +
> > > +    s->clock = kvm_get_clock();
> > 
> > I can't tell if s->clock_valid here is a bug, or a feature. In
> > either case, we seem to be contradicting the comment at
> > kvmclock_vm_state_change():
> > 
> >     /*
> >      * If the VM is stopped, declare the clock state valid to
> >      * avoid re-reading it on next vmsave (which would return
> >        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >      * a different value). Will be reset when the VM is continued.
> >      */
> >     s->clock_valid = true;
> 
> Sorry, what is the problem??

The comment says clock_valid will avoid re-reading the clock, but
kvmclock_pre_save() ignores clock_valid and will re-read it
unconditionally. Either the comment is incorrect, or
kvmclock_pre_save() is incorrect (I don't know which is the
case).
Paolo Bonzini Nov. 28, 2016, 5:31 p.m. UTC | #4
On 28/11/2016 18:12, Eduardo Habkost wrote:
>>> > > 
>>> > > Why not use src_use_reliable_get_clock here, too? (Maybe rename
>>> > > it to simply clock_is_reliable?)
>> > 
>> > Because you end up mixing two mental ideas (one: whether the source
>> > has KVM_GET_CLOCK, second: whether the destination has KVM_GET_CLOCK 
>> > into one variable). I find it more confusing than not.
>> > Maybe its just my limited brain but i find it
>> > confusing.
> I find it simpler and easier to reason about.

Me too---note that it's not "whether X had reliable KVM_GET_CLOCK" but
rather "whether the last read came from a reliable KVM_GET_CLOCK".
However...

>>> > > You can change kvm_get_clock() to kvm_get_clock(KVMClockState*),
>>> > > and make it update both s->clock and s->clock_is_reliable. With
>>> > > this, s->clock and s->clock_is_reliable will be always in sync,
>>> > > and have a single code path for both local restore and migration:
>> > 
>> > There should not be a single path for migration/runtime cases: 
>> > on migration, we care whether the source used reliable KVM_GET_CLOCK.
>> > 
>> > On runtime, we care whether the destination uses reliable KVM_GET_CLOCK.
> The conditions may be different, but both can be translated to a
> "should we trust the value in s->clock" flag, to simplify the
> code (just like Paolo's suggestion on his reply to v1 today).

... I think I understand now: Marcelo is approaching the problem
differently.  For him it's not an issue of "should we trust the value"
but rather it's "is the value going to mess up the guest with backwards
time".  Which is fine, it just requires some more comments because it's
subtle and it looks a lot like the kernel API is being misused (while
it's fine).

Paolo
diff mbox

Patch

Index: qemu-mig-advance-clock/hw/i386/kvm/clock.c
===================================================================
--- qemu-mig-advance-clock.orig/hw/i386/kvm/clock.c	2016-11-17 15:07:11.220632761 -0200
+++ qemu-mig-advance-clock/hw/i386/kvm/clock.c	2016-11-17 15:11:51.372048640 -0200
@@ -36,6 +36,12 @@ 
 
     uint64_t clock;
     bool clock_valid;
+
+    /* whether machine supports reliable KVM_GET_CLOCK */
+    bool mach_use_reliable_get_clock;
+
+    /* whether source host supported reliable KVM_GET_CLOCK */
+    bool src_use_reliable_get_clock;
 } KVMClockState;
 
 struct pvclock_vcpu_time_info {
@@ -81,6 +87,19 @@ 
     return nsec + time.system_time;
 }
 
+static uint64_t kvm_get_clock(void)
+{
+    struct kvm_clock_data data;
+    int ret;
+
+    ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
+    if (ret < 0) {
+        fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
+                abort();
+    }
+    return data.clock;
+}
+
 static void kvmclock_vm_state_change(void *opaque, int running,
                                      RunState state)
 {
@@ -91,15 +110,36 @@ 
 
     if (running) {
         struct kvm_clock_data data = {};
-        uint64_t time_at_migration = kvmclock_current_nsec(s);
+        uint64_t pvclock_via_mem = 0;
 
-        s->clock_valid = false;
+        /* local (running VM) restore */
+        if (s->clock_valid) {
+            /*
+             * if host does not support reliable KVM_GET_CLOCK,
+             * read kvmclock value from memory
+             */
+            if (!kvm_has_adjust_clock_stable()) {
+                pvclock_via_mem = kvmclock_current_nsec(s);
+            }
+        /* migration/savevm/init restore */
+        } else {
+            /*
+             * use s->clock in case machine uses reliable
+             * get clock and source host supported
+             * reliable get clock
+             */
+            if (!s->src_use_reliable_get_clock) {
+                pvclock_via_mem = kvmclock_current_nsec(s);
+            }
+        }
 
-        /* We can't rely on the migrated clock value, just discard it */
-        if (time_at_migration) {
-            s->clock = time_at_migration;
+        /* We can't rely on the saved clock value, just discard it */
+        if (pvclock_via_mem) {
+            s->clock = pvclock_via_mem;
         }
 
+        s->clock_valid = false;
+
         data.clock = s->clock;
         ret = kvm_vm_ioctl(kvm_state, KVM_SET_CLOCK, &data);
         if (ret < 0) {
@@ -120,8 +160,6 @@ 
             }
         }
     } else {
-        struct kvm_clock_data data;
-        int ret;
 
         if (s->clock_valid) {
             return;
@@ -129,13 +167,7 @@ 
 
         kvm_synchronize_all_tsc();
 
-        ret = kvm_vm_ioctl(kvm_state, KVM_GET_CLOCK, &data);
-        if (ret < 0) {
-            fprintf(stderr, "KVM_GET_CLOCK failed: %s\n", strerror(ret));
-            abort();
-        }
-        s->clock = data.clock;
-
+        s->clock = kvm_get_clock();
         /*
          * If the VM is stopped, declare the clock state valid to
          * avoid re-reading it on next vmsave (which would return
@@ -149,25 +181,72 @@ 
 {
     KVMClockState *s = KVM_CLOCK(dev);
 
+    /*
+     * On machine types that support reliable KVM_GET_CLOCK,
+     * if host kernel does provide reliable KVM_GET_CLOCK,
+     * set src_use_reliable_get_clock=true so that destination
+     * avoids reading kvmclock from memory.
+     */
+    if (s->mach_use_reliable_get_clock && kvm_has_adjust_clock_stable()) {
+        s->src_use_reliable_get_clock = true;
+    }
+
     qemu_add_vm_change_state_handler(kvmclock_vm_state_change, s);
 }
 
+static bool kvmclock_src_use_reliable_get_clock(void *opaque)
+{
+    KVMClockState *s = opaque;
+
+    return s->mach_use_reliable_get_clock;
+}
+
+static const VMStateDescription kvmclock_reliable_get_clock = {
+    .name = "kvmclock/src_use_reliable_get_clock",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = kvmclock_src_use_reliable_get_clock,
+    .fields = (VMStateField[]) {
+        VMSTATE_BOOL(src_use_reliable_get_clock, KVMClockState),
+        VMSTATE_END_OF_LIST()
+    }
+};
+
+static void kvmclock_pre_save(void *opaque)
+{
+    KVMClockState *s = opaque;
+
+    s->clock = kvm_get_clock();
+}
+
 static const VMStateDescription kvmclock_vmsd = {
     .name = "kvmclock",
     .version_id = 1,
     .minimum_version_id = 1,
+    .pre_save = kvmclock_pre_save,
     .fields = (VMStateField[]) {
         VMSTATE_UINT64(clock, KVMClockState),
         VMSTATE_END_OF_LIST()
+    },
+    .subsections = (const VMStateDescription * []) {
+        &kvmclock_reliable_get_clock,
+        NULL
     }
 };
 
+static Property kvmclock_properties[] = {
+    DEFINE_PROP_BOOL("x-mach-use-reliable-get-clock", KVMClockState,
+                      mach_use_reliable_get_clock, true),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
 static void kvmclock_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
 
     dc->realize = kvmclock_realize;
     dc->vmsd = &kvmclock_vmsd;
+    dc->props = kvmclock_properties;
 }
 
 static const TypeInfo kvmclock_info = {
Index: qemu-mig-advance-clock/include/hw/i386/pc.h
===================================================================
--- qemu-mig-advance-clock.orig/include/hw/i386/pc.h	2016-11-17 15:07:11.220632761 -0200
+++ qemu-mig-advance-clock/include/hw/i386/pc.h	2016-11-17 15:08:58.096791416 -0200
@@ -370,6 +370,11 @@ 
 #define PC_COMPAT_2_7 \
     HW_COMPAT_2_7 \
     {\
+        .driver   = "kvmclock",\
+        .property = "x-mach-use-reliable-get-clock",\
+        .value    = "off",\
+    },\
+    {\
         .driver   = TYPE_X86_CPU,\
         .property = "l3-cache",\
         .value    = "off",\
Index: qemu-mig-advance-clock/target-i386/kvm.c
===================================================================
--- qemu-mig-advance-clock.orig/target-i386/kvm.c	2016-11-17 15:07:11.221632762 -0200
+++ qemu-mig-advance-clock/target-i386/kvm.c	2016-11-17 15:07:15.867639659 -0200
@@ -117,6 +117,13 @@ 
     return kvm_check_extension(kvm_state, KVM_CAP_X86_SMM);
 }
 
+bool kvm_has_adjust_clock_stable(void)
+{
+    int ret = kvm_check_extension(kvm_state, KVM_CAP_ADJUST_CLOCK);
+
+    return (ret == KVM_CLOCK_TSC_STABLE);
+}
+
 bool kvm_allows_irq0_override(void)
 {
     return !kvm_irqchip_in_kernel() || kvm_has_gsi_routing();
Index: qemu-mig-advance-clock/target-i386/kvm_i386.h
===================================================================
--- qemu-mig-advance-clock.orig/target-i386/kvm_i386.h	2016-11-17 15:07:11.222632764 -0200
+++ qemu-mig-advance-clock/target-i386/kvm_i386.h	2016-11-17 15:07:15.867639659 -0200
@@ -17,6 +17,7 @@ 
 
 bool kvm_allows_irq0_override(void);
 bool kvm_has_smm(void);
+bool kvm_has_adjust_clock_stable(void);
 void kvm_synchronize_all_tsc(void);
 void kvm_arch_reset_vcpu(X86CPU *cs);
 void kvm_arch_do_init_vcpu(X86CPU *cs);