diff mbox series

[v2,03/20] vfio/migration: Add VFIO migration pre-copy support

Message ID 20230222174915.5647-4-avihaih@nvidia.com
State New
Headers show
Series vfio: Add migration pre-copy support and device dirty tracking | expand

Commit Message

Avihai Horon Feb. 22, 2023, 5:48 p.m. UTC
Pre-copy support allows the VFIO device data to be transferred while the
VM is running. This helps to accommodate VFIO devices that have a large
amount of data that needs to be transferred, and it can reduce migration
downtime.

Pre-copy support is optional in VFIO migration protocol v2.
Implement pre-copy of VFIO migration protocol v2 and use it for devices
that support it. Full description of it can be found here [1].

[1]
https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst |  35 +++++--
 include/hw/vfio/vfio-common.h |   3 +
 hw/vfio/common.c              |   6 +-
 hw/vfio/migration.c           | 175 ++++++++++++++++++++++++++++++++--
 hw/vfio/trace-events          |   4 +-
 5 files changed, 201 insertions(+), 22 deletions(-)

Comments

Alex Williamson Feb. 22, 2023, 8:58 p.m. UTC | #1
On Wed, 22 Feb 2023 19:48:58 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> Pre-copy support allows the VFIO device data to be transferred while the
> VM is running. This helps to accommodate VFIO devices that have a large
> amount of data that needs to be transferred, and it can reduce migration
> downtime.
> 
> Pre-copy support is optional in VFIO migration protocol v2.
> Implement pre-copy of VFIO migration protocol v2 and use it for devices
> that support it. Full description of it can be found here [1].
> 
> [1]
> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>  docs/devel/vfio-migration.rst |  35 +++++--
>  include/hw/vfio/vfio-common.h |   3 +
>  hw/vfio/common.c              |   6 +-
>  hw/vfio/migration.c           | 175 ++++++++++++++++++++++++++++++++--
>  hw/vfio/trace-events          |   4 +-
>  5 files changed, 201 insertions(+), 22 deletions(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index c214c73e28..ba80b9150d 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>  destination host. This document details how saving and restoring of VFIO
>  devices is done in QEMU.
>  
> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
> -data is transferred to the destination.
> -
> -The pre-copy phase of migration is currently not supported for VFIO devices.
> -Support for VFIO pre-copy will be added later on.
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.

Or alternatively for the last sentence,

  VFIO devices opt-in to pre-copy support by reporting the
  VFIO_MIGRATION_PRE_COPY flag in the VFIO_DEVICE_FEATURE_MIGRATION
  ioctl.


>  Note that currently VFIO migration is supported only for a single device. This
>  is due to VFIO migration's lack of P2P support. However, P2P support is planned
> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>  * A ``load_setup`` function that sets the VFIO device on the destination in
>    _RESUMING state.
>  
> +* A ``state_pending_estimate`` function that reports an estimate of the
> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
> +  device.
> +
>  * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>    driver, which indicates the amount of data that the vendor driver has yet to
>    save for the VFIO device.
>  
> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
> +  active only when the VFIO device is in pre-copy states.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver during iterative pre-copy phase.
> +
>  * A ``save_state`` function to save the device config space if it is present.
>  
>  * A ``save_live_complete_precopy`` function that sets the VFIO device in
> @@ -95,8 +107,10 @@ Flow of state changes during Live migration
>  ===========================================
>  
>  Below is the flow of state change during live migration.
> -The values in the brackets represent the VM state, the migration state, and
> +The values in the parentheses represent the VM state, the migration state, and
>  the VFIO device state, respectively.
> +The text in the square brackets represents the flow if the VFIO device supports
> +pre-copy.
>  
>  Live migration save path
>  ------------------------
> @@ -108,11 +122,12 @@ Live migration save path
>                                    |
>                       migrate_init spawns migration_thread
>                  Migration thread then calls each device's .save_setup()
> -                       (RUNNING, _SETUP, _RUNNING)
> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>                                    |
> -                      (RUNNING, _ACTIVE, _RUNNING)
> -             If device is active, get pending_bytes by .state_pending_exact()
> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>            If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  [Data of VFIO device for pre-copy phase is copied]
>          Iterate till total pending bytes converge and are less than threshold
>                                    |
>    On migration completion, vCPU stops and calls .save_live_complete_precopy for
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index 87524c64a4..ee55d442b4 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>      int data_fd;
>      void *data_buffer;
>      size_t data_buffer_size;
> +    uint64_t precopy_init_size;
> +    uint64_t precopy_dirty_size;

size_t?

> +    uint64_t mig_flags;
>  } VFIOMigration;
>  
>  typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index bab83c0e55..6f5afe9f5a 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -409,7 +409,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>              }
>  
>              if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>                  return false;
>              }
>          }
> @@ -438,7 +439,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>                  return false;
>              }
>  
> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>                  continue;
>              } else {
>                  return false;
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index 94a4df73d0..307983d57d 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -67,6 +67,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>          return "STOP_COPY";
>      case VFIO_DEVICE_STATE_RESUMING:
>          return "RESUMING";
> +    case VFIO_DEVICE_STATE_PRE_COPY:
> +        return "PRE_COPY";
>      default:
>          return "UNKNOWN STATE";
>      }
> @@ -240,6 +242,23 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>      return 0;
>  }
>  
> +static int vfio_query_precopy_size(VFIOMigration *migration,
> +                                   uint64_t *init_size, uint64_t *dirty_size)

size_t?  Seems like a concern throughout.

> +{
> +    struct vfio_precopy_info precopy = {
> +        .argsz = sizeof(precopy),
> +    };
> +
> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
> +        return -errno;
> +    }
> +
> +    *init_size = precopy.initial_bytes;
> +    *dirty_size = precopy.dirty_bytes;
> +
> +    return 0;
> +}
> +
>  /* Returns the size of saved data on success and -errno on error */
>  static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>  {
> @@ -248,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>      data_size = read(migration->data_fd, migration->data_buffer,
>                       migration->data_buffer_size);
>      if (data_size < 0) {
> +        /* Pre-copy emptied all the device state for now */
> +        if (errno == ENOMSG) {
> +            return 0;
> +        }
> +
>          return -errno;
>      }
>      if (data_size == 0) {
> @@ -264,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>      return qemu_file_get_error(f) ?: data_size;
>  }
>  
> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
> +                                               uint64_t data_size)
> +{
> +    if (!data_size) {
> +        /*
> +         * Pre-copy emptied all the device state for now, update estimated sizes
> +         * accordingly.
> +         */
> +        migration->precopy_init_size = 0;
> +        migration->precopy_dirty_size = 0;
> +
> +        return;
> +    }
> +
> +    if (migration->precopy_init_size) {
> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
> +
> +        migration->precopy_init_size -= init_size;
> +        data_size -= init_size;
> +    }
> +
> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
> +                                         data_size);
> +}
> +
>  /* ---------------------------------------------------------------------- */
>  
>  static int vfio_save_setup(QEMUFile *f, void *opaque)
> @@ -284,6 +333,35 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>          return -ENOMEM;
>      }
>  
> +    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
> +        uint64_t init_size = 0, dirty_size = 0;
> +        int ret;
> +
> +        switch (migration->device_state) {
> +        case VFIO_DEVICE_STATE_RUNNING:
> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
> +                                           VFIO_DEVICE_STATE_RUNNING);
> +            if (ret) {
> +                return ret;
> +            }
> +
> +            vfio_query_precopy_size(migration, &init_size, &dirty_size);
> +            migration->precopy_init_size = init_size;
> +            migration->precopy_dirty_size = dirty_size;

Seems like we could do away with {init,dirty}_size, initialize
migration->precopy_{init,dirty}_size before the switch, pass them
directly to vfio_query_precopy_size() and remove all but the break from
the case below.  But then that also suggests we could redefine
vfio_query_precopy_size() to

static int vfio_update_precopy_info(VFIOMigration *migration)

which sets the fields directly since this is the only way it's used.

> +
> +            break;
> +        case VFIO_DEVICE_STATE_STOP:
> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
> +
> +            migration->precopy_init_size = 0;
> +            migration->precopy_dirty_size = 0;
> +
> +            break;
> +        default:
> +            return -EINVAL;
> +        }
> +    }
> +
>      trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>  
>      qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
>      trace_vfio_save_cleanup(vbasedev->name);
>  }
>  
> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
> +                                        uint64_t *must_precopy,
> +                                        uint64_t *can_postcopy)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> +        return;
> +    }
> +
> +    /*
> +     * Initial size should be transferred during pre-copy phase so stop-copy
> +     * phase will not be slowed down. Report threshold_size to force another
> +     * pre-copy iteration.
> +     */
> +    *must_precopy += migration->precopy_init_size ?
> +                         threshold_size :
> +                         migration->precopy_dirty_size;

This sure feels like we're feeding false data back to the iterator to
spoof it to run another iteration, when the vfio migration protocol
only recommends that initial_bytes reaches zero before proceeding to
stop-copy, it's not a requirement.  What benefit is actually observed
from this?  Why is this required for initial pre-copy support?  It
seems devious.

> +
> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
> +                                      *can_postcopy,
> +                                      migration->precopy_init_size,
> +                                      migration->precopy_dirty_size);
> +}
> +
>  /*
>   * Migration size of VFIO devices can be as little as a few KBs or as big as
>   * many GBs. This value should be big enough to cover the worst case.
>   */
>  #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>  
> -/*
> - * Only exact function is implemented and not estimate function. The reason is
> - * that during pre-copy phase of migration the estimate function is called
> - * repeatedly while pending RAM size is over the threshold, thus migration
> - * can't converge and querying the VFIO device pending data size is useless.
> - */
>  static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>                                       uint64_t *must_precopy,
>                                       uint64_t *can_postcopy)
>  {
>      VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>      uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>  
>      /*
> @@ -328,8 +427,57 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>      vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>      *must_precopy += stop_copy_size;
>  
> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
> +        uint64_t init_size = 0, dirty_size = 0;
> +
> +        vfio_query_precopy_size(migration, &init_size, &dirty_size);
> +        migration->precopy_init_size = init_size;
> +        migration->precopy_dirty_size = dirty_size;

This is the only other caller of vfio_query_precopy_size(), following
the same pattern that could be simplified if the function filled the
migration fields itself.

> +
> +        /*
> +         * Initial size should be transferred during pre-copy phase so
> +         * stop-copy phase will not be slowed down. Report threshold_size
> +         * to force another pre-copy iteration.
> +         */
> +        *must_precopy += migration->precopy_init_size ?
> +                             threshold_size :
> +                             migration->precopy_dirty_size;
> +    }

Just as sketchy as above.  Thanks,

Alex

> +
>      trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
> -                                   stop_copy_size);
> +                                   stop_copy_size, migration->precopy_init_size,
> +                                   migration->precopy_dirty_size);
> +}
> +
> +static bool vfio_is_active_iterate(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
> +}
> +
> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +    ssize_t data_size;
> +
> +    data_size = vfio_save_block(f, migration);
> +    if (data_size < 0) {
> +        return data_size;
> +    }
> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +
> +    vfio_update_estimated_pending_data(migration, data_size);
> +
> +    trace_vfio_save_iterate(vbasedev->name);
> +
> +    /*
> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
> +     * Return 1 so following handlers will not be potentially blocked.
> +     */
> +    return 1;
>  }
>  
>  static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
> @@ -338,7 +486,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>      ssize_t data_size;
>      int ret;
>  
> -    /* We reach here with device state STOP only */
> +    /* We reach here with device state STOP or STOP_COPY only */
>      ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>                                     VFIO_DEVICE_STATE_STOP);
>      if (ret) {
> @@ -457,7 +605,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>  static const SaveVMHandlers savevm_vfio_handlers = {
>      .save_setup = vfio_save_setup,
>      .save_cleanup = vfio_save_cleanup,
> +    .state_pending_estimate = vfio_state_pending_estimate,
>      .state_pending_exact = vfio_state_pending_exact,
> +    .is_active_iterate = vfio_is_active_iterate,
> +    .save_live_iterate = vfio_save_iterate,
>      .save_live_complete_precopy = vfio_save_complete_precopy,
>      .save_state = vfio_save_state,
>      .load_setup = vfio_load_setup,
> @@ -470,13 +621,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>  static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>  {
>      VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
>      enum vfio_device_mig_state new_state;
>      int ret;
>  
>      if (running) {
>          new_state = VFIO_DEVICE_STATE_RUNNING;
>      } else {
> -        new_state = VFIO_DEVICE_STATE_STOP;
> +        new_state =
> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
> +                VFIO_DEVICE_STATE_STOP_COPY :
> +                VFIO_DEVICE_STATE_STOP;
>      }
>  
>      /*
> @@ -590,6 +746,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>      migration->vbasedev = vbasedev;
>      migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>      migration->data_fd = -1;
> +    migration->mig_flags = mig_flags;
>  
>      oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
>      if (oid) {
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index 669d9fe07c..51613e02e6 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -161,6 +161,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>  vfio_save_cleanup(const char *name) " (%s)"
>  vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>  vfio_save_device_config_state(const char *name) " (%s)"
> +vfio_save_iterate(const char *name) " (%s)"
>  vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>  vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
Avihai Horon Feb. 23, 2023, 3:25 p.m. UTC | #2
On 22/02/2023 22:58, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Wed, 22 Feb 2023 19:48:58 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> Pre-copy support allows the VFIO device data to be transferred while the
>> VM is running. This helps to accommodate VFIO devices that have a large
>> amount of data that needs to be transferred, and it can reduce migration
>> downtime.
>>
>> Pre-copy support is optional in VFIO migration protocol v2.
>> Implement pre-copy of VFIO migration protocol v2 and use it for devices
>> that support it. Full description of it can be found here [1].
>>
>> [1]
>> https://lore.kernel.org/kvm/20221206083438.37807-3-yishaih@nvidia.com/
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst |  35 +++++--
>>   include/hw/vfio/vfio-common.h |   3 +
>>   hw/vfio/common.c              |   6 +-
>>   hw/vfio/migration.c           | 175 ++++++++++++++++++++++++++++++++--
>>   hw/vfio/trace-events          |   4 +-
>>   5 files changed, 201 insertions(+), 22 deletions(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> index c214c73e28..ba80b9150d 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the
>>   destination host. This document details how saving and restoring of VFIO
>>   devices is done in QEMU.
>>
>> -Migration of VFIO devices currently consists of a single stop-and-copy phase.
>> -During the stop-and-copy phase the guest is stopped and the entire VFIO device
>> -data is transferred to the destination.
>> -
>> -The pre-copy phase of migration is currently not supported for VFIO devices.
>> -Support for VFIO pre-copy will be added later on.
>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>> +accommodate VFIO devices that have a large amount of data that needs to be
>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>> +continue whilst the VFIO device state is transferred to the destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
>> +the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
>> +flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
> Or alternatively for the last sentence,
>
>    VFIO devices opt-in to pre-copy support by reporting the
>    VFIO_MIGRATION_PRE_COPY flag in the VFIO_DEVICE_FEATURE_MIGRATION
>    ioctl.

Sounds good, I will change it.

>
>>   Note that currently VFIO migration is supported only for a single device. This
>>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
>> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows:
>>   * A ``load_setup`` function that sets the VFIO device on the destination in
>>     _RESUMING state.
>>
>> +* A ``state_pending_estimate`` function that reports an estimate of the
>> +  remaining pre-copy data that the vendor driver has yet to save for the VFIO
>> +  device.
>> +
>>   * A ``state_pending_exact`` function that reads pending_bytes from the vendor
>>     driver, which indicates the amount of data that the vendor driver has yet to
>>     save for the VFIO device.
>>
>> +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
>> +  active only when the VFIO device is in pre-copy states.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver during iterative pre-copy phase.
>> +
>>   * A ``save_state`` function to save the device config space if it is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
>> @@ -95,8 +107,10 @@ Flow of state changes during Live migration
>>   ===========================================
>>
>>   Below is the flow of state change during live migration.
>> -The values in the brackets represent the VM state, the migration state, and
>> +The values in the parentheses represent the VM state, the migration state, and
>>   the VFIO device state, respectively.
>> +The text in the square brackets represents the flow if the VFIO device supports
>> +pre-copy.
>>
>>   Live migration save path
>>   ------------------------
>> @@ -108,11 +122,12 @@ Live migration save path
>>                                     |
>>                        migrate_init spawns migration_thread
>>                   Migration thread then calls each device's .save_setup()
>> -                       (RUNNING, _SETUP, _RUNNING)
>> +                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
>>                                     |
>> -                      (RUNNING, _ACTIVE, _RUNNING)
>> -             If device is active, get pending_bytes by .state_pending_exact()
>> +                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
>> +      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
>>             If total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  [Data of VFIO device for pre-copy phase is copied]
>>           Iterate till total pending bytes converge and are less than threshold
>>                                     |
>>     On migration completion, vCPU stops and calls .save_live_complete_precopy for
>> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
>> index 87524c64a4..ee55d442b4 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -66,6 +66,9 @@ typedef struct VFIOMigration {
>>       int data_fd;
>>       void *data_buffer;
>>       size_t data_buffer_size;
>> +    uint64_t precopy_init_size;
>> +    uint64_t precopy_dirty_size;
> size_t?
>
>> +    uint64_t mig_flags;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
>> index bab83c0e55..6f5afe9f5a 100644
>> --- a/hw/vfio/common.c
>> +++ b/hw/vfio/common.c
>> @@ -409,7 +409,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
>>               }
>>
>>               if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
>> -                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
>>                   return false;
>>               }
>>           }
>> @@ -438,7 +439,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
>>                   return false;
>>               }
>>
>> -            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
>> +            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
>> +                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>>                   continue;
>>               } else {
>>                   return false;
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index 94a4df73d0..307983d57d 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -67,6 +67,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state)
>>           return "STOP_COPY";
>>       case VFIO_DEVICE_STATE_RESUMING:
>>           return "RESUMING";
>> +    case VFIO_DEVICE_STATE_PRE_COPY:
>> +        return "PRE_COPY";
>>       default:
>>           return "UNKNOWN STATE";
>>       }
>> @@ -240,6 +242,23 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>>       return 0;
>>   }
>>
>> +static int vfio_query_precopy_size(VFIOMigration *migration,
>> +                                   uint64_t *init_size, uint64_t *dirty_size)
> size_t?  Seems like a concern throughout.

Yes, I will change it in all places.

>> +{
>> +    struct vfio_precopy_info precopy = {
>> +        .argsz = sizeof(precopy),
>> +    };
>> +
>> +    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
>> +        return -errno;
>> +    }
>> +
>> +    *init_size = precopy.initial_bytes;
>> +    *dirty_size = precopy.dirty_bytes;
>> +
>> +    return 0;
>> +}
>> +
>>   /* Returns the size of saved data on success and -errno on error */
>>   static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>   {
>> @@ -248,6 +267,11 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>       data_size = read(migration->data_fd, migration->data_buffer,
>>                        migration->data_buffer_size);
>>       if (data_size < 0) {
>> +        /* Pre-copy emptied all the device state for now */
>> +        if (errno == ENOMSG) {
>> +            return 0;
>> +        }
>> +
>>           return -errno;
>>       }
>>       if (data_size == 0) {
>> @@ -264,6 +288,31 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
>>       return qemu_file_get_error(f) ?: data_size;
>>   }
>>
>> +static void vfio_update_estimated_pending_data(VFIOMigration *migration,
>> +                                               uint64_t data_size)
>> +{
>> +    if (!data_size) {
>> +        /*
>> +         * Pre-copy emptied all the device state for now, update estimated sizes
>> +         * accordingly.
>> +         */
>> +        migration->precopy_init_size = 0;
>> +        migration->precopy_dirty_size = 0;
>> +
>> +        return;
>> +    }
>> +
>> +    if (migration->precopy_init_size) {
>> +        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
>> +
>> +        migration->precopy_init_size -= init_size;
>> +        data_size -= init_size;
>> +    }
>> +
>> +    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
>> +                                         data_size);
>> +}
>> +
>>   /* ---------------------------------------------------------------------- */
>>
>>   static int vfio_save_setup(QEMUFile *f, void *opaque)
>> @@ -284,6 +333,35 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>>           return -ENOMEM;
>>       }
>>
>> +    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
>> +        uint64_t init_size = 0, dirty_size = 0;
>> +        int ret;
>> +
>> +        switch (migration->device_state) {
>> +        case VFIO_DEVICE_STATE_RUNNING:
>> +            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
>> +                                           VFIO_DEVICE_STATE_RUNNING);
>> +            if (ret) {
>> +                return ret;
>> +            }
>> +
>> +            vfio_query_precopy_size(migration, &init_size, &dirty_size);
>> +            migration->precopy_init_size = init_size;
>> +            migration->precopy_dirty_size = dirty_size;
> Seems like we could do away with {init,dirty}_size, initialize
> migration->precopy_{init,dirty}_size before the switch, pass them
> directly to vfio_query_precopy_size() and remove all but the break from
> the case below.  But then that also suggests we could redefine
> vfio_query_precopy_size() to
>
> static int vfio_update_precopy_info(VFIOMigration *migration)
>
> which sets the fields directly since this is the only way it's used.

You are right, I will change it.

>> +
>> +            break;
>> +        case VFIO_DEVICE_STATE_STOP:
>> +            /* vfio_save_complete_precopy() will go to STOP_COPY */
>> +
>> +            migration->precopy_init_size = 0;
>> +            migration->precopy_dirty_size = 0;
>> +
>> +            break;
>> +        default:
>> +            return -EINVAL;
>> +        }
>> +    }
>> +
>>       trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
>>
>>       qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>>
>> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
>> +                                        uint64_t *must_precopy,
>> +                                        uint64_t *can_postcopy)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Initial size should be transferred during pre-copy phase so stop-copy
>> +     * phase will not be slowed down. Report threshold_size to force another
>> +     * pre-copy iteration.
>> +     */
>> +    *must_precopy += migration->precopy_init_size ?
>> +                         threshold_size :
>> +                         migration->precopy_dirty_size;
> This sure feels like we're feeding false data back to the iterator to
> spoof it to run another iteration, when the vfio migration protocol
> only recommends that initial_bytes reaches zero before proceeding to
> stop-copy, it's not a requirement.  What benefit is actually observed
> from this?  Why is this required for initial pre-copy support?  It
> seems devious.

As previously discussed in the thread that added the pre-copy uAPI [1], 
the init_bytes can be used by drivers to reduce the downtime.
For example, mlx5 transfers some metadata to the target so it will be 
able to pre-allocate resources etc.

[1] 
https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/

Thanks!

>> +
>> +    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
>> +                                      *can_postcopy,
>> +                                      migration->precopy_init_size,
>> +                                      migration->precopy_dirty_size);
>> +}
>> +
>>   /*
>>    * Migration size of VFIO devices can be as little as a few KBs or as big as
>>    * many GBs. This value should be big enough to cover the worst case.
>>    */
>>   #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
>>
>> -/*
>> - * Only exact function is implemented and not estimate function. The reason is
>> - * that during pre-copy phase of migration the estimate function is called
>> - * repeatedly while pending RAM size is over the threshold, thus migration
>> - * can't converge and querying the VFIO device pending data size is useless.
>> - */
>>   static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>>                                        uint64_t *must_precopy,
>>                                        uint64_t *can_postcopy)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
>>
>>       /*
>> @@ -328,8 +427,57 @@ static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
>>       vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
>>       *must_precopy += stop_copy_size;
>>
>> +    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
>> +        uint64_t init_size = 0, dirty_size = 0;
>> +
>> +        vfio_query_precopy_size(migration, &init_size, &dirty_size);
>> +        migration->precopy_init_size = init_size;
>> +        migration->precopy_dirty_size = dirty_size;
> This is the only other caller of vfio_query_precopy_size(), following
> the same pattern that could be simplified if the function filled the
> migration fields itself.
>
>> +
>> +        /*
>> +         * Initial size should be transferred during pre-copy phase so
>> +         * stop-copy phase will not be slowed down. Report threshold_size
>> +         * to force another pre-copy iteration.
>> +         */
>> +        *must_precopy += migration->precopy_init_size ?
>> +                             threshold_size :
>> +                             migration->precopy_dirty_size;
>> +    }
> Just as sketchy as above.  Thanks,
>
> Alex
>
>> +
>>       trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
>> -                                   stop_copy_size);
>> +                                   stop_copy_size, migration->precopy_init_size,
>> +                                   migration->precopy_dirty_size);
>> +}
>> +
>> +static bool vfio_is_active_iterate(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
>> +}
>> +
>> +static int vfio_save_iterate(QEMUFile *f, void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +    ssize_t data_size;
>> +
>> +    data_size = vfio_save_block(f, migration);
>> +    if (data_size < 0) {
>> +        return data_size;
>> +    }
>> +    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +
>> +    vfio_update_estimated_pending_data(migration, data_size);
>> +
>> +    trace_vfio_save_iterate(vbasedev->name);
>> +
>> +    /*
>> +     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
>> +     * Return 1 so following handlers will not be potentially blocked.
>> +     */
>> +    return 1;
>>   }
>>
>>   static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>> @@ -338,7 +486,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
>>       ssize_t data_size;
>>       int ret;
>>
>> -    /* We reach here with device state STOP only */
>> +    /* We reach here with device state STOP or STOP_COPY only */
>>       ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
>>                                      VFIO_DEVICE_STATE_STOP);
>>       if (ret) {
>> @@ -457,7 +605,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> +    .state_pending_estimate = vfio_state_pending_estimate,
>>       .state_pending_exact = vfio_state_pending_exact,
>> +    .is_active_iterate = vfio_is_active_iterate,
>> +    .save_live_iterate = vfio_save_iterate,
>>       .save_live_complete_precopy = vfio_save_complete_precopy,
>>       .save_state = vfio_save_state,
>>       .load_setup = vfio_load_setup,
>> @@ -470,13 +621,18 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>   static void vfio_vmstate_change(void *opaque, bool running, RunState state)
>>   {
>>       VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>>       enum vfio_device_mig_state new_state;
>>       int ret;
>>
>>       if (running) {
>>           new_state = VFIO_DEVICE_STATE_RUNNING;
>>       } else {
>> -        new_state = VFIO_DEVICE_STATE_STOP;
>> +        new_state =
>> +            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
>> +             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
>> +                VFIO_DEVICE_STATE_STOP_COPY :
>> +                VFIO_DEVICE_STATE_STOP;
>>       }
>>
>>       /*
>> @@ -590,6 +746,7 @@ static int vfio_migration_init(VFIODevice *vbasedev)
>>       migration->vbasedev = vbasedev;
>>       migration->device_state = VFIO_DEVICE_STATE_RUNNING;
>>       migration->data_fd = -1;
>> +    migration->mig_flags = mig_flags;
>>
>>       oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
>>       if (oid) {
>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
>> index 669d9fe07c..51613e02e6 100644
>> --- a/hw/vfio/trace-events
>> +++ b/hw/vfio/trace-events
>> @@ -161,6 +161,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
>>   vfio_save_cleanup(const char *name) " (%s)"
>>   vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
>>   vfio_save_device_config_state(const char *name) " (%s)"
>> +vfio_save_iterate(const char *name) " (%s)"
>>   vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
>> -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
>> +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>> +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
>>   vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
Alex Williamson Feb. 23, 2023, 9:16 p.m. UTC | #3
On Thu, 23 Feb 2023 17:25:12 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 22/02/2023 22:58, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Wed, 22 Feb 2023 19:48:58 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
> >>       trace_vfio_save_cleanup(vbasedev->name);
> >>   }
> >>
> >> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
> >> +                                        uint64_t *must_precopy,
> >> +                                        uint64_t *can_postcopy)
> >> +{
> >> +    VFIODevice *vbasedev = opaque;
> >> +    VFIOMigration *migration = vbasedev->migration;
> >> +
> >> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Initial size should be transferred during pre-copy phase so stop-copy
> >> +     * phase will not be slowed down. Report threshold_size to force another
> >> +     * pre-copy iteration.
> >> +     */
> >> +    *must_precopy += migration->precopy_init_size ?
> >> +                         threshold_size :
> >> +                         migration->precopy_dirty_size;  
> > This sure feels like we're feeding false data back to the iterator to
> > spoof it to run another iteration, when the vfio migration protocol
> > only recommends that initial_bytes reaches zero before proceeding to
> > stop-copy, it's not a requirement.  What benefit is actually observed
> > from this?  Why is this required for initial pre-copy support?  It
> > seems devious.  
> 
> As previously discussed in the thread that added the pre-copy uAPI [1], 
> the init_bytes can be used by drivers to reduce the downtime.
> For example, mlx5 transfers some metadata to the target so it will be 
> able to pre-allocate resources etc.
> 
> [1] 
> https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/

Yes, but how does that become a requirement to QEMU that it must
iterate until the initial segment is complete?  Especially when we need
to trigger that behavior via such nefarious means.  AIUI, QEMU should
be allowed to move to stop-copy at any point.  We should make efforts
that QEMU would never decide on its own to move from pre-copy to
stop-copy without completing the init_bytes (which sounds suspiciously
like the purpose of @must_precopy), but if, for instance a user forces a
transition to stop-copy, I don't see that we have any business to
impose a policy to delay that until the init_bytes is complete.  Thanks,

Alex
Avihai Horon Feb. 26, 2023, 4:43 p.m. UTC | #4
On 23/02/2023 23:16, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Thu, 23 Feb 2023 17:25:12 +0200
> Avihai Horon <avihaih@nvidia.com> wrote:
>
>> On 22/02/2023 22:58, Alex Williamson wrote:
>>> External email: Use caution opening links or attachments
>>>
>>>
>>> On Wed, 22 Feb 2023 19:48:58 +0200
>>> Avihai Horon <avihaih@nvidia.com> wrote:
>>>
>>>> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
>>>>        trace_vfio_save_cleanup(vbasedev->name);
>>>>    }
>>>>
>>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
>>>> +                                        uint64_t *must_precopy,
>>>> +                                        uint64_t *can_postcopy)
>>>> +{
>>>> +    VFIODevice *vbasedev = opaque;
>>>> +    VFIOMigration *migration = vbasedev->migration;
>>>> +
>>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Initial size should be transferred during pre-copy phase so stop-copy
>>>> +     * phase will not be slowed down. Report threshold_size to force another
>>>> +     * pre-copy iteration.
>>>> +     */
>>>> +    *must_precopy += migration->precopy_init_size ?
>>>> +                         threshold_size :
>>>> +                         migration->precopy_dirty_size;
>>> This sure feels like we're feeding false data back to the iterator to
>>> spoof it to run another iteration, when the vfio migration protocol
>>> only recommends that initial_bytes reaches zero before proceeding to
>>> stop-copy, it's not a requirement.  What benefit is actually observed
>>> from this?  Why is this required for initial pre-copy support?  It
>>> seems devious.
>> As previously discussed in the thread that added the pre-copy uAPI [1],
>> the init_bytes can be used by drivers to reduce the downtime.
>> For example, mlx5 transfers some metadata to the target so it will be
>> able to pre-allocate resources etc.
>>
>> [1]
>> https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/
> Yes, but how does that become a requirement to QEMU that it must
> iterate until the initial segment is complete?  Especially when we need
> to trigger that behavior via such nefarious means.  AIUI, QEMU should
> be allowed to move to stop-copy at any point.  We should make efforts
> that QEMU would never decide on its own to move from pre-copy to
> stop-copy without completing the init_bytes (which sounds suspiciously
> like the purpose of @must_precopy),

@must_precopy represents the pending bytes that must be transferred 
during pre-copy or stop-copy. If it's under the threshold, then 
migration will move to stop-copy and be completed.
So simply adding init_bytes to @must_precopy will not guarantee that we 
send all init_bytes before moving to stop-copy, since the transition to 
stop-copy can happen when @must_precopy != 0.

>   but if, for instance a user forces a
> transition to stop-copy, I don't see that we have any business to
> impose a policy to delay that until the init_bytes is complete.

Is there a way a user can force the migration to move to stop-copy?
Looking at migration code, it seems that the only way to move to 
stop-copy is if @must_precopy is below the threshold.
If so, then this is our effort to make QEMU send all init_bytes before 
moving to stop_copy and we can only benefit from it.

Regarding how to do it -- maybe instead of spoofing @must_precopy we can 
introduce a new parameter in upper migration layer (e.g., @init_precopy) 
and add another condition in migration layer that it must be zero to 
move to stop-copy.

Thanks.
Alex Williamson Feb. 27, 2023, 4:14 p.m. UTC | #5
On Sun, 26 Feb 2023 18:43:50 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 23/02/2023 23:16, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Thu, 23 Feb 2023 17:25:12 +0200
> > Avihai Horon <avihaih@nvidia.com> wrote:
> >  
> >> On 22/02/2023 22:58, Alex Williamson wrote:  
> >>> External email: Use caution opening links or attachments
> >>>
> >>>
> >>> On Wed, 22 Feb 2023 19:48:58 +0200
> >>> Avihai Horon <avihaih@nvidia.com> wrote:
> >>>  
> >>>> @@ -302,23 +380,44 @@ static void vfio_save_cleanup(void *opaque)
> >>>>        trace_vfio_save_cleanup(vbasedev->name);
> >>>>    }
> >>>>
> >>>> +static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
> >>>> +                                        uint64_t *must_precopy,
> >>>> +                                        uint64_t *can_postcopy)
> >>>> +{
> >>>> +    VFIODevice *vbasedev = opaque;
> >>>> +    VFIOMigration *migration = vbasedev->migration;
> >>>> +
> >>>> +    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    /*
> >>>> +     * Initial size should be transferred during pre-copy phase so stop-copy
> >>>> +     * phase will not be slowed down. Report threshold_size to force another
> >>>> +     * pre-copy iteration.
> >>>> +     */
> >>>> +    *must_precopy += migration->precopy_init_size ?
> >>>> +                         threshold_size :
> >>>> +                         migration->precopy_dirty_size;  
> >>> This sure feels like we're feeding false data back to the iterator to
> >>> spoof it to run another iteration, when the vfio migration protocol
> >>> only recommends that initial_bytes reaches zero before proceeding to
> >>> stop-copy, it's not a requirement.  What benefit is actually observed
> >>> from this?  Why is this required for initial pre-copy support?  It
> >>> seems devious.  
> >> As previously discussed in the thread that added the pre-copy uAPI [1],
> >> the init_bytes can be used by drivers to reduce the downtime.
> >> For example, mlx5 transfers some metadata to the target so it will be
> >> able to pre-allocate resources etc.
> >>
> >> [1]
> >> https://lore.kernel.org/kvm/ae4a6259-349d-0131-896c-7a6ea775cc9e@nvidia.com/  
> > Yes, but how does that become a requirement to QEMU that it must
> > iterate until the initial segment is complete?  Especially when we need
> > to trigger that behavior via such nefarious means.  AIUI, QEMU should
> > be allowed to move to stop-copy at any point.  We should make efforts
> > that QEMU would never decide on its own to move from pre-copy to
> > stop-copy without completing the init_bytes (which sounds suspiciously
> > like the purpose of @must_precopy),  
> 
> @must_precopy represents the pending bytes that must be transferred 
> during pre-copy or stop-copy. If it's under the threshold, then 
> migration will move to stop-copy and be completed.
> So simply adding init_bytes to @must_precopy will not guarantee that we 
> send all init_bytes before moving to stop-copy, since the transition to 
> stop-copy can happen when @must_precopy != 0.

But we have no requirement to send all init_bytes before stop-copy.
This is a hack to achieve a theoretical benefit that a driver might be
able to improve the latency on the target by completing another
iteration.  If drivers are filling in a "must_precopy" arg, it sounds
like even if migration moves to stop-copy, that data should be migrated
first and deferring stop-copy could potentially extend the migration in
other areas.

> >   but if, for instance a user forces a
> > transition to stop-copy, I don't see that we have any business to
> > impose a policy to delay that until the init_bytes is complete.  
> 
> Is there a way a user can force the migration to move to stop-copy?
> Looking at migration code, it seems that the only way to move to 
> stop-copy is if @must_precopy is below the threshold.
> If so, then this is our effort to make QEMU send all init_bytes before 
> moving to stop_copy and we can only benefit from it.

But we have no requirement to send all init_bytes before stop-copy.
This is a hack to achieve a theoretical benefit that a driver might be
able to improve the latency on the target by completing another
iteration.  If drivers are filling in a "must_precopy" arg, it sounds
like even if migration moves to stop-copy, that data should be migrated
first and deferring stop-copy could potentially extend the migration in
other areas.
 
> Regarding how to do it -- maybe instead of spoofing @must_precopy we can 
> introduce a new parameter in upper migration layer (e.g., @init_precopy) 
> and add another condition in migration layer that it must be zero to 
> move to stop-copy.

Why not just move to stop-copy but transfer all must_precopy data
first?  That would seem to align with the naming to me.  I don't think
the device actually cares if the transfer happens while the device is
running or stopped, it just wants it at the target device early enough
to start configuration, right?

I'd drop this for an initial implementation, the uAPI does not require
that QEMU complete init_bytes before transitioning to stop-copy and
this is clearly not a very clean or well justified means to try to
achieve that as a non-requirement.  Thanks,

Alex
Jason Gunthorpe Feb. 27, 2023, 5:26 p.m. UTC | #6
On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:

> But we have no requirement to send all init_bytes before stop-copy.
> This is a hack to achieve a theoretical benefit that a driver might be
> able to improve the latency on the target by completing another
> iteration.

I think this is another half-step at this point..

The goal is to not stop the VM until the target VFIO driver has
completed loading initial_bytes.

This signals that the time consuming pre-setup is completed in the
device and we don't have to use downtime to do that work.

We've measured this in our devices and the time-shift can be
significant, like seconds levels of time removed from the downtime
period.

Stopping the VM before this pre-setup is done is simply extending the
stopped VM downtime.

Really what we want is to have the far side acknowledge that
initial_bytes has completed loading.

To remind, what mlx5 is doing here with precopy is time-shifting work,
not data. We want to put expensive work (ie time) into the period when
the VM is still running and have less downtime.

This challenges the assumption built into qmeu that all data has equal
time and it can estimate downtime time simply by scaling the estimated
data. We have a data-size independent time component to deal with as
well.

Jason
Alex Williamson Feb. 27, 2023, 5:43 p.m. UTC | #7
On Mon, 27 Feb 2023 13:26:00 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
> 
> > But we have no requirement to send all init_bytes before stop-copy.
> > This is a hack to achieve a theoretical benefit that a driver might be
> > able to improve the latency on the target by completing another
> > iteration.  
> 
> I think this is another half-step at this point..
> 
> The goal is to not stop the VM until the target VFIO driver has
> completed loading initial_bytes.
> 
> This signals that the time consuming pre-setup is completed in the
> device and we don't have to use downtime to do that work.
> 
> We've measured this in our devices and the time-shift can be
> significant, like seconds levels of time removed from the downtime
> period.
> 
> Stopping the VM before this pre-setup is done is simply extending the
> stopped VM downtime.
> 
> Really what we want is to have the far side acknowledge that
> initial_bytes has completed loading.
> 
> To remind, what mlx5 is doing here with precopy is time-shifting work,
> not data. We want to put expensive work (ie time) into the period when
> the VM is still running and have less downtime.
> 
> This challenges the assumption built into qmeu that all data has equal
> time and it can estimate downtime time simply by scaling the estimated
> data. We have a data-size independent time component to deal with as
> well.

As I mentioned before, I understand the motivation, but imo the
implementation is exploiting the interface it extended in order to force
a device driven policy which is specifically not a requirement of the
vfio migration uAPI.  It sounds like there's more work required in the
QEMU migration interfaces to properly factor this information into the
algorithm.  Until then, this seems like a follow-on improvement unless
you can convince the migration maintainers that providing false
information in order to force another pre-copy iteration is a valid use
of passing the threshold value to the driver.  Thanks,

Alex
Avihai Horon March 1, 2023, 6:49 p.m. UTC | #8
On 27/02/2023 19:43, Alex Williamson wrote:
> External email: Use caution opening links or attachments
>
>
> On Mon, 27 Feb 2023 13:26:00 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
>>
>>> But we have no requirement to send all init_bytes before stop-copy.
>>> This is a hack to achieve a theoretical benefit that a driver might be
>>> able to improve the latency on the target by completing another
>>> iteration.
>> I think this is another half-step at this point..
>>
>> The goal is to not stop the VM until the target VFIO driver has
>> completed loading initial_bytes.
>>
>> This signals that the time consuming pre-setup is completed in the
>> device and we don't have to use downtime to do that work.
>>
>> We've measured this in our devices and the time-shift can be
>> significant, like seconds levels of time removed from the downtime
>> period.
>>
>> Stopping the VM before this pre-setup is done is simply extending the
>> stopped VM downtime.
>>
>> Really what we want is to have the far side acknowledge that
>> initial_bytes has completed loading.
>>
>> To remind, what mlx5 is doing here with precopy is time-shifting work,
>> not data. We want to put expensive work (ie time) into the period when
>> the VM is still running and have less downtime.
>>
>> This challenges the assumption built into qmeu that all data has equal
>> time and it can estimate downtime time simply by scaling the estimated
>> data. We have a data-size independent time component to deal with as
>> well.
> As I mentioned before, I understand the motivation, but imo the
> implementation is exploiting the interface it extended in order to force
> a device driven policy which is specifically not a requirement of the
> vfio migration uAPI.  It sounds like there's more work required in the
> QEMU migration interfaces to properly factor this information into the
> algorithm.  Until then, this seems like a follow-on improvement unless
> you can convince the migration maintainers that providing false
> information in order to force another pre-copy iteration is a valid use
> of passing the threshold value to the driver.

In my previous message I suggested to drop this exploit and instead 
change the QEMU migration API and introduce to it the concept of 
pre-copy initial bytes -- data that must be transferred before source VM 
stops (which is different from current @must_precopy that represents 
data that can be transferred even when VM is stopped).
We could do it by adding a new parameter "init_precopy_size" to the 
state_pending_{estimate,exact} handlers and every migration user could 
use it (RAM, block, etc).
We will also change the migration algorithm to take this new parameter 
into account when deciding to move to stop-copy.

Of course this will have to be approved by migration maintainers first, 
but if it's done in a standard way such as above, via the migration API, 
would it be OK by you to go this way?

Thanks.
Alex Williamson March 1, 2023, 7:55 p.m. UTC | #9
On Wed, 1 Mar 2023 20:49:28 +0200
Avihai Horon <avihaih@nvidia.com> wrote:

> On 27/02/2023 19:43, Alex Williamson wrote:
> > External email: Use caution opening links or attachments
> >
> >
> > On Mon, 27 Feb 2023 13:26:00 -0400
> > Jason Gunthorpe <jgg@nvidia.com> wrote:
> >  
> >> On Mon, Feb 27, 2023 at 09:14:44AM -0700, Alex Williamson wrote:
> >>  
> >>> But we have no requirement to send all init_bytes before stop-copy.
> >>> This is a hack to achieve a theoretical benefit that a driver might be
> >>> able to improve the latency on the target by completing another
> >>> iteration.  
> >> I think this is another half-step at this point..
> >>
> >> The goal is to not stop the VM until the target VFIO driver has
> >> completed loading initial_bytes.
> >>
> >> This signals that the time consuming pre-setup is completed in the
> >> device and we don't have to use downtime to do that work.
> >>
> >> We've measured this in our devices and the time-shift can be
> >> significant, like seconds levels of time removed from the downtime
> >> period.
> >>
> >> Stopping the VM before this pre-setup is done is simply extending the
> >> stopped VM downtime.
> >>
> >> Really what we want is to have the far side acknowledge that
> >> initial_bytes has completed loading.
> >>
> >> To remind, what mlx5 is doing here with precopy is time-shifting work,
> >> not data. We want to put expensive work (ie time) into the period when
> >> the VM is still running and have less downtime.
> >>
> >> This challenges the assumption built into qmeu that all data has equal
> >> time and it can estimate downtime time simply by scaling the estimated
> >> data. We have a data-size independent time component to deal with as
> >> well.  
> > As I mentioned before, I understand the motivation, but imo the
> > implementation is exploiting the interface it extended in order to force
> > a device driven policy which is specifically not a requirement of the
> > vfio migration uAPI.  It sounds like there's more work required in the
> > QEMU migration interfaces to properly factor this information into the
> > algorithm.  Until then, this seems like a follow-on improvement unless
> > you can convince the migration maintainers that providing false
> > information in order to force another pre-copy iteration is a valid use
> > of passing the threshold value to the driver.  
> 
> In my previous message I suggested to drop this exploit and instead 
> change the QEMU migration API and introduce to it the concept of 
> pre-copy initial bytes -- data that must be transferred before source VM 
> stops (which is different from current @must_precopy that represents 
> data that can be transferred even when VM is stopped).
> We could do it by adding a new parameter "init_precopy_size" to the 
> state_pending_{estimate,exact} handlers and every migration user could 
> use it (RAM, block, etc).
> We will also change the migration algorithm to take this new parameter 
> into account when deciding to move to stop-copy.
> 
> Of course this will have to be approved by migration maintainers first, 
> but if it's done in a standard way such as above, via the migration API, 
> would it be OK by you to go this way?

I still think we're conflating information and requirements by allowing
a device to impose a policy which keeps QEMU in pre-copy.  AIUI, what
we're trying to do is maximize the time separation between the
initial_bytes from the device and the end-of-stream.  But knowing the
data size of initial_bytes is not really all that useful.

If we think about the limits of network bandwidth, all data transfers
approach zero time, but the startup latency of the target device that
we're trying to maximize here is fixed.  By prioritizing initial_bytes,
we're separating in space the beginning of target device setup from the
end-of-stream, but that's only an approximation of time, which is what
QEMU really needs to know to honor downtime requirements.

So it seems like what we need here is both a preface buffer size and a
target device latency.  The QEMU pre-copy algorithm should factor both
the remaining data size and the device latency into deciding when to
transition to stop-copy, thereby allowing the device to feed actually
relevant data into the algorithm rather than dictate its behavior.
Thanks,

Alex
Jason Gunthorpe March 1, 2023, 9:12 p.m. UTC | #10
On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:

> So it seems like what we need here is both a preface buffer size and a
> target device latency.  The QEMU pre-copy algorithm should factor both
> the remaining data size and the device latency into deciding when to
> transition to stop-copy, thereby allowing the device to feed actually
> relevant data into the algorithm rather than dictate its behavior.

I don't know that we can realistically estimate startup latency,
especially have the sender estimate latency on the receiver..

I feel like trying to overlap the device start up with the STOP phase
is an unnecessary optimization? How do you see it benifits?

I've been thinking of this from the perspective that we should always
ensure device startup is completed, it is time that has to be paid,
why pay it during STOP?

Jason
Alex Williamson March 1, 2023, 10:39 p.m. UTC | #11
On Wed, 1 Mar 2023 17:12:51 -0400
Jason Gunthorpe <jgg@nvidia.com> wrote:

> On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:
> 
> > So it seems like what we need here is both a preface buffer size and a
> > target device latency.  The QEMU pre-copy algorithm should factor both
> > the remaining data size and the device latency into deciding when to
> > transition to stop-copy, thereby allowing the device to feed actually
> > relevant data into the algorithm rather than dictate its behavior.  
> 
> I don't know that we can realistically estimate startup latency,
> especially have the sender estimate latency on the receiver..

Knowing that the target device is compatible with the source is a point
towards making an educated guess.

> I feel like trying to overlap the device start up with the STOP phase
> is an unnecessary optimization? How do you see it benifits?

If we can't guarantee that there's some time difference between sending
initial bytes immediately at the end of pre-copy vs immediately at the
beginning of stop-copy, does that mean any handling of initial bytes is
an unnecessary optimization?

I'm imagining that completing initial bytes triggers some
initialization sequence in the target host driver which runs in
parallel to the remaining data stream, so in practice, even if sent at
the beginning of stop-copy, the target device gets a head start.

> I've been thinking of this from the perspective that we should always
> ensure device startup is completed, it is time that has to be paid,
> why pay it during STOP?

Creating a policy for QEMU to send initial bytes in a given phase
doesn't ensure startup is complete.  There's no guaranteed time
difference between sending that data and the beginning of stop-copy.

QEMU is trying to achieve a downtime goal, where it estimates network
bandwidth to get a data size threshold, and then polls devices for
remaining data.  That downtime goal might exceed the startup latency of
the target device anyway, where it's then the operators choice to pay
that time in stop-copy, or stalled on the target.

But if we actually want to ensure startup of the target is complete,
then drivers should be able to return both data size and estimated time
for the target device to initialize.  That time estimate should be
updated by the driver based on if/when initial_bytes is drained.  The
decision whether to continue iterating pre-copy would then be based on
both the maximum remaining device startup time and the calculated time
based on remaining data size.

I think this provides a better guarantee than anything based simply on
transferring a given chunk of data in a specific phase of the process.
Thoughts?  Thanks,

Alex
Jason Gunthorpe March 6, 2023, 7:01 p.m. UTC | #12
On Wed, Mar 01, 2023 at 03:39:17PM -0700, Alex Williamson wrote:
> On Wed, 1 Mar 2023 17:12:51 -0400
> Jason Gunthorpe <jgg@nvidia.com> wrote:
> 
> > On Wed, Mar 01, 2023 at 12:55:59PM -0700, Alex Williamson wrote:
> > 
> > > So it seems like what we need here is both a preface buffer size and a
> > > target device latency.  The QEMU pre-copy algorithm should factor both
> > > the remaining data size and the device latency into deciding when to
> > > transition to stop-copy, thereby allowing the device to feed actually
> > > relevant data into the algorithm rather than dictate its behavior.  
> > 
> > I don't know that we can realistically estimate startup latency,
> > especially have the sender estimate latency on the receiver..
> 
> Knowing that the target device is compatible with the source is a point
> towards making an educated guess.
> 
> > I feel like trying to overlap the device start up with the STOP phase
> > is an unnecessary optimization? How do you see it benifits?
> 
> If we can't guarantee that there's some time difference between sending
> initial bytes immediately at the end of pre-copy vs immediately at the
> beginning of stop-copy, does that mean any handling of initial bytes is
> an unnecessary optimization?

Sure if the device doesn't implement an initial_bytes startup phase
then it is all pointless, but probably those devices should return 0
for initial_bytes? If we see initial_bytes and assume it indicates a
startup phase, why not do it?

> I'm imagining that completing initial bytes triggers some
> initialization sequence in the target host driver which runs in
> parallel to the remaining data stream, so in practice, even if sent at
> the beginning of stop-copy, the target device gets a head start.

It isn't parallel in mlx5. The load operation of the initial bytes on
the receiver will execute the load command and that command will take
some amount of time sort of proportional to how much data is in the
device. IIRC the mlx5 VFIO driver will block read until this finishes.

It is convoluted but it ultimately is allocating (potentially alot)
pages in the hypervisor kernel so the time predictability is not very
good.

Other device types we are looking at might do network connections at
this step - eg a storage might open a network connection to its back
end. This could be unpredicatably long in degenerate cases.

> > I've been thinking of this from the perspective that we should always
> > ensure device startup is completed, it is time that has to be paid,
> > why pay it during STOP?
> 
> Creating a policy for QEMU to send initial bytes in a given phase
> doesn't ensure startup is complete.  There's no guaranteed time
> difference between sending that data and the beginning of stop-copy.

As I've said, to really do a good job here we want to have the sender
wait until the receiver completes startup, and not just treat it as a
unidirectional byte-stream. That isn't this patch..

> QEMU is trying to achieve a downtime goal, where it estimates network
> bandwidth to get a data size threshold, and then polls devices for
> remaining data.  That downtime goal might exceed the startup latency of
> the target device anyway, where it's then the operators choice to pay
> that time in stop-copy, or stalled on the target.

If you are saying there should be a policy flag ('optimize for total
migration time' vs 'optimize for minimum downtime') that seems
reasonable, though I wonder who would pick the first option.
 
> But if we actually want to ensure startup of the target is complete,
> then drivers should be able to return both data size and estimated time
> for the target device to initialize.  That time estimate should be
> updated by the driver based on if/when initial_bytes is drained.  The
> decision whether to continue iterating pre-copy would then be based on
> both the maximum remaining device startup time and the calculated time
> based on remaining data size.

That seems complicated. Why not just wait for the other side to
acknowledge it has started the device? Then we aren't trying to guess.

AFAIK this sort of happens implicitly in this patch because once
initial bytes is pushed the next data that follows it will block on
the pending load and the single socket will backpressure until the
load is done. Horrible, yes, but it is where qemu is at. multi-fd is
really important :)

Jason
diff mbox series

Patch

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index c214c73e28..ba80b9150d 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -7,12 +7,14 @@  the guest is running on source host and restoring this saved state on the
 destination host. This document details how saving and restoring of VFIO
 devices is done in QEMU.
 
-Migration of VFIO devices currently consists of a single stop-and-copy phase.
-During the stop-and-copy phase the guest is stopped and the entire VFIO device
-data is transferred to the destination.
-
-The pre-copy phase of migration is currently not supported for VFIO devices.
-Support for VFIO pre-copy will be added later on.
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by not reporting the VFIO_MIGRATION_PRE_COPY
+flag in VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
@@ -29,10 +31,20 @@  VFIO implements the device hooks for the iterative approach as follows:
 * A ``load_setup`` function that sets the VFIO device on the destination in
   _RESUMING state.
 
+* A ``state_pending_estimate`` function that reports an estimate of the
+  remaining pre-copy data that the vendor driver has yet to save for the VFIO
+  device.
+
 * A ``state_pending_exact`` function that reads pending_bytes from the vendor
   driver, which indicates the amount of data that the vendor driver has yet to
   save for the VFIO device.
 
+* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is
+  active only when the VFIO device is in pre-copy states.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver during iterative pre-copy phase.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
@@ -95,8 +107,10 @@  Flow of state changes during Live migration
 ===========================================
 
 Below is the flow of state change during live migration.
-The values in the brackets represent the VM state, the migration state, and
+The values in the parentheses represent the VM state, the migration state, and
 the VFIO device state, respectively.
+The text in the square brackets represents the flow if the VFIO device supports
+pre-copy.
 
 Live migration save path
 ------------------------
@@ -108,11 +122,12 @@  Live migration save path
                                   |
                      migrate_init spawns migration_thread
                 Migration thread then calls each device's .save_setup()
-                       (RUNNING, _SETUP, _RUNNING)
+                  (RUNNING, _SETUP, _RUNNING [_PRE_COPY])
                                   |
-                      (RUNNING, _ACTIVE, _RUNNING)
-             If device is active, get pending_bytes by .state_pending_exact()
+                  (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY])
+      If device is active, get pending_bytes by .state_pending_{estimate,exact}()
           If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  [Data of VFIO device for pre-copy phase is copied]
         Iterate till total pending bytes converge and are less than threshold
                                   |
   On migration completion, vCPU stops and calls .save_live_complete_precopy for
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 87524c64a4..ee55d442b4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -66,6 +66,9 @@  typedef struct VFIOMigration {
     int data_fd;
     void *data_buffer;
     size_t data_buffer_size;
+    uint64_t precopy_init_size;
+    uint64_t precopy_dirty_size;
+    uint64_t mig_flags;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index bab83c0e55..6f5afe9f5a 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -409,7 +409,8 @@  static bool vfio_devices_all_dirty_tracking(VFIOContainer *container)
             }
 
             if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF &&
-                migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+                (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                 migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) {
                 return false;
             }
         }
@@ -438,7 +439,8 @@  static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container)
                 return false;
             }
 
-            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) {
+            if (migration->device_state == VFIO_DEVICE_STATE_RUNNING ||
+                migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
                 continue;
             } else {
                 return false;
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index 94a4df73d0..307983d57d 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -67,6 +67,8 @@  static const char *mig_state_to_str(enum vfio_device_mig_state state)
         return "STOP_COPY";
     case VFIO_DEVICE_STATE_RESUMING:
         return "RESUMING";
+    case VFIO_DEVICE_STATE_PRE_COPY:
+        return "PRE_COPY";
     default:
         return "UNKNOWN STATE";
     }
@@ -240,6 +242,23 @@  static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
     return 0;
 }
 
+static int vfio_query_precopy_size(VFIOMigration *migration,
+                                   uint64_t *init_size, uint64_t *dirty_size)
+{
+    struct vfio_precopy_info precopy = {
+        .argsz = sizeof(precopy),
+    };
+
+    if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) {
+        return -errno;
+    }
+
+    *init_size = precopy.initial_bytes;
+    *dirty_size = precopy.dirty_bytes;
+
+    return 0;
+}
+
 /* Returns the size of saved data on success and -errno on error */
 static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
 {
@@ -248,6 +267,11 @@  static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     data_size = read(migration->data_fd, migration->data_buffer,
                      migration->data_buffer_size);
     if (data_size < 0) {
+        /* Pre-copy emptied all the device state for now */
+        if (errno == ENOMSG) {
+            return 0;
+        }
+
         return -errno;
     }
     if (data_size == 0) {
@@ -264,6 +288,31 @@  static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration)
     return qemu_file_get_error(f) ?: data_size;
 }
 
+static void vfio_update_estimated_pending_data(VFIOMigration *migration,
+                                               uint64_t data_size)
+{
+    if (!data_size) {
+        /*
+         * Pre-copy emptied all the device state for now, update estimated sizes
+         * accordingly.
+         */
+        migration->precopy_init_size = 0;
+        migration->precopy_dirty_size = 0;
+
+        return;
+    }
+
+    if (migration->precopy_init_size) {
+        uint64_t init_size = MIN(migration->precopy_init_size, data_size);
+
+        migration->precopy_init_size -= init_size;
+        data_size -= init_size;
+    }
+
+    migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size,
+                                         data_size);
+}
+
 /* ---------------------------------------------------------------------- */
 
 static int vfio_save_setup(QEMUFile *f, void *opaque)
@@ -284,6 +333,35 @@  static int vfio_save_setup(QEMUFile *f, void *opaque)
         return -ENOMEM;
     }
 
+    if (migration->mig_flags & VFIO_MIGRATION_PRE_COPY) {
+        uint64_t init_size = 0, dirty_size = 0;
+        int ret;
+
+        switch (migration->device_state) {
+        case VFIO_DEVICE_STATE_RUNNING:
+            ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
+                                           VFIO_DEVICE_STATE_RUNNING);
+            if (ret) {
+                return ret;
+            }
+
+            vfio_query_precopy_size(migration, &init_size, &dirty_size);
+            migration->precopy_init_size = init_size;
+            migration->precopy_dirty_size = dirty_size;
+
+            break;
+        case VFIO_DEVICE_STATE_STOP:
+            /* vfio_save_complete_precopy() will go to STOP_COPY */
+
+            migration->precopy_init_size = 0;
+            migration->precopy_dirty_size = 0;
+
+            break;
+        default:
+            return -EINVAL;
+        }
+    }
+
     trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size);
 
     qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
@@ -302,23 +380,44 @@  static void vfio_save_cleanup(void *opaque)
     trace_vfio_save_cleanup(vbasedev->name);
 }
 
+static void vfio_state_pending_estimate(void *opaque, uint64_t threshold_size,
+                                        uint64_t *must_precopy,
+                                        uint64_t *can_postcopy)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) {
+        return;
+    }
+
+    /*
+     * Initial size should be transferred during pre-copy phase so stop-copy
+     * phase will not be slowed down. Report threshold_size to force another
+     * pre-copy iteration.
+     */
+    *must_precopy += migration->precopy_init_size ?
+                         threshold_size :
+                         migration->precopy_dirty_size;
+
+    trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy,
+                                      *can_postcopy,
+                                      migration->precopy_init_size,
+                                      migration->precopy_dirty_size);
+}
+
 /*
  * Migration size of VFIO devices can be as little as a few KBs or as big as
  * many GBs. This value should be big enough to cover the worst case.
  */
 #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB)
 
-/*
- * Only exact function is implemented and not estimate function. The reason is
- * that during pre-copy phase of migration the estimate function is called
- * repeatedly while pending RAM size is over the threshold, thus migration
- * can't converge and querying the VFIO device pending data size is useless.
- */
 static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
                                      uint64_t *must_precopy,
                                      uint64_t *can_postcopy)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE;
 
     /*
@@ -328,8 +427,57 @@  static void vfio_state_pending_exact(void *opaque, uint64_t threshold_size,
     vfio_query_stop_copy_size(vbasedev, &stop_copy_size);
     *must_precopy += stop_copy_size;
 
+    if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) {
+        uint64_t init_size = 0, dirty_size = 0;
+
+        vfio_query_precopy_size(migration, &init_size, &dirty_size);
+        migration->precopy_init_size = init_size;
+        migration->precopy_dirty_size = dirty_size;
+
+        /*
+         * Initial size should be transferred during pre-copy phase so
+         * stop-copy phase will not be slowed down. Report threshold_size
+         * to force another pre-copy iteration.
+         */
+        *must_precopy += migration->precopy_init_size ?
+                             threshold_size :
+                             migration->precopy_dirty_size;
+    }
+
     trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy,
-                                   stop_copy_size);
+                                   stop_copy_size, migration->precopy_init_size,
+                                   migration->precopy_dirty_size);
+}
+
+static bool vfio_is_active_iterate(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY;
+}
+
+static int vfio_save_iterate(QEMUFile *f, void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+    ssize_t data_size;
+
+    data_size = vfio_save_block(f, migration);
+    if (data_size < 0) {
+        return data_size;
+    }
+    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+
+    vfio_update_estimated_pending_data(migration, data_size);
+
+    trace_vfio_save_iterate(vbasedev->name);
+
+    /*
+     * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero.
+     * Return 1 so following handlers will not be potentially blocked.
+     */
+    return 1;
 }
 
 static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
@@ -338,7 +486,7 @@  static int vfio_save_complete_precopy(QEMUFile *f, void *opaque)
     ssize_t data_size;
     int ret;
 
-    /* We reach here with device state STOP only */
+    /* We reach here with device state STOP or STOP_COPY only */
     ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY,
                                    VFIO_DEVICE_STATE_STOP);
     if (ret) {
@@ -457,7 +605,10 @@  static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
+    .state_pending_estimate = vfio_state_pending_estimate,
     .state_pending_exact = vfio_state_pending_exact,
+    .is_active_iterate = vfio_is_active_iterate,
+    .save_live_iterate = vfio_save_iterate,
     .save_live_complete_precopy = vfio_save_complete_precopy,
     .save_state = vfio_save_state,
     .load_setup = vfio_load_setup,
@@ -470,13 +621,18 @@  static const SaveVMHandlers savevm_vfio_handlers = {
 static void vfio_vmstate_change(void *opaque, bool running, RunState state)
 {
     VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
     enum vfio_device_mig_state new_state;
     int ret;
 
     if (running) {
         new_state = VFIO_DEVICE_STATE_RUNNING;
     } else {
-        new_state = VFIO_DEVICE_STATE_STOP;
+        new_state =
+            (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY &&
+             (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ?
+                VFIO_DEVICE_STATE_STOP_COPY :
+                VFIO_DEVICE_STATE_STOP;
     }
 
     /*
@@ -590,6 +746,7 @@  static int vfio_migration_init(VFIODevice *vbasedev)
     migration->vbasedev = vbasedev;
     migration->device_state = VFIO_DEVICE_STATE_RUNNING;
     migration->data_fd = -1;
+    migration->mig_flags = mig_flags;
 
     oid = vmstate_if_get_id(VMSTATE_IF(DEVICE(obj)));
     if (oid) {
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 669d9fe07c..51613e02e6 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -161,6 +161,8 @@  vfio_save_block(const char *name, int data_size) " (%s) data_size %d"
 vfio_save_cleanup(const char *name) " (%s)"
 vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d"
 vfio_save_device_config_state(const char *name) " (%s)"
+vfio_save_iterate(const char *name) " (%s)"
 vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64
-vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64
+vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
+vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64
 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"