diff mbox series

[v4,9/9] vfio/migration: Add support for switchover ack capability

Message ID 20230528140652.8693-10-avihaih@nvidia.com
State New
Headers show
Series migration: Add switchover ack capability and VFIO precopy support | expand

Commit Message

Avihai Horon May 28, 2023, 2:06 p.m. UTC
Loading of a VFIO device's data can take a substantial amount of time as
the device may need to allocate resources, prepare internal data
structures, etc. This can increase migration downtime, especially for
VFIO devices with a lot of resources.

To solve this, VFIO migration uAPI defines "initial bytes" as part of
its precopy data stream. Initial bytes can be used in various ways to
improve VFIO migration performance. For example, it can be used to
transfer device metadata to pre-allocate resources in the destination.
However, for this to work we need to make sure that all initial bytes
are sent and loaded in the destination before the source VM is stopped.

Use migration switchover ack capability to make sure a VFIO device's
initial bytes are sent and loaded in the destination before the source
stops the VM and attempts to complete the migration.
This can significantly reduce migration downtime for some devices.

As precopy support and precopy initial bytes support come together in
VFIO migration, use x-allow-pre-copy device property to control usage of
this feature as well.

Signed-off-by: Avihai Horon <avihaih@nvidia.com>
---
 docs/devel/vfio-migration.rst | 10 ++++++++
 include/hw/vfio/vfio-common.h |  2 ++
 hw/vfio/migration.c           | 48 ++++++++++++++++++++++++++++++++++-
 3 files changed, 59 insertions(+), 1 deletion(-)

Comments

Cédric Le Goater May 30, 2023, 9:58 a.m. UTC | #1
On 5/28/23 16:06, Avihai Horon wrote:
> Loading of a VFIO device's data can take a substantial amount of time as
> the device may need to allocate resources, prepare internal data
> structures, etc. This can increase migration downtime, especially for
> VFIO devices with a lot of resources.
> 
> To solve this, VFIO migration uAPI defines "initial bytes" as part of
> its precopy data stream. Initial bytes can be used in various ways to
> improve VFIO migration performance. For example, it can be used to
> transfer device metadata to pre-allocate resources in the destination.
> However, for this to work we need to make sure that all initial bytes
> are sent and loaded in the destination before the source VM is stopped.
> 
> Use migration switchover ack capability to make sure a VFIO device's
> initial bytes are sent and loaded in the destination before the source
> stops the VM and attempts to complete the migration.
> This can significantly reduce migration downtime for some devices.
> 
> As precopy support and precopy initial bytes support come together in
> VFIO migration, use x-allow-pre-copy device property to control usage of
> this feature as well.
> 
> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
> ---
>   docs/devel/vfio-migration.rst | 10 ++++++++
>   include/hw/vfio/vfio-common.h |  2 ++
>   hw/vfio/migration.c           | 48 ++++++++++++++++++++++++++++++++++-
>   3 files changed, 59 insertions(+), 1 deletion(-)
> 
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> index e896b2a673..b433cb5bb2 100644
> --- a/docs/devel/vfio-migration.rst
> +++ b/docs/devel/vfio-migration.rst
> @@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>   
> +When pre-copy is supported, it's possible to further reduce downtime by
> +enabling "switchover-ack" migration capability.
> +VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
> +and recommends that the initial bytes are sent and loaded in the destination
> +before stopping the source VM. Enabling this migration capability will
> +guarantee that and thus, can potentially reduce downtime even further.
> +
>   Note that currently VFIO migration is supported only for a single device. This
>   is due to VFIO migration's lack of P2P support. However, P2P support is planned
>   to be added later on.
> @@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative approach as follows:
>   * A ``save_live_iterate`` function that reads the VFIO device's data from the
>     vendor driver during iterative pre-copy phase.
>   
> +* A ``switchover_ack_needed`` function that checks if the VFIO device uses
> +  "switchover-ack" migration capability when this capability is enabled.
> +
>   * A ``save_state`` function to save the device config space if it is present.
>   
>   * A ``save_live_complete_precopy`` function that sets the VFIO device in
> diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
> index a53ecbe2e0..ad0562c8b7 100644
> --- a/include/hw/vfio/vfio-common.h
> +++ b/include/hw/vfio/vfio-common.h
> @@ -69,6 +69,8 @@ typedef struct VFIOMigration {
>       uint64_t mig_flags;
>       uint64_t precopy_init_size;
>       uint64_t precopy_dirty_size;
> +    bool switchover_ack_needed;

Do we really need the 'switchover_ack_needed' bool ?

It seems that each time it is used in a routine it could be computed
locally with migrate_switchover_ack() and vfio_precopy_supported().
This would simplify the code a bit more.

Thanks,

C.


  
> +    bool initial_data_sent;
>   } VFIOMigration;
>   
>   typedef struct VFIOAddressSpace {
> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
> index cb6923ed3f..ede29ffb5c 100644
> --- a/hw/vfio/migration.c
> +++ b/hw/vfio/migration.c
> @@ -18,6 +18,8 @@
>   #include "sysemu/runstate.h"
>   #include "hw/vfio/vfio-common.h"
>   #include "migration/migration.h"
> +#include "migration/options.h"
> +#include "migration/savevm.h"
>   #include "migration/vmstate.h"
>   #include "migration/qemu-file.h"
>   #include "migration/register.h"
> @@ -45,6 +47,7 @@
>   #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
>   #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
> +#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>   
>   /*
>    * This is an arbitrary size based on migration of mlx5 devices, where typically
> @@ -218,6 +221,7 @@ static void vfio_migration_cleanup(VFIODevice *vbasedev)
>   
>       close(migration->data_fd);
>       migration->data_fd = -1;
> +    migration->switchover_ack_needed = false;
>   }
>   
>   static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
> @@ -350,6 +354,10 @@ static int vfio_save_setup(QEMUFile *f, void *opaque)
>       if (vfio_precopy_supported(vbasedev)) {
>           int ret;
>   
> +        if (migrate_switchover_ack()) {
> +            migration->switchover_ack_needed = true;
> +        }
> +
>           switch (migration->device_state) {
>           case VFIO_DEVICE_STATE_RUNNING:
>               ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
> @@ -385,6 +393,7 @@ static void vfio_save_cleanup(void *opaque)
>       migration->data_buffer = NULL;
>       migration->precopy_init_size = 0;
>       migration->precopy_dirty_size = 0;
> +    migration->initial_data_sent = false;
>       vfio_migration_cleanup(vbasedev);
>       trace_vfio_save_cleanup(vbasedev->name);
>   }
> @@ -458,10 +467,17 @@ static int vfio_save_iterate(QEMUFile *f, void *opaque)
>       if (data_size < 0) {
>           return data_size;
>       }
> -    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>   
>       vfio_update_estimated_pending_data(migration, data_size);
>   
> +    if (migration->switchover_ack_needed && !migration->precopy_init_size &&
> +        !migration->initial_data_sent) {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
> +        migration->initial_data_sent = true;
> +    } else {
> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
> +    }
> +
>       trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size,
>                               migration->precopy_dirty_size);
>   
> @@ -526,6 +542,10 @@ static int vfio_load_setup(QEMUFile *f, void *opaque)
>   {
>       VFIODevice *vbasedev = opaque;
>   
> +    if (migrate_switchover_ack() && vfio_precopy_supported(vbasedev)) {
> +        vbasedev->migration->switchover_ack_needed = true;
> +    }
> +
>       return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
>                                      vbasedev->migration->device_state);
>   }
> @@ -580,6 +600,23 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>               }
>               break;
>           }
> +        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
> +        {
> +            if (!vbasedev->migration->switchover_ack_needed) {
> +                error_report("%s: Received INIT_DATA_SENT but switchover ack "
> +                             "is not needed", vbasedev->name);
> +                return -EINVAL;
> +            }
> +
> +            ret = qemu_loadvm_approve_switchover();
> +            if (ret) {
> +                error_report(
> +                    "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
> +                    vbasedev->name, ret, strerror(-ret));
> +            }
> +
> +            return ret;
> +        }
>           default:
>               error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
>               return -EINVAL;
> @@ -594,6 +631,14 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
>       return ret;
>   }
>   
> +static bool vfio_switchover_ack_needed(void *opaque)
> +{
> +    VFIODevice *vbasedev = opaque;
> +    VFIOMigration *migration = vbasedev->migration;
> +
> +    return migration->switchover_ack_needed;
> +}
> +
>   static const SaveVMHandlers savevm_vfio_handlers = {
>       .save_setup = vfio_save_setup,
>       .save_cleanup = vfio_save_cleanup,
> @@ -606,6 +651,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>       .load_setup = vfio_load_setup,
>       .load_cleanup = vfio_load_cleanup,
>       .load_state = vfio_load_state,
> +    .switchover_ack_needed = vfio_switchover_ack_needed,
>   };
>   
>   /* ---------------------------------------------------------------------- */
Avihai Horon May 30, 2023, 11:04 a.m. UTC | #2
On 30/05/2023 12:58, Cédric Le Goater wrote:
> External email: Use caution opening links or attachments
>
>
> On 5/28/23 16:06, Avihai Horon wrote:
>> Loading of a VFIO device's data can take a substantial amount of time as
>> the device may need to allocate resources, prepare internal data
>> structures, etc. This can increase migration downtime, especially for
>> VFIO devices with a lot of resources.
>>
>> To solve this, VFIO migration uAPI defines "initial bytes" as part of
>> its precopy data stream. Initial bytes can be used in various ways to
>> improve VFIO migration performance. For example, it can be used to
>> transfer device metadata to pre-allocate resources in the destination.
>> However, for this to work we need to make sure that all initial bytes
>> are sent and loaded in the destination before the source VM is stopped.
>>
>> Use migration switchover ack capability to make sure a VFIO device's
>> initial bytes are sent and loaded in the destination before the source
>> stops the VM and attempts to complete the migration.
>> This can significantly reduce migration downtime for some devices.
>>
>> As precopy support and precopy initial bytes support come together in
>> VFIO migration, use x-allow-pre-copy device property to control usage of
>> this feature as well.
>>
>> Signed-off-by: Avihai Horon <avihaih@nvidia.com>
>> ---
>>   docs/devel/vfio-migration.rst | 10 ++++++++
>>   include/hw/vfio/vfio-common.h |  2 ++
>>   hw/vfio/migration.c           | 48 ++++++++++++++++++++++++++++++++++-
>>   3 files changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/docs/devel/vfio-migration.rst 
>> b/docs/devel/vfio-migration.rst
>> index e896b2a673..b433cb5bb2 100644
>> --- a/docs/devel/vfio-migration.rst
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -16,6 +16,13 @@ helps to reduce the total downtime of the VM. VFIO 
>> devices opt-in to pre-copy
>>   support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
>>   VFIO_DEVICE_FEATURE_MIGRATION ioctl.
>>
>> +When pre-copy is supported, it's possible to further reduce downtime by
>> +enabling "switchover-ack" migration capability.
>> +VFIO migration uAPI defines "initial bytes" as part of its pre-copy 
>> data stream
>> +and recommends that the initial bytes are sent and loaded in the 
>> destination
>> +before stopping the source VM. Enabling this migration capability will
>> +guarantee that and thus, can potentially reduce downtime even further.
>> +
>>   Note that currently VFIO migration is supported only for a single 
>> device. This
>>   is due to VFIO migration's lack of P2P support. However, P2P 
>> support is planned
>>   to be added later on.
>> @@ -45,6 +52,9 @@ VFIO implements the device hooks for the iterative 
>> approach as follows:
>>   * A ``save_live_iterate`` function that reads the VFIO device's 
>> data from the
>>     vendor driver during iterative pre-copy phase.
>>
>> +* A ``switchover_ack_needed`` function that checks if the VFIO 
>> device uses
>> +  "switchover-ack" migration capability when this capability is 
>> enabled.
>> +
>>   * A ``save_state`` function to save the device config space if it 
>> is present.
>>
>>   * A ``save_live_complete_precopy`` function that sets the VFIO 
>> device in
>> diff --git a/include/hw/vfio/vfio-common.h 
>> b/include/hw/vfio/vfio-common.h
>> index a53ecbe2e0..ad0562c8b7 100644
>> --- a/include/hw/vfio/vfio-common.h
>> +++ b/include/hw/vfio/vfio-common.h
>> @@ -69,6 +69,8 @@ typedef struct VFIOMigration {
>>       uint64_t mig_flags;
>>       uint64_t precopy_init_size;
>>       uint64_t precopy_dirty_size;
>> +    bool switchover_ack_needed;
>
> Do we really need the 'switchover_ack_needed' bool ?
>
> It seems that each time it is used in a routine it could be computed
> locally with migrate_switchover_ack() and vfio_precopy_supported().
> This would simplify the code a bit more.
>
You are right.
I will drop it and send a v5 (will fix the superfluous " as well).

Thanks!

>
>
>
>> +    bool initial_data_sent;
>>   } VFIOMigration;
>>
>>   typedef struct VFIOAddressSpace {
>> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
>> index cb6923ed3f..ede29ffb5c 100644
>> --- a/hw/vfio/migration.c
>> +++ b/hw/vfio/migration.c
>> @@ -18,6 +18,8 @@
>>   #include "sysemu/runstate.h"
>>   #include "hw/vfio/vfio-common.h"
>>   #include "migration/migration.h"
>> +#include "migration/options.h"
>> +#include "migration/savevm.h"
>>   #include "migration/vmstate.h"
>>   #include "migration/qemu-file.h"
>>   #include "migration/register.h"
>> @@ -45,6 +47,7 @@
>>   #define VFIO_MIG_FLAG_DEV_CONFIG_STATE (0xffffffffef100002ULL)
>>   #define VFIO_MIG_FLAG_DEV_SETUP_STATE (0xffffffffef100003ULL)
>>   #define VFIO_MIG_FLAG_DEV_DATA_STATE (0xffffffffef100004ULL)
>> +#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
>>
>>   /*
>>    * This is an arbitrary size based on migration of mlx5 devices, 
>> where typically
>> @@ -218,6 +221,7 @@ static void vfio_migration_cleanup(VFIODevice 
>> *vbasedev)
>>
>>       close(migration->data_fd);
>>       migration->data_fd = -1;
>> +    migration->switchover_ack_needed = false;
>>   }
>>
>>   static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
>> @@ -350,6 +354,10 @@ static int vfio_save_setup(QEMUFile *f, void 
>> *opaque)
>>       if (vfio_precopy_supported(vbasedev)) {
>>           int ret;
>>
>> +        if (migrate_switchover_ack()) {
>> +            migration->switchover_ack_needed = true;
>> +        }
>> +
>>           switch (migration->device_state) {
>>           case VFIO_DEVICE_STATE_RUNNING:
>>               ret = vfio_migration_set_state(vbasedev, 
>> VFIO_DEVICE_STATE_PRE_COPY,
>> @@ -385,6 +393,7 @@ static void vfio_save_cleanup(void *opaque)
>>       migration->data_buffer = NULL;
>>       migration->precopy_init_size = 0;
>>       migration->precopy_dirty_size = 0;
>> +    migration->initial_data_sent = false;
>>       vfio_migration_cleanup(vbasedev);
>>       trace_vfio_save_cleanup(vbasedev->name);
>>   }
>> @@ -458,10 +467,17 @@ static int vfio_save_iterate(QEMUFile *f, void 
>> *opaque)
>>       if (data_size < 0) {
>>           return data_size;
>>       }
>> -    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>>
>>       vfio_update_estimated_pending_data(migration, data_size);
>>
>> +    if (migration->switchover_ack_needed && 
>> !migration->precopy_init_size &&
>> +        !migration->initial_data_sent) {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
>> +        migration->initial_data_sent = true;
>> +    } else {
>> +        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
>> +    }
>> +
>>       trace_vfio_save_iterate(vbasedev->name, 
>> migration->precopy_init_size,
>>                               migration->precopy_dirty_size);
>>
>> @@ -526,6 +542,10 @@ static int vfio_load_setup(QEMUFile *f, void 
>> *opaque)
>>   {
>>       VFIODevice *vbasedev = opaque;
>>
>> +    if (migrate_switchover_ack() && vfio_precopy_supported(vbasedev)) {
>> +        vbasedev->migration->switchover_ack_needed = true;
>> +    }
>> +
>>       return vfio_migration_set_state(vbasedev, 
>> VFIO_DEVICE_STATE_RESUMING,
>> vbasedev->migration->device_state);
>>   }
>> @@ -580,6 +600,23 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>               }
>>               break;
>>           }
>> +        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
>> +        {
>> +            if (!vbasedev->migration->switchover_ack_needed) {
>> +                error_report("%s: Received INIT_DATA_SENT but 
>> switchover ack "
>> +                             "is not needed", vbasedev->name);
>> +                return -EINVAL;
>> +            }
>> +
>> +            ret = qemu_loadvm_approve_switchover();
>> +            if (ret) {
>> +                error_report(
>> +                    "%s: qemu_loadvm_approve_switchover failed, 
>> err=%d (%s)",
>> +                    vbasedev->name, ret, strerror(-ret));
>> +            }
>> +
>> +            return ret;
>> +        }
>>           default:
>>               error_report("%s: Unknown tag 0x%"PRIx64, 
>> vbasedev->name, data);
>>               return -EINVAL;
>> @@ -594,6 +631,14 @@ static int vfio_load_state(QEMUFile *f, void 
>> *opaque, int version_id)
>>       return ret;
>>   }
>>
>> +static bool vfio_switchover_ack_needed(void *opaque)
>> +{
>> +    VFIODevice *vbasedev = opaque;
>> +    VFIOMigration *migration = vbasedev->migration;
>> +
>> +    return migration->switchover_ack_needed;
>> +}
>> +
>>   static const SaveVMHandlers savevm_vfio_handlers = {
>>       .save_setup = vfio_save_setup,
>>       .save_cleanup = vfio_save_cleanup,
>> @@ -606,6 +651,7 @@ static const SaveVMHandlers savevm_vfio_handlers = {
>>       .load_setup = vfio_load_setup,
>>       .load_cleanup = vfio_load_cleanup,
>>       .load_state = vfio_load_state,
>> +    .switchover_ack_needed = vfio_switchover_ack_needed,
>>   };
>>
>>   /* 
>> ---------------------------------------------------------------------- 
>> */
>
diff mbox series

Patch

diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
index e896b2a673..b433cb5bb2 100644
--- a/docs/devel/vfio-migration.rst
+++ b/docs/devel/vfio-migration.rst
@@ -16,6 +16,13 @@  helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy
 support by reporting the VFIO_MIGRATION_PRE_COPY flag in the
 VFIO_DEVICE_FEATURE_MIGRATION ioctl.
 
+When pre-copy is supported, it's possible to further reduce downtime by
+enabling "switchover-ack" migration capability.
+VFIO migration uAPI defines "initial bytes" as part of its pre-copy data stream
+and recommends that the initial bytes are sent and loaded in the destination
+before stopping the source VM. Enabling this migration capability will
+guarantee that and thus, can potentially reduce downtime even further.
+
 Note that currently VFIO migration is supported only for a single device. This
 is due to VFIO migration's lack of P2P support. However, P2P support is planned
 to be added later on.
@@ -45,6 +52,9 @@  VFIO implements the device hooks for the iterative approach as follows:
 * A ``save_live_iterate`` function that reads the VFIO device's data from the
   vendor driver during iterative pre-copy phase.
 
+* A ``switchover_ack_needed`` function that checks if the VFIO device uses
+  "switchover-ack" migration capability when this capability is enabled.
+
 * A ``save_state`` function to save the device config space if it is present.
 
 * A ``save_live_complete_precopy`` function that sets the VFIO device in
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index a53ecbe2e0..ad0562c8b7 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -69,6 +69,8 @@  typedef struct VFIOMigration {
     uint64_t mig_flags;
     uint64_t precopy_init_size;
     uint64_t precopy_dirty_size;
+    bool switchover_ack_needed;
+    bool initial_data_sent;
 } VFIOMigration;
 
 typedef struct VFIOAddressSpace {
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index cb6923ed3f..ede29ffb5c 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -18,6 +18,8 @@ 
 #include "sysemu/runstate.h"
 #include "hw/vfio/vfio-common.h"
 #include "migration/migration.h"
+#include "migration/options.h"
+#include "migration/savevm.h"
 #include "migration/vmstate.h"
 #include "migration/qemu-file.h"
 #include "migration/register.h"
@@ -45,6 +47,7 @@ 
 #define VFIO_MIG_FLAG_DEV_CONFIG_STATE  (0xffffffffef100002ULL)
 #define VFIO_MIG_FLAG_DEV_SETUP_STATE   (0xffffffffef100003ULL)
 #define VFIO_MIG_FLAG_DEV_DATA_STATE    (0xffffffffef100004ULL)
+#define VFIO_MIG_FLAG_DEV_INIT_DATA_SENT (0xffffffffef100005ULL)
 
 /*
  * This is an arbitrary size based on migration of mlx5 devices, where typically
@@ -218,6 +221,7 @@  static void vfio_migration_cleanup(VFIODevice *vbasedev)
 
     close(migration->data_fd);
     migration->data_fd = -1;
+    migration->switchover_ack_needed = false;
 }
 
 static int vfio_query_stop_copy_size(VFIODevice *vbasedev,
@@ -350,6 +354,10 @@  static int vfio_save_setup(QEMUFile *f, void *opaque)
     if (vfio_precopy_supported(vbasedev)) {
         int ret;
 
+        if (migrate_switchover_ack()) {
+            migration->switchover_ack_needed = true;
+        }
+
         switch (migration->device_state) {
         case VFIO_DEVICE_STATE_RUNNING:
             ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY,
@@ -385,6 +393,7 @@  static void vfio_save_cleanup(void *opaque)
     migration->data_buffer = NULL;
     migration->precopy_init_size = 0;
     migration->precopy_dirty_size = 0;
+    migration->initial_data_sent = false;
     vfio_migration_cleanup(vbasedev);
     trace_vfio_save_cleanup(vbasedev->name);
 }
@@ -458,10 +467,17 @@  static int vfio_save_iterate(QEMUFile *f, void *opaque)
     if (data_size < 0) {
         return data_size;
     }
-    qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
 
     vfio_update_estimated_pending_data(migration, data_size);
 
+    if (migration->switchover_ack_needed && !migration->precopy_init_size &&
+        !migration->initial_data_sent) {
+        qemu_put_be64(f, VFIO_MIG_FLAG_DEV_INIT_DATA_SENT);
+        migration->initial_data_sent = true;
+    } else {
+        qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE);
+    }
+
     trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size,
                             migration->precopy_dirty_size);
 
@@ -526,6 +542,10 @@  static int vfio_load_setup(QEMUFile *f, void *opaque)
 {
     VFIODevice *vbasedev = opaque;
 
+    if (migrate_switchover_ack() && vfio_precopy_supported(vbasedev)) {
+        vbasedev->migration->switchover_ack_needed = true;
+    }
+
     return vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_RESUMING,
                                    vbasedev->migration->device_state);
 }
@@ -580,6 +600,23 @@  static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
             }
             break;
         }
+        case VFIO_MIG_FLAG_DEV_INIT_DATA_SENT:
+        {
+            if (!vbasedev->migration->switchover_ack_needed) {
+                error_report("%s: Received INIT_DATA_SENT but switchover ack "
+                             "is not needed", vbasedev->name);
+                return -EINVAL;
+            }
+
+            ret = qemu_loadvm_approve_switchover();
+            if (ret) {
+                error_report(
+                    "%s: qemu_loadvm_approve_switchover failed, err=%d (%s)",
+                    vbasedev->name, ret, strerror(-ret));
+            }
+
+            return ret;
+        }
         default:
             error_report("%s: Unknown tag 0x%"PRIx64, vbasedev->name, data);
             return -EINVAL;
@@ -594,6 +631,14 @@  static int vfio_load_state(QEMUFile *f, void *opaque, int version_id)
     return ret;
 }
 
+static bool vfio_switchover_ack_needed(void *opaque)
+{
+    VFIODevice *vbasedev = opaque;
+    VFIOMigration *migration = vbasedev->migration;
+
+    return migration->switchover_ack_needed;
+}
+
 static const SaveVMHandlers savevm_vfio_handlers = {
     .save_setup = vfio_save_setup,
     .save_cleanup = vfio_save_cleanup,
@@ -606,6 +651,7 @@  static const SaveVMHandlers savevm_vfio_handlers = {
     .load_setup = vfio_load_setup,
     .load_cleanup = vfio_load_cleanup,
     .load_state = vfio_load_state,
+    .switchover_ack_needed = vfio_switchover_ack_needed,
 };
 
 /* ---------------------------------------------------------------------- */