Message ID | 20230528140652.8693-8-avihaih@nvidia.com |
---|---|
State | New |
Headers | show |
Series | migration: Add switchover ack capability and VFIO precopy support | expand |
On 5/28/23 16:06, Avihai Horon wrote: > Pre-copy support allows the VFIO device data to be transferred while the > VM is running. This helps to accommodate VFIO devices that have a large > amount of data that needs to be transferred, and it can reduce migration > downtime. > > Pre-copy support is optional in VFIO migration protocol v2. > Implement pre-copy of VFIO migration protocol v2 and use it for devices > that support it. Full description of it can be found in the following > Linux commit: 4db52602a607 ("vfio: Extend the device migration protocol > with PRE_COPY"). > > Signed-off-by: Avihai Horon <avihaih@nvidia.com> LGTM, Reviewed-by: Cédric Le Goater <clg@redhat.com> one minor issue below, > --- > docs/devel/vfio-migration.rst | 35 +++++--- > include/hw/vfio/vfio-common.h | 2 + > hw/vfio/common.c | 6 +- > hw/vfio/migration.c | 165 ++++++++++++++++++++++++++++++++-- > hw/vfio/trace-events | 4 +- > 5 files changed, 190 insertions(+), 22 deletions(-) > > diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst > index 1b68ccf115..e896b2a673 100644 > --- a/docs/devel/vfio-migration.rst > +++ b/docs/devel/vfio-migration.rst > @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the > destination host. This document details how saving and restoring of VFIO > devices is done in QEMU. > > -Migration of VFIO devices currently consists of a single stop-and-copy phase. > -During the stop-and-copy phase the guest is stopped and the entire VFIO device > -data is transferred to the destination. > - > -The pre-copy phase of migration is currently not supported for VFIO devices. > -Support for VFIO pre-copy will be added later on. > +Migration of VFIO devices consists of two phases: the optional pre-copy phase, > +and the stop-and-copy phase. The pre-copy phase is iterative and allows to > +accommodate VFIO devices that have a large amount of data that needs to be > +transferred. The iterative pre-copy phase of migration allows for the guest to > +continue whilst the VFIO device state is transferred to the destination, this > +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy > +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the > +VFIO_DEVICE_FEATURE_MIGRATION ioctl. > > Note that currently VFIO migration is supported only for a single device. This > is due to VFIO migration's lack of P2P support. However, P2P support is planned > @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows: > * A ``load_setup`` function that sets the VFIO device on the destination in > _RESUMING state. > > +* A ``state_pending_estimate`` function that reports an estimate of the > + remaining pre-copy data that the vendor driver has yet to save for the VFIO > + device. > + > * A ``state_pending_exact`` function that reads pending_bytes from the vendor > driver, which indicates the amount of data that the vendor driver has yet to > save for the VFIO device. > > +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is > + active only when the VFIO device is in pre-copy states. > + > +* A ``save_live_iterate`` function that reads the VFIO device's data from the > + vendor driver during iterative pre-copy phase. > + > * A ``save_state`` function to save the device config space if it is present. > > * A ``save_live_complete_precopy`` function that sets the VFIO device in > @@ -111,8 +123,10 @@ Flow of state changes during Live migration > =========================================== > > Below is the flow of state change during live migration. > -The values in the brackets represent the VM state, the migration state, and > +The values in the parentheses represent the VM state, the migration state, and > the VFIO device state, respectively. > +The text in the square brackets represents the flow if the VFIO device supports > +pre-copy. > > Live migration save path > ------------------------ > @@ -124,11 +138,12 @@ Live migration save path > | > migrate_init spawns migration_thread > Migration thread then calls each device's .save_setup() > - (RUNNING, _SETUP, _RUNNING) > + (RUNNING, _SETUP, _RUNNING [_PRE_COPY]) > | > - (RUNNING, _ACTIVE, _RUNNING) > - If device is active, get pending_bytes by .state_pending_exact() > + (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY]) > + If device is active, get pending_bytes by .state_pending_{estimate,exact}() > If total pending_bytes >= threshold_size, call .save_live_iterate() > + [Data of VFIO device for pre-copy phase is copied] > Iterate till total pending bytes converge and are less than threshold > | > On migration completion, vCPU stops and calls .save_live_complete_precopy for > diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h > index 5f29dab839..1db901c194 100644 > --- a/include/hw/vfio/vfio-common.h > +++ b/include/hw/vfio/vfio-common.h > @@ -67,6 +67,8 @@ typedef struct VFIOMigration { > void *data_buffer; > size_t data_buffer_size; > uint64_t mig_flags; > + uint64_t precopy_init_size; > + uint64_t precopy_dirty_size; > } VFIOMigration; > > typedef struct VFIOAddressSpace { > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 78358ede27..b73086e17a 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) > } > > if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF && > - migration->device_state == VFIO_DEVICE_STATE_RUNNING) { > + (migration->device_state == VFIO_DEVICE_STATE_RUNNING || > + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) { > return false; > } > } > @@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) > return false; > } > > - if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) { > + if (migration->device_state == VFIO_DEVICE_STATE_RUNNING || > + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) { > continue; > } else { > return false; > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 8d33414379..d8f6a22ae1 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state) > return "STOP_COPY"; > case VFIO_DEVICE_STATE_RESUMING: > return "RESUMING"; > + case VFIO_DEVICE_STATE_PRE_COPY: > + return "PRE_COPY"; > default: > return "UNKNOWN STATE"; > } > @@ -241,6 +243,25 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev, > return 0; > } > > +static int vfio_query_precopy_size(VFIOMigration *migration) > +{ > + struct vfio_precopy_info precopy = { > + .argsz = sizeof(precopy), > + }; > + > + migration->precopy_init_size = 0; > + migration->precopy_dirty_size = 0; > + > + if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) { > + return -errno; > + } > + > + migration->precopy_init_size = precopy.initial_bytes; > + migration->precopy_dirty_size = precopy.dirty_bytes; > + > + return 0; > +} > + > /* Returns the size of saved data on success and -errno on error */ > static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) > { > @@ -249,6 +270,14 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) > data_size = read(migration->data_fd, migration->data_buffer, > migration->data_buffer_size); > if (data_size < 0) { > + /* > + * Pre-copy emptied all the device state for now. For more information, > + * please refer to the Linux kernel VFIO uAPI. > + */ > + if (errno == ENOMSG) { > + return 0; > + } > + > return -errno; > } > if (data_size == 0) { > @@ -265,6 +294,38 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) > return qemu_file_get_error(f) ?: data_size; > } > > +static void vfio_update_estimated_pending_data(VFIOMigration *migration, > + uint64_t data_size) > +{ > + if (!data_size) { > + /* > + * Pre-copy emptied all the device state for now, update estimated sizes > + * accordingly. > + */ > + migration->precopy_init_size = 0; > + migration->precopy_dirty_size = 0; > + > + return; > + } > + > + if (migration->precopy_init_size) { > + uint64_t init_size = MIN(migration->precopy_init_size, data_size); > + > + migration->precopy_init_size -= init_size; > + data_size -= init_size; > + } > + > + migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size, > + data_size); > +} > + > +static bool vfio_precopy_supported(VFIODevice *vbasedev) > +{ > + VFIOMigration *migration = vbasedev->migration; > + > + return migration->mig_flags & VFIO_MIGRATION_PRE_COPY; > +} > + > /* ---------------------------------------------------------------------- */ > > static int vfio_save_setup(QEMUFile *f, void *opaque) > @@ -285,6 +346,28 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) > return -ENOMEM; > } > > + if (vfio_precopy_supported(vbasedev)) { > + int ret; > + > + switch (migration->device_state) { > + case VFIO_DEVICE_STATE_RUNNING: > + ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY, > + VFIO_DEVICE_STATE_RUNNING); > + if (ret) { > + return ret; > + } > + > + vfio_query_precopy_size(migration); > + > + break; > + case VFIO_DEVICE_STATE_STOP: > + /* vfio_save_complete_precopy() will go to STOP_COPY */ > + break; > + default: > + return -EINVAL; > + } > + } > + > trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size); > > qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); > @@ -299,26 +382,42 @@ static void vfio_save_cleanup(void *opaque) > > g_free(migration->data_buffer); > migration->data_buffer = NULL; > + migration->precopy_init_size = 0; > + migration->precopy_dirty_size = 0; > vfio_migration_cleanup(vbasedev); > trace_vfio_save_cleanup(vbasedev->name); > } > > +static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy, > + uint64_t *can_postcopy) > +{ > + VFIODevice *vbasedev = opaque; > + VFIOMigration *migration = vbasedev->migration; > + > + if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) { > + return; > + } > + > + *must_precopy += > + migration->precopy_init_size + migration->precopy_dirty_size; > + > + trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy, > + *can_postcopy, > + migration->precopy_init_size, > + migration->precopy_dirty_size); > +} > + > /* > * Migration size of VFIO devices can be as little as a few KBs or as big as > * many GBs. This value should be big enough to cover the worst case. > */ > #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) > > -/* > - * Only exact function is implemented and not estimate function. The reason is > - * that during pre-copy phase of migration the estimate function is called > - * repeatedly while pending RAM size is over the threshold, thus migration > - * can't converge and querying the VFIO device pending data size is useless. > - */ > static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, > uint64_t *can_postcopy) > { > VFIODevice *vbasedev = opaque; > + VFIOMigration *migration = vbasedev->migration; > uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; > > /* > @@ -328,8 +427,48 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, > vfio_query_stop_copy_size(vbasedev, &stop_copy_size); > *must_precopy += stop_copy_size; > > + if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) { > + vfio_query_precopy_size(migration); > + > + *must_precopy += > + migration->precopy_init_size + migration->precopy_dirty_size; > + } > + > trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy, > - stop_copy_size); > + stop_copy_size, migration->precopy_init_size, > + migration->precopy_dirty_size); > +} > + > +static bool vfio_is_active_iterate(void *opaque) > +{ > + VFIODevice *vbasedev = opaque; > + VFIOMigration *migration = vbasedev->migration; > + > + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY; > +} > + > +static int vfio_save_iterate(QEMUFile *f, void *opaque) > +{ > + VFIODevice *vbasedev = opaque; > + VFIOMigration *migration = vbasedev->migration; > + ssize_t data_size; > + > + data_size = vfio_save_block(f, migration); > + if (data_size < 0) { > + return data_size; > + } > + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); > + > + vfio_update_estimated_pending_data(migration, data_size); > + > + trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size, > + migration->precopy_dirty_size); > + > + /* > + * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero. > + * Return 1 so following handlers will not be potentially blocked. > + */ > + return 1; > } > > static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) > @@ -338,7 +477,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) > ssize_t data_size; > int ret; > > - /* We reach here with device state STOP only */ > + /* We reach here with device state STOP or STOP_COPY only */ > ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, > VFIO_DEVICE_STATE_STOP); > if (ret) { > @@ -457,7 +596,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) > static const SaveVMHandlers savevm_vfio_handlers = { > .save_setup = vfio_save_setup, > .save_cleanup = vfio_save_cleanup, > + .state_pending_estimate = vfio_state_pending_estimate, > .state_pending_exact = vfio_state_pending_exact, > + .is_active_iterate = vfio_is_active_iterate, > + .save_live_iterate = vfio_save_iterate, > .save_live_complete_precopy = vfio_save_complete_precopy, > .save_state = vfio_save_state, > .load_setup = vfio_load_setup, > @@ -470,13 +612,18 @@ static const SaveVMHandlers savevm_vfio_handlers = { > static void vfio_vmstate_change(void *opaque, bool running, RunState state) > { > VFIODevice *vbasedev = opaque; > + VFIOMigration *migration = vbasedev->migration; > enum vfio_device_mig_state new_state; > int ret; > > if (running) { > new_state = VFIO_DEVICE_STATE_RUNNING; > } else { > - new_state = VFIO_DEVICE_STATE_STOP; > + new_state = > + (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY && > + (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ? > + VFIO_DEVICE_STATE_STOP_COPY : > + VFIO_DEVICE_STATE_STOP; > } > > /* > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events > index 646e42fd27..548f9488a7 100644 > --- a/hw/vfio/trace-events > +++ b/hw/vfio/trace-events > @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d" > vfio_save_cleanup(const char *name) " (%s)" > vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" > vfio_save_device_config_state(const char *name) " (%s)" > +vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64" the extra '"' at the end breaks compile. No need to resend just for that. It can be fixed. Thanks, C. > vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 > -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64 > +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 > +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 > vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
On 30/05/2023 12:28, Cédric Le Goater wrote: > External email: Use caution opening links or attachments > > > On 5/28/23 16:06, Avihai Horon wrote: >> Pre-copy support allows the VFIO device data to be transferred while the >> VM is running. This helps to accommodate VFIO devices that have a large >> amount of data that needs to be transferred, and it can reduce migration >> downtime. >> >> Pre-copy support is optional in VFIO migration protocol v2. >> Implement pre-copy of VFIO migration protocol v2 and use it for devices >> that support it. Full description of it can be found in the following >> Linux commit: 4db52602a607 ("vfio: Extend the device migration protocol >> with PRE_COPY"). >> >> Signed-off-by: Avihai Horon <avihaih@nvidia.com> > > LGTM, > > Reviewed-by: Cédric Le Goater <clg@redhat.com> > > one minor issue below, > >> --- >> docs/devel/vfio-migration.rst | 35 +++++--- >> include/hw/vfio/vfio-common.h | 2 + >> hw/vfio/common.c | 6 +- >> hw/vfio/migration.c | 165 ++++++++++++++++++++++++++++++++-- >> hw/vfio/trace-events | 4 +- >> 5 files changed, 190 insertions(+), 22 deletions(-) >> >> diff --git a/docs/devel/vfio-migration.rst >> b/docs/devel/vfio-migration.rst >> index 1b68ccf115..e896b2a673 100644 >> --- a/docs/devel/vfio-migration.rst >> +++ b/docs/devel/vfio-migration.rst >> @@ -7,12 +7,14 @@ the guest is running on source host and restoring >> this saved state on the >> destination host. This document details how saving and restoring of >> VFIO >> devices is done in QEMU. >> >> -Migration of VFIO devices currently consists of a single >> stop-and-copy phase. >> -During the stop-and-copy phase the guest is stopped and the entire >> VFIO device >> -data is transferred to the destination. >> - >> -The pre-copy phase of migration is currently not supported for VFIO >> devices. >> -Support for VFIO pre-copy will be added later on. >> +Migration of VFIO devices consists of two phases: the optional >> pre-copy phase, >> +and the stop-and-copy phase. The pre-copy phase is iterative and >> allows to >> +accommodate VFIO devices that have a large amount of data that needs >> to be >> +transferred. The iterative pre-copy phase of migration allows for >> the guest to >> +continue whilst the VFIO device state is transferred to the >> destination, this >> +helps to reduce the total downtime of the VM. VFIO devices opt-in to >> pre-copy >> +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the >> +VFIO_DEVICE_FEATURE_MIGRATION ioctl. >> >> Note that currently VFIO migration is supported only for a single >> device. This >> is due to VFIO migration's lack of P2P support. However, P2P >> support is planned >> @@ -29,10 +31,20 @@ VFIO implements the device hooks for the >> iterative approach as follows: >> * A ``load_setup`` function that sets the VFIO device on the >> destination in >> _RESUMING state. >> >> +* A ``state_pending_estimate`` function that reports an estimate of the >> + remaining pre-copy data that the vendor driver has yet to save for >> the VFIO >> + device. >> + >> * A ``state_pending_exact`` function that reads pending_bytes from >> the vendor >> driver, which indicates the amount of data that the vendor driver >> has yet to >> save for the VFIO device. >> >> +* An ``is_active_iterate`` function that indicates >> ``save_live_iterate`` is >> + active only when the VFIO device is in pre-copy states. >> + >> +* A ``save_live_iterate`` function that reads the VFIO device's data >> from the >> + vendor driver during iterative pre-copy phase. >> + >> * A ``save_state`` function to save the device config space if it >> is present. >> >> * A ``save_live_complete_precopy`` function that sets the VFIO >> device in >> @@ -111,8 +123,10 @@ Flow of state changes during Live migration >> =========================================== >> >> Below is the flow of state change during live migration. >> -The values in the brackets represent the VM state, the migration >> state, and >> +The values in the parentheses represent the VM state, the migration >> state, and >> the VFIO device state, respectively. >> +The text in the square brackets represents the flow if the VFIO >> device supports >> +pre-copy. >> >> Live migration save path >> ------------------------ >> @@ -124,11 +138,12 @@ Live migration save path >> | >> migrate_init spawns migration_thread >> Migration thread then calls each device's >> .save_setup() >> - (RUNNING, _SETUP, _RUNNING) >> + (RUNNING, _SETUP, _RUNNING [_PRE_COPY]) >> | >> - (RUNNING, _ACTIVE, _RUNNING) >> - If device is active, get pending_bytes by >> .state_pending_exact() >> + (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY]) >> + If device is active, get pending_bytes by >> .state_pending_{estimate,exact}() >> If total pending_bytes >= threshold_size, call >> .save_live_iterate() >> + [Data of VFIO device for pre-copy phase is copied] >> Iterate till total pending bytes converge and are less than >> threshold >> | >> On migration completion, vCPU stops and calls >> .save_live_complete_precopy for >> diff --git a/include/hw/vfio/vfio-common.h >> b/include/hw/vfio/vfio-common.h >> index 5f29dab839..1db901c194 100644 >> --- a/include/hw/vfio/vfio-common.h >> +++ b/include/hw/vfio/vfio-common.h >> @@ -67,6 +67,8 @@ typedef struct VFIOMigration { >> void *data_buffer; >> size_t data_buffer_size; >> uint64_t mig_flags; >> + uint64_t precopy_init_size; >> + uint64_t precopy_dirty_size; >> } VFIOMigration; >> >> typedef struct VFIOAddressSpace { >> diff --git a/hw/vfio/common.c b/hw/vfio/common.c >> index 78358ede27..b73086e17a 100644 >> --- a/hw/vfio/common.c >> +++ b/hw/vfio/common.c >> @@ -492,7 +492,8 @@ static bool >> vfio_devices_all_dirty_tracking(VFIOContainer *container) >> } >> >> if (vbasedev->pre_copy_dirty_page_tracking == >> ON_OFF_AUTO_OFF && >> - migration->device_state == VFIO_DEVICE_STATE_RUNNING) { >> + (migration->device_state == >> VFIO_DEVICE_STATE_RUNNING || >> + migration->device_state == >> VFIO_DEVICE_STATE_PRE_COPY)) { >> return false; >> } >> } >> @@ -537,7 +538,8 @@ static bool >> vfio_devices_all_running_and_mig_active(VFIOContainer *container) >> return false; >> } >> >> - if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) { >> + if (migration->device_state == VFIO_DEVICE_STATE_RUNNING || >> + migration->device_state == >> VFIO_DEVICE_STATE_PRE_COPY) { >> continue; >> } else { >> return false; >> diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c >> index 8d33414379..d8f6a22ae1 100644 >> --- a/hw/vfio/migration.c >> +++ b/hw/vfio/migration.c >> @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum >> vfio_device_mig_state state) >> return "STOP_COPY"; >> case VFIO_DEVICE_STATE_RESUMING: >> return "RESUMING"; >> + case VFIO_DEVICE_STATE_PRE_COPY: >> + return "PRE_COPY"; >> default: >> return "UNKNOWN STATE"; >> } >> @@ -241,6 +243,25 @@ static int vfio_query_stop_copy_size(VFIODevice >> *vbasedev, >> return 0; >> } >> >> +static int vfio_query_precopy_size(VFIOMigration *migration) >> +{ >> + struct vfio_precopy_info precopy = { >> + .argsz = sizeof(precopy), >> + }; >> + >> + migration->precopy_init_size = 0; >> + migration->precopy_dirty_size = 0; >> + >> + if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, >> &precopy)) { >> + return -errno; >> + } >> + >> + migration->precopy_init_size = precopy.initial_bytes; >> + migration->precopy_dirty_size = precopy.dirty_bytes; >> + >> + return 0; >> +} >> + >> /* Returns the size of saved data on success and -errno on error */ >> static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) >> { >> @@ -249,6 +270,14 @@ static ssize_t vfio_save_block(QEMUFile *f, >> VFIOMigration *migration) >> data_size = read(migration->data_fd, migration->data_buffer, >> migration->data_buffer_size); >> if (data_size < 0) { >> + /* >> + * Pre-copy emptied all the device state for now. For more >> information, >> + * please refer to the Linux kernel VFIO uAPI. >> + */ >> + if (errno == ENOMSG) { >> + return 0; >> + } >> + >> return -errno; >> } >> if (data_size == 0) { >> @@ -265,6 +294,38 @@ static ssize_t vfio_save_block(QEMUFile *f, >> VFIOMigration *migration) >> return qemu_file_get_error(f) ?: data_size; >> } >> >> +static void vfio_update_estimated_pending_data(VFIOMigration >> *migration, >> + uint64_t data_size) >> +{ >> + if (!data_size) { >> + /* >> + * Pre-copy emptied all the device state for now, update >> estimated sizes >> + * accordingly. >> + */ >> + migration->precopy_init_size = 0; >> + migration->precopy_dirty_size = 0; >> + >> + return; >> + } >> + >> + if (migration->precopy_init_size) { >> + uint64_t init_size = MIN(migration->precopy_init_size, >> data_size); >> + >> + migration->precopy_init_size -= init_size; >> + data_size -= init_size; >> + } >> + >> + migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size, >> + data_size); >> +} >> + >> +static bool vfio_precopy_supported(VFIODevice *vbasedev) >> +{ >> + VFIOMigration *migration = vbasedev->migration; >> + >> + return migration->mig_flags & VFIO_MIGRATION_PRE_COPY; >> +} >> + >> /* >> ---------------------------------------------------------------------- >> */ >> >> static int vfio_save_setup(QEMUFile *f, void *opaque) >> @@ -285,6 +346,28 @@ static int vfio_save_setup(QEMUFile *f, void >> *opaque) >> return -ENOMEM; >> } >> >> + if (vfio_precopy_supported(vbasedev)) { >> + int ret; >> + >> + switch (migration->device_state) { >> + case VFIO_DEVICE_STATE_RUNNING: >> + ret = vfio_migration_set_state(vbasedev, >> VFIO_DEVICE_STATE_PRE_COPY, >> + VFIO_DEVICE_STATE_RUNNING); >> + if (ret) { >> + return ret; >> + } >> + >> + vfio_query_precopy_size(migration); >> + >> + break; >> + case VFIO_DEVICE_STATE_STOP: >> + /* vfio_save_complete_precopy() will go to STOP_COPY */ >> + break; >> + default: >> + return -EINVAL; >> + } >> + } >> + >> trace_vfio_save_setup(vbasedev->name, >> migration->data_buffer_size); >> >> qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); >> @@ -299,26 +382,42 @@ static void vfio_save_cleanup(void *opaque) >> >> g_free(migration->data_buffer); >> migration->data_buffer = NULL; >> + migration->precopy_init_size = 0; >> + migration->precopy_dirty_size = 0; >> vfio_migration_cleanup(vbasedev); >> trace_vfio_save_cleanup(vbasedev->name); >> } >> >> +static void vfio_state_pending_estimate(void *opaque, uint64_t >> *must_precopy, >> + uint64_t *can_postcopy) >> +{ >> + VFIODevice *vbasedev = opaque; >> + VFIOMigration *migration = vbasedev->migration; >> + >> + if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) { >> + return; >> + } >> + >> + *must_precopy += >> + migration->precopy_init_size + migration->precopy_dirty_size; >> + >> + trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy, >> + *can_postcopy, >> + migration->precopy_init_size, >> + migration->precopy_dirty_size); >> +} >> + >> /* >> * Migration size of VFIO devices can be as little as a few KBs or >> as big as >> * many GBs. This value should be big enough to cover the worst case. >> */ >> #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) >> >> -/* >> - * Only exact function is implemented and not estimate function. The >> reason is >> - * that during pre-copy phase of migration the estimate function is >> called >> - * repeatedly while pending RAM size is over the threshold, thus >> migration >> - * can't converge and querying the VFIO device pending data size is >> useless. >> - */ >> static void vfio_state_pending_exact(void *opaque, uint64_t >> *must_precopy, >> uint64_t *can_postcopy) >> { >> VFIODevice *vbasedev = opaque; >> + VFIOMigration *migration = vbasedev->migration; >> uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; >> >> /* >> @@ -328,8 +427,48 @@ static void vfio_state_pending_exact(void >> *opaque, uint64_t *must_precopy, >> vfio_query_stop_copy_size(vbasedev, &stop_copy_size); >> *must_precopy += stop_copy_size; >> >> + if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) { >> + vfio_query_precopy_size(migration); >> + >> + *must_precopy += >> + migration->precopy_init_size + >> migration->precopy_dirty_size; >> + } >> + >> trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, >> *can_postcopy, >> - stop_copy_size); >> + stop_copy_size, >> migration->precopy_init_size, >> + migration->precopy_dirty_size); >> +} >> + >> +static bool vfio_is_active_iterate(void *opaque) >> +{ >> + VFIODevice *vbasedev = opaque; >> + VFIOMigration *migration = vbasedev->migration; >> + >> + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY; >> +} >> + >> +static int vfio_save_iterate(QEMUFile *f, void *opaque) >> +{ >> + VFIODevice *vbasedev = opaque; >> + VFIOMigration *migration = vbasedev->migration; >> + ssize_t data_size; >> + >> + data_size = vfio_save_block(f, migration); >> + if (data_size < 0) { >> + return data_size; >> + } >> + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); >> + >> + vfio_update_estimated_pending_data(migration, data_size); >> + >> + trace_vfio_save_iterate(vbasedev->name, >> migration->precopy_init_size, >> + migration->precopy_dirty_size); >> + >> + /* >> + * A VFIO device's pre-copy dirty_bytes is not guaranteed to >> reach zero. >> + * Return 1 so following handlers will not be potentially blocked. >> + */ >> + return 1; >> } >> >> static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) >> @@ -338,7 +477,7 @@ static int vfio_save_complete_precopy(QEMUFile >> *f, void *opaque) >> ssize_t data_size; >> int ret; >> >> - /* We reach here with device state STOP only */ >> + /* We reach here with device state STOP or STOP_COPY only */ >> ret = vfio_migration_set_state(vbasedev, >> VFIO_DEVICE_STATE_STOP_COPY, >> VFIO_DEVICE_STATE_STOP); >> if (ret) { >> @@ -457,7 +596,10 @@ static int vfio_load_state(QEMUFile *f, void >> *opaque, int version_id) >> static const SaveVMHandlers savevm_vfio_handlers = { >> .save_setup = vfio_save_setup, >> .save_cleanup = vfio_save_cleanup, >> + .state_pending_estimate = vfio_state_pending_estimate, >> .state_pending_exact = vfio_state_pending_exact, >> + .is_active_iterate = vfio_is_active_iterate, >> + .save_live_iterate = vfio_save_iterate, >> .save_live_complete_precopy = vfio_save_complete_precopy, >> .save_state = vfio_save_state, >> .load_setup = vfio_load_setup, >> @@ -470,13 +612,18 @@ static const SaveVMHandlers >> savevm_vfio_handlers = { >> static void vfio_vmstate_change(void *opaque, bool running, >> RunState state) >> { >> VFIODevice *vbasedev = opaque; >> + VFIOMigration *migration = vbasedev->migration; >> enum vfio_device_mig_state new_state; >> int ret; >> >> if (running) { >> new_state = VFIO_DEVICE_STATE_RUNNING; >> } else { >> - new_state = VFIO_DEVICE_STATE_STOP; >> + new_state = >> + (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY && >> + (state == RUN_STATE_FINISH_MIGRATE || state == >> RUN_STATE_PAUSED)) ? >> + VFIO_DEVICE_STATE_STOP_COPY : >> + VFIO_DEVICE_STATE_STOP; >> } >> >> /* >> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events >> index 646e42fd27..548f9488a7 100644 >> --- a/hw/vfio/trace-events >> +++ b/hw/vfio/trace-events >> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) >> " (%s) data_size %d" >> vfio_save_cleanup(const char *name) " (%s)" >> vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" >> vfio_save_device_config_state(const char *name) " (%s)" >> +vfio_save_iterate(const char *name, uint64_t precopy_init_size, >> uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" >> precopy dirty size 0x%"PRIx64" > > the extra '"' at the end breaks compile. No need to resend just for that. > It can be fixed. > Oh, strange that it doesn't break when I compile it. Do you have any idea why would that be? Thanks! > > >> vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) >> data buffer size 0x%"PRIx64 >> -vfio_state_pending_exact(const char *name, uint64_t precopy, >> uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" >> postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64 >> +vfio_state_pending_estimate(const char *name, uint64_t precopy, >> uint64_t postcopy, uint64_t precopy_init_size, uint64_t >> precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" >> precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 >> +vfio_state_pending_exact(const char *name, uint64_t precopy, >> uint64_t postcopy, uint64_t stopcopy_size, uint64_t >> precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy >> 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy >> initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 >> vfio_vmstate_change(const char *name, int running, const char >> *reason, const char *dev_state) " (%s) running %d reason %s device >> state %s" >
>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events >>> index 646e42fd27..548f9488a7 100644 >>> --- a/hw/vfio/trace-events >>> +++ b/hw/vfio/trace-events >>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d" >>> vfio_save_cleanup(const char *name) " (%s)" >>> vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" >>> vfio_save_device_config_state(const char *name) " (%s)" >>> +vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64" >> >> the extra '"' at the end breaks compile. No need to resend just for that. >> It can be fixed. >> > Oh, strange that it doesn't break when I compile it. > Do you have any idea why would that be? It generates a -Werror=format= . Did you configure the build with --disable-werror ? C.
On 30/05/2023 13:17, Cédric Le Goater wrote: > External email: Use caution opening links or attachments > > >>>> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events >>>> index 646e42fd27..548f9488a7 100644 >>>> --- a/hw/vfio/trace-events >>>> +++ b/hw/vfio/trace-events >>>> @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int >>>> data_size) " (%s) data_size %d" >>>> vfio_save_cleanup(const char *name) " (%s)" >>>> vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" >>>> vfio_save_device_config_state(const char *name) " (%s)" >>>> +vfio_save_iterate(const char *name, uint64_t precopy_init_size, >>>> uint64_t precopy_dirty_size) " (%s) precopy initial size >>>> 0x%"PRIx64" precopy dirty size 0x%"PRIx64" >>> >>> the extra '"' at the end breaks compile. No need to resend just for >>> that. >>> It can be fixed. >>> >> Oh, strange that it doesn't break when I compile it. >> Do you have any idea why would that be? > > It generates a -Werror=format= . > > Did you configure the build with --disable-werror ? Nope. configure prints this: User defined options [...] werror : true
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst index 1b68ccf115..e896b2a673 100644 --- a/docs/devel/vfio-migration.rst +++ b/docs/devel/vfio-migration.rst @@ -7,12 +7,14 @@ the guest is running on source host and restoring this saved state on the destination host. This document details how saving and restoring of VFIO devices is done in QEMU. -Migration of VFIO devices currently consists of a single stop-and-copy phase. -During the stop-and-copy phase the guest is stopped and the entire VFIO device -data is transferred to the destination. - -The pre-copy phase of migration is currently not supported for VFIO devices. -Support for VFIO pre-copy will be added later on. +Migration of VFIO devices consists of two phases: the optional pre-copy phase, +and the stop-and-copy phase. The pre-copy phase is iterative and allows to +accommodate VFIO devices that have a large amount of data that needs to be +transferred. The iterative pre-copy phase of migration allows for the guest to +continue whilst the VFIO device state is transferred to the destination, this +helps to reduce the total downtime of the VM. VFIO devices opt-in to pre-copy +support by reporting the VFIO_MIGRATION_PRE_COPY flag in the +VFIO_DEVICE_FEATURE_MIGRATION ioctl. Note that currently VFIO migration is supported only for a single device. This is due to VFIO migration's lack of P2P support. However, P2P support is planned @@ -29,10 +31,20 @@ VFIO implements the device hooks for the iterative approach as follows: * A ``load_setup`` function that sets the VFIO device on the destination in _RESUMING state. +* A ``state_pending_estimate`` function that reports an estimate of the + remaining pre-copy data that the vendor driver has yet to save for the VFIO + device. + * A ``state_pending_exact`` function that reads pending_bytes from the vendor driver, which indicates the amount of data that the vendor driver has yet to save for the VFIO device. +* An ``is_active_iterate`` function that indicates ``save_live_iterate`` is + active only when the VFIO device is in pre-copy states. + +* A ``save_live_iterate`` function that reads the VFIO device's data from the + vendor driver during iterative pre-copy phase. + * A ``save_state`` function to save the device config space if it is present. * A ``save_live_complete_precopy`` function that sets the VFIO device in @@ -111,8 +123,10 @@ Flow of state changes during Live migration =========================================== Below is the flow of state change during live migration. -The values in the brackets represent the VM state, the migration state, and +The values in the parentheses represent the VM state, the migration state, and the VFIO device state, respectively. +The text in the square brackets represents the flow if the VFIO device supports +pre-copy. Live migration save path ------------------------ @@ -124,11 +138,12 @@ Live migration save path | migrate_init spawns migration_thread Migration thread then calls each device's .save_setup() - (RUNNING, _SETUP, _RUNNING) + (RUNNING, _SETUP, _RUNNING [_PRE_COPY]) | - (RUNNING, _ACTIVE, _RUNNING) - If device is active, get pending_bytes by .state_pending_exact() + (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY]) + If device is active, get pending_bytes by .state_pending_{estimate,exact}() If total pending_bytes >= threshold_size, call .save_live_iterate() + [Data of VFIO device for pre-copy phase is copied] Iterate till total pending bytes converge and are less than threshold | On migration completion, vCPU stops and calls .save_live_complete_precopy for diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h index 5f29dab839..1db901c194 100644 --- a/include/hw/vfio/vfio-common.h +++ b/include/hw/vfio/vfio-common.h @@ -67,6 +67,8 @@ typedef struct VFIOMigration { void *data_buffer; size_t data_buffer_size; uint64_t mig_flags; + uint64_t precopy_init_size; + uint64_t precopy_dirty_size; } VFIOMigration; typedef struct VFIOAddressSpace { diff --git a/hw/vfio/common.c b/hw/vfio/common.c index 78358ede27..b73086e17a 100644 --- a/hw/vfio/common.c +++ b/hw/vfio/common.c @@ -492,7 +492,8 @@ static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) } if (vbasedev->pre_copy_dirty_page_tracking == ON_OFF_AUTO_OFF && - migration->device_state == VFIO_DEVICE_STATE_RUNNING) { + (migration->device_state == VFIO_DEVICE_STATE_RUNNING || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY)) { return false; } } @@ -537,7 +538,8 @@ static bool vfio_devices_all_running_and_mig_active(VFIOContainer *container) return false; } - if (migration->device_state == VFIO_DEVICE_STATE_RUNNING) { + if (migration->device_state == VFIO_DEVICE_STATE_RUNNING || + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) { continue; } else { return false; diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c index 8d33414379..d8f6a22ae1 100644 --- a/hw/vfio/migration.c +++ b/hw/vfio/migration.c @@ -68,6 +68,8 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state) return "STOP_COPY"; case VFIO_DEVICE_STATE_RESUMING: return "RESUMING"; + case VFIO_DEVICE_STATE_PRE_COPY: + return "PRE_COPY"; default: return "UNKNOWN STATE"; } @@ -241,6 +243,25 @@ static int vfio_query_stop_copy_size(VFIODevice *vbasedev, return 0; } +static int vfio_query_precopy_size(VFIOMigration *migration) +{ + struct vfio_precopy_info precopy = { + .argsz = sizeof(precopy), + }; + + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + + if (ioctl(migration->data_fd, VFIO_MIG_GET_PRECOPY_INFO, &precopy)) { + return -errno; + } + + migration->precopy_init_size = precopy.initial_bytes; + migration->precopy_dirty_size = precopy.dirty_bytes; + + return 0; +} + /* Returns the size of saved data on success and -errno on error */ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) { @@ -249,6 +270,14 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) data_size = read(migration->data_fd, migration->data_buffer, migration->data_buffer_size); if (data_size < 0) { + /* + * Pre-copy emptied all the device state for now. For more information, + * please refer to the Linux kernel VFIO uAPI. + */ + if (errno == ENOMSG) { + return 0; + } + return -errno; } if (data_size == 0) { @@ -265,6 +294,38 @@ static ssize_t vfio_save_block(QEMUFile *f, VFIOMigration *migration) return qemu_file_get_error(f) ?: data_size; } +static void vfio_update_estimated_pending_data(VFIOMigration *migration, + uint64_t data_size) +{ + if (!data_size) { + /* + * Pre-copy emptied all the device state for now, update estimated sizes + * accordingly. + */ + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; + + return; + } + + if (migration->precopy_init_size) { + uint64_t init_size = MIN(migration->precopy_init_size, data_size); + + migration->precopy_init_size -= init_size; + data_size -= init_size; + } + + migration->precopy_dirty_size -= MIN(migration->precopy_dirty_size, + data_size); +} + +static bool vfio_precopy_supported(VFIODevice *vbasedev) +{ + VFIOMigration *migration = vbasedev->migration; + + return migration->mig_flags & VFIO_MIGRATION_PRE_COPY; +} + /* ---------------------------------------------------------------------- */ static int vfio_save_setup(QEMUFile *f, void *opaque) @@ -285,6 +346,28 @@ static int vfio_save_setup(QEMUFile *f, void *opaque) return -ENOMEM; } + if (vfio_precopy_supported(vbasedev)) { + int ret; + + switch (migration->device_state) { + case VFIO_DEVICE_STATE_RUNNING: + ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_PRE_COPY, + VFIO_DEVICE_STATE_RUNNING); + if (ret) { + return ret; + } + + vfio_query_precopy_size(migration); + + break; + case VFIO_DEVICE_STATE_STOP: + /* vfio_save_complete_precopy() will go to STOP_COPY */ + break; + default: + return -EINVAL; + } + } + trace_vfio_save_setup(vbasedev->name, migration->data_buffer_size); qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); @@ -299,26 +382,42 @@ static void vfio_save_cleanup(void *opaque) g_free(migration->data_buffer); migration->data_buffer = NULL; + migration->precopy_init_size = 0; + migration->precopy_dirty_size = 0; vfio_migration_cleanup(vbasedev); trace_vfio_save_cleanup(vbasedev->name); } +static void vfio_state_pending_estimate(void *opaque, uint64_t *must_precopy, + uint64_t *can_postcopy) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + + if (migration->device_state != VFIO_DEVICE_STATE_PRE_COPY) { + return; + } + + *must_precopy += + migration->precopy_init_size + migration->precopy_dirty_size; + + trace_vfio_state_pending_estimate(vbasedev->name, *must_precopy, + *can_postcopy, + migration->precopy_init_size, + migration->precopy_dirty_size); +} + /* * Migration size of VFIO devices can be as little as a few KBs or as big as * many GBs. This value should be big enough to cover the worst case. */ #define VFIO_MIG_STOP_COPY_SIZE (100 * GiB) -/* - * Only exact function is implemented and not estimate function. The reason is - * that during pre-copy phase of migration the estimate function is called - * repeatedly while pending RAM size is over the threshold, thus migration - * can't converge and querying the VFIO device pending data size is useless. - */ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, uint64_t *can_postcopy) { VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; uint64_t stop_copy_size = VFIO_MIG_STOP_COPY_SIZE; /* @@ -328,8 +427,48 @@ static void vfio_state_pending_exact(void *opaque, uint64_t *must_precopy, vfio_query_stop_copy_size(vbasedev, &stop_copy_size); *must_precopy += stop_copy_size; + if (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY) { + vfio_query_precopy_size(migration); + + *must_precopy += + migration->precopy_init_size + migration->precopy_dirty_size; + } + trace_vfio_state_pending_exact(vbasedev->name, *must_precopy, *can_postcopy, - stop_copy_size); + stop_copy_size, migration->precopy_init_size, + migration->precopy_dirty_size); +} + +static bool vfio_is_active_iterate(void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY; +} + +static int vfio_save_iterate(QEMUFile *f, void *opaque) +{ + VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; + ssize_t data_size; + + data_size = vfio_save_block(f, migration); + if (data_size < 0) { + return data_size; + } + qemu_put_be64(f, VFIO_MIG_FLAG_END_OF_STATE); + + vfio_update_estimated_pending_data(migration, data_size); + + trace_vfio_save_iterate(vbasedev->name, migration->precopy_init_size, + migration->precopy_dirty_size); + + /* + * A VFIO device's pre-copy dirty_bytes is not guaranteed to reach zero. + * Return 1 so following handlers will not be potentially blocked. + */ + return 1; } static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) @@ -338,7 +477,7 @@ static int vfio_save_complete_precopy(QEMUFile *f, void *opaque) ssize_t data_size; int ret; - /* We reach here with device state STOP only */ + /* We reach here with device state STOP or STOP_COPY only */ ret = vfio_migration_set_state(vbasedev, VFIO_DEVICE_STATE_STOP_COPY, VFIO_DEVICE_STATE_STOP); if (ret) { @@ -457,7 +596,10 @@ static int vfio_load_state(QEMUFile *f, void *opaque, int version_id) static const SaveVMHandlers savevm_vfio_handlers = { .save_setup = vfio_save_setup, .save_cleanup = vfio_save_cleanup, + .state_pending_estimate = vfio_state_pending_estimate, .state_pending_exact = vfio_state_pending_exact, + .is_active_iterate = vfio_is_active_iterate, + .save_live_iterate = vfio_save_iterate, .save_live_complete_precopy = vfio_save_complete_precopy, .save_state = vfio_save_state, .load_setup = vfio_load_setup, @@ -470,13 +612,18 @@ static const SaveVMHandlers savevm_vfio_handlers = { static void vfio_vmstate_change(void *opaque, bool running, RunState state) { VFIODevice *vbasedev = opaque; + VFIOMigration *migration = vbasedev->migration; enum vfio_device_mig_state new_state; int ret; if (running) { new_state = VFIO_DEVICE_STATE_RUNNING; } else { - new_state = VFIO_DEVICE_STATE_STOP; + new_state = + (migration->device_state == VFIO_DEVICE_STATE_PRE_COPY && + (state == RUN_STATE_FINISH_MIGRATE || state == RUN_STATE_PAUSED)) ? + VFIO_DEVICE_STATE_STOP_COPY : + VFIO_DEVICE_STATE_STOP; } /* diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events index 646e42fd27..548f9488a7 100644 --- a/hw/vfio/trace-events +++ b/hw/vfio/trace-events @@ -162,6 +162,8 @@ vfio_save_block(const char *name, int data_size) " (%s) data_size %d" vfio_save_cleanup(const char *name) " (%s)" vfio_save_complete_precopy(const char *name, int ret) " (%s) ret %d" vfio_save_device_config_state(const char *name) " (%s)" +vfio_save_iterate(const char *name, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64" vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer size 0x%"PRIx64 -vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64 +vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 +vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"
Pre-copy support allows the VFIO device data to be transferred while the VM is running. This helps to accommodate VFIO devices that have a large amount of data that needs to be transferred, and it can reduce migration downtime. Pre-copy support is optional in VFIO migration protocol v2. Implement pre-copy of VFIO migration protocol v2 and use it for devices that support it. Full description of it can be found in the following Linux commit: 4db52602a607 ("vfio: Extend the device migration protocol with PRE_COPY"). Signed-off-by: Avihai Horon <avihaih@nvidia.com> --- docs/devel/vfio-migration.rst | 35 +++++--- include/hw/vfio/vfio-common.h | 2 + hw/vfio/common.c | 6 +- hw/vfio/migration.c | 165 ++++++++++++++++++++++++++++++++-- hw/vfio/trace-events | 4 +- 5 files changed, 190 insertions(+), 22 deletions(-)