diff mbox series

[v3,1/1] docs/devel: Add VFIO device migration documentation

Message ID 20210326131850.149337-1-targupta@nvidia.com
State New
Headers show
Series [v3,1/1] docs/devel: Add VFIO device migration documentation | expand

Commit Message

Tarun Gupta March 26, 2021, 1:18 p.m. UTC
Document interfaces used for VFIO device migration. Added flow of state changes
during live migration with VFIO device. Tested by building docs with the new
vfio-migration.rst file.

v3:
- Add introductory line about VM migration in general.
- Remove occurcences of vfio_pin_pages() to describe pinning.
- Incorporated comments from v2

v2:
- Included the new vfio-migration.rst file in index.rst
- Updated dirty page tracking section, also added details about
  'pre-copy-dirty-page-tracking' opt-out option.
- Incorporated comments around wording of doc.

Signed-off-by: Tarun Gupta <targupta@nvidia.com>
Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
---
 MAINTAINERS                   |   1 +
 docs/devel/index.rst          |   1 +
 docs/devel/vfio-migration.rst | 143 ++++++++++++++++++++++++++++++++++
 3 files changed, 145 insertions(+)
 create mode 100644 docs/devel/vfio-migration.rst

Comments

Shenming Lu March 27, 2021, 6:04 a.m. UTC | #1
On 2021/3/26 21:18, Tarun Gupta wrote:
> Document interfaces used for VFIO device migration. Added flow of state changes
> during live migration with VFIO device. Tested by building docs with the new
> vfio-migration.rst file.
> 
> v3:
> - Add introductory line about VM migration in general.
> - Remove occurcences of vfio_pin_pages() to describe pinning.
> - Incorporated comments from v2
> 
> v2:
> - Included the new vfio-migration.rst file in index.rst
> - Updated dirty page tracking section, also added details about
>   'pre-copy-dirty-page-tracking' opt-out option.
> - Incorporated comments around wording of doc.
> 
> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> ---
>  MAINTAINERS                   |   1 +
>  docs/devel/index.rst          |   1 +
>  docs/devel/vfio-migration.rst | 143 ++++++++++++++++++++++++++++++++++
>  3 files changed, 145 insertions(+)
>  create mode 100644 docs/devel/vfio-migration.rst
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 738786146d..a2a80eee59 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1801,6 +1801,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
>  S: Supported
>  F: hw/vfio/*
>  F: include/hw/vfio/
> +F: docs/devel/vfio-migration.rst
>  
>  vfio-ccw
>  M: Cornelia Huck <cohuck@redhat.com>
> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
> index ae664da00c..5330f1ca1d 100644
> --- a/docs/devel/index.rst
> +++ b/docs/devel/index.rst
> @@ -39,3 +39,4 @@ Contents:
>     qom
>     block-coroutine-wrapper
>     multi-process
> +   vfio-migration
> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> new file mode 100644
> index 0000000000..24cb55991a
> --- /dev/null
> +++ b/docs/devel/vfio-migration.rst
> @@ -0,0 +1,143 @@
> +=====================
> +VFIO device Migration
> +=====================
> +
> +Migration of virtual machine involves saving the state for each device that
> +the guest is running on source host and restoring this saved state on the
> +destination host. This document details how saving and restoring of VFIO
> +devices is done in QEMU.
> +
> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
> +accommodate VFIO devices that have a large amount of data that needs to be
> +transferred. The iterative pre-copy phase of migration allows for the guest to
> +continue whilst the VFIO device state is transferred to the destination, this
> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
> +the pre-copy phase of migration by returning pending_bytes as zero during the
> +pre-copy phase.
> +
> +A detailed description of the UAPI for VFIO device migration can be found in
> +the comment for the ``vfio_device_migration_info`` structure in the header
> +file linux-headers/linux/vfio.h.
> +
> +VFIO device hooks for iterative approach:
> +
> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
> +  dirty page tracking.
> +
> +* A ``load_setup`` function that sets up the migration region on the
> +  destination and sets _RESUMING flag in the VFIO device state.
> +
> +* A ``save_live_pending`` function that reads pending_bytes from the vendor
> +  driver, which indicates the amount of data that the vendor driver has yet to
> +  save for the VFIO device.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver through the migration region during iterative phase.
> +
> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
> +  VFIO device state, saves the device config space, if any, and iteratively
> +  copies the remaining data for the VFIO device until the vendor driver
> +  indicates that no data remains (pending bytes is zero).

Hi Tarun,

We have moved the saving of the config space to the ``save_state`` function
added in commit d329f5032e1, do we need to add this change here? :-)

Thanks,
Shenming

> +
> +* A ``load_state`` function that loads the config section and the data
> +  sections that are generated by the save functions above
> +
> +* ``cleanup`` functions for both save and load that perform any migration
> +  related cleanup, including unmapping the migration region
> +
> +A VM state change handler is registered to change the VFIO device state when
> +the VM state changes.
> +
> +Similarly, a migration state change notifier is registered to get a
> +notification on migration state change. These states are translated to the
> +corresponding VFIO device state and conveyed to the vendor driver.
> +
> +System memory dirty pages tracking
> +----------------------------------
> +
> +A ``log_sync`` memory listener callback marks those system memory pages
> +as dirty which are used for DMA by the VFIO device. The dirty pages bitmap is
> +queried per container. All pages pinned by the vendor driver through external
> +APIs have to be marked as dirty during migration. When there are CPU writes,
> +CPU dirty page tracking can identify dirtied pages, but any page pinned by the
> +vendor driver can also be written by device. There is currently no device or
> +IOMMU support for dirty page tracking in hardware.
> +
> +By default, dirty pages are tracked when the device is in pre-copy as well as
> +stop-and-copy phase. So, a page pinned by vendor driver will be copied to
> +destination in both the phases. Copying dirty pages in pre-copy phase helps
> +QEMU to predict if it can achieve its downtime tolerances. If QEMU during
> +pre-copy phase keeps finding dirty pages continuously, then it understands
> +that even in stop-and-copy phase, it is likely to find dirty pages and can
> +predict the downtime accordingly
> +
> +QEMU also provides per device opt-out option ``pre-copy-dirty-page-tracking``
> +which disables querying dirty bitmap during pre-copy phase. If it is set to
> +off, all dirty pages will be copied to destination in stop-and-copy phase only
> +
> +System memory dirty pages tracking when vIOMMU is enabled
> +---------------------------------------------------------
> +
> +With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
> +phase of migration. In that case, the unmap ioctl returns any dirty pages in
> +that range and QEMU reports corresponding guest physical pages dirty. During
> +stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
> +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
> +mapped ranges.
> +
> +Flow of state changes during Live migration
> +===========================================
> +
> +Below is the flow of state change during live migration.
> +The values in the brackets represent the VM state, the migration state, and
> +the VFIO device state, respectively.
> +
> +Live migration save path
> +------------------------
> +
> +::
> +
> +                        QEMU normal running state
> +                        (RUNNING, _NONE, _RUNNING)
> +                                  |
> +                     migrate_init spawns migration_thread
> +                Migration thread then calls each device's .save_setup()
> +                    (RUNNING, _SETUP, _RUNNING|_SAVING)
> +                                  |
> +                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
> +             If device is active, get pending_bytes by .save_live_pending()
> +          If total pending_bytes >= threshold_size, call .save_live_iterate()
> +                  Data of VFIO device for pre-copy phase is copied
> +        Iterate till total pending bytes converge and are less than threshold
> +                                  |
> +  On migration completion, vCPU stops and calls .save_live_complete_precopy for
> +   each active device. The VFIO device is then transitioned into _SAVING state
> +                   (FINISH_MIGRATE, _DEVICE, _SAVING)
> +                                  |
> +     For the VFIO device, iterate in .save_live_complete_precopy until
> +                         pending data is 0
> +                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
> +                                  |
> +                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
> +             Migraton thread schedules cleanup bottom half and exits
> +
> +Live migration resume path
> +--------------------------
> +
> +::
> +
> +              Incoming migration calls .load_setup for each device
> +                       (RESTORE_VM, _ACTIVE, _STOPPED)
> +                                 |
> +       For each device, .load_state is called for that device section data
> +                       (RESTORE_VM, _ACTIVE, _RESUMING)
> +                                 |
> +    At the end, .load_cleanup is called for each device and vCPUs are started
> +                       (RUNNING, _NONE, _RUNNING)
> +
> +Postcopy
> +========
> +
> +Postcopy migration is currently not supported for VFIO devices.
>
Tarun Gupta April 1, 2021, 6:58 a.m. UTC | #2
On 3/27/2021 11:34 AM, Shenming Lu wrote:
> On 2021/3/26 21:18, Tarun Gupta wrote:
>> Document interfaces used for VFIO device migration. Added flow of state changes
>> during live migration with VFIO device. Tested by building docs with the new
>> vfio-migration.rst file.
>>
>> v3:
>> - Add introductory line about VM migration in general.
>> - Remove occurcences of vfio_pin_pages() to describe pinning.
>> - Incorporated comments from v2
>>
>> v2:
>> - Included the new vfio-migration.rst file in index.rst
>> - Updated dirty page tracking section, also added details about
>>    'pre-copy-dirty-page-tracking' opt-out option.
>> - Incorporated comments around wording of doc.
>>
>> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>> ---
>>   MAINTAINERS                   |   1 +
>>   docs/devel/index.rst          |   1 +
>>   docs/devel/vfio-migration.rst | 143 ++++++++++++++++++++++++++++++++++
>>   3 files changed, 145 insertions(+)
>>   create mode 100644 docs/devel/vfio-migration.rst
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index 738786146d..a2a80eee59 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -1801,6 +1801,7 @@ M: Alex Williamson <alex.williamson@redhat.com>
>>   S: Supported
>>   F: hw/vfio/*
>>   F: include/hw/vfio/
>> +F: docs/devel/vfio-migration.rst
>>
>>   vfio-ccw
>>   M: Cornelia Huck <cohuck@redhat.com>
>> diff --git a/docs/devel/index.rst b/docs/devel/index.rst
>> index ae664da00c..5330f1ca1d 100644
>> --- a/docs/devel/index.rst
>> +++ b/docs/devel/index.rst
>> @@ -39,3 +39,4 @@ Contents:
>>      qom
>>      block-coroutine-wrapper
>>      multi-process
>> +   vfio-migration
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> new file mode 100644
>> index 0000000000..24cb55991a
>> --- /dev/null
>> +++ b/docs/devel/vfio-migration.rst
>> @@ -0,0 +1,143 @@
>> +=====================
>> +VFIO device Migration
>> +=====================
>> +
>> +Migration of virtual machine involves saving the state for each device that
>> +the guest is running on source host and restoring this saved state on the
>> +destination host. This document details how saving and restoring of VFIO
>> +devices is done in QEMU.
>> +
>> +Migration of VFIO devices consists of two phases: the optional pre-copy phase,
>> +and the stop-and-copy phase. The pre-copy phase is iterative and allows to
>> +accommodate VFIO devices that have a large amount of data that needs to be
>> +transferred. The iterative pre-copy phase of migration allows for the guest to
>> +continue whilst the VFIO device state is transferred to the destination, this
>> +helps to reduce the total downtime of the VM. VFIO devices can choose to skip
>> +the pre-copy phase of migration by returning pending_bytes as zero during the
>> +pre-copy phase.
>> +
>> +A detailed description of the UAPI for VFIO device migration can be found in
>> +the comment for the ``vfio_device_migration_info`` structure in the header
>> +file linux-headers/linux/vfio.h.
>> +
>> +VFIO device hooks for iterative approach:
>> +
>> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
>> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
>> +  dirty page tracking.
>> +
>> +* A ``load_setup`` function that sets up the migration region on the
>> +  destination and sets _RESUMING flag in the VFIO device state.
>> +
>> +* A ``save_live_pending`` function that reads pending_bytes from the vendor
>> +  driver, which indicates the amount of data that the vendor driver has yet to
>> +  save for the VFIO device.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver through the migration region during iterative phase.
>> +
>> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
>> +  VFIO device state, saves the device config space, if any, and iteratively
>> +  copies the remaining data for the VFIO device until the vendor driver
>> +  indicates that no data remains (pending bytes is zero).
> 
> Hi Tarun,
> 
> We have moved the saving of the config space to the ``save_state`` function
> added in commit d329f5032e1, do we need to add this change here? :-)
> 

Thanks Shenming, I'll update it accordingly in the next version.

Thanks,
Tarun

> Thanks,
> Shenming
> 
>> +
>> +* A ``load_state`` function that loads the config section and the data
>> +  sections that are generated by the save functions above
>> +
>> +* ``cleanup`` functions for both save and load that perform any migration
>> +  related cleanup, including unmapping the migration region
>> +
>> +A VM state change handler is registered to change the VFIO device state when
>> +the VM state changes.
>> +
>> +Similarly, a migration state change notifier is registered to get a
>> +notification on migration state change. These states are translated to the
>> +corresponding VFIO device state and conveyed to the vendor driver.
>> +
>> +System memory dirty pages tracking
>> +----------------------------------
>> +
>> +A ``log_sync`` memory listener callback marks those system memory pages
>> +as dirty which are used for DMA by the VFIO device. The dirty pages bitmap is
>> +queried per container. All pages pinned by the vendor driver through external
>> +APIs have to be marked as dirty during migration. When there are CPU writes,
>> +CPU dirty page tracking can identify dirtied pages, but any page pinned by the
>> +vendor driver can also be written by device. There is currently no device or
>> +IOMMU support for dirty page tracking in hardware.
>> +
>> +By default, dirty pages are tracked when the device is in pre-copy as well as
>> +stop-and-copy phase. So, a page pinned by vendor driver will be copied to
>> +destination in both the phases. Copying dirty pages in pre-copy phase helps
>> +QEMU to predict if it can achieve its downtime tolerances. If QEMU during
>> +pre-copy phase keeps finding dirty pages continuously, then it understands
>> +that even in stop-and-copy phase, it is likely to find dirty pages and can
>> +predict the downtime accordingly
>> +
>> +QEMU also provides per device opt-out option ``pre-copy-dirty-page-tracking``
>> +which disables querying dirty bitmap during pre-copy phase. If it is set to
>> +off, all dirty pages will be copied to destination in stop-and-copy phase only
>> +
>> +System memory dirty pages tracking when vIOMMU is enabled
>> +---------------------------------------------------------
>> +
>> +With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
>> +phase of migration. In that case, the unmap ioctl returns any dirty pages in
>> +that range and QEMU reports corresponding guest physical pages dirty. During
>> +stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
>> +pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
>> +mapped ranges.
>> +
>> +Flow of state changes during Live migration
>> +===========================================
>> +
>> +Below is the flow of state change during live migration.
>> +The values in the brackets represent the VM state, the migration state, and
>> +the VFIO device state, respectively.
>> +
>> +Live migration save path
>> +------------------------
>> +
>> +::
>> +
>> +                        QEMU normal running state
>> +                        (RUNNING, _NONE, _RUNNING)
>> +                                  |
>> +                     migrate_init spawns migration_thread
>> +                Migration thread then calls each device's .save_setup()
>> +                    (RUNNING, _SETUP, _RUNNING|_SAVING)
>> +                                  |
>> +                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
>> +             If device is active, get pending_bytes by .save_live_pending()
>> +          If total pending_bytes >= threshold_size, call .save_live_iterate()
>> +                  Data of VFIO device for pre-copy phase is copied
>> +        Iterate till total pending bytes converge and are less than threshold
>> +                                  |
>> +  On migration completion, vCPU stops and calls .save_live_complete_precopy for
>> +   each active device. The VFIO device is then transitioned into _SAVING state
>> +                   (FINISH_MIGRATE, _DEVICE, _SAVING)
>> +                                  |
>> +     For the VFIO device, iterate in .save_live_complete_precopy until
>> +                         pending data is 0
>> +                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
>> +                                  |
>> +                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
>> +             Migraton thread schedules cleanup bottom half and exits
>> +
>> +Live migration resume path
>> +--------------------------
>> +
>> +::
>> +
>> +              Incoming migration calls .load_setup for each device
>> +                       (RESTORE_VM, _ACTIVE, _STOPPED)
>> +                                 |
>> +       For each device, .load_state is called for that device section data
>> +                       (RESTORE_VM, _ACTIVE, _RESUMING)
>> +                                 |
>> +    At the end, .load_cleanup is called for each device and vCPUs are started
>> +                       (RUNNING, _NONE, _RUNNING)
>> +
>> +Postcopy
>> +========
>> +
>> +Postcopy migration is currently not supported for VFIO devices.
>>
Cornelia Huck April 1, 2021, 11:05 a.m. UTC | #3
On Fri, 26 Mar 2021 18:48:50 +0530
Tarun Gupta <targupta@nvidia.com> wrote:

> Document interfaces used for VFIO device migration. Added flow of state changes
> during live migration with VFIO device. Tested by building docs with the new
> vfio-migration.rst file.

I don't think you want to include the test state in the patch
description; that should go into a --- section that is stripped off by
git am.

> 
> v3:
> - Add introductory line about VM migration in general.
> - Remove occurcences of vfio_pin_pages() to describe pinning.
> - Incorporated comments from v2
> 
> v2:
> - Included the new vfio-migration.rst file in index.rst
> - Updated dirty page tracking section, also added details about
>   'pre-copy-dirty-page-tracking' opt-out option.
> - Incorporated comments around wording of doc.

Same for the changelog; this is interesting for review, but not for the
final git log.

> 
> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>

This S-o-b chain does not look correct. Your address should be the last
one in the chain, signing off on all of the previous ones. (Maybe Kirti
also needs to be listed in a Co-developed-by: statement?)

> ---
>  MAINTAINERS                   |   1 +
>  docs/devel/index.rst          |   1 +
>  docs/devel/vfio-migration.rst | 143 ++++++++++++++++++++++++++++++++++
>  3 files changed, 145 insertions(+)
>  create mode 100644 docs/devel/vfio-migration.rst

> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
> new file mode 100644
> index 0000000000..24cb55991a
> --- /dev/null
> +++ b/docs/devel/vfio-migration.rst

(...)

> +VFIO device hooks for iterative approach:

"VFIO implements the device hooks for the iterative approach as
follows:"

?

> +
> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
> +  dirty page tracking.
> +
> +* A ``load_setup`` function that sets up the migration region on the
> +  destination and sets _RESUMING flag in the VFIO device state.
> +
> +* A ``save_live_pending`` function that reads pending_bytes from the vendor
> +  driver, which indicates the amount of data that the vendor driver has yet to
> +  save for the VFIO device.
> +
> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
> +  vendor driver through the migration region during iterative phase.
> +
> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
> +  VFIO device state, saves the device config space, if any, and iteratively
> +  copies the remaining data for the VFIO device until the vendor driver
> +  indicates that no data remains (pending bytes is zero).
> +
> +* A ``load_state`` function that loads the config section and the data
> +  sections that are generated by the save functions above
> +
> +* ``cleanup`` functions for both save and load that perform any migration
> +  related cleanup, including unmapping the migration region
> +
> +A VM state change handler is registered to change the VFIO device state when
> +the VM state changes.

This sentence is not very informative. What about:

"The VFIO migration code uses a VM state change handler to change the
VFIO device state when the VM state changes from running to
not-running, and vice versa."

> +
> +Similarly, a migration state change notifier is registered to get a
> +notification on migration state change. These states are translated to the
> +corresponding VFIO device state and conveyed to the vendor driver.

"Similarly, a migration state change handler is used to transition the
VFIO device state back to _RUNNING in case a migration failed or was
canceled."


> +
> +System memory dirty pages tracking
> +----------------------------------
> +
> +A ``log_sync`` memory listener callback marks those system memory pages
> +as dirty which are used for DMA by the VFIO device. The dirty pages bitmap is
> +queried per container. All pages pinned by the vendor driver through external
> +APIs have to be marked as dirty during migration. When there are CPU writes,
> +CPU dirty page tracking can identify dirtied pages, but any page pinned by the
> +vendor driver can also be written by device. There is currently no device or

s/by/by the/

> +IOMMU support for dirty page tracking in hardware.
> +
> +By default, dirty pages are tracked when the device is in pre-copy as well as
> +stop-and-copy phase. So, a page pinned by vendor driver will be copied to

s/by/by the/
s/to/to the/

> +destination in both the phases. Copying dirty pages in pre-copy phase helps

s/both the/both/ ?

> +QEMU to predict if it can achieve its downtime tolerances. If QEMU during
> +pre-copy phase keeps finding dirty pages continuously, then it understands
> +that even in stop-and-copy phase, it is likely to find dirty pages and can
> +predict the downtime accordingly
> +
> +QEMU also provides per device opt-out option ``pre-copy-dirty-page-tracking``

s/provides/provides a/

> +which disables querying dirty bitmap during pre-copy phase. If it is set to

s/querying/querying the/

> +off, all dirty pages will be copied to destination in stop-and-copy phase only

s/to/to the/

(...)
Tarun Gupta April 5, 2021, 5:02 p.m. UTC | #4
On 4/1/2021 4:35 PM, Cornelia Huck wrote:
> 
> On Fri, 26 Mar 2021 18:48:50 +0530
> Tarun Gupta <targupta@nvidia.com> wrote:
> 
>> Document interfaces used for VFIO device migration. Added flow of state changes
>> during live migration with VFIO device. Tested by building docs with the new
>> vfio-migration.rst file.
> 
> I don't think you want to include the test state in the patch
> description; that should go into a --- section that is stripped off by
> git am.
> 
>>
>> v3:
>> - Add introductory line about VM migration in general.
>> - Remove occurcences of vfio_pin_pages() to describe pinning.
>> - Incorporated comments from v2
>>
>> v2:
>> - Included the new vfio-migration.rst file in index.rst
>> - Updated dirty page tracking section, also added details about
>>    'pre-copy-dirty-page-tracking' opt-out option.
>> - Incorporated comments around wording of doc.
> 
> Same for the changelog; this is interesting for review, but not for the
> final git log.
>

Will move these details in --- section.

>>
>> Signed-off-by: Tarun Gupta <targupta@nvidia.com>
>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> 
> This S-o-b chain does not look correct. Your address should be the last
> one in the chain, signing off on all of the previous ones. (Maybe Kirti
> also needs to be listed in a Co-developed-by: statement?)
> 
>> ---
>>   MAINTAINERS                   |   1 +
>>   docs/devel/index.rst          |   1 +
>>   docs/devel/vfio-migration.rst | 143 ++++++++++++++++++++++++++++++++++
>>   3 files changed, 145 insertions(+)
>>   create mode 100644 docs/devel/vfio-migration.rst
> 
>> diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
>> new file mode 100644
>> index 0000000000..24cb55991a
>> --- /dev/null
>> +++ b/docs/devel/vfio-migration.rst
> 
> (...)
> 
>> +VFIO device hooks for iterative approach:
> 
> "VFIO implements the device hooks for the iterative approach as
> follows:"
> 
> ?
> 
>> +
>> +* A ``save_setup`` function that sets up the migration region, sets _SAVING
>> +  flag in the VFIO device state and informs the VFIO IOMMU module to start
>> +  dirty page tracking.
>> +
>> +* A ``load_setup`` function that sets up the migration region on the
>> +  destination and sets _RESUMING flag in the VFIO device state.
>> +
>> +* A ``save_live_pending`` function that reads pending_bytes from the vendor
>> +  driver, which indicates the amount of data that the vendor driver has yet to
>> +  save for the VFIO device.
>> +
>> +* A ``save_live_iterate`` function that reads the VFIO device's data from the
>> +  vendor driver through the migration region during iterative phase.
>> +
>> +* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
>> +  VFIO device state, saves the device config space, if any, and iteratively
>> +  copies the remaining data for the VFIO device until the vendor driver
>> +  indicates that no data remains (pending bytes is zero).
>> +
>> +* A ``load_state`` function that loads the config section and the data
>> +  sections that are generated by the save functions above
>> +
>> +* ``cleanup`` functions for both save and load that perform any migration
>> +  related cleanup, including unmapping the migration region
>> +
>> +A VM state change handler is registered to change the VFIO device state when
>> +the VM state changes.
> 
> This sentence is not very informative. What about:
> 
> "The VFIO migration code uses a VM state change handler to change the
> VFIO device state when the VM state changes from running to
> not-running, and vice versa."
> 
>> +
>> +Similarly, a migration state change notifier is registered to get a
>> +notification on migration state change. These states are translated to the
>> +corresponding VFIO device state and conveyed to the vendor driver.
> 
> "Similarly, a migration state change handler is used to transition the
> VFIO device state back to _RUNNING in case a migration failed or was
> canceled."

I wanted to keep the statement generic because the VFIO device state can 
be _RUNNING, _SAVING, _RESUMING. I can use your statement as an example 
as to how the migration state can be changed back to _RUNNING in case of 
migration failure or cancel. Does that work?

Thanks,
Tarun

> 
> 
>> +
>> +System memory dirty pages tracking
>> +----------------------------------
>> +
>> +A ``log_sync`` memory listener callback marks those system memory pages
>> +as dirty which are used for DMA by the VFIO device. The dirty pages bitmap is
>> +queried per container. All pages pinned by the vendor driver through external
>> +APIs have to be marked as dirty during migration. When there are CPU writes,
>> +CPU dirty page tracking can identify dirtied pages, but any page pinned by the
>> +vendor driver can also be written by device. There is currently no device or
> 
> s/by/by the/
> 
>> +IOMMU support for dirty page tracking in hardware.
>> +
>> +By default, dirty pages are tracked when the device is in pre-copy as well as
>> +stop-and-copy phase. So, a page pinned by vendor driver will be copied to
> 
> s/by/by the/
> s/to/to the/
> 
>> +destination in both the phases. Copying dirty pages in pre-copy phase helps
> 
> s/both the/both/ ?
> 
>> +QEMU to predict if it can achieve its downtime tolerances. If QEMU during
>> +pre-copy phase keeps finding dirty pages continuously, then it understands
>> +that even in stop-and-copy phase, it is likely to find dirty pages and can
>> +predict the downtime accordingly
>> +
>> +QEMU also provides per device opt-out option ``pre-copy-dirty-page-tracking``
> 
> s/provides/provides a/
> 
>> +which disables querying dirty bitmap during pre-copy phase. If it is set to
> 
> s/querying/querying the/
> 
>> +off, all dirty pages will be copied to destination in stop-and-copy phase only
> 
> s/to/to the/
> 
> (...)
>
Cornelia Huck April 7, 2021, 10:23 a.m. UTC | #5
On Mon, 5 Apr 2021 22:32:47 +0530
"Tarun Gupta (SW-GPU)" <targupta@nvidia.com> wrote:

> On 4/1/2021 4:35 PM, Cornelia Huck wrote:
> > 
> > On Fri, 26 Mar 2021 18:48:50 +0530
> > Tarun Gupta <targupta@nvidia.com> wrote:

> >> +
> >> +Similarly, a migration state change notifier is registered to get a
> >> +notification on migration state change. These states are translated to the
> >> +corresponding VFIO device state and conveyed to the vendor driver.  
> > 
> > "Similarly, a migration state change handler is used to transition the
> > VFIO device state back to _RUNNING in case a migration failed or was
> > canceled."  
> 
> I wanted to keep the statement generic because the VFIO device state can 
> be _RUNNING, _SAVING, _RESUMING. I can use your statement as an example 
> as to how the migration state can be changed back to _RUNNING in case of 
> migration failure or cancel. Does that work?

So, maybe:

"Similarly, a migration state change handler is used to trigger a
transition of the VFIO device state when certain changes of the
migration state occur. For example, the VFIO device state is
transitioned back to _RUNNING in case a migration failed or was
canceled."
Tarun Gupta April 7, 2021, 11:33 a.m. UTC | #6
On 4/7/2021 3:53 PM, Cornelia Huck wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Mon, 5 Apr 2021 22:32:47 +0530
> "Tarun Gupta (SW-GPU)" <targupta@nvidia.com> wrote:
> 
>> On 4/1/2021 4:35 PM, Cornelia Huck wrote:
>>>
>>> On Fri, 26 Mar 2021 18:48:50 +0530
>>> Tarun Gupta <targupta@nvidia.com> wrote:
> 
>>>> +
>>>> +Similarly, a migration state change notifier is registered to get a
>>>> +notification on migration state change. These states are translated to the
>>>> +corresponding VFIO device state and conveyed to the vendor driver.
>>>
>>> "Similarly, a migration state change handler is used to transition the
>>> VFIO device state back to _RUNNING in case a migration failed or was
>>> canceled."
>>
>> I wanted to keep the statement generic because the VFIO device state can
>> be _RUNNING, _SAVING, _RESUMING. I can use your statement as an example
>> as to how the migration state can be changed back to _RUNNING in case of
>> migration failure or cancel. Does that work?
> 
> So, maybe:
> 
> "Similarly, a migration state change handler is used to trigger a
> transition of the VFIO device state when certain changes of the
> migration state occur. For example, the VFIO device state is
> transitioned back to _RUNNING in case a migration failed or was
> canceled."
> 

Yes, this looks fine to me. I'll update this in v4.

Thanks,
Tarun
diff mbox series

Patch

diff --git a/MAINTAINERS b/MAINTAINERS
index 738786146d..a2a80eee59 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1801,6 +1801,7 @@  M: Alex Williamson <alex.williamson@redhat.com>
 S: Supported
 F: hw/vfio/*
 F: include/hw/vfio/
+F: docs/devel/vfio-migration.rst
 
 vfio-ccw
 M: Cornelia Huck <cohuck@redhat.com>
diff --git a/docs/devel/index.rst b/docs/devel/index.rst
index ae664da00c..5330f1ca1d 100644
--- a/docs/devel/index.rst
+++ b/docs/devel/index.rst
@@ -39,3 +39,4 @@  Contents:
    qom
    block-coroutine-wrapper
    multi-process
+   vfio-migration
diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst
new file mode 100644
index 0000000000..24cb55991a
--- /dev/null
+++ b/docs/devel/vfio-migration.rst
@@ -0,0 +1,143 @@ 
+=====================
+VFIO device Migration
+=====================
+
+Migration of virtual machine involves saving the state for each device that
+the guest is running on source host and restoring this saved state on the
+destination host. This document details how saving and restoring of VFIO
+devices is done in QEMU.
+
+Migration of VFIO devices consists of two phases: the optional pre-copy phase,
+and the stop-and-copy phase. The pre-copy phase is iterative and allows to
+accommodate VFIO devices that have a large amount of data that needs to be
+transferred. The iterative pre-copy phase of migration allows for the guest to
+continue whilst the VFIO device state is transferred to the destination, this
+helps to reduce the total downtime of the VM. VFIO devices can choose to skip
+the pre-copy phase of migration by returning pending_bytes as zero during the
+pre-copy phase.
+
+A detailed description of the UAPI for VFIO device migration can be found in
+the comment for the ``vfio_device_migration_info`` structure in the header
+file linux-headers/linux/vfio.h.
+
+VFIO device hooks for iterative approach:
+
+* A ``save_setup`` function that sets up the migration region, sets _SAVING
+  flag in the VFIO device state and informs the VFIO IOMMU module to start
+  dirty page tracking.
+
+* A ``load_setup`` function that sets up the migration region on the
+  destination and sets _RESUMING flag in the VFIO device state.
+
+* A ``save_live_pending`` function that reads pending_bytes from the vendor
+  driver, which indicates the amount of data that the vendor driver has yet to
+  save for the VFIO device.
+
+* A ``save_live_iterate`` function that reads the VFIO device's data from the
+  vendor driver through the migration region during iterative phase.
+
+* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
+  VFIO device state, saves the device config space, if any, and iteratively
+  copies the remaining data for the VFIO device until the vendor driver
+  indicates that no data remains (pending bytes is zero).
+
+* A ``load_state`` function that loads the config section and the data
+  sections that are generated by the save functions above
+
+* ``cleanup`` functions for both save and load that perform any migration
+  related cleanup, including unmapping the migration region
+
+A VM state change handler is registered to change the VFIO device state when
+the VM state changes.
+
+Similarly, a migration state change notifier is registered to get a
+notification on migration state change. These states are translated to the
+corresponding VFIO device state and conveyed to the vendor driver.
+
+System memory dirty pages tracking
+----------------------------------
+
+A ``log_sync`` memory listener callback marks those system memory pages
+as dirty which are used for DMA by the VFIO device. The dirty pages bitmap is
+queried per container. All pages pinned by the vendor driver through external
+APIs have to be marked as dirty during migration. When there are CPU writes,
+CPU dirty page tracking can identify dirtied pages, but any page pinned by the
+vendor driver can also be written by device. There is currently no device or
+IOMMU support for dirty page tracking in hardware.
+
+By default, dirty pages are tracked when the device is in pre-copy as well as
+stop-and-copy phase. So, a page pinned by vendor driver will be copied to
+destination in both the phases. Copying dirty pages in pre-copy phase helps
+QEMU to predict if it can achieve its downtime tolerances. If QEMU during
+pre-copy phase keeps finding dirty pages continuously, then it understands
+that even in stop-and-copy phase, it is likely to find dirty pages and can
+predict the downtime accordingly
+
+QEMU also provides per device opt-out option ``pre-copy-dirty-page-tracking``
+which disables querying dirty bitmap during pre-copy phase. If it is set to
+off, all dirty pages will be copied to destination in stop-and-copy phase only
+
+System memory dirty pages tracking when vIOMMU is enabled
+---------------------------------------------------------
+
+With vIOMMU, an IO virtual address range can get unmapped while in pre-copy
+phase of migration. In that case, the unmap ioctl returns any dirty pages in
+that range and QEMU reports corresponding guest physical pages dirty. During
+stop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
+pages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
+mapped ranges.
+
+Flow of state changes during Live migration
+===========================================
+
+Below is the flow of state change during live migration.
+The values in the brackets represent the VM state, the migration state, and
+the VFIO device state, respectively.
+
+Live migration save path
+------------------------
+
+::
+
+                        QEMU normal running state
+                        (RUNNING, _NONE, _RUNNING)
+                                  |
+                     migrate_init spawns migration_thread
+                Migration thread then calls each device's .save_setup()
+                    (RUNNING, _SETUP, _RUNNING|_SAVING)
+                                  |
+                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
+             If device is active, get pending_bytes by .save_live_pending()
+          If total pending_bytes >= threshold_size, call .save_live_iterate()
+                  Data of VFIO device for pre-copy phase is copied
+        Iterate till total pending bytes converge and are less than threshold
+                                  |
+  On migration completion, vCPU stops and calls .save_live_complete_precopy for
+   each active device. The VFIO device is then transitioned into _SAVING state
+                   (FINISH_MIGRATE, _DEVICE, _SAVING)
+                                  |
+     For the VFIO device, iterate in .save_live_complete_precopy until
+                         pending data is 0
+                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
+                                  |
+                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
+             Migraton thread schedules cleanup bottom half and exits
+
+Live migration resume path
+--------------------------
+
+::
+
+              Incoming migration calls .load_setup for each device
+                       (RESTORE_VM, _ACTIVE, _STOPPED)
+                                 |
+       For each device, .load_state is called for that device section data
+                       (RESTORE_VM, _ACTIVE, _RESUMING)
+                                 |
+    At the end, .load_cleanup is called for each device and vCPUs are started
+                       (RUNNING, _NONE, _RUNNING)
+
+Postcopy
+========
+
+Postcopy migration is currently not supported for VFIO devices.