diff mbox series

[v8,01/13] vfio: KABI for migration interface

Message ID 1566845753-18993-2-git-send-email-kwankhede@nvidia.com
State New
Headers show
Series Add migration support for VFIO device | expand

Commit Message

Kirti Wankhede Aug. 26, 2019, 6:55 p.m. UTC
- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
    Bit 0 => _RUNNING
    Bit 1 => _SAVING
    Bit 2 => _RESUMING
    Combination of these bits defines VFIO device's state during migration
    _STOPPED => All bits 0 indicates VFIO device stopped.
    _RUNNING => Normal VFIO device running state.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
                          saving state of device i.e. pre-copy state
    _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
                          save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
    _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
    Bits 3 - 31 are reserved for future use. User should perform
    read-modify-write operation on this field.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (read/write)
        To convey VFIO device state to be transitioned to. Only 3 bits are used
        as of now, Bits 3 - 31 are reserved for future use.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device.
    * data_offset: (read only)
        To get data offset in migration region from where data exist during
        _SAVING, from where data should be written by user space application
        during _RESUMING state and while read dirty pages bitmap.
    * data_size: (read/write)
        To get and set size of data copied in migration region during _SAVING
        and _RESUMING state.
    * start_pfn, page_size, total_pfns: (write only)
        To get bitmap of dirty pages from vendor driver from given
        start address for total_pfns.
    * copied_pfns: (read only)
        To get number of pfns bitmap copied in migration region.
        Vendor driver should copy the bitmap with bits set only for
        pages to be marked dirty in migration region. Vendor driver
        should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages dirty in
        requested range. Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL
        to mark all pages in the section as dirty.

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^                              ^
 offset 0-trapped part        data_offset                 data_size

Data section is always followed by vfio_device_migration_info
structure in the region, so data_offset will always be non-0.
Offset from where data is copied is decided by kernel driver, data
section can be trapped or mapped depending on how kernel driver
defines data section. If mmapped, then data_offset should be page
aligned, where as initial section which contain vfio_device_migration_info
structure might not end at offset which is page aligned.
Data_offset can be same or different for device data and dirty pages bitmap.
Vendor driver should decide whether to partition data section and how to
partition the data section. Vendor driver should return data_offset
accordingly.

For user application, data is opaque. User should write data in the same
order as received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 linux-headers/linux/vfio.h | 148 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 148 insertions(+)

Comments

Alex Williamson Aug. 28, 2019, 8:50 p.m. UTC | #1
On Tue, 27 Aug 2019 00:25:41 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 3 bits to define VFIO device states.
>     Bit 0 => _RUNNING
>     Bit 1 => _SAVING
>     Bit 2 => _RESUMING
>     Combination of these bits defines VFIO device's state during migration
>     _STOPPED => All bits 0 indicates VFIO device stopped.
>     _RUNNING => Normal VFIO device running state.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but start
>                           saving state of device i.e. pre-copy state
>     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
>                           save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.
>     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits are set
>     Bits 3 - 31 are reserved for future use. User should perform
>     read-modify-write operation on this field.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (read/write)
>         To convey VFIO device state to be transitioned to. Only 3 bits are used
>         as of now, Bits 3 - 31 are reserved for future use.
>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device.
>     * data_offset: (read only)
>         To get data offset in migration region from where data exist during
>         _SAVING, from where data should be written by user space application
>         during _RESUMING state and while read dirty pages bitmap.
>     * data_size: (read/write)
>         To get and set size of data copied in migration region during _SAVING
>         and _RESUMING state.
>     * start_pfn, page_size, total_pfns: (write only)
>         To get bitmap of dirty pages from vendor driver from given
>         start address for total_pfns.
>     * copied_pfns: (read only)
>         To get number of pfns bitmap copied in migration region.
>         Vendor driver should copy the bitmap with bits set only for
>         pages to be marked dirty in migration region. Vendor driver
>         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages dirty in
>         requested range. Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL
>         to mark all pages in the section as dirty.
> 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
>  ^                              ^                              ^
>  offset 0-trapped part        data_offset                 data_size
> 
> Data section is always followed by vfio_device_migration_info
> structure in the region, so data_offset will always be non-0.
> Offset from where data is copied is decided by kernel driver, data
> section can be trapped or mapped depending on how kernel driver
> defines data section. If mmapped, then data_offset should be page
> aligned, where as initial section which contain vfio_device_migration_info
> structure might not end at offset which is page aligned.
> Data_offset can be same or different for device data and dirty pages bitmap.
> Vendor driver should decide whether to partition data section and how to
> partition the data section. Vendor driver should return data_offset
> accordingly.
> 
> For user application, data is opaque. User should write data in the same
> order as received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  linux-headers/linux/vfio.h | 148 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 148 insertions(+)
> 
> diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> index 24f505199f83..4bc0236b0898 100644
> --- a/linux-headers/linux/vfio.h
> +++ b/linux-headers/linux/vfio.h
> @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
>   */
>  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
>  
> +/* Migration region type and sub-type */
> +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> +
> +/**
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write on this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state.
> + *      - If bit 2 set, indicates _RESUMING state.
> + *      Bits 3 - 31 are reserved for future use. User should perform
> + *      read-modify-write operation on this field.
> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state or read dirty pages bitmap. See below
> + *      for detail of sequence to be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      migration region during _SAVING state and write size of data copied in
> + *      migration region during _RESUMING state.
> + *
> + * start_pfn: (write only)
> + *      Start address pfn to get bitmap of dirty pages from vendor driver duing
> + *      _SAVING state.
> + *
> + * page_size: (write only)
> + *      User application should write the page_size of pfn.
> + *
> + * total_pfns: (write only)
> + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> + *
> + * copied_pfns: (read only)
> + *      pfn count for which dirty bitmap is copied to migration region.
> + *      Vendor driver should copy the bitmap with bits set only for pages to be
> + *      marked dirty in migration region.
> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
> + *        pages are dirty in requested range or rest of the range.
> + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
> + *        pages dirty in the given range or rest of the range.
> + *      - Vendor driver should return pfn count for which bitmap is written in
> + *        the region.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^                             ^
> + *  offset 0-trapped part        data_offset                 data_size
> + *
> + * Data section is always followed by vfio_device_migration_info structure
> + * in the region, so data_offset will always be non-0. Offset from where data
> + * is copied is decided by kernel driver, data section can be trapped or
> + * mapped or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned.
> + * Data_offset can be same or different for device data and dirty pages bitmap.
> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> + * c. read data_size, amount of data in bytes written by vendor driver in
> + *    migration region.
> + * d. read data_size bytes of data from data_offset in the migration region.
> + * e. process data.
> + * f. Loop through a to e.

Something needs to be said here about the availability of the data, for
example, what indicates to the vendor driver that the above operation is
complete?  Is the data immutable?

> + *
> + * To copy system memory content during migration, vendor driver should be able
> + * to report system memory pages which are dirtied by that driver. For such
> + * dirty page reporting, user application should query for a range of GFNs
> + * relative to device address space (IOVA), then vendor driver should provide
> + * the bitmap of pages from this range which are dirtied by him through
> + * migration region where each bit represents a page and bit set to 1 represents
> + * that the page is dirty.
> + * User space application should take care of copying content of system memory
> + * for those pages.

Can we say that device state and dirty pfn operations on the data
area may be intermixed in any order the user chooses?

Should we specify that bits accumulate since EITHER a) _SAVING state is
enabled or b) a given pfn was last reported via the below sequence (ie.
dirty bits are cleared once reported)?

How does QEMU handle the fact that IOVAs are potentially dynamic while
performing the live portion of a migration?  For example, each time a
guest driver calls dma_map_page() or dma_unmap_page(), a
MemoryRegionSection pops in or out of the AddressSpace for the device
(I'm assuming a vIOMMU where the device AddressSpace is not
system_memory).  I don't see any QEMU code that intercepts that change
in the AddressSpace such that the IOVA dirty pfns could be recorded and
translated to GFNs.  The vendor driver can't track these beyond getting
an unmap notification since it only knows the IOVA pfns, which can be
re-used with different GFN backing.  Once the DMA mapping is torn down,
it seems those dirty pfns are lost in the ether.  If this works in QEMU,
please help me find the code that handles it.

> + *
> + * Steps to get dirty page bitmap:
> + * a. write start_pfn, page_size and total_pfns.

This is not well specified.  Is it intended that the user write all
three of these on every iteration, or could they write start_pfn=0,
page_size=4K, total_pfns=1, complete the steps below, then write
start_pfn=1 and immediately begin the next iteration?  They've written
all three, though not all on the current iteration, does that count?
Furthermore, could the user simple re-read copied_pfns to determine if
anything in the previously setup range has been re-dirtied?

IOW, are these three "registers" sticky or do the below operations
invalidate them?  If they're invalidated, then there needs to be a
mechanism to generate an error, such as below.

> + * b. read copied_pfns. Vendor driver should take one of the below action:
> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
> + *       doesn't have any page to report dirty in given range or rest of the
> + *       range. Exit the loop.
> + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
> + *       pages dirty for given range or rest of the range. User space
> + *       application mark all pages in the range as dirty and exit the loop.
> + *     - Vendor driver should return copied_pfns and provide bitmap for
> + *       copied_pfn in migration region.

Read returns errno if the pre-requisite registers are not valid?

> + * c. read data_offset, where vendor driver has written bitmap.
> + * d. read bitmap from the migration region from data_offset.
> + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)

It seems like the intent here is that the user effectively does:

start_pfn += copied_pfns
total_pfns -= copied_pfns
page_size = page_size?

But are they under any obligation to do so?

Also same question above regarding data availability/life-cycle.  Is
the vendor driver responsible for making the data available
indefinitely?  Seems it's only released at the next iteration, or
re-use of the data area for another operation, or clearing of the
_SAVING state bit.

> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset.
> + * c. write data_size which indicates vendor driver that data is written in
> + *    staging buffer.
> + *
> + * For user application, data is opaque. User should write data in the same
> + * order as received.

Additionally, implicit synchronization between _SAVING and _RESUMING
ends within the vendor driver is assumed.

Are there any assumptions we haven't covered with respect to mmaps?
For instance, can the user setup mmaps at any time or only during
certain device states?  Are there recommended best practices for users
to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
mechanism, it might be nice to use it when either of these bits are
cleared.  Thanks,

Alex

> + */
> +
> +struct vfio_device_migration_info {
> +        __u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +                                     VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)
> +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> +                                     VFIO_DEVICE_STATE_RESUMING)
> +        __u32 reserved;
> +        __u64 pending_bytes;
> +        __u64 data_offset;
> +        __u64 data_size;
> +        __u64 start_pfn;
> +        __u64 page_size;
> +        __u64 total_pfns;
> +        __u64 copied_pfns;
> +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
Tian, Kevin Aug. 30, 2019, 7:25 a.m. UTC | #2
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, August 29, 2019 4:51 AM
> 
> On Tue, 27 Aug 2019 00:25:41 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > - Defined MIGRATION region type and sub-type.
> > - Used 3 bits to define VFIO device states.
> >     Bit 0 => _RUNNING
> >     Bit 1 => _SAVING
> >     Bit 2 => _RESUMING
> >     Combination of these bits defines VFIO device's state during migration
> >     _STOPPED => All bits 0 indicates VFIO device stopped.
> >     _RUNNING => Normal VFIO device running state.
> >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> start
> >                           saving state of device i.e. pre-copy state
> >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> >                           save device state,i.e. stop-n-copy state
> >     _RESUMING => VFIO device resuming state.
> >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits
> are set
> >     Bits 3 - 31 are reserved for future use. User should perform
> >     read-modify-write operation on this field.
> > - Defined vfio_device_migration_info structure which will be placed at 0th
> >   offset of migration region to get/set VFIO device related information.
> >   Defined members of structure and usage on read/write access:
> >     * device_state: (read/write)
> >         To convey VFIO device state to be transitioned to. Only 3 bits are used
> >         as of now, Bits 3 - 31 are reserved for future use.
> >     * pending bytes: (read only)
> >         To get pending bytes yet to be migrated for VFIO device.
> >     * data_offset: (read only)
> >         To get data offset in migration region from where data exist during
> >         _SAVING, from where data should be written by user space application
> >         during _RESUMING state and while read dirty pages bitmap.
> >     * data_size: (read/write)
> >         To get and set size of data copied in migration region during _SAVING
> >         and _RESUMING state.
> >     * start_pfn, page_size, total_pfns: (write only)
> >         To get bitmap of dirty pages from vendor driver from given
> >         start address for total_pfns.
> >     * copied_pfns: (read only)
> >         To get number of pfns bitmap copied in migration region.
> >         Vendor driver should copy the bitmap with bits set only for
> >         pages to be marked dirty in migration region. Vendor driver
> >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages
> dirty in
> >         requested range. Vendor driver should return
> VFIO_DEVICE_DIRTY_PFNS_ALL
> >         to mark all pages in the section as dirty.
> >
> > Migration region looks like:
> >  ------------------------------------------------------------------
> > |vfio_device_migration_info|    data section                      |
> > |                          |     ///////////////////////////////  |
> >  ------------------------------------------------------------------
> >  ^                              ^                              ^
> >  offset 0-trapped part        data_offset                 data_size
> >
> > Data section is always followed by vfio_device_migration_info
> > structure in the region, so data_offset will always be non-0.
> > Offset from where data is copied is decided by kernel driver, data
> > section can be trapped or mapped depending on how kernel driver
> > defines data section. If mmapped, then data_offset should be page
> > aligned, where as initial section which contain vfio_device_migration_info
> > structure might not end at offset which is page aligned.
> > Data_offset can be same or different for device data and dirty pages bitmap.
> > Vendor driver should decide whether to partition data section and how to
> > partition the data section. Vendor driver should return data_offset
> > accordingly.
> >
> > For user application, data is opaque. User should write data in the same
> > order as received.
> >
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  linux-headers/linux/vfio.h | 148
> +++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 148 insertions(+)
> >
> > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > index 24f505199f83..4bc0236b0898 100644
> > --- a/linux-headers/linux/vfio.h
> > +++ b/linux-headers/linux/vfio.h
> > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> >   */
> >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> >
> > +/* Migration region type and sub-type */
> > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > +
> > +/**
> > + * Structure vfio_device_migration_info is placed at 0th offset of
> > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> related migration
> > + * information. Field accesses from this structure are only supported at
> their
> > + * native width and alignment, otherwise the result is undefined and
> vendor
> > + * drivers should return an error.
> > + *
> > + * device_state: (read/write)
> > + *      To indicate vendor driver the state VFIO device should be
> transitioned
> > + *      to. If device state transition fails, write on this field return error.
> > + *      It consists of 3 bits:
> > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> > + *        _STOPPED state. When device is changed to _STOPPED, driver should
> stop
> > + *        device before write() returns.
> > + *      - If bit 1 set, indicates _SAVING state.
> > + *      - If bit 2 set, indicates _RESUMING state.

please add a few words to explain _SAVING and _RESUMING, similar to 
what you did for _RUNNING.

> > + *      Bits 3 - 31 are reserved for future use. User should perform
> > + *      read-modify-write operation on this field.
> > + *      _SAVING and _RESUMING bits set at the same time is invalid state.

what about _RUNNING | _RESUMING? Is it allowed?

> > + *
> > + * pending bytes: (read only)
> > + *      Number of pending bytes yet to be migrated from vendor driver
> > + *
> > + * data_offset: (read only)
> > + *      User application should read data_offset in migration region from
> where
> > + *      user application should read device data during _SAVING state or
> write
> > + *      device data during _RESUMING state or read dirty pages bitmap. See
> below
> > + *      for detail of sequence to be followed.
> > + *
> > + * data_size: (read/write)
> > + *      User application should read data_size to get size of data copied in
> > + *      migration region during _SAVING state and write size of data copied
> in
> > + *      migration region during _RESUMING state.
> > + *
> > + * start_pfn: (write only)
> > + *      Start address pfn to get bitmap of dirty pages from vendor driver
> duing
> > + *      _SAVING state.
> > + *
> > + * page_size: (write only)
> > + *      User application should write the page_size of pfn.
> > + *
> > + * total_pfns: (write only)
> > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > + *
> > + * copied_pfns: (read only)

'copied' gives the impression as if the page content is copied. what about
'dirty_pfns'? btw can this field merge with total_pfns?

> > + *      pfn count for which dirty bitmap is copied to migration region.
> > + *      Vendor driver should copy the bitmap with bits set only for pages to
> be
> > + *      marked dirty in migration region.
> > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> none of the
> > + *        pages are dirty in requested range or rest of the range.
> > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark
> all
> > + *        pages dirty in the given range or rest of the range.
> > + *      - Vendor driver should return pfn count for which bitmap is written in
> > + *        the region.
> > + *
> > + * Migration region looks like:
> > + *  ------------------------------------------------------------------
> > + * |vfio_device_migration_info|    data section                      |
> > + * |                          |     ///////////////////////////////  |
> > + * ------------------------------------------------------------------
> > + *   ^                              ^                             ^
> > + *  offset 0-trapped part        data_offset                 data_size

'data_size' -> "data_offset + data_size"

> > + *
> > + * Data section is always followed by vfio_device_migration_info structure

Data section is always following ..., or ... structure is always followed by
data section.

> > + * in the region, so data_offset will always be non-0. Offset from where
> data
> > + * is copied is decided by kernel driver, data section can be trapped or
> > + * mapped or partitioned, depending on how kernel driver defines data
> section.
> > + * Data section partition can be defined as mapped by sparse mmap
> capability.
> > + * If mmapped, then data_offset should be page aligned, where as initial
> section
> > + * which contain vfio_device_migration_info structure might not end at
> offset
> > + * which is page aligned.
> > + * Data_offset can be same or different for device data and dirty pages
> bitmap.
> > + * Vendor driver should decide whether to partition data section and how
> to
> > + * partition the data section. Vendor driver should return data_offset
> > + * accordingly.

Here lacks of high-level summary about how to differentiate reading device
state from reading dirty page bitmap, when both use the same interface 
(data_offset) to convey information to user space.

From below sequence example, looks reading device state is initiated by
reading pending_bytes, while reading dirty bitmap is marked by writing
start_pfn. In case of shared data region between two operations, they have
to be mutually-exclusive i.e. one must wait for the other to complete. Even
when the region is partitioned, data_offset itself could be raced if pending_
bytes and start_pfn are accessed at the same time. How do we expect the
application to cope with it? Isn't it too limiting with such design?

Since you anyway invent different sets of fields for two operations, why not
forcing partitioned flavor and then introduce two data_offset fields for each
other? This way the application is free to intermix device state and dirty
page collection for whatever needs.

> > + *
> > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-
> copy phase
> > + * and for _SAVING device state or stop-and-copy phase:
> > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > + * b. read data_offset, indicates kernel driver to write data to staging
> buffer.
> > + * c. read data_size, amount of data in bytes written by vendor driver in
> > + *    migration region.
> > + * d. read data_size bytes of data from data_offset in the migration region.
> > + * e. process data.
> > + * f. Loop through a to e.
> 
> Something needs to be said here about the availability of the data, for
> example, what indicates to the vendor driver that the above operation is
> complete?  Is the data immutable?

I guess the vendor driver just continues to track pending_bytes for dirtied 
device state, until exiting _SAVING state. Data is copied only when
pending_bytes is read by userspace. Copied data is immutable if data region
is mmapped. But yes, this part needs clarification.

> 
> > + *
> > + * To copy system memory content during migration, vendor driver should
> be able
> > + * to report system memory pages which are dirtied by that driver. For such
> > + * dirty page reporting, user application should query for a range of GFNs
> > + * relative to device address space (IOVA), then vendor driver should
> provide
> > + * the bitmap of pages from this range which are dirtied by him through
> > + * migration region where each bit represents a page and bit set to 1
> represents
> > + * that the page is dirty.
> > + * User space application should take care of copying content of system
> memory
> > + * for those pages.
> 
> Can we say that device state and dirty pfn operations on the data
> area may be intermixed in any order the user chooses?

this part is very opaque from previous description. Now I'm inclined to
vote for no-intermix design. There is really no good to force some
dependency between retrieving device state and dirty page bitmap.

> 
> Should we specify that bits accumulate since EITHER a) _SAVING state is
> enabled or b) a given pfn was last reported via the below sequence (ie.
> dirty bits are cleared once reported)?
> 
> How does QEMU handle the fact that IOVAs are potentially dynamic while
> performing the live portion of a migration?  For example, each time a
> guest driver calls dma_map_page() or dma_unmap_page(), a
> MemoryRegionSection pops in or out of the AddressSpace for the device
> (I'm assuming a vIOMMU where the device AddressSpace is not
> system_memory).  I don't see any QEMU code that intercepts that change
> in the AddressSpace such that the IOVA dirty pfns could be recorded and
> translated to GFNs.  The vendor driver can't track these beyond getting
> an unmap notification since it only knows the IOVA pfns, which can be
> re-used with different GFN backing.  Once the DMA mapping is torn down,
> it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> please help me find the code that handles it.

I'm curious about this part too. Interestingly, I didn't find any log_sync
callback registered by emulated devices in Qemu. Looks dirty pages
by emulated DMAs are recorded in some implicit way. But KVM always
reports dirty page in GFN instead of IOVA, regardless of the presence of
vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
 (translation can be done when DMA happens), then we don't need 
worry about transient mapping from IOVA to GFN. Along this way we
also want GFN-based dirty bitmap being reported through VFIO, 
similar to what KVM does. For vendor drivers, it needs to translate
from IOVA to HVA to GFN when tracking DMA activities on VFIO 
devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
provided by KVM but I'm not sure whether it's exposed now.

> 
> > + *
> > + * Steps to get dirty page bitmap:
> > + * a. write start_pfn, page_size and total_pfns.
> 
> This is not well specified.  Is it intended that the user write all
> three of these on every iteration, or could they write start_pfn=0,
> page_size=4K, total_pfns=1, complete the steps below, then write
> start_pfn=1 and immediately begin the next iteration?  They've written
> all three, though not all on the current iteration, does that count?
> Furthermore, could the user simple re-read copied_pfns to determine if
> anything in the previously setup range has been re-dirtied?
> 
> IOW, are these three "registers" sticky or do the below operations
> invalidate them?  If they're invalidated, then there needs to be a
> mechanism to generate an error, such as below.
> 
> > + * b. read copied_pfns. Vendor driver should take one of the below action:
> > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> driver
> > + *       doesn't have any page to report dirty in given range or rest of the
> > + *       range. Exit the loop.
> > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark
> all
> > + *       pages dirty for given range or rest of the range. User space
> > + *       application mark all pages in the range as dirty and exit the loop.
> > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > + *       copied_pfn in migration region.
> 
> Read returns errno if the pre-requisite registers are not valid?
> 
> > + * c. read data_offset, where vendor driver has written bitmap.
> > + * d. read bitmap from the migration region from data_offset.
> > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> 
> It seems like the intent here is that the user effectively does:
> 
> start_pfn += copied_pfns
> total_pfns -= copied_pfns
> page_size = page_size?
> 
> But are they under any obligation to do so?
> 
> Also same question above regarding data availability/life-cycle.  Is
> the vendor driver responsible for making the data available
> indefinitely?  Seems it's only released at the next iteration, or
> re-use of the data area for another operation, or clearing of the
> _SAVING state bit.
> 
> > + * Sequence to be followed while _RESUMING device state:
> > + * While data for this device is available, repeat below steps:
> > + * a. read data_offset from where user application should write data.
> > + * b. write data of data_size to migration region from data_offset.
> > + * c. write data_size which indicates vendor driver that data is written in
> > + *    staging buffer.
> > + *
> > + * For user application, data is opaque. User should write data in the same
> > + * order as received.
> 
> Additionally, implicit synchronization between _SAVING and _RESUMING
> ends within the vendor driver is assumed.
> 
> Are there any assumptions we haven't covered with respect to mmaps?
> For instance, can the user setup mmaps at any time or only during
> certain device states?  Are there recommended best practices for users
> to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> mechanism, it might be nice to use it when either of these bits are
> cleared.  Thanks,

another open for mmaps is how many pages to map. copied_pfns 
carries only the number of copied pages. They scatter in the bitmap
while application doesn't know how long the bitmap is. Do we
expect the application to map pages one-by-one, until scanned 
dirty pages is equal to copied_pfns?

> 
> Alex
> 
> > + */
> > +
> > +struct vfio_device_migration_info {
> > +        __u32 device_state;         /* VFIO device state */
> > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > +                                     VFIO_DEVICE_STATE_SAVING | \
> > +                                     VFIO_DEVICE_STATE_RESUMING)
> > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> > +                                     VFIO_DEVICE_STATE_RESUMING)
> > +        __u32 reserved;
> > +        __u64 pending_bytes;
> > +        __u64 data_offset;
> > +        __u64 data_size;
> > +        __u64 start_pfn;
> > +        __u64 page_size;
> > +        __u64 total_pfns;
> > +        __u64 copied_pfns;
> > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > +} __attribute__((packed));
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be
> mmapped
> >   * which allows direct access to non-MSIX registers which happened to be
> within
Tian, Kevin Aug. 30, 2019, 8:06 a.m. UTC | #3
> From: Tian, Kevin
> Sent: Friday, August 30, 2019 3:26 PM
> 
[...]
> > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > performing the live portion of a migration?  For example, each time a
> > guest driver calls dma_map_page() or dma_unmap_page(), a
> > MemoryRegionSection pops in or out of the AddressSpace for the device
> > (I'm assuming a vIOMMU where the device AddressSpace is not
> > system_memory).  I don't see any QEMU code that intercepts that change
> > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > translated to GFNs.  The vendor driver can't track these beyond getting
> > an unmap notification since it only knows the IOVA pfns, which can be
> > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > please help me find the code that handles it.
> 
> I'm curious about this part too. Interestingly, I didn't find any log_sync
> callback registered by emulated devices in Qemu. Looks dirty pages
> by emulated DMAs are recorded in some implicit way. But KVM always
> reports dirty page in GFN instead of IOVA, regardless of the presence of
> vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
>  (translation can be done when DMA happens), then we don't need
> worry about transient mapping from IOVA to GFN. Along this way we
> also want GFN-based dirty bitmap being reported through VFIO,
> similar to what KVM does. For vendor drivers, it needs to translate
> from IOVA to HVA to GFN when tracking DMA activities on VFIO
> devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> provided by KVM but I'm not sure whether it's exposed now.
> 

HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.

Above flow works for software-tracked dirty mechanism, e.g. in
KVMGT, where GFN-based 'dirty' is marked when a guest page is 
mapped into device mmu. IOVA->HPA->GFN translation is done 
at that time, thus immune from further IOVA->GFN changes.

When hardware IOMMU supports D-bit in 2nd level translation (e.g.
VT-d rev3.0), there are two scenarios:

1) nested translation: guest manages 1st-level translation (IOVA->GPA)
and host manages 2nd-level translation (GPA->HPA). The 2nd-level
is not affected by guest mapping operations. So it's OK for IOMMU
driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
level structure, upon request from user space. 

2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
information is tied to IOVA. the IOMMU driver is expected to maintain
an internal dirty bitmap. Upon any change of IOVA->GPA notification
from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
->GPA translation required for GFN-based recording. When userspace 
queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level 
dirty status to internal bitmap, which is then copied to user space.

Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.

Thanks
Kevin
Alex Williamson Aug. 30, 2019, 4:15 p.m. UTC | #4
On Fri, 30 Aug 2019 07:25:59 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, August 29, 2019 4:51 AM
> > 
> > On Tue, 27 Aug 2019 00:25:41 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > - Defined MIGRATION region type and sub-type.
> > > - Used 3 bits to define VFIO device states.
> > >     Bit 0 => _RUNNING
> > >     Bit 1 => _SAVING
> > >     Bit 2 => _RESUMING
> > >     Combination of these bits defines VFIO device's state during migration
> > >     _STOPPED => All bits 0 indicates VFIO device stopped.
> > >     _RUNNING => Normal VFIO device running state.
> > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but  
> > start  
> > >                           saving state of device i.e. pre-copy state
> > >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> > >                           save device state,i.e. stop-n-copy state
> > >     _RESUMING => VFIO device resuming state.
> > >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING bits  
> > are set  
> > >     Bits 3 - 31 are reserved for future use. User should perform
> > >     read-modify-write operation on this field.
> > > - Defined vfio_device_migration_info structure which will be placed at 0th
> > >   offset of migration region to get/set VFIO device related information.
> > >   Defined members of structure and usage on read/write access:
> > >     * device_state: (read/write)
> > >         To convey VFIO device state to be transitioned to. Only 3 bits are used
> > >         as of now, Bits 3 - 31 are reserved for future use.
> > >     * pending bytes: (read only)
> > >         To get pending bytes yet to be migrated for VFIO device.
> > >     * data_offset: (read only)
> > >         To get data offset in migration region from where data exist during
> > >         _SAVING, from where data should be written by user space application
> > >         during _RESUMING state and while read dirty pages bitmap.
> > >     * data_size: (read/write)
> > >         To get and set size of data copied in migration region during _SAVING
> > >         and _RESUMING state.
> > >     * start_pfn, page_size, total_pfns: (write only)
> > >         To get bitmap of dirty pages from vendor driver from given
> > >         start address for total_pfns.
> > >     * copied_pfns: (read only)
> > >         To get number of pfns bitmap copied in migration region.
> > >         Vendor driver should copy the bitmap with bits set only for
> > >         pages to be marked dirty in migration region. Vendor driver
> > >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages  
> > dirty in  
> > >         requested range. Vendor driver should return  
> > VFIO_DEVICE_DIRTY_PFNS_ALL  
> > >         to mark all pages in the section as dirty.
> > >
> > > Migration region looks like:
> > >  ------------------------------------------------------------------
> > > |vfio_device_migration_info|    data section                      |
> > > |                          |     ///////////////////////////////  |
> > >  ------------------------------------------------------------------
> > >  ^                              ^                              ^
> > >  offset 0-trapped part        data_offset                 data_size
> > >
> > > Data section is always followed by vfio_device_migration_info
> > > structure in the region, so data_offset will always be non-0.
> > > Offset from where data is copied is decided by kernel driver, data
> > > section can be trapped or mapped depending on how kernel driver
> > > defines data section. If mmapped, then data_offset should be page
> > > aligned, where as initial section which contain vfio_device_migration_info
> > > structure might not end at offset which is page aligned.
> > > Data_offset can be same or different for device data and dirty pages bitmap.
> > > Vendor driver should decide whether to partition data section and how to
> > > partition the data section. Vendor driver should return data_offset
> > > accordingly.
> > >
> > > For user application, data is opaque. User should write data in the same
> > > order as received.
> > >
> > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > ---
> > >  linux-headers/linux/vfio.h | 148  
> > +++++++++++++++++++++++++++++++++++++++++++++  
> > >  1 file changed, 148 insertions(+)
> > >
> > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > index 24f505199f83..4bc0236b0898 100644
> > > --- a/linux-headers/linux/vfio.h
> > > +++ b/linux-headers/linux/vfio.h
> > > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> > >   */
> > >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> > >
> > > +/* Migration region type and sub-type */
> > > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > > +
> > > +/**
> > > + * Structure vfio_device_migration_info is placed at 0th offset of
> > > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device  
> > related migration  
> > > + * information. Field accesses from this structure are only supported at  
> > their  
> > > + * native width and alignment, otherwise the result is undefined and  
> > vendor  
> > > + * drivers should return an error.
> > > + *
> > > + * device_state: (read/write)
> > > + *      To indicate vendor driver the state VFIO device should be  
> > transitioned  
> > > + *      to. If device state transition fails, write on this field return error.
> > > + *      It consists of 3 bits:
> > > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> > > + *        _STOPPED state. When device is changed to _STOPPED, driver should  
> > stop  
> > > + *        device before write() returns.
> > > + *      - If bit 1 set, indicates _SAVING state.
> > > + *      - If bit 2 set, indicates _RESUMING state.  
> 
> please add a few words to explain _SAVING and _RESUMING, similar to 
> what you did for _RUNNING.
> 
> > > + *      Bits 3 - 31 are reserved for future use. User should perform
> > > + *      read-modify-write operation on this field.
> > > + *      _SAVING and _RESUMING bits set at the same time is invalid state.  
> 
> what about _RUNNING | _RESUMING? Is it allowed?

I think this would be post-copy migration, which I assume is
theoretically supportable, but not necessarily (and not currently)
supported by vendor drivers.  I'm not sure how a vendor driver reports
the lack of this support though.

> > > + *
> > > + * pending bytes: (read only)
> > > + *      Number of pending bytes yet to be migrated from vendor driver
> > > + *
> > > + * data_offset: (read only)
> > > + *      User application should read data_offset in migration region from  
> > where  
> > > + *      user application should read device data during _SAVING state or  
> > write  
> > > + *      device data during _RESUMING state or read dirty pages bitmap. See  
> > below  
> > > + *      for detail of sequence to be followed.
> > > + *
> > > + * data_size: (read/write)
> > > + *      User application should read data_size to get size of data copied in
> > > + *      migration region during _SAVING state and write size of data copied  
> > in  
> > > + *      migration region during _RESUMING state.
> > > + *
> > > + * start_pfn: (write only)
> > > + *      Start address pfn to get bitmap of dirty pages from vendor driver  
> > duing  
> > > + *      _SAVING state.
> > > + *
> > > + * page_size: (write only)
> > > + *      User application should write the page_size of pfn.
> > > + *
> > > + * total_pfns: (write only)
> > > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > > + *
> > > + * copied_pfns: (read only)  
> 
> 'copied' gives the impression as if the page content is copied. what about
> 'dirty_pfns'? btw can this field merge with total_pfns?

I don't agree with the implication that copied implies page content,
we're working with a bitmap, so this is easily disproved.  I think the
intent of 'copied' is to differentiate that 'total' is how many the
user asked for, and 'copied' is how many they got.  It's not obvious to
me whether there's anything that would prevent duplicate use of one
register, ie. write total, read copied.  I'm not sure I see a huge
advantage to it either though.
 
> > > + *      pfn count for which dirty bitmap is copied to migration region.
> > > + *      Vendor driver should copy the bitmap with bits set only for pages to  
> > be  
> > > + *      marked dirty in migration region.
> > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if  
> > none of the  
> > > + *        pages are dirty in requested range or rest of the range.
> > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark  
> > all  
> > > + *        pages dirty in the given range or rest of the range.
> > > + *      - Vendor driver should return pfn count for which bitmap is written in
> > > + *        the region.
> > > + *
> > > + * Migration region looks like:
> > > + *  ------------------------------------------------------------------
> > > + * |vfio_device_migration_info|    data section                      |
> > > + * |                          |     ///////////////////////////////  |
> > > + * ------------------------------------------------------------------
> > > + *   ^                              ^                             ^
> > > + *  offset 0-trapped part        data_offset                 data_size  
> 
> 'data_size' -> "data_offset + data_size"

The diagram sort of implies that, but I don't think this is correct.  I
believe it's more like:

data_offset = data_start_offset
data_size = data_end_offset - data_start_offset

> > > + *
> > > + * Data section is always followed by vfio_device_migration_info structure  
> 
> Data section is always following ..., or ... structure is always followed by
> data section.

The vfio_device_migration struct is always at the zero'th offset of the
region, the data section is theoretically the remainder, but the used
range of the data section is defined by the vendor driver via
data_offset.

> > > + * in the region, so data_offset will always be non-0. Offset from where  
> > data  
> > > + * is copied is decided by kernel driver, data section can be trapped or
> > > + * mapped or partitioned, depending on how kernel driver defines data  
> > section.  
> > > + * Data section partition can be defined as mapped by sparse mmap  
> > capability.  
> > > + * If mmapped, then data_offset should be page aligned, where as initial  
> > section  
> > > + * which contain vfio_device_migration_info structure might not end at  
> > offset  
> > > + * which is page aligned.
> > > + * Data_offset can be same or different for device data and dirty pages  
> > bitmap.  
> > > + * Vendor driver should decide whether to partition data section and how  
> > to  
> > > + * partition the data section. Vendor driver should return data_offset
> > > + * accordingly.  
> 
> Here lacks of high-level summary about how to differentiate reading device
> state from reading dirty page bitmap, when both use the same interface 
> (data_offset) to convey information to user space.
> 
> From below sequence example, looks reading device state is initiated by
> reading pending_bytes, while reading dirty bitmap is marked by writing
> start_pfn. In case of shared data region between two operations, they have
> to be mutually-exclusive i.e. one must wait for the other to complete. Even
> when the region is partitioned, data_offset itself could be raced if pending_
> bytes and start_pfn are accessed at the same time. How do we expect the
> application to cope with it? Isn't it too limiting with such design?
> 
> Since you anyway invent different sets of fields for two operations, why not
> forcing partitioned flavor and then introduce two data_offset fields for each
> other? This way the application is free to intermix device state and dirty
> page collection for whatever needs.

AIUI, it's the user's responsibility to consume the data they've
asked to be provided before they perform the next operation, but the
user can alternate between device state and dirty pages at will.  I
agree though that the lifecycle of the data with regard to the vendor
driver is lacking.  Nothing seems to indicate to the vendor driver that
the data is consumed other than starting the next operation or turning
off _SAVING.

> > > + *
> > > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-  
> > copy phase  
> > > + * and for _SAVING device state or stop-and-copy phase:
> > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > + * b. read data_offset, indicates kernel driver to write data to staging  
> > buffer.  
> > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > + *    migration region.
> > > + * d. read data_size bytes of data from data_offset in the migration region.
> > > + * e. process data.
> > > + * f. Loop through a to e.  
> > 
> > Something needs to be said here about the availability of the data, for
> > example, what indicates to the vendor driver that the above operation is
> > complete?  Is the data immutable?  
> 
> I guess the vendor driver just continues to track pending_bytes for dirtied 
> device state, until exiting _SAVING state. Data is copied only when
> pending_bytes is read by userspace. Copied data is immutable if data region
> is mmapped. But yes, this part needs clarification.
> 
> >   
> > > + *
> > > + * To copy system memory content during migration, vendor driver should  
> > be able  
> > > + * to report system memory pages which are dirtied by that driver. For such
> > > + * dirty page reporting, user application should query for a range of GFNs
> > > + * relative to device address space (IOVA), then vendor driver should  
> > provide  
> > > + * the bitmap of pages from this range which are dirtied by him through
> > > + * migration region where each bit represents a page and bit set to 1  
> > represents  
> > > + * that the page is dirty.
> > > + * User space application should take care of copying content of system  
> > memory  
> > > + * for those pages.  
> > 
> > Can we say that device state and dirty pfn operations on the data
> > area may be intermixed in any order the user chooses?  
> 
> this part is very opaque from previous description. Now I'm inclined to
> vote for no-intermix design. There is really no good to force some
> dependency between retrieving device state and dirty page bitmap.

I'm confused, the state of the device continues to change so long as it
is _RUNNING.  The device may also continue to dirty pages in the
_RUNNING state.  So how could we not have both occurring at the same
time and therefore must support the user deciding which to sample in
each iteration?

> > Should we specify that bits accumulate since EITHER a) _SAVING state is
> > enabled or b) a given pfn was last reported via the below sequence (ie.
> > dirty bits are cleared once reported)?
> > 
> > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > performing the live portion of a migration?  For example, each time a
> > guest driver calls dma_map_page() or dma_unmap_page(), a
> > MemoryRegionSection pops in or out of the AddressSpace for the device
> > (I'm assuming a vIOMMU where the device AddressSpace is not
> > system_memory).  I don't see any QEMU code that intercepts that change
> > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > translated to GFNs.  The vendor driver can't track these beyond getting
> > an unmap notification since it only knows the IOVA pfns, which can be
> > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > please help me find the code that handles it.  
> 
> I'm curious about this part too. Interestingly, I didn't find any log_sync
> callback registered by emulated devices in Qemu. Looks dirty pages
> by emulated DMAs are recorded in some implicit way. But KVM always
> reports dirty page in GFN instead of IOVA, regardless of the presence of
> vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
>  (translation can be done when DMA happens), then we don't need 
> worry about transient mapping from IOVA to GFN. Along this way we
> also want GFN-based dirty bitmap being reported through VFIO, 
> similar to what KVM does. For vendor drivers, it needs to translate
> from IOVA to HVA to GFN when tracking DMA activities on VFIO 
> devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> provided by KVM but I'm not sure whether it's exposed now.
> 
> >   
> > > + *
> > > + * Steps to get dirty page bitmap:
> > > + * a. write start_pfn, page_size and total_pfns.  
> > 
> > This is not well specified.  Is it intended that the user write all
> > three of these on every iteration, or could they write start_pfn=0,
> > page_size=4K, total_pfns=1, complete the steps below, then write
> > start_pfn=1 and immediately begin the next iteration?  They've written
> > all three, though not all on the current iteration, does that count?
> > Furthermore, could the user simple re-read copied_pfns to determine if
> > anything in the previously setup range has been re-dirtied?
> > 
> > IOW, are these three "registers" sticky or do the below operations
> > invalidate them?  If they're invalidated, then there needs to be a
> > mechanism to generate an error, such as below.
> >   
> > > + * b. read copied_pfns. Vendor driver should take one of the below action:
> > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if  
> > driver  
> > > + *       doesn't have any page to report dirty in given range or rest of the
> > > + *       range. Exit the loop.
> > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark  
> > all  
> > > + *       pages dirty for given range or rest of the range. User space
> > > + *       application mark all pages in the range as dirty and exit the loop.
> > > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > > + *       copied_pfn in migration region.  
> > 
> > Read returns errno if the pre-requisite registers are not valid?
> >   
> > > + * c. read data_offset, where vendor driver has written bitmap.
> > > + * d. read bitmap from the migration region from data_offset.
> > > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)  
> > 
> > It seems like the intent here is that the user effectively does:
> > 
> > start_pfn += copied_pfns
> > total_pfns -= copied_pfns
> > page_size = page_size?
> > 
> > But are they under any obligation to do so?
> > 
> > Also same question above regarding data availability/life-cycle.  Is
> > the vendor driver responsible for making the data available
> > indefinitely?  Seems it's only released at the next iteration, or
> > re-use of the data area for another operation, or clearing of the
> > _SAVING state bit.
> >   
> > > + * Sequence to be followed while _RESUMING device state:
> > > + * While data for this device is available, repeat below steps:
> > > + * a. read data_offset from where user application should write data.
> > > + * b. write data of data_size to migration region from data_offset.
> > > + * c. write data_size which indicates vendor driver that data is written in
> > > + *    staging buffer.
> > > + *
> > > + * For user application, data is opaque. User should write data in the same
> > > + * order as received.  
> > 
> > Additionally, implicit synchronization between _SAVING and _RESUMING
> > ends within the vendor driver is assumed.
> > 
> > Are there any assumptions we haven't covered with respect to mmaps?
> > For instance, can the user setup mmaps at any time or only during
> > certain device states?  Are there recommended best practices for users
> > to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> > mechanism, it might be nice to use it when either of these bits are
> > cleared.  Thanks,  
> 
> another open for mmaps is how many pages to map. copied_pfns 
> carries only the number of copied pages. They scatter in the bitmap
> while application doesn't know how long the bitmap is. Do we
> expect the application to map pages one-by-one, until scanned 
> dirty pages is equal to copied_pfns?

I think we expect userspace to mmap according to the sparse mmap
capability of the region.  The vendor driver then chooses whether to
expose the current data set within the mmap region, outside the mmap
region, or some mix of both, depending on their requirements.  Thanks,

Alex
 
> > > + */
> > > +
> > > +struct vfio_device_migration_info {
> > > +        __u32 device_state;         /* VFIO device state */
> > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > > +                                     VFIO_DEVICE_STATE_SAVING | \
> > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
> > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > +        __u32 reserved;
> > > +        __u64 pending_bytes;
> > > +        __u64 data_offset;
> > > +        __u64 data_size;
> > > +        __u64 start_pfn;
> > > +        __u64 page_size;
> > > +        __u64 total_pfns;
> > > +        __u64 copied_pfns;
> > > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > > +} __attribute__((packed));
> > > +
> > >  /*
> > >   * The MSIX mappable capability informs that MSIX data of a BAR can be  
> > mmapped  
> > >   * which allows direct access to non-MSIX registers which happened to be  
> > within  
>
Alex Williamson Aug. 30, 2019, 4:32 p.m. UTC | #5
On Fri, 30 Aug 2019 08:06:32 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Tian, Kevin
> > Sent: Friday, August 30, 2019 3:26 PM
> >   
> [...]
> > > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > > performing the live portion of a migration?  For example, each time a
> > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > system_memory).  I don't see any QEMU code that intercepts that change
> > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > an unmap notification since it only knows the IOVA pfns, which can be
> > > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > please help me find the code that handles it.  
> > 
> > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > callback registered by emulated devices in Qemu. Looks dirty pages
> > by emulated DMAs are recorded in some implicit way. But KVM always
> > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> >  (translation can be done when DMA happens), then we don't need
> > worry about transient mapping from IOVA to GFN. Along this way we
> > also want GFN-based dirty bitmap being reported through VFIO,
> > similar to what KVM does. For vendor drivers, it needs to translate
> > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > provided by KVM but I'm not sure whether it's exposed now.
> >   
> 
> HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.

I thought it was bad enough that we have vendor drivers that depend on
KVM, but designing a vfio interface that only supports a KVM interface
is more undesirable.  I also note without comment that gfn_to_memslot()
is a GPL symbol.  Thanks,

Alex

> Above flow works for software-tracked dirty mechanism, e.g. in
> KVMGT, where GFN-based 'dirty' is marked when a guest page is 
> mapped into device mmu. IOVA->HPA->GFN translation is done 
> at that time, thus immune from further IOVA->GFN changes.
> 
> When hardware IOMMU supports D-bit in 2nd level translation (e.g.
> VT-d rev3.0), there are two scenarios:
> 
> 1) nested translation: guest manages 1st-level translation (IOVA->GPA)
> and host manages 2nd-level translation (GPA->HPA). The 2nd-level
> is not affected by guest mapping operations. So it's OK for IOMMU
> driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
> level structure, upon request from user space. 
> 
> 2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
> information is tied to IOVA. the IOMMU driver is expected to maintain
> an internal dirty bitmap. Upon any change of IOVA->GPA notification
> from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
> entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
> ->GPA translation required for GFN-based recording. When userspace   
> queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level 
> dirty status to internal bitmap, which is then copied to user space.
> 
> Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.
> 
> Thanks
> Kevin
Tian, Kevin Sept. 3, 2019, 6:05 a.m. UTC | #6
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, August 31, 2019 12:15 AM
> 
> On Fri, 30 Aug 2019 07:25:59 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, August 29, 2019 4:51 AM
> > >
> > > On Tue, 27 Aug 2019 00:25:41 +0530
> > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > >
> > > > - Defined MIGRATION region type and sub-type.
> > > > - Used 3 bits to define VFIO device states.
> > > >     Bit 0 => _RUNNING
> > > >     Bit 1 => _SAVING
> > > >     Bit 2 => _RESUMING
> > > >     Combination of these bits defines VFIO device's state during migration
> > > >     _STOPPED => All bits 0 indicates VFIO device stopped.
> > > >     _RUNNING => Normal VFIO device running state.
> > > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running
> but
> > > start
> > > >                           saving state of device i.e. pre-copy state
> > > >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> > > >                           save device state,i.e. stop-n-copy state
> > > >     _RESUMING => VFIO device resuming state.
> > > >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING
> bits
> > > are set
> > > >     Bits 3 - 31 are reserved for future use. User should perform
> > > >     read-modify-write operation on this field.
> > > > - Defined vfio_device_migration_info structure which will be placed at
> 0th
> > > >   offset of migration region to get/set VFIO device related information.
> > > >   Defined members of structure and usage on read/write access:
> > > >     * device_state: (read/write)
> > > >         To convey VFIO device state to be transitioned to. Only 3 bits are
> used
> > > >         as of now, Bits 3 - 31 are reserved for future use.
> > > >     * pending bytes: (read only)
> > > >         To get pending bytes yet to be migrated for VFIO device.
> > > >     * data_offset: (read only)
> > > >         To get data offset in migration region from where data exist during
> > > >         _SAVING, from where data should be written by user space
> application
> > > >         during _RESUMING state and while read dirty pages bitmap.
> > > >     * data_size: (read/write)
> > > >         To get and set size of data copied in migration region during
> _SAVING
> > > >         and _RESUMING state.
> > > >     * start_pfn, page_size, total_pfns: (write only)
> > > >         To get bitmap of dirty pages from vendor driver from given
> > > >         start address for total_pfns.
> > > >     * copied_pfns: (read only)
> > > >         To get number of pfns bitmap copied in migration region.
> > > >         Vendor driver should copy the bitmap with bits set only for
> > > >         pages to be marked dirty in migration region. Vendor driver
> > > >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages
> > > dirty in
> > > >         requested range. Vendor driver should return
> > > VFIO_DEVICE_DIRTY_PFNS_ALL
> > > >         to mark all pages in the section as dirty.
> > > >
> > > > Migration region looks like:
> > > >  ------------------------------------------------------------------
> > > > |vfio_device_migration_info|    data section                      |
> > > > |                          |     ///////////////////////////////  |
> > > >  ------------------------------------------------------------------
> > > >  ^                              ^                              ^
> > > >  offset 0-trapped part        data_offset                 data_size
> > > >
> > > > Data section is always followed by vfio_device_migration_info
> > > > structure in the region, so data_offset will always be non-0.
> > > > Offset from where data is copied is decided by kernel driver, data
> > > > section can be trapped or mapped depending on how kernel driver
> > > > defines data section. If mmapped, then data_offset should be page
> > > > aligned, where as initial section which contain
> vfio_device_migration_info
> > > > structure might not end at offset which is page aligned.
> > > > Data_offset can be same or different for device data and dirty pages
> bitmap.
> > > > Vendor driver should decide whether to partition data section and how
> to
> > > > partition the data section. Vendor driver should return data_offset
> > > > accordingly.
> > > >
> > > > For user application, data is opaque. User should write data in the same
> > > > order as received.
> > > >
> > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > ---
> > > >  linux-headers/linux/vfio.h | 148
> > > +++++++++++++++++++++++++++++++++++++++++++++
> > > >  1 file changed, 148 insertions(+)
> > > >
> > > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > > index 24f505199f83..4bc0236b0898 100644
> > > > --- a/linux-headers/linux/vfio.h
> > > > +++ b/linux-headers/linux/vfio.h
> > > > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> > > >   */
> > > >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> > > >
> > > > +/* Migration region type and sub-type */
> > > > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > > > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > > > +
> > > > +/**
> > > > + * Structure vfio_device_migration_info is placed at 0th offset of
> > > > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> > > related migration
> > > > + * information. Field accesses from this structure are only supported at
> > > their
> > > > + * native width and alignment, otherwise the result is undefined and
> > > vendor
> > > > + * drivers should return an error.
> > > > + *
> > > > + * device_state: (read/write)
> > > > + *      To indicate vendor driver the state VFIO device should be
> > > transitioned
> > > > + *      to. If device state transition fails, write on this field return error.
> > > > + *      It consists of 3 bits:
> > > > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that
> indicates
> > > > + *        _STOPPED state. When device is changed to _STOPPED, driver
> should
> > > stop
> > > > + *        device before write() returns.
> > > > + *      - If bit 1 set, indicates _SAVING state.
> > > > + *      - If bit 2 set, indicates _RESUMING state.
> >
> > please add a few words to explain _SAVING and _RESUMING, similar to
> > what you did for _RUNNING.
> >
> > > > + *      Bits 3 - 31 are reserved for future use. User should perform
> > > > + *      read-modify-write operation on this field.
> > > > + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> >
> > what about _RUNNING | _RESUMING? Is it allowed?
> 
> I think this would be post-copy migration, which I assume is
> theoretically supportable, but not necessarily (and not currently)
> supported by vendor drivers.  I'm not sure how a vendor driver reports
> the lack of this support though.

Yan is working on post-copy now. I talked to her about this open.
She will respond later after some thinking. Ideally we need a way
that user space can use to verify which combinations are available 
before starting the post-copy process. It's not good to blindly start
post-copy and then fail due to failure to switch to desired state on
the dest machine.

> 
> > > > + *
> > > > + * pending bytes: (read only)
> > > > + *      Number of pending bytes yet to be migrated from vendor driver
> > > > + *
> > > > + * data_offset: (read only)
> > > > + *      User application should read data_offset in migration region from
> > > where
> > > > + *      user application should read device data during _SAVING state or
> > > write
> > > > + *      device data during _RESUMING state or read dirty pages bitmap.
> See
> > > below
> > > > + *      for detail of sequence to be followed.
> > > > + *
> > > > + * data_size: (read/write)
> > > > + *      User application should read data_size to get size of data copied
> in
> > > > + *      migration region during _SAVING state and write size of data
> copied
> > > in
> > > > + *      migration region during _RESUMING state.
> > > > + *
> > > > + * start_pfn: (write only)
> > > > + *      Start address pfn to get bitmap of dirty pages from vendor driver
> > > duing
> > > > + *      _SAVING state.
> > > > + *
> > > > + * page_size: (write only)
> > > > + *      User application should write the page_size of pfn.
> > > > + *
> > > > + * total_pfns: (write only)
> > > > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > > > + *
> > > > + * copied_pfns: (read only)
> >
> > 'copied' gives the impression as if the page content is copied. what about
> > 'dirty_pfns'? btw can this field merge with total_pfns?
> 
> I don't agree with the implication that copied implies page content,
> we're working with a bitmap, so this is easily disproved.  I think the
> intent of 'copied' is to differentiate that 'total' is how many the
> user asked for, and 'copied' is how many they got.  It's not obvious to
> me whether there's anything that would prevent duplicate use of one
> register, ie. write total, read copied.  I'm not sure I see a huge
> advantage to it either though.

It's fine. I just had the impression that many kernel interfaces are 
designed in such way, i.e., having the same field to carry-in user
requested number and then carry-out actually handled number.

> 
> > > > + *      pfn count for which dirty bitmap is copied to migration region.
> > > > + *      Vendor driver should copy the bitmap with bits set only for pages
> to
> > > be
> > > > + *      marked dirty in migration region.
> > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > none of the
> > > > + *        pages are dirty in requested range or rest of the range.
> > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> mark
> > > all
> > > > + *        pages dirty in the given range or rest of the range.
> > > > + *      - Vendor driver should return pfn count for which bitmap is
> written in
> > > > + *        the region.
> > > > + *
> > > > + * Migration region looks like:
> > > > + *  ------------------------------------------------------------------
> > > > + * |vfio_device_migration_info|    data section                      |
> > > > + * |                          |     ///////////////////////////////  |
> > > > + * ------------------------------------------------------------------
> > > > + *   ^                              ^                             ^
> > > > + *  offset 0-trapped part        data_offset                 data_size
> >
> > 'data_size' -> "data_offset + data_size"
> 
> The diagram sort of implies that, but I don't think this is correct.  I
> believe it's more like:
> 
> data_offset = data_start_offset
> data_size = data_end_offset - data_start_offset

I'm fine with replacing data_offset & data_size with data_start_
offset & data_end_offset.

> 
> > > > + *
> > > > + * Data section is always followed by vfio_device_migration_info
> structure
> >
> > Data section is always following ..., or ... structure is always followed by
> > data section.
> 
> The vfio_device_migration struct is always at the zero'th offset of the
> region, the data section is theoretically the remainder, but the used
> range of the data section is defined by the vendor driver via
> data_offset.

Isn't original description incorrect here? It implied data section sits at
offset zero, and then migration info...

> 
> > > > + * in the region, so data_offset will always be non-0. Offset from where
> > > data
> > > > + * is copied is decided by kernel driver, data section can be trapped or
> > > > + * mapped or partitioned, depending on how kernel driver defines data
> > > section.
> > > > + * Data section partition can be defined as mapped by sparse mmap
> > > capability.
> > > > + * If mmapped, then data_offset should be page aligned, where as
> initial
> > > section
> > > > + * which contain vfio_device_migration_info structure might not end at
> > > offset
> > > > + * which is page aligned.
> > > > + * Data_offset can be same or different for device data and dirty pages
> > > bitmap.
> > > > + * Vendor driver should decide whether to partition data section and
> how
> > > to
> > > > + * partition the data section. Vendor driver should return data_offset
> > > > + * accordingly.
> >
> > Here lacks of high-level summary about how to differentiate reading device
> > state from reading dirty page bitmap, when both use the same interface
> > (data_offset) to convey information to user space.
> >
> > From below sequence example, looks reading device state is initiated by
> > reading pending_bytes, while reading dirty bitmap is marked by writing
> > start_pfn. In case of shared data region between two operations, they have
> > to be mutually-exclusive i.e. one must wait for the other to complete. Even
> > when the region is partitioned, data_offset itself could be raced if pending_
> > bytes and start_pfn are accessed at the same time. How do we expect the
> > application to cope with it? Isn't it too limiting with such design?
> >
> > Since you anyway invent different sets of fields for two operations, why not
> > forcing partitioned flavor and then introduce two data_offset fields for each
> > other? This way the application is free to intermix device state and dirty
> > page collection for whatever needs.
> 
> AIUI, it's the user's responsibility to consume the data they've
> asked to be provided before they perform the next operation, but the
> user can alternate between device state and dirty pages at will.  I
> agree though that the lifecycle of the data with regard to the vendor
> driver is lacking.  Nothing seems to indicate to the vendor driver that
> the data is consumed other than starting the next operation or turning
> off _SAVING.
> 
> > > > + *
> > > > + * Sequence to be followed for _SAVING|_RUNNING device state or
> pre-
> > > copy phase
> > > > + * and for _SAVING device state or stop-and-copy phase:
> > > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > > + * b. read data_offset, indicates kernel driver to write data to staging
> > > buffer.
> > > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > > + *    migration region.
> > > > + * d. read data_size bytes of data from data_offset in the migration
> region.
> > > > + * e. process data.
> > > > + * f. Loop through a to e.
> > >
> > > Something needs to be said here about the availability of the data, for
> > > example, what indicates to the vendor driver that the above operation is
> > > complete?  Is the data immutable?
> >
> > I guess the vendor driver just continues to track pending_bytes for dirtied
> > device state, until exiting _SAVING state. Data is copied only when
> > pending_bytes is read by userspace. Copied data is immutable if data
> region
> > is mmapped. But yes, this part needs clarification.
> >
> > >
> > > > + *
> > > > + * To copy system memory content during migration, vendor driver
> should
> > > be able
> > > > + * to report system memory pages which are dirtied by that driver. For
> such
> > > > + * dirty page reporting, user application should query for a range of
> GFNs
> > > > + * relative to device address space (IOVA), then vendor driver should
> > > provide
> > > > + * the bitmap of pages from this range which are dirtied by him through
> > > > + * migration region where each bit represents a page and bit set to 1
> > > represents
> > > > + * that the page is dirty.
> > > > + * User space application should take care of copying content of system
> > > memory
> > > > + * for those pages.
> > >
> > > Can we say that device state and dirty pfn operations on the data
> > > area may be intermixed in any order the user chooses?
> >
> > this part is very opaque from previous description. Now I'm inclined to
> > vote for no-intermix design. There is really no good to force some
> > dependency between retrieving device state and dirty page bitmap.
> 
> I'm confused, the state of the device continues to change so long as it
> is _RUNNING.  The device may also continue to dirty pages in the
> _RUNNING state.  So how could we not have both occurring at the same
> time and therefore must support the user deciding which to sample in
> each iteration?

I may not make it clear. Actually I'm with you on that device state and
dirty pages are changed in parallel in the _RUNNING state. and two are
irrelevant by definition. Then why did we design an interface, as this
patch suggested, having two states sharing the same region thus forcing 
the application to manage unnecessary contention in-between? For
example, if the application creates two threads, with one to collect 
dirty page bitmaps and the other to dump device state, such design 
requires the application to protect the data region between two 
threads - one thread has to wait until the other drains the new state 
reported by kernel driver, otherwise the data region will be messed.

On the other hand, if we explicitly design two sub-regions for device
state and dirty bitmap and split the control interface, two can
be enquired simultaneously. As earlier discussed, two states already
have most of their control fields separate except data_offset. Now 
just move one step further to have two data_offset, then two states
are completely decoupled and the application is lock-free to access 
any sub-region at any time.

> 
> > > Should we specify that bits accumulate since EITHER a) _SAVING state is
> > > enabled or b) a given pfn was last reported via the below sequence (ie.
> > > dirty bits are cleared once reported)?
> > >
> > > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > > performing the live portion of a migration?  For example, each time a
> > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > system_memory).  I don't see any QEMU code that intercepts that change
> > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > an unmap notification since it only knows the IOVA pfns, which can be
> > > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > please help me find the code that handles it.
> >
> > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > callback registered by emulated devices in Qemu. Looks dirty pages
> > by emulated DMAs are recorded in some implicit way. But KVM always
> > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> >  (translation can be done when DMA happens), then we don't need
> > worry about transient mapping from IOVA to GFN. Along this way we
> > also want GFN-based dirty bitmap being reported through VFIO,
> > similar to what KVM does. For vendor drivers, it needs to translate
> > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > provided by KVM but I'm not sure whether it's exposed now.
> >
> > >
> > > > + *
> > > > + * Steps to get dirty page bitmap:
> > > > + * a. write start_pfn, page_size and total_pfns.
> > >
> > > This is not well specified.  Is it intended that the user write all
> > > three of these on every iteration, or could they write start_pfn=0,
> > > page_size=4K, total_pfns=1, complete the steps below, then write
> > > start_pfn=1 and immediately begin the next iteration?  They've written
> > > all three, though not all on the current iteration, does that count?
> > > Furthermore, could the user simple re-read copied_pfns to determine if
> > > anything in the previously setup range has been re-dirtied?
> > >
> > > IOW, are these three "registers" sticky or do the below operations
> > > invalidate them?  If they're invalidated, then there needs to be a
> > > mechanism to generate an error, such as below.
> > >
> > > > + * b. read copied_pfns. Vendor driver should take one of the below
> action:
> > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > driver
> > > > + *       doesn't have any page to report dirty in given range or rest of the
> > > > + *       range. Exit the loop.
> > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> mark
> > > all
> > > > + *       pages dirty for given range or rest of the range. User space
> > > > + *       application mark all pages in the range as dirty and exit the loop.
> > > > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > > > + *       copied_pfn in migration region.
> > >
> > > Read returns errno if the pre-requisite registers are not valid?
> > >
> > > > + * c. read data_offset, where vendor driver has written bitmap.
> > > > + * d. read bitmap from the migration region from data_offset.
> > > > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> > >
> > > It seems like the intent here is that the user effectively does:
> > >
> > > start_pfn += copied_pfns
> > > total_pfns -= copied_pfns
> > > page_size = page_size?
> > >
> > > But are they under any obligation to do so?
> > >
> > > Also same question above regarding data availability/life-cycle.  Is
> > > the vendor driver responsible for making the data available
> > > indefinitely?  Seems it's only released at the next iteration, or
> > > re-use of the data area for another operation, or clearing of the
> > > _SAVING state bit.
> > >
> > > > + * Sequence to be followed while _RESUMING device state:
> > > > + * While data for this device is available, repeat below steps:
> > > > + * a. read data_offset from where user application should write data.
> > > > + * b. write data of data_size to migration region from data_offset.
> > > > + * c. write data_size which indicates vendor driver that data is written
> in
> > > > + *    staging buffer.
> > > > + *
> > > > + * For user application, data is opaque. User should write data in the
> same
> > > > + * order as received.
> > >
> > > Additionally, implicit synchronization between _SAVING and _RESUMING
> > > ends within the vendor driver is assumed.
> > >
> > > Are there any assumptions we haven't covered with respect to mmaps?
> > > For instance, can the user setup mmaps at any time or only during
> > > certain device states?  Are there recommended best practices for users
> > > to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> > > mechanism, it might be nice to use it when either of these bits are
> > > cleared.  Thanks,
> >
> > another open for mmaps is how many pages to map. copied_pfns
> > carries only the number of copied pages. They scatter in the bitmap
> > while application doesn't know how long the bitmap is. Do we
> > expect the application to map pages one-by-one, until scanned
> > dirty pages is equal to copied_pfns?
> 
> I think we expect userspace to mmap according to the sparse mmap
> capability of the region.  The vendor driver then chooses whether to
> expose the current data set within the mmap region, outside the mmap
> region, or some mix of both, depending on their requirements.  Thanks,

Thanks, it makes sense.

> 
> Alex
> 
> > > > + */
> > > > +
> > > > +struct vfio_device_migration_info {
> > > > +        __u32 device_state;         /* VFIO device state */
> > > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > > +#define VFIO_DEVICE_STATE_MASK
> (VFIO_DEVICE_STATE_RUNNING | \
> > > > +                                     VFIO_DEVICE_STATE_SAVING | \
> > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING
> | \
> > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > +        __u32 reserved;
> > > > +        __u64 pending_bytes;
> > > > +        __u64 data_offset;
> > > > +        __u64 data_size;
> > > > +        __u64 start_pfn;
> > > > +        __u64 page_size;
> > > > +        __u64 total_pfns;
> > > > +        __u64 copied_pfns;
> > > > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > > > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > > > +} __attribute__((packed));
> > > > +
> > > >  /*
> > > >   * The MSIX mappable capability informs that MSIX data of a BAR can
> be
> > > mmapped
> > > >   * which allows direct access to non-MSIX registers which happened to
> be
> > > within
> >
Tian, Kevin Sept. 3, 2019, 6:57 a.m. UTC | #7
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Saturday, August 31, 2019 12:33 AM
> 
> On Fri, 30 Aug 2019 08:06:32 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Friday, August 30, 2019 3:26 PM
> > >
> > [...]
> > > > How does QEMU handle the fact that IOVAs are potentially dynamic
> while
> > > > performing the live portion of a migration?  For example, each time a
> > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > system_memory).  I don't see any QEMU code that intercepts that
> change
> > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > re-used with different GFN backing.  Once the DMA mapping is torn
> down,
> > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > please help me find the code that handles it.
> > >
> > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > >  (translation can be done when DMA happens), then we don't need
> > > worry about transient mapping from IOVA to GFN. Along this way we
> > > also want GFN-based dirty bitmap being reported through VFIO,
> > > similar to what KVM does. For vendor drivers, it needs to translate
> > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > provided by KVM but I'm not sure whether it's exposed now.
> > >
> >
> > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> 
> I thought it was bad enough that we have vendor drivers that depend on
> KVM, but designing a vfio interface that only supports a KVM interface
> is more undesirable.  I also note without comment that gfn_to_memslot()
> is a GPL symbol.  Thanks,

yes it is bad, but sometimes inevitable. If you recall our discussions
back to 3yrs (when discussing the 1st mdev framework), there were similar
hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
creating some shadow structures. gpa->hpa is definitely hypervisor
specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
hypercall in Xen. but VFIO already makes assumption based on KVM-
only flavor when implementing vfio_{un}pin_page_external. So GVT-g
has to maintain an internal abstraction layer to support both Xen and
KVM. Maybe someday we will re-consider introducing some hypervisor
abstraction layer in VFIO, if this issue starts to hurt other devices and
Xen guys are willing to support VFIO.

Back to this IOVA issue, I discussed with Yan and we found another 
hypervisor-agnostic alternative, by learning from vhost. vhost is very
similar to VFIO - DMA also happens in the kernel, while it already 
supports vIOMMU.

Generally speaking, there are three paths of dirty page collection
in Qemu so far (as previously noted, Qemu always tracks the dirty
bitmap in GFN):

1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps 
are updated directly when the guest memory is being updated. For 
example, PCI writes are completed through pci_dma_write, which 
goes through vIOMMU to translate IOVA into GPA and then update 
the bitmap through cpu_physical_memory_set_dirty_range.

2) Memory writes that are not tracked by Qemu are collected by
registering .log_sync() callback, which is invoked in the dirty logging
process. Now there are two users: kvm and vhost.

  2.1) KVM tracks CPU-side memory writes, through write-protection
or EPT A/D bits (+PML). This part is always based on GFN and returned
to Qemu when kvm_log_sync is invoked;

  2.2) vhost tracks kernel-side DMA writes, by interpreting vring
data structure. It maintains an internal iotlb which is synced with
Qemu vIOMMU through a specific interface:
	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
for Qemu to keep vhost iotlb in sync
	- new VHOST_IOTLB_MISS message to notify Qemu in case of
a miss in vhost iotlb.
	- Qemu registers a log buffer to kernel vhost driver. The latter
update the buffer (using internal iotlb to get GFN) when serving vring
descriptor.

VFIO could also implement an internal iotlb, so vendor drivers can
utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
don't need re-invent another iotlb protocol as vhost does. vIOMMU
already sends map/unmap ioctl cmds upon any change of IOVA
mapping. We may introduce a v2 map/unmap interface, allowing
Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
doesn't want to cache full-size vIOMMU mappings. 

Definitely this alternative needs more work and possibly less 
performant (if maintaining a small size iotlb) than straightforward
calling into KVM interface. But the gain is also obvious, since it
is fully constrained with VFIO.

Thoughts? :-)

Thanks
Kevin

> 
> Alex
> 
> > Above flow works for software-tracked dirty mechanism, e.g. in
> > KVMGT, where GFN-based 'dirty' is marked when a guest page is
> > mapped into device mmu. IOVA->HPA->GFN translation is done
> > at that time, thus immune from further IOVA->GFN changes.
> >
> > When hardware IOMMU supports D-bit in 2nd level translation (e.g.
> > VT-d rev3.0), there are two scenarios:
> >
> > 1) nested translation: guest manages 1st-level translation (IOVA->GPA)
> > and host manages 2nd-level translation (GPA->HPA). The 2nd-level
> > is not affected by guest mapping operations. So it's OK for IOMMU
> > driver to retrieve GFN-based dirty pages by directly scanning the 2nd-
> > level structure, upon request from user space.
> >
> > 2) shadowed translation (IOVA->HPA) in 2nd level: in such case the dirty
> > information is tied to IOVA. the IOMMU driver is expected to maintain
> > an internal dirty bitmap. Upon any change of IOVA->GPA notification
> > from VFIO, the IOMMU driver should flush dirty status of affected 2nd-level
> > entries to the internal GFN-based bitmap. At this time, again IOVA->HVA
> > ->GPA translation required for GFN-based recording. When userspace
> > queries dirty bitmap, the IOMMU driver needs to flush latest 2nd-level
> > dirty status to internal bitmap, which is then copied to user space.
> >
> > Given the trickiness of 2), we aim to enable 1) on intel-iommu driver.
> >
> > Thanks
> > Kevin
Yan Zhao Sept. 4, 2019, 8:28 a.m. UTC | #8
On Tue, Sep 03, 2019 at 02:05:46PM +0800, Tian, Kevin wrote:
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Saturday, August 31, 2019 12:15 AM
> > 
> > On Fri, 30 Aug 2019 07:25:59 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > 
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, August 29, 2019 4:51 AM
> > > >
> > > > On Tue, 27 Aug 2019 00:25:41 +0530
> > > > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> > > >
> > > > > - Defined MIGRATION region type and sub-type.
> > > > > - Used 3 bits to define VFIO device states.
> > > > >     Bit 0 => _RUNNING
> > > > >     Bit 1 => _SAVING
> > > > >     Bit 2 => _RESUMING
> > > > >     Combination of these bits defines VFIO device's state during migration
> > > > >     _STOPPED => All bits 0 indicates VFIO device stopped.
> > > > >     _RUNNING => Normal VFIO device running state.
> > > > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running
> > but
> > > > start
> > > > >                           saving state of device i.e. pre-copy state
> > > > >     _SAVING  => vCPUs are stoppped, VFIO device should be stopped, and
> > > > >                           save device state,i.e. stop-n-copy state
> > > > >     _RESUMING => VFIO device resuming state.
> > > > >     _SAVING | _RESUMING => Invalid state if _SAVING and _RESUMING
> > bits
> > > > are set
> > > > >     Bits 3 - 31 are reserved for future use. User should perform
> > > > >     read-modify-write operation on this field.
> > > > > - Defined vfio_device_migration_info structure which will be placed at
> > 0th
> > > > >   offset of migration region to get/set VFIO device related information.
> > > > >   Defined members of structure and usage on read/write access:
> > > > >     * device_state: (read/write)
> > > > >         To convey VFIO device state to be transitioned to. Only 3 bits are
> > used
> > > > >         as of now, Bits 3 - 31 are reserved for future use.
> > > > >     * pending bytes: (read only)
> > > > >         To get pending bytes yet to be migrated for VFIO device.
> > > > >     * data_offset: (read only)
> > > > >         To get data offset in migration region from where data exist during
> > > > >         _SAVING, from where data should be written by user space
> > application
> > > > >         during _RESUMING state and while read dirty pages bitmap.
> > > > >     * data_size: (read/write)
> > > > >         To get and set size of data copied in migration region during
> > _SAVING
> > > > >         and _RESUMING state.
> > > > >     * start_pfn, page_size, total_pfns: (write only)
> > > > >         To get bitmap of dirty pages from vendor driver from given
> > > > >         start address for total_pfns.
> > > > >     * copied_pfns: (read only)
> > > > >         To get number of pfns bitmap copied in migration region.
> > > > >         Vendor driver should copy the bitmap with bits set only for
> > > > >         pages to be marked dirty in migration region. Vendor driver
> > > > >         should return VFIO_DEVICE_DIRTY_PFNS_NONE if there are 0 pages
> > > > dirty in
> > > > >         requested range. Vendor driver should return
> > > > VFIO_DEVICE_DIRTY_PFNS_ALL
> > > > >         to mark all pages in the section as dirty.
> > > > >
> > > > > Migration region looks like:
> > > > >  ------------------------------------------------------------------
> > > > > |vfio_device_migration_info|    data section                      |
> > > > > |                          |     ///////////////////////////////  |
> > > > >  ------------------------------------------------------------------
> > > > >  ^                              ^                              ^
> > > > >  offset 0-trapped part        data_offset                 data_size
> > > > >
> > > > > Data section is always followed by vfio_device_migration_info
> > > > > structure in the region, so data_offset will always be non-0.
> > > > > Offset from where data is copied is decided by kernel driver, data
> > > > > section can be trapped or mapped depending on how kernel driver
> > > > > defines data section. If mmapped, then data_offset should be page
> > > > > aligned, where as initial section which contain
> > vfio_device_migration_info
> > > > > structure might not end at offset which is page aligned.
> > > > > Data_offset can be same or different for device data and dirty pages
> > bitmap.
> > > > > Vendor driver should decide whether to partition data section and how
> > to
> > > > > partition the data section. Vendor driver should return data_offset
> > > > > accordingly.
> > > > >
> > > > > For user application, data is opaque. User should write data in the same
> > > > > order as received.
> > > > >
> > > > > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > > > > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > > > > ---
> > > > >  linux-headers/linux/vfio.h | 148
> > > > +++++++++++++++++++++++++++++++++++++++++++++
> > > > >  1 file changed, 148 insertions(+)
> > > > >
> > > > > diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
> > > > > index 24f505199f83..4bc0236b0898 100644
> > > > > --- a/linux-headers/linux/vfio.h
> > > > > +++ b/linux-headers/linux/vfio.h
> > > > > @@ -372,6 +372,154 @@ struct vfio_region_gfx_edid {
> > > > >   */
> > > > >  #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
> > > > >
> > > > > +/* Migration region type and sub-type */
> > > > > +#define VFIO_REGION_TYPE_MIGRATION	        (3)
> > > > > +#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
> > > > > +
> > > > > +/**
> > > > > + * Structure vfio_device_migration_info is placed at 0th offset of
> > > > > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device
> > > > related migration
> > > > > + * information. Field accesses from this structure are only supported at
> > > > their
> > > > > + * native width and alignment, otherwise the result is undefined and
> > > > vendor
> > > > > + * drivers should return an error.
> > > > > + *
> > > > > + * device_state: (read/write)
> > > > > + *      To indicate vendor driver the state VFIO device should be
> > > > transitioned
> > > > > + *      to. If device state transition fails, write on this field return error.
> > > > > + *      It consists of 3 bits:
> > > > > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that
> > indicates
> > > > > + *        _STOPPED state. When device is changed to _STOPPED, driver
> > should
> > > > stop
> > > > > + *        device before write() returns.
> > > > > + *      - If bit 1 set, indicates _SAVING state.
> > > > > + *      - If bit 2 set, indicates _RESUMING state.
> > >
> > > please add a few words to explain _SAVING and _RESUMING, similar to
> > > what you did for _RUNNING.
> > >
> > > > > + *      Bits 3 - 31 are reserved for future use. User should perform
> > > > > + *      read-modify-write operation on this field.
> > > > > + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> > >
> > > what about _RUNNING | _RESUMING? Is it allowed?
> > 
> > I think this would be post-copy migration, which I assume is
> > theoretically supportable, but not necessarily (and not currently)
> > supported by vendor drivers.  I'm not sure how a vendor driver reports
> > the lack of this support though.
> 
> Yan is working on post-copy now. I talked to her about this open.
> She will respond later after some thinking. Ideally we need a way
> that user space can use to verify which combinations are available 
> before starting the post-copy process. It's not good to blindly start
> post-copy and then fail due to failure to switch to desired state on
> the dest machine.
>
hi,
May I know what vendor driver is supposed to do in _RESUMING state currently?
Could I understand it as that loading device data is only permitted in _RESUMING state,
and rejected otherwise?
If this understanding is right, is _LOADING a better name for it?
Then _SAVING and _LOADING is mutually exclusive obviously,
and we can clean the _LOADING state in .load_cleanup in qemu.
Then, state _LOADING | _RUNNING would only appear when VM has started up and
migration is still not completed.

And for postcopy, I guess anyway an extra POSTCOPY state needs to be introduced.
because in POSTCOPY state, vendor driver in target side needs to listen
to faulted pages and requests them from source vm, it has to be set when
migration is in postcopy stage (before VM running and after .load_setup) and
it is not necssarily equal to state _LOADING or state _LOADING|_RUNNING.

Thanks
Yan


> > 
> > > > > + *
> > > > > + * pending bytes: (read only)
> > > > > + *      Number of pending bytes yet to be migrated from vendor driver
> > > > > + *
> > > > > + * data_offset: (read only)
> > > > > + *      User application should read data_offset in migration region from
> > > > where
> > > > > + *      user application should read device data during _SAVING state or
> > > > write
> > > > > + *      device data during _RESUMING state or read dirty pages bitmap.
> > See
> > > > below
> > > > > + *      for detail of sequence to be followed.
> > > > > + *
> > > > > + * data_size: (read/write)
> > > > > + *      User application should read data_size to get size of data copied
> > in
> > > > > + *      migration region during _SAVING state and write size of data
> > copied
> > > > in
> > > > > + *      migration region during _RESUMING state.
> > > > > + *
> > > > > + * start_pfn: (write only)
> > > > > + *      Start address pfn to get bitmap of dirty pages from vendor driver
> > > > duing
> > > > > + *      _SAVING state.
> > > > > + *
> > > > > + * page_size: (write only)
> > > > > + *      User application should write the page_size of pfn.
> > > > > + *
> > > > > + * total_pfns: (write only)
> > > > > + *      Total pfn count from start_pfn for which dirty bitmap is requested.
> > > > > + *
> > > > > + * copied_pfns: (read only)
> > >
> > > 'copied' gives the impression as if the page content is copied. what about
> > > 'dirty_pfns'? btw can this field merge with total_pfns?
> > 
> > I don't agree with the implication that copied implies page content,
> > we're working with a bitmap, so this is easily disproved.  I think the
> > intent of 'copied' is to differentiate that 'total' is how many the
> > user asked for, and 'copied' is how many they got.  It's not obvious to
> > me whether there's anything that would prevent duplicate use of one
> > register, ie. write total, read copied.  I'm not sure I see a huge
> > advantage to it either though.
> 
> It's fine. I just had the impression that many kernel interfaces are 
> designed in such way, i.e., having the same field to carry-in user
> requested number and then carry-out actually handled number.
> 
> > 
> > > > > + *      pfn count for which dirty bitmap is copied to migration region.
> > > > > + *      Vendor driver should copy the bitmap with bits set only for pages
> > to
> > > > be
> > > > > + *      marked dirty in migration region.
> > > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > > none of the
> > > > > + *        pages are dirty in requested range or rest of the range.
> > > > > + *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> > mark
> > > > all
> > > > > + *        pages dirty in the given range or rest of the range.
> > > > > + *      - Vendor driver should return pfn count for which bitmap is
> > written in
> > > > > + *        the region.
> > > > > + *
> > > > > + * Migration region looks like:
> > > > > + *  ------------------------------------------------------------------
> > > > > + * |vfio_device_migration_info|    data section                      |
> > > > > + * |                          |     ///////////////////////////////  |
> > > > > + * ------------------------------------------------------------------
> > > > > + *   ^                              ^                             ^
> > > > > + *  offset 0-trapped part        data_offset                 data_size
> > >
> > > 'data_size' -> "data_offset + data_size"
> > 
> > The diagram sort of implies that, but I don't think this is correct.  I
> > believe it's more like:
> > 
> > data_offset = data_start_offset
> > data_size = data_end_offset - data_start_offset
> 
> I'm fine with replacing data_offset & data_size with data_start_
> offset & data_end_offset.
> 
> > 
> > > > > + *
> > > > > + * Data section is always followed by vfio_device_migration_info
> > structure
> > >
> > > Data section is always following ..., or ... structure is always followed by
> > > data section.
> > 
> > The vfio_device_migration struct is always at the zero'th offset of the
> > region, the data section is theoretically the remainder, but the used
> > range of the data section is defined by the vendor driver via
> > data_offset.
> 
> Isn't original description incorrect here? It implied data section sits at
> offset zero, and then migration info...
> 
> > 
> > > > > + * in the region, so data_offset will always be non-0. Offset from where
> > > > data
> > > > > + * is copied is decided by kernel driver, data section can be trapped or
> > > > > + * mapped or partitioned, depending on how kernel driver defines data
> > > > section.
> > > > > + * Data section partition can be defined as mapped by sparse mmap
> > > > capability.
> > > > > + * If mmapped, then data_offset should be page aligned, where as
> > initial
> > > > section
> > > > > + * which contain vfio_device_migration_info structure might not end at
> > > > offset
> > > > > + * which is page aligned.
> > > > > + * Data_offset can be same or different for device data and dirty pages
> > > > bitmap.
> > > > > + * Vendor driver should decide whether to partition data section and
> > how
> > > > to
> > > > > + * partition the data section. Vendor driver should return data_offset
> > > > > + * accordingly.
> > >
> > > Here lacks of high-level summary about how to differentiate reading device
> > > state from reading dirty page bitmap, when both use the same interface
> > > (data_offset) to convey information to user space.
> > >
> > > From below sequence example, looks reading device state is initiated by
> > > reading pending_bytes, while reading dirty bitmap is marked by writing
> > > start_pfn. In case of shared data region between two operations, they have
> > > to be mutually-exclusive i.e. one must wait for the other to complete. Even
> > > when the region is partitioned, data_offset itself could be raced if pending_
> > > bytes and start_pfn are accessed at the same time. How do we expect the
> > > application to cope with it? Isn't it too limiting with such design?
> > >
> > > Since you anyway invent different sets of fields for two operations, why not
> > > forcing partitioned flavor and then introduce two data_offset fields for each
> > > other? This way the application is free to intermix device state and dirty
> > > page collection for whatever needs.
> > 
> > AIUI, it's the user's responsibility to consume the data they've
> > asked to be provided before they perform the next operation, but the
> > user can alternate between device state and dirty pages at will.  I
> > agree though that the lifecycle of the data with regard to the vendor
> > driver is lacking.  Nothing seems to indicate to the vendor driver that
> > the data is consumed other than starting the next operation or turning
> > off _SAVING.
> > 
> > > > > + *
> > > > > + * Sequence to be followed for _SAVING|_RUNNING device state or
> > pre-
> > > > copy phase
> > > > > + * and for _SAVING device state or stop-and-copy phase:
> > > > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > > > + * b. read data_offset, indicates kernel driver to write data to staging
> > > > buffer.
> > > > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > > > + *    migration region.
> > > > > + * d. read data_size bytes of data from data_offset in the migration
> > region.
> > > > > + * e. process data.
> > > > > + * f. Loop through a to e.
> > > >
> > > > Something needs to be said here about the availability of the data, for
> > > > example, what indicates to the vendor driver that the above operation is
> > > > complete?  Is the data immutable?
> > >
> > > I guess the vendor driver just continues to track pending_bytes for dirtied
> > > device state, until exiting _SAVING state. Data is copied only when
> > > pending_bytes is read by userspace. Copied data is immutable if data
> > region
> > > is mmapped. But yes, this part needs clarification.
> > >
> > > >
> > > > > + *
> > > > > + * To copy system memory content during migration, vendor driver
> > should
> > > > be able
> > > > > + * to report system memory pages which are dirtied by that driver. For
> > such
> > > > > + * dirty page reporting, user application should query for a range of
> > GFNs
> > > > > + * relative to device address space (IOVA), then vendor driver should
> > > > provide
> > > > > + * the bitmap of pages from this range which are dirtied by him through
> > > > > + * migration region where each bit represents a page and bit set to 1
> > > > represents
> > > > > + * that the page is dirty.
> > > > > + * User space application should take care of copying content of system
> > > > memory
> > > > > + * for those pages.
> > > >
> > > > Can we say that device state and dirty pfn operations on the data
> > > > area may be intermixed in any order the user chooses?
> > >
> > > this part is very opaque from previous description. Now I'm inclined to
> > > vote for no-intermix design. There is really no good to force some
> > > dependency between retrieving device state and dirty page bitmap.
> > 
> > I'm confused, the state of the device continues to change so long as it
> > is _RUNNING.  The device may also continue to dirty pages in the
> > _RUNNING state.  So how could we not have both occurring at the same
> > time and therefore must support the user deciding which to sample in
> > each iteration?
> 
> I may not make it clear. Actually I'm with you on that device state and
> dirty pages are changed in parallel in the _RUNNING state. and two are
> irrelevant by definition. Then why did we design an interface, as this
> patch suggested, having two states sharing the same region thus forcing 
> the application to manage unnecessary contention in-between? For
> example, if the application creates two threads, with one to collect 
> dirty page bitmaps and the other to dump device state, such design 
> requires the application to protect the data region between two 
> threads - one thread has to wait until the other drains the new state 
> reported by kernel driver, otherwise the data region will be messed.
> 
> On the other hand, if we explicitly design two sub-regions for device
> state and dirty bitmap and split the control interface, two can
> be enquired simultaneously. As earlier discussed, two states already
> have most of their control fields separate except data_offset. Now 
> just move one step further to have two data_offset, then two states
> are completely decoupled and the application is lock-free to access 
> any sub-region at any time.
> 
> > 
> > > > Should we specify that bits accumulate since EITHER a) _SAVING state is
> > > > enabled or b) a given pfn was last reported via the below sequence (ie.
> > > > dirty bits are cleared once reported)?
> > > >
> > > > How does QEMU handle the fact that IOVAs are potentially dynamic while
> > > > performing the live portion of a migration?  For example, each time a
> > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > system_memory).  I don't see any QEMU code that intercepts that change
> > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > re-used with different GFN backing.  Once the DMA mapping is torn down,
> > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > please help me find the code that handles it.
> > >
> > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > >  (translation can be done when DMA happens), then we don't need
> > > worry about transient mapping from IOVA to GFN. Along this way we
> > > also want GFN-based dirty bitmap being reported through VFIO,
> > > similar to what KVM does. For vendor drivers, it needs to translate
> > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > provided by KVM but I'm not sure whether it's exposed now.
> > >
> > > >
> > > > > + *
> > > > > + * Steps to get dirty page bitmap:
> > > > > + * a. write start_pfn, page_size and total_pfns.
> > > >
> > > > This is not well specified.  Is it intended that the user write all
> > > > three of these on every iteration, or could they write start_pfn=0,
> > > > page_size=4K, total_pfns=1, complete the steps below, then write
> > > > start_pfn=1 and immediately begin the next iteration?  They've written
> > > > all three, though not all on the current iteration, does that count?
> > > > Furthermore, could the user simple re-read copied_pfns to determine if
> > > > anything in the previously setup range has been re-dirtied?
> > > >
> > > > IOW, are these three "registers" sticky or do the below operations
> > > > invalidate them?  If they're invalidated, then there needs to be a
> > > > mechanism to generate an error, such as below.
> > > >
> > > > > + * b. read copied_pfns. Vendor driver should take one of the below
> > action:
> > > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if
> > > > driver
> > > > > + *       doesn't have any page to report dirty in given range or rest of the
> > > > > + *       range. Exit the loop.
> > > > > + *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to
> > mark
> > > > all
> > > > > + *       pages dirty for given range or rest of the range. User space
> > > > > + *       application mark all pages in the range as dirty and exit the loop.
> > > > > + *     - Vendor driver should return copied_pfns and provide bitmap for
> > > > > + *       copied_pfn in migration region.
> > > >
> > > > Read returns errno if the pre-requisite registers are not valid?
> > > >
> > > > > + * c. read data_offset, where vendor driver has written bitmap.
> > > > > + * d. read bitmap from the migration region from data_offset.
> > > > > + * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
> > > >
> > > > It seems like the intent here is that the user effectively does:
> > > >
> > > > start_pfn += copied_pfns
> > > > total_pfns -= copied_pfns
> > > > page_size = page_size?
> > > >
> > > > But are they under any obligation to do so?
> > > >
> > > > Also same question above regarding data availability/life-cycle.  Is
> > > > the vendor driver responsible for making the data available
> > > > indefinitely?  Seems it's only released at the next iteration, or
> > > > re-use of the data area for another operation, or clearing of the
> > > > _SAVING state bit.
> > > >
> > > > > + * Sequence to be followed while _RESUMING device state:
> > > > > + * While data for this device is available, repeat below steps:
> > > > > + * a. read data_offset from where user application should write data.
> > > > > + * b. write data of data_size to migration region from data_offset.
> > > > > + * c. write data_size which indicates vendor driver that data is written
> > in
> > > > > + *    staging buffer.
> > > > > + *
> > > > > + * For user application, data is opaque. User should write data in the
> > same
> > > > > + * order as received.
> > > >
> > > > Additionally, implicit synchronization between _SAVING and _RESUMING
> > > > ends within the vendor driver is assumed.
> > > >
> > > > Are there any assumptions we haven't covered with respect to mmaps?
> > > > For instance, can the user setup mmaps at any time or only during
> > > > certain device states?  Are there recommended best practices for users
> > > > to only setup mmaps during _SAVING or _RESUMING?  If we had a revoke
> > > > mechanism, it might be nice to use it when either of these bits are
> > > > cleared.  Thanks,
> > >
> > > another open for mmaps is how many pages to map. copied_pfns
> > > carries only the number of copied pages. They scatter in the bitmap
> > > while application doesn't know how long the bitmap is. Do we
> > > expect the application to map pages one-by-one, until scanned
> > > dirty pages is equal to copied_pfns?
> > 
> > I think we expect userspace to mmap according to the sparse mmap
> > capability of the region.  The vendor driver then chooses whether to
> > expose the current data set within the mmap region, outside the mmap
> > region, or some mix of both, depending on their requirements.  Thanks,
> 
> Thanks, it makes sense.
> 
> > 
> > Alex
> > 
> > > > > + */
> > > > > +
> > > > > +struct vfio_device_migration_info {
> > > > > +        __u32 device_state;         /* VFIO device state */
> > > > > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > > > > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > > > > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > > > > +#define VFIO_DEVICE_STATE_MASK
> > (VFIO_DEVICE_STATE_RUNNING | \
> > > > > +                                     VFIO_DEVICE_STATE_SAVING | \
> > > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > > +#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING
> > | \
> > > > > +                                     VFIO_DEVICE_STATE_RESUMING)
> > > > > +        __u32 reserved;
> > > > > +        __u64 pending_bytes;
> > > > > +        __u64 data_offset;
> > > > > +        __u64 data_size;
> > > > > +        __u64 start_pfn;
> > > > > +        __u64 page_size;
> > > > > +        __u64 total_pfns;
> > > > > +        __u64 copied_pfns;
> > > > > +#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
> > > > > +#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
> > > > > +} __attribute__((packed));
> > > > > +
> > > > >  /*
> > > > >   * The MSIX mappable capability informs that MSIX data of a BAR can
> > be
> > > > mmapped
> > > > >   * which allows direct access to non-MSIX registers which happened to
> > be
> > > > within
> > >
>
Alex Williamson Sept. 12, 2019, 2:41 p.m. UTC | #9
On Tue, 3 Sep 2019 06:57:27 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Saturday, August 31, 2019 12:33 AM
> > 
> > On Fri, 30 Aug 2019 08:06:32 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Friday, August 30, 2019 3:26 PM
> > > >  
> > > [...]  
> > > > > How does QEMU handle the fact that IOVAs are potentially dynamic  
> > while  
> > > > > performing the live portion of a migration?  For example, each time a
> > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > MemoryRegionSection pops in or out of the AddressSpace for the device
> > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > system_memory).  I don't see any QEMU code that intercepts that  
> > change  
> > > > > in the AddressSpace such that the IOVA dirty pfns could be recorded and
> > > > > translated to GFNs.  The vendor driver can't track these beyond getting
> > > > > an unmap notification since it only knows the IOVA pfns, which can be
> > > > > re-used with different GFN backing.  Once the DMA mapping is torn  
> > down,  
> > > > > it seems those dirty pfns are lost in the ether.  If this works in QEMU,
> > > > > please help me find the code that handles it.  
> > > >
> > > > I'm curious about this part too. Interestingly, I didn't find any log_sync
> > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > by emulated DMAs are recorded in some implicit way. But KVM always
> > > > reports dirty page in GFN instead of IOVA, regardless of the presence of
> > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > >  (translation can be done when DMA happens), then we don't need
> > > > worry about transient mapping from IOVA to GFN. Along this way we
> > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > provided by KVM but I'm not sure whether it's exposed now.
> > > >  
> > >
> > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > 
> > I thought it was bad enough that we have vendor drivers that depend on
> > KVM, but designing a vfio interface that only supports a KVM interface
> > is more undesirable.  I also note without comment that gfn_to_memslot()
> > is a GPL symbol.  Thanks,  
> 
> yes it is bad, but sometimes inevitable. If you recall our discussions
> back to 3yrs (when discussing the 1st mdev framework), there were similar
> hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> creating some shadow structures. gpa->hpa is definitely hypervisor
> specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> hypercall in Xen. but VFIO already makes assumption based on KVM-
> only flavor when implementing vfio_{un}pin_page_external.

Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA and
HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
to get an HPA and provide an array of HPA pfns back to the caller.  The
other vGPU mdev vendor manages to make use of this without KVM... the
KVM interface used by GVT-g is GPL-only.

> So GVT-g
> has to maintain an internal abstraction layer to support both Xen and
> KVM. Maybe someday we will re-consider introducing some hypervisor
> abstraction layer in VFIO, if this issue starts to hurt other devices and
> Xen guys are willing to support VFIO.

Once upon a time, we had a KVM specific device assignment interface,
ie. legacy KVM devie assignment.  We developed VFIO specifically to get
KVM out of the business of being a (bad) device driver.  We do have
some awareness and interaction between VFIO and KVM in the vfio-kvm
pseudo device, but we still try to keep those interfaces generic.  In
some cases we're not very successful at that, see vfio_group_set_kvm(),
but that's largely just a mechanism to associate a cookie with a group
to be consumed by the mdev vendor driver such that it can work with kvm
external to vfio.  I don't intend to add further hypervisor awareness
to vfio.

> Back to this IOVA issue, I discussed with Yan and we found another 
> hypervisor-agnostic alternative, by learning from vhost. vhost is very
> similar to VFIO - DMA also happens in the kernel, while it already 
> supports vIOMMU.
> 
> Generally speaking, there are three paths of dirty page collection
> in Qemu so far (as previously noted, Qemu always tracks the dirty
> bitmap in GFN):

GFNs or simply PFNs within an AddressSpace?
 
> 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps 
> are updated directly when the guest memory is being updated. For 
> example, PCI writes are completed through pci_dma_write, which 
> goes through vIOMMU to translate IOVA into GPA and then update 
> the bitmap through cpu_physical_memory_set_dirty_range.

Right, so the IOVA to GPA (GFN) occurs through an explicit translation
on the IOMMU AddressSpace.
 
> 2) Memory writes that are not tracked by Qemu are collected by
> registering .log_sync() callback, which is invoked in the dirty logging
> process. Now there are two users: kvm and vhost.
> 
>   2.1) KVM tracks CPU-side memory writes, through write-protection
> or EPT A/D bits (+PML). This part is always based on GFN and returned
> to Qemu when kvm_log_sync is invoked;
> 
>   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> data structure. It maintains an internal iotlb which is synced with
> Qemu vIOMMU through a specific interface:
> 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> for Qemu to keep vhost iotlb in sync
> 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> a miss in vhost iotlb.
> 	- Qemu registers a log buffer to kernel vhost driver. The latter
> update the buffer (using internal iotlb to get GFN) when serving vring
> descriptor.
> 
> VFIO could also implement an internal iotlb, so vendor drivers can
> utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> don't need re-invent another iotlb protocol as vhost does. vIOMMU
> already sends map/unmap ioctl cmds upon any change of IOVA
> mapping. We may introduce a v2 map/unmap interface, allowing
> Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> doesn't want to cache full-size vIOMMU mappings. 
> 
> Definitely this alternative needs more work and possibly less 
> performant (if maintaining a small size iotlb) than straightforward
> calling into KVM interface. But the gain is also obvious, since it
> is fully constrained with VFIO.
> 
> Thoughts? :-)

So vhost must then be configuring a listener across system memory
rather than only against the device AddressSpace like we do in vfio,
such that it get's log_sync() callbacks for the actual GPA space rather
than only the IOVA space.  OTOH, QEMU could understand that the device
AddressSpace has a translate function and apply the IOVA dirty bits to
the system memory AddressSpace.  Wouldn't it make more sense for QEMU
to perform a log_sync() prior to removing a MemoryRegionSection within
an AddressSpace and update the GPA rather than pushing GPA awareness
and potentially large tracking structures into the host kernel?  Thanks,

Alex
Tian, Kevin Sept. 12, 2019, 11 p.m. UTC | #10
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 12, 2019 10:41 PM
> 
> On Tue, 3 Sep 2019 06:57:27 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Saturday, August 31, 2019 12:33 AM
> > >
> > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Tian, Kevin
> > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > >
> > > > [...]
> > > > > > How does QEMU handle the fact that IOVAs are potentially
> dynamic
> > > while
> > > > > > performing the live portion of a migration?  For example, each
> time a
> > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > MemoryRegionSection pops in or out of the AddressSpace for the
> device
> > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > system_memory).  I don't see any QEMU code that intercepts that
> > > change
> > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> recorded and
> > > > > > translated to GFNs.  The vendor driver can't track these beyond
> getting
> > > > > > an unmap notification since it only knows the IOVA pfns, which
> can be
> > > > > > re-used with different GFN backing.  Once the DMA mapping is
> torn
> > > down,
> > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> QEMU,
> > > > > > please help me find the code that handles it.
> > > > >
> > > > > I'm curious about this part too. Interestingly, I didn't find any
> log_sync
> > > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > > by emulated DMAs are recorded in some implicit way. But KVM
> always
> > > > > reports dirty page in GFN instead of IOVA, regardless of the
> presence of
> > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > > >  (translation can be done when DMA happens), then we don't need
> > > > > worry about transient mapping from IOVA to GFN. Along this way
> we
> > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > >
> > > >
> > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> > >
> > > I thought it was bad enough that we have vendor drivers that depend
> on
> > > KVM, but designing a vfio interface that only supports a KVM interface
> > > is more undesirable.  I also note without comment that
> gfn_to_memslot()
> > > is a GPL symbol.  Thanks,
> >
> > yes it is bad, but sometimes inevitable. If you recall our discussions
> > back to 3yrs (when discussing the 1st mdev framework), there were
> similar
> > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > creating some shadow structures. gpa->hpa is definitely hypervisor
> > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > only flavor when implementing vfio_{un}pin_page_external.
> 
> Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> and
> HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
> to get an HPA and provide an array of HPA pfns back to the caller.  The
> other vGPU mdev vendor manages to make use of this without KVM... the
> KVM interface used by GVT-g is GPL-only.

To be clear it's the assumption on the host-based hypervisors e.g. KVM.
GUP is a perfect example, which doesn't work for Xen since DomU's
memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
through Xen specific hypercalls.

> 
> > So GVT-g
> > has to maintain an internal abstraction layer to support both Xen and
> > KVM. Maybe someday we will re-consider introducing some hypervisor
> > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > Xen guys are willing to support VFIO.
> 
> Once upon a time, we had a KVM specific device assignment interface,
> ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> KVM out of the business of being a (bad) device driver.  We do have
> some awareness and interaction between VFIO and KVM in the vfio-kvm
> pseudo device, but we still try to keep those interfaces generic.  In
> some cases we're not very successful at that, see vfio_group_set_kvm(),
> but that's largely just a mechanism to associate a cookie with a group
> to be consumed by the mdev vendor driver such that it can work with kvm
> external to vfio.  I don't intend to add further hypervisor awareness
> to vfio.
> 
> > Back to this IOVA issue, I discussed with Yan and we found another
> > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > similar to VFIO - DMA also happens in the kernel, while it already
> > supports vIOMMU.
> >
> > Generally speaking, there are three paths of dirty page collection
> > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > bitmap in GFN):
> 
> GFNs or simply PFNs within an AddressSpace?
> 
> > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > are updated directly when the guest memory is being updated. For
> > example, PCI writes are completed through pci_dma_write, which
> > goes through vIOMMU to translate IOVA into GPA and then update
> > the bitmap through cpu_physical_memory_set_dirty_range.
> 
> Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> on the IOMMU AddressSpace.
> 
> > 2) Memory writes that are not tracked by Qemu are collected by
> > registering .log_sync() callback, which is invoked in the dirty logging
> > process. Now there are two users: kvm and vhost.
> >
> >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > to Qemu when kvm_log_sync is invoked;
> >
> >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > data structure. It maintains an internal iotlb which is synced with
> > Qemu vIOMMU through a specific interface:
> > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > for Qemu to keep vhost iotlb in sync
> > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > a miss in vhost iotlb.
> > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > update the buffer (using internal iotlb to get GFN) when serving vring
> > descriptor.
> >
> > VFIO could also implement an internal iotlb, so vendor drivers can
> > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > already sends map/unmap ioctl cmds upon any change of IOVA
> > mapping. We may introduce a v2 map/unmap interface, allowing
> > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > doesn't want to cache full-size vIOMMU mappings.
> >
> > Definitely this alternative needs more work and possibly less
> > performant (if maintaining a small size iotlb) than straightforward
> > calling into KVM interface. But the gain is also obvious, since it
> > is fully constrained with VFIO.
> >
> > Thoughts? :-)
> 
> So vhost must then be configuring a listener across system memory
> rather than only against the device AddressSpace like we do in vfio,
> such that it get's log_sync() callbacks for the actual GPA space rather
> than only the IOVA space.  OTOH, QEMU could understand that the device
> AddressSpace has a translate function and apply the IOVA dirty bits to
> the system memory AddressSpace.  Wouldn't it make more sense for
> QEMU
> to perform a log_sync() prior to removing a MemoryRegionSection within
> an AddressSpace and update the GPA rather than pushing GPA awareness
> and potentially large tracking structures into the host kernel?  Thanks,
> 

It is an interesting idea.  One drawback is that log_sync might be
frequently invoked in IOVA case, but I guess the overhead is not much 
compared to the total overhead of emulating the IOTLB invalidation. 
Maybe other folks can better comment why this model was not 
considered before, e.g. when vhost iotlb was introduced.

Thanks
Kevin
Alex Williamson Sept. 13, 2019, 3:47 p.m. UTC | #11
On Thu, 12 Sep 2019 23:00:03 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 12, 2019 10:41 PM
> > 
> > On Tue, 3 Sep 2019 06:57:27 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > >
> > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >  
> > > > > > From: Tian, Kevin
> > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > >  
> > > > > [...]  
> > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > dynamic  
> > > > while  
> > > > > > > performing the live portion of a migration?  For example, each  
> > time a  
> > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > MemoryRegionSection pops in or out of the AddressSpace for the  
> > device  
> > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > system_memory).  I don't see any QEMU code that intercepts that  
> > > > change  
> > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > recorded and  
> > > > > > > translated to GFNs.  The vendor driver can't track these beyond  
> > getting  
> > > > > > > an unmap notification since it only knows the IOVA pfns, which  
> > can be  
> > > > > > > re-used with different GFN backing.  Once the DMA mapping is  
> > torn  
> > > > down,  
> > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > QEMU,  
> > > > > > > please help me find the code that handles it.  
> > > > > >
> > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > log_sync  
> > > > > > callback registered by emulated devices in Qemu. Looks dirty pages
> > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > always  
> > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > presence of  
> > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated DMAs
> > > > > >  (translation can be done when DMA happens), then we don't need
> > > > > > worry about transient mapping from IOVA to GFN. Along this way  
> > we  
> > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > >  
> > > > >
> > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > > >
> > > > I thought it was bad enough that we have vendor drivers that depend  
> > on  
> > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > is more undesirable.  I also note without comment that  
> > gfn_to_memslot()  
> > > > is a GPL symbol.  Thanks,  
> > >
> > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > similar  
> > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > only flavor when implementing vfio_{un}pin_page_external.  
> > 
> > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > and
> > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the HVA
> > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > other vGPU mdev vendor manages to make use of this without KVM... the
> > KVM interface used by GVT-g is GPL-only.  
> 
> To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> GUP is a perfect example, which doesn't work for Xen since DomU's
> memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> through Xen specific hypercalls.

VFIO does not assume a hypervisor at all.  Yes, it happens to work well
with a host-based hypervisor like KVM were we can simply use GUP, but
I'd hardly call using the standard mechanism to pin a user page and get
the pfn within the Linux kernel a KVM assumption.  The fact that Dom0
Xen requires work here while KVM does not does is not an equivalency to
VFIO assuming KVM.  Thanks,

Alex
 
> > > So GVT-g
> > > has to maintain an internal abstraction layer to support both Xen and
> > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > Xen guys are willing to support VFIO.  
> > 
> > Once upon a time, we had a KVM specific device assignment interface,
> > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > KVM out of the business of being a (bad) device driver.  We do have
> > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > pseudo device, but we still try to keep those interfaces generic.  In
> > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > but that's largely just a mechanism to associate a cookie with a group
> > to be consumed by the mdev vendor driver such that it can work with kvm
> > external to vfio.  I don't intend to add further hypervisor awareness
> > to vfio.
> >   
> > > Back to this IOVA issue, I discussed with Yan and we found another
> > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > similar to VFIO - DMA also happens in the kernel, while it already
> > > supports vIOMMU.
> > >
> > > Generally speaking, there are three paths of dirty page collection
> > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > bitmap in GFN):  
> > 
> > GFNs or simply PFNs within an AddressSpace?
> >   
> > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > are updated directly when the guest memory is being updated. For
> > > example, PCI writes are completed through pci_dma_write, which
> > > goes through vIOMMU to translate IOVA into GPA and then update
> > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > 
> > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > on the IOMMU AddressSpace.
> >   
> > > 2) Memory writes that are not tracked by Qemu are collected by
> > > registering .log_sync() callback, which is invoked in the dirty logging
> > > process. Now there are two users: kvm and vhost.
> > >
> > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > to Qemu when kvm_log_sync is invoked;
> > >
> > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > data structure. It maintains an internal iotlb which is synced with
> > > Qemu vIOMMU through a specific interface:
> > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > for Qemu to keep vhost iotlb in sync
> > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > a miss in vhost iotlb.
> > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > descriptor.
> > >
> > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > doesn't want to cache full-size vIOMMU mappings.
> > >
> > > Definitely this alternative needs more work and possibly less
> > > performant (if maintaining a small size iotlb) than straightforward
> > > calling into KVM interface. But the gain is also obvious, since it
> > > is fully constrained with VFIO.
> > >
> > > Thoughts? :-)  
> > 
> > So vhost must then be configuring a listener across system memory
> > rather than only against the device AddressSpace like we do in vfio,
> > such that it get's log_sync() callbacks for the actual GPA space rather
> > than only the IOVA space.  OTOH, QEMU could understand that the device
> > AddressSpace has a translate function and apply the IOVA dirty bits to
> > the system memory AddressSpace.  Wouldn't it make more sense for
> > QEMU
> > to perform a log_sync() prior to removing a MemoryRegionSection within
> > an AddressSpace and update the GPA rather than pushing GPA awareness
> > and potentially large tracking structures into the host kernel?  Thanks,
> >   
> 
> It is an interesting idea.  One drawback is that log_sync might be
> frequently invoked in IOVA case, but I guess the overhead is not much 
> compared to the total overhead of emulating the IOTLB invalidation. 
> Maybe other folks can better comment why this model was not 
> considered before, e.g. when vhost iotlb was introduced.
> 
> Thanks
> Kevin
Tian, Kevin Sept. 16, 2019, 1:53 a.m. UTC | #12
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, September 13, 2019 11:48 PM
> 
> On Thu, 12 Sep 2019 23:00:03 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, September 12, 2019 10:41 PM
> > >
> > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > >
> > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >
> > > > > > > From: Tian, Kevin
> > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > >
> > > > > > [...]
> > > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > > dynamic
> > > > > while
> > > > > > > > performing the live portion of a migration?  For example, each
> > > time a
> > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> the
> > > device
> > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is
> not
> > > > > > > > system_memory).  I don't see any QEMU code that intercepts
> that
> > > > > change
> > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > > recorded and
> > > > > > > > translated to GFNs.  The vendor driver can't track these beyond
> > > getting
> > > > > > > > an unmap notification since it only knows the IOVA pfns, which
> > > can be
> > > > > > > > re-used with different GFN backing.  Once the DMA mapping is
> > > torn
> > > > > down,
> > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> > > QEMU,
> > > > > > > > please help me find the code that handles it.
> > > > > > >
> > > > > > > I'm curious about this part too. Interestingly, I didn't find any
> > > log_sync
> > > > > > > callback registered by emulated devices in Qemu. Looks dirty
> pages
> > > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > > always
> > > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > > presence of
> > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> DMAs
> > > > > > >  (translation can be done when DMA happens), then we don't
> need
> > > > > > > worry about transient mapping from IOVA to GFN. Along this
> way
> > > we
> > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > similar to what KVM does. For vendor drivers, it needs to
> translate
> > > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > >
> > > > > >
> > > > > > HVA->GFN can be done through hva_to_gfn_memslot in
> kvm_host.h.
> > > > >
> > > > > I thought it was bad enough that we have vendor drivers that
> depend
> > > on
> > > > > KVM, but designing a vfio interface that only supports a KVM
> interface
> > > > > is more undesirable.  I also note without comment that
> > > gfn_to_memslot()
> > > > > is a GPL symbol.  Thanks,
> > > >
> > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > back to 3yrs (when discussing the 1st mdev framework), there were
> > > similar
> > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > > only flavor when implementing vfio_{un}pin_page_external.
> > >
> > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an
> IOVA
> > > and
> > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> HVA
> > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > other vGPU mdev vendor manages to make use of this without KVM...
> the
> > > KVM interface used by GVT-g is GPL-only.
> >
> > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > GUP is a perfect example, which doesn't work for Xen since DomU's
> > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > through Xen specific hypercalls.
> 
> VFIO does not assume a hypervisor at all.  Yes, it happens to work well
> with a host-based hypervisor like KVM were we can simply use GUP, but
> I'd hardly call using the standard mechanism to pin a user page and get
> the pfn within the Linux kernel a KVM assumption.  The fact that Dom0
> Xen requires work here while KVM does not does is not an equivalency to
> VFIO assuming KVM.  Thanks,
> 

Agree, thanks for clarification.

Thanks
Kevin
Tian, Kevin Sept. 24, 2019, 2:19 a.m. UTC | #13
> From: Tian, Kevin
> Sent: Friday, September 13, 2019 7:00 AM
> 
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 12, 2019 10:41 PM
> >
> > On Tue, 3 Sep 2019 06:57:27 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > >
> > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >
> > > > > > From: Tian, Kevin
> > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > >
> > > > > [...]
> > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > dynamic
> > > > while
> > > > > > > performing the live portion of a migration?  For example, each
> > time a
> > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> the
> > device
> > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > system_memory).  I don't see any QEMU code that intercepts
> that
> > > > change
> > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > recorded and
> > > > > > > translated to GFNs.  The vendor driver can't track these beyond
> > getting
> > > > > > > an unmap notification since it only knows the IOVA pfns, which
> > can be
> > > > > > > re-used with different GFN backing.  Once the DMA mapping is
> > torn
> > > > down,
> > > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> > QEMU,
> > > > > > > please help me find the code that handles it.
> > > > > >
> > > > > > I'm curious about this part too. Interestingly, I didn't find any
> > log_sync
> > > > > > callback registered by emulated devices in Qemu. Looks dirty
> pages
> > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > always
> > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > presence of
> > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> DMAs
> > > > > >  (translation can be done when DMA happens), then we don't
> need
> > > > > > worry about transient mapping from IOVA to GFN. Along this way
> > we
> > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > >
> > > > >
> > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.
> > > >
> > > > I thought it was bad enough that we have vendor drivers that depend
> > on
> > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > is more undesirable.  I also note without comment that
> > gfn_to_memslot()
> > > > is a GPL symbol.  Thanks,
> > >
> > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > back to 3yrs (when discussing the 1st mdev framework), there were
> > similar
> > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > only flavor when implementing vfio_{un}pin_page_external.
> >
> > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > and
> > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> HVA
> > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > other vGPU mdev vendor manages to make use of this without KVM... the
> > KVM interface used by GVT-g is GPL-only.
> 
> To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> GUP is a perfect example, which doesn't work for Xen since DomU's
> memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> through Xen specific hypercalls.
> 
> >
> > > So GVT-g
> > > has to maintain an internal abstraction layer to support both Xen and
> > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > Xen guys are willing to support VFIO.
> >
> > Once upon a time, we had a KVM specific device assignment interface,
> > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > KVM out of the business of being a (bad) device driver.  We do have
> > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > pseudo device, but we still try to keep those interfaces generic.  In
> > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > but that's largely just a mechanism to associate a cookie with a group
> > to be consumed by the mdev vendor driver such that it can work with
> kvm
> > external to vfio.  I don't intend to add further hypervisor awareness
> > to vfio.
> >
> > > Back to this IOVA issue, I discussed with Yan and we found another
> > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > similar to VFIO - DMA also happens in the kernel, while it already
> > > supports vIOMMU.
> > >
> > > Generally speaking, there are three paths of dirty page collection
> > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > bitmap in GFN):
> >
> > GFNs or simply PFNs within an AddressSpace?
> >
> > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > are updated directly when the guest memory is being updated. For
> > > example, PCI writes are completed through pci_dma_write, which
> > > goes through vIOMMU to translate IOVA into GPA and then update
> > > the bitmap through cpu_physical_memory_set_dirty_range.
> >
> > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > on the IOMMU AddressSpace.
> >
> > > 2) Memory writes that are not tracked by Qemu are collected by
> > > registering .log_sync() callback, which is invoked in the dirty logging
> > > process. Now there are two users: kvm and vhost.
> > >
> > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > to Qemu when kvm_log_sync is invoked;
> > >
> > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > data structure. It maintains an internal iotlb which is synced with
> > > Qemu vIOMMU through a specific interface:
> > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > for Qemu to keep vhost iotlb in sync
> > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > a miss in vhost iotlb.
> > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > descriptor.
> > >
> > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > doesn't want to cache full-size vIOMMU mappings.
> > >
> > > Definitely this alternative needs more work and possibly less
> > > performant (if maintaining a small size iotlb) than straightforward
> > > calling into KVM interface. But the gain is also obvious, since it
> > > is fully constrained with VFIO.
> > >
> > > Thoughts? :-)
> >
> > So vhost must then be configuring a listener across system memory
> > rather than only against the device AddressSpace like we do in vfio,
> > such that it get's log_sync() callbacks for the actual GPA space rather
> > than only the IOVA space.  OTOH, QEMU could understand that the
> device
> > AddressSpace has a translate function and apply the IOVA dirty bits to
> > the system memory AddressSpace.  Wouldn't it make more sense for
> > QEMU
> > to perform a log_sync() prior to removing a MemoryRegionSection within
> > an AddressSpace and update the GPA rather than pushing GPA
> awareness
> > and potentially large tracking structures into the host kernel?  Thanks,
> >
> 

Hi, Alex,

I moved back the VFIO related discussion to this thread, to not mix with
vhost related discussions here.

https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg03126.html

Your latest reply still prefers to the userspace approach:

> > Same as last time, you're asking VFIO to be aware of an entirely new
> > address space and implement tracking structures of that address space
> > to make life easier for QEMU.  Don't we typically push such complexity
> > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> >

I answered two points but didn't hear your further thoughts. Can you
take a look and respond?

The first point is about complexity and performance:
> 
> Is it really complex? No need of a new tracking structure. Just allowing
> the MAP interface to carry a new parameter and then record it in the
> existing vfio_dma objects.
> 
> Note the frequency of guest DMA map/unmap could be very high. We
> saw >100K invocations per second with a 40G NIC. To do the right
> translation Qemu requires log_sync for every unmap, before the
> mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> each log_sync requires several system_calls through the migration
> info, e.g. setting start_pfn/page_size/total_pfns and then reading
> data_offset/data_size. That design is fine for doing log_sync in every
> pre-copy round, but too costly if doing so for every IOVA unmap. If
> small extension in kernel can lead to great overhead reduction,
> why not?
> 

The second point is about write-protection:

> There is another value of recording GPA in VFIO. Vendor drivers (e.g.
> GVT-g) may need to selectively write-protect guest memory pages
> when interpreting certain workload descriptors. Those pages are
> recorded in IOVA when vIOMMU is enabled, however the KVM 
> write-protection API only knows GPA. So currently vIOMMU must
> be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> we need a way to translate IOVA into GPA in the vendor drivers. There
> are two options. One is having KVM export a new API for such 
> translation purpose. But as you explained earlier it's not good to
> have vendor drivers depend on KVM. The other is having VFIO
> maintaining such knowledge through extended MAP interface, 
> then providing a uniform API for all vendor drivers to use.

Thanks
Kevin
Alex Williamson Sept. 24, 2019, 6:03 p.m. UTC | #14
On Tue, 24 Sep 2019 02:19:15 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Tian, Kevin
> > Sent: Friday, September 13, 2019 7:00 AM
> >   
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, September 12, 2019 10:41 PM
> > >
> > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > >  
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > >
> > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >  
> > > > > > > From: Tian, Kevin
> > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > >  
> > > > > > [...]  
> > > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > > dynamic  
> > > > > while  
> > > > > > > > performing the live portion of a migration?  For example, each  
> > > time a  
> > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for  
> > the  
> > > device  
> > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is not
> > > > > > > > system_memory).  I don't see any QEMU code that intercepts  
> > that  
> > > > > change  
> > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > > recorded and  
> > > > > > > > translated to GFNs.  The vendor driver can't track these beyond  
> > > getting  
> > > > > > > > an unmap notification since it only knows the IOVA pfns, which  
> > > can be  
> > > > > > > > re-used with different GFN backing.  Once the DMA mapping is  
> > > torn  
> > > > > down,  
> > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > > QEMU,  
> > > > > > > > please help me find the code that handles it.  
> > > > > > >
> > > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > > log_sync  
> > > > > > > callback registered by emulated devices in Qemu. Looks dirty  
> > pages  
> > > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > > always  
> > > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > > presence of  
> > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated  
> > DMAs  
> > > > > > >  (translation can be done when DMA happens), then we don't  
> > need  
> > > > > > > worry about transient mapping from IOVA to GFN. Along this way  
> > > we  
> > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > similar to what KVM does. For vendor drivers, it needs to translate
> > > > > > > from IOVA to HVA to GFN when tracking DMA activities on VFIO
> > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can be
> > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > >  
> > > > > >
> > > > > > HVA->GFN can be done through hva_to_gfn_memslot in kvm_host.h.  
> > > > >
> > > > > I thought it was bad enough that we have vendor drivers that depend  
> > > on  
> > > > > KVM, but designing a vfio interface that only supports a KVM interface
> > > > > is more undesirable.  I also note without comment that  
> > > gfn_to_memslot()  
> > > > > is a GPL symbol.  Thanks,  
> > > >
> > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > > similar  
> > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but needs
> > > > hypercall in Xen. but VFIO already makes assumption based on KVM-
> > > > only flavor when implementing vfio_{un}pin_page_external.  
> > >
> > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an IOVA
> > > and
> > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the  
> > HVA  
> > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > other vGPU mdev vendor manages to make use of this without KVM... the
> > > KVM interface used by GVT-g is GPL-only.  
> > 
> > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > GUP is a perfect example, which doesn't work for Xen since DomU's
> > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > through Xen specific hypercalls.
> >   
> > >  
> > > > So GVT-g
> > > > has to maintain an internal abstraction layer to support both Xen and
> > > > KVM. Maybe someday we will re-consider introducing some hypervisor
> > > > abstraction layer in VFIO, if this issue starts to hurt other devices and
> > > > Xen guys are willing to support VFIO.  
> > >
> > > Once upon a time, we had a KVM specific device assignment interface,
> > > ie. legacy KVM devie assignment.  We developed VFIO specifically to get
> > > KVM out of the business of being a (bad) device driver.  We do have
> > > some awareness and interaction between VFIO and KVM in the vfio-kvm
> > > pseudo device, but we still try to keep those interfaces generic.  In
> > > some cases we're not very successful at that, see vfio_group_set_kvm(),
> > > but that's largely just a mechanism to associate a cookie with a group
> > > to be consumed by the mdev vendor driver such that it can work with  
> > kvm  
> > > external to vfio.  I don't intend to add further hypervisor awareness
> > > to vfio.
> > >  
> > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > supports vIOMMU.
> > > >
> > > > Generally speaking, there are three paths of dirty page collection
> > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > bitmap in GFN):  
> > >
> > > GFNs or simply PFNs within an AddressSpace?
> > >  
> > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty bitmaps
> > > > are updated directly when the guest memory is being updated. For
> > > > example, PCI writes are completed through pci_dma_write, which
> > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > >
> > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > on the IOMMU AddressSpace.
> > >  
> > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > registering .log_sync() callback, which is invoked in the dirty logging
> > > > process. Now there are two users: kvm and vhost.
> > > >
> > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > or EPT A/D bits (+PML). This part is always based on GFN and returned
> > > > to Qemu when kvm_log_sync is invoked;
> > > >
> > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > data structure. It maintains an internal iotlb which is synced with
> > > > Qemu vIOMMU through a specific interface:
> > > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > for Qemu to keep vhost iotlb in sync
> > > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > a miss in vhost iotlb.
> > > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > > update the buffer (using internal iotlb to get GFN) when serving vring
> > > > descriptor.
> > > >
> > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > doesn't want to cache full-size vIOMMU mappings.
> > > >
> > > > Definitely this alternative needs more work and possibly less
> > > > performant (if maintaining a small size iotlb) than straightforward
> > > > calling into KVM interface. But the gain is also obvious, since it
> > > > is fully constrained with VFIO.
> > > >
> > > > Thoughts? :-)  
> > >
> > > So vhost must then be configuring a listener across system memory
> > > rather than only against the device AddressSpace like we do in vfio,
> > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > than only the IOVA space.  OTOH, QEMU could understand that the  
> > device  
> > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > QEMU
> > > to perform a log_sync() prior to removing a MemoryRegionSection within
> > > an AddressSpace and update the GPA rather than pushing GPA  
> > awareness  
> > > and potentially large tracking structures into the host kernel?  Thanks,
> > >  
> >   
> 
> Hi, Alex,
> 
> I moved back the VFIO related discussion to this thread, to not mix with
> vhost related discussions here.
> 
> https://lists.nongnu.org/archive/html/qemu-devel/2019-09/msg03126.html
> 
> Your latest reply still prefers to the userspace approach:
> 
> > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > address space and implement tracking structures of that address space
> > > to make life easier for QEMU.  Don't we typically push such complexity
> > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > >  
> 
> I answered two points but didn't hear your further thoughts. Can you
> take a look and respond?
> 
> The first point is about complexity and performance:
> > 
> > Is it really complex? No need of a new tracking structure. Just allowing
> > the MAP interface to carry a new parameter and then record it in the
> > existing vfio_dma objects.
> > 
> > Note the frequency of guest DMA map/unmap could be very high. We
> > saw >100K invocations per second with a 40G NIC. To do the right
> > translation Qemu requires log_sync for every unmap, before the
> > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > each log_sync requires several system_calls through the migration
> > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > data_offset/data_size. That design is fine for doing log_sync in every
> > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > small extension in kernel can lead to great overhead reduction,
> > why not?

You're citing a workload that already performs abysmally with vfio and
vIOMMU, we cannot handle those rates efficiently with the current vfio
DMA API.  The current use cases of vIOMMU and vfio are predominantly
for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
relatively static or else performance problems are already very
apparent.  In that sort of model, I don't see that QEMU doing a
log_sync on unmap is really an issue, unmaps should be relatively
rare.  Of course I don't want to compound the issue, but the current
vfio DMA mapping interfaces needs to be scrapped to make this remotely
performant even before we look at migration performance, so does it
really make sense to introduce GPAs for a workload the ioctls are
unsuited for?
 
> The second point is about write-protection:
> 
> > There is another value of recording GPA in VFIO. Vendor drivers
> > (e.g. GVT-g) may need to selectively write-protect guest memory
> > pages when interpreting certain workload descriptors. Those pages
> > are recorded in IOVA when vIOMMU is enabled, however the KVM 
> > write-protection API only knows GPA. So currently vIOMMU must
> > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > we need a way to translate IOVA into GPA in the vendor drivers.
> > There are two options. One is having KVM export a new API for such 
> > translation purpose. But as you explained earlier it's not good to
> > have vendor drivers depend on KVM. The other is having VFIO
> > maintaining such knowledge through extended MAP interface, 
> > then providing a uniform API for all vendor drivers to use.  

So the argument is that in order to interact with KVM (write protecting
guest memory) there's a missing feature (IOVA to GPA translation), but
we don't want to add an API to KVM for this feature because that would
create a dependency on KVM (for interacting with KVM), so lets add an
API to vfio instead.  That makes no sense to me.  What am I missing?
Thanks,

Alex
Tian, Kevin Sept. 24, 2019, 11:04 p.m. UTC | #15
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Wednesday, September 25, 2019 2:03 AM
> 
> On Tue, 24 Sep 2019 02:19:15 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Tian, Kevin
> > > Sent: Friday, September 13, 2019 7:00 AM
> > >
> > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > Sent: Thursday, September 12, 2019 10:41 PM
> > > >
> > > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > >
> > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > > >
> > > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > >
> > > > > > > > From: Tian, Kevin
> > > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > > >
> > > > > > > [...]
> > > > > > > > > How does QEMU handle the fact that IOVAs are potentially
> > > > dynamic
> > > > > > while
> > > > > > > > > performing the live portion of a migration?  For example,
> each
> > > > time a
> > > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for
> > > the
> > > > device
> > > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is
> not
> > > > > > > > > system_memory).  I don't see any QEMU code that intercepts
> > > that
> > > > > > change
> > > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be
> > > > recorded and
> > > > > > > > > translated to GFNs.  The vendor driver can't track these
> beyond
> > > > getting
> > > > > > > > > an unmap notification since it only knows the IOVA pfns,
> which
> > > > can be
> > > > > > > > > re-used with different GFN backing.  Once the DMA mapping
> is
> > > > torn
> > > > > > down,
> > > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in
> > > > QEMU,
> > > > > > > > > please help me find the code that handles it.
> > > > > > > >
> > > > > > > > I'm curious about this part too. Interestingly, I didn't find any
> > > > log_sync
> > > > > > > > callback registered by emulated devices in Qemu. Looks dirty
> > > pages
> > > > > > > > by emulated DMAs are recorded in some implicit way. But KVM
> > > > always
> > > > > > > > reports dirty page in GFN instead of IOVA, regardless of the
> > > > presence of
> > > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated
> > > DMAs
> > > > > > > >  (translation can be done when DMA happens), then we don't
> > > need
> > > > > > > > worry about transient mapping from IOVA to GFN. Along this
> way
> > > > we
> > > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > > similar to what KVM does. For vendor drivers, it needs to
> translate
> > > > > > > > from IOVA to HVA to GFN when tracking DMA activities on
> VFIO
> > > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can
> be
> > > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > > >
> > > > > > >
> > > > > > > HVA->GFN can be done through hva_to_gfn_memslot in
> kvm_host.h.
> > > > > >
> > > > > > I thought it was bad enough that we have vendor drivers that
> depend
> > > > on
> > > > > > KVM, but designing a vfio interface that only supports a KVM
> interface
> > > > > > is more undesirable.  I also note without comment that
> > > > gfn_to_memslot()
> > > > > > is a GPL symbol.  Thanks,
> > > > >
> > > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > > back to 3yrs (when discussing the 1st mdev framework), there were
> > > > similar
> > > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but
> needs
> > > > > hypercall in Xen. but VFIO already makes assumption based on
> KVM-
> > > > > only flavor when implementing vfio_{un}pin_page_external.
> > > >
> > > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an
> IOVA
> > > > and
> > > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the
> > > HVA
> > > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > > other vGPU mdev vendor manages to make use of this without KVM...
> the
> > > > KVM interface used by GVT-g is GPL-only.
> > >
> > > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > > GUP is a perfect example, which doesn't work for Xen since DomU's
> > > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > > through Xen specific hypercalls.
> > >
> > > >
> > > > > So GVT-g
> > > > > has to maintain an internal abstraction layer to support both Xen
> and
> > > > > KVM. Maybe someday we will re-consider introducing some
> hypervisor
> > > > > abstraction layer in VFIO, if this issue starts to hurt other devices
> and
> > > > > Xen guys are willing to support VFIO.
> > > >
> > > > Once upon a time, we had a KVM specific device assignment interface,
> > > > ie. legacy KVM devie assignment.  We developed VFIO specifically to
> get
> > > > KVM out of the business of being a (bad) device driver.  We do have
> > > > some awareness and interaction between VFIO and KVM in the vfio-
> kvm
> > > > pseudo device, but we still try to keep those interfaces generic.  In
> > > > some cases we're not very successful at that, see
> vfio_group_set_kvm(),
> > > > but that's largely just a mechanism to associate a cookie with a group
> > > > to be consumed by the mdev vendor driver such that it can work with
> > > kvm
> > > > external to vfio.  I don't intend to add further hypervisor awareness
> > > > to vfio.
> > > >
> > > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > > supports vIOMMU.
> > > > >
> > > > > Generally speaking, there are three paths of dirty page collection
> > > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > > bitmap in GFN):
> > > >
> > > > GFNs or simply PFNs within an AddressSpace?
> > > >
> > > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty
> bitmaps
> > > > > are updated directly when the guest memory is being updated. For
> > > > > example, PCI writes are completed through pci_dma_write, which
> > > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > > the bitmap through cpu_physical_memory_set_dirty_range.
> > > >
> > > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > > on the IOMMU AddressSpace.
> > > >
> > > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > > registering .log_sync() callback, which is invoked in the dirty logging
> > > > > process. Now there are two users: kvm and vhost.
> > > > >
> > > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > > or EPT A/D bits (+PML). This part is always based on GFN and
> returned
> > > > > to Qemu when kvm_log_sync is invoked;
> > > > >
> > > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > > data structure. It maintains an internal iotlb which is synced with
> > > > > Qemu vIOMMU through a specific interface:
> > > > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > > for Qemu to keep vhost iotlb in sync
> > > > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > > a miss in vhost iotlb.
> > > > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > > > update the buffer (using internal iotlb to get GFN) when serving
> vring
> > > > > descriptor.
> > > > >
> > > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > > doesn't want to cache full-size vIOMMU mappings.
> > > > >
> > > > > Definitely this alternative needs more work and possibly less
> > > > > performant (if maintaining a small size iotlb) than straightforward
> > > > > calling into KVM interface. But the gain is also obvious, since it
> > > > > is fully constrained with VFIO.
> > > > >
> > > > > Thoughts? :-)
> > > >
> > > > So vhost must then be configuring a listener across system memory
> > > > rather than only against the device AddressSpace like we do in vfio,
> > > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > > than only the IOVA space.  OTOH, QEMU could understand that the
> > > device
> > > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > > QEMU
> > > > to perform a log_sync() prior to removing a MemoryRegionSection
> within
> > > > an AddressSpace and update the GPA rather than pushing GPA
> > > awareness
> > > > and potentially large tracking structures into the host kernel?  Thanks,
> > > >
> > >
> >
> > Hi, Alex,
> >
> > I moved back the VFIO related discussion to this thread, to not mix with
> > vhost related discussions here.
> >
> > https://lists.nongnu.org/archive/html/qemu-devel/2019-
> 09/msg03126.html
> >
> > Your latest reply still prefers to the userspace approach:
> >
> > > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > > address space and implement tracking structures of that address
> space
> > > > to make life easier for QEMU.  Don't we typically push such complexity
> > > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > > >
> >
> > I answered two points but didn't hear your further thoughts. Can you
> > take a look and respond?
> >
> > The first point is about complexity and performance:
> > >
> > > Is it really complex? No need of a new tracking structure. Just allowing
> > > the MAP interface to carry a new parameter and then record it in the
> > > existing vfio_dma objects.
> > >
> > > Note the frequency of guest DMA map/unmap could be very high. We
> > > saw >100K invocations per second with a 40G NIC. To do the right
> > > translation Qemu requires log_sync for every unmap, before the
> > > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > > each log_sync requires several system_calls through the migration
> > > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > > data_offset/data_size. That design is fine for doing log_sync in every
> > > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > > small extension in kernel can lead to great overhead reduction,
> > > why not?
> 
> You're citing a workload that already performs abysmally with vfio and
> vIOMMU, we cannot handle those rates efficiently with the current vfio
> DMA API.  The current use cases of vIOMMU and vfio are predominantly
> for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
> relatively static or else performance problems are already very
> apparent.  In that sort of model, I don't see that QEMU doing a
> log_sync on unmap is really an issue, unmaps should be relatively
> rare.  Of course I don't want to compound the issue, but the current
> vfio DMA mapping interfaces needs to be scrapped to make this remotely
> performant even before we look at migration performance, so does it
> really make sense to introduce GPAs for a workload the ioctls are
> unsuited for?
> 
> > The second point is about write-protection:
> >
> > > There is another value of recording GPA in VFIO. Vendor drivers
> > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > pages when interpreting certain workload descriptors. Those pages
> > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > write-protection API only knows GPA. So currently vIOMMU must
> > > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > There are two options. One is having KVM export a new API for such
> > > translation purpose. But as you explained earlier it's not good to
> > > have vendor drivers depend on KVM. The other is having VFIO
> > > maintaining such knowledge through extended MAP interface,
> > > then providing a uniform API for all vendor drivers to use.
> 
> So the argument is that in order to interact with KVM (write protecting
> guest memory) there's a missing feature (IOVA to GPA translation), but
> we don't want to add an API to KVM for this feature because that would
> create a dependency on KVM (for interacting with KVM), so lets add an
> API to vfio instead.  That makes no sense to me.  What am I missing?
> Thanks,
> 

Then do you have a recommendation how such feature can be 
implemented cleanly in vendor driver, without introducing direct
dependency on KVM? 

Thanks
Kevin
Alex Williamson Sept. 25, 2019, 7:06 p.m. UTC | #16
On Tue, 24 Sep 2019 23:04:22 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Wednesday, September 25, 2019 2:03 AM
> > 
> > On Tue, 24 Sep 2019 02:19:15 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Tian, Kevin
> > > > Sent: Friday, September 13, 2019 7:00 AM
> > > >  
> > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > Sent: Thursday, September 12, 2019 10:41 PM
> > > > >
> > > > > On Tue, 3 Sep 2019 06:57:27 +0000
> > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > >  
> > > > > > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > > > > > Sent: Saturday, August 31, 2019 12:33 AM
> > > > > > >
> > > > > > > On Fri, 30 Aug 2019 08:06:32 +0000
> > > > > > > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> > > > > > >  
> > > > > > > > > From: Tian, Kevin
> > > > > > > > > Sent: Friday, August 30, 2019 3:26 PM
> > > > > > > > >  
> > > > > > > > [...]  
> > > > > > > > > > How does QEMU handle the fact that IOVAs are potentially  
> > > > > dynamic  
> > > > > > > while  
> > > > > > > > > > performing the live portion of a migration?  For example,  
> > each  
> > > > > time a  
> > > > > > > > > > guest driver calls dma_map_page() or dma_unmap_page(), a
> > > > > > > > > > MemoryRegionSection pops in or out of the AddressSpace for  
> > > > the  
> > > > > device  
> > > > > > > > > > (I'm assuming a vIOMMU where the device AddressSpace is  
> > not  
> > > > > > > > > > system_memory).  I don't see any QEMU code that intercepts  
> > > > that  
> > > > > > > change  
> > > > > > > > > > in the AddressSpace such that the IOVA dirty pfns could be  
> > > > > recorded and  
> > > > > > > > > > translated to GFNs.  The vendor driver can't track these  
> > beyond  
> > > > > getting  
> > > > > > > > > > an unmap notification since it only knows the IOVA pfns,  
> > which  
> > > > > can be  
> > > > > > > > > > re-used with different GFN backing.  Once the DMA mapping  
> > is  
> > > > > torn  
> > > > > > > down,  
> > > > > > > > > > it seems those dirty pfns are lost in the ether.  If this works in  
> > > > > QEMU,  
> > > > > > > > > > please help me find the code that handles it.  
> > > > > > > > >
> > > > > > > > > I'm curious about this part too. Interestingly, I didn't find any  
> > > > > log_sync  
> > > > > > > > > callback registered by emulated devices in Qemu. Looks dirty  
> > > > pages  
> > > > > > > > > by emulated DMAs are recorded in some implicit way. But KVM  
> > > > > always  
> > > > > > > > > reports dirty page in GFN instead of IOVA, regardless of the  
> > > > > presence of  
> > > > > > > > > vIOMMU. If Qemu also tracks dirty pages in GFN for emulated  
> > > > DMAs  
> > > > > > > > >  (translation can be done when DMA happens), then we don't  
> > > > need  
> > > > > > > > > worry about transient mapping from IOVA to GFN. Along this  
> > way  
> > > > > we  
> > > > > > > > > also want GFN-based dirty bitmap being reported through VFIO,
> > > > > > > > > similar to what KVM does. For vendor drivers, it needs to  
> > translate  
> > > > > > > > > from IOVA to HVA to GFN when tracking DMA activities on  
> > VFIO  
> > > > > > > > > devices. IOVA->HVA is provided by VFIO. for HVA->GFN, it can  
> > be  
> > > > > > > > > provided by KVM but I'm not sure whether it's exposed now.
> > > > > > > > >  
> > > > > > > >
> > > > > > > > HVA->GFN can be done through hva_to_gfn_memslot in  
> > kvm_host.h.  
> > > > > > >
> > > > > > > I thought it was bad enough that we have vendor drivers that  
> > depend  
> > > > > on  
> > > > > > > KVM, but designing a vfio interface that only supports a KVM  
> > interface  
> > > > > > > is more undesirable.  I also note without comment that  
> > > > > gfn_to_memslot()  
> > > > > > > is a GPL symbol.  Thanks,  
> > > > > >
> > > > > > yes it is bad, but sometimes inevitable. If you recall our discussions
> > > > > > back to 3yrs (when discussing the 1st mdev framework), there were  
> > > > > similar  
> > > > > > hypervisor dependencies in GVT-g, e.g. querying gpa->hpa when
> > > > > > creating some shadow structures. gpa->hpa is definitely hypervisor
> > > > > > specific knowledge, which is easy in KVM (gpa->hva->hpa), but  
> > needs  
> > > > > > hypercall in Xen. but VFIO already makes assumption based on  
> > KVM-  
> > > > > > only flavor when implementing vfio_{un}pin_page_external.  
> > > > >
> > > > > Where's the KVM assumption there?  The MAP_DMA ioctl takes an  
> > IOVA  
> > > > > and
> > > > > HVA.  When an mdev vendor driver calls vfio_pin_pages(), we GUP the  
> > > > HVA  
> > > > > to get an HPA and provide an array of HPA pfns back to the caller.  The
> > > > > other vGPU mdev vendor manages to make use of this without KVM...  
> > the  
> > > > > KVM interface used by GVT-g is GPL-only.  
> > > >
> > > > To be clear it's the assumption on the host-based hypervisors e.g. KVM.
> > > > GUP is a perfect example, which doesn't work for Xen since DomU's
> > > > memory doesn't belong to Dom0. VFIO in Dom0 has to find the HPA
> > > > through Xen specific hypercalls.
> > > >  
> > > > >  
> > > > > > So GVT-g
> > > > > > has to maintain an internal abstraction layer to support both Xen  
> > and  
> > > > > > KVM. Maybe someday we will re-consider introducing some  
> > hypervisor  
> > > > > > abstraction layer in VFIO, if this issue starts to hurt other devices  
> > and  
> > > > > > Xen guys are willing to support VFIO.  
> > > > >
> > > > > Once upon a time, we had a KVM specific device assignment interface,
> > > > > ie. legacy KVM devie assignment.  We developed VFIO specifically to  
> > get  
> > > > > KVM out of the business of being a (bad) device driver.  We do have
> > > > > some awareness and interaction between VFIO and KVM in the vfio-  
> > kvm  
> > > > > pseudo device, but we still try to keep those interfaces generic.  In
> > > > > some cases we're not very successful at that, see  
> > vfio_group_set_kvm(),  
> > > > > but that's largely just a mechanism to associate a cookie with a group
> > > > > to be consumed by the mdev vendor driver such that it can work with  
> > > > kvm  
> > > > > external to vfio.  I don't intend to add further hypervisor awareness
> > > > > to vfio.
> > > > >  
> > > > > > Back to this IOVA issue, I discussed with Yan and we found another
> > > > > > hypervisor-agnostic alternative, by learning from vhost. vhost is very
> > > > > > similar to VFIO - DMA also happens in the kernel, while it already
> > > > > > supports vIOMMU.
> > > > > >
> > > > > > Generally speaking, there are three paths of dirty page collection
> > > > > > in Qemu so far (as previously noted, Qemu always tracks the dirty
> > > > > > bitmap in GFN):  
> > > > >
> > > > > GFNs or simply PFNs within an AddressSpace?
> > > > >  
> > > > > > 1) Qemu-tracked memory writes (e.g. emulated DMAs). Dirty  
> > bitmaps  
> > > > > > are updated directly when the guest memory is being updated. For
> > > > > > example, PCI writes are completed through pci_dma_write, which
> > > > > > goes through vIOMMU to translate IOVA into GPA and then update
> > > > > > the bitmap through cpu_physical_memory_set_dirty_range.  
> > > > >
> > > > > Right, so the IOVA to GPA (GFN) occurs through an explicit translation
> > > > > on the IOMMU AddressSpace.
> > > > >  
> > > > > > 2) Memory writes that are not tracked by Qemu are collected by
> > > > > > registering .log_sync() callback, which is invoked in the dirty logging
> > > > > > process. Now there are two users: kvm and vhost.
> > > > > >
> > > > > >   2.1) KVM tracks CPU-side memory writes, through write-protection
> > > > > > or EPT A/D bits (+PML). This part is always based on GFN and  
> > returned  
> > > > > > to Qemu when kvm_log_sync is invoked;
> > > > > >
> > > > > >   2.2) vhost tracks kernel-side DMA writes, by interpreting vring
> > > > > > data structure. It maintains an internal iotlb which is synced with
> > > > > > Qemu vIOMMU through a specific interface:
> > > > > > 	- new vhost message type (VHOST_IOTLB_UPDATE/INVALIDATE)
> > > > > > for Qemu to keep vhost iotlb in sync
> > > > > > 	- new VHOST_IOTLB_MISS message to notify Qemu in case of
> > > > > > a miss in vhost iotlb.
> > > > > > 	- Qemu registers a log buffer to kernel vhost driver. The latter
> > > > > > update the buffer (using internal iotlb to get GFN) when serving  
> > vring  
> > > > > > descriptor.
> > > > > >
> > > > > > VFIO could also implement an internal iotlb, so vendor drivers can
> > > > > > utilize the iotlb to update the GFN-based dirty bitmap. Ideally we
> > > > > > don't need re-invent another iotlb protocol as vhost does. vIOMMU
> > > > > > already sends map/unmap ioctl cmds upon any change of IOVA
> > > > > > mapping. We may introduce a v2 map/unmap interface, allowing
> > > > > > Qemu to pass both {iova, gpa, hva} together to keep internal iotlb
> > > > > > in-sync. But we may also need a iotlb_miss_upcall interface, if VFIO
> > > > > > doesn't want to cache full-size vIOMMU mappings.
> > > > > >
> > > > > > Definitely this alternative needs more work and possibly less
> > > > > > performant (if maintaining a small size iotlb) than straightforward
> > > > > > calling into KVM interface. But the gain is also obvious, since it
> > > > > > is fully constrained with VFIO.
> > > > > >
> > > > > > Thoughts? :-)  
> > > > >
> > > > > So vhost must then be configuring a listener across system memory
> > > > > rather than only against the device AddressSpace like we do in vfio,
> > > > > such that it get's log_sync() callbacks for the actual GPA space rather
> > > > > than only the IOVA space.  OTOH, QEMU could understand that the  
> > > > device  
> > > > > AddressSpace has a translate function and apply the IOVA dirty bits to
> > > > > the system memory AddressSpace.  Wouldn't it make more sense for
> > > > > QEMU
> > > > > to perform a log_sync() prior to removing a MemoryRegionSection  
> > within  
> > > > > an AddressSpace and update the GPA rather than pushing GPA  
> > > > awareness  
> > > > > and potentially large tracking structures into the host kernel?  Thanks,
> > > > >  
> > > >  
> > >
> > > Hi, Alex,
> > >
> > > I moved back the VFIO related discussion to this thread, to not mix with
> > > vhost related discussions here.
> > >
> > > https://lists.nongnu.org/archive/html/qemu-devel/2019-  
> > 09/msg03126.html  
> > >
> > > Your latest reply still prefers to the userspace approach:
> > >  
> > > > > Same as last time, you're asking VFIO to be aware of an entirely new
> > > > > address space and implement tracking structures of that address  
> > space  
> > > > > to make life easier for QEMU.  Don't we typically push such complexity
> > > > > to userspace rather than into the kernel?  I'm not convinced.  Thanks,
> > > > >  
> > >
> > > I answered two points but didn't hear your further thoughts. Can you
> > > take a look and respond?
> > >
> > > The first point is about complexity and performance:  
> > > >
> > > > Is it really complex? No need of a new tracking structure. Just allowing
> > > > the MAP interface to carry a new parameter and then record it in the
> > > > existing vfio_dma objects.
> > > >
> > > > Note the frequency of guest DMA map/unmap could be very high. We
> > > > saw >100K invocations per second with a 40G NIC. To do the right
> > > > translation Qemu requires log_sync for every unmap, before the
> > > > mapping for logged dirty IOVA becomes stale. In current Kirti's patch,
> > > > each log_sync requires several system_calls through the migration
> > > > info, e.g. setting start_pfn/page_size/total_pfns and then reading
> > > > data_offset/data_size. That design is fine for doing log_sync in every
> > > > pre-copy round, but too costly if doing so for every IOVA unmap. If
> > > > small extension in kernel can lead to great overhead reduction,
> > > > why not?  
> > 
> > You're citing a workload that already performs abysmally with vfio and
> > vIOMMU, we cannot handle those rates efficiently with the current vfio
> > DMA API.  The current use cases of vIOMMU and vfio are predominantly
> > for nesting vfio, ex. DPDK/SPDK, where we assume the mappings are
> > relatively static or else performance problems are already very
> > apparent.  In that sort of model, I don't see that QEMU doing a
> > log_sync on unmap is really an issue, unmaps should be relatively
> > rare.  Of course I don't want to compound the issue, but the current
> > vfio DMA mapping interfaces needs to be scrapped to make this remotely
> > performant even before we look at migration performance, so does it
> > really make sense to introduce GPAs for a workload the ioctls are
> > unsuited for?
> >   
> > > The second point is about write-protection:
> > >  
> > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > pages when interpreting certain workload descriptors. Those pages
> > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it working
> > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > There are two options. One is having KVM export a new API for such
> > > > translation purpose. But as you explained earlier it's not good to
> > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > maintaining such knowledge through extended MAP interface,
> > > > then providing a uniform API for all vendor drivers to use.  
> > 
> > So the argument is that in order to interact with KVM (write protecting
> > guest memory) there's a missing feature (IOVA to GPA translation), but
> > we don't want to add an API to KVM for this feature because that would
> > create a dependency on KVM (for interacting with KVM), so lets add an
> > API to vfio instead.  That makes no sense to me.  What am I missing?
> > Thanks,
> >   
> 
> Then do you have a recommendation how such feature can be 
> implemented cleanly in vendor driver, without introducing direct
> dependency on KVM? 

I think the disconnect is that these sorts of extensions don't reflect
things that a physical device can actually do.  The idea of vfio is
that it's a userspace driver interface.  It provides a channel for the
user to interact with the device, map device resources, receive
interrupts, map system memory through the iommu, etc.  Mediated devices
augment this by replacing the physical device the user accesses with a
software virtualized device.  So then the question becomes why this
device virtualizing software, ie. the mdev vendor driver, needs to do
things that a physical device clearly cannot do.  For example, how can
a physical device write-protect portions of system memory?  Or even,
why would it need to?  It makes me suspect that mdev is being used to
bypass the hypervisor, or maybe fill in the gaps for hardware that
isn't as "mediation friendly" as it claims to be.

In the case of a physical device discovering an iova translation, this
is what device iotlbs are for, but as an acceleration and offload
mechanism for the system iommu rather than a lookup mechanism as seems
to be wanted here.  If we had a system iommu with dirty page tracking,
I believe that tracking would live in the iommu page tables and
therefore reflect dirty pages relative to iova.  We'd need to consume
those dirty page bits before we tear down the iova mappings, much like
we're suggesting QEMU do here.

Unfortunately I also think that KVM and vhost are not really the best
examples of what we need to do for vfio.  KVM is intimately involved
with GPAs, so clearly dirty page tracking at that level is not an
issue.  Vhost tends to circumvent the viommu; it's trying to poke
directly into guest memory without the help of a physical iommu.  So
I can't say that I have much faith that QEMU is already properly wired
with respect to viommu and dirty page tracking, leaving open the
possibility that a log_sync on iommu region unmap is simply a gap in
the QEMU migration story.  The vfio migration interface we have on the
table seems like it could work, but QEMU needs an update and we need to
define the interface in terms of pfns relative to the address space.

If GPAs are still needed, what are they for?  The write-protect example
is clearly a hypervisor level interaction as I assume it's write
protection relative to the vCPU.  It's a hypervisor specific interface
to perform that write-protection, so why wouldn't we use a hypervisor
specific interface to collect the data to perform that operation?
IOW, if GVT-g already has a KVM dependency, why be concerned about
adding another GVT-g KVM dependency?  It seems like vfio is just a
potentially convenient channel, but as discussed above, vfio has no
business in GPAs because devices don't operate on GPAs and I've not
been sold that there's value in vfio getting involved in that address
space.  Convince me otherwise ;)  Thanks,

Alex
Tian, Kevin Sept. 26, 2019, 3:07 a.m. UTC | #17
> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Thursday, September 26, 2019 3:06 AM
[...]
> > > > The second point is about write-protection:
> > > >
> > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > pages when interpreting certain workload descriptors. Those pages
> > > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it
> working
> > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > There are two options. One is having KVM export a new API for such
> > > > > translation purpose. But as you explained earlier it's not good to
> > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > maintaining such knowledge through extended MAP interface,
> > > > > then providing a uniform API for all vendor drivers to use.
> > >
> > > So the argument is that in order to interact with KVM (write protecting
> > > guest memory) there's a missing feature (IOVA to GPA translation), but
> > > we don't want to add an API to KVM for this feature because that would
> > > create a dependency on KVM (for interacting with KVM), so lets add an
> > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > Thanks,
> > >
> >
> > Then do you have a recommendation how such feature can be
> > implemented cleanly in vendor driver, without introducing direct
> > dependency on KVM?
> 
> I think the disconnect is that these sorts of extensions don't reflect
> things that a physical device can actually do.  The idea of vfio is
> that it's a userspace driver interface.  It provides a channel for the
> user to interact with the device, map device resources, receive
> interrupts, map system memory through the iommu, etc.  Mediated
> devices
> augment this by replacing the physical device the user accesses with a
> software virtualized device.  So then the question becomes why this
> device virtualizing software, ie. the mdev vendor driver, needs to do
> things that a physical device clearly cannot do.  For example, how can
> a physical device write-protect portions of system memory?  Or even,
> why would it need to?  It makes me suspect that mdev is being used to
> bypass the hypervisor, or maybe fill in the gaps for hardware that
> isn't as "mediation friendly" as it claims to be.

We do have one such example on Intel GPU. To support direct cmd
submission from userspace (SVA), kernel driver allocates a doorbell
page (in system memory) for each application and then registers
the page to the GPU. Once the doorbell is armed, the GPU starts
to monitor CPU writes to that page. Then the application can ring the 
GPU by simply writing to the doorbell page to submit cmds. This
possibly makes sense only for integrated devices.

In case that direction submission is not allowed in mediated device
(some auditing work is required in GVT-g), we need to write-protect 
the doorbell page with hypervisor help to mimic the hardware 
behavior. We have prototype work internally, but hasn't sent it out.

> 
> In the case of a physical device discovering an iova translation, this
> is what device iotlbs are for, but as an acceleration and offload
> mechanism for the system iommu rather than a lookup mechanism as
> seems
> to be wanted here.  If we had a system iommu with dirty page tracking,
> I believe that tracking would live in the iommu page tables and
> therefore reflect dirty pages relative to iova.  We'd need to consume
> those dirty page bits before we tear down the iova mappings, much like
> we're suggesting QEMU do here.

Yes. There are two cases:

1) iova shadowing. Say using only 2nd level as today. Here the dirty 
bits are associated to iova. When Qemu is revised to invoke log_sync 
before tearing down any iova mapping, vfio can get the dirty info 
from iommu driver for affected range.

2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
2nd level. In that case the iova carried in the map/unmap ioctl is
actually gpa, thus the dirty bits are associated to gpa. In such case,
Qemu should continue to consume gpa-based dirty bitmap, as if
viommu is disabled.

> 
> Unfortunately I also think that KVM and vhost are not really the best
> examples of what we need to do for vfio.  KVM is intimately involved
> with GPAs, so clearly dirty page tracking at that level is not an
> issue.  Vhost tends to circumvent the viommu; it's trying to poke
> directly into guest memory without the help of a physical iommu.  So
> I can't say that I have much faith that QEMU is already properly wired
> with respect to viommu and dirty page tracking, leaving open the
> possibility that a log_sync on iommu region unmap is simply a gap in
> the QEMU migration story.  The vfio migration interface we have on the
> table seems like it could work, but QEMU needs an update and we need to
> define the interface in terms of pfns relative to the address space.

Yan and I did a brief discussion on this. Besides the basic change of
doing log_sync for every iova unmap, there are two others gaps to
be fixed:

1) Today the iova->gpa mapping is maintained in two places: viommu 
page table in guest memory and viotlb in Qemu. viotlb is filled when 
a walk on the viommu page table happens, due to emulation of a virtual
DMA operation from emulated devices or request from vhost devices. 
It's not affected by passthrough device activities though, since the latter 
goes through physical iommu. Per iommu spec, guest iommu driver 
first clears the viommu page table, followed by viotlb invalidation 
request. It's the latter being trapped by Qemu, then vfio is notified 
at that point, where iova->gpa translation will simply fail since no 
valid mapping in viommu page table and very likely no hit in viotlb. 
To fix this gap, we need extend Qemu to cache all the valid iova 
mappings in viommu page table, similar to how vfio does.

2) Then there will be parallel log_sync requests on each vfio device. 
One is from the vcpu thread, when iotlb invalidation request is being 
emulated. The other is from the migration thread, where log_sync is 
requested for the entire guest memory in iterative copies. The 
contention among multiple vCPU threads is already protected through 
iommu lock, but we didn't find such thing between migration thread 
and vcpu threads. Maybe we overlooked something, but ideally the 
whole iova address space should be locked when the migration thread 
is doing mega-sync/translation.

+Yi and Peter for their opinions.

> 
> If GPAs are still needed, what are they for?  The write-protect example
> is clearly a hypervisor level interaction as I assume it's write
> protection relative to the vCPU.  It's a hypervisor specific interface
> to perform that write-protection, so why wouldn't we use a hypervisor
> specific interface to collect the data to perform that operation?
> IOW, if GVT-g already has a KVM dependency, why be concerned about
> adding another GVT-g KVM dependency?  It seems like vfio is just a

This is possibly the way that we have to go, based on discussions
so far. Earlier I just hold the same argument as you emphasized
for vfio - although there are existing KVM dependencies, we want
minimize it. :-) Another worry is what if other vendor drivers may
have similar requirements, then can we invent some generic ways
thus avoid pushing them to do same tricky thing again. Of course, 
we may revisit it later until this issue does become a common 
requirement.

> potentially convenient channel, but as discussed above, vfio has no
> business in GPAs because devices don't operate on GPAs and I've not
> been sold that there's value in vfio getting involved in that address
> space.  Convince me otherwise ;)  Thanks,
> 

Looks none of my arguments is convincible to you :-), so we move
to investigate what should be changed in qemu to support your 
proposal (as discussed above). While this part is on-going, let me
have a last try on my original idea. ;) Just be curious how your
further thought is, regarding to earlier doorbell monitoring 
example for operating GPA on device. If it's just Intel GPU only thing, 
yes we can still fix it in GVT-g itself as you suggested. But I'm just not 
sure about other integrated devices, and also new accelerators 
with coherent bus connected to cpu package. Also we don't need 
call it GPA - it could be named as user_target_address that the iova 
is mapped to, and is the address space that userspace expects
the device to operate for purposes (logging, monitoring, etc.) other 
than for dma (using iova) and for accessing userspace/guest
memory (hva).

Thanks
Kevin
Alex Williamson Sept. 26, 2019, 9:33 p.m. UTC | #18
On Thu, 26 Sep 2019 03:07:08 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, September 26, 2019 3:06 AM  
> [...]
> > > > > The second point is about write-protection:
> > > > >  
> > > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > > pages when interpreting certain workload descriptors. Those pages
> > > > > > are recorded in IOVA when vIOMMU is enabled, however the KVM
> > > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it  
> > working  
> > > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > > There are two options. One is having KVM export a new API for such
> > > > > > translation purpose. But as you explained earlier it's not good to
> > > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > > maintaining such knowledge through extended MAP interface,
> > > > > > then providing a uniform API for all vendor drivers to use.  
> > > >
> > > > So the argument is that in order to interact with KVM (write protecting
> > > > guest memory) there's a missing feature (IOVA to GPA translation), but
> > > > we don't want to add an API to KVM for this feature because that would
> > > > create a dependency on KVM (for interacting with KVM), so lets add an
> > > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > > Thanks,
> > > >  
> > >
> > > Then do you have a recommendation how such feature can be
> > > implemented cleanly in vendor driver, without introducing direct
> > > dependency on KVM?  
> > 
> > I think the disconnect is that these sorts of extensions don't reflect
> > things that a physical device can actually do.  The idea of vfio is
> > that it's a userspace driver interface.  It provides a channel for the
> > user to interact with the device, map device resources, receive
> > interrupts, map system memory through the iommu, etc.  Mediated
> > devices
> > augment this by replacing the physical device the user accesses with a
> > software virtualized device.  So then the question becomes why this
> > device virtualizing software, ie. the mdev vendor driver, needs to do
> > things that a physical device clearly cannot do.  For example, how can
> > a physical device write-protect portions of system memory?  Or even,
> > why would it need to?  It makes me suspect that mdev is being used to
> > bypass the hypervisor, or maybe fill in the gaps for hardware that
> > isn't as "mediation friendly" as it claims to be.  
> 
> We do have one such example on Intel GPU. To support direct cmd
> submission from userspace (SVA), kernel driver allocates a doorbell
> page (in system memory) for each application and then registers
> the page to the GPU. Once the doorbell is armed, the GPU starts
> to monitor CPU writes to that page. Then the application can ring the 
> GPU by simply writing to the doorbell page to submit cmds. This
> possibly makes sense only for integrated devices.
> 
> In case that direction submission is not allowed in mediated device
> (some auditing work is required in GVT-g), we need to write-protect 
> the doorbell page with hypervisor help to mimic the hardware 
> behavior. We have prototype work internally, but hasn't sent it out.

What would it look like for QEMU to orchestrate this?  Maybe the mdev
device could expose a doorbell page as a device specific region.
Possibly a quirk in QEMU vfio-pci could detect the guest driver
registering a doorbell (or vendor driver could tell QEMU via a device
specific interrupt) and setup a MemoryRegion overlay of the doorbell
page in the guest.  If the physical GPU has the resources, maybe the
doorbell page is real, otherwise it could be emulated in the vendor
driver.  Trying to do this without QEMU seems to be where we're running
into trouble and is what I'd classify as bypassing the hypervisor.

> > In the case of a physical device discovering an iova translation, this
> > is what device iotlbs are for, but as an acceleration and offload
> > mechanism for the system iommu rather than a lookup mechanism as
> > seems
> > to be wanted here.  If we had a system iommu with dirty page tracking,
> > I believe that tracking would live in the iommu page tables and
> > therefore reflect dirty pages relative to iova.  We'd need to consume
> > those dirty page bits before we tear down the iova mappings, much like
> > we're suggesting QEMU do here.  
> 
> Yes. There are two cases:
> 
> 1) iova shadowing. Say using only 2nd level as today. Here the dirty 
> bits are associated to iova. When Qemu is revised to invoke log_sync 
> before tearing down any iova mapping, vfio can get the dirty info 
> from iommu driver for affected range.

Maybe we need two mechanisms, log_sync for the "occasional" polling of
dirty bits and an UNMAP_DMA ioctl extension that allows the user to
provide a buffer into which the unmap ioctl would set dirty bits.
Userspace could potentially chunk MAP/UNMAP_DMA calls in order to bound
the size of the bitmap buffer (modulo the difficulties of assuming
physical page alignment).  The QEMU vfio-pci driver would then perform
the translation and mark the GPA pages dirty.  Seems that might relieve
your efficiency concerns.  The unpin path and unpin notifier would need
to relay dirty info for mdev support, where a generic implementation
might simply assume everything that has been or is currently pinned by
mdev is dirty.

Now I'm wondering if we should be consolidating vfio dirty page
tracking per container rather than per device if we do something like
this...

> 2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
> 2nd level. In that case the iova carried in the map/unmap ioctl is
> actually gpa, thus the dirty bits are associated to gpa. In such case,
> Qemu should continue to consume gpa-based dirty bitmap, as if
> viommu is disabled.

Seems again that it's QEMU that really knows which AddressSpace that
the kernel vfio is operating in and how to apply it to a dirty bitmap.

> > Unfortunately I also think that KVM and vhost are not really the best
> > examples of what we need to do for vfio.  KVM is intimately involved
> > with GPAs, so clearly dirty page tracking at that level is not an
> > issue.  Vhost tends to circumvent the viommu; it's trying to poke
> > directly into guest memory without the help of a physical iommu.  So
> > I can't say that I have much faith that QEMU is already properly wired
> > with respect to viommu and dirty page tracking, leaving open the
> > possibility that a log_sync on iommu region unmap is simply a gap in
> > the QEMU migration story.  The vfio migration interface we have on the
> > table seems like it could work, but QEMU needs an update and we need to
> > define the interface in terms of pfns relative to the address space.  
> 
> Yan and I did a brief discussion on this. Besides the basic change of
> doing log_sync for every iova unmap, there are two others gaps to
> be fixed:
> 
> 1) Today the iova->gpa mapping is maintained in two places: viommu 
> page table in guest memory and viotlb in Qemu. viotlb is filled when 
> a walk on the viommu page table happens, due to emulation of a virtual
> DMA operation from emulated devices or request from vhost devices. 
> It's not affected by passthrough device activities though, since the latter 
> goes through physical iommu. Per iommu spec, guest iommu driver 
> first clears the viommu page table, followed by viotlb invalidation 
> request. It's the latter being trapped by Qemu, then vfio is notified 
> at that point, where iova->gpa translation will simply fail since no 
> valid mapping in viommu page table and very likely no hit in viotlb. 
> To fix this gap, we need extend Qemu to cache all the valid iova 
> mappings in viommu page table, similar to how vfio does.
> 
> 2) Then there will be parallel log_sync requests on each vfio device. 
> One is from the vcpu thread, when iotlb invalidation request is being 
> emulated. The other is from the migration thread, where log_sync is 
> requested for the entire guest memory in iterative copies. The 
> contention among multiple vCPU threads is already protected through 
> iommu lock, but we didn't find such thing between migration thread 
> and vcpu threads. Maybe we overlooked something, but ideally the 
> whole iova address space should be locked when the migration thread 
> is doing mega-sync/translation.
> 
> +Yi and Peter for their opinions.

Good points, not sure I have anything to add to that.  As above, we can
think about whether the device or the container is the right place to
do dirty page tracking.

> > If GPAs are still needed, what are they for?  The write-protect example
> > is clearly a hypervisor level interaction as I assume it's write
> > protection relative to the vCPU.  It's a hypervisor specific interface
> > to perform that write-protection, so why wouldn't we use a hypervisor
> > specific interface to collect the data to perform that operation?
> > IOW, if GVT-g already has a KVM dependency, why be concerned about
> > adding another GVT-g KVM dependency?  It seems like vfio is just a  
> 
> This is possibly the way that we have to go, based on discussions
> so far. Earlier I just hold the same argument as you emphasized
> for vfio - although there are existing KVM dependencies, we want
> minimize it. :-) Another worry is what if other vendor drivers may
> have similar requirements, then can we invent some generic ways
> thus avoid pushing them to do same tricky thing again. Of course, 
> we may revisit it later until this issue does become a common 
> requirement.

I know we don't see quite the same on the degree to which vfio has KVM
dependencies, but it's possible, and regularly done, to use vfio without
KVM.  The same cannot be said of GVT-g.  Therefore it's not very
motivating for me to entertain a proposal for a new vfio interface
whose main purpose is to fill in a gap in the already existing GVT-g
dependency on KVM (and in no way reduce that existing KVM dependency).
I maintain vfio's lack of dependencies on KVM either way ;)
 
> > potentially convenient channel, but as discussed above, vfio has no
> > business in GPAs because devices don't operate on GPAs and I've not
> > been sold that there's value in vfio getting involved in that
> > address space.  Convince me otherwise ;)  Thanks,
> >   
> 
> Looks none of my arguments is convincible to you :-), so we move
> to investigate what should be changed in qemu to support your 
> proposal (as discussed above). While this part is on-going, let me
> have a last try on my original idea. ;) Just be curious how your
> further thought is, regarding to earlier doorbell monitoring 
> example for operating GPA on device. If it's just Intel GPU only
> thing, yes we can still fix it in GVT-g itself as you suggested. But
> I'm just not sure about other integrated devices, and also new
> accelerators with coherent bus connected to cpu package. Also we
> don't need call it GPA - it could be named as user_target_address
> that the iova is mapped to, and is the address space that userspace
> expects the device to operate for purposes (logging, monitoring,
> etc.) other than for dma (using iova) and for accessing
> userspace/guest memory (hva).

My impression of the doorbell example is that we're taking the wrong
approach to solving how that interaction should work.  To setup an
interaction between a device and a vCPU, we should be going through
QEMU.  This is the model we've used from the start with vfio.  KVM is
an accelerator, but it's up to QEMU to connect KVM and vfio together.
I'm curious if there's any merit to my proposal above about how this
could be redesigned.  Thanks,

Alex
Tian, Kevin Oct. 24, 2019, 11:41 a.m. UTC | #19
Sorry for late reply...

> From: Alex Williamson [mailto:alex.williamson@redhat.com]
> Sent: Friday, September 27, 2019 5:33 AM
> 
> On Thu, 26 Sep 2019 03:07:08 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > > Sent: Thursday, September 26, 2019 3:06 AM
> > [...]
> > > > > > The second point is about write-protection:
> > > > > >
> > > > > > > There is another value of recording GPA in VFIO. Vendor drivers
> > > > > > > (e.g. GVT-g) may need to selectively write-protect guest memory
> > > > > > > pages when interpreting certain workload descriptors. Those
> pages
> > > > > > > are recorded in IOVA when vIOMMU is enabled, however the
> KVM
> > > > > > > write-protection API only knows GPA. So currently vIOMMU must
> > > > > > > be disabled on Intel vGPUs when GVT-g is enabled. To make it
> > > working
> > > > > > > we need a way to translate IOVA into GPA in the vendor drivers.
> > > > > > > There are two options. One is having KVM export a new API for
> such
> > > > > > > translation purpose. But as you explained earlier it's not good to
> > > > > > > have vendor drivers depend on KVM. The other is having VFIO
> > > > > > > maintaining such knowledge through extended MAP interface,
> > > > > > > then providing a uniform API for all vendor drivers to use.
> > > > >
> > > > > So the argument is that in order to interact with KVM (write
> protecting
> > > > > guest memory) there's a missing feature (IOVA to GPA translation),
> but
> > > > > we don't want to add an API to KVM for this feature because that
> would
> > > > > create a dependency on KVM (for interacting with KVM), so lets add
> an
> > > > > API to vfio instead.  That makes no sense to me.  What am I missing?
> > > > > Thanks,
> > > > >
> > > >
> > > > Then do you have a recommendation how such feature can be
> > > > implemented cleanly in vendor driver, without introducing direct
> > > > dependency on KVM?
> > >
> > > I think the disconnect is that these sorts of extensions don't reflect
> > > things that a physical device can actually do.  The idea of vfio is
> > > that it's a userspace driver interface.  It provides a channel for the
> > > user to interact with the device, map device resources, receive
> > > interrupts, map system memory through the iommu, etc.  Mediated
> > > devices
> > > augment this by replacing the physical device the user accesses with a
> > > software virtualized device.  So then the question becomes why this
> > > device virtualizing software, ie. the mdev vendor driver, needs to do
> > > things that a physical device clearly cannot do.  For example, how can
> > > a physical device write-protect portions of system memory?  Or even,
> > > why would it need to?  It makes me suspect that mdev is being used to
> > > bypass the hypervisor, or maybe fill in the gaps for hardware that
> > > isn't as "mediation friendly" as it claims to be.
> >
> > We do have one such example on Intel GPU. To support direct cmd
> > submission from userspace (SVA), kernel driver allocates a doorbell
> > page (in system memory) for each application and then registers
> > the page to the GPU. Once the doorbell is armed, the GPU starts
> > to monitor CPU writes to that page. Then the application can ring the
> > GPU by simply writing to the doorbell page to submit cmds. This
> > possibly makes sense only for integrated devices.
> >
> > In case that direction submission is not allowed in mediated device
> > (some auditing work is required in GVT-g), we need to write-protect
> > the doorbell page with hypervisor help to mimic the hardware
> > behavior. We have prototype work internally, but hasn't sent it out.
> 
> What would it look like for QEMU to orchestrate this?  Maybe the mdev
> device could expose a doorbell page as a device specific region.
> Possibly a quirk in QEMU vfio-pci could detect the guest driver
> registering a doorbell (or vendor driver could tell QEMU via a device
> specific interrupt) and setup a MemoryRegion overlay of the doorbell
> page in the guest.  If the physical GPU has the resources, maybe the
> doorbell page is real, otherwise it could be emulated in the vendor
> driver.  Trying to do this without QEMU seems to be where we're running
> into trouble and is what I'd classify as bypassing the hypervisor.

en, let's think about this approach. We are used to solving all the
problems in kernel (thus deeply relying on what VFIO affords). It
would be a different story by offloading some mediation knowledge 
to userspace. This is possibly fine for doorbell virtualization, as its
registration interface is not complex. But it may not work well for
device page table shadowing. It's hard to split this knowledge 
between user/kernel, while simply notifying Qemu to request 
write-protection is likely slow. But as said, this is a good direction
for further study. :-)

> 
> > > In the case of a physical device discovering an iova translation, this
> > > is what device iotlbs are for, but as an acceleration and offload
> > > mechanism for the system iommu rather than a lookup mechanism as
> > > seems
> > > to be wanted here.  If we had a system iommu with dirty page tracking,
> > > I believe that tracking would live in the iommu page tables and
> > > therefore reflect dirty pages relative to iova.  We'd need to consume
> > > those dirty page bits before we tear down the iova mappings, much like
> > > we're suggesting QEMU do here.
> >
> > Yes. There are two cases:
> >
> > 1) iova shadowing. Say using only 2nd level as today. Here the dirty
> > bits are associated to iova. When Qemu is revised to invoke log_sync
> > before tearing down any iova mapping, vfio can get the dirty info
> > from iommu driver for affected range.
> 
> Maybe we need two mechanisms, log_sync for the "occasional" polling of
> dirty bits and an UNMAP_DMA ioctl extension that allows the user to
> provide a buffer into which the unmap ioctl would set dirty bits.
> Userspace could potentially chunk MAP/UNMAP_DMA calls in order to
> bound
> the size of the bitmap buffer (modulo the difficulties of assuming
> physical page alignment).  The QEMU vfio-pci driver would then perform
> the translation and mark the GPA pages dirty.  Seems that might relieve
> your efficiency concerns.  The unpin path and unpin notifier would need

yes, this approach can fix my concern. It carries the dirty info
in existing syscalls thus is much better than tying to the migration
region interface.

> to relay dirty info for mdev support, where a generic implementation
> might simply assume everything that has been or is currently pinned by
> mdev is dirty.
> 
> Now I'm wondering if we should be consolidating vfio dirty page
> tracking per container rather than per device if we do something like
> this...

I sort of agree. UNMAP_DMA is per container by definition. If
we continue to keep log_sync path per device (through migration
region), it looks inconsistent. Possibly we should just keep the
migration region solely for poking the device state, while leaving
dirty page tracking through the container interface.

> 
> > 2) iova nesting, where iova->gpa is in 1st level and gpa->hpa is in
> > 2nd level. In that case the iova carried in the map/unmap ioctl is
> > actually gpa, thus the dirty bits are associated to gpa. In such case,
> > Qemu should continue to consume gpa-based dirty bitmap, as if
> > viommu is disabled.
> 
> Seems again that it's QEMU that really knows which AddressSpace that
> the kernel vfio is operating in and how to apply it to a dirty bitmap.
> 
> > > Unfortunately I also think that KVM and vhost are not really the best
> > > examples of what we need to do for vfio.  KVM is intimately involved
> > > with GPAs, so clearly dirty page tracking at that level is not an
> > > issue.  Vhost tends to circumvent the viommu; it's trying to poke
> > > directly into guest memory without the help of a physical iommu.  So
> > > I can't say that I have much faith that QEMU is already properly wired
> > > with respect to viommu and dirty page tracking, leaving open the
> > > possibility that a log_sync on iommu region unmap is simply a gap in
> > > the QEMU migration story.  The vfio migration interface we have on the
> > > table seems like it could work, but QEMU needs an update and we need
> to
> > > define the interface in terms of pfns relative to the address space.
> >
> > Yan and I did a brief discussion on this. Besides the basic change of
> > doing log_sync for every iova unmap, there are two others gaps to
> > be fixed:
> >
> > 1) Today the iova->gpa mapping is maintained in two places: viommu
> > page table in guest memory and viotlb in Qemu. viotlb is filled when
> > a walk on the viommu page table happens, due to emulation of a virtual
> > DMA operation from emulated devices or request from vhost devices.
> > It's not affected by passthrough device activities though, since the latter
> > goes through physical iommu. Per iommu spec, guest iommu driver
> > first clears the viommu page table, followed by viotlb invalidation
> > request. It's the latter being trapped by Qemu, then vfio is notified
> > at that point, where iova->gpa translation will simply fail since no
> > valid mapping in viommu page table and very likely no hit in viotlb.
> > To fix this gap, we need extend Qemu to cache all the valid iova
> > mappings in viommu page table, similar to how vfio does.
> >
> > 2) Then there will be parallel log_sync requests on each vfio device.
> > One is from the vcpu thread, when iotlb invalidation request is being
> > emulated. The other is from the migration thread, where log_sync is
> > requested for the entire guest memory in iterative copies. The
> > contention among multiple vCPU threads is already protected through
> > iommu lock, but we didn't find such thing between migration thread
> > and vcpu threads. Maybe we overlooked something, but ideally the
> > whole iova address space should be locked when the migration thread
> > is doing mega-sync/translation.
> >
> > +Yi and Peter for their opinions.
> 
> Good points, not sure I have anything to add to that.  As above, we can
> think about whether the device or the container is the right place to
> do dirty page tracking.

Sure. And above two points are orthogonal to whether doing it
in container or device level. We haven't got time to continue the
investigation since last post. will resume after KVM forum.

> 
> > > If GPAs are still needed, what are they for?  The write-protect example
> > > is clearly a hypervisor level interaction as I assume it's write
> > > protection relative to the vCPU.  It's a hypervisor specific interface
> > > to perform that write-protection, so why wouldn't we use a hypervisor
> > > specific interface to collect the data to perform that operation?
> > > IOW, if GVT-g already has a KVM dependency, why be concerned about
> > > adding another GVT-g KVM dependency?  It seems like vfio is just a
> >
> > This is possibly the way that we have to go, based on discussions
> > so far. Earlier I just hold the same argument as you emphasized
> > for vfio - although there are existing KVM dependencies, we want
> > minimize it. :-) Another worry is what if other vendor drivers may
> > have similar requirements, then can we invent some generic ways
> > thus avoid pushing them to do same tricky thing again. Of course,
> > we may revisit it later until this issue does become a common
> > requirement.
> 
> I know we don't see quite the same on the degree to which vfio has KVM
> dependencies, but it's possible, and regularly done, to use vfio without
> KVM.  The same cannot be said of GVT-g.  Therefore it's not very
> motivating for me to entertain a proposal for a new vfio interface
> whose main purpose is to fill in a gap in the already existing GVT-g
> dependency on KVM (and in no way reduce that existing KVM dependency).
> I maintain vfio's lack of dependencies on KVM either way ;)

Understand.

> 
> > > potentially convenient channel, but as discussed above, vfio has no
> > > business in GPAs because devices don't operate on GPAs and I've not
> > > been sold that there's value in vfio getting involved in that
> > > address space.  Convince me otherwise ;)  Thanks,
> > >
> >
> > Looks none of my arguments is convincible to you :-), so we move
> > to investigate what should be changed in qemu to support your
> > proposal (as discussed above). While this part is on-going, let me
> > have a last try on my original idea. ;) Just be curious how your
> > further thought is, regarding to earlier doorbell monitoring
> > example for operating GPA on device. If it's just Intel GPU only
> > thing, yes we can still fix it in GVT-g itself as you suggested. But
> > I'm just not sure about other integrated devices, and also new
> > accelerators with coherent bus connected to cpu package. Also we
> > don't need call it GPA - it could be named as user_target_address
> > that the iova is mapped to, and is the address space that userspace
> > expects the device to operate for purposes (logging, monitoring,
> > etc.) other than for dma (using iova) and for accessing
> > userspace/guest memory (hva).
> 
> My impression of the doorbell example is that we're taking the wrong
> approach to solving how that interaction should work.  To setup an
> interaction between a device and a vCPU, we should be going through
> QEMU.  This is the model we've used from the start with vfio.  KVM is
> an accelerator, but it's up to QEMU to connect KVM and vfio together.
> I'm curious if there's any merit to my proposal above about how this
> could be redesigned.  Thanks,
> 

There are merits for sure. We will explore that direction and see how
it works. Even if still some extension required between QEMU and KVM
(in case of functionality or efficiency concern), that might be better
fitted than extending VFIO.

Thanks
Kevin
diff mbox series

Patch

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 24f505199f83..4bc0236b0898 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -372,6 +372,154 @@  struct vfio_region_gfx_edid {
  */
 #define VFIO_REGION_SUBTYPE_IBM_NVLINK2_ATSD	(1)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION	        (3)
+#define VFIO_REGION_SUBTYPE_MIGRATION	        (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state.
+ *      - If bit 2 set, indicates _RESUMING state.
+ *      Bits 3 - 31 are reserved for future use. User should perform
+ *      read-modify-write operation on this field.
+ *      _SAVING and _RESUMING bits set at the same time is invalid state.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state or read dirty pages bitmap. See below
+ *      for detail of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      migration region during _SAVING state and write size of data copied in
+ *      migration region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *      Start address pfn to get bitmap of dirty pages from vendor driver duing
+ *      _SAVING state.
+ *
+ * page_size: (write only)
+ *      User application should write the page_size of pfn.
+ *
+ * total_pfns: (write only)
+ *      Total pfn count from start_pfn for which dirty bitmap is requested.
+ *
+ * copied_pfns: (read only)
+ *      pfn count for which dirty bitmap is copied to migration region.
+ *      Vendor driver should copy the bitmap with bits set only for pages to be
+ *      marked dirty in migration region.
+ *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if none of the
+ *        pages are dirty in requested range or rest of the range.
+ *      - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ *        pages dirty in the given range or rest of the range.
+ *      - Vendor driver should return pfn count for which bitmap is written in
+ *        the region.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^                             ^
+ *  offset 0-trapped part        data_offset                 data_size
+ *
+ * Data section is always followed by vfio_device_migration_info structure
+ * in the region, so data_offset will always be non-0. Offset from where data
+ * is copied is decided by kernel driver, data section can be trapped or
+ * mapped or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned.
+ * Data_offset can be same or different for device data and dirty pages bitmap.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging buffer.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ *    migration region.
+ * d. read data_size bytes of data from data_offset in the migration region.
+ * e. process data.
+ * f. Loop through a to e.
+ *
+ * To copy system memory content during migration, vendor driver should be able
+ * to report system memory pages which are dirtied by that driver. For such
+ * dirty page reporting, user application should query for a range of GFNs
+ * relative to device address space (IOVA), then vendor driver should provide
+ * the bitmap of pages from this range which are dirtied by him through
+ * migration region where each bit represents a page and bit set to 1 represents
+ * that the page is dirty.
+ * User space application should take care of copying content of system memory
+ * for those pages.
+ *
+ * Steps to get dirty page bitmap:
+ * a. write start_pfn, page_size and total_pfns.
+ * b. read copied_pfns. Vendor driver should take one of the below action:
+ *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_NONE if driver
+ *       doesn't have any page to report dirty in given range or rest of the
+ *       range. Exit the loop.
+ *     - Vendor driver should return VFIO_DEVICE_DIRTY_PFNS_ALL to mark all
+ *       pages dirty for given range or rest of the range. User space
+ *       application mark all pages in the range as dirty and exit the loop.
+ *     - Vendor driver should return copied_pfns and provide bitmap for
+ *       copied_pfn in migration region.
+ * c. read data_offset, where vendor driver has written bitmap.
+ * d. read bitmap from the migration region from data_offset.
+ * e. Iterate through steps a to d while (total copied_pfns < total_pfns)
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    staging buffer.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+        __u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+                                     VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+#define VFIO_DEVICE_STATE_INVALID   (VFIO_DEVICE_STATE_SAVING | \
+                                     VFIO_DEVICE_STATE_RESUMING)
+        __u32 reserved;
+        __u64 pending_bytes;
+        __u64 data_offset;
+        __u64 data_size;
+        __u64 start_pfn;
+        __u64 page_size;
+        __u64 total_pfns;
+        __u64 copied_pfns;
+#define VFIO_DEVICE_DIRTY_PFNS_NONE     (0)
+#define VFIO_DEVICE_DIRTY_PFNS_ALL      (~0ULL)
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within