[v9,Kernel,1/5] vfio: KABI for migration interface for device state
diff mbox series

Message ID 1573578220-7530-2-git-send-email-kwankhede@nvidia.com
State New
Headers show
Series
  • Add KABIs to support migration for VFIO devices
Related show

Commit Message

Kirti Wankhede Nov. 12, 2019, 5:03 p.m. UTC
- Defined MIGRATION region type and sub-type.
- Used 3 bits to define VFIO device states.
    Bit 0 => _RUNNING
    Bit 1 => _SAVING
    Bit 2 => _RESUMING
    Combination of these bits defines VFIO device's state during migration
    _RUNNING => Normal VFIO device running state. When its reset, it
		indicates _STOPPED state. when device is changed to
		_STOPPED, driver should stop device before write()
		returns.
    _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
                          start saving state of device i.e. pre-copy state
    _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
                save device state,i.e. stop-n-copy state
    _RESUMING => VFIO device resuming state.
    _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
    Bits 3 - 31 are reserved for future use. User should perform
    read-modify-write operation on this field.
- Defined vfio_device_migration_info structure which will be placed at 0th
  offset of migration region to get/set VFIO device related information.
  Defined members of structure and usage on read/write access:
    * device_state: (read/write)
        To convey VFIO device state to be transitioned to. Only 3 bits are
	used as of now, Bits 3 - 31 are reserved for future use.
    * pending bytes: (read only)
        To get pending bytes yet to be migrated for VFIO device.
    * data_offset: (read only)
        To get data offset in migration region from where data exist
	during _SAVING and from where data should be written by user space
	application during _RESUMING state.
    * data_size: (read/write)
        To get and set size in bytes of data copied in migration region
	during _SAVING and _RESUMING state.

Migration region looks like:
 ------------------------------------------------------------------
|vfio_device_migration_info|    data section                      |
|                          |     ///////////////////////////////  |
 ------------------------------------------------------------------
 ^                              ^
 offset 0-trapped part        data_offset

Structure vfio_device_migration_info is always followed by data section
in the region, so data_offset will always be non-0. Offset from where data
to be copied is decided by kernel driver, data section can be trapped or
mapped depending on how kernel driver defines data section.
Data section partition can be defined as mapped by sparse mmap capability.
If mmapped, then data_offset should be page aligned, where as initial
section which contain vfio_device_migration_info structure might not end
at offset which is page aligned.
Vendor driver should decide whether to partition data section and how to
partition the data section. Vendor driver should return data_offset
accordingly.

For user application, data is opaque. User should write data in the same
order as received.

Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
Reviewed-by: Neo Jia <cjia@nvidia.com>
---
 include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 108 insertions(+)

Comments

Alex Williamson Nov. 12, 2019, 10:30 p.m. UTC | #1
On Tue, 12 Nov 2019 22:33:36 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> - Defined MIGRATION region type and sub-type.
> - Used 3 bits to define VFIO device states.
>     Bit 0 => _RUNNING
>     Bit 1 => _SAVING
>     Bit 2 => _RESUMING
>     Combination of these bits defines VFIO device's state during migration
>     _RUNNING => Normal VFIO device running state. When its reset, it
> 		indicates _STOPPED state. when device is changed to
> 		_STOPPED, driver should stop device before write()
> 		returns.
>     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>                           start saving state of device i.e. pre-copy state
>     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and

s/should/must/

>                 save device state,i.e. stop-n-copy state
>     _RESUMING => VFIO device resuming state.
>     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states

A table might be useful here and in the uapi header to indicate valid
states:

| _RESUMING | _SAVING | _RUNNING | Description
+-----------+---------+----------+------------------------------------------
|     0     |    0    |     0    | Stopped, not saving or resuming (a)
+-----------+---------+----------+------------------------------------------
|     0     |    0    |     1    | Running, default state
+-----------+---------+----------+------------------------------------------
|     0     |    1    |     0    | Stopped, migration interface in save mode
+-----------+---------+----------+------------------------------------------
|     0     |    1    |     1    | Running, save mode interface, iterative
+-----------+---------+----------+------------------------------------------
|     1     |    0    |     0    | Stopped, migration resume interface active
+-----------+---------+----------+------------------------------------------
|     1     |    0    |     1    | Invalid (b)
+-----------+---------+----------+------------------------------------------
|     1     |    1    |     0    | Invalid (c)
+-----------+---------+----------+------------------------------------------
|     1     |    1    |     1    | Invalid (d)

I think we need to consider whether we define (a) as generally
available, for instance we might want to use it for diagnostics or a
fatal error condition outside of migration.

Are there hidden assumptions between state transitions here or are
there specific next possible state diagrams that we need to include as
well?

I'm curious if Intel agrees with the states marked invalid with their
push for post-copy support.

>     Bits 3 - 31 are reserved for future use. User should perform
>     read-modify-write operation on this field.
> - Defined vfio_device_migration_info structure which will be placed at 0th
>   offset of migration region to get/set VFIO device related information.
>   Defined members of structure and usage on read/write access:
>     * device_state: (read/write)
>         To convey VFIO device state to be transitioned to. Only 3 bits are
> 	used as of now, Bits 3 - 31 are reserved for future use.
>     * pending bytes: (read only)
>         To get pending bytes yet to be migrated for VFIO device.
>     * data_offset: (read only)
>         To get data offset in migration region from where data exist
> 	during _SAVING and from where data should be written by user space
> 	application during _RESUMING state.
>     * data_size: (read/write)
>         To get and set size in bytes of data copied in migration region
> 	during _SAVING and _RESUMING state.
> 
> Migration region looks like:
>  ------------------------------------------------------------------
> |vfio_device_migration_info|    data section                      |
> |                          |     ///////////////////////////////  |
>  ------------------------------------------------------------------
>  ^                              ^
>  offset 0-trapped part        data_offset
> 
> Structure vfio_device_migration_info is always followed by data section
> in the region, so data_offset will always be non-0. Offset from where data
> to be copied is decided by kernel driver, data section can be trapped or
> mapped depending on how kernel driver defines data section.
> Data section partition can be defined as mapped by sparse mmap capability.
> If mmapped, then data_offset should be page aligned, where as initial
> section which contain vfio_device_migration_info structure might not end
> at offset which is page aligned.
> Vendor driver should decide whether to partition data section and how to
> partition the data section. Vendor driver should return data_offset
> accordingly.
> 
> For user application, data is opaque. User should write data in the same
> order as received.
> 
> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> Reviewed-by: Neo Jia <cjia@nvidia.com>
> ---
>  include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 108 insertions(+)
> 
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index 9e843a147ead..35b09427ad9f 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>  #define VFIO_REGION_TYPE_GFX                    (1)
>  #define VFIO_REGION_TYPE_CCW			(2)
> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>  
> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
>  /* sub-types for VFIO_REGION_TYPE_CCW */
>  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>  
> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> +
> +/*
> + * Structure vfio_device_migration_info is placed at 0th offset of
> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> + * information. Field accesses from this structure are only supported at their
> + * native width and alignment, otherwise the result is undefined and vendor
> + * drivers should return an error.
> + *
> + * device_state: (read/write)
> + *      To indicate vendor driver the state VFIO device should be transitioned
> + *      to. If device state transition fails, write on this field return error.
> + *      It consists of 3 bits:
> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates

Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
ambiguous.

> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> + *        device before write() returns.
> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> + *        should start gathering device state information which will be provided
> + *        to VFIO user space application to save device's state.
> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> + *        prepare to resume device, data provided through migration region
> + *        should be used to resume device.
> + *      Bits 3 - 31 are reserved for future use. User should perform
> + *      read-modify-write operation on this field.
> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
> + *
> + * pending bytes: (read only)
> + *      Number of pending bytes yet to be migrated from vendor driver
> + *
> + * data_offset: (read only)
> + *      User application should read data_offset in migration region from where
> + *      user application should read device data during _SAVING state or write
> + *      device data during _RESUMING state. See below for detail of sequence to
> + *      be followed.
> + *
> + * data_size: (read/write)
> + *      User application should read data_size to get size of data copied in
> + *      bytes in migration region during _SAVING state and write size of data
> + *      copied in bytes in migration region during _RESUMING state.
> + *
> + * Migration region looks like:
> + *  ------------------------------------------------------------------
> + * |vfio_device_migration_info|    data section                      |
> + * |                          |     ///////////////////////////////  |
> + * ------------------------------------------------------------------
> + *   ^                              ^
> + *  offset 0-trapped part        data_offset
> + *
> + * Structure vfio_device_migration_info is always followed by data section in
> + * the region, so data_offset will always be non-0. Offset from where data is
> + * copied is decided by kernel driver, data section can be trapped or mapped
> + * or partitioned, depending on how kernel driver defines data section.
> + * Data section partition can be defined as mapped by sparse mmap capability.
> + * If mmapped, then data_offset should be page aligned, where as initial section
> + * which contain vfio_device_migration_info structure might not end at offset
> + * which is page aligned.

"The user is not required to to access via mmap regardless of the
region mmap capabilities."

> + * Vendor driver should decide whether to partition data section and how to
> + * partition the data section. Vendor driver should return data_offset
> + * accordingly.
> + *
> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> + * and for _SAVING device state or stop-and-copy phase:
> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> + *    Kernel driver should return this read operation only after writing data to
> + *    staging buffer is done.

"staging buffer" implies a vendor driver implementation, perhaps we
could just state that data is available from (region + data_offset) to
(region + data_offset + data_size) upon return of this read operation.

> + * c. read data_size, amount of data in bytes written by vendor driver in
> + *    migration region.
> + * d. read data_size bytes of data from data_offset in the migration region.
> + * e. process data.
> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> + *    operation from migration region for previous iteration is done.

I think this indicate that step (f) should be to read pending_bytes, the
read sequence is not complete until this step.  Optionally the user can
then proceed to step (b).  There are no read side-effects of (a) afaict.

Is the use required to reach pending_bytes == 0 before changing
device_state, particularly transitioning to !_RUNNING?  Presumably the
user can exit this sequence at any time by clearing _SAVING.

> + *
> + * Sequence to be followed while _RESUMING device state:
> + * While data for this device is available, repeat below steps:
> + * a. read data_offset from where user application should write data.
> + * b. write data of data_size to migration region from data_offset.
> + * c. write data_size which indicates vendor driver that data is written in
> + *    staging buffer. Vendor driver should read this data from migration
> + *    region and resume device's state.

The device defaults to _RUNNING state, so a prerequisite is to set
_RESUMING and clear _RUNNING, right?

> + *
> + * For user application, data is opaque. User should write data in the same
> + * order as received.
> + */
> +
> +struct vfio_device_migration_info {
> +	__u32 device_state;         /* VFIO device state */
> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> +				     VFIO_DEVICE_STATE_SAVING |  \
> +				     VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> +					    VFIO_DEVICE_STATE_RESUMING)
> +
> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> +					    VFIO_DEVICE_STATE_RESUMING)

These seem difficult to use, maybe we just need a
VFIO_DEVICE_STATE_VALID macro?

#define VFIO_DEVICE_STATE_VALID(state) \
  (state & VFIO_DEVICE_STATE_RESUMING ? \
  (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)

Thanks,
Alex

> +	__u32 reserved;
> +	__u64 pending_bytes;
> +	__u64 data_offset;
> +	__u64 data_size;
> +} __attribute__((packed));
> +
>  /*
>   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>   * which allows direct access to non-MSIX registers which happened to be within
Yan Zhao Nov. 13, 2019, 3:23 a.m. UTC | #2
On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
> On Tue, 12 Nov 2019 22:33:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > - Defined MIGRATION region type and sub-type.
> > - Used 3 bits to define VFIO device states.
> >     Bit 0 => _RUNNING
> >     Bit 1 => _SAVING
> >     Bit 2 => _RESUMING
> >     Combination of these bits defines VFIO device's state during migration
> >     _RUNNING => Normal VFIO device running state. When its reset, it
> > 		indicates _STOPPED state. when device is changed to
> > 		_STOPPED, driver should stop device before write()
> > 		returns.
> >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >                           start saving state of device i.e. pre-copy state
> >     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
> 
> s/should/must/
> 
> >                 save device state,i.e. stop-n-copy state
> >     _RESUMING => VFIO device resuming state.
> >     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
> 
> A table might be useful here and in the uapi header to indicate valid
> states:
> 
> | _RESUMING | _SAVING | _RUNNING | Description
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     1    | Running, default state
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     0    | Stopped, migration interface in save mode
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     1    | Running, save mode interface, iterative
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     0    | Stopped, migration resume interface active
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     1    | Invalid (b)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     0    | Invalid (c)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     1    | Invalid (d)
> 
> I think we need to consider whether we define (a) as generally
> available, for instance we might want to use it for diagnostics or a
> fatal error condition outside of migration.
> 
> Are there hidden assumptions between state transitions here or are
> there specific next possible state diagrams that we need to include as
> well?
> 
> I'm curious if Intel agrees with the states marked invalid with their
> push for post-copy support.
> 
hi Alex and Kirti,
Actually, for postcopy, I think we anyway need an extra POSTCOPY state
introduced. Reasons as below:
- in the target side, _RSESUMING state is set in the beginning of
  migration. It cannot be used as a state to inform device of that
  currently it's in postcopy state and device DMAs are to be trapped and
  pre-faulted.
  we also cannot use state (_RESUMING + _RUNNING) as an indicator of
  postcopy, because before device & vm running in target side, some device
  state are already loaded (e.g. page tables, pending workloads),
  target side can do pre-pagefault at that period before target vm up.
- in the source side, after device is stopped, postcopy needs saving
  device state only (as compared to device state + remaining dirty
  pages in precopy). state (!_RUNNING + _SAVING) here again cannot
  differentiate precopy and postcopy here.

> >     Bits 3 - 31 are reserved for future use. User should perform
> >     read-modify-write operation on this field.
> > - Defined vfio_device_migration_info structure which will be placed at 0th
> >   offset of migration region to get/set VFIO device related information.
> >   Defined members of structure and usage on read/write access:
> >     * device_state: (read/write)
> >         To convey VFIO device state to be transitioned to. Only 3 bits are
> > 	used as of now, Bits 3 - 31 are reserved for future use.
> >     * pending bytes: (read only)
> >         To get pending bytes yet to be migrated for VFIO device.
> >     * data_offset: (read only)
> >         To get data offset in migration region from where data exist
> > 	during _SAVING and from where data should be written by user space
> > 	application during _RESUMING state.
> >     * data_size: (read/write)
> >         To get and set size in bytes of data copied in migration region
> > 	during _SAVING and _RESUMING state.
> > 
> > Migration region looks like:
> >  ------------------------------------------------------------------
> > |vfio_device_migration_info|    data section                      |
> > |                          |     ///////////////////////////////  |
> >  ------------------------------------------------------------------
> >  ^                              ^
> >  offset 0-trapped part        data_offset
> > 
> > Structure vfio_device_migration_info is always followed by data section
> > in the region, so data_offset will always be non-0. Offset from where data
> > to be copied is decided by kernel driver, data section can be trapped or
> > mapped depending on how kernel driver defines data section.
> > Data section partition can be defined as mapped by sparse mmap capability.
> > If mmapped, then data_offset should be page aligned, where as initial
> > section which contain vfio_device_migration_info structure might not end
> > at offset which is page aligned.
> > Vendor driver should decide whether to partition data section and how to
> > partition the data section. Vendor driver should return data_offset
> > accordingly.
> > 
> > For user application, data is opaque. User should write data in the same
> > order as received.
> > 
> > Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> > Reviewed-by: Neo Jia <cjia@nvidia.com>
> > ---
> >  include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 108 insertions(+)
> > 
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index 9e843a147ead..35b09427ad9f 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >  #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >  #define VFIO_REGION_TYPE_GFX                    (1)
> >  #define VFIO_REGION_TYPE_CCW			(2)
> > +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >  
> >  /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >  
> > @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
> >  /* sub-types for VFIO_REGION_TYPE_CCW */
> >  #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >  
> > +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> > +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> > +
> > +/*
> > + * Structure vfio_device_migration_info is placed at 0th offset of
> > + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> > + * information. Field accesses from this structure are only supported at their
> > + * native width and alignment, otherwise the result is undefined and vendor
> > + * drivers should return an error.
> > + *
> > + * device_state: (read/write)
> > + *      To indicate vendor driver the state VFIO device should be transitioned
> > + *      to. If device state transition fails, write on this field return error.
> > + *      It consists of 3 bits:
> > + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> 
> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
> ambiguous.
> 
> > + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> > + *        device before write() returns.
> > + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> > + *        should start gathering device state information which will be provided
> > + *        to VFIO user space application to save device's state.
> > + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> > + *        prepare to resume device, data provided through migration region
> > + *        should be used to resume device.
> > + *      Bits 3 - 31 are reserved for future use. User should perform
> > + *      read-modify-write operation on this field.
> > + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> > + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
> > + *
> > + * pending bytes: (read only)
> > + *      Number of pending bytes yet to be migrated from vendor driver
> > + *
> > + * data_offset: (read only)
> > + *      User application should read data_offset in migration region from where
> > + *      user application should read device data during _SAVING state or write
> > + *      device data during _RESUMING state. See below for detail of sequence to
> > + *      be followed.
> > + *
> > + * data_size: (read/write)
> > + *      User application should read data_size to get size of data copied in
> > + *      bytes in migration region during _SAVING state and write size of data
> > + *      copied in bytes in migration region during _RESUMING state.
> > + *
> > + * Migration region looks like:
> > + *  ------------------------------------------------------------------
> > + * |vfio_device_migration_info|    data section                      |
> > + * |                          |     ///////////////////////////////  |
> > + * ------------------------------------------------------------------
> > + *   ^                              ^
> > + *  offset 0-trapped part        data_offset
> > + *
> > + * Structure vfio_device_migration_info is always followed by data section in
> > + * the region, so data_offset will always be non-0. Offset from where data is
> > + * copied is decided by kernel driver, data section can be trapped or mapped
> > + * or partitioned, depending on how kernel driver defines data section.
> > + * Data section partition can be defined as mapped by sparse mmap capability.
> > + * If mmapped, then data_offset should be page aligned, where as initial section
> > + * which contain vfio_device_migration_info structure might not end at offset
> > + * which is page aligned.
> 
> "The user is not required to to access via mmap regardless of the
> region mmap capabilities."
>
But once the user decides to access via mmap, it has to read data of
data_size each time, otherwise the vendor driver has no idea of how many
data are already read from user. Agree?

> > + * Vendor driver should decide whether to partition data section and how to
> > + * partition the data section. Vendor driver should return data_offset
> > + * accordingly.
> > + *
> > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > + * and for _SAVING device state or stop-and-copy phase:
> > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> > + *    Kernel driver should return this read operation only after writing data to
> > + *    staging buffer is done.
May I know under what condition this data_offset will be changed per
each iteration from a-f ?

> 
> "staging buffer" implies a vendor driver implementation, perhaps we
> could just state that data is available from (region + data_offset) to
> (region + data_offset + data_size) upon return of this read operation.
> 
> > + * c. read data_size, amount of data in bytes written by vendor driver in
> > + *    migration region.
> > + * d. read data_size bytes of data from data_offset in the migration region.
> > + * e. process data.
> > + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> > + *    operation from migration region for previous iteration is done.
> 
> I think this indicate that step (f) should be to read pending_bytes, the
> read sequence is not complete until this step.  Optionally the user can
> then proceed to step (b).  There are no read side-effects of (a) afaict.
> 
> Is the use required to reach pending_bytes == 0 before changing
> device_state, particularly transitioning to !_RUNNING?  Presumably the
> user can exit this sequence at any time by clearing _SAVING.
> 
> > + *
> > + * Sequence to be followed while _RESUMING device state:
> > + * While data for this device is available, repeat below steps:
> > + * a. read data_offset from where user application should write data.
before proceed to step b, need to read data_size from vendor driver to determine
the max len of data to write. I think it's necessary in such a condition
that source vendor driver and target vendor driver do not offer data sections of
the same size. e.g. in source side, the data section is of size 100M,
but in target side, the data section is only of 50M size.
rather than simply fail, loop and write seems better.

Thanks
Yan
> > + * b. write data of data_size to migration region from data_offset.
> > + * c. write data_size which indicates vendor driver that data is written in
> > + *    staging buffer. Vendor driver should read this data from migration
> > + *    region and resume device's state.
> 
> The device defaults to _RUNNING state, so a prerequisite is to set
> _RESUMING and clear _RUNNING, right?
> 
> > + *
> > + * For user application, data is opaque. User should write data in the same
> > + * order as received.
> > + */
> > +
> > +struct vfio_device_migration_info {
> > +	__u32 device_state;         /* VFIO device state */
> > +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> > +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> > +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> > +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> > +				     VFIO_DEVICE_STATE_SAVING |  \
> > +				     VFIO_DEVICE_STATE_RESUMING)
> > +
> > +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> > +					    VFIO_DEVICE_STATE_RESUMING)
> > +
> > +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> > +					    VFIO_DEVICE_STATE_RESUMING)
> 
> These seem difficult to use, maybe we just need a
> VFIO_DEVICE_STATE_VALID macro?
> 
> #define VFIO_DEVICE_STATE_VALID(state) \
>   (state & VFIO_DEVICE_STATE_RESUMING ? \
>   (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> 
> Thanks,
> Alex
> 
> > +	__u32 reserved;
> > +	__u64 pending_bytes;
> > +	__u64 data_offset;
> > +	__u64 data_size;
> > +} __attribute__((packed));
> > +
> >  /*
> >   * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >   * which allows direct access to non-MSIX registers which happened to be within
>
Cornelia Huck Nov. 13, 2019, 10:24 a.m. UTC | #3
On Tue, 12 Nov 2019 15:30:05 -0700
Alex Williamson <alex.williamson@redhat.com> wrote:

> On Tue, 12 Nov 2019 22:33:36 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
> > - Defined MIGRATION region type and sub-type.
> > - Used 3 bits to define VFIO device states.
> >     Bit 0 => _RUNNING
> >     Bit 1 => _SAVING
> >     Bit 2 => _RESUMING
> >     Combination of these bits defines VFIO device's state during migration
> >     _RUNNING => Normal VFIO device running state. When its reset, it
> > 		indicates _STOPPED state. when device is changed to
> > 		_STOPPED, driver should stop device before write()
> > 		returns.
> >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >                           start saving state of device i.e. pre-copy state
> >     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and  
> 
> s/should/must/
> 
> >                 save device state,i.e. stop-n-copy state
> >     _RESUMING => VFIO device resuming state.
> >     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states  
> 
> A table might be useful here and in the uapi header to indicate valid
> states:

I like that.

> 
> | _RESUMING | _SAVING | _RUNNING | Description
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> +-----------+---------+----------+------------------------------------------
> |     0     |    0    |     1    | Running, default state
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     0    | Stopped, migration interface in save mode
> +-----------+---------+----------+------------------------------------------
> |     0     |    1    |     1    | Running, save mode interface, iterative
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     0    | Stopped, migration resume interface active
> +-----------+---------+----------+------------------------------------------
> |     1     |    0    |     1    | Invalid (b)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     0    | Invalid (c)
> +-----------+---------+----------+------------------------------------------
> |     1     |    1    |     1    | Invalid (d)
> 
> I think we need to consider whether we define (a) as generally
> available, for instance we might want to use it for diagnostics or a
> fatal error condition outside of migration.
> 
> Are there hidden assumptions between state transitions here or are
> there specific next possible state diagrams that we need to include as
> well?

Some kind of state-change diagram might be useful in addition to the
textual description anyway. Let me try, just to make sure I understand
this correctly:

1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
2) 0/0/1 ---(tell driver to stop)---> 0/0/0
3) 0/1/1 ---(tell driver to stop)---> 0/1/0
4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
5) 1/0/0 ---(driver is ready)---> 0/0/1
6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1

Not sure about the usefulness of 2). Also, is 4) the only way to
trigger resuming? And is the change in 5) performed by the driver, or
by userspace?

Are any other state transitions valid?

(...)

> > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > + * and for _SAVING device state or stop-and-copy phase:
> > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> > + *    Kernel driver should return this read operation only after writing data to
> > + *    staging buffer is done.  
> 
> "staging buffer" implies a vendor driver implementation, perhaps we
> could just state that data is available from (region + data_offset) to
> (region + data_offset + data_size) upon return of this read operation.
> 
> > + * c. read data_size, amount of data in bytes written by vendor driver in
> > + *    migration region.
> > + * d. read data_size bytes of data from data_offset in the migration region.
> > + * e. process data.
> > + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> > + *    operation from migration region for previous iteration is done.  
> 
> I think this indicate that step (f) should be to read pending_bytes, the
> read sequence is not complete until this step.  Optionally the user can
> then proceed to step (b).  There are no read side-effects of (a) afaict.
> 
> Is the use required to reach pending_bytes == 0 before changing
> device_state, particularly transitioning to !_RUNNING?  Presumably the
> user can exit this sequence at any time by clearing _SAVING.

That would be transition 6) above (abort saving and continue). I think
it makes sense not to forbid this.

> 
> > + *
> > + * Sequence to be followed while _RESUMING device state:
> > + * While data for this device is available, repeat below steps:
> > + * a. read data_offset from where user application should write data.
> > + * b. write data of data_size to migration region from data_offset.
> > + * c. write data_size which indicates vendor driver that data is written in
> > + *    staging buffer. Vendor driver should read this data from migration
> > + *    region and resume device's state.  
> 
> The device defaults to _RUNNING state, so a prerequisite is to set
> _RESUMING and clear _RUNNING, right?

Transition 4) above. Do we need
7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
as well? (Probably depends on how sensible the 0/0/0 state is.)

> 
> > + *
> > + * For user application, data is opaque. User should write data in the same
> > + * order as received.
> > + */
Alex Williamson Nov. 13, 2019, 6:27 p.m. UTC | #4
On Wed, 13 Nov 2019 11:24:17 +0100
Cornelia Huck <cohuck@redhat.com> wrote:

> On Tue, 12 Nov 2019 15:30:05 -0700
> Alex Williamson <alex.williamson@redhat.com> wrote:
> 
> > On Tue, 12 Nov 2019 22:33:36 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> > > - Defined MIGRATION region type and sub-type.
> > > - Used 3 bits to define VFIO device states.
> > >     Bit 0 => _RUNNING
> > >     Bit 1 => _SAVING
> > >     Bit 2 => _RESUMING
> > >     Combination of these bits defines VFIO device's state during migration
> > >     _RUNNING => Normal VFIO device running state. When its reset, it
> > > 		indicates _STOPPED state. when device is changed to
> > > 		_STOPPED, driver should stop device before write()
> > > 		returns.
> > >     _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> > >                           start saving state of device i.e. pre-copy state
> > >     _SAVING  => vCPUs are stopped, VFIO device should be stopped, and    
> > 
> > s/should/must/
> >   
> > >                 save device state,i.e. stop-n-copy state
> > >     _RESUMING => VFIO device resuming state.
> > >     _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states    
> > 
> > A table might be useful here and in the uapi header to indicate valid
> > states:  
> 
> I like that.
> 
> > 
> > | _RESUMING | _SAVING | _RUNNING | Description
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    0    |     1    | Running, default state
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    1    |     0    | Stopped, migration interface in save mode
> > +-----------+---------+----------+------------------------------------------
> > |     0     |    1    |     1    | Running, save mode interface, iterative
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    0    |     0    | Stopped, migration resume interface active
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    0    |     1    | Invalid (b)
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    1    |     0    | Invalid (c)
> > +-----------+---------+----------+------------------------------------------
> > |     1     |    1    |     1    | Invalid (d)
> > 
> > I think we need to consider whether we define (a) as generally
> > available, for instance we might want to use it for diagnostics or a
> > fatal error condition outside of migration.
> > 
> > Are there hidden assumptions between state transitions here or are
> > there specific next possible state diagrams that we need to include as
> > well?  
> 
> Some kind of state-change diagram might be useful in addition to the
> textual description anyway. Let me try, just to make sure I understand
> this correctly:
> 
> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0
> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0
> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0

I think this is to switch into resuming mode, the data will follow

> 5) 1/0/0 ---(driver is ready)---> 0/0/1
> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1

I think also:

0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy

0/0/0 and 0/0/1 should be reachable from any state, though I could see
that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
received state is incomplete.  Somehow though a user always needs to
return the device to the initial state, so how does device_state
interact with the reset ioctl?  Would this automatically manipulate
device_state back to 0/0/1?
 
> Not sure about the usefulness of 2). Also, is 4) the only way to
> trigger resuming? And is the change in 5) performed by the driver, or
> by userspace?
> 
> Are any other state transitions valid?
> 
> (...)
> 
> > > + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> > > + * and for _SAVING device state or stop-and-copy phase:
> > > + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> > > + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> > > + *    Kernel driver should return this read operation only after writing data to
> > > + *    staging buffer is done.    
> > 
> > "staging buffer" implies a vendor driver implementation, perhaps we
> > could just state that data is available from (region + data_offset) to
> > (region + data_offset + data_size) upon return of this read operation.
> >   
> > > + * c. read data_size, amount of data in bytes written by vendor driver in
> > > + *    migration region.
> > > + * d. read data_size bytes of data from data_offset in the migration region.
> > > + * e. process data.
> > > + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> > > + *    operation from migration region for previous iteration is done.    
> > 
> > I think this indicate that step (f) should be to read pending_bytes, the
> > read sequence is not complete until this step.  Optionally the user can
> > then proceed to step (b).  There are no read side-effects of (a) afaict.
> > 
> > Is the use required to reach pending_bytes == 0 before changing
> > device_state, particularly transitioning to !_RUNNING?  Presumably the
> > user can exit this sequence at any time by clearing _SAVING.  
> 
> That would be transition 6) above (abort saving and continue). I think
> it makes sense not to forbid this.
> 
> >   
> > > + *
> > > + * Sequence to be followed while _RESUMING device state:
> > > + * While data for this device is available, repeat below steps:
> > > + * a. read data_offset from where user application should write data.
> > > + * b. write data of data_size to migration region from data_offset.
> > > + * c. write data_size which indicates vendor driver that data is written in
> > > + *    staging buffer. Vendor driver should read this data from migration
> > > + *    region and resume device's state.    
> > 
> > The device defaults to _RUNNING state, so a prerequisite is to set
> > _RESUMING and clear _RUNNING, right?  
> 
> Transition 4) above. Do we need
> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
> as well? (Probably depends on how sensible the 0/0/0 state is.)

I think we must unless we require the user to transition from 0/0/1 to
1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
available.  Thanks,

Alex
Kirti Wankhede Nov. 13, 2019, 7:02 p.m. UTC | #5
On 11/13/2019 8:53 AM, Yan Zhao wrote:
> On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
>> On Tue, 12 Nov 2019 22:33:36 +0530
>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>
>>> - Defined MIGRATION region type and sub-type.
>>> - Used 3 bits to define VFIO device states.
>>>      Bit 0 => _RUNNING
>>>      Bit 1 => _SAVING
>>>      Bit 2 => _RESUMING
>>>      Combination of these bits defines VFIO device's state during migration
>>>      _RUNNING => Normal VFIO device running state. When its reset, it
>>> 		indicates _STOPPED state. when device is changed to
>>> 		_STOPPED, driver should stop device before write()
>>> 		returns.
>>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>                            start saving state of device i.e. pre-copy state
>>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>
>> s/should/must/
>>
>>>                  save device state,i.e. stop-n-copy state
>>>      _RESUMING => VFIO device resuming state.
>>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>
>> A table might be useful here and in the uapi header to indicate valid
>> states:
>>
>> | _RESUMING | _SAVING | _RUNNING | Description
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    0    |     1    | Running, default state
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>> +-----------+---------+----------+------------------------------------------
>> |     0     |    1    |     1    | Running, save mode interface, iterative
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    0    |     0    | Stopped, migration resume interface active
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    0    |     1    | Invalid (b)
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    1    |     0    | Invalid (c)
>> +-----------+---------+----------+------------------------------------------
>> |     1     |    1    |     1    | Invalid (d)
>>
>> I think we need to consider whether we define (a) as generally
>> available, for instance we might want to use it for diagnostics or a
>> fatal error condition outside of migration.
>>

We have to set it as init state. I'll add this it.

>> Are there hidden assumptions between state transitions here or are
>> there specific next possible state diagrams that we need to include as
>> well?
>>
>> I'm curious if Intel agrees with the states marked invalid with their
>> push for post-copy support.
>>
> hi Alex and Kirti,
> Actually, for postcopy, I think we anyway need an extra POSTCOPY state
> introduced. Reasons as below:
> - in the target side, _RSESUMING state is set in the beginning of
>    migration. It cannot be used as a state to inform device of that
>    currently it's in postcopy state and device DMAs are to be trapped and
>    pre-faulted.
>    we also cannot use state (_RESUMING + _RUNNING) as an indicator of
>    postcopy, because before device & vm running in target side, some device
>    state are already loaded (e.g. page tables, pending workloads),
>    target side can do pre-pagefault at that period before target vm up.
> - in the source side, after device is stopped, postcopy needs saving
>    device state only (as compared to device state + remaining dirty
>    pages in precopy). state (!_RUNNING + _SAVING) here again cannot
>    differentiate precopy and postcopy here.
> 
>>>      Bits 3 - 31 are reserved for future use. User should perform
>>>      read-modify-write operation on this field.
>>> - Defined vfio_device_migration_info structure which will be placed at 0th
>>>    offset of migration region to get/set VFIO device related information.
>>>    Defined members of structure and usage on read/write access:
>>>      * device_state: (read/write)
>>>          To convey VFIO device state to be transitioned to. Only 3 bits are
>>> 	used as of now, Bits 3 - 31 are reserved for future use.
>>>      * pending bytes: (read only)
>>>          To get pending bytes yet to be migrated for VFIO device.
>>>      * data_offset: (read only)
>>>          To get data offset in migration region from where data exist
>>> 	during _SAVING and from where data should be written by user space
>>> 	application during _RESUMING state.
>>>      * data_size: (read/write)
>>>          To get and set size in bytes of data copied in migration region
>>> 	during _SAVING and _RESUMING state.
>>>
>>> Migration region looks like:
>>>   ------------------------------------------------------------------
>>> |vfio_device_migration_info|    data section                      |
>>> |                          |     ///////////////////////////////  |
>>>   ------------------------------------------------------------------
>>>   ^                              ^
>>>   offset 0-trapped part        data_offset
>>>
>>> Structure vfio_device_migration_info is always followed by data section
>>> in the region, so data_offset will always be non-0. Offset from where data
>>> to be copied is decided by kernel driver, data section can be trapped or
>>> mapped depending on how kernel driver defines data section.
>>> Data section partition can be defined as mapped by sparse mmap capability.
>>> If mmapped, then data_offset should be page aligned, where as initial
>>> section which contain vfio_device_migration_info structure might not end
>>> at offset which is page aligned.
>>> Vendor driver should decide whether to partition data section and how to
>>> partition the data section. Vendor driver should return data_offset
>>> accordingly.
>>>
>>> For user application, data is opaque. User should write data in the same
>>> order as received.
>>>
>>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
>>> Reviewed-by: Neo Jia <cjia@nvidia.com>
>>> ---
>>>   include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
>>>   1 file changed, 108 insertions(+)
>>>
>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>> index 9e843a147ead..35b09427ad9f 100644
>>> --- a/include/uapi/linux/vfio.h
>>> +++ b/include/uapi/linux/vfio.h
>>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
>>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
>>>   #define VFIO_REGION_TYPE_GFX                    (1)
>>>   #define VFIO_REGION_TYPE_CCW			(2)
>>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
>>>   
>>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
>>>   
>>> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
>>>   /* sub-types for VFIO_REGION_TYPE_CCW */
>>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
>>>   
>>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
>>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
>>> +
>>> +/*
>>> + * Structure vfio_device_migration_info is placed at 0th offset of
>>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
>>> + * information. Field accesses from this structure are only supported at their
>>> + * native width and alignment, otherwise the result is undefined and vendor
>>> + * drivers should return an error.
>>> + *
>>> + * device_state: (read/write)
>>> + *      To indicate vendor driver the state VFIO device should be transitioned
>>> + *      to. If device state transition fails, write on this field return error.
>>> + *      It consists of 3 bits:
>>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
>>
>> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
>> ambiguous.

Ok. Updating it.

>>
>>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
>>> + *        device before write() returns.
>>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
>>> + *        should start gathering device state information which will be provided
>>> + *        to VFIO user space application to save device's state.
>>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
>>> + *        prepare to resume device, data provided through migration region
>>> + *        should be used to resume device.
>>> + *      Bits 3 - 31 are reserved for future use. User should perform
>>> + *      read-modify-write operation on this field.
>>> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
>>> + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
>>> + *
>>> + * pending bytes: (read only)
>>> + *      Number of pending bytes yet to be migrated from vendor driver
>>> + *
>>> + * data_offset: (read only)
>>> + *      User application should read data_offset in migration region from where
>>> + *      user application should read device data during _SAVING state or write
>>> + *      device data during _RESUMING state. See below for detail of sequence to
>>> + *      be followed.
>>> + *
>>> + * data_size: (read/write)
>>> + *      User application should read data_size to get size of data copied in
>>> + *      bytes in migration region during _SAVING state and write size of data
>>> + *      copied in bytes in migration region during _RESUMING state.
>>> + *
>>> + * Migration region looks like:
>>> + *  ------------------------------------------------------------------
>>> + * |vfio_device_migration_info|    data section                      |
>>> + * |                          |     ///////////////////////////////  |
>>> + * ------------------------------------------------------------------
>>> + *   ^                              ^
>>> + *  offset 0-trapped part        data_offset
>>> + *
>>> + * Structure vfio_device_migration_info is always followed by data section in
>>> + * the region, so data_offset will always be non-0. Offset from where data is
>>> + * copied is decided by kernel driver, data section can be trapped or mapped
>>> + * or partitioned, depending on how kernel driver defines data section.
>>> + * Data section partition can be defined as mapped by sparse mmap capability.
>>> + * If mmapped, then data_offset should be page aligned, where as initial section
>>> + * which contain vfio_device_migration_info structure might not end at offset
>>> + * which is page aligned.
>>
>> "The user is not required to to access via mmap regardless of the
>> region mmap capabilities."
>>
> But once the user decides to access via mmap, it has to read data of
> data_size each time, otherwise the vendor driver has no idea of how many
> data are already read from user. Agree?
> 

that's right.

>>> + * Vendor driver should decide whether to partition data section and how to
>>> + * partition the data section. Vendor driver should return data_offset
>>> + * accordingly.
>>> + *
>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>> + * and for _SAVING device state or stop-and-copy phase:
>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>> + *    Kernel driver should return this read operation only after writing data to
>>> + *    staging buffer is done.
> May I know under what condition this data_offset will be changed per
> each iteration from a-f ?
> 

Its upto vendor driver, if vendor driver maintains multiple partitions 
in data section.

>>
>> "staging buffer" implies a vendor driver implementation, perhaps we
>> could just state that data is available from (region + data_offset) to
>> (region + data_offset + data_size) upon return of this read operation.
>>

Makes sense. Updating it.

>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>> + *    migration region.
>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>> + * e. process data.
>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>> + *    operation from migration region for previous iteration is done.
>>
>> I think this indicate that step (f) should be to read pending_bytes, the
>> read sequence is not complete until this step.  Optionally the user can
>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>

I tried to reword this sequence to be more specific:

* Sequence to be followed for _SAVING|_RUNNING device state or pre-copy 
* phase and for _SAVING device state or stop-and-copy phase:
* a. read pending_bytes, indicates start of new iteration to get device 
*    data. If there was previous iteration, then this read operation
*    indicates previous iteration is done. If pending_bytes > 0, go
*    through below steps.
* b. read data_offset, indicates kernel driver to make data available
*    through data section. Kernel driver should return this read
*    operation only after data is available from (region + data_offset)
*    to (region + data_offset + data_size).
* c. read data_size, amount of data in bytes available through migration
*    region.
* d. read data of data_size bytes from (region + data_offset) from
*    migration region.
* e. process data.
* f. Loop through a to e.

Hope this is more clear.

>> Is the use required to reach pending_bytes == 0 before changing
>> device_state, particularly transitioning to !_RUNNING?

No, its not necessary to reach till pending_bytes==0 in pre-copy phase.

>>  Presumably the
>> user can exit this sequence at any time by clearing _SAVING.

In that case device state data is not complete, which will result in not 
able to resume device with that data.
In stop-and-copy phase, user should iterate till pending_bytes is 0.

>>
>>> + *
>>> + * Sequence to be followed while _RESUMING device state:
>>> + * While data for this device is available, repeat below steps:
>>> + * a. read data_offset from where user application should write data.
> before proceed to step b, need to read data_size from vendor driver to determine
> the max len of data to write. I think it's necessary in such a condition
> that source vendor driver and target vendor driver do not offer data sections of
> the same size. e.g. in source side, the data section is of size 100M,
> but in target side, the data section is only of 50M size.
> rather than simply fail, loop and write seems better.
> 

Makes sense. Doing this change for next version.

> Thanks
> Yan
>>> + * b. write data of data_size to migration region from data_offset.
>>> + * c. write data_size which indicates vendor driver that data is written in
>>> + *    staging buffer. Vendor driver should read this data from migration
>>> + *    region and resume device's state.
>>
>> The device defaults to _RUNNING state, so a prerequisite is to set
>> _RESUMING and clear _RUNNING, right?

Yes.

>>
>>> + *
>>> + * For user application, data is opaque. User should write data in the same
>>> + * order as received.
>>> + */
>>> +
>>> +struct vfio_device_migration_info {
>>> +	__u32 device_state;         /* VFIO device state */
>>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
>>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
>>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
>>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
>>> +				     VFIO_DEVICE_STATE_SAVING |  \
>>> +				     VFIO_DEVICE_STATE_RESUMING)
>>> +
>>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>> +
>>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
>>> +					    VFIO_DEVICE_STATE_RESUMING)
>>
>> These seem difficult to use, maybe we just need a
>> VFIO_DEVICE_STATE_VALID macro?
>>
>> #define VFIO_DEVICE_STATE_VALID(state) \
>>    (state & VFIO_DEVICE_STATE_RESUMING ? \
>>    (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
>>

This will not be work when use of other bits gets added in future. 
That's the reason I preferred to add individual invalid states which 
user should check.

Thanks,
Kirti

>> Thanks,
>> Alex
>>
>>> +	__u32 reserved;
>>> +	__u64 pending_bytes;
>>> +	__u64 data_offset;
>>> +	__u64 data_size;
>>> +} __attribute__((packed));
>>> +
>>>   /*
>>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
>>>    * which allows direct access to non-MSIX registers which happened to be within
>>
Kirti Wankhede Nov. 13, 2019, 7:29 p.m. UTC | #6
On 11/13/2019 11:57 PM, Alex Williamson wrote:
> On Wed, 13 Nov 2019 11:24:17 +0100
> Cornelia Huck <cohuck@redhat.com> wrote:
> 
>> On Tue, 12 Nov 2019 15:30:05 -0700
>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>
>>> On Tue, 12 Nov 2019 22:33:36 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> - Defined MIGRATION region type and sub-type.
>>>> - Used 3 bits to define VFIO device states.
>>>>      Bit 0 => _RUNNING
>>>>      Bit 1 => _SAVING
>>>>      Bit 2 => _RESUMING
>>>>      Combination of these bits defines VFIO device's state during migration
>>>>      _RUNNING => Normal VFIO device running state. When its reset, it
>>>> 		indicates _STOPPED state. when device is changed to
>>>> 		_STOPPED, driver should stop device before write()
>>>> 		returns.
>>>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>>                            start saving state of device i.e. pre-copy state
>>>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>>
>>> s/should/must/
>>>    
>>>>                  save device state,i.e. stop-n-copy state
>>>>      _RESUMING => VFIO device resuming state.
>>>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>>
>>> A table might be useful here and in the uapi header to indicate valid
>>> states:
>>
>> I like that.
>>
>>>
>>> | _RESUMING | _SAVING | _RUNNING | Description
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    0    |     1    | Running, default state
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>>> +-----------+---------+----------+------------------------------------------
>>> |     0     |    1    |     1    | Running, save mode interface, iterative
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    0    |     0    | Stopped, migration resume interface active
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    0    |     1    | Invalid (b)
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    1    |     0    | Invalid (c)
>>> +-----------+---------+----------+------------------------------------------
>>> |     1     |    1    |     1    | Invalid (d)
>>>
>>> I think we need to consider whether we define (a) as generally
>>> available, for instance we might want to use it for diagnostics or a
>>> fatal error condition outside of migration.
>>>
>>> Are there hidden assumptions between state transitions here or are
>>> there specific next possible state diagrams that we need to include as
>>> well?
>>
>> Some kind of state-change diagram might be useful in addition to the
>> textual description anyway. Let me try, just to make sure I understand
>> this correctly:
>>

During User application initialization, there is one more state change:

0) 0/0/0 ---- stop to running -----> 0/0/1

>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1

not just gathering state info, but also copy device state to be 
transferred during pre-copy phase.

Below 2 state are not just to tell driver to stop, those 2 differ.
2) is device state changed from running to stop, this is when VM 
shutdowns cleanly, no need to save device state

>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0 

>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0

above is transition from pre-copy phase to stop-and-copy phase, where 
device data should be made available to user to transfer to destination 
or to save it to file in case of save VM or suspend.


>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
> 
> I think this is to switch into resuming mode, the data will follow >
>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1
>

above can occur on migration cancelled or failed.


> I think also:
> 
> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy

that's right, this happens in case of save VM or suspend VM.

> 
> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
> received state is incomplete.  Somehow though a user always needs to
> return the device to the initial state, so how does device_state
> interact with the reset ioctl?  Would this automatically manipulate
> device_state back to 0/0/1?

why would reset occur on 1/0/0 -> 0/0/1 failure?

1/0/0 -> 0/0/1 fails, then user should convey that to source that 
migration has failed, then resume at source.

>   
>> Not sure about the usefulness of 2).

I explained this above.

>> Also, is 4) the only way to
>> trigger resuming? 
Yes.

>> And is the change in 5) performed by the driver, or
>> by userspace?
>>
By userspace.

>> Are any other state transitions valid?
>>
>> (...)
>>
>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>> + *    Kernel driver should return this read operation only after writing data to
>>>> + *    staging buffer is done.
>>>
>>> "staging buffer" implies a vendor driver implementation, perhaps we
>>> could just state that data is available from (region + data_offset) to
>>> (region + data_offset + data_size) upon return of this read operation.
>>>    
>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>>> + *    migration region.
>>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>>> + * e. process data.
>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>>> + *    operation from migration region for previous iteration is done.
>>>
>>> I think this indicate that step (f) should be to read pending_bytes, the
>>> read sequence is not complete until this step.  Optionally the user can
>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>>
>>> Is the use required to reach pending_bytes == 0 before changing
>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
>>> user can exit this sequence at any time by clearing _SAVING.
>>
>> That would be transition 6) above (abort saving and continue). I think
>> it makes sense not to forbid this.
>>
>>>    
>>>> + *
>>>> + * Sequence to be followed while _RESUMING device state:
>>>> + * While data for this device is available, repeat below steps:
>>>> + * a. read data_offset from where user application should write data.
>>>> + * b. write data of data_size to migration region from data_offset.
>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>> + *    region and resume device's state.
>>>
>>> The device defaults to _RUNNING state, so a prerequisite is to set
>>> _RESUMING and clear _RUNNING, right?
>>

Sorry, I replied yes in my previous reply, but no. Default device state 
is _STOPPED. During resume _STOPPED -> _RESUMING

>> Transition 4) above. Do we need

I think, its not required.

>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
>> as well? (Probably depends on how sensible the 0/0/0 state is.)
> 
> I think we must unless we require the user to transition from 0/0/1 to
> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
> available.  Thanks,
> 

its 0/0/0 -> 1/0/0 while resuming.

Thanks,
Kirti

> Alex
>
Alex Williamson Nov. 13, 2019, 7:48 p.m. UTC | #7
On Thu, 14 Nov 2019 00:59:52 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/13/2019 11:57 PM, Alex Williamson wrote:
> > On Wed, 13 Nov 2019 11:24:17 +0100
> > Cornelia Huck <cohuck@redhat.com> wrote:
> >   
> >> On Tue, 12 Nov 2019 15:30:05 -0700
> >> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>  
> >>> On Tue, 12 Nov 2019 22:33:36 +0530
> >>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>      
> >>>> - Defined MIGRATION region type and sub-type.
> >>>> - Used 3 bits to define VFIO device states.
> >>>>      Bit 0 => _RUNNING
> >>>>      Bit 1 => _SAVING
> >>>>      Bit 2 => _RESUMING
> >>>>      Combination of these bits defines VFIO device's state during migration
> >>>>      _RUNNING => Normal VFIO device running state. When its reset, it
> >>>> 		indicates _STOPPED state. when device is changed to
> >>>> 		_STOPPED, driver should stop device before write()
> >>>> 		returns.
> >>>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>>>                            start saving state of device i.e. pre-copy state
> >>>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and  
> >>>
> >>> s/should/must/
> >>>      
> >>>>                  save device state,i.e. stop-n-copy state
> >>>>      _RESUMING => VFIO device resuming state.
> >>>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states  
> >>>
> >>> A table might be useful here and in the uapi header to indicate valid
> >>> states:  
> >>
> >> I like that.
> >>  
> >>>
> >>> | _RESUMING | _SAVING | _RUNNING | Description
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    0    |     1    | Running, default state
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    1    |     0    | Stopped, migration interface in save mode
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     0     |    1    |     1    | Running, save mode interface, iterative
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    0    |     0    | Stopped, migration resume interface active
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    0    |     1    | Invalid (b)
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    1    |     0    | Invalid (c)
> >>> +-----------+---------+----------+------------------------------------------
> >>> |     1     |    1    |     1    | Invalid (d)
> >>>
> >>> I think we need to consider whether we define (a) as generally
> >>> available, for instance we might want to use it for diagnostics or a
> >>> fatal error condition outside of migration.
> >>>
> >>> Are there hidden assumptions between state transitions here or are
> >>> there specific next possible state diagrams that we need to include as
> >>> well?  
> >>
> >> Some kind of state-change diagram might be useful in addition to the
> >> textual description anyway. Let me try, just to make sure I understand
> >> this correctly:
> >>  
> 
> During User application initialization, there is one more state change:
> 
> 0) 0/0/0 ---- stop to running -----> 0/0/1

0/0/0 cannot be the initial state of the device, that would imply that
a device supporting this migration interface breaks backwards
compatibility with all existing vfio userspace code and that code needs
to learn to set the device running as part of its initialization.
That's absolutely unacceptable.  The initial device state must be 0/0/1.

> >> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1  
> 
> not just gathering state info, but also copy device state to be 
> transferred during pre-copy phase.
> 
> Below 2 state are not just to tell driver to stop, those 2 differ.
> 2) is device state changed from running to stop, this is when VM 
> shutdowns cleanly, no need to save device state

Userspace is under no obligation to perform this state change though,
backwards compatibility dictates this.
 
> >> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0   
> 
> >> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0  
> 
> above is transition from pre-copy phase to stop-and-copy phase, where 
> device data should be made available to user to transfer to destination 
> or to save it to file in case of save VM or suspend.
> 
> 
> >> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0  
> > 
> > I think this is to switch into resuming mode, the data will follow >  
> >> 5) 1/0/0 ---(driver is ready)---> 0/0/1
> >> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1  
> >  
> 
> above can occur on migration cancelled or failed.
> 
> 
> > I think also:
> > 
> > 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy  
> 
> that's right, this happens in case of save VM or suspend VM.
> 
> > 
> > 0/0/0 and 0/0/1 should be reachable from any state, though I could see
> > that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
> > received state is incomplete.  Somehow though a user always needs to
> > return the device to the initial state, so how does device_state
> > interact with the reset ioctl?  Would this automatically manipulate
> > device_state back to 0/0/1?  
> 
> why would reset occur on 1/0/0 -> 0/0/1 failure?

The question is whether the reset ioctl automatically puts the device
back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
presumably discards much of the device state we just restored, so
clearly that would be undesirable.
 
> 1/0/0 -> 0/0/1 fails, then user should convey that to source that 
> migration has failed, then resume at source.

In the scheme of the migration yet, but as far as the vfio interface is
concerned the user should have a path to make use of a device after
this point without closing it and starting over.  Thus, if a 1/0/0 ->
0/0/1 transition fails, would we define the device reset ioctl as a
mechanism to flush the bogus state and place the device into the 0/0/1
initial state?
 
> >     
> >> Not sure about the usefulness of 2).  
> 
> I explained this above.
> 
> >> Also, is 4) the only way to
> >> trigger resuming?   
> Yes.
> 
> >> And is the change in 5) performed by the driver, or
> >> by userspace?
> >>  
> By userspace.
> 
> >> Are any other state transitions valid?
> >>
> >> (...)
> >>  
> >>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>>> + * and for _SAVING device state or stop-and-copy phase:
> >>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>>> + *    Kernel driver should return this read operation only after writing data to
> >>>> + *    staging buffer is done.  
> >>>
> >>> "staging buffer" implies a vendor driver implementation, perhaps we
> >>> could just state that data is available from (region + data_offset) to
> >>> (region + data_offset + data_size) upon return of this read operation.
> >>>      
> >>>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>>> + *    migration region.
> >>>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>>> + * e. process data.
> >>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>>> + *    operation from migration region for previous iteration is done.  
> >>>
> >>> I think this indicate that step (f) should be to read pending_bytes, the
> >>> read sequence is not complete until this step.  Optionally the user can
> >>> then proceed to step (b).  There are no read side-effects of (a) afaict.
> >>>
> >>> Is the use required to reach pending_bytes == 0 before changing
> >>> device_state, particularly transitioning to !_RUNNING?  Presumably the
> >>> user can exit this sequence at any time by clearing _SAVING.  
> >>
> >> That would be transition 6) above (abort saving and continue). I think
> >> it makes sense not to forbid this.
> >>  
> >>>      
> >>>> + *
> >>>> + * Sequence to be followed while _RESUMING device state:
> >>>> + * While data for this device is available, repeat below steps:
> >>>> + * a. read data_offset from where user application should write data.
> >>>> + * b. write data of data_size to migration region from data_offset.
> >>>> + * c. write data_size which indicates vendor driver that data is written in
> >>>> + *    staging buffer. Vendor driver should read this data from migration
> >>>> + *    region and resume device's state.  
> >>>
> >>> The device defaults to _RUNNING state, so a prerequisite is to set
> >>> _RESUMING and clear _RUNNING, right?  
> >>  
> 
> Sorry, I replied yes in my previous reply, but no. Default device state 
> is _STOPPED. During resume _STOPPED -> _RESUMING

Nope, it can't be, it must be _RUNNING.

> >> Transition 4) above. Do we need  
> 
> I think, its not required.

But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
1/0/0).

> >> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
> >> as well? (Probably depends on how sensible the 0/0/0 state is.)  
> > 
> > I think we must unless we require the user to transition from 0/0/1 to
> > 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
> > available.  Thanks,
> >   
> 
> its 0/0/0 -> 1/0/0 while resuming.

I think we're starting with different initial states, IMO there is
absolutely no way around 0/0/1 being the initial device state.
Anything otherwise means that we cannot add migration support to an
existing device and maintain compatibility with existing userspace.
Thanks,

Alex
Kirti Wankhede Nov. 13, 2019, 8:17 p.m. UTC | #8
On 11/14/2019 1:18 AM, Alex Williamson wrote:
> On Thu, 14 Nov 2019 00:59:52 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/13/2019 11:57 PM, Alex Williamson wrote:
>>> On Wed, 13 Nov 2019 11:24:17 +0100
>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>    
>>>> On Tue, 12 Nov 2019 15:30:05 -0700
>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>   
>>>>> On Tue, 12 Nov 2019 22:33:36 +0530
>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>       
>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>       Bit 0 => _RUNNING
>>>>>>       Bit 1 => _SAVING
>>>>>>       Bit 2 => _RESUMING
>>>>>>       Combination of these bits defines VFIO device's state during migration
>>>>>>       _RUNNING => Normal VFIO device running state. When its reset, it
>>>>>> 		indicates _STOPPED state. when device is changed to
>>>>>> 		_STOPPED, driver should stop device before write()
>>>>>> 		returns.
>>>>>>       _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>>>>                             start saving state of device i.e. pre-copy state
>>>>>>       _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>>>>
>>>>> s/should/must/
>>>>>       
>>>>>>                   save device state,i.e. stop-n-copy state
>>>>>>       _RESUMING => VFIO device resuming state.
>>>>>>       _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>>>>
>>>>> A table might be useful here and in the uapi header to indicate valid
>>>>> states:
>>>>
>>>> I like that.
>>>>   
>>>>>
>>>>> | _RESUMING | _SAVING | _RUNNING | Description
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    0    |     1    | Running, default state
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     0     |    1    |     1    | Running, save mode interface, iterative
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    0    |     0    | Stopped, migration resume interface active
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    0    |     1    | Invalid (b)
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    1    |     0    | Invalid (c)
>>>>> +-----------+---------+----------+------------------------------------------
>>>>> |     1     |    1    |     1    | Invalid (d)
>>>>>
>>>>> I think we need to consider whether we define (a) as generally
>>>>> available, for instance we might want to use it for diagnostics or a
>>>>> fatal error condition outside of migration.
>>>>>
>>>>> Are there hidden assumptions between state transitions here or are
>>>>> there specific next possible state diagrams that we need to include as
>>>>> well?
>>>>
>>>> Some kind of state-change diagram might be useful in addition to the
>>>> textual description anyway. Let me try, just to make sure I understand
>>>> this correctly:
>>>>   
>>
>> During User application initialization, there is one more state change:
>>
>> 0) 0/0/0 ---- stop to running -----> 0/0/1
> 
> 0/0/0 cannot be the initial state of the device, that would imply that
> a device supporting this migration interface breaks backwards
> compatibility with all existing vfio userspace code and that code needs
> to learn to set the device running as part of its initialization.
> That's absolutely unacceptable.  The initial device state must be 0/0/1.
> 

There isn't any device state for all existing vfio userspace code right 
now. So default its assumed to be always running.

With migration support, device states are explicitly getting added. For 
example, in case of QEMU, while device is getting initialized, i.e. from 
vfio_realize(), device_state is set to 0/0/0, but not required to convey 
it to vendor driver. Then with vfio_vmstate_change() notifier, device 
state is changed to 0/0/1 when VM/vCPU are transitioned to running, at 
this moment device state is conveyed to vendor driver. So vendor driver 
doesn't see 0/0/0 state.

While resuming, for userspace, for example QEMU, device state change is 
from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic 
initialization is done.


>>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
>>
>> not just gathering state info, but also copy device state to be
>> transferred during pre-copy phase.
>>
>> Below 2 state are not just to tell driver to stop, those 2 differ.
>> 2) is device state changed from running to stop, this is when VM
>> shutdowns cleanly, no need to save device state
> 
> Userspace is under no obligation to perform this state change though,
> backwards compatibility dictates this.
>   
>>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0
>>
>>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0
>>
>> above is transition from pre-copy phase to stop-and-copy phase, where
>> device data should be made available to user to transfer to destination
>> or to save it to file in case of save VM or suspend.
>>
>>
>>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
>>>
>>> I think this is to switch into resuming mode, the data will follow >
>>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
>>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1
>>>   
>>
>> above can occur on migration cancelled or failed.
>>
>>
>>> I think also:
>>>
>>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy
>>
>> that's right, this happens in case of save VM or suspend VM.
>>
>>>
>>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
>>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
>>> received state is incomplete.  Somehow though a user always needs to
>>> return the device to the initial state, so how does device_state
>>> interact with the reset ioctl?  Would this automatically manipulate
>>> device_state back to 0/0/1?
>>
>> why would reset occur on 1/0/0 -> 0/0/1 failure?
> 
> The question is whether the reset ioctl automatically puts the device
> back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
> presumably discards much of the device state we just restored, so
> clearly that would be undesirable.
>   
>> 1/0/0 -> 0/0/1 fails, then user should convey that to source that
>> migration has failed, then resume at source.
> 
> In the scheme of the migration yet, but as far as the vfio interface is
> concerned the user should have a path to make use of a device after
> this point without closing it and starting over.  Thus, if a 1/0/0 ->
> 0/0/1 transition fails, would we define the device reset ioctl as a
> mechanism to flush the bogus state and place the device into the 0/0/1
> initial state?
>

Ok, userspace applications can be designed to do that. As of now with 
QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure.


>>>      
>>>> Not sure about the usefulness of 2).
>>
>> I explained this above.
>>
>>>> Also, is 4) the only way to
>>>> trigger resuming?
>> Yes.
>>
>>>> And is the change in 5) performed by the driver, or
>>>> by userspace?
>>>>   
>> By userspace.
>>
>>>> Are any other state transitions valid?
>>>>
>>>> (...)
>>>>   
>>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>>>> + *    Kernel driver should return this read operation only after writing data to
>>>>>> + *    staging buffer is done.
>>>>>
>>>>> "staging buffer" implies a vendor driver implementation, perhaps we
>>>>> could just state that data is available from (region + data_offset) to
>>>>> (region + data_offset + data_size) upon return of this read operation.
>>>>>       
>>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>>>>> + *    migration region.
>>>>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>>>>> + * e. process data.
>>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>>>>> + *    operation from migration region for previous iteration is done.
>>>>>
>>>>> I think this indicate that step (f) should be to read pending_bytes, the
>>>>> read sequence is not complete until this step.  Optionally the user can
>>>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>>>>
>>>>> Is the use required to reach pending_bytes == 0 before changing
>>>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
>>>>> user can exit this sequence at any time by clearing _SAVING.
>>>>
>>>> That would be transition 6) above (abort saving and continue). I think
>>>> it makes sense not to forbid this.
>>>>   
>>>>>       
>>>>>> + *
>>>>>> + * Sequence to be followed while _RESUMING device state:
>>>>>> + * While data for this device is available, repeat below steps:
>>>>>> + * a. read data_offset from where user application should write data.
>>>>>> + * b. write data of data_size to migration region from data_offset.
>>>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>>>> + *    region and resume device's state.
>>>>>
>>>>> The device defaults to _RUNNING state, so a prerequisite is to set
>>>>> _RESUMING and clear _RUNNING, right?
>>>>   
>>
>> Sorry, I replied yes in my previous reply, but no. Default device state
>> is _STOPPED. During resume _STOPPED -> _RESUMING
> 
> Nope, it can't be, it must be _RUNNING.
> 
>>>> Transition 4) above. Do we need
>>
>> I think, its not required.
> 
> But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
> 1/0/0).
> 
>>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
>>>> as well? (Probably depends on how sensible the 0/0/0 state is.)
>>>
>>> I think we must unless we require the user to transition from 0/0/1 to
>>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
>>> available.  Thanks,
>>>    
>>
>> its 0/0/0 -> 1/0/0 while resuming.
> 
> I think we're starting with different initial states, IMO there is
> absolutely no way around 0/0/1 being the initial device state.
> Anything otherwise means that we cannot add migration support to an
> existing device and maintain compatibility with existing userspace.
> Thanks,
> 
Hope above explanation helps to resolve this concern.

Thanks,
Kirti
Alex Williamson Nov. 13, 2019, 8:40 p.m. UTC | #9
On Thu, 14 Nov 2019 01:47:04 +0530
Kirti Wankhede <kwankhede@nvidia.com> wrote:

> On 11/14/2019 1:18 AM, Alex Williamson wrote:
> > On Thu, 14 Nov 2019 00:59:52 +0530
> > Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >   
> >> On 11/13/2019 11:57 PM, Alex Williamson wrote:  
> >>> On Wed, 13 Nov 2019 11:24:17 +0100
> >>> Cornelia Huck <cohuck@redhat.com> wrote:
> >>>      
> >>>> On Tue, 12 Nov 2019 15:30:05 -0700
> >>>> Alex Williamson <alex.williamson@redhat.com> wrote:
> >>>>     
> >>>>> On Tue, 12 Nov 2019 22:33:36 +0530
> >>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>>>>         
> >>>>>> - Defined MIGRATION region type and sub-type.
> >>>>>> - Used 3 bits to define VFIO device states.
> >>>>>>       Bit 0 => _RUNNING
> >>>>>>       Bit 1 => _SAVING
> >>>>>>       Bit 2 => _RESUMING
> >>>>>>       Combination of these bits defines VFIO device's state during migration
> >>>>>>       _RUNNING => Normal VFIO device running state. When its reset, it
> >>>>>> 		indicates _STOPPED state. when device is changed to
> >>>>>> 		_STOPPED, driver should stop device before write()
> >>>>>> 		returns.
> >>>>>>       _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>>>>>                             start saving state of device i.e. pre-copy state
> >>>>>>       _SAVING  => vCPUs are stopped, VFIO device should be stopped, and  
> >>>>>
> >>>>> s/should/must/
> >>>>>         
> >>>>>>                   save device state,i.e. stop-n-copy state
> >>>>>>       _RESUMING => VFIO device resuming state.
> >>>>>>       _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states  
> >>>>>
> >>>>> A table might be useful here and in the uapi header to indicate valid
> >>>>> states:  
> >>>>
> >>>> I like that.
> >>>>     
> >>>>>
> >>>>> | _RESUMING | _SAVING | _RUNNING | Description
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    0    |     1    | Running, default state
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     0     |    1    |     1    | Running, save mode interface, iterative
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    0    |     0    | Stopped, migration resume interface active
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    0    |     1    | Invalid (b)
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    1    |     0    | Invalid (c)
> >>>>> +-----------+---------+----------+------------------------------------------
> >>>>> |     1     |    1    |     1    | Invalid (d)
> >>>>>
> >>>>> I think we need to consider whether we define (a) as generally
> >>>>> available, for instance we might want to use it for diagnostics or a
> >>>>> fatal error condition outside of migration.
> >>>>>
> >>>>> Are there hidden assumptions between state transitions here or are
> >>>>> there specific next possible state diagrams that we need to include as
> >>>>> well?  
> >>>>
> >>>> Some kind of state-change diagram might be useful in addition to the
> >>>> textual description anyway. Let me try, just to make sure I understand
> >>>> this correctly:
> >>>>     
> >>
> >> During User application initialization, there is one more state change:
> >>
> >> 0) 0/0/0 ---- stop to running -----> 0/0/1  
> > 
> > 0/0/0 cannot be the initial state of the device, that would imply that
> > a device supporting this migration interface breaks backwards
> > compatibility with all existing vfio userspace code and that code needs
> > to learn to set the device running as part of its initialization.
> > That's absolutely unacceptable.  The initial device state must be 0/0/1.
> >   
> 
> There isn't any device state for all existing vfio userspace code right 
> now. So default its assumed to be always running.

Exactly, there is no representation of device state, therefore it's
assumed to be running, therefore when adding a representation of device
state it must default to running.

> With migration support, device states are explicitly getting added. For 
> example, in case of QEMU, while device is getting initialized, i.e. from 
> vfio_realize(), device_state is set to 0/0/0, but not required to convey 
> it to vendor driver.

But we have a 0/0/0 state, why would we intentionally keep an internal
state that's inconsistent with the device?

> Then with vfio_vmstate_change() notifier, device 
> state is changed to 0/0/1 when VM/vCPU are transitioned to running, at 
> this moment device state is conveyed to vendor driver. So vendor driver 
> doesn't see 0/0/0 state.

But the running state is the state of the device, not the VM or the
vCPU.  Sure we might want to stop the device if the VM/vCPU state is
stopped, but we must accept that the device is running when it's opened
and we shouldn't intentionally maintain inconsistent state.
 
> While resuming, for userspace, for example QEMU, device state change is 
> from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic 
> initialization is done.

I don't see why this matters, all device_state transitions are written
directly to the vendor driver.  The device is initially in 0/0/1 and
can be set to 1/0/0 for resuming with an optional transition through
0/0/0 and the vendor driver can see each state change.

> >>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1  
> >>
> >> not just gathering state info, but also copy device state to be
> >> transferred during pre-copy phase.
> >>
> >> Below 2 state are not just to tell driver to stop, those 2 differ.
> >> 2) is device state changed from running to stop, this is when VM
> >> shutdowns cleanly, no need to save device state  
> > 
> > Userspace is under no obligation to perform this state change though,
> > backwards compatibility dictates this.
> >     
> >>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0  
> >>  
> >>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0  
> >>
> >> above is transition from pre-copy phase to stop-and-copy phase, where
> >> device data should be made available to user to transfer to destination
> >> or to save it to file in case of save VM or suspend.
> >>
> >>  
> >>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0  
> >>>
> >>> I think this is to switch into resuming mode, the data will follow >  
> >>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
> >>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1  
> >>>     
> >>
> >> above can occur on migration cancelled or failed.
> >>
> >>  
> >>> I think also:
> >>>
> >>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy  
> >>
> >> that's right, this happens in case of save VM or suspend VM.
> >>  
> >>>
> >>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
> >>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
> >>> received state is incomplete.  Somehow though a user always needs to
> >>> return the device to the initial state, so how does device_state
> >>> interact with the reset ioctl?  Would this automatically manipulate
> >>> device_state back to 0/0/1?  
> >>
> >> why would reset occur on 1/0/0 -> 0/0/1 failure?  
> > 
> > The question is whether the reset ioctl automatically puts the device
> > back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
> > presumably discards much of the device state we just restored, so
> > clearly that would be undesirable.
> >     
> >> 1/0/0 -> 0/0/1 fails, then user should convey that to source that
> >> migration has failed, then resume at source.  
> > 
> > In the scheme of the migration yet, but as far as the vfio interface is
> > concerned the user should have a path to make use of a device after
> > this point without closing it and starting over.  Thus, if a 1/0/0 ->
> > 0/0/1 transition fails, would we define the device reset ioctl as a
> > mechanism to flush the bogus state and place the device into the 0/0/1
> > initial state?
> >  
> 
> Ok, userspace applications can be designed to do that. As of now with 
> QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure.

It's simply an ioctl, we must already have access to the device file
descriptor to perform the device_state transition.  QEMU is not
necessarily the consumer of this behavior though, if transition 1/0/0
-> 0/0/1 fails in QEMU, it very well may just exit.  The vfio API
should support a defined mechanism to recover the device from this
state though, which I propose is the existing reset ioctl, which
logically implies that any device reset returns the device_state to
0/0/1.

> >>>> Not sure about the usefulness of 2).  
> >>
> >> I explained this above.
> >>  
> >>>> Also, is 4) the only way to
> >>>> trigger resuming?  
> >> Yes.
> >>  
> >>>> And is the change in 5) performed by the driver, or
> >>>> by userspace?
> >>>>     
> >> By userspace.
> >>  
> >>>> Are any other state transitions valid?
> >>>>
> >>>> (...)
> >>>>     
> >>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>>>>> + * and for _SAVING device state or stop-and-copy phase:
> >>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>>>>> + *    Kernel driver should return this read operation only after writing data to
> >>>>>> + *    staging buffer is done.  
> >>>>>
> >>>>> "staging buffer" implies a vendor driver implementation, perhaps we
> >>>>> could just state that data is available from (region + data_offset) to
> >>>>> (region + data_offset + data_size) upon return of this read operation.
> >>>>>         
> >>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>>>>> + *    migration region.
> >>>>>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>>>>> + * e. process data.
> >>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>>>>> + *    operation from migration region for previous iteration is done.  
> >>>>>
> >>>>> I think this indicate that step (f) should be to read pending_bytes, the
> >>>>> read sequence is not complete until this step.  Optionally the user can
> >>>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
> >>>>>
> >>>>> Is the use required to reach pending_bytes == 0 before changing
> >>>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
> >>>>> user can exit this sequence at any time by clearing _SAVING.  
> >>>>
> >>>> That would be transition 6) above (abort saving and continue). I think
> >>>> it makes sense not to forbid this.
> >>>>     
> >>>>>         
> >>>>>> + *
> >>>>>> + * Sequence to be followed while _RESUMING device state:
> >>>>>> + * While data for this device is available, repeat below steps:
> >>>>>> + * a. read data_offset from where user application should write data.
> >>>>>> + * b. write data of data_size to migration region from data_offset.
> >>>>>> + * c. write data_size which indicates vendor driver that data is written in
> >>>>>> + *    staging buffer. Vendor driver should read this data from migration
> >>>>>> + *    region and resume device's state.  
> >>>>>
> >>>>> The device defaults to _RUNNING state, so a prerequisite is to set
> >>>>> _RESUMING and clear _RUNNING, right?  
> >>>>     
> >>
> >> Sorry, I replied yes in my previous reply, but no. Default device state
> >> is _STOPPED. During resume _STOPPED -> _RESUMING  
> > 
> > Nope, it can't be, it must be _RUNNING.
> >   
> >>>> Transition 4) above. Do we need  
> >>
> >> I think, its not required.  
> > 
> > But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
> > 1/0/0).
> >   
> >>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
> >>>> as well? (Probably depends on how sensible the 0/0/0 state is.)  
> >>>
> >>> I think we must unless we require the user to transition from 0/0/1 to
> >>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
> >>> available.  Thanks,
> >>>      
> >>
> >> its 0/0/0 -> 1/0/0 while resuming.  
> > 
> > I think we're starting with different initial states, IMO there is
> > absolutely no way around 0/0/1 being the initial device state.
> > Anything otherwise means that we cannot add migration support to an
> > existing device and maintain compatibility with existing userspace.
> > Thanks,
> >   
> Hope above explanation helps to resolve this concern.

Not really, I stand by that the default state must reflect previous
assumptions and therefore it must be 0/0/1 and additionally we should
not maintain state in QEMU intentionally inconsistent with the device
state.  Thanks,

Alex
Yan Zhao Nov. 14, 2019, 12:36 a.m. UTC | #10
On Thu, Nov 14, 2019 at 03:02:55AM +0800, Kirti Wankhede wrote:
> 
> 
> On 11/13/2019 8:53 AM, Yan Zhao wrote:
> > On Wed, Nov 13, 2019 at 06:30:05AM +0800, Alex Williamson wrote:
> >> On Tue, 12 Nov 2019 22:33:36 +0530
> >> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> >>
> >>> - Defined MIGRATION region type and sub-type.
> >>> - Used 3 bits to define VFIO device states.
> >>>      Bit 0 => _RUNNING
> >>>      Bit 1 => _SAVING
> >>>      Bit 2 => _RESUMING
> >>>      Combination of these bits defines VFIO device's state during migration
> >>>      _RUNNING => Normal VFIO device running state. When its reset, it
> >>> 		indicates _STOPPED state. when device is changed to
> >>> 		_STOPPED, driver should stop device before write()
> >>> 		returns.
> >>>      _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
> >>>                            start saving state of device i.e. pre-copy state
> >>>      _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
> >>
> >> s/should/must/
> >>
> >>>                  save device state,i.e. stop-n-copy state
> >>>      _RESUMING => VFIO device resuming state.
> >>>      _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
> >>
> >> A table might be useful here and in the uapi header to indicate valid
> >> states:
> >>
> >> | _RESUMING | _SAVING | _RUNNING | Description
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    0    |     1    | Running, default state
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    1    |     0    | Stopped, migration interface in save mode
> >> +-----------+---------+----------+------------------------------------------
> >> |     0     |    1    |     1    | Running, save mode interface, iterative
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    0    |     0    | Stopped, migration resume interface active
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    0    |     1    | Invalid (b)
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    1    |     0    | Invalid (c)
> >> +-----------+---------+----------+------------------------------------------
> >> |     1     |    1    |     1    | Invalid (d)
> >>
> >> I think we need to consider whether we define (a) as generally
> >> available, for instance we might want to use it for diagnostics or a
> >> fatal error condition outside of migration.
> >>
> 
> We have to set it as init state. I'll add this it.
> 
> >> Are there hidden assumptions between state transitions here or are
> >> there specific next possible state diagrams that we need to include as
> >> well?
> >>
> >> I'm curious if Intel agrees with the states marked invalid with their
> >> push for post-copy support.
> >>
> > hi Alex and Kirti,
> > Actually, for postcopy, I think we anyway need an extra POSTCOPY state
> > introduced. Reasons as below:
> > - in the target side, _RSESUMING state is set in the beginning of
> >    migration. It cannot be used as a state to inform device of that
> >    currently it's in postcopy state and device DMAs are to be trapped and
> >    pre-faulted.
> >    we also cannot use state (_RESUMING + _RUNNING) as an indicator of
> >    postcopy, because before device & vm running in target side, some device
> >    state are already loaded (e.g. page tables, pending workloads),
> >    target side can do pre-pagefault at that period before target vm up.
> > - in the source side, after device is stopped, postcopy needs saving
> >    device state only (as compared to device state + remaining dirty
> >    pages in precopy). state (!_RUNNING + _SAVING) here again cannot
> >    differentiate precopy and postcopy here.
> > 
> >>>      Bits 3 - 31 are reserved for future use. User should perform
> >>>      read-modify-write operation on this field.
> >>> - Defined vfio_device_migration_info structure which will be placed at 0th
> >>>    offset of migration region to get/set VFIO device related information.
> >>>    Defined members of structure and usage on read/write access:
> >>>      * device_state: (read/write)
> >>>          To convey VFIO device state to be transitioned to. Only 3 bits are
> >>> 	used as of now, Bits 3 - 31 are reserved for future use.
> >>>      * pending bytes: (read only)
> >>>          To get pending bytes yet to be migrated for VFIO device.
> >>>      * data_offset: (read only)
> >>>          To get data offset in migration region from where data exist
> >>> 	during _SAVING and from where data should be written by user space
> >>> 	application during _RESUMING state.
> >>>      * data_size: (read/write)
> >>>          To get and set size in bytes of data copied in migration region
> >>> 	during _SAVING and _RESUMING state.
> >>>
> >>> Migration region looks like:
> >>>   ------------------------------------------------------------------
> >>> |vfio_device_migration_info|    data section                      |
> >>> |                          |     ///////////////////////////////  |
> >>>   ------------------------------------------------------------------
> >>>   ^                              ^
> >>>   offset 0-trapped part        data_offset
> >>>
> >>> Structure vfio_device_migration_info is always followed by data section
> >>> in the region, so data_offset will always be non-0. Offset from where data
> >>> to be copied is decided by kernel driver, data section can be trapped or
> >>> mapped depending on how kernel driver defines data section.
> >>> Data section partition can be defined as mapped by sparse mmap capability.
> >>> If mmapped, then data_offset should be page aligned, where as initial
> >>> section which contain vfio_device_migration_info structure might not end
> >>> at offset which is page aligned.
> >>> Vendor driver should decide whether to partition data section and how to
> >>> partition the data section. Vendor driver should return data_offset
> >>> accordingly.
> >>>
> >>> For user application, data is opaque. User should write data in the same
> >>> order as received.
> >>>
> >>> Signed-off-by: Kirti Wankhede <kwankhede@nvidia.com>
> >>> Reviewed-by: Neo Jia <cjia@nvidia.com>
> >>> ---
> >>>   include/uapi/linux/vfio.h | 108 ++++++++++++++++++++++++++++++++++++++++++++++
> >>>   1 file changed, 108 insertions(+)
> >>>
> >>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> >>> index 9e843a147ead..35b09427ad9f 100644
> >>> --- a/include/uapi/linux/vfio.h
> >>> +++ b/include/uapi/linux/vfio.h
> >>> @@ -305,6 +305,7 @@ struct vfio_region_info_cap_type {
> >>>   #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
> >>>   #define VFIO_REGION_TYPE_GFX                    (1)
> >>>   #define VFIO_REGION_TYPE_CCW			(2)
> >>> +#define VFIO_REGION_TYPE_MIGRATION              (3)
> >>>   
> >>>   /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >>>   
> >>> @@ -379,6 +380,113 @@ struct vfio_region_gfx_edid {
> >>>   /* sub-types for VFIO_REGION_TYPE_CCW */
> >>>   #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
> >>>   
> >>> +/* sub-types for VFIO_REGION_TYPE_MIGRATION */
> >>> +#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
> >>> +
> >>> +/*
> >>> + * Structure vfio_device_migration_info is placed at 0th offset of
> >>> + * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
> >>> + * information. Field accesses from this structure are only supported at their
> >>> + * native width and alignment, otherwise the result is undefined and vendor
> >>> + * drivers should return an error.
> >>> + *
> >>> + * device_state: (read/write)
> >>> + *      To indicate vendor driver the state VFIO device should be transitioned
> >>> + *      to. If device state transition fails, write on this field return error.
> >>> + *      It consists of 3 bits:
> >>> + *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
> >>
> >> Let's use set/cleared or 1/0 to indicate bit values, 'reset' is somewhat
> >> ambiguous.
> 
> Ok. Updating it.
> 
> >>
> >>> + *        _STOPPED state. When device is changed to _STOPPED, driver should stop
> >>> + *        device before write() returns.
> >>> + *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
> >>> + *        should start gathering device state information which will be provided
> >>> + *        to VFIO user space application to save device's state.
> >>> + *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
> >>> + *        prepare to resume device, data provided through migration region
> >>> + *        should be used to resume device.
> >>> + *      Bits 3 - 31 are reserved for future use. User should perform
> >>> + *      read-modify-write operation on this field.
> >>> + *      _SAVING and _RESUMING bits set at the same time is invalid state.
> >>> + *	Similarly _RUNNING and _RESUMING bits set is invalid state.
> >>> + *
> >>> + * pending bytes: (read only)
> >>> + *      Number of pending bytes yet to be migrated from vendor driver
> >>> + *
> >>> + * data_offset: (read only)
> >>> + *      User application should read data_offset in migration region from where
> >>> + *      user application should read device data during _SAVING state or write
> >>> + *      device data during _RESUMING state. See below for detail of sequence to
> >>> + *      be followed.
> >>> + *
> >>> + * data_size: (read/write)
> >>> + *      User application should read data_size to get size of data copied in
> >>> + *      bytes in migration region during _SAVING state and write size of data
> >>> + *      copied in bytes in migration region during _RESUMING state.
> >>> + *
> >>> + * Migration region looks like:
> >>> + *  ------------------------------------------------------------------
> >>> + * |vfio_device_migration_info|    data section                      |
> >>> + * |                          |     ///////////////////////////////  |
> >>> + * ------------------------------------------------------------------
> >>> + *   ^                              ^
> >>> + *  offset 0-trapped part        data_offset
> >>> + *
> >>> + * Structure vfio_device_migration_info is always followed by data section in
> >>> + * the region, so data_offset will always be non-0. Offset from where data is
> >>> + * copied is decided by kernel driver, data section can be trapped or mapped
> >>> + * or partitioned, depending on how kernel driver defines data section.
> >>> + * Data section partition can be defined as mapped by sparse mmap capability.
> >>> + * If mmapped, then data_offset should be page aligned, where as initial section
> >>> + * which contain vfio_device_migration_info structure might not end at offset
> >>> + * which is page aligned.
> >>
> >> "The user is not required to to access via mmap regardless of the
> >> region mmap capabilities."
> >>
> > But once the user decides to access via mmap, it has to read data of
> > data_size each time, otherwise the vendor driver has no idea of how many
> > data are already read from user. Agree?
> > 
> 
> that's right.
> 
> >>> + * Vendor driver should decide whether to partition data section and how to
> >>> + * partition the data section. Vendor driver should return data_offset
> >>> + * accordingly.
> >>> + *
> >>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
> >>> + * and for _SAVING device state or stop-and-copy phase:
> >>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
> >>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
> >>> + *    Kernel driver should return this read operation only after writing data to
> >>> + *    staging buffer is done.
> > May I know under what condition this data_offset will be changed per
> > each iteration from a-f ?
> > 
> 
> Its upto vendor driver, if vendor driver maintains multiple partitions 
> in data section.
>
So, do you mean each time before doing b (reading data_offset), step a
(reading pending_bytes) is mandatory, otherwise the vendor driver does
not know which data_offset is?
Then, any lock to wrap step a and b to ensure atomic?

Thanks
Yan


> >>
> >> "staging buffer" implies a vendor driver implementation, perhaps we
> >> could just state that data is available from (region + data_offset) to
> >> (region + data_offset + data_size) upon return of this read operation.
> >>
> 
> Makes sense. Updating it.
> 
> >>> + * c. read data_size, amount of data in bytes written by vendor driver in
> >>> + *    migration region.
> >>> + * d. read data_size bytes of data from data_offset in the migration region.
> >>> + * e. process data.
> >>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
> >>> + *    operation from migration region for previous iteration is done.
> >>
> >> I think this indicate that step (f) should be to read pending_bytes, the
> >> read sequence is not complete until this step.  Optionally the user can
> >> then proceed to step (b).  There are no read side-effects of (a) afaict.
> >>
> 
> I tried to reword this sequence to be more specific:
> 
> * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy 
> * phase and for _SAVING device state or stop-and-copy phase:
> * a. read pending_bytes, indicates start of new iteration to get device 
> *    data. If there was previous iteration, then this read operation
> *    indicates previous iteration is done. If pending_bytes > 0, go
> *    through below steps.
> * b. read data_offset, indicates kernel driver to make data available
> *    through data section. Kernel driver should return this read
> *    operation only after data is available from (region + data_offset)
> *    to (region + data_offset + data_size).
> * c. read data_size, amount of data in bytes available through migration
> *    region.
> * d. read data of data_size bytes from (region + data_offset) from
> *    migration region.
> * e. process data.
> * f. Loop through a to e.
> 
> Hope this is more clear.
> 
> >> Is the use required to reach pending_bytes == 0 before changing
> >> device_state, particularly transitioning to !_RUNNING?
> 
> No, its not necessary to reach till pending_bytes==0 in pre-copy phase.
> 
> >>  Presumably the
> >> user can exit this sequence at any time by clearing _SAVING.
> 
> In that case device state data is not complete, which will result in not 
> able to resume device with that data.
> In stop-and-copy phase, user should iterate till pending_bytes is 0.
> 
> >>
> >>> + *
> >>> + * Sequence to be followed while _RESUMING device state:
> >>> + * While data for this device is available, repeat below steps:
> >>> + * a. read data_offset from where user application should write data.
> > before proceed to step b, need to read data_size from vendor driver to determine
> > the max len of data to write. I think it's necessary in such a condition
> > that source vendor driver and target vendor driver do not offer data sections of
> > the same size. e.g. in source side, the data section is of size 100M,
> > but in target side, the data section is only of 50M size.
> > rather than simply fail, loop and write seems better.
> > 
> 
> Makes sense. Doing this change for next version.
> 
> > Thanks
> > Yan
> >>> + * b. write data of data_size to migration region from data_offset.
> >>> + * c. write data_size which indicates vendor driver that data is written in
> >>> + *    staging buffer. Vendor driver should read this data from migration
> >>> + *    region and resume device's state.
> >>
> >> The device defaults to _RUNNING state, so a prerequisite is to set
> >> _RESUMING and clear _RUNNING, right?
> 
> Yes.
> 
> >>
> >>> + *
> >>> + * For user application, data is opaque. User should write data in the same
> >>> + * order as received.
> >>> + */
> >>> +
> >>> +struct vfio_device_migration_info {
> >>> +	__u32 device_state;         /* VFIO device state */
> >>> +#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
> >>> +#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
> >>> +#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
> >>> +#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
> >>> +				     VFIO_DEVICE_STATE_SAVING |  \
> >>> +				     VFIO_DEVICE_STATE_RESUMING)
> >>> +
> >>> +#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
> >>> +					    VFIO_DEVICE_STATE_RESUMING)
> >>> +
> >>> +#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
> >>> +					    VFIO_DEVICE_STATE_RESUMING)
> >>
> >> These seem difficult to use, maybe we just need a
> >> VFIO_DEVICE_STATE_VALID macro?
> >>
> >> #define VFIO_DEVICE_STATE_VALID(state) \
> >>    (state & VFIO_DEVICE_STATE_RESUMING ? \
> >>    (state & VFIO_DEVICE_STATE_MASK) == VFIO_DEVICE_STATE_RESUMING : 1)
> >>
> 
> This will not be work when use of other bits gets added in future. 
> That's the reason I preferred to add individual invalid states which 
> user should check.
> 
> Thanks,
> Kirti
> 
> >> Thanks,
> >> Alex
> >>
> >>> +	__u32 reserved;
> >>> +	__u64 pending_bytes;
> >>> +	__u64 data_offset;
> >>> +	__u64 data_size;
> >>> +} __attribute__((packed));
> >>> +
> >>>   /*
> >>>    * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
> >>>    * which allows direct access to non-MSIX registers which happened to be within
> >>
Kirti Wankhede Nov. 14, 2019, 6:49 p.m. UTC | #11
On 11/14/2019 2:10 AM, Alex Williamson wrote:
> On Thu, 14 Nov 2019 01:47:04 +0530
> Kirti Wankhede <kwankhede@nvidia.com> wrote:
> 
>> On 11/14/2019 1:18 AM, Alex Williamson wrote:
>>> On Thu, 14 Nov 2019 00:59:52 +0530
>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>    
>>>> On 11/13/2019 11:57 PM, Alex Williamson wrote:
>>>>> On Wed, 13 Nov 2019 11:24:17 +0100
>>>>> Cornelia Huck <cohuck@redhat.com> wrote:
>>>>>       
>>>>>> On Tue, 12 Nov 2019 15:30:05 -0700
>>>>>> Alex Williamson <alex.williamson@redhat.com> wrote:
>>>>>>      
>>>>>>> On Tue, 12 Nov 2019 22:33:36 +0530
>>>>>>> Kirti Wankhede <kwankhede@nvidia.com> wrote:
>>>>>>>          
>>>>>>>> - Defined MIGRATION region type and sub-type.
>>>>>>>> - Used 3 bits to define VFIO device states.
>>>>>>>>        Bit 0 => _RUNNING
>>>>>>>>        Bit 1 => _SAVING
>>>>>>>>        Bit 2 => _RESUMING
>>>>>>>>        Combination of these bits defines VFIO device's state during migration
>>>>>>>>        _RUNNING => Normal VFIO device running state. When its reset, it
>>>>>>>> 		indicates _STOPPED state. when device is changed to
>>>>>>>> 		_STOPPED, driver should stop device before write()
>>>>>>>> 		returns.
>>>>>>>>        _SAVING | _RUNNING => vCPUs are running, VFIO device is running but
>>>>>>>>                              start saving state of device i.e. pre-copy state
>>>>>>>>        _SAVING  => vCPUs are stopped, VFIO device should be stopped, and
>>>>>>>
>>>>>>> s/should/must/
>>>>>>>          
>>>>>>>>                    save device state,i.e. stop-n-copy state
>>>>>>>>        _RESUMING => VFIO device resuming state.
>>>>>>>>        _SAVING | _RESUMING and _RUNNING | _RESUMING => Invalid states
>>>>>>>
>>>>>>> A table might be useful here and in the uapi header to indicate valid
>>>>>>> states:
>>>>>>
>>>>>> I like that.
>>>>>>      
>>>>>>>
>>>>>>> | _RESUMING | _SAVING | _RUNNING | Description
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    0    |     0    | Stopped, not saving or resuming (a)
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    0    |     1    | Running, default state
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    1    |     0    | Stopped, migration interface in save mode
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     0     |    1    |     1    | Running, save mode interface, iterative
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    0    |     0    | Stopped, migration resume interface active
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    0    |     1    | Invalid (b)
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    1    |     0    | Invalid (c)
>>>>>>> +-----------+---------+----------+------------------------------------------
>>>>>>> |     1     |    1    |     1    | Invalid (d)
>>>>>>>
>>>>>>> I think we need to consider whether we define (a) as generally
>>>>>>> available, for instance we might want to use it for diagnostics or a
>>>>>>> fatal error condition outside of migration.
>>>>>>>
>>>>>>> Are there hidden assumptions between state transitions here or are
>>>>>>> there specific next possible state diagrams that we need to include as
>>>>>>> well?
>>>>>>
>>>>>> Some kind of state-change diagram might be useful in addition to the
>>>>>> textual description anyway. Let me try, just to make sure I understand
>>>>>> this correctly:
>>>>>>      
>>>>
>>>> During User application initialization, there is one more state change:
>>>>
>>>> 0) 0/0/0 ---- stop to running -----> 0/0/1
>>>
>>> 0/0/0 cannot be the initial state of the device, that would imply that
>>> a device supporting this migration interface breaks backwards
>>> compatibility with all existing vfio userspace code and that code needs
>>> to learn to set the device running as part of its initialization.
>>> That's absolutely unacceptable.  The initial device state must be 0/0/1.
>>>    
>>
>> There isn't any device state for all existing vfio userspace code right
>> now. So default its assumed to be always running.
> 
> Exactly, there is no representation of device state, therefore it's
> assumed to be running, therefore when adding a representation of device
> state it must default to running.
> 
>> With migration support, device states are explicitly getting added. For
>> example, in case of QEMU, while device is getting initialized, i.e. from
>> vfio_realize(), device_state is set to 0/0/0, but not required to convey
>> it to vendor driver.
> 
> But we have a 0/0/0 state, why would we intentionally keep an internal
> state that's inconsistent with the device?
> 
>> Then with vfio_vmstate_change() notifier, device
>> state is changed to 0/0/1 when VM/vCPU are transitioned to running, at
>> this moment device state is conveyed to vendor driver. So vendor driver
>> doesn't see 0/0/0 state.
> 
> But the running state is the state of the device, not the VM or the
> vCPU.  Sure we might want to stop the device if the VM/vCPU state is
> stopped, but we must accept that the device is running when it's opened
> and we shouldn't intentionally maintain inconsistent state.
>   
>> While resuming, for userspace, for example QEMU, device state change is
>> from 0/0/0 to 1/0/0, vendor driver see 1/0/0 after device basic
>> initialization is done.
> 
> I don't see why this matters, all device_state transitions are written
> directly to the vendor driver.  The device is initially in 0/0/1 and
> can be set to 1/0/0 for resuming with an optional transition through
> 0/0/0 and the vendor driver can see each state change.
> 
>>>>>> 1) 0/0/1 ---(trigger driver to start gathering state info)---> 0/1/1
>>>>
>>>> not just gathering state info, but also copy device state to be
>>>> transferred during pre-copy phase.
>>>>
>>>> Below 2 state are not just to tell driver to stop, those 2 differ.
>>>> 2) is device state changed from running to stop, this is when VM
>>>> shutdowns cleanly, no need to save device state
>>>
>>> Userspace is under no obligation to perform this state change though,
>>> backwards compatibility dictates this.
>>>      
>>>>>> 2) 0/0/1 ---(tell driver to stop)---> 0/0/0
>>>>   
>>>>>> 3) 0/1/1 ---(tell driver to stop)---> 0/1/0
>>>>
>>>> above is transition from pre-copy phase to stop-and-copy phase, where
>>>> device data should be made available to user to transfer to destination
>>>> or to save it to file in case of save VM or suspend.
>>>>
>>>>   
>>>>>> 4) 0/0/1 ---(tell driver to resume with provided info)---> 1/0/0
>>>>>
>>>>> I think this is to switch into resuming mode, the data will follow >
>>>>>> 5) 1/0/0 ---(driver is ready)---> 0/0/1
>>>>>> 6) 0/1/1 ---(tell driver to stop saving)---> 0/0/1
>>>>>      
>>>>
>>>> above can occur on migration cancelled or failed.
>>>>
>>>>   
>>>>> I think also:
>>>>>
>>>>> 0/0/1 --> 0/1/0 If user chooses to go directly to stop and copy
>>>>
>>>> that's right, this happens in case of save VM or suspend VM.
>>>>   
>>>>>
>>>>> 0/0/0 and 0/0/1 should be reachable from any state, though I could see
>>>>> that a vendor driver could fail transition from 1/0/0 -> 0/0/1 if the
>>>>> received state is incomplete.  Somehow though a user always needs to
>>>>> return the device to the initial state, so how does device_state
>>>>> interact with the reset ioctl?  Would this automatically manipulate
>>>>> device_state back to 0/0/1?
>>>>
>>>> why would reset occur on 1/0/0 -> 0/0/1 failure?
>>>
>>> The question is whether the reset ioctl automatically puts the device
>>> back into the initial state, 0/0/1.  A reset from 1/0/0 -> 0/0/1
>>> presumably discards much of the device state we just restored, so
>>> clearly that would be undesirable.
>>>      
>>>> 1/0/0 -> 0/0/1 fails, then user should convey that to source that
>>>> migration has failed, then resume at source.
>>>
>>> In the scheme of the migration yet, but as far as the vfio interface is
>>> concerned the user should have a path to make use of a device after
>>> this point without closing it and starting over.  Thus, if a 1/0/0 ->
>>> 0/0/1 transition fails, would we define the device reset ioctl as a
>>> mechanism to flush the bogus state and place the device into the 0/0/1
>>> initial state?
>>>   
>>
>> Ok, userspace applications can be designed to do that. As of now with
>> QEMU, I don't see a way to reset device on 1/0/0-> 0/0/1 failure.
> 
> It's simply an ioctl, we must already have access to the device file
> descriptor to perform the device_state transition.  QEMU is not
> necessarily the consumer of this behavior though, if transition 1/0/0
> -> 0/0/1 fails in QEMU, it very well may just exit.  The vfio API
> should support a defined mechanism to recover the device from this
> state though, which I propose is the existing reset ioctl, which
> logically implies that any device reset returns the device_state to
> 0/0/1.
> 

Ok.

>>> >>> Not sure about the usefulness of 2).
>>>>
>>>> I explained this above.
>>>>   
>>>>>> Also, is 4) the only way to
>>>>>> trigger resuming?
>>>> Yes.
>>>>   
>>>>>> And is the change in 5) performed by the driver, or
>>>>>> by userspace?
>>>>>>      
>>>> By userspace.
>>>>   
>>>>>> Are any other state transitions valid?
>>>>>>
>>>>>> (...)
>>>>>>      
>>>>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>>>>>> + *    Kernel driver should return this read operation only after writing data to
>>>>>>>> + *    staging buffer is done.
>>>>>>>
>>>>>>> "staging buffer" implies a vendor driver implementation, perhaps we
>>>>>>> could just state that data is available from (region + data_offset) to
>>>>>>> (region + data_offset + data_size) upon return of this read operation.
>>>>>>>          
>>>>>>>> + * c. read data_size, amount of data in bytes written by vendor driver in
>>>>>>>> + *    migration region.
>>>>>>>> + * d. read data_size bytes of data from data_offset in the migration region.
>>>>>>>> + * e. process data.
>>>>>>>> + * f. Loop through a to e. Next read on pending_bytes indicates that read data
>>>>>>>> + *    operation from migration region for previous iteration is done.
>>>>>>>
>>>>>>> I think this indicate that step (f) should be to read pending_bytes, the
>>>>>>> read sequence is not complete until this step.  Optionally the user can
>>>>>>> then proceed to step (b).  There are no read side-effects of (a) afaict.
>>>>>>>
>>>>>>> Is the use required to reach pending_bytes == 0 before changing
>>>>>>> device_state, particularly transitioning to !_RUNNING?  Presumably the
>>>>>>> user can exit this sequence at any time by clearing _SAVING.
>>>>>>
>>>>>> That would be transition 6) above (abort saving and continue). I think
>>>>>> it makes sense not to forbid this.
>>>>>>      
>>>>>>>          
>>>>>>>> + *
>>>>>>>> + * Sequence to be followed while _RESUMING device state:
>>>>>>>> + * While data for this device is available, repeat below steps:
>>>>>>>> + * a. read data_offset from where user application should write data.
>>>>>>>> + * b. write data of data_size to migration region from data_offset.
>>>>>>>> + * c. write data_size which indicates vendor driver that data is written in
>>>>>>>> + *    staging buffer. Vendor driver should read this data from migration
>>>>>>>> + *    region and resume device's state.
>>>>>>>
>>>>>>> The device defaults to _RUNNING state, so a prerequisite is to set
>>>>>>> _RESUMING and clear _RUNNING, right?
>>>>>>      
>>>>
>>>> Sorry, I replied yes in my previous reply, but no. Default device state
>>>> is _STOPPED. During resume _STOPPED -> _RESUMING
>>>
>>> Nope, it can't be, it must be _RUNNING.
>>>    
>>>>>> Transition 4) above. Do we need
>>>>
>>>> I think, its not required.
>>>
>>> But above we say it's the only way to trigger resuming (4 was 0/0/1 ->
>>> 1/0/0).
>>>    
>>>>>> 7) 0/0/0 ---(tell driver to resume with provided info)---> 1/0/0
>>>>>> as well? (Probably depends on how sensible the 0/0/0 state is.)
>>>>>
>>>>> I think we must unless we require the user to transition from 0/0/1 to
>>>>> 1/0/0 in a single operation, but I'd prefer to make 0/0/0 generally
>>>>> available.  Thanks,
>>>>>       
>>>>
>>>> its 0/0/0 -> 1/0/0 while resuming.
>>>
>>> I think we're starting with different initial states, IMO there is
>>> absolutely no way around 0/0/1 being the initial device state.
>>> Anything otherwise means that we cannot add migration support to an
>>> existing device and maintain compatibility with existing userspace.
>>> Thanks,
>>>    
>> Hope above explanation helps to resolve this concern.
> 
> Not really, I stand by that the default state must reflect previous
> assumptions and therefore it must be 0/0/1 and additionally we should
> not maintain state in QEMU intentionally inconsistent with the device
> state.  Thanks,
> 

Ok. Will change that

Thanks,
Kirti
Kirti Wankhede Nov. 14, 2019, 6:55 p.m. UTC | #12
>>>>> + * Vendor driver should decide whether to partition data section and how to
>>>>> + * partition the data section. Vendor driver should return data_offset
>>>>> + * accordingly.
>>>>> + *
>>>>> + * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
>>>>> + * and for _SAVING device state or stop-and-copy phase:
>>>>> + * a. read pending_bytes. If pending_bytes > 0, go through below steps.
>>>>> + * b. read data_offset, indicates kernel driver to write data to staging buffer.
>>>>> + *    Kernel driver should return this read operation only after writing data to
>>>>> + *    staging buffer is done.
>>> May I know under what condition this data_offset will be changed per
>>> each iteration from a-f ?
>>>
>>
>> Its upto vendor driver, if vendor driver maintains multiple partitions
>> in data section.
>>
> So, do you mean each time before doing b (reading data_offset), step a
> (reading pending_bytes) is mandatory, otherwise the vendor driver does
> not know which data_offset is?

pending_bytes will only tell bytes remaining to transfer from vendor 
driver. On read operation on data_offset, vendor driver should decide 
what to send depending on where he is making data available to userspace.

> Then, any lock to wrap step a and b to ensure atomic?

With current QEMU implementation, where migration is single thread, 
there is not need of lock yet. But when we add multi-threaded support 
may be in future then locks will be required in userspace side.

Thanks,
Kirti

Patch
diff mbox series

diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..35b09427ad9f 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -305,6 +305,7 @@  struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_PCI_VENDOR_MASK	(0xffff)
 #define VFIO_REGION_TYPE_GFX                    (1)
 #define VFIO_REGION_TYPE_CCW			(2)
+#define VFIO_REGION_TYPE_MIGRATION              (3)
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */
 
@@ -379,6 +380,113 @@  struct vfio_region_gfx_edid {
 /* sub-types for VFIO_REGION_TYPE_CCW */
 #define VFIO_REGION_SUBTYPE_CCW_ASYNC_CMD	(1)
 
+/* sub-types for VFIO_REGION_TYPE_MIGRATION */
+#define VFIO_REGION_SUBTYPE_MIGRATION           (1)
+
+/*
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *      To indicate vendor driver the state VFIO device should be transitioned
+ *      to. If device state transition fails, write on this field return error.
+ *      It consists of 3 bits:
+ *      - If bit 0 set, indicates _RUNNING state. When its reset, that indicates
+ *        _STOPPED state. When device is changed to _STOPPED, driver should stop
+ *        device before write() returns.
+ *      - If bit 1 set, indicates _SAVING state. When set, that indicates driver
+ *        should start gathering device state information which will be provided
+ *        to VFIO user space application to save device's state.
+ *      - If bit 2 set, indicates _RESUMING state. When set, that indicates
+ *        prepare to resume device, data provided through migration region
+ *        should be used to resume device.
+ *      Bits 3 - 31 are reserved for future use. User should perform
+ *      read-modify-write operation on this field.
+ *      _SAVING and _RESUMING bits set at the same time is invalid state.
+ *	Similarly _RUNNING and _RESUMING bits set is invalid state.
+ *
+ * pending bytes: (read only)
+ *      Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *      User application should read data_offset in migration region from where
+ *      user application should read device data during _SAVING state or write
+ *      device data during _RESUMING state. See below for detail of sequence to
+ *      be followed.
+ *
+ * data_size: (read/write)
+ *      User application should read data_size to get size of data copied in
+ *      bytes in migration region during _SAVING state and write size of data
+ *      copied in bytes in migration region during _RESUMING state.
+ *
+ * Migration region looks like:
+ *  ------------------------------------------------------------------
+ * |vfio_device_migration_info|    data section                      |
+ * |                          |     ///////////////////////////////  |
+ * ------------------------------------------------------------------
+ *   ^                              ^
+ *  offset 0-trapped part        data_offset
+ *
+ * Structure vfio_device_migration_info is always followed by data section in
+ * the region, so data_offset will always be non-0. Offset from where data is
+ * copied is decided by kernel driver, data section can be trapped or mapped
+ * or partitioned, depending on how kernel driver defines data section.
+ * Data section partition can be defined as mapped by sparse mmap capability.
+ * If mmapped, then data_offset should be page aligned, where as initial section
+ * which contain vfio_device_migration_info structure might not end at offset
+ * which is page aligned.
+ * Vendor driver should decide whether to partition data section and how to
+ * partition the data section. Vendor driver should return data_offset
+ * accordingly.
+ *
+ * Sequence to be followed for _SAVING|_RUNNING device state or pre-copy phase
+ * and for _SAVING device state or stop-and-copy phase:
+ * a. read pending_bytes. If pending_bytes > 0, go through below steps.
+ * b. read data_offset, indicates kernel driver to write data to staging buffer.
+ *    Kernel driver should return this read operation only after writing data to
+ *    staging buffer is done.
+ * c. read data_size, amount of data in bytes written by vendor driver in
+ *    migration region.
+ * d. read data_size bytes of data from data_offset in the migration region.
+ * e. process data.
+ * f. Loop through a to e. Next read on pending_bytes indicates that read data
+ *    operation from migration region for previous iteration is done.
+ *
+ * Sequence to be followed while _RESUMING device state:
+ * While data for this device is available, repeat below steps:
+ * a. read data_offset from where user application should write data.
+ * b. write data of data_size to migration region from data_offset.
+ * c. write data_size which indicates vendor driver that data is written in
+ *    staging buffer. Vendor driver should read this data from migration
+ *    region and resume device's state.
+ *
+ * For user application, data is opaque. User should write data in the same
+ * order as received.
+ */
+
+struct vfio_device_migration_info {
+	__u32 device_state;         /* VFIO device state */
+#define VFIO_DEVICE_STATE_RUNNING   (1 << 0)
+#define VFIO_DEVICE_STATE_SAVING    (1 << 1)
+#define VFIO_DEVICE_STATE_RESUMING  (1 << 2)
+#define VFIO_DEVICE_STATE_MASK      (VFIO_DEVICE_STATE_RUNNING | \
+				     VFIO_DEVICE_STATE_SAVING |  \
+				     VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE1    (VFIO_DEVICE_STATE_SAVING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+
+#define VFIO_DEVICE_STATE_INVALID_CASE2    (VFIO_DEVICE_STATE_RUNNING | \
+					    VFIO_DEVICE_STATE_RESUMING)
+	__u32 reserved;
+	__u64 pending_bytes;
+	__u64 data_offset;
+	__u64 data_size;
+} __attribute__((packed));
+
 /*
  * The MSIX mappable capability informs that MSIX data of a BAR can be mmapped
  * which allows direct access to non-MSIX registers which happened to be within