[RFC,4/9] vfio-pci: register default dynamic-trap-bar-info region
diff mbox series

Message ID 20191205032650.29794-1-yan.y.zhao@intel.com
State New
Headers show
Series
  • Introduce mediate ops in vfio-pci
Related show

Commit Message

Yan Zhao Dec. 5, 2019, 3:26 a.m. UTC
Dynamic trap bar info region is a channel for QEMU and vendor driver to
communicate dynamic trap info. It is of type
VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.

This region has two fields: dt_fd and trap.
When QEMU detects a device regions of this type, it will create an
eventfd and write its eventfd id to dt_fd field.
When vendor drivre signals this eventfd, QEMU reads trap field of this
info region.
- If trap is true, QEMU would search the device's PCI BAR
regions and disable all the sparse mmaped subregions (if the sparse
mmaped subregion is disablable).
- If trap is false, QEMU would re-enable those subregions.

A typical usage is
1. vendor driver first cuts its bar 0 into several sections, all in a
sparse mmap array. So initally, all its bar 0 are passthroughed.
2. vendor driver specifys part of bar 0 sections to be disablable.
3. on migration starts, vendor driver signals dt_fd and set trap to true
to notify QEMU disabling the bar 0 sections of disablable flags on.
4. QEMU disables those bar 0 section and hence let vendor driver be able
to trap access of bar 0 registers and make dirty page tracking possible.
5. on migration failure, vendor driver signals dt_fd to QEMU again.
QEMU reads trap field of this info region which is false and QEMU
re-passthrough the whole bar 0 region.

Vendor driver specifies whether it supports dynamic-trap-bar-info region
through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it will create a default
dynamic_trap_bar_info region on behalf of vendor driver with region len=0
and region->ops=null.
Vvendor driver should override this region's len, flags, rw, mmap in its
vfio_pci_mediate_ops.

Cc: Kevin Tian <kevin.tian@intel.com>

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 drivers/vfio/pci/vfio_pci.c | 16 ++++++++++++++++
 include/linux/vfio.h        |  3 ++-
 include/uapi/linux/vfio.h   | 11 +++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

Comments

Alex Williamson Dec. 5, 2019, 11:55 p.m. UTC | #1
On Wed,  4 Dec 2019 22:26:50 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> Dynamic trap bar info region is a channel for QEMU and vendor driver to
> communicate dynamic trap info. It is of type
> VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> 
> This region has two fields: dt_fd and trap.
> When QEMU detects a device regions of this type, it will create an
> eventfd and write its eventfd id to dt_fd field.
> When vendor drivre signals this eventfd, QEMU reads trap field of this
> info region.
> - If trap is true, QEMU would search the device's PCI BAR
> regions and disable all the sparse mmaped subregions (if the sparse
> mmaped subregion is disablable).
> - If trap is false, QEMU would re-enable those subregions.
> 
> A typical usage is
> 1. vendor driver first cuts its bar 0 into several sections, all in a
> sparse mmap array. So initally, all its bar 0 are passthroughed.
> 2. vendor driver specifys part of bar 0 sections to be disablable.
> 3. on migration starts, vendor driver signals dt_fd and set trap to true
> to notify QEMU disabling the bar 0 sections of disablable flags on.
> 4. QEMU disables those bar 0 section and hence let vendor driver be able
> to trap access of bar 0 registers and make dirty page tracking possible.
> 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> QEMU reads trap field of this info region which is false and QEMU
> re-passthrough the whole bar 0 region.
> 
> Vendor driver specifies whether it supports dynamic-trap-bar-info region
> through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> vfio_pci_mediate_ops->open().
> 
> If vfio-pci detects this cap, it will create a default
> dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> and region->ops=null.
> Vvendor driver should override this region's len, flags, rw, mmap in its
> vfio_pci_mediate_ops.

TBH, I don't like this interface at all.  Userspace doesn't pass data
to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
configuring user signaling with eventfds.  I think we only need to
define an IRQ type that tells the user to re-evaluate the sparse mmap
information for a region.  The user would enumerate the device IRQs via
GET_IRQ_INFO, find one of this type where the IRQ info would also
indicate which region(s) should be re-evaluated on signaling.  The user
would enable that signaling via SET_IRQS and simply re-evaluate the
sparse mmap capability for the associated regions when signaled.
Thanks,

Alex

> 
> Cc: Kevin Tian <kevin.tian@intel.com>
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  drivers/vfio/pci/vfio_pci.c | 16 ++++++++++++++++
>  include/linux/vfio.h        |  3 ++-
>  include/uapi/linux/vfio.h   | 11 +++++++++++
>  3 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> index 059660328be2..62b811ca43e4 100644
> --- a/drivers/vfio/pci/vfio_pci.c
> +++ b/drivers/vfio/pci/vfio_pci.c
> @@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
>  		NULL);
>  }
>  
> +/**
> + * register a region to hold info for dynamically trap bar regions
> + */
> +void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
> +{
> +	vfio_pci_register_dev_region(vdev,
> +		VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
> +		VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
> +		NULL, 0,
> +		VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> +		NULL);
> +}
> +
>  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
>  {
>  	struct resource *res;
> @@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
>  				if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
>  					init_migration_region(vdev);
>  
> +				if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
> +					init_dynamic_trap_bar_info_region(vdev);
> +
>  				pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
>  						vdev->mediate_ops->name, caps,
>  						handle, vdev->pdev->vendor,
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index cddea8e9dcb2..cf8ecf687bee 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
>  
>  struct vfio_pci_mediate_ops {
>  	char	*name;
> -#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> +#define VFIO_PCI_DEVICE_CAP_MIGRATION		(0x01)
> +#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR	(0x02)
>  	int	(*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
>  	void	(*release)(int handle);
>  	void	(*get_region_info)(int handle,
> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> index caf8845a67a6..74a2d0b57741 100644
> --- a/include/uapi/linux/vfio.h
> +++ b/include/uapi/linux/vfio.h
> @@ -258,6 +258,9 @@ struct vfio_region_info {
>  struct vfio_region_sparse_mmap_area {
>  	__u64	offset;	/* Offset of mmap'able area within region */
>  	__u64	size;	/* Size of mmap'able area */
> +	__u32	disablable;	/* whether this mmap'able are able to
> +				 *  be dynamically disabled
> +				 */
>  };
>  
>  struct vfio_region_info_cap_sparse_mmap {
> @@ -454,6 +457,14 @@ struct vfio_device_migration_info {
>  #define VFIO_DEVICE_DIRTY_PFNS_ALL	(~0ULL)
>  } __attribute__((packed));
>  
> +/* Region type and sub-type to hold info to dynamically trap bars */
> +#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO		(4)
> +#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO	(1)
> +
> +struct vfio_device_dt_bar_info_region {
> +	__u32 dt_fd; /* fd of eventfd to notify qemu trap/untrap bars*/
> +	__u32 trap;   /* trap/untrap bar regions */
> +};
>  
>  /* sub-types for VFIO_REGION_TYPE_PCI_* */
>
Yan Zhao Dec. 6, 2019, 6:04 a.m. UTC | #2
On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> On Wed,  4 Dec 2019 22:26:50 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > communicate dynamic trap info. It is of type
> > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > 
> > This region has two fields: dt_fd and trap.
> > When QEMU detects a device regions of this type, it will create an
> > eventfd and write its eventfd id to dt_fd field.
> > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > info region.
> > - If trap is true, QEMU would search the device's PCI BAR
> > regions and disable all the sparse mmaped subregions (if the sparse
> > mmaped subregion is disablable).
> > - If trap is false, QEMU would re-enable those subregions.
> > 
> > A typical usage is
> > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > to trap access of bar 0 registers and make dirty page tracking possible.
> > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > QEMU reads trap field of this info region which is false and QEMU
> > re-passthrough the whole bar 0 region.
> > 
> > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > vfio_pci_mediate_ops->open().
> > 
> > If vfio-pci detects this cap, it will create a default
> > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > and region->ops=null.
> > Vvendor driver should override this region's len, flags, rw, mmap in its
> > vfio_pci_mediate_ops.
> 
> TBH, I don't like this interface at all.  Userspace doesn't pass data
> to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> configuring user signaling with eventfds.  I think we only need to
> define an IRQ type that tells the user to re-evaluate the sparse mmap
> information for a region.  The user would enumerate the device IRQs via
> GET_IRQ_INFO, find one of this type where the IRQ info would also
> indicate which region(s) should be re-evaluated on signaling.  The user
> would enable that signaling via SET_IRQS and simply re-evaluate the
ok. I'll try to switch to this way. Thanks for this suggestion.

> sparse mmap capability for the associated regions when signaled.

Do you like the "disablable" flag of sparse mmap ?
I think it's a lightweight way for user to switch mmap state of a whole region,
otherwise going through a complete flow of GET_REGION_INFO and re-setup
region might be too heavy.

Thanks
Yan

> Thanks,
> 
> Alex
>




> > 
> > Cc: Kevin Tian <kevin.tian@intel.com>
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 16 ++++++++++++++++
> >  include/linux/vfio.h        |  3 ++-
> >  include/uapi/linux/vfio.h   | 11 +++++++++++
> >  3 files changed, 29 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 059660328be2..62b811ca43e4 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
> >  		NULL);
> >  }
> >  
> > +/**
> > + * register a region to hold info for dynamically trap bar regions
> > + */
> > +void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
> > +{
> > +	vfio_pci_register_dev_region(vdev,
> > +		VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
> > +		VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
> > +		NULL, 0,
> > +		VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> > +		NULL);
> > +}
> > +
> >  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >  {
> >  	struct resource *res;
> > @@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
> >  				if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> >  					init_migration_region(vdev);
> >  
> > +				if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
> > +					init_dynamic_trap_bar_info_region(vdev);
> > +
> >  				pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
> >  						vdev->mediate_ops->name, caps,
> >  						handle, vdev->pdev->vendor,
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > index cddea8e9dcb2..cf8ecf687bee 100644
> > --- a/include/linux/vfio.h
> > +++ b/include/linux/vfio.h
> > @@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
> >  
> >  struct vfio_pci_mediate_ops {
> >  	char	*name;
> > -#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
> > +#define VFIO_PCI_DEVICE_CAP_MIGRATION		(0x01)
> > +#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR	(0x02)
> >  	int	(*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
> >  	void	(*release)(int handle);
> >  	void	(*get_region_info)(int handle,
> > diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
> > index caf8845a67a6..74a2d0b57741 100644
> > --- a/include/uapi/linux/vfio.h
> > +++ b/include/uapi/linux/vfio.h
> > @@ -258,6 +258,9 @@ struct vfio_region_info {
> >  struct vfio_region_sparse_mmap_area {
> >  	__u64	offset;	/* Offset of mmap'able area within region */
> >  	__u64	size;	/* Size of mmap'able area */
> > +	__u32	disablable;	/* whether this mmap'able are able to
> > +				 *  be dynamically disabled
> > +				 */
> >  };
> >  
> >  struct vfio_region_info_cap_sparse_mmap {
> > @@ -454,6 +457,14 @@ struct vfio_device_migration_info {
> >  #define VFIO_DEVICE_DIRTY_PFNS_ALL	(~0ULL)
> >  } __attribute__((packed));
> >  
> > +/* Region type and sub-type to hold info to dynamically trap bars */
> > +#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO		(4)
> > +#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO	(1)
> > +
> > +struct vfio_device_dt_bar_info_region {
> > +	__u32 dt_fd; /* fd of eventfd to notify qemu trap/untrap bars*/
> > +	__u32 trap;   /* trap/untrap bar regions */
> > +};
> >  
> >  /* sub-types for VFIO_REGION_TYPE_PCI_* */
> >  
>
Alex Williamson Dec. 6, 2019, 3:20 p.m. UTC | #3
On Fri, 6 Dec 2019 01:04:07 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > On Wed,  4 Dec 2019 22:26:50 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > communicate dynamic trap info. It is of type
> > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > 
> > > This region has two fields: dt_fd and trap.
> > > When QEMU detects a device regions of this type, it will create an
> > > eventfd and write its eventfd id to dt_fd field.
> > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > info region.
> > > - If trap is true, QEMU would search the device's PCI BAR
> > > regions and disable all the sparse mmaped subregions (if the sparse
> > > mmaped subregion is disablable).
> > > - If trap is false, QEMU would re-enable those subregions.
> > > 
> > > A typical usage is
> > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > QEMU reads trap field of this info region which is false and QEMU
> > > re-passthrough the whole bar 0 region.
> > > 
> > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > vfio_pci_mediate_ops->open().
> > > 
> > > If vfio-pci detects this cap, it will create a default
> > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > and region->ops=null.
> > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > vfio_pci_mediate_ops.  
> > 
> > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > configuring user signaling with eventfds.  I think we only need to
> > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > information for a region.  The user would enumerate the device IRQs via
> > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > indicate which region(s) should be re-evaluated on signaling.  The user
> > would enable that signaling via SET_IRQS and simply re-evaluate the  
> ok. I'll try to switch to this way. Thanks for this suggestion.
> 
> > sparse mmap capability for the associated regions when signaled.  
> 
> Do you like the "disablable" flag of sparse mmap ?
> I think it's a lightweight way for user to switch mmap state of a whole region,
> otherwise going through a complete flow of GET_REGION_INFO and re-setup
> region might be too heavy.

No, I don't like the disable-able flag.  At what frequency do we expect
regions to change?  It seems like we'd only change when switching into
and out of the _SAVING state, which is rare.  It seems easy for
userspace, at least QEMU, to drop the entire mmap configuration and
re-read it.  Another concern here is how do we synchronize the event?
Are we assuming that this event would occur when a user switch to
_SAVING mode on the device?  That operation is synchronous, the device
must be in saving mode after the write to device state completes, but
it seems like this might be trying to add an asynchronous dependency.
Will the write to device_state only complete once the user handles the
eventfd?  How would the kernel know when the mmap re-evaluation is
complete.  It seems like there are gaps here that the vendor driver
could miss traps required for migration because the user hasn't
completed the mmap transition yet.  Thanks,

Alex
Yan Zhao Dec. 9, 2019, 6:22 a.m. UTC | #4
On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> On Fri, 6 Dec 2019 01:04:07 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > communicate dynamic trap info. It is of type
> > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > 
> > > > This region has two fields: dt_fd and trap.
> > > > When QEMU detects a device regions of this type, it will create an
> > > > eventfd and write its eventfd id to dt_fd field.
> > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > info region.
> > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > mmaped subregion is disablable).
> > > > - If trap is false, QEMU would re-enable those subregions.
> > > > 
> > > > A typical usage is
> > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > QEMU reads trap field of this info region which is false and QEMU
> > > > re-passthrough the whole bar 0 region.
> > > > 
> > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > vfio_pci_mediate_ops->open().
> > > > 
> > > > If vfio-pci detects this cap, it will create a default
> > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > and region->ops=null.
> > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > vfio_pci_mediate_ops.  
> > > 
> > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > configuring user signaling with eventfds.  I think we only need to
> > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > information for a region.  The user would enumerate the device IRQs via
> > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > would enable that signaling via SET_IRQS and simply re-evaluate the  
> > ok. I'll try to switch to this way. Thanks for this suggestion.
> > 
> > > sparse mmap capability for the associated regions when signaled.  
> > 
> > Do you like the "disablable" flag of sparse mmap ?
> > I think it's a lightweight way for user to switch mmap state of a whole region,
> > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > region might be too heavy.
> 
> No, I don't like the disable-able flag.  At what frequency do we expect
> regions to change?  It seems like we'd only change when switching into
> and out of the _SAVING state, which is rare.  It seems easy for
> userspace, at least QEMU, to drop the entire mmap configuration and
ok. I'll try this way.

> re-read it.  Another concern here is how do we synchronize the event?
> Are we assuming that this event would occur when a user switch to
> _SAVING mode on the device?  That operation is synchronous, the device
> must be in saving mode after the write to device state completes, but
> it seems like this might be trying to add an asynchronous dependency.
> Will the write to device_state only complete once the user handles the
> eventfd?  How would the kernel know when the mmap re-evaluation is
> complete.  It seems like there are gaps here that the vendor driver
> could miss traps required for migration because the user hasn't
> completed the mmap transition yet.  Thanks,
> 
> Alex

yes, this asynchronous event notification will cause vendor driver miss
traps. But it's supposed to be of very short period time. That's also a
reason for us to wish the re-evaluation to be lightweight. E.g. if it's
able to be finished before the first iterate, it's still safe.

But I agree, the timing is not guaranteed, and so it's best for kernel
to wait for mmap re-evaluation to complete. 

migration_thread
    |->qemu_savevm_state_setup
    |   |->ram_save_setup
    |   |   |->migration_bitmap_sync
    |   |       |->kvm_log_sync
    |   |       |->vfio_log_sync
    |   |
    |   |->vfio_save_setup
    |       |->set_device_state(_SAVING)
    |
    |->qemu_savevm_state_pending
    |   |->ram_save_pending
    |   |   |->migration_bitmap_sync 
    |   |      |->kvm_log_sync
    |   |      |->vfio_log_sync
    |   |->vfio_save_pending
    |
    |->qemu_savevm_state_iterate
    |   |->ram_save_iterate //send pages
    |   |->vfio_save_iterate
    ...


Actually, we previously let qemu trigger the re-evaluation when migration starts.
And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
there're other two possible use cases:
(1) keep passing through devices when migration starts and track dirty pages
    using hardware IOMMU. Then when migration is about to complete, stop the
    device and start trap PCI BARs for software emulation. (we made some
    changes to let device stop ahead of vcpu )
(2) performance optimization. There's an example in GVT (mdev case): 
    PCI BARs are passed through on vGPU initialization and are mmaped to a host
    dummy buffer. Then after initialization done, start trap of PCI BARs of
    vGPUs and start normal host mediation. The initial pass-through can save
    1000000 times of mmio trap.

Thanks
Yan
Alex Williamson Dec. 9, 2019, 9:16 p.m. UTC | #5
On Mon, 9 Dec 2019 01:22:12 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > On Fri, 6 Dec 2019 01:04:07 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:  
> > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > communicate dynamic trap info. It is of type
> > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > 
> > > > > This region has two fields: dt_fd and trap.
> > > > > When QEMU detects a device regions of this type, it will create an
> > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > info region.
> > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > mmaped subregion is disablable).
> > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > 
> > > > > A typical usage is
> > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > re-passthrough the whole bar 0 region.
> > > > > 
> > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > vfio_pci_mediate_ops->open().
> > > > > 
> > > > > If vfio-pci detects this cap, it will create a default
> > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > and region->ops=null.
> > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > vfio_pci_mediate_ops.    
> > > > 
> > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > configuring user signaling with eventfds.  I think we only need to
> > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > information for a region.  The user would enumerate the device IRQs via
> > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > would enable that signaling via SET_IRQS and simply re-evaluate the    
> > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > >   
> > > > sparse mmap capability for the associated regions when signaled.    
> > > 
> > > Do you like the "disablable" flag of sparse mmap ?
> > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > region might be too heavy.  
> > 
> > No, I don't like the disable-able flag.  At what frequency do we expect
> > regions to change?  It seems like we'd only change when switching into
> > and out of the _SAVING state, which is rare.  It seems easy for
> > userspace, at least QEMU, to drop the entire mmap configuration and  
> ok. I'll try this way.
> 
> > re-read it.  Another concern here is how do we synchronize the event?
> > Are we assuming that this event would occur when a user switch to
> > _SAVING mode on the device?  That operation is synchronous, the device
> > must be in saving mode after the write to device state completes, but
> > it seems like this might be trying to add an asynchronous dependency.
> > Will the write to device_state only complete once the user handles the
> > eventfd?  How would the kernel know when the mmap re-evaluation is
> > complete.  It seems like there are gaps here that the vendor driver
> > could miss traps required for migration because the user hasn't
> > completed the mmap transition yet.  Thanks,
> > 
> > Alex  
> 
> yes, this asynchronous event notification will cause vendor driver miss
> traps. But it's supposed to be of very short period time. That's also a
> reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> able to be finished before the first iterate, it's still safe.

Making the re-evaluation lightweight cannot solve the race, it only
masks it.

> But I agree, the timing is not guaranteed, and so it's best for kernel
> to wait for mmap re-evaluation to complete. 
> 
> migration_thread
>     |->qemu_savevm_state_setup
>     |   |->ram_save_setup
>     |   |   |->migration_bitmap_sync
>     |   |       |->kvm_log_sync
>     |   |       |->vfio_log_sync
>     |   |
>     |   |->vfio_save_setup
>     |       |->set_device_state(_SAVING)
>     |
>     |->qemu_savevm_state_pending
>     |   |->ram_save_pending
>     |   |   |->migration_bitmap_sync 
>     |   |      |->kvm_log_sync
>     |   |      |->vfio_log_sync
>     |   |->vfio_save_pending
>     |
>     |->qemu_savevm_state_iterate
>     |   |->ram_save_iterate //send pages
>     |   |->vfio_save_iterate
>     ...
> 
> 
> Actually, we previously let qemu trigger the re-evaluation when migration starts.
> And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> there're other two possible use cases:
> (1) keep passing through devices when migration starts and track dirty pages
>     using hardware IOMMU. Then when migration is about to complete, stop the
>     device and start trap PCI BARs for software emulation. (we made some
>     changes to let device stop ahead of vcpu )

How is that possible?  I/O devices need to continue to work until the
vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU
should assume all mmaps should be dropped on vfio device after we pass
some point of the migration process.

If there are a fixed set of mmap settings for a region and discrete
conditions under which they become active (ex. switch device to SAVING
mode) then QEMU could choose the right mapping itself and we wouldn't
need to worry about this asynchronous signaling problem, it would just
be defined as part of the protocol userspace needs to use.

> (2) performance optimization. There's an example in GVT (mdev case): 
>     PCI BARs are passed through on vGPU initialization and are mmaped to a host
>     dummy buffer. Then after initialization done, start trap of PCI BARs of
>     vGPUs and start normal host mediation. The initial pass-through can save
>     1000000 times of mmio trap.

Much of this discussion has me worried that many assumptions are being
made about the user and device interaction.  Backwards compatible
behavior is required.  If a mdev device presents an initial sparse mmap
capability for this acceleration, how do you support an existing
userspace that doesn't understand the new dynamic mmap semantics and
continues to try to operate with the initial sparse mmap?  Doesn't this
introduce another example of the raciness of the device trying to
switch mmaps?  Seems that if QEMU doesn't handle the eventfd with
sufficient timeliness the switch back to trap behavior could miss an
important transaction.  This also seems like an optimization targeted
at VMs running for only a short time, where it's not obvious to me that
GVT-g overlaps those sorts of use cases.  How much initialization time
is actually being saved with such a hack?  Thanks,

Alex
Yan Zhao Dec. 10, 2019, 7:44 a.m. UTC | #6
On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> On Mon, 9 Dec 2019 01:22:12 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:  
> > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > communicate dynamic trap info. It is of type
> > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > 
> > > > > > This region has two fields: dt_fd and trap.
> > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > info region.
> > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > mmaped subregion is disablable).
> > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > 
> > > > > > A typical usage is
> > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > re-passthrough the whole bar 0 region.
> > > > > > 
> > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > vfio_pci_mediate_ops->open().
> > > > > > 
> > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > and region->ops=null.
> > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > vfio_pci_mediate_ops.    
> > > > > 
> > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > would enable that signaling via SET_IRQS and simply re-evaluate the    
> > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > >   
> > > > > sparse mmap capability for the associated regions when signaled.    
> > > > 
> > > > Do you like the "disablable" flag of sparse mmap ?
> > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > region might be too heavy.  
> > > 
> > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > regions to change?  It seems like we'd only change when switching into
> > > and out of the _SAVING state, which is rare.  It seems easy for
> > > userspace, at least QEMU, to drop the entire mmap configuration and  
> > ok. I'll try this way.
> > 
> > > re-read it.  Another concern here is how do we synchronize the event?
> > > Are we assuming that this event would occur when a user switch to
> > > _SAVING mode on the device?  That operation is synchronous, the device
> > > must be in saving mode after the write to device state completes, but
> > > it seems like this might be trying to add an asynchronous dependency.
> > > Will the write to device_state only complete once the user handles the
> > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > complete.  It seems like there are gaps here that the vendor driver
> > > could miss traps required for migration because the user hasn't
> > > completed the mmap transition yet.  Thanks,
> > > 
> > > Alex  
> > 
> > yes, this asynchronous event notification will cause vendor driver miss
> > traps. But it's supposed to be of very short period time. That's also a
> > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > able to be finished before the first iterate, it's still safe.
> 
> Making the re-evaluation lightweight cannot solve the race, it only
> masks it.
> 
> > But I agree, the timing is not guaranteed, and so it's best for kernel
> > to wait for mmap re-evaluation to complete. 
> > 
> > migration_thread
> >     |->qemu_savevm_state_setup
> >     |   |->ram_save_setup
> >     |   |   |->migration_bitmap_sync
> >     |   |       |->kvm_log_sync
> >     |   |       |->vfio_log_sync
> >     |   |
> >     |   |->vfio_save_setup
> >     |       |->set_device_state(_SAVING)
> >     |
> >     |->qemu_savevm_state_pending
> >     |   |->ram_save_pending
> >     |   |   |->migration_bitmap_sync 
> >     |   |      |->kvm_log_sync
> >     |   |      |->vfio_log_sync
> >     |   |->vfio_save_pending
> >     |
> >     |->qemu_savevm_state_iterate
> >     |   |->ram_save_iterate //send pages
> >     |   |->vfio_save_iterate
> >     ...
> > 
> > 
> > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > there're other two possible use cases:
> > (1) keep passing through devices when migration starts and track dirty pages
> >     using hardware IOMMU. Then when migration is about to complete, stop the
> >     device and start trap PCI BARs for software emulation. (we made some
> >     changes to let device stop ahead of vcpu )
> 
> How is that possible?  I/O devices need to continue to work until the
> vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU
hi Alex
For devices like DSA [1], it can support SVM mode. In this mode, when a
page fault happens, the Intel DSA device blocks until the page fault is
resolved, if PRS is enabled; otherwise it is reported as an error.

Therefore, to pass through DSA into guest and do live migration with it,
it is desired to stop DSA before stopping vCPU, as there may be an
outstanding page fault to be resolved.

During the period when DSA is stopped and vCPUs are still running, all the
pass-through resources are trapped and emulated by host mediation driver until
vCPUs stop.


[1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf


> should assume all mmaps should be dropped on vfio device after we pass
> some point of the migration process.
> 
yes, it should be workable for the use case of DSA.

> If there are a fixed set of mmap settings for a region and discrete
> conditions under which they become active (ex. switch device to SAVING
> mode) then QEMU could choose the right mapping itself and we wouldn't
> need to worry about this asynchronous signaling problem, it would just
> be defined as part of the protocol userspace needs to use.
>
It's ok to let QEMU trigger dynamic trap on certain condition (like switching
device to SAVING mode), but it seems that there's no fixed set of mmap settings
for a region.
For example, some devices may want to trap the whole BARs, but some devices
only requires to trap a range of pages in a BAR for performance consideration.

If the "disable-able" flag is not preferable, maybe re-evaluation way is
the only choice? But it is a burden to ask for re-evaluation if they are
not required.

What about introducing a "region_bitmask" in ctl header of the migration region?
when QEMU writes a region index to the "region_bitmask", it can read back
from this field a bitmask to know which mmap to disable.

> > (2) performance optimization. There's an example in GVT (mdev case): 
> >     PCI BARs are passed through on vGPU initialization and are mmaped to a host
> >     dummy buffer. Then after initialization done, start trap of PCI BARs of
> >     vGPUs and start normal host mediation. The initial pass-through can save
> >     1000000 times of mmio trap.
> 
> Much of this discussion has me worried that many assumptions are being
> made about the user and device interaction.  Backwards compatible
> behavior is required.  If a mdev device presents an initial sparse mmap
> capability for this acceleration, how do you support an existing
> userspace that doesn't understand the new dynamic mmap semantics and
> continues to try to operate with the initial sparse mmap?  Doesn't this
> introduce another example of the raciness of the device trying to
> switch mmaps?  Seems that if QEMU doesn't handle the eventfd with
> sufficient timeliness the switch back to trap behavior could miss an
> important transaction.  This also seems like an optimization targeted
> at VMs running for only a short time, where it's not obvious to me that
> GVT-g overlaps those sorts of use cases.  How much initialization time
> is actually being saved with such a hack?  Thanks,
>
It can save about 4s initialization time with such a hack. But you are
right, the backward compatibility is a problem and we are not going to
upstream that. Just an example to show the usage.
It's fine if we drop the way of asynchronous kernel notification.

Thanks
Yan
Alex Williamson Dec. 10, 2019, 4:38 p.m. UTC | #7
On Tue, 10 Dec 2019 02:44:44 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > On Mon, 9 Dec 2019 01:22:12 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:  
> > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:    
> > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > communicate dynamic trap info. It is of type
> > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > 
> > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > info region.
> > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > mmaped subregion is disablable).
> > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > 
> > > > > > > A typical usage is
> > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > 
> > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > 
> > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > and region->ops=null.
> > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > vfio_pci_mediate_ops.      
> > > > > > 
> > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the      
> > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > >     
> > > > > > sparse mmap capability for the associated regions when signaled.      
> > > > > 
> > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > region might be too heavy.    
> > > > 
> > > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > > regions to change?  It seems like we'd only change when switching into
> > > > and out of the _SAVING state, which is rare.  It seems easy for
> > > > userspace, at least QEMU, to drop the entire mmap configuration and    
> > > ok. I'll try this way.
> > >   
> > > > re-read it.  Another concern here is how do we synchronize the event?
> > > > Are we assuming that this event would occur when a user switch to
> > > > _SAVING mode on the device?  That operation is synchronous, the device
> > > > must be in saving mode after the write to device state completes, but
> > > > it seems like this might be trying to add an asynchronous dependency.
> > > > Will the write to device_state only complete once the user handles the
> > > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > > complete.  It seems like there are gaps here that the vendor driver
> > > > could miss traps required for migration because the user hasn't
> > > > completed the mmap transition yet.  Thanks,
> > > > 
> > > > Alex    
> > > 
> > > yes, this asynchronous event notification will cause vendor driver miss
> > > traps. But it's supposed to be of very short period time. That's also a
> > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > able to be finished before the first iterate, it's still safe.  
> > 
> > Making the re-evaluation lightweight cannot solve the race, it only
> > masks it.
> >   
> > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > to wait for mmap re-evaluation to complete. 
> > > 
> > > migration_thread
> > >     |->qemu_savevm_state_setup
> > >     |   |->ram_save_setup
> > >     |   |   |->migration_bitmap_sync
> > >     |   |       |->kvm_log_sync
> > >     |   |       |->vfio_log_sync
> > >     |   |
> > >     |   |->vfio_save_setup
> > >     |       |->set_device_state(_SAVING)
> > >     |
> > >     |->qemu_savevm_state_pending
> > >     |   |->ram_save_pending
> > >     |   |   |->migration_bitmap_sync 
> > >     |   |      |->kvm_log_sync
> > >     |   |      |->vfio_log_sync
> > >     |   |->vfio_save_pending
> > >     |
> > >     |->qemu_savevm_state_iterate
> > >     |   |->ram_save_iterate //send pages
> > >     |   |->vfio_save_iterate
> > >     ...
> > > 
> > > 
> > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > there're other two possible use cases:
> > > (1) keep passing through devices when migration starts and track dirty pages
> > >     using hardware IOMMU. Then when migration is about to complete, stop the
> > >     device and start trap PCI BARs for software emulation. (we made some
> > >     changes to let device stop ahead of vcpu )  
> > 
> > How is that possible?  I/O devices need to continue to work until the
> > vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU  
> hi Alex
> For devices like DSA [1], it can support SVM mode. In this mode, when a
> page fault happens, the Intel DSA device blocks until the page fault is
> resolved, if PRS is enabled; otherwise it is reported as an error.
> 
> Therefore, to pass through DSA into guest and do live migration with it,
> it is desired to stop DSA before stopping vCPU, as there may be an
> outstanding page fault to be resolved.
> 
> During the period when DSA is stopped and vCPUs are still running, all the
> pass-through resources are trapped and emulated by host mediation driver until
> vCPUs stop.

If the DSA is stopped and resources are trapped and emulated, then is
the device really stopped from a QEMU perspective or has it simply
switched modes underneath QEMU?  If the device is truly stopped, then
I'd like to understand how a vCPU doing a PIO read from the device
wouldn't wedge the VM.
 
> [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> 
> 
> > should assume all mmaps should be dropped on vfio device after we pass
> > some point of the migration process.
> >   
> yes, it should be workable for the use case of DSA.
> 
> > If there are a fixed set of mmap settings for a region and discrete
> > conditions under which they become active (ex. switch device to SAVING
> > mode) then QEMU could choose the right mapping itself and we wouldn't
> > need to worry about this asynchronous signaling problem, it would just
> > be defined as part of the protocol userspace needs to use.
> >  
> It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> device to SAVING mode), but it seems that there's no fixed set of mmap settings
> for a region.
> For example, some devices may want to trap the whole BARs, but some devices
> only requires to trap a range of pages in a BAR for performance consideration.
> 
> If the "disable-able" flag is not preferable, maybe re-evaluation way is
> the only choice? But it is a burden to ask for re-evaluation if they are
> not required.
> 
> What about introducing a "region_bitmask" in ctl header of the migration region?
> when QEMU writes a region index to the "region_bitmask", it can read back
> from this field a bitmask to know which mmap to disable.

If a vendor driver wanted to have a migration sparse mmap that's
different from its runtime sparse mmap, we could simply add a new
capability in the region_info.  Userspace would only need to switch to
a different mapping for regions which advertise a new migration sparse
mmap capability.  Doesn't that serve the same purpose as the proposed
bitmap?

> > > (2) performance optimization. There's an example in GVT (mdev case): 
> > >     PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > >     dummy buffer. Then after initialization done, start trap of PCI BARs of
> > >     vGPUs and start normal host mediation. The initial pass-through can save
> > >     1000000 times of mmio trap.  
> > 
> > Much of this discussion has me worried that many assumptions are being
> > made about the user and device interaction.  Backwards compatible
> > behavior is required.  If a mdev device presents an initial sparse mmap
> > capability for this acceleration, how do you support an existing
> > userspace that doesn't understand the new dynamic mmap semantics and
> > continues to try to operate with the initial sparse mmap?  Doesn't this
> > introduce another example of the raciness of the device trying to
> > switch mmaps?  Seems that if QEMU doesn't handle the eventfd with
> > sufficient timeliness the switch back to trap behavior could miss an
> > important transaction.  This also seems like an optimization targeted
> > at VMs running for only a short time, where it's not obvious to me that
> > GVT-g overlaps those sorts of use cases.  How much initialization time
> > is actually being saved with such a hack?  Thanks,
> >  
> It can save about 4s initialization time with such a hack. But you are
> right, the backward compatibility is a problem and we are not going to
> upstream that. Just an example to show the usage.
> It's fine if we drop the way of asynchronous kernel notification.

I think to handle such a situation we'd need a mechanism to revoke the
user's mmap.  We can make use of an asynchronous mechanism to improve
performance of a device, but we need a synchronous mechanism to
maintain correctness.  For this example, the sparse mmap capability
could advertise the section of the BAR as mmap'able and revoke that
user mapping after the device finishes the initialization phase.
Potentially the user re-evaluating region_info after the initialization
phase would see a different sparse mmap capability excluding these
sections, but then we might need to think whether we want to suggest
that the user always re-read the region_info after device reset.  AFAIK,
we currently have no mechanism to revoke user mmaps. Thanks,

Alex
Yan Zhao Dec. 11, 2019, 6:25 a.m. UTC | #8
On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> On Tue, 10 Dec 2019 02:44:44 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:  
> > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:    
> > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >       
> > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > 
> > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > info region.
> > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > mmaped subregion is disablable).
> > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > 
> > > > > > > > A typical usage is
> > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > 
> > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > 
> > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > and region->ops=null.
> > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > vfio_pci_mediate_ops.      
> > > > > > > 
> > > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the      
> > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > >     
> > > > > > > sparse mmap capability for the associated regions when signaled.      
> > > > > > 
> > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > region might be too heavy.    
> > > > > 
> > > > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > > > regions to change?  It seems like we'd only change when switching into
> > > > > and out of the _SAVING state, which is rare.  It seems easy for
> > > > > userspace, at least QEMU, to drop the entire mmap configuration and    
> > > > ok. I'll try this way.
> > > >   
> > > > > re-read it.  Another concern here is how do we synchronize the event?
> > > > > Are we assuming that this event would occur when a user switch to
> > > > > _SAVING mode on the device?  That operation is synchronous, the device
> > > > > must be in saving mode after the write to device state completes, but
> > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > Will the write to device_state only complete once the user handles the
> > > > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > > > complete.  It seems like there are gaps here that the vendor driver
> > > > > could miss traps required for migration because the user hasn't
> > > > > completed the mmap transition yet.  Thanks,
> > > > > 
> > > > > Alex    
> > > > 
> > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > traps. But it's supposed to be of very short period time. That's also a
> > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > able to be finished before the first iterate, it's still safe.  
> > > 
> > > Making the re-evaluation lightweight cannot solve the race, it only
> > > masks it.
> > >   
> > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > to wait for mmap re-evaluation to complete. 
> > > > 
> > > > migration_thread
> > > >     |->qemu_savevm_state_setup
> > > >     |   |->ram_save_setup
> > > >     |   |   |->migration_bitmap_sync
> > > >     |   |       |->kvm_log_sync
> > > >     |   |       |->vfio_log_sync
> > > >     |   |
> > > >     |   |->vfio_save_setup
> > > >     |       |->set_device_state(_SAVING)
> > > >     |
> > > >     |->qemu_savevm_state_pending
> > > >     |   |->ram_save_pending
> > > >     |   |   |->migration_bitmap_sync 
> > > >     |   |      |->kvm_log_sync
> > > >     |   |      |->vfio_log_sync
> > > >     |   |->vfio_save_pending
> > > >     |
> > > >     |->qemu_savevm_state_iterate
> > > >     |   |->ram_save_iterate //send pages
> > > >     |   |->vfio_save_iterate
> > > >     ...
> > > > 
> > > > 
> > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > there're other two possible use cases:
> > > > (1) keep passing through devices when migration starts and track dirty pages
> > > >     using hardware IOMMU. Then when migration is about to complete, stop the
> > > >     device and start trap PCI BARs for software emulation. (we made some
> > > >     changes to let device stop ahead of vcpu )  
> > > 
> > > How is that possible?  I/O devices need to continue to work until the
> > > vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU  
> > hi Alex
> > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > page fault happens, the Intel DSA device blocks until the page fault is
> > resolved, if PRS is enabled; otherwise it is reported as an error.
> > 
> > Therefore, to pass through DSA into guest and do live migration with it,
> > it is desired to stop DSA before stopping vCPU, as there may be an
> > outstanding page fault to be resolved.
> > 
> > During the period when DSA is stopped and vCPUs are still running, all the
> > pass-through resources are trapped and emulated by host mediation driver until
> > vCPUs stop.
> 
> If the DSA is stopped and resources are trapped and emulated, then is
> the device really stopped from a QEMU perspective or has it simply
> switched modes underneath QEMU?  If the device is truly stopped, then
> I'd like to understand how a vCPU doing a PIO read from the device
> wouldn't wedge the VM.
>
It doesn't matter if the device is truly stopped or not (although from
my point of view, just draining commands and keeping device running is
better as it handles live migration failure better).
PIOs also need to be trapped and emulated if a vCPU accesses them. 

> > [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> > 
> > 
> > > should assume all mmaps should be dropped on vfio device after we pass
> > > some point of the migration process.
> > >   
> > yes, it should be workable for the use case of DSA.
> > 
> > > If there are a fixed set of mmap settings for a region and discrete
> > > conditions under which they become active (ex. switch device to SAVING
> > > mode) then QEMU could choose the right mapping itself and we wouldn't
> > > need to worry about this asynchronous signaling problem, it would just
> > > be defined as part of the protocol userspace needs to use.
> > >  
> > It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> > device to SAVING mode), but it seems that there's no fixed set of mmap settings
> > for a region.
> > For example, some devices may want to trap the whole BARs, but some devices
> > only requires to trap a range of pages in a BAR for performance consideration.
> > 
> > If the "disable-able" flag is not preferable, maybe re-evaluation way is
> > the only choice? But it is a burden to ask for re-evaluation if they are
> > not required.
> > 
> > What about introducing a "region_bitmask" in ctl header of the migration region?
> > when QEMU writes a region index to the "region_bitmask", it can read back
> > from this field a bitmask to know which mmap to disable.
> 
> If a vendor driver wanted to have a migration sparse mmap that's
> different from its runtime sparse mmap, we could simply add a new
> capability in the region_info.  Userspace would only need to switch to
> a different mapping for regions which advertise a new migration sparse
> mmap capability.  Doesn't that serve the same purpose as the proposed
> bitmap?

yes, it does.
I will try this way in next version.

> > > > (2) performance optimization. There's an example in GVT (mdev case): 
> > > >     PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > >     dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > >     vGPUs and start normal host mediation. The initial pass-through can save
> > > >     1000000 times of mmio trap.  
> > > 
> > > Much of this discussion has me worried that many assumptions are being
> > > made about the user and device interaction.  Backwards compatible
> > > behavior is required.  If a mdev device presents an initial sparse mmap
> > > capability for this acceleration, how do you support an existing
> > > userspace that doesn't understand the new dynamic mmap semantics and
> > > continues to try to operate with the initial sparse mmap?  Doesn't this
> > > introduce another example of the raciness of the device trying to
> > > switch mmaps?  Seems that if QEMU doesn't handle the eventfd with
> > > sufficient timeliness the switch back to trap behavior could miss an
> > > important transaction.  This also seems like an optimization targeted
> > > at VMs running for only a short time, where it's not obvious to me that
> > > GVT-g overlaps those sorts of use cases.  How much initialization time
> > > is actually being saved with such a hack?  Thanks,
> > >  
> > It can save about 4s initialization time with such a hack. But you are
> > right, the backward compatibility is a problem and we are not going to
> > upstream that. Just an example to show the usage.
> > It's fine if we drop the way of asynchronous kernel notification.
> 
> I think to handle such a situation we'd need a mechanism to revoke the
> user's mmap.  We can make use of an asynchronous mechanism to improve
> performance of a device, but we need a synchronous mechanism to
> maintain correctness.  For this example, the sparse mmap capability
> could advertise the section of the BAR as mmap'able and revoke that
> user mapping after the device finishes the initialization phase.
> Potentially the user re-evaluating region_info after the initialization
> phase would see a different sparse mmap capability excluding these
> sections, but then we might need to think whether we want to suggest
> that the user always re-read the region_info after device reset.  AFAIK,
> we currently have no mechanism to revoke user mmaps. Thanks,
> 
Actually I think the "disable-able" flag is good except for its backward
compatibility :)
 
Thanks
Yan
Alex Williamson Dec. 11, 2019, 6:56 p.m. UTC | #9
On Wed, 11 Dec 2019 01:25:55 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > On Tue, 10 Dec 2019 02:44:44 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:  
> > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:    
> > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:      
> > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > >         
> > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > 
> > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > info region.
> > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > 
> > > > > > > > > A typical usage is
> > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > 
> > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > 
> > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > and region->ops=null.
> > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > vfio_pci_mediate_ops.        
> > > > > > > > 
> > > > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the        
> > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > >       
> > > > > > > > sparse mmap capability for the associated regions when signaled.        
> > > > > > > 
> > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > region might be too heavy.      
> > > > > > 
> > > > > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > > > > regions to change?  It seems like we'd only change when switching into
> > > > > > and out of the _SAVING state, which is rare.  It seems easy for
> > > > > > userspace, at least QEMU, to drop the entire mmap configuration and      
> > > > > ok. I'll try this way.
> > > > >     
> > > > > > re-read it.  Another concern here is how do we synchronize the event?
> > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > _SAVING mode on the device?  That operation is synchronous, the device
> > > > > > must be in saving mode after the write to device state completes, but
> > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > Will the write to device_state only complete once the user handles the
> > > > > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > > > > complete.  It seems like there are gaps here that the vendor driver
> > > > > > could miss traps required for migration because the user hasn't
> > > > > > completed the mmap transition yet.  Thanks,
> > > > > > 
> > > > > > Alex      
> > > > > 
> > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > able to be finished before the first iterate, it's still safe.    
> > > > 
> > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > masks it.
> > > >     
> > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > to wait for mmap re-evaluation to complete. 
> > > > > 
> > > > > migration_thread
> > > > >     |->qemu_savevm_state_setup
> > > > >     |   |->ram_save_setup
> > > > >     |   |   |->migration_bitmap_sync
> > > > >     |   |       |->kvm_log_sync
> > > > >     |   |       |->vfio_log_sync
> > > > >     |   |
> > > > >     |   |->vfio_save_setup
> > > > >     |       |->set_device_state(_SAVING)
> > > > >     |
> > > > >     |->qemu_savevm_state_pending
> > > > >     |   |->ram_save_pending
> > > > >     |   |   |->migration_bitmap_sync 
> > > > >     |   |      |->kvm_log_sync
> > > > >     |   |      |->vfio_log_sync
> > > > >     |   |->vfio_save_pending
> > > > >     |
> > > > >     |->qemu_savevm_state_iterate
> > > > >     |   |->ram_save_iterate //send pages
> > > > >     |   |->vfio_save_iterate
> > > > >     ...
> > > > > 
> > > > > 
> > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > there're other two possible use cases:
> > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > >     using hardware IOMMU. Then when migration is about to complete, stop the
> > > > >     device and start trap PCI BARs for software emulation. (we made some
> > > > >     changes to let device stop ahead of vcpu )    
> > > > 
> > > > How is that possible?  I/O devices need to continue to work until the
> > > > vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU    
> > > hi Alex
> > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > page fault happens, the Intel DSA device blocks until the page fault is
> > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > 
> > > Therefore, to pass through DSA into guest and do live migration with it,
> > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > outstanding page fault to be resolved.
> > > 
> > > During the period when DSA is stopped and vCPUs are still running, all the
> > > pass-through resources are trapped and emulated by host mediation driver until
> > > vCPUs stop.  
> > 
> > If the DSA is stopped and resources are trapped and emulated, then is
> > the device really stopped from a QEMU perspective or has it simply
> > switched modes underneath QEMU?  If the device is truly stopped, then
> > I'd like to understand how a vCPU doing a PIO read from the device
> > wouldn't wedge the VM.
> >  
> It doesn't matter if the device is truly stopped or not (although from
> my point of view, just draining commands and keeping device running is
> better as it handles live migration failure better).
> PIOs also need to be trapped and emulated if a vCPU accesses them.

We seem to be talking around each other here.  If PIOs are trapped and
emulated then the device is not "stopped" as far as QEMU is concerned,
right?  "Stopping" a device suggests to me that a running vCPU doing a
PIO read from the device would block and cause problems in the still
running VM.  So I think you're suggesting some sort of mode switch in
the device where direct access is disabled an emulation takes over
until the vCPUs are stopped.
 
> > > [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> > > 
> > >   
> > > > should assume all mmaps should be dropped on vfio device after we pass
> > > > some point of the migration process.
> > > >     
> > > yes, it should be workable for the use case of DSA.
> > >   
> > > > If there are a fixed set of mmap settings for a region and discrete
> > > > conditions under which they become active (ex. switch device to SAVING
> > > > mode) then QEMU could choose the right mapping itself and we wouldn't
> > > > need to worry about this asynchronous signaling problem, it would just
> > > > be defined as part of the protocol userspace needs to use.
> > > >    
> > > It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> > > device to SAVING mode), but it seems that there's no fixed set of mmap settings
> > > for a region.
> > > For example, some devices may want to trap the whole BARs, but some devices
> > > only requires to trap a range of pages in a BAR for performance consideration.
> > > 
> > > If the "disable-able" flag is not preferable, maybe re-evaluation way is
> > > the only choice? But it is a burden to ask for re-evaluation if they are
> > > not required.
> > > 
> > > What about introducing a "region_bitmask" in ctl header of the migration region?
> > > when QEMU writes a region index to the "region_bitmask", it can read back
> > > from this field a bitmask to know which mmap to disable.  
> > 
> > If a vendor driver wanted to have a migration sparse mmap that's
> > different from its runtime sparse mmap, we could simply add a new
> > capability in the region_info.  Userspace would only need to switch to
> > a different mapping for regions which advertise a new migration sparse
> > mmap capability.  Doesn't that serve the same purpose as the proposed
> > bitmap?  
> 
> yes, it does.
> I will try this way in next version.
> 
> > > > > (2) performance optimization. There's an example in GVT (mdev case): 
> > > > >     PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > > >     dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > > >     vGPUs and start normal host mediation. The initial pass-through can save
> > > > >     1000000 times of mmio trap.    
> > > > 
> > > > Much of this discussion has me worried that many assumptions are being
> > > > made about the user and device interaction.  Backwards compatible
> > > > behavior is required.  If a mdev device presents an initial sparse mmap
> > > > capability for this acceleration, how do you support an existing
> > > > userspace that doesn't understand the new dynamic mmap semantics and
> > > > continues to try to operate with the initial sparse mmap?  Doesn't this
> > > > introduce another example of the raciness of the device trying to
> > > > switch mmaps?  Seems that if QEMU doesn't handle the eventfd with
> > > > sufficient timeliness the switch back to trap behavior could miss an
> > > > important transaction.  This also seems like an optimization targeted
> > > > at VMs running for only a short time, where it's not obvious to me that
> > > > GVT-g overlaps those sorts of use cases.  How much initialization time
> > > > is actually being saved with such a hack?  Thanks,
> > > >    
> > > It can save about 4s initialization time with such a hack. But you are
> > > right, the backward compatibility is a problem and we are not going to
> > > upstream that. Just an example to show the usage.
> > > It's fine if we drop the way of asynchronous kernel notification.  
> > 
> > I think to handle such a situation we'd need a mechanism to revoke the
> > user's mmap.  We can make use of an asynchronous mechanism to improve
> > performance of a device, but we need a synchronous mechanism to
> > maintain correctness.  For this example, the sparse mmap capability
> > could advertise the section of the BAR as mmap'able and revoke that
> > user mapping after the device finishes the initialization phase.
> > Potentially the user re-evaluating region_info after the initialization
> > phase would see a different sparse mmap capability excluding these
> > sections, but then we might need to think whether we want to suggest
> > that the user always re-read the region_info after device reset.  AFAIK,
> > we currently have no mechanism to revoke user mmaps. Thanks,
> >   
> Actually I think the "disable-able" flag is good except for its backward
> compatibility :)

Setting a flag on a section of a region doesn't solve the asynchronous
problem.  Thanks,

Alex
Yan Zhao Dec. 12, 2019, 2:02 a.m. UTC | #10
On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> On Wed, 11 Dec 2019 01:25:55 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:  
> > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:    
> > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >       
> > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:      
> > > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > >         
> > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > 
> > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > > info region.
> > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > 
> > > > > > > > > > A typical usage is
> > > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > 
> > > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > 
> > > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > > and region->ops=null.
> > > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > > vfio_pci_mediate_ops.        
> > > > > > > > > 
> > > > > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the        
> > > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > > >       
> > > > > > > > > sparse mmap capability for the associated regions when signaled.        
> > > > > > > > 
> > > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > > region might be too heavy.      
> > > > > > > 
> > > > > > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > > > > > regions to change?  It seems like we'd only change when switching into
> > > > > > > and out of the _SAVING state, which is rare.  It seems easy for
> > > > > > > userspace, at least QEMU, to drop the entire mmap configuration and      
> > > > > > ok. I'll try this way.
> > > > > >     
> > > > > > > re-read it.  Another concern here is how do we synchronize the event?
> > > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > > _SAVING mode on the device?  That operation is synchronous, the device
> > > > > > > must be in saving mode after the write to device state completes, but
> > > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > > Will the write to device_state only complete once the user handles the
> > > > > > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > > > > > complete.  It seems like there are gaps here that the vendor driver
> > > > > > > could miss traps required for migration because the user hasn't
> > > > > > > completed the mmap transition yet.  Thanks,
> > > > > > > 
> > > > > > > Alex      
> > > > > > 
> > > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > > able to be finished before the first iterate, it's still safe.    
> > > > > 
> > > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > > masks it.
> > > > >     
> > > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > > to wait for mmap re-evaluation to complete. 
> > > > > > 
> > > > > > migration_thread
> > > > > >     |->qemu_savevm_state_setup
> > > > > >     |   |->ram_save_setup
> > > > > >     |   |   |->migration_bitmap_sync
> > > > > >     |   |       |->kvm_log_sync
> > > > > >     |   |       |->vfio_log_sync
> > > > > >     |   |
> > > > > >     |   |->vfio_save_setup
> > > > > >     |       |->set_device_state(_SAVING)
> > > > > >     |
> > > > > >     |->qemu_savevm_state_pending
> > > > > >     |   |->ram_save_pending
> > > > > >     |   |   |->migration_bitmap_sync 
> > > > > >     |   |      |->kvm_log_sync
> > > > > >     |   |      |->vfio_log_sync
> > > > > >     |   |->vfio_save_pending
> > > > > >     |
> > > > > >     |->qemu_savevm_state_iterate
> > > > > >     |   |->ram_save_iterate //send pages
> > > > > >     |   |->vfio_save_iterate
> > > > > >     ...
> > > > > > 
> > > > > > 
> > > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > > there're other two possible use cases:
> > > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > >     using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > >     device and start trap PCI BARs for software emulation. (we made some
> > > > > >     changes to let device stop ahead of vcpu )    
> > > > > 
> > > > > How is that possible?  I/O devices need to continue to work until the
> > > > > vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU    
> > > > hi Alex
> > > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > > page fault happens, the Intel DSA device blocks until the page fault is
> > > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > > 
> > > > Therefore, to pass through DSA into guest and do live migration with it,
> > > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > > outstanding page fault to be resolved.
> > > > 
> > > > During the period when DSA is stopped and vCPUs are still running, all the
> > > > pass-through resources are trapped and emulated by host mediation driver until
> > > > vCPUs stop.  
> > > 
> > > If the DSA is stopped and resources are trapped and emulated, then is
> > > the device really stopped from a QEMU perspective or has it simply
> > > switched modes underneath QEMU?  If the device is truly stopped, then
> > > I'd like to understand how a vCPU doing a PIO read from the device
> > > wouldn't wedge the VM.
> > >  
> > It doesn't matter if the device is truly stopped or not (although from
> > my point of view, just draining commands and keeping device running is
> > better as it handles live migration failure better).
> > PIOs also need to be trapped and emulated if a vCPU accesses them.
> 
> We seem to be talking around each other here.  If PIOs are trapped and
> emulated then the device is not "stopped" as far as QEMU is concerned,
> right?  "Stopping" a device suggests to me that a running vCPU doing a
> PIO read from the device would block and cause problems in the still
> running VM.  So I think you're suggesting some sort of mode switch in
> the device where direct access is disabled an emulation takes over
> until the vCPUs are stopped.

sorry for this confusion.
yes, it's a kind of mode switch from a QEMU perspective.
Currently, its implementation in our local branch is like that:
1. before migration thread stopping vCPUs, a migration state
(COMPLETING) notification is sent to vfio migration state notifier, and
this notifier would put device state to !RUNNING, and put all BARs to trap
state.
2. in the kernel, when device state is set to !RUNNING, draining all
pending device requests, and starts emulation.

This implementation has two issues:
1. it requires hardcode in QEMU to put all BARs trapped and the time
spending on revoking mmaps is not necessary for devices that do not need it.
2. !RUNNING state here is not accurate and it will confuse vendor
drivers who stop devices after vCPUs stop.

For the 2nd issue, I think we can propose a new device state like
PRE-STOPPING.

But for the 1st issue, not sure how to fix it right now.
Maybe we can still add an asynchronous kernel notification and wait until
QEMU have switched the region mmap state?


> > > > [1] https://software.intel.com/sites/default/files/341204-intel-data-streaming-accelerator-spec.pdf
> > > > 
> > > >   
> > > > > should assume all mmaps should be dropped on vfio device after we pass
> > > > > some point of the migration process.
> > > > >     
> > > > yes, it should be workable for the use case of DSA.
> > > >   
> > > > > If there are a fixed set of mmap settings for a region and discrete
> > > > > conditions under which they become active (ex. switch device to SAVING
> > > > > mode) then QEMU could choose the right mapping itself and we wouldn't
> > > > > need to worry about this asynchronous signaling problem, it would just
> > > > > be defined as part of the protocol userspace needs to use.
> > > > >    
> > > > It's ok to let QEMU trigger dynamic trap on certain condition (like switching
> > > > device to SAVING mode), but it seems that there's no fixed set of mmap settings
> > > > for a region.
> > > > For example, some devices may want to trap the whole BARs, but some devices
> > > > only requires to trap a range of pages in a BAR for performance consideration.
> > > > 
> > > > If the "disable-able" flag is not preferable, maybe re-evaluation way is
> > > > the only choice? But it is a burden to ask for re-evaluation if they are
> > > > not required.
> > > > 
> > > > What about introducing a "region_bitmask" in ctl header of the migration region?
> > > > when QEMU writes a region index to the "region_bitmask", it can read back
> > > > from this field a bitmask to know which mmap to disable.  
> > > 
> > > If a vendor driver wanted to have a migration sparse mmap that's
> > > different from its runtime sparse mmap, we could simply add a new
> > > capability in the region_info.  Userspace would only need to switch to
> > > a different mapping for regions which advertise a new migration sparse
> > > mmap capability.  Doesn't that serve the same purpose as the proposed
> > > bitmap?  
> > 
> > yes, it does.
> > I will try this way in next version.
> > 
> > > > > > (2) performance optimization. There's an example in GVT (mdev case): 
> > > > > >     PCI BARs are passed through on vGPU initialization and are mmaped to a host
> > > > > >     dummy buffer. Then after initialization done, start trap of PCI BARs of
> > > > > >     vGPUs and start normal host mediation. The initial pass-through can save
> > > > > >     1000000 times of mmio trap.    
> > > > > 
> > > > > Much of this discussion has me worried that many assumptions are being
> > > > > made about the user and device interaction.  Backwards compatible
> > > > > behavior is required.  If a mdev device presents an initial sparse mmap
> > > > > capability for this acceleration, how do you support an existing
> > > > > userspace that doesn't understand the new dynamic mmap semantics and
> > > > > continues to try to operate with the initial sparse mmap?  Doesn't this
> > > > > introduce another example of the raciness of the device trying to
> > > > > switch mmaps?  Seems that if QEMU doesn't handle the eventfd with
> > > > > sufficient timeliness the switch back to trap behavior could miss an
> > > > > important transaction.  This also seems like an optimization targeted
> > > > > at VMs running for only a short time, where it's not obvious to me that
> > > > > GVT-g overlaps those sorts of use cases.  How much initialization time
> > > > > is actually being saved with such a hack?  Thanks,
> > > > >    
> > > > It can save about 4s initialization time with such a hack. But you are
> > > > right, the backward compatibility is a problem and we are not going to
> > > > upstream that. Just an example to show the usage.
> > > > It's fine if we drop the way of asynchronous kernel notification.  
> > > 
> > > I think to handle such a situation we'd need a mechanism to revoke the
> > > user's mmap.  We can make use of an asynchronous mechanism to improve
> > > performance of a device, but we need a synchronous mechanism to
> > > maintain correctness.  For this example, the sparse mmap capability
> > > could advertise the section of the BAR as mmap'able and revoke that
> > > user mapping after the device finishes the initialization phase.
> > > Potentially the user re-evaluating region_info after the initialization
> > > phase would see a different sparse mmap capability excluding these
> > > sections, but then we might need to think whether we want to suggest
> > > that the user always re-read the region_info after device reset.  AFAIK,
> > > we currently have no mechanism to revoke user mmaps. Thanks,
> > >   
> > Actually I think the "disable-able" flag is good except for its backward
> > compatibility :)
> 
> Setting a flag on a section of a region doesn't solve the asynchronous
> problem.  Thanks,
> 
yes. I mean we are ok to give up the way of kernel to trigger re-evaluation
for now, as currently the use cases of DSA and GVT are not in upstream
phase and we can add that if necessary in future.
And I think "disable-able" flag is a way to re-use existing sparse mmap,
because otherwise we have to either re-evaluate the region_info or introduce
new caps like migration_sparse_mmap, reset_sparse_mmap. But I agree,
this flag may cause problem for old QEMUs.

Thanks
Yan
Alex Williamson Dec. 12, 2019, 3:07 a.m. UTC | #11
On Wed, 11 Dec 2019 21:02:40 -0500
Yan Zhao <yan.y.zhao@intel.com> wrote:

> On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > On Wed, 11 Dec 2019 01:25:55 -0500
> > Yan Zhao <yan.y.zhao@intel.com> wrote:
> >   
> > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:  
> > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > >     
> > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:    
> > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > >       
> > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:      
> > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > >         
> > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:        
> > > > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > >           
> > > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > > 
> > > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > > > info region.
> > > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > > 
> > > > > > > > > > > A typical usage is
> > > > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > > 
> > > > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > > 
> > > > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > > > and region->ops=null.
> > > > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > > > vfio_pci_mediate_ops.          
> > > > > > > > > > 
> > > > > > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the          
> > > > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > > > >         
> > > > > > > > > > sparse mmap capability for the associated regions when signaled.          
> > > > > > > > > 
> > > > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > > > region might be too heavy.        
> > > > > > > > 
> > > > > > > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > > > > > > regions to change?  It seems like we'd only change when switching into
> > > > > > > > and out of the _SAVING state, which is rare.  It seems easy for
> > > > > > > > userspace, at least QEMU, to drop the entire mmap configuration and        
> > > > > > > ok. I'll try this way.
> > > > > > >       
> > > > > > > > re-read it.  Another concern here is how do we synchronize the event?
> > > > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > > > _SAVING mode on the device?  That operation is synchronous, the device
> > > > > > > > must be in saving mode after the write to device state completes, but
> > > > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > > > Will the write to device_state only complete once the user handles the
> > > > > > > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > > > > > > complete.  It seems like there are gaps here that the vendor driver
> > > > > > > > could miss traps required for migration because the user hasn't
> > > > > > > > completed the mmap transition yet.  Thanks,
> > > > > > > > 
> > > > > > > > Alex        
> > > > > > > 
> > > > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > > > able to be finished before the first iterate, it's still safe.      
> > > > > > 
> > > > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > > > masks it.
> > > > > >       
> > > > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > > > to wait for mmap re-evaluation to complete. 
> > > > > > > 
> > > > > > > migration_thread
> > > > > > >     |->qemu_savevm_state_setup
> > > > > > >     |   |->ram_save_setup
> > > > > > >     |   |   |->migration_bitmap_sync
> > > > > > >     |   |       |->kvm_log_sync
> > > > > > >     |   |       |->vfio_log_sync
> > > > > > >     |   |
> > > > > > >     |   |->vfio_save_setup
> > > > > > >     |       |->set_device_state(_SAVING)
> > > > > > >     |
> > > > > > >     |->qemu_savevm_state_pending
> > > > > > >     |   |->ram_save_pending
> > > > > > >     |   |   |->migration_bitmap_sync 
> > > > > > >     |   |      |->kvm_log_sync
> > > > > > >     |   |      |->vfio_log_sync
> > > > > > >     |   |->vfio_save_pending
> > > > > > >     |
> > > > > > >     |->qemu_savevm_state_iterate
> > > > > > >     |   |->ram_save_iterate //send pages
> > > > > > >     |   |->vfio_save_iterate
> > > > > > >     ...
> > > > > > > 
> > > > > > > 
> > > > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > > > there're other two possible use cases:
> > > > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > > >     using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > > >     device and start trap PCI BARs for software emulation. (we made some
> > > > > > >     changes to let device stop ahead of vcpu )      
> > > > > > 
> > > > > > How is that possible?  I/O devices need to continue to work until the
> > > > > > vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU      
> > > > > hi Alex
> > > > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > > > page fault happens, the Intel DSA device blocks until the page fault is
> > > > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > > > 
> > > > > Therefore, to pass through DSA into guest and do live migration with it,
> > > > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > > > outstanding page fault to be resolved.
> > > > > 
> > > > > During the period when DSA is stopped and vCPUs are still running, all the
> > > > > pass-through resources are trapped and emulated by host mediation driver until
> > > > > vCPUs stop.    
> > > > 
> > > > If the DSA is stopped and resources are trapped and emulated, then is
> > > > the device really stopped from a QEMU perspective or has it simply
> > > > switched modes underneath QEMU?  If the device is truly stopped, then
> > > > I'd like to understand how a vCPU doing a PIO read from the device
> > > > wouldn't wedge the VM.
> > > >    
> > > It doesn't matter if the device is truly stopped or not (although from
> > > my point of view, just draining commands and keeping device running is
> > > better as it handles live migration failure better).
> > > PIOs also need to be trapped and emulated if a vCPU accesses them.  
> > 
> > We seem to be talking around each other here.  If PIOs are trapped and
> > emulated then the device is not "stopped" as far as QEMU is concerned,
> > right?  "Stopping" a device suggests to me that a running vCPU doing a
> > PIO read from the device would block and cause problems in the still
> > running VM.  So I think you're suggesting some sort of mode switch in
> > the device where direct access is disabled an emulation takes over
> > until the vCPUs are stopped.  
> 
> sorry for this confusion.
> yes, it's a kind of mode switch from a QEMU perspective.
> Currently, its implementation in our local branch is like that:
> 1. before migration thread stopping vCPUs, a migration state
> (COMPLETING) notification is sent to vfio migration state notifier, and
> this notifier would put device state to !RUNNING, and put all BARs to trap
> state.
> 2. in the kernel, when device state is set to !RUNNING, draining all
> pending device requests, and starts emulation.
> 
> This implementation has two issues:
> 1. it requires hardcode in QEMU to put all BARs trapped and the time
> spending on revoking mmaps is not necessary for devices that do not need it.
> 2. !RUNNING state here is not accurate and it will confuse vendor
> drivers who stop devices after vCPUs stop.
> 
> For the 2nd issue, I think we can propose a new device state like
> PRE-STOPPING.

Yes, this is absolutely abusing the !RUNNING state, if the device is
still processing accesses by the vCPU, it's still running.
 
> But for the 1st issue, not sure how to fix it right now.
> Maybe we can still add an asynchronous kernel notification and wait until
> QEMU have switched the region mmap state?

It seems like you're preemptively trying to optimize the SAVING state
before we even have migration working.  Shouldn't SAVING be the point
at which you switch to trapping the device in order to track it?
Thanks,

Alex
Yan Zhao Dec. 12, 2019, 3:11 a.m. UTC | #12
On Thu, Dec 12, 2019 at 11:07:42AM +0800, Alex Williamson wrote:
> On Wed, 11 Dec 2019 21:02:40 -0500
> Yan Zhao <yan.y.zhao@intel.com> wrote:
> 
> > On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > > On Wed, 11 Dec 2019 01:25:55 -0500
> > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > >   
> > > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:  
> > > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > >     
> > > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:    
> > > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > >       
> > > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:      
> > > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > >         
> > > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:        
> > > > > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > > > > Yan Zhao <yan.y.zhao@intel.com> wrote:
> > > > > > > > > > >           
> > > > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > > > 
> > > > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > > > > > > > > > info region.
> > > > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > > > 
> > > > > > > > > > > > A typical usage is
> > > > > > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > > > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > > > > > > > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > > > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > > > 
> > > > > > > > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > > > 
> > > > > > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > > > > > > > > > > > and region->ops=null.
> > > > > > > > > > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > > > > > > > > > vfio_pci_mediate_ops.          
> > > > > > > > > > > 
> > > > > > > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > > > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > > > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > > > > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > > > > > > > information for a region.  The user would enumerate the device IRQs via
> > > > > > > > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > > > > > > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > > > > > > > > > would enable that signaling via SET_IRQS and simply re-evaluate the          
> > > > > > > > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > > > > > > > >         
> > > > > > > > > > > sparse mmap capability for the associated regions when signaled.          
> > > > > > > > > > 
> > > > > > > > > > Do you like the "disablable" flag of sparse mmap ?
> > > > > > > > > > I think it's a lightweight way for user to switch mmap state of a whole region,
> > > > > > > > > > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > > > > > > > > > region might be too heavy.        
> > > > > > > > > 
> > > > > > > > > No, I don't like the disable-able flag.  At what frequency do we expect
> > > > > > > > > regions to change?  It seems like we'd only change when switching into
> > > > > > > > > and out of the _SAVING state, which is rare.  It seems easy for
> > > > > > > > > userspace, at least QEMU, to drop the entire mmap configuration and        
> > > > > > > > ok. I'll try this way.
> > > > > > > >       
> > > > > > > > > re-read it.  Another concern here is how do we synchronize the event?
> > > > > > > > > Are we assuming that this event would occur when a user switch to
> > > > > > > > > _SAVING mode on the device?  That operation is synchronous, the device
> > > > > > > > > must be in saving mode after the write to device state completes, but
> > > > > > > > > it seems like this might be trying to add an asynchronous dependency.
> > > > > > > > > Will the write to device_state only complete once the user handles the
> > > > > > > > > eventfd?  How would the kernel know when the mmap re-evaluation is
> > > > > > > > > complete.  It seems like there are gaps here that the vendor driver
> > > > > > > > > could miss traps required for migration because the user hasn't
> > > > > > > > > completed the mmap transition yet.  Thanks,
> > > > > > > > > 
> > > > > > > > > Alex        
> > > > > > > > 
> > > > > > > > yes, this asynchronous event notification will cause vendor driver miss
> > > > > > > > traps. But it's supposed to be of very short period time. That's also a
> > > > > > > > reason for us to wish the re-evaluation to be lightweight. E.g. if it's
> > > > > > > > able to be finished before the first iterate, it's still safe.      
> > > > > > > 
> > > > > > > Making the re-evaluation lightweight cannot solve the race, it only
> > > > > > > masks it.
> > > > > > >       
> > > > > > > > But I agree, the timing is not guaranteed, and so it's best for kernel
> > > > > > > > to wait for mmap re-evaluation to complete. 
> > > > > > > > 
> > > > > > > > migration_thread
> > > > > > > >     |->qemu_savevm_state_setup
> > > > > > > >     |   |->ram_save_setup
> > > > > > > >     |   |   |->migration_bitmap_sync
> > > > > > > >     |   |       |->kvm_log_sync
> > > > > > > >     |   |       |->vfio_log_sync
> > > > > > > >     |   |
> > > > > > > >     |   |->vfio_save_setup
> > > > > > > >     |       |->set_device_state(_SAVING)
> > > > > > > >     |
> > > > > > > >     |->qemu_savevm_state_pending
> > > > > > > >     |   |->ram_save_pending
> > > > > > > >     |   |   |->migration_bitmap_sync 
> > > > > > > >     |   |      |->kvm_log_sync
> > > > > > > >     |   |      |->vfio_log_sync
> > > > > > > >     |   |->vfio_save_pending
> > > > > > > >     |
> > > > > > > >     |->qemu_savevm_state_iterate
> > > > > > > >     |   |->ram_save_iterate //send pages
> > > > > > > >     |   |->vfio_save_iterate
> > > > > > > >     ...
> > > > > > > > 
> > > > > > > > 
> > > > > > > > Actually, we previously let qemu trigger the re-evaluation when migration starts.
> > > > > > > > And now the reason for we to wish kernel to trigger the mmap re-evaluation is that
> > > > > > > > there're other two possible use cases:
> > > > > > > > (1) keep passing through devices when migration starts and track dirty pages
> > > > > > > >     using hardware IOMMU. Then when migration is about to complete, stop the
> > > > > > > >     device and start trap PCI BARs for software emulation. (we made some
> > > > > > > >     changes to let device stop ahead of vcpu )      
> > > > > > > 
> > > > > > > How is that possible?  I/O devices need to continue to work until the
> > > > > > > vCPU stops otherwise the vCPU can get blocked on the device.  Maybe QEMU      
> > > > > > hi Alex
> > > > > > For devices like DSA [1], it can support SVM mode. In this mode, when a
> > > > > > page fault happens, the Intel DSA device blocks until the page fault is
> > > > > > resolved, if PRS is enabled; otherwise it is reported as an error.
> > > > > > 
> > > > > > Therefore, to pass through DSA into guest and do live migration with it,
> > > > > > it is desired to stop DSA before stopping vCPU, as there may be an
> > > > > > outstanding page fault to be resolved.
> > > > > > 
> > > > > > During the period when DSA is stopped and vCPUs are still running, all the
> > > > > > pass-through resources are trapped and emulated by host mediation driver until
> > > > > > vCPUs stop.    
> > > > > 
> > > > > If the DSA is stopped and resources are trapped and emulated, then is
> > > > > the device really stopped from a QEMU perspective or has it simply
> > > > > switched modes underneath QEMU?  If the device is truly stopped, then
> > > > > I'd like to understand how a vCPU doing a PIO read from the device
> > > > > wouldn't wedge the VM.
> > > > >    
> > > > It doesn't matter if the device is truly stopped or not (although from
> > > > my point of view, just draining commands and keeping device running is
> > > > better as it handles live migration failure better).
> > > > PIOs also need to be trapped and emulated if a vCPU accesses them.  
> > > 
> > > We seem to be talking around each other here.  If PIOs are trapped and
> > > emulated then the device is not "stopped" as far as QEMU is concerned,
> > > right?  "Stopping" a device suggests to me that a running vCPU doing a
> > > PIO read from the device would block and cause problems in the still
> > > running VM.  So I think you're suggesting some sort of mode switch in
> > > the device where direct access is disabled an emulation takes over
> > > until the vCPUs are stopped.  
> > 
> > sorry for this confusion.
> > yes, it's a kind of mode switch from a QEMU perspective.
> > Currently, its implementation in our local branch is like that:
> > 1. before migration thread stopping vCPUs, a migration state
> > (COMPLETING) notification is sent to vfio migration state notifier, and
> > this notifier would put device state to !RUNNING, and put all BARs to trap
> > state.
> > 2. in the kernel, when device state is set to !RUNNING, draining all
> > pending device requests, and starts emulation.
> > 
> > This implementation has two issues:
> > 1. it requires hardcode in QEMU to put all BARs trapped and the time
> > spending on revoking mmaps is not necessary for devices that do not need it.
> > 2. !RUNNING state here is not accurate and it will confuse vendor
> > drivers who stop devices after vCPUs stop.
> > 
> > For the 2nd issue, I think we can propose a new device state like
> > PRE-STOPPING.
> 
> Yes, this is absolutely abusing the !RUNNING state, if the device is
> still processing accesses by the vCPU, it's still running.
>  
> > But for the 1st issue, not sure how to fix it right now.
> > Maybe we can still add an asynchronous kernel notification and wait until
> > QEMU have switched the region mmap state?
> 
> It seems like you're preemptively trying to optimize the SAVING state
> before we even have migration working.  Shouldn't SAVING be the point
> at which you switch to trapping the device in order to track it?
> Thanks,

But for some devices, start trapping on entering SAVING state is too
early. They don't really need the trapping until PRE_STOPPING stage.
E.g. for DSA, it can get dirty pages without trapping. The intention for
it to enter trap is not for SAVING, but for emulation.

Thanks
Yan

Patch
diff mbox series

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 059660328be2..62b811ca43e4 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -127,6 +127,19 @@  void init_migration_region(struct vfio_pci_device *vdev)
 		NULL);
 }
 
+/**
+ * register a region to hold info for dynamically trap bar regions
+ */
+void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
+{
+	vfio_pci_register_dev_region(vdev,
+		VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
+		VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
+		NULL, 0,
+		VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+		NULL);
+}
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
 	struct resource *res;
@@ -538,6 +551,9 @@  static int vfio_pci_open(void *device_data)
 				if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
 					init_migration_region(vdev);
 
+				if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
+					init_dynamic_trap_bar_info_region(vdev);
+
 				pr_info("vfio pci found mediate_ops %s, caps=%llx, handle=%x for %x:%x\n",
 						vdev->mediate_ops->name, caps,
 						handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index cddea8e9dcb2..cf8ecf687bee 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,7 +197,8 @@  extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
 struct vfio_pci_mediate_ops {
 	char	*name;
-#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
+#define VFIO_PCI_DEVICE_CAP_MIGRATION		(0x01)
+#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR	(0x02)
 	int	(*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
 	void	(*release)(int handle);
 	void	(*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index caf8845a67a6..74a2d0b57741 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -258,6 +258,9 @@  struct vfio_region_info {
 struct vfio_region_sparse_mmap_area {
 	__u64	offset;	/* Offset of mmap'able area within region */
 	__u64	size;	/* Size of mmap'able area */
+	__u32	disablable;	/* whether this mmap'able are able to
+				 *  be dynamically disabled
+				 */
 };
 
 struct vfio_region_info_cap_sparse_mmap {
@@ -454,6 +457,14 @@  struct vfio_device_migration_info {
 #define VFIO_DEVICE_DIRTY_PFNS_ALL	(~0ULL)
 } __attribute__((packed));
 
+/* Region type and sub-type to hold info to dynamically trap bars */
+#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO		(4)
+#define VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO	(1)
+
+struct vfio_device_dt_bar_info_region {
+	__u32 dt_fd; /* fd of eventfd to notify qemu trap/untrap bars*/
+	__u32 trap;   /* trap/untrap bar regions */
+};
 
 /* sub-types for VFIO_REGION_TYPE_PCI_* */