mbox series

[RFC,v2,00/18] Add VFIO mediated device support and DEV-MSI support for the idxd driver

Message ID 159534667974.28840.2045034360240786644.stgit@djiang5-desk3.ch.intel.com
Headers show
Series Add VFIO mediated device support and DEV-MSI support for the idxd driver | expand

Message

Dave Jiang July 21, 2020, 4:02 p.m. UTC
v2:
IMS (now dev-msi):
With recommendations from Jason/Thomas/Dan on making IMS more generic:
Pass a non-pci generic device(struct device) for IMS management instead of mdev
Remove all references to mdev and symbol_get/put
Remove all references to IMS in common code and replace with dev-msi
remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers
Create a generic dev-msi domain with and without interrupt remapping enabled.
Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis

mdev: 
Removing unrelated bits from SVA enabling that’s not necessary for the submission. (Kevin)
Restructured entire mdev driver series to make reviewing easier (Kevin)
Made rw emulation more robust (Kevin)
Removed uuid wq type and added single dedicated wq type (Kevin)
Locking fixes for vdev (Yan Zhao)
VFIO MSIX trigger fixes (Yan Zhao)

Link to previous discussions with Jason:
https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/
The emulation part that can be moved to user space is very small due to the majority of the
emulations being control bits and need to reside in the kernel. We can revisit the necessity of
moving the small emulation part to userspace and required architectural changes at a later time.

This RFC series has been reviewed by Dan Williams <dan.j.williams@intel.com>

The actual code can be independent of the stage 2 driver code submission that adds support for SVM,
ENQCMD(S), PASID, and shared workqueues. This code series will match the support of the 5.6 kernel
(stage 1) driver but on guest. The code is dependent on Baolu’s iommu aux-domain API extensions
patches that’s still in process of being reviewed:
https://lkml.org/lkml/2020/7/14/48

Stage 1 of the driver has been accepted in v5.6 kernel. It supports dedicated workqueue (wq)
without Shared Virtual Memory (SVM) support. Stage 2 supports shared wq and SVM. It is pending
upstream review and targeting kernel v5.9.

VFIO mediated device framework allows vendor drivers to wrap a portion of device resources into
virtual devices (mdev). Each mdev can be assigned to different guest using the same set of VFIO
uAPIs as assigning a physical device. Accessing to the mdev resource is served with mixed policies.
For example, vendor drivers typically mark data-path interface as pass-through for fast guest
operations, and then trap-and-mediate the control-path interface to avoid undesired interference
between mdevs. Some level of emulation is necessary behind vfio mdev to compose the virtual device
interface. 

This series brings mdev to idxd driver to enable Intel Scalable IOV (SIOV), a hardware-assisted
mediated pass-through technology. SIOV makes each DSA wq independently assignable through
PASID-granular resource/DMA isolation. It helps improve scalability and reduces mediation
complexity against purely software-based mdev implementations. Each assigned wq is configured by
host and exposed to the guest in a read-only configuration mode, which allows the guest to use the
wq w/o additional setup. This design greatly reduces the emulation bits to focus on handling
commands from guests.

Introducing mdev types “1dwq” type. This mdev type allows allocation of a single dedicated wq from
available dedicated wqs. After a workqueue (wq) is enabled, the user will generate an uuid. On mdev
creation, the mdev driver code will find a dwq depending on the mdev type. When the create operation
is successful, the user generated uuid can be passed to qemu. When the guest boots up, it should
discover a DSA device when doing PCI discovery.

For example of “1dwq” type:
1. Enable wq with “mdev” wq type
2. A user generated uuid.
3. The uuid is written to the mdev class sysfs path:
echo $UUID > /sys/class/mdev_bus/0000\:00\:0a.0/mdev_supported_types/idxd-wq/create
4. Pass the following parameter to qemu:
"-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:0a.0/$UUID"
 
The wq exported through mdev will have the read only config bit set for configuration. This means
that the device does not require the typical configuration. After enabling the device, the user
must set the WQ type and name. That is all is necessary to enable the WQ and start using it. The
single wq configuration is not the only way to create the mdev. Multi wqs support for mdev will be
in the future works.
 
The mdev utilizes Interrupt Message Store or IMS[3], a device-specific MSI implementation, instead
of MSIX for interrupts for the guest. This preserves MSIX for host usages and also allows a
significantly larger number of interrupt vectors for guest usage.

The idxd driver implements IMS as on-device memory mapped unified storage. Each interrupt message
is stored as a DWORD size data payload and a 64-bit address (same as MSI-X). Access to the IMS is
through the host idxd driver.

This patchset extends the existing platform-msi framework (which provides a generic mechanism to
support non-PCI compliant MSI interrupts) to benefit any driver which wants to allocate
msi-like(dev-msi) interrupts and provide its own ops functions (mask/unmask etc.)

Call-back functions defined by the kernel and implemented by the driver are used to
1. program the interrupt addr/data values instead of the kernel directly programming them.
2. mask/unmask the interrupt source

The kernel can specify the requirements for these callback functions (e.g., the driver is not
expected to block, or not expected to take a lock in the callback function).

Support for 2 new IRQ chip/domain is added(with and without IRQ_REMAP support- DEV-MSI/IR-DEV-MSI).

[1]: https://lore.kernel.org/lkml/157965011794.73301.15960052071729101309.stgit@djiang5-desk3.ch.intel.com/
[2]: https://software.intel.com/en-us/articles/intel-sdm
[3]: https://software.intel.com/en-us/download/intel-scalable-io-virtualization-technical-specification
[4]: https://software.intel.com/en-us/download/intel-data-streaming-accelerator-preliminary-architecture-specification
[5]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
[6]: https://intel.github.io/idxd/
[7]: https://github.com/intel/idxd-driver idxd-stage2.5

---

Dave Jiang (13):
      dmaengine: idxd: add support for readonly config devices
      dmaengine: idxd: add interrupt handle request support
      dmaengine: idxd: add DEV-MSI support in base driver
      dmaengine: idxd: add device support functions in prep for mdev
      dmaengine: idxd: add basic mdev registration and helper functions
      dmaengine: idxd: add emulation rw routines
      dmaengine: idxd: prep for virtual device commands
      dmaengine: idxd: virtual device commands emulation
      dmaengine: idxd: ims setup for the vdcm
      dmaengine: idxd: add mdev type as a new wq type
      dmaengine: idxd: add dedicated wq mdev type
      dmaengine: idxd: add new wq state for mdev
      dmaengine: idxd: add error notification from host driver to mediated device

Jing Lin (1):
      dmaengine: idxd: add ABI documentation for mediated device support

Megha Dey (4):
      platform-msi: Introduce platform_msi_ops
      irq/dev-msi: Add support for a new DEV_MSI irq domain
      irq/dev-msi: Create IR-DEV-MSI irq domain
      irq/dev-msi: Introduce APIs to allocate/free dev-msi interrupts


 Documentation/ABI/stable/sysfs-driver-dma-idxd |   15 
 arch/x86/include/asm/hw_irq.h                  |    6 
 arch/x86/kernel/apic/msi.c                     |   12 
 drivers/base/Kconfig                           |    7 
 drivers/base/Makefile                          |    1 
 drivers/base/dev-msi.c                         |  170 ++++
 drivers/base/platform-msi.c                    |   62 +
 drivers/base/platform-msi.h                    |   23 
 drivers/dma/Kconfig                            |    7 
 drivers/dma/idxd/Makefile                      |    2 
 drivers/dma/idxd/cdev.c                        |    6 
 drivers/dma/idxd/device.c                      |  266 +++++-
 drivers/dma/idxd/idxd.h                        |   62 +
 drivers/dma/idxd/ims.c                         |  174 ++++
 drivers/dma/idxd/ims.h                         |   17 
 drivers/dma/idxd/init.c                        |  100 ++
 drivers/dma/idxd/irq.c                         |    6 
 drivers/dma/idxd/mdev.c                        | 1106 ++++++++++++++++++++++++
 drivers/dma/idxd/mdev.h                        |  118 +++
 drivers/dma/idxd/registers.h                   |   24 -
 drivers/dma/idxd/submit.c                      |   37 +
 drivers/dma/idxd/sysfs.c                       |   55 +
 drivers/dma/idxd/vdev.c                        |  962 +++++++++++++++++++++
 drivers/dma/idxd/vdev.h                        |   28 +
 drivers/dma/mv_xor_v2.c                        |    6 
 drivers/dma/qcom/hidma.c                       |    6 
 drivers/iommu/arm-smmu-v3.c                    |    6 
 drivers/iommu/intel/irq_remapping.c            |   11 
 drivers/irqchip/irq-mbigen.c                   |    8 
 drivers/irqchip/irq-mvebu-icu.c                |    6 
 drivers/mailbox/bcm-flexrm-mailbox.c           |    6 
 drivers/perf/arm_smmuv3_pmu.c                  |    6 
 include/linux/intel-iommu.h                    |    1 
 include/linux/irqdomain.h                      |   11 
 include/linux/msi.h                            |   35 +
 include/uapi/linux/idxd.h                      |    2 
 36 files changed, 3270 insertions(+), 100 deletions(-)
 create mode 100644 drivers/base/dev-msi.c
 create mode 100644 drivers/base/platform-msi.h
 create mode 100644 drivers/dma/idxd/ims.c
 create mode 100644 drivers/dma/idxd/ims.h
 create mode 100644 drivers/dma/idxd/mdev.c
 create mode 100644 drivers/dma/idxd/mdev.h
 create mode 100644 drivers/dma/idxd/vdev.c
 create mode 100644 drivers/dma/idxd/vdev.h

--

Comments

Greg Kroah-Hartman July 21, 2020, 4:28 p.m. UTC | #1
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:

"RFC" to me means "I don't really think this is mergable, so I'm
throwing it out there."  Which implies you know it needs more work
before others should review it as you are not comfortable with it :(

So, back-of-the-queue you go...

greg k-h
Jason Gunthorpe July 21, 2020, 4:45 p.m. UTC | #2
On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> v2:
> IMS (now dev-msi):
> With recommendations from Jason/Thomas/Dan on making IMS more generic:
> Pass a non-pci generic device(struct device) for IMS management instead of mdev
> Remove all references to mdev and symbol_get/put
> Remove all references to IMS in common code and replace with dev-msi
> remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers
> Create a generic dev-msi domain with and without interrupt remapping enabled.
> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis

I didn't dig into the details of irq handling to really check this,
but the big picture of this is much more in line with what I would
expect for this kind of ability.

> Link to previous discussions with Jason:
> https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/
> The emulation part that can be moved to user space is very small due to the majority of the
> emulations being control bits and need to reside in the kernel. We can revisit the necessity of
> moving the small emulation part to userspace and required architectural changes at a later time.

The point here is that you already have a user space interface for
these queues that already has kernel support to twiddle the control
bits. Generally I'd expect extending that existing kernel code to do
the small bit more needed for mapping the queue through to PCI
emulation to be smaller than the 2kloc of new code here to put all the
emulation and support framework in the kernel, and exposes a lower
attack surface of kernel code to the guest.

> The kernel can specify the requirements for these callback functions
> (e.g., the driver is not expected to block, or not expected to take
> a lock in the callback function).

I didn't notice any of this in the patch series? What is the calling
context for the platform_msi_ops ? I think I already mentioned that
ideally we'd need blocking/sleeping. The big selling point is that IMS
allows this data to move off-chip, which means accessing it is no
longer just an atomic write to some on-chip memory.

These details should be documented in the comment on top of
platform_msi_ops

I'm actually a little confused how idxd_ims_irq_mask() manages this -
I thought IRQ masking should be synchronous, shouldn't there at least be a
flushing read to ensure that new MSI's are stopped and any in flight
are flushed to the APIC?

Jason
Dave Jiang July 21, 2020, 5:17 p.m. UTC | #3
On 7/21/2020 9:28 AM, Greg KH wrote:
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
>> v2:
> 
> "RFC" to me means "I don't really think this is mergable, so I'm
> throwing it out there."  Which implies you know it needs more work
> before others should review it as you are not comfortable with it :(
> 
> So, back-of-the-queue you go...
> 
> greg k-h
> 

Hi Greg! Yes this absolutely needs more work! I think it's in pretty good shape, 
but it has reached the point where it needs the talented eyes of reviewers from 
outside of Intel. I was really hoping to get feedback from folks like Jason 
(Thanks Jason!!) and KVM and VFIO experts like Alex, Paolo, Eric, and Kirti.

I can understand that you are quite busy and can not necessarily provide a 
detailed review at this phase. Would you prefer to be cc'd on code at this phase 
in the future? Or, should we reserve putting you on the cc for times when we 
know it's ready for merge?
Dave Jiang July 21, 2020, 6 p.m. UTC | #4
On 7/21/2020 9:45 AM, Jason Gunthorpe wrote:
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
>> v2:
>> IMS (now dev-msi):
>> With recommendations from Jason/Thomas/Dan on making IMS more generic:
>> Pass a non-pci generic device(struct device) for IMS management instead of mdev
>> Remove all references to mdev and symbol_get/put
>> Remove all references to IMS in common code and replace with dev-msi
>> remove dynamic allocation of platform-msi interrupts: no groups,no new msi list or list helpers
>> Create a generic dev-msi domain with and without interrupt remapping enabled.
>> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
> 
> I didn't dig into the details of irq handling to really check this,
> but the big picture of this is much more in line with what I would
> expect for this kind of ability.
> 
>> Link to previous discussions with Jason:
>> https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/
>> The emulation part that can be moved to user space is very small due to the majority of the
>> emulations being control bits and need to reside in the kernel. We can revisit the necessity of
>> moving the small emulation part to userspace and required architectural changes at a later time.
> 
> The point here is that you already have a user space interface for
> these queues that already has kernel support to twiddle the control
> bits. Generally I'd expect extending that existing kernel code to do
> the small bit more needed for mapping the queue through to PCI
> emulation to be smaller than the 2kloc of new code here to put all the
> emulation and support framework in the kernel, and exposes a lower
> attack surface of kernel code to the guest.
> 
>> The kernel can specify the requirements for these callback functions
>> (e.g., the driver is not expected to block, or not expected to take
>> a lock in the callback function).
> 
> I didn't notice any of this in the patch series? What is the calling
> context for the platform_msi_ops ? I think I already mentioned that
> ideally we'd need blocking/sleeping. The big selling point is that IMS
> allows this data to move off-chip, which means accessing it is no
> longer just an atomic write to some on-chip memory.
> 
> These details should be documented in the comment on top of
> platform_msi_ops
> 
> I'm actually a little confused how idxd_ims_irq_mask() manages this -
> I thought IRQ masking should be synchronous, shouldn't there at least be a
> flushing read to ensure that new MSI's are stopped and any in flight
> are flushed to the APIC?

You are right Jason. It's missing a flushing read.

> 
> Jason
>
Dan Williams July 21, 2020, 9:35 p.m. UTC | #5
On Tue, Jul 21, 2020 at 9:29 AM Greg KH <gregkh@linuxfoundation.org> wrote:
>
> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
> > v2:
>
> "RFC" to me means "I don't really think this is mergable, so I'm
> throwing it out there."  Which implies you know it needs more work
> before others should review it as you are not comfortable with it :(

There's full blown reviewed-by from me on the irq changes. The VFIO /
mdev changes looked ok to me, but I did not feel comfortable / did not
have time to sign-off on them. At the same time I did not see much to
be gained to keeping those internal. So "RFC" in this case is a bit
modest. It's more internal reviewer said this looks like it is going
in the right direction, but wants more community discussion on the
approach.

> So, back-of-the-queue you go...

Let's consider this not RFC in that context. The drivers/base/ pieces
have my review for you, the rest are dmaengine and vfio subsystem
concerns that could use some commentary.
Tian, Kevin July 21, 2020, 11:54 p.m. UTC | #6
> From: Jason Gunthorpe <jgg@mellanox.com>
> Sent: Wednesday, July 22, 2020 12:45 AM
> 
> > Link to previous discussions with Jason:
> > https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-
> 46d823ca0b5f@intel.com/
> > The emulation part that can be moved to user space is very small due to
> the majority of the
> > emulations being control bits and need to reside in the kernel. We can
> revisit the necessity of
> > moving the small emulation part to userspace and required architectural
> changes at a later time.
> 
> The point here is that you already have a user space interface for
> these queues that already has kernel support to twiddle the control
> bits. Generally I'd expect extending that existing kernel code to do
> the small bit more needed for mapping the queue through to PCI
> emulation to be smaller than the 2kloc of new code here to put all the
> emulation and support framework in the kernel, and exposes a lower
> attack surface of kernel code to the guest.
> 

We discussed in v1 about why extending user space interface is not a
strong motivation at current stage:

https://lore.kernel.org/lkml/20200513124016.GG19158@mellanox.com/

In a nutshell, applications don't require raw WQ controllability as guest
kernel drivers may expect. Extending DSA user space interface to be another
passthrough interface just for virtualization needs is less compelling than
leveraging established VFIO/mdev framework (with the major merit that
existing user space VMMs just work w/o any change as long as they already
support VFIO uAPI).

And in this version we split previous 2kloc mdev patch into three parts:
[09] mdev framework and callbacks; [10] mmio/pci_cfg emulation; and
[11] handling of control commands. Only patch[10] is purely about
emulation (~500LOC), while the other two parts are tightly coupled to
physical resource management.

In last review you said that you didn't hard nak this approach and would
like to hear opinion from virtualization guys. In this version we CCed KVM
mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
hypervisor), etc. Let's see how they feel about this approach.

Thanks
Kevin
Dey, Megha July 22, 2020, 5:31 p.m. UTC | #7
On 7/21/2020 11:00 AM, Dave Jiang wrote:
> 
> 
> On 7/21/2020 9:45 AM, Jason Gunthorpe wrote:
>> On Tue, Jul 21, 2020 at 09:02:15AM -0700, Dave Jiang wrote:
>>> v2:
>>> IMS (now dev-msi):
>>> With recommendations from Jason/Thomas/Dan on making IMS more generic:
>>> Pass a non-pci generic device(struct device) for IMS management 
>>> instead of mdev
>>> Remove all references to mdev and symbol_get/put
>>> Remove all references to IMS in common code and replace with dev-msi
>>> remove dynamic allocation of platform-msi interrupts: no groups,no 
>>> new msi list or list helpers
>>> Create a generic dev-msi domain with and without interrupt remapping 
>>> enabled.
>>> Introduce dev_msi_domain_alloc_irqs and dev_msi_domain_free_irqs apis
>>
>> I didn't dig into the details of irq handling to really check this,
>> but the big picture of this is much more in line with what I would
>> expect for this kind of ability.
>>
>>> Link to previous discussions with Jason:
>>> https://lore.kernel.org/lkml/57296ad1-20fe-caf2-b83f-46d823ca0b5f@intel.com/ 
>>>
>>> The emulation part that can be moved to user space is very small due 
>>> to the majority of the
>>> emulations being control bits and need to reside in the kernel. We 
>>> can revisit the necessity of
>>> moving the small emulation part to userspace and required 
>>> architectural changes at a later time.
>>
>> The point here is that you already have a user space interface for
>> these queues that already has kernel support to twiddle the control
>> bits. Generally I'd expect extending that existing kernel code to do
>> the small bit more needed for mapping the queue through to PCI
>> emulation to be smaller than the 2kloc of new code here to put all the
>> emulation and support framework in the kernel, and exposes a lower
>> attack surface of kernel code to the guest.
>>
>>> The kernel can specify the requirements for these callback functions
>>> (e.g., the driver is not expected to block, or not expected to take
>>> a lock in the callback function).
>>
>> I didn't notice any of this in the patch series? What is the calling
>> context for the platform_msi_ops ? I think I already mentioned that
>> ideally we'd need blocking/sleeping. The big selling point is that IMS
>> allows this data to move off-chip, which means accessing it is no
>> longer just an atomic write to some on-chip memory.
>>
>> These details should be documented in the comment on top of
>> platform_msi_ops

so the platform_msi_ops care called from the same context as the 
existing msi_ops for instance, we are not adding anything new. I think 
the above comment is a little misleading I will remove it next time around.

Also, I thought even the current write to on-chip memory is not atomic.. 
could you let me know which piece of code you are referring to?
Since the driver gets to write to the off chip memory, shouldn't it be 
the drivers responsibility to call it from a sleeping/blocking context?

>>
>> I'm actually a little confused how idxd_ims_irq_mask() manages this -
>> I thought IRQ masking should be synchronous, shouldn't there at least 
>> be a
>> flushing read to ensure that new MSI's are stopped and any in flight
>> are flushed to the APIC?
> 
> You are right Jason. It's missing a flushing read.
> 
>>
>> Jason
>>
> .
Jason Gunthorpe July 22, 2020, 6:16 p.m. UTC | #8
On Wed, Jul 22, 2020 at 10:31:28AM -0700, Dey, Megha wrote:
> > > I didn't notice any of this in the patch series? What is the calling
> > > context for the platform_msi_ops ? I think I already mentioned that
> > > ideally we'd need blocking/sleeping. The big selling point is that IMS
> > > allows this data to move off-chip, which means accessing it is no
> > > longer just an atomic write to some on-chip memory.
> > > 
> > > These details should be documented in the comment on top of
> > > platform_msi_ops
> 
> so the platform_msi_ops care called from the same context as the existing
> msi_ops for instance, we are not adding anything new. I think the above
> comment is a little misleading I will remove it next time around.

If it is true that all calls are under driver control then I think it
would be good to document that. I actually don't know off hand if
mask/unmask are restricted like that

As this is a op a driver has to implement vs the arch it probably need
a bit more hand holding.

> Also, I thought even the current write to on-chip memory is not
> atomic..

The writel to the MSI-X table in MMIO memory is 'atomic'

Jason
Jason Gunthorpe July 24, 2020, 12:19 a.m. UTC | #9
On Tue, Jul 21, 2020 at 11:54:49PM +0000, Tian, Kevin wrote:
> In a nutshell, applications don't require raw WQ controllability as guest
> kernel drivers may expect. Extending DSA user space interface to be another
> passthrough interface just for virtualization needs is less compelling than
> leveraging established VFIO/mdev framework (with the major merit that
> existing user space VMMs just work w/o any change as long as they already
> support VFIO uAPI).

Sure, but the above is how the cover letter should have summarized
that discussion, not as "it is not much code difference"

> In last review you said that you didn't hard nak this approach and would
> like to hear opinion from virtualization guys. In this version we CCed KVM
> mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
> hypervisor), etc. Let's see how they feel about this approach.

Yes, the VFIO community should decide.

If we are doing emulation tasks in the kernel now, then I can think of
several nice semi-emulated mdevs to propose.

This will not be some one off, but the start of a widely copied
pattern.

Jason
Alex Williamson Aug. 6, 2020, 1:22 a.m. UTC | #10
On Thu, 23 Jul 2020 21:19:30 -0300
Jason Gunthorpe <jgg@mellanox.com> wrote:

> On Tue, Jul 21, 2020 at 11:54:49PM +0000, Tian, Kevin wrote:
> > In a nutshell, applications don't require raw WQ controllability as guest
> > kernel drivers may expect. Extending DSA user space interface to be another
> > passthrough interface just for virtualization needs is less compelling than
> > leveraging established VFIO/mdev framework (with the major merit that
> > existing user space VMMs just work w/o any change as long as they already
> > support VFIO uAPI).  
> 
> Sure, but the above is how the cover letter should have summarized
> that discussion, not as "it is not much code difference"
> 
> > In last review you said that you didn't hard nak this approach and would
> > like to hear opinion from virtualization guys. In this version we CCed KVM
> > mailing list, Paolo (VFIO/Qemu), Alex (VFIO), Samuel (Rust-VMM/Cloud
> > hypervisor), etc. Let's see how they feel about this approach.  
> 
> Yes, the VFIO community should decide.
> 
> If we are doing emulation tasks in the kernel now, then I can think of
> several nice semi-emulated mdevs to propose.
> 
> This will not be some one off, but the start of a widely copied
> pattern.

And that's definitely a concern, there should be a reason for
implementing device emulation in the kernel beyond an easy path to get
a device exposed up through a virtualization stack.  The entire idea of
mdev is the mediation of access to a device to make it safe for a user
and to fit within the vfio device API.  Mediation, emulation, and
virtualization can be hard to differentiate, and there is some degree of
emulation required to fill out the device API, for vfio-pci itself
included.  So I struggle with a specific measure of where to draw the
line, and also whose authority it is to draw that line.  I don't think
it's solely mine, that's something we need to decide as a community.

If you see this as an abuse of the framework, then let's identify those
specific issues and come up with a better approach.  As we've discussed
before, things like basic PCI config space emulation are acceptable
overhead and low risk (imo) and some degree of register emulation is
well within the territory of an mdev driver.  Drivers are accepting
some degree of increased attack surface by each addition of a uAPI and
the complexity of those uAPIs, but it seems largely a decision for
those drivers whether they're willing to take on that responsibility
and burden.

At some point, possibly in the near-ish future, we might have a
vfio-user interface with userspace vfio-over-socket servers that might
be able to consume existing uAPIs and offload some of this complexity
and emulation to userspace while still providing an easy path to insert
devices into the virtualization stack.  Hopefully if/when that comes
along, it would provide these sorts of drivers an opportunity to
offload some of the current overhead out to userspace, but I'm not sure
it's worth denying a mainline implementation now.  Thanks,

Alex
Jason Gunthorpe Aug. 7, 2020, 12:19 p.m. UTC | #11
On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:

> If you see this as an abuse of the framework, then let's identify those
> specific issues and come up with a better approach.  As we've discussed
> before, things like basic PCI config space emulation are acceptable
> overhead and low risk (imo) and some degree of register emulation is
> well within the territory of an mdev driver.  

What troubles me is that idxd already has a direct userspace interface
to its HW, and does userspace DMA. The purpose of this mdev is to
provide a second direct userspace interface that is a little different
and trivially plugs into the virtualization stack.

I don't think VFIO should be the only entry point to
virtualization. If we say the universe of devices doing user space DMA
must also implement a VFIO mdev to plug into virtualization then it
will be alot of mdevs.

I would prefer to see that the existing userspace interface have the
extra needed bits for virtualization (eg by having appropriate
internal kernel APIs to make this easy) and all the emulation to build
the synthetic PCI device be done in userspace.

Not only is it better for security, it keeps things to one device
driver per device..

Jason
Tian, Kevin Aug. 10, 2020, 7:32 a.m. UTC | #12
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 7, 2020 8:20 PM
> 
> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> 
> > If you see this as an abuse of the framework, then let's identify those
> > specific issues and come up with a better approach.  As we've discussed
> > before, things like basic PCI config space emulation are acceptable
> > overhead and low risk (imo) and some degree of register emulation is
> > well within the territory of an mdev driver.
> 
> What troubles me is that idxd already has a direct userspace interface
> to its HW, and does userspace DMA. The purpose of this mdev is to
> provide a second direct userspace interface that is a little different
> and trivially plugs into the virtualization stack.

No. Userspace DMA and subdevice passthrough (what mdev provides)
are two distinct usages IMO (at least in idxd context). and this might 
be the main divergence between us, thus let me put more words here. 
If we could reach consensus in this matter, which direction to go 
would be clearer.

First, a passthrough interface requires some unique requirements 
which are not commonly observed in an userspace DMA interface, e.g.:

- Tracking DMA dirty pages for live migration;
- A set of interfaces for using SVA inside guest;
	* PASID allocation/free (on some platforms);
	* bind/unbind guest mm/page table (nested translation);
	* invalidate IOMMU cache/iotlb for guest page table changes;
	* report page request from device to guest;
	* forward page response from guest to device;
- Configuring irqbypass for posted interrupt;
- ...

Second, a passthrough interface requires delegating raw controllability
of subdevice to guest driver, while the same delegation might not be
required for implementing an userspace DMA interface (especially for
modern devices which support SVA). For example, idxd allows following
setting per wq (guest driver may configure them in any combination):
	- put in dedicated or shared mode;
	- enable/disable SVA;
	- Associate guest-provided PASID to MSI/IMS entry;
	- set threshold;
	- allow/deny privileged access;
	- allocate/free interrupt handle (enlightened for guest);
	- collect error status;
	- ...

We plan to support idxd userspace DMA with SVA. The driver just needs 
to prepare a wq with a predefined configuration (e.g. shared, SVA, 
etc.), bind the process mm to IOMMU (non-nested) and then map 
the portal to userspace. The goal that userspace can do DMA to 
associated wq doesn't change the fact that the wq is still *owned* 
and *controlled* by kernel driver. However as far as passthrough 
is concerned, the wq is considered 'owned' by the guest driver thus 
we need an interface which can support low-level *controllability* 
from guest driver. It is sort of a mess in uAPI when mixing the
two together.

Based on above two reasons, we see distinct requirements between 
userspace DMA and passthrough interfaces, at least in idxd context 
(though other devices may have less distinction in-between). Therefore,
we didn't see the value/necessity of reinventing the wheel that mdev 
already handles well to evolve an simple application-oriented usespace 
DMA interface to a complex guest-driver-oriented passthrough interface. 
The complexity of doing so would incur far more kernel-side changes 
than the portion of emulation code that you've been concerned about...
 
> 
> I don't think VFIO should be the only entry point to
> virtualization. If we say the universe of devices doing user space DMA
> must also implement a VFIO mdev to plug into virtualization then it
> will be alot of mdevs.

Certainly VFIO will not be the only entry point. and This has to be a 
case-by-case decision.  If an userspace DMA interface can be easily 
adapted to be a passthrough one, it might be the choice. But for idxd, 
we see mdev a much better fit here, given the big difference between 
what userspace DMA requires and what guest driver requires in this hw.

> 
> I would prefer to see that the existing userspace interface have the
> extra needed bits for virtualization (eg by having appropriate
> internal kernel APIs to make this easy) and all the emulation to build
> the synthetic PCI device be done in userspace.

In the end what decides the direction is the amount of changes that
we have to put in kernel, not whether we call it 'emulation'. For idxd,
adding special passthrough requirements (guest SVA, dirty tracking,
etc.) and raw controllability to the simple userspace DMA interface 
is for sure making kernel more complex than reusing the mdev
framework (plus some degree of emulation mockup behind). Not to
mention the merit of uAPI compatibility with mdev...

Thanks
Kevin
Alex Williamson Aug. 11, 2020, 5 p.m. UTC | #13
On Mon, 10 Aug 2020 07:32:24 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Friday, August 7, 2020 8:20 PM
> > 
> > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >   
> > > If you see this as an abuse of the framework, then let's identify those
> > > specific issues and come up with a better approach.  As we've discussed
> > > before, things like basic PCI config space emulation are acceptable
> > > overhead and low risk (imo) and some degree of register emulation is
> > > well within the territory of an mdev driver.  
> > 
> > What troubles me is that idxd already has a direct userspace interface
> > to its HW, and does userspace DMA. The purpose of this mdev is to
> > provide a second direct userspace interface that is a little different
> > and trivially plugs into the virtualization stack.  
> 
> No. Userspace DMA and subdevice passthrough (what mdev provides)
> are two distinct usages IMO (at least in idxd context). and this might 
> be the main divergence between us, thus let me put more words here. 
> If we could reach consensus in this matter, which direction to go 
> would be clearer.
> 
> First, a passthrough interface requires some unique requirements 
> which are not commonly observed in an userspace DMA interface, e.g.:
> 
> - Tracking DMA dirty pages for live migration;
> - A set of interfaces for using SVA inside guest;
> 	* PASID allocation/free (on some platforms);
> 	* bind/unbind guest mm/page table (nested translation);
> 	* invalidate IOMMU cache/iotlb for guest page table changes;
> 	* report page request from device to guest;
> 	* forward page response from guest to device;
> - Configuring irqbypass for posted interrupt;
> - ...
> 
> Second, a passthrough interface requires delegating raw controllability
> of subdevice to guest driver, while the same delegation might not be
> required for implementing an userspace DMA interface (especially for
> modern devices which support SVA). For example, idxd allows following
> setting per wq (guest driver may configure them in any combination):
> 	- put in dedicated or shared mode;
> 	- enable/disable SVA;
> 	- Associate guest-provided PASID to MSI/IMS entry;
> 	- set threshold;
> 	- allow/deny privileged access;
> 	- allocate/free interrupt handle (enlightened for guest);
> 	- collect error status;
> 	- ...
> 
> We plan to support idxd userspace DMA with SVA. The driver just needs 
> to prepare a wq with a predefined configuration (e.g. shared, SVA, 
> etc.), bind the process mm to IOMMU (non-nested) and then map 
> the portal to userspace. The goal that userspace can do DMA to 
> associated wq doesn't change the fact that the wq is still *owned* 
> and *controlled* by kernel driver. However as far as passthrough 
> is concerned, the wq is considered 'owned' by the guest driver thus 
> we need an interface which can support low-level *controllability* 
> from guest driver. It is sort of a mess in uAPI when mixing the
> two together.
> 
> Based on above two reasons, we see distinct requirements between 
> userspace DMA and passthrough interfaces, at least in idxd context 
> (though other devices may have less distinction in-between). Therefore,
> we didn't see the value/necessity of reinventing the wheel that mdev 
> already handles well to evolve an simple application-oriented usespace 
> DMA interface to a complex guest-driver-oriented passthrough interface. 
> The complexity of doing so would incur far more kernel-side changes 
> than the portion of emulation code that you've been concerned about...
>  
> > 
> > I don't think VFIO should be the only entry point to
> > virtualization. If we say the universe of devices doing user space DMA
> > must also implement a VFIO mdev to plug into virtualization then it
> > will be alot of mdevs.  
> 
> Certainly VFIO will not be the only entry point. and This has to be a 
> case-by-case decision.  If an userspace DMA interface can be easily 
> adapted to be a passthrough one, it might be the choice. But for idxd, 
> we see mdev a much better fit here, given the big difference between 
> what userspace DMA requires and what guest driver requires in this hw.
> 
> > 
> > I would prefer to see that the existing userspace interface have the
> > extra needed bits for virtualization (eg by having appropriate
> > internal kernel APIs to make this easy) and all the emulation to build
> > the synthetic PCI device be done in userspace.  
> 
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. For idxd,
> adding special passthrough requirements (guest SVA, dirty tracking,
> etc.) and raw controllability to the simple userspace DMA interface 
> is for sure making kernel more complex than reusing the mdev
> framework (plus some degree of emulation mockup behind). Not to
> mention the merit of uAPI compatibility with mdev...

I agree with a lot of this argument, exposing a device through a
userspace interface versus allowing user access to a device through a
userspace interface are different levels of abstraction and control.
In an ideal world, perhaps we could compose one from the other, but I
don't think the existence of one is proof that the other is redundant.
That's not to say that mdev/vfio isn't ripe for abuse in this space,
but I'm afraid the test for that abuse is probably much more subtle.

I'll also remind folks that LPC is coming up in just a couple short
weeks and this might be something we should discuss (virtually)
in-person.  uconf CfPs are currently open. </plug>   Thanks,

Alex
Tian, Kevin Aug. 12, 2020, 1:58 a.m. UTC | #14
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, August 12, 2020 1:01 AM
> 
> On Mon, 10 Aug 2020 07:32:24 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Friday, August 7, 2020 8:20 PM
> > >
> > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > >
> > > > If you see this as an abuse of the framework, then let's identify those
> > > > specific issues and come up with a better approach.  As we've discussed
> > > > before, things like basic PCI config space emulation are acceptable
> > > > overhead and low risk (imo) and some degree of register emulation is
> > > > well within the territory of an mdev driver.
> > >
> > > What troubles me is that idxd already has a direct userspace interface
> > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > provide a second direct userspace interface that is a little different
> > > and trivially plugs into the virtualization stack.
> >
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > 	* PASID allocation/free (on some platforms);
> > 	* bind/unbind guest mm/page table (nested translation);
> > 	* invalidate IOMMU cache/iotlb for guest page table changes;
> > 	* report page request from device to guest;
> > 	* forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > 	- put in dedicated or shared mode;
> > 	- enable/disable SVA;
> > 	- Associate guest-provided PASID to MSI/IMS entry;
> > 	- set threshold;
> > 	- allow/deny privileged access;
> > 	- allocate/free interrupt handle (enlightened for guest);
> > 	- collect error status;
> > 	- ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> > >
> > > I don't think VFIO should be the only entry point to
> > > virtualization. If we say the universe of devices doing user space DMA
> > > must also implement a VFIO mdev to plug into virtualization then it
> > > will be alot of mdevs.
> >
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.  If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice. But for idxd,
> > we see mdev a much better fit here, given the big difference between
> > what userspace DMA requires and what guest driver requires in this hw.
> >
> > >
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > adding special passthrough requirements (guest SVA, dirty tracking,
> > etc.) and raw controllability to the simple userspace DMA interface
> > is for sure making kernel more complex than reusing the mdev
> > framework (plus some degree of emulation mockup behind). Not to
> > mention the merit of uAPI compatibility with mdev...
> 
> I agree with a lot of this argument, exposing a device through a
> userspace interface versus allowing user access to a device through a
> userspace interface are different levels of abstraction and control.
> In an ideal world, perhaps we could compose one from the other, but I
> don't think the existence of one is proof that the other is redundant.
> That's not to say that mdev/vfio isn't ripe for abuse in this space,
> but I'm afraid the test for that abuse is probably much more subtle.
> 
> I'll also remind folks that LPC is coming up in just a couple short
> weeks and this might be something we should discuss (virtually)
> in-person.  uconf CfPs are currently open. </plug>   Thanks,
> 

Yes, LPC is a good place to reach consensus. btw I saw there is 
already one VFIO topic called "device assignment/sub-assignment".
Do you think whether this can be covered under that topic, or
makes more sense to be a new one?

Thanks
Kevin
Alex Williamson Aug. 12, 2020, 2:36 a.m. UTC | #15
On Wed, 12 Aug 2020 01:58:00 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Alex Williamson <alex.williamson@redhat.com>
> > Sent: Wednesday, August 12, 2020 1:01 AM
> > 
> > On Mon, 10 Aug 2020 07:32:24 +0000
> > "Tian, Kevin" <kevin.tian@intel.com> wrote:
> >   
> > > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > > Sent: Friday, August 7, 2020 8:20 PM
> > > >
> > > > On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> > > >  
> > > > > If you see this as an abuse of the framework, then let's identify those
> > > > > specific issues and come up with a better approach.  As we've discussed
> > > > > before, things like basic PCI config space emulation are acceptable
> > > > > overhead and low risk (imo) and some degree of register emulation is
> > > > > well within the territory of an mdev driver.  
> > > >
> > > > What troubles me is that idxd already has a direct userspace interface
> > > > to its HW, and does userspace DMA. The purpose of this mdev is to
> > > > provide a second direct userspace interface that is a little different
> > > > and trivially plugs into the virtualization stack.  
> > >
> > > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > > are two distinct usages IMO (at least in idxd context). and this might
> > > be the main divergence between us, thus let me put more words here.
> > > If we could reach consensus in this matter, which direction to go
> > > would be clearer.
> > >
> > > First, a passthrough interface requires some unique requirements
> > > which are not commonly observed in an userspace DMA interface, e.g.:
> > >
> > > - Tracking DMA dirty pages for live migration;
> > > - A set of interfaces for using SVA inside guest;
> > > 	* PASID allocation/free (on some platforms);
> > > 	* bind/unbind guest mm/page table (nested translation);
> > > 	* invalidate IOMMU cache/iotlb for guest page table changes;
> > > 	* report page request from device to guest;
> > > 	* forward page response from guest to device;
> > > - Configuring irqbypass for posted interrupt;
> > > - ...
> > >
> > > Second, a passthrough interface requires delegating raw controllability
> > > of subdevice to guest driver, while the same delegation might not be
> > > required for implementing an userspace DMA interface (especially for
> > > modern devices which support SVA). For example, idxd allows following
> > > setting per wq (guest driver may configure them in any combination):
> > > 	- put in dedicated or shared mode;
> > > 	- enable/disable SVA;
> > > 	- Associate guest-provided PASID to MSI/IMS entry;
> > > 	- set threshold;
> > > 	- allow/deny privileged access;
> > > 	- allocate/free interrupt handle (enlightened for guest);
> > > 	- collect error status;
> > > 	- ...
> > >
> > > We plan to support idxd userspace DMA with SVA. The driver just needs
> > > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > > etc.), bind the process mm to IOMMU (non-nested) and then map
> > > the portal to userspace. The goal that userspace can do DMA to
> > > associated wq doesn't change the fact that the wq is still *owned*
> > > and *controlled* by kernel driver. However as far as passthrough
> > > is concerned, the wq is considered 'owned' by the guest driver thus
> > > we need an interface which can support low-level *controllability*
> > > from guest driver. It is sort of a mess in uAPI when mixing the
> > > two together.
> > >
> > > Based on above two reasons, we see distinct requirements between
> > > userspace DMA and passthrough interfaces, at least in idxd context
> > > (though other devices may have less distinction in-between). Therefore,
> > > we didn't see the value/necessity of reinventing the wheel that mdev
> > > already handles well to evolve an simple application-oriented usespace
> > > DMA interface to a complex guest-driver-oriented passthrough interface.
> > > The complexity of doing so would incur far more kernel-side changes
> > > than the portion of emulation code that you've been concerned about...
> > >  
> > > >
> > > > I don't think VFIO should be the only entry point to
> > > > virtualization. If we say the universe of devices doing user space DMA
> > > > must also implement a VFIO mdev to plug into virtualization then it
> > > > will be alot of mdevs.  
> > >
> > > Certainly VFIO will not be the only entry point. and This has to be a
> > > case-by-case decision.  If an userspace DMA interface can be easily
> > > adapted to be a passthrough one, it might be the choice. But for idxd,
> > > we see mdev a much better fit here, given the big difference between
> > > what userspace DMA requires and what guest driver requires in this hw.
> > >  
> > > >
> > > > I would prefer to see that the existing userspace interface have the
> > > > extra needed bits for virtualization (eg by having appropriate
> > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > the synthetic PCI device be done in userspace.  
> > >
> > > In the end what decides the direction is the amount of changes that
> > > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > > adding special passthrough requirements (guest SVA, dirty tracking,
> > > etc.) and raw controllability to the simple userspace DMA interface
> > > is for sure making kernel more complex than reusing the mdev
> > > framework (plus some degree of emulation mockup behind). Not to
> > > mention the merit of uAPI compatibility with mdev...  
> > 
> > I agree with a lot of this argument, exposing a device through a
> > userspace interface versus allowing user access to a device through a
> > userspace interface are different levels of abstraction and control.
> > In an ideal world, perhaps we could compose one from the other, but I
> > don't think the existence of one is proof that the other is redundant.
> > That's not to say that mdev/vfio isn't ripe for abuse in this space,
> > but I'm afraid the test for that abuse is probably much more subtle.
> > 
> > I'll also remind folks that LPC is coming up in just a couple short
> > weeks and this might be something we should discuss (virtually)
> > in-person.  uconf CfPs are currently open. </plug>   Thanks,
> >   
> 
> Yes, LPC is a good place to reach consensus. btw I saw there is 
> already one VFIO topic called "device assignment/sub-assignment".
> Do you think whether this can be covered under that topic, or
> makes more sense to be a new one?

All the things listed in the CFP are only potential topics to get ideas
flowing, there is currently no proposal to talk about sub-assignment.
I'd suggest submitting separate topics for each and if we run into time
constraints we can ask that they might be combined together.  Thanks,

Alex
Jason Wang Aug. 12, 2020, 3:28 a.m. UTC | #16
On 2020/8/10 下午3:32, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Friday, August 7, 2020 8:20 PM
>>
>> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
>>
>>> If you see this as an abuse of the framework, then let's identify those
>>> specific issues and come up with a better approach.  As we've discussed
>>> before, things like basic PCI config space emulation are acceptable
>>> overhead and low risk (imo) and some degree of register emulation is
>>> well within the territory of an mdev driver.
>> What troubles me is that idxd already has a direct userspace interface
>> to its HW, and does userspace DMA. The purpose of this mdev is to
>> provide a second direct userspace interface that is a little different
>> and trivially plugs into the virtualization stack.
> No. Userspace DMA and subdevice passthrough (what mdev provides)
> are two distinct usages IMO (at least in idxd context). and this might
> be the main divergence between us, thus let me put more words here.
> If we could reach consensus in this matter, which direction to go
> would be clearer.
>
> First, a passthrough interface requires some unique requirements
> which are not commonly observed in an userspace DMA interface, e.g.:
>
> - Tracking DMA dirty pages for live migration;
> - A set of interfaces for using SVA inside guest;
> 	* PASID allocation/free (on some platforms);
> 	* bind/unbind guest mm/page table (nested translation);
> 	* invalidate IOMMU cache/iotlb for guest page table changes;
> 	* report page request from device to guest;
> 	* forward page response from guest to device;
> - Configuring irqbypass for posted interrupt;
> - ...
>
> Second, a passthrough interface requires delegating raw controllability
> of subdevice to guest driver, while the same delegation might not be
> required for implementing an userspace DMA interface (especially for
> modern devices which support SVA). For example, idxd allows following
> setting per wq (guest driver may configure them in any combination):
> 	- put in dedicated or shared mode;
> 	- enable/disable SVA;
> 	- Associate guest-provided PASID to MSI/IMS entry;
> 	- set threshold;
> 	- allow/deny privileged access;
> 	- allocate/free interrupt handle (enlightened for guest);
> 	- collect error status;
> 	- ...
>
> We plan to support idxd userspace DMA with SVA. The driver just needs
> to prepare a wq with a predefined configuration (e.g. shared, SVA,
> etc.), bind the process mm to IOMMU (non-nested) and then map
> the portal to userspace. The goal that userspace can do DMA to
> associated wq doesn't change the fact that the wq is still *owned*
> and *controlled* by kernel driver. However as far as passthrough
> is concerned, the wq is considered 'owned' by the guest driver thus
> we need an interface which can support low-level *controllability*
> from guest driver. It is sort of a mess in uAPI when mixing the
> two together.


So for userspace drivers like DPDK, it can use both of the two uAPIs?


>
> Based on above two reasons, we see distinct requirements between
> userspace DMA and passthrough interfaces, at least in idxd context
> (though other devices may have less distinction in-between). Therefore,
> we didn't see the value/necessity of reinventing the wheel that mdev
> already handles well to evolve an simple application-oriented usespace
> DMA interface to a complex guest-driver-oriented passthrough interface.
> The complexity of doing so would incur far more kernel-side changes
> than the portion of emulation code that you've been concerned about...
>   
>> I don't think VFIO should be the only entry point to
>> virtualization. If we say the universe of devices doing user space DMA
>> must also implement a VFIO mdev to plug into virtualization then it
>> will be alot of mdevs.
> Certainly VFIO will not be the only entry point. and This has to be a
> case-by-case decision.


The problem is that if we tie all controls via VFIO uAPI, the other 
subsystem like vDPA is likely to duplicate them. I wonder if there is a 
way to decouple the vSVA out of VFIO uAPI?


>   If an userspace DMA interface can be easily
> adapted to be a passthrough one, it might be the choice.


It's not that easy even for VFIO which requires a lot of new uAPIs and 
infrastructures(e.g mdev) to be invented.


> But for idxd,
> we see mdev a much better fit here, given the big difference between
> what userspace DMA requires and what guest driver requires in this hw.


A weak point for mdev is that it can't serve kernel subsystem other than 
VFIO. In this case, you need some other infrastructures (like [1]) to do 
this.

(For idxd, you probably don't need this, but it's pretty common in the 
case of networking or storage device.)

Thanks

[1] https://patchwork.kernel.org/patch/11280547/


>
>> I would prefer to see that the existing userspace interface have the
>> extra needed bits for virtualization (eg by having appropriate
>> internal kernel APIs to make this easy) and all the emulation to build
>> the synthetic PCI device be done in userspace.
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. For idxd,
> adding special passthrough requirements (guest SVA, dirty tracking,
> etc.) and raw controllability to the simple userspace DMA interface
> is for sure making kernel more complex than reusing the mdev
> framework (plus some degree of emulation mockup behind). Not to
> mention the merit of uAPI compatibility with mdev...
>
> Thanks
> Kevin
>
Tian, Kevin Aug. 12, 2020, 3:35 a.m. UTC | #17
> From: Alex Williamson <alex.williamson@redhat.com>
> Sent: Wednesday, August 12, 2020 10:36 AM
> On Wed, 12 Aug 2020 01:58:00 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
> 
> > >
> > > I'll also remind folks that LPC is coming up in just a couple short
> > > weeks and this might be something we should discuss (virtually)
> > > in-person.  uconf CfPs are currently open. </plug>   Thanks,
> > >
> >
> > Yes, LPC is a good place to reach consensus. btw I saw there is
> > already one VFIO topic called "device assignment/sub-assignment".
> > Do you think whether this can be covered under that topic, or
> > makes more sense to be a new one?
> 
> All the things listed in the CFP are only potential topics to get ideas
> flowing, there is currently no proposal to talk about sub-assignment.
> I'd suggest submitting separate topics for each and if we run into time
> constraints we can ask that they might be combined together.  Thanks,
> 

Done.
--
title: Criteria of using VFIO mdev (vs. userspace DMA)

Content:
VFIO mdev provides a framework for subdevice assignment and reuses 
existing VFIO uAPI  to handle common passthrough-related requirements. 
However, subdevice (e.g. ADI defined in Intel Scalable IOV) might not be 
a PCI endpoint (e.g. just a work queue), thus requires some degree of 
emulation/mediation in kernel to fit into VFIO device API. Then there is 
a concern on putting emulation in kernel and how to judge abuse of 
mdev framework by simply using it as an easy path to hook into 
virtualization stack. An associated open is about differentiating mdev 
from userspace DMA framework (such as uacce), and whether building 
passthrough features on top of userspace DMA framework is a better 
choice than using mdev. 

Thanks
Kevin
Tian, Kevin Aug. 12, 2020, 4:05 a.m. UTC | #18
> From: Jason Wang <jasowang@redhat.com>
> Sent: Wednesday, August 12, 2020 11:28 AM
> 
> 
> On 2020/8/10 下午3:32, Tian, Kevin wrote:
> >> From: Jason Gunthorpe <jgg@nvidia.com>
> >> Sent: Friday, August 7, 2020 8:20 PM
> >>
> >> On Wed, Aug 05, 2020 at 07:22:58PM -0600, Alex Williamson wrote:
> >>
> >>> If you see this as an abuse of the framework, then let's identify those
> >>> specific issues and come up with a better approach.  As we've discussed
> >>> before, things like basic PCI config space emulation are acceptable
> >>> overhead and low risk (imo) and some degree of register emulation is
> >>> well within the territory of an mdev driver.
> >> What troubles me is that idxd already has a direct userspace interface
> >> to its HW, and does userspace DMA. The purpose of this mdev is to
> >> provide a second direct userspace interface that is a little different
> >> and trivially plugs into the virtualization stack.
> > No. Userspace DMA and subdevice passthrough (what mdev provides)
> > are two distinct usages IMO (at least in idxd context). and this might
> > be the main divergence between us, thus let me put more words here.
> > If we could reach consensus in this matter, which direction to go
> > would be clearer.
> >
> > First, a passthrough interface requires some unique requirements
> > which are not commonly observed in an userspace DMA interface, e.g.:
> >
> > - Tracking DMA dirty pages for live migration;
> > - A set of interfaces for using SVA inside guest;
> > 	* PASID allocation/free (on some platforms);
> > 	* bind/unbind guest mm/page table (nested translation);
> > 	* invalidate IOMMU cache/iotlb for guest page table changes;
> > 	* report page request from device to guest;
> > 	* forward page response from guest to device;
> > - Configuring irqbypass for posted interrupt;
> > - ...
> >
> > Second, a passthrough interface requires delegating raw controllability
> > of subdevice to guest driver, while the same delegation might not be
> > required for implementing an userspace DMA interface (especially for
> > modern devices which support SVA). For example, idxd allows following
> > setting per wq (guest driver may configure them in any combination):
> > 	- put in dedicated or shared mode;
> > 	- enable/disable SVA;
> > 	- Associate guest-provided PASID to MSI/IMS entry;
> > 	- set threshold;
> > 	- allow/deny privileged access;
> > 	- allocate/free interrupt handle (enlightened for guest);
> > 	- collect error status;
> > 	- ...
> >
> > We plan to support idxd userspace DMA with SVA. The driver just needs
> > to prepare a wq with a predefined configuration (e.g. shared, SVA,
> > etc.), bind the process mm to IOMMU (non-nested) and then map
> > the portal to userspace. The goal that userspace can do DMA to
> > associated wq doesn't change the fact that the wq is still *owned*
> > and *controlled* by kernel driver. However as far as passthrough
> > is concerned, the wq is considered 'owned' by the guest driver thus
> > we need an interface which can support low-level *controllability*
> > from guest driver. It is sort of a mess in uAPI when mixing the
> > two together.
> 
> 
> So for userspace drivers like DPDK, it can use both of the two uAPIs?

yes.

> 
> 
> >
> > Based on above two reasons, we see distinct requirements between
> > userspace DMA and passthrough interfaces, at least in idxd context
> > (though other devices may have less distinction in-between). Therefore,
> > we didn't see the value/necessity of reinventing the wheel that mdev
> > already handles well to evolve an simple application-oriented usespace
> > DMA interface to a complex guest-driver-oriented passthrough interface.
> > The complexity of doing so would incur far more kernel-side changes
> > than the portion of emulation code that you've been concerned about...
> >
> >> I don't think VFIO should be the only entry point to
> >> virtualization. If we say the universe of devices doing user space DMA
> >> must also implement a VFIO mdev to plug into virtualization then it
> >> will be alot of mdevs.
> > Certainly VFIO will not be the only entry point. and This has to be a
> > case-by-case decision.
> 
> 
> The problem is that if we tie all controls via VFIO uAPI, the other
> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> way to decouple the vSVA out of VFIO uAPI?

vSVA is a per-device (either pdev or mdev) feature thus naturally should 
be managed by its device driver (VFIO or vDPA). From this angle some
duplication is inevitable given VFIO and vDPA are orthogonal passthrough
frameworks. Within the kernel the majority of vSVA handling is done by
IOMMU and IOASID modules thus most logic are shared.

> 
> 
> >   If an userspace DMA interface can be easily
> > adapted to be a passthrough one, it might be the choice.
> 
> 
> It's not that easy even for VFIO which requires a lot of new uAPIs and
> infrastructures(e.g mdev) to be invented.
> 
> 
> > But for idxd,
> > we see mdev a much better fit here, given the big difference between
> > what userspace DMA requires and what guest driver requires in this hw.
> 
> 
> A weak point for mdev is that it can't serve kernel subsystem other than
> VFIO. In this case, you need some other infrastructures (like [1]) to do
> this.

mdev is not exclusive from kernel usages. It's perfectly fine for a driver
to reserve some work queues for host usages, while wrapping others
into mdevs.

Thanks
Kevin

> 
> (For idxd, you probably don't need this, but it's pretty common in the
> case of networking or storage device.)
> 
> Thanks
> 
> [1] https://patchwork.kernel.org/patch/11280547/
> 
> 
> >
> >> I would prefer to see that the existing userspace interface have the
> >> extra needed bits for virtualization (eg by having appropriate
> >> internal kernel APIs to make this easy) and all the emulation to build
> >> the synthetic PCI device be done in userspace.
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'. For idxd,
> > adding special passthrough requirements (guest SVA, dirty tracking,
> > etc.) and raw controllability to the simple userspace DMA interface
> > is for sure making kernel more complex than reusing the mdev
> > framework (plus some degree of emulation mockup behind). Not to
> > mention the merit of uAPI compatibility with mdev...
> >
> > Thanks
> > Kevin
> >
Jason Wang Aug. 13, 2020, 4:33 a.m. UTC | #19
On 2020/8/12 下午12:05, Tian, Kevin wrote:
>> The problem is that if we tie all controls via VFIO uAPI, the other
>> subsystem like vDPA is likely to duplicate them. I wonder if there is a
>> way to decouple the vSVA out of VFIO uAPI?
> vSVA is a per-device (either pdev or mdev) feature thus naturally should
> be managed by its device driver (VFIO or vDPA). From this angle some
> duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> frameworks. Within the kernel the majority of vSVA handling is done by
> IOMMU and IOASID modules thus most logic are shared.


So why not introduce vSVA uAPI at IOMMU or IOASID layer?


>
>>>    If an userspace DMA interface can be easily
>>> adapted to be a passthrough one, it might be the choice.
>> It's not that easy even for VFIO which requires a lot of new uAPIs and
>> infrastructures(e.g mdev) to be invented.
>>
>>
>>> But for idxd,
>>> we see mdev a much better fit here, given the big difference between
>>> what userspace DMA requires and what guest driver requires in this hw.
>> A weak point for mdev is that it can't serve kernel subsystem other than
>> VFIO. In this case, you need some other infrastructures (like [1]) to do
>> this.
> mdev is not exclusive from kernel usages. It's perfectly fine for a driver
> to reserve some work queues for host usages, while wrapping others
> into mdevs.


I meant you may want slices to be an independent device from the kernel 
point of view:

E.g for ethernet devices, you may want 10K mdevs to be passed to guest.

Similarly, you may want 10K net devices which is connected to the kernel 
networking subsystems.

In this case it's not simply reserving queues but you need some other 
type of device abstraction. There could be some kind of duplication 
between this and mdev.

Thanks


>
> Thanks
> Kevin
>
Tian, Kevin Aug. 13, 2020, 5:26 a.m. UTC | #20
> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, August 13, 2020 12:34 PM
> 
> 
> On 2020/8/12 下午12:05, Tian, Kevin wrote:
> >> The problem is that if we tie all controls via VFIO uAPI, the other
> >> subsystem like vDPA is likely to duplicate them. I wonder if there is a
> >> way to decouple the vSVA out of VFIO uAPI?
> > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > be managed by its device driver (VFIO or vDPA). From this angle some
> > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > frameworks. Within the kernel the majority of vSVA handling is done by
> > IOMMU and IOASID modules thus most logic are shared.
> 
> 
> So why not introduce vSVA uAPI at IOMMU or IOASID layer?

One may ask a similar question why IOMMU doesn't expose map/unmap
as uAPI...

> 
> 
> >
> >>>    If an userspace DMA interface can be easily
> >>> adapted to be a passthrough one, it might be the choice.
> >> It's not that easy even for VFIO which requires a lot of new uAPIs and
> >> infrastructures(e.g mdev) to be invented.
> >>
> >>
> >>> But for idxd,
> >>> we see mdev a much better fit here, given the big difference between
> >>> what userspace DMA requires and what guest driver requires in this hw.
> >> A weak point for mdev is that it can't serve kernel subsystem other than
> >> VFIO. In this case, you need some other infrastructures (like [1]) to do
> >> this.
> > mdev is not exclusive from kernel usages. It's perfectly fine for a driver
> > to reserve some work queues for host usages, while wrapping others
> > into mdevs.
> 
> 
> I meant you may want slices to be an independent device from the kernel
> point of view:
> 
> E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
> 
> Similarly, you may want 10K net devices which is connected to the kernel
> networking subsystems.
> 
> In this case it's not simply reserving queues but you need some other
> type of device abstraction. There could be some kind of duplication
> between this and mdev.
> 

yes, some abstraction required but isn't it what the driver should
care about instead of mdev framework itself? If the driver reports
the same set of resource to both mdev and networking, it needs to
make sure when the resource is claimed in one interface then it
should be marked in-use in another. e.g. each mdev includes a
available_intances attribute. the driver could report 10k available
instances initially and then update it to 5K when another 5K is used
for net devices later.

Mdev definitely has its usage limitations. Some may be improved 
in the future, some may not. But those are distracting from the
original purpose of this thread (mdev vs. userspace DMA) and better
be discussed in other places e.g. LPC... 

Thanks
Kevin
Jason Wang Aug. 13, 2020, 6:01 a.m. UTC | #21
On 2020/8/13 下午1:26, Tian, Kevin wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Thursday, August 13, 2020 12:34 PM
>>
>>
>> On 2020/8/12 下午12:05, Tian, Kevin wrote:
>>>> The problem is that if we tie all controls via VFIO uAPI, the other
>>>> subsystem like vDPA is likely to duplicate them. I wonder if there is a
>>>> way to decouple the vSVA out of VFIO uAPI?
>>> vSVA is a per-device (either pdev or mdev) feature thus naturally should
>>> be managed by its device driver (VFIO or vDPA). From this angle some
>>> duplication is inevitable given VFIO and vDPA are orthogonal passthrough
>>> frameworks. Within the kernel the majority of vSVA handling is done by
>>> IOMMU and IOASID modules thus most logic are shared.
>>
>> So why not introduce vSVA uAPI at IOMMU or IOASID layer?
> One may ask a similar question why IOMMU doesn't expose map/unmap
> as uAPI...


I think this is probably a good idea as well. If there's anything missed 
in the infrastructure, we can invent. Besides vhost-vDPA, there are 
other subsystems that relaying their uAPI to IOMMU API. Duplicating 
uAPIs is usually a hint of the codes duplication. Simple map/unmap could 
be easy but vSVA uAPI is much more complicated.


>
>>
>>>>>     If an userspace DMA interface can be easily
>>>>> adapted to be a passthrough one, it might be the choice.
>>>> It's not that easy even for VFIO which requires a lot of new uAPIs and
>>>> infrastructures(e.g mdev) to be invented.
>>>>
>>>>
>>>>> But for idxd,
>>>>> we see mdev a much better fit here, given the big difference between
>>>>> what userspace DMA requires and what guest driver requires in this hw.
>>>> A weak point for mdev is that it can't serve kernel subsystem other than
>>>> VFIO. In this case, you need some other infrastructures (like [1]) to do
>>>> this.
>>> mdev is not exclusive from kernel usages. It's perfectly fine for a driver
>>> to reserve some work queues for host usages, while wrapping others
>>> into mdevs.
>>
>> I meant you may want slices to be an independent device from the kernel
>> point of view:
>>
>> E.g for ethernet devices, you may want 10K mdevs to be passed to guest.
>>
>> Similarly, you may want 10K net devices which is connected to the kernel
>> networking subsystems.
>>
>> In this case it's not simply reserving queues but you need some other
>> type of device abstraction. There could be some kind of duplication
>> between this and mdev.
>>
> yes, some abstraction required but isn't it what the driver should
> care about instead of mdev framework itself?


With mdev you present a "PCI" device, but what's kind of device it tries 
to present to kernel? If it's still PCI, there's duplication with mdev, 
if it's something new, maybe we can switch to that API.


> If the driver reports
> the same set of resource to both mdev and networking, it needs to
> make sure when the resource is claimed in one interface then it
> should be marked in-use in another. e.g. each mdev includes a
> available_intances attribute. the driver could report 10k available
> instances initially and then update it to 5K when another 5K is used
> for net devices later.


Right but this probably means you need another management layer under mdev.


>
> Mdev definitely has its usage limitations. Some may be improved
> in the future, some may not. But those are distracting from the
> original purpose of this thread (mdev vs. userspace DMA) and better
> be discussed in other places e.g. LPC...


Ok.

Thanks


>
> Thanks
> Kevin
Jason Gunthorpe Aug. 14, 2020, 1:23 p.m. UTC | #22
On Thu, Aug 13, 2020 at 02:01:58PM +0800, Jason Wang wrote:
> 
> On 2020/8/13 下午1:26, Tian, Kevin wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Thursday, August 13, 2020 12:34 PM
> > > 
> > > 
> > > On 2020/8/12 下午12:05, Tian, Kevin wrote:
> > > > > The problem is that if we tie all controls via VFIO uAPI, the other
> > > > > subsystem like vDPA is likely to duplicate them. I wonder if there is a
> > > > > way to decouple the vSVA out of VFIO uAPI?
> > > > vSVA is a per-device (either pdev or mdev) feature thus naturally should
> > > > be managed by its device driver (VFIO or vDPA). From this angle some
> > > > duplication is inevitable given VFIO and vDPA are orthogonal passthrough
> > > > frameworks. Within the kernel the majority of vSVA handling is done by
> > > > IOMMU and IOASID modules thus most logic are shared.
> > > 
> > > So why not introduce vSVA uAPI at IOMMU or IOASID layer?
> > One may ask a similar question why IOMMU doesn't expose map/unmap
> > as uAPI...
> 
> 
> I think this is probably a good idea as well. If there's anything missed in
> the infrastructure, we can invent. Besides vhost-vDPA, there are other
> subsystems that relaying their uAPI to IOMMU API. Duplicating uAPIs is
> usually a hint of the codes duplication. Simple map/unmap could be easy but
> vSVA uAPI is much more complicated.

A way to create the vSVA objects unrelated to VFIO and then pass those
objects for device use into existing uAPIs, to me, makes alot of
sense.

You should not have to use the VFIO API just to get vSVA.

Or stated another way, the existing user driver should be able to get
a PASID from the vSVA components as well as just create a PASID from
the local mm_struct.

The same basic argument goes for all the points - the issue is really
the only uAPI we have for this stuff is under VFIO, and the better
solution is to disagregate that uAPI, not to try and make everything
pretend to be a VFIO device.

Jason
Jason Gunthorpe Aug. 14, 2020, 1:35 p.m. UTC | #23
On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:

> > I would prefer to see that the existing userspace interface have the
> > extra needed bits for virtualization (eg by having appropriate
> > internal kernel APIs to make this easy) and all the emulation to build
> > the synthetic PCI device be done in userspace.
> 
> In the end what decides the direction is the amount of changes that
> we have to put in kernel, not whether we call it 'emulation'. 

No, this is not right. The decision should be based on what will end
up more maintable in the long run.

Yes it would be more code to dis-aggregate some of the things
currently only bundled as uAPI inside VFIO (eg your vSVA argument
above) but once it is disaggregated the maintability of the whole
solution will be better overall, and more drivers will be able to use
this functionality.

Jason
Tian, Kevin Aug. 17, 2020, 2:12 a.m. UTC | #24
> From: Jason Gunthorpe
> Sent: Friday, August 14, 2020 9:35 PM
> 
> On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
> 
> > > I would prefer to see that the existing userspace interface have the
> > > extra needed bits for virtualization (eg by having appropriate
> > > internal kernel APIs to make this easy) and all the emulation to build
> > > the synthetic PCI device be done in userspace.
> >
> > In the end what decides the direction is the amount of changes that
> > we have to put in kernel, not whether we call it 'emulation'.
> 
> No, this is not right. The decision should be based on what will end
> up more maintable in the long run.
> 
> Yes it would be more code to dis-aggregate some of the things
> currently only bundled as uAPI inside VFIO (eg your vSVA argument
> above) but once it is disaggregated the maintability of the whole
> solution will be better overall, and more drivers will be able to use
> this functionality.
> 

Disaggregation is an orthogonal topic to the main divergence in 
this thread, which is passthrough vs. userspace DMA. I gave detail
explanation about the difference between the two in last reply.
the possibility of dis-aggregating something between passthrough
frameworks (e.g. VFIO and vDPA) is not the reason for growing 
every userspace DMA framework to be a passthrough framework.
Doing that is instead hurting maintainability in general...

Thanks
Kevin
Tian, Kevin Aug. 17, 2020, 2:24 a.m. UTC | #25
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, August 14, 2020 9:24 PM
> 
> The same basic argument goes for all the points - the issue is really
> the only uAPI we have for this stuff is under VFIO, and the better
> solution is to disagregate that uAPI, not to try and make everything
> pretend to be a VFIO device.
> 

Nobody is proposing to make everything VFIO. there must be some
criteria which can be brainstormed in LPC. But the opposite also holds - 
the fact that we should not make everything VFIO doesn't imply
prohibition on anyone from using it. There is a clear difference between 
passthrough and userspace DMA requirements in idxd context, and we
see good reasons to use VFIO for our passthrough requirements.


Thanks
Kevin
Jason Gunthorpe Aug. 18, 2020, 12:43 a.m. UTC | #26
On Mon, Aug 17, 2020 at 02:12:44AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe
> > Sent: Friday, August 14, 2020 9:35 PM
> > 
> > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
> > 
> > > > I would prefer to see that the existing userspace interface have the
> > > > extra needed bits for virtualization (eg by having appropriate
> > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > the synthetic PCI device be done in userspace.
> > >
> > > In the end what decides the direction is the amount of changes that
> > > we have to put in kernel, not whether we call it 'emulation'.
> > 
> > No, this is not right. The decision should be based on what will end
> > up more maintable in the long run.
> > 
> > Yes it would be more code to dis-aggregate some of the things
> > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > above) but once it is disaggregated the maintability of the whole
> > solution will be better overall, and more drivers will be able to use
> > this functionality.
> > 
> 
> Disaggregation is an orthogonal topic to the main divergence in 
> this thread, which is passthrough vs. userspace DMA. I gave detail
> explanation about the difference between the two in last reply.

You said the first issue was related to SVA, which is understandable
because we have no SVA uAPIs outside VFIO.

Imagine if we had some /dev/sva that provided this API and user space
DMA drivers could simply accept an FD and work properly. It is not
such a big leap anymore, nor is it customized code in idxd.

The other pass through issue was IRQ, which last time I looked, was
fairly trivial to connect via interrupt remapping in the kernel, or
could be made extremely trivial.

The last, seemed to be a concern that the current uapi for idxd was
lacking seems idxd specific features, which seems like an quite weak
reason to use VFIO.

Jason
Tian, Kevin Aug. 18, 2020, 1:09 a.m. UTC | #27
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 18, 2020 8:44 AM
> 
> On Mon, Aug 17, 2020 at 02:12:44AM +0000, Tian, Kevin wrote:
> > > From: Jason Gunthorpe
> > > Sent: Friday, August 14, 2020 9:35 PM
> > >
> > > On Mon, Aug 10, 2020 at 07:32:24AM +0000, Tian, Kevin wrote:
> > >
> > > > > I would prefer to see that the existing userspace interface have the
> > > > > extra needed bits for virtualization (eg by having appropriate
> > > > > internal kernel APIs to make this easy) and all the emulation to build
> > > > > the synthetic PCI device be done in userspace.
> > > >
> > > > In the end what decides the direction is the amount of changes that
> > > > we have to put in kernel, not whether we call it 'emulation'.
> > >
> > > No, this is not right. The decision should be based on what will end
> > > up more maintable in the long run.
> > >
> > > Yes it would be more code to dis-aggregate some of the things
> > > currently only bundled as uAPI inside VFIO (eg your vSVA argument
> > > above) but once it is disaggregated the maintability of the whole
> > > solution will be better overall, and more drivers will be able to use
> > > this functionality.
> > >
> >
> > Disaggregation is an orthogonal topic to the main divergence in
> > this thread, which is passthrough vs. userspace DMA. I gave detail
> > explanation about the difference between the two in last reply.
> 
> You said the first issue was related to SVA, which is understandable
> because we have no SVA uAPIs outside VFIO.
> 
> Imagine if we had some /dev/sva that provided this API and user space
> DMA drivers could simply accept an FD and work properly. It is not
> such a big leap anymore, nor is it customized code in idxd.
> 
> The other pass through issue was IRQ, which last time I looked, was
> fairly trivial to connect via interrupt remapping in the kernel, or
> could be made extremely trivial.
> 
> The last, seemed to be a concern that the current uapi for idxd was
> lacking seems idxd specific features, which seems like an quite weak
> reason to use VFIO.
> 

The difference in my reply is not just about the implementation gap
of growing a userspace DMA framework to a passthrough framework.
My real point is about the different goals that each wants to achieve.
Userspace DMA is purely about allowing userspace to directly access
the portal and do DMA, but the wq configuration is always under kernel 
driver's control. On the other hand, passthrough means delegating full 
control of the wq to the guest and then associated mock-up (live migration,
vSVA, posted interrupt, etc.) for that to work. I really didn't see the
value of mixing them together when there is already a good candidate
to handle passthrough...

Thanks
Kevin
Jason Gunthorpe Aug. 18, 2020, 11:50 a.m. UTC | #28
On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote:
> The difference in my reply is not just about the implementation gap
> of growing a userspace DMA framework to a passthrough framework.
> My real point is about the different goals that each wants to achieve.
> Userspace DMA is purely about allowing userspace to directly access
> the portal and do DMA, but the wq configuration is always under kernel 
> driver's control. On the other hand, passthrough means delegating full 
> control of the wq to the guest and then associated mock-up (live migration,
> vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> value of mixing them together when there is already a good candidate
> to handle passthrough...

In Linux a 'VM' and virtualization has always been a normal system
process that uses a few extra kernel features. This has been more or
less the cornerstone of that design since the start.

In that view it doesn't make any sense to say that uAPI from idxd that
is useful for virtualization somehow doesn't belong as part of the
standard uAPI.

Especially when it is such a small detail like what APIs are used to
configure the wq.

For instance, what about suspend/resume of containers using idxd?
Wouldn't you want to have the same basic approach of controlling the
wq from userspace that virtualization uses?

Jason
Paolo Bonzini Aug. 18, 2020, 4:27 p.m. UTC | #29
On 18/08/20 13:50, Jason Gunthorpe wrote:
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?

The difference is that VFIO more or less standardizes the approach you
use for live migration.  With another interface you'd have to come up
with something for every driver, and add support in CRIU for every
driver as well.

Paolo
Jason Gunthorpe Aug. 18, 2020, 4:49 p.m. UTC | #30
On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
> On 18/08/20 13:50, Jason Gunthorpe wrote:
> > For instance, what about suspend/resume of containers using idxd?
> > Wouldn't you want to have the same basic approach of controlling the
> > wq from userspace that virtualization uses?
> 
> The difference is that VFIO more or less standardizes the approach you
> use for live migration.  With another interface you'd have to come up
> with something for every driver, and add support in CRIU for every
> driver as well.

VFIO is very unsuitable for use as some general userspace. It only 1:1
with a single process and just can't absorb what the existing idxd
userspace is doing.

So VFIO is already not a solution for normal userspace idxd where CRIU
becomes interesting. Not sure what you are trying to say?

My point was the opposite, if you want to enable CRIU for idxd then
you probably need all the same stuff as for qemu/VFIO except in the
normal idxd user API.

Jason
Paolo Bonzini Aug. 18, 2020, 5:05 p.m. UTC | #31
On 18/08/20 18:49, Jason Gunthorpe wrote:
> On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
>> On 18/08/20 13:50, Jason Gunthorpe wrote:
>>> For instance, what about suspend/resume of containers using idxd?
>>> Wouldn't you want to have the same basic approach of controlling the
>>> wq from userspace that virtualization uses?
>>
>> The difference is that VFIO more or less standardizes the approach you
>> use for live migration.  With another interface you'd have to come up
>> with something for every driver, and add support in CRIU for every
>> driver as well.
> 
> VFIO is very unsuitable for use as some general userspace. It only 1:1
> with a single process and just can't absorb what the existing idxd
> userspace is doing.

The point of mdev is that it's not 1:1 anymore.

Paolo

> So VFIO is already not a solution for normal userspace idxd where CRIU
> becomes interesting. Not sure what you are trying to say?
> 
> My point was the opposite, if you want to enable CRIU for idxd then
> you probably need all the same stuff as for qemu/VFIO except in the
> normal idxd user API.
> 
> Jason
>
Jason Gunthorpe Aug. 18, 2020, 5:18 p.m. UTC | #32
On Tue, Aug 18, 2020 at 07:05:16PM +0200, Paolo Bonzini wrote:
> On 18/08/20 18:49, Jason Gunthorpe wrote:
> > On Tue, Aug 18, 2020 at 06:27:21PM +0200, Paolo Bonzini wrote:
> >> On 18/08/20 13:50, Jason Gunthorpe wrote:
> >>> For instance, what about suspend/resume of containers using idxd?
> >>> Wouldn't you want to have the same basic approach of controlling the
> >>> wq from userspace that virtualization uses?
> >>
> >> The difference is that VFIO more or less standardizes the approach you
> >> use for live migration.  With another interface you'd have to come up
> >> with something for every driver, and add support in CRIU for every
> >> driver as well.
> > 
> > VFIO is very unsuitable for use as some general userspace. It only 1:1
> > with a single process and just can't absorb what the existing idxd
> > userspace is doing.
> 
> The point of mdev is that it's not 1:1 anymore.

The lifecycle model of mdev is not compatible with how something like
idxd works, it needs a multi-open cdev.

Jason
Tian, Kevin Aug. 19, 2020, 7:29 a.m. UTC | #33
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, August 18, 2020 7:50 PM
> 
> On Tue, Aug 18, 2020 at 01:09:01AM +0000, Tian, Kevin wrote:
> > The difference in my reply is not just about the implementation gap
> > of growing a userspace DMA framework to a passthrough framework.
> > My real point is about the different goals that each wants to achieve.
> > Userspace DMA is purely about allowing userspace to directly access
> > the portal and do DMA, but the wq configuration is always under kernel
> > driver's control. On the other hand, passthrough means delegating full
> > control of the wq to the guest and then associated mock-up (live migration,
> > vSVA, posted interrupt, etc.) for that to work. I really didn't see the
> > value of mixing them together when there is already a good candidate
> > to handle passthrough...
> 
> In Linux a 'VM' and virtualization has always been a normal system
> process that uses a few extra kernel features. This has been more or
> less the cornerstone of that design since the start.
> 
> In that view it doesn't make any sense to say that uAPI from idxd that
> is useful for virtualization somehow doesn't belong as part of the
> standard uAPI.

The point is that we already have a more standard uAPI (VFIO) which
is unified and vendor-agnostic to userspace. Creating a idxd specific
uAPI to absorb similar requirements that VFIO already does is not 
compelling and instead causes more trouble to Qemu or other VMMs 
as they need to deal with every such driver uAPI even when Qemu 
itself has no interest in the device detail (since the real user is inside 
guest). 

> 
> Especially when it is such a small detail like what APIs are used to
> configure the wq.
> 
> For instance, what about suspend/resume of containers using idxd?
> Wouldn't you want to have the same basic approach of controlling the
> wq from userspace that virtualization uses?
> 

I'm not familiar with how container suspend/resume is done today.
But my gut-feeling is that it's different from virtualization. For 
virtualization, the whole wq is assigned to the guest thus the uAPI
must provide a way to save the wq state including its configuration 
at suspsend, and then restore the state to what guest expects when
resume. However in container case which does userspace DMA, the
wq is managed by host kernel and could be shared between multiple
containers. So the wq state is irrelevant to container. The only relevant
state is the in-fly workloads which needs a draining interface. In this
view I think the two have a major difference.

Thanks
Kevin