Patchwork [1/5] vfio: Introduce documentation for VFIO driver

login
register
mail settings
Submitter Alex Williamson
Date Dec. 21, 2011, 9:42 p.m.
Message ID <20111221214212.27028.78947.stgit@bling.home>
Download mbox | patch
Permalink /patch/132739/
State New
Headers show

Comments

Alex Williamson - Dec. 21, 2011, 9:42 p.m.
Including rationale for design, example usage and API description.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 Documentation/vfio.txt |  352 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 352 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt
Ronen Hod - Dec. 28, 2011, 5:16 p.m.
On 12/21/2011 11:42 PM, Alex Williamson wrote:
> Including rationale for design, example usage and API description.
>
> Signed-off-by: Alex Williamson<alex.williamson@redhat.com>
> ---
>
>   Documentation/vfio.txt |  352 ++++++++++++++++++++++++++++++++++++++++++++++++
>   1 files changed, 352 insertions(+), 0 deletions(-)
>   create mode 100644 Documentation/vfio.txt
>
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..09a5a5b
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,352 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
> +POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
> +systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
> +agnostic framework for exposing direct device access to userspace, in
> +a secure, IOMMU protected environment.  In other words, this allows
> +safe[2], non-privileged, userspace drivers.
> +
> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply
> +turns the VM into a userspace driver, with the benefits of
> +significantly reduced latency, higher bandwidth, and direct use of
> +bare-metal device drivers[3].
> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Prior to VFIO, these drivers had to either
> +go through the full development cycle to become proper upstream
> +driver, be maintained out of tree, or make use of the UIO framework,
> +which has no notion of IOMMU protection, limited interrupt support,
> +and requires root privileges to access things like PCI configuration
> +space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment code as well as provide a more
> +secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, and IOMMUs
> +-------------------------------------------------------------------------------
> +
> +Userspace drivers are primarily concerned with manipulating individual
> +devices and setting up mappings in the IOMMU for those devices.
> +Unfortunately, the IOMMU doesn't always have the granularity to track
> +mappings for an individual device.  Sometimes this is a topology
> +barrier, such as a PCIe-to-PCI bridge interposing the device and
> +IOMMU, other times this is an IOMMU limitation.  In any case, the
> +reality is that devices are not always independent with respect to the
> +IOMMU.  Translations setup for one device can be used by another
> +device in these scenarios.
> +
> +The IOMMU API exposes these relationships by identifying an "IOMMU
> +group" for these dependent devices.  Devices on the same bus with the
> +same IOMMU group (or just "group" for this document) are not isolated
> +from each other with respect to DMA mappings.  For userspace usage,
> +this logically means that instead of being able to grant ownership of
> +an individual device, we must grant ownership of a group, which may
> +contain one or more devices.
> +
> +These groups therefore become a fundamental component of VFIO and the
> +working unit we use for exposing devices and granting permissions to
> +userspace.  In addition, VFIO make efforts to ensure the integrity of
> +the group for user access.  This includes ensuring that all devices
> +within the group are controlled by VFIO (vs native host drivers)
> +before allowing a user to access any member of the group or the IOMMU
> +mappings, as well as maintaining the group viability as devices are
> +dynamically added or removed from the system.
> +
> +To access a device through VFIO, a user must open a character device
> +for the group that the device belongs to and then issue an ioctl to
> +retrieve a file descriptor for the individual device.  This ensures
> +that the user has permissions to the group (file based access to the
> +/dev entry) and allows a check point at which VFIO can deny access to
> +the device if the group is not viable (all devices within the group
> +controlled by VFIO).  A file descriptor for the IOMMU is obtain in the
> +same fashion.
> +
> +VFIO defines a standard set of APIs for access to devices and a
> +modular interface for adding new, bus-specific VFIO device drivers.
> +We call these "VFIO bus drivers".  The vfio-pci module is an example
> +of a bus driver for exposing PCI devices.  When the bus driver module
> +is loaded it enumerates all of the devices for it's bus, registering
> +each device with the vfio core along with a set of callbacks.  For
> +buses that support hotplug, the bus driver also adds itself to the
> +notification chain for such events.  The callbacks registered with
> +each device implement the VFIO device access API for that bus.
> +
> +The VFIO device API includes ioctls for describing the device, the I/O
> +regions and their read/write/mmap offsets on the device descriptor, as
> +well as mechanisms for describing and registering interrupt
> +notifications.
> +
> +The VFIO IOMMU object is accessed in a similar way; an ioctl on the
> +group provides a file descriptor for programming the IOMMU.  Like
> +devices, the IOMMU file descriptor is only accessible when a group is
> +viable.  The API for the IOMMU is effectively a userspace extension of
> +the kernel IOMMU API.  The IOMMU provides an ioctl to describe the
> +IOMMU domain as well as to setup and teardown DMA mappings.  As the
> +IOMMU API is extended to support more esoteric IOMMU implementations,
> +it's expected that the VFIO interface will also evolve.
> +
> +To facilitate this evolution, all of the VFIO interfaces are designed
> +for extensions.  Particularly, for all structures passed via ioctl, we
> +include a structure size and flags field.  We also define the ioctl
> +request to be independent of passed structure size.  This allows us to
> +later add structure fields and define flags as necessary.  It's
> +expected that each additional field will have an associated flag to
> +indicate whether the data is valid.  Additionally, we provide an
> +"info" ioctl for each file descriptor, which allows us to flag new
> +features as they're added (ex. an IOMMU domain configuration ioctl).
> +
> +The final aspect of VFIO is the notion of merging groups.  In both the
> +assignment of devices to virtual machines and the pure userspace
> +driver model, it's expect that a single user instance is likely to
> +have multiple groups in use simultaneously.  If these groups are all
> +using the same set of IOMMU mappings, the overhead of userspace
> +setting up and tearing down the mappings, as well as the internal
> +IOMMU driver overhead of managing those mappings can be non-trivial.
> +Some IOMMU implementations are able to easily reduce this overhead by
> +simply using the same set of page tables across multiple groups.
> +VFIO allows users to take advantage of this option by merging groups
> +together, effectively creating a super group (IOMMU groups only define
> +the minimum granularity).
> +
> +A user can attempt to merge groups together by calling the merge ioctl
> +on one group (the "merger") and passing a file descriptor for the
> +group to be merged in (the "mergee").  Note that existing DMA mappings
> +cannot be atomically merged between groups, it's therefore a
> +requirement that the mergee group is not in use.  This is enforced by
> +not allowing open device or iommu file descriptors on the mergee group
> +at the time of merging.  The merger group can be actively in use at
> +the time of merging.  Likewise, to unmerge a group, none of the device
> +file descriptors for the group being removed can be in use.  The
> +remaining merged group can be actively in use.
> +

Can you elaborate on the scenario that led a user to merge groups?
Does it make sense to try to "automatically" merge a (new) group with 
all the existing groups sometime prior to its first device open?

As always, it is a pleasure to read your documentation.
Ronen.

> +If the groups cannot be merged, the ioctl will fail and the user will
> +need to manage the groups independently.  Users should have no
> +expectation for group merging to be successful.  Some platforms may
> +not support it at all, others may only enable merging of sufficiently
> +similar groups.  If the ioctl succeeds, then the group file
> +descriptors are effectively fungible between the groups.  That is,
> +instead of their actions being isolated to the individual group, each
> +of them are gateways into the combined, merged group.  For instance,
> +retrieving an IOMMU file descriptor from any group returns a reference
> +to the same object, mappings to that IOMMU descriptor are visible to
> +all devices in the merged group, and device descriptors can be
> +retrieved for any device in the merged group from any one of the group
> +file descriptors.  In effect, a user can manage devices and the IOMMU
> +of a merged group using a single file descriptor (saving the merged
> +groups file descriptors away only for unmerged) without the
> +permission complications of creating a separate "super group" character
> +device.
> +
> +VFIO Usage Example
> +-------------------------------------------------------------------------------
> +
> +Assume user wants to access PCI device 0000:06:0d.0
> +
> +$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group
> +240
> +
> +Since this device is on the "pci" bus, the user can then find the
> +character device for interacting with the VFIO group as:
> +
> +$ ls -l /dev/vfio/pci:240
> +crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240
> +
> +We can also examine other members of the group through sysfs:
> +
> +$ ls -l /sys/devices/virtual/vfio/pci:240/devices/
> +total 0
> +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 ->  \
> +		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0
> +lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 ->  \
> +		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1
> +
> +This group therefore contains two devices[4].  VFIO will prevent
> +device or iommu manipulation unless all group members are attached to
> +the vfio bus driver, so we simply unbind the devices from their
> +current driver and rebind them to vfio:
> +
> +# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do
> +	dir=$(readlink -f $i)
> +	if [ -L $dir/driver ]; then
> +		echo $(basename $i)>  $dir/driver/unbind
> +	fi
> +	vendor=$(cat $dir/vendor)
> +	device=$(cat $dir/device)
> +	echo $vendor $device>  /sys/bus/pci/drivers/vfio/new_id
> +	echo $(basename $i)>  /sys/bus/pci/drivers/vfio/bind
> +done
> +
> +# chown user:user /dev/vfio/pci:240
> +
> +The user now has full access to all the devices and the iommu for this
> +group and can access them as follows:
> +
> +	int group, iommu, device, i;
> +	struct vfio_group_info group_info = { .argsz = sizeof(group_info) };
> +	struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) };
> +	struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) };
> +	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
> +
> +	/* Open the group */
> +	group = open("/dev/vfio/pci:240", O_RDWR);
> +
> +	/* Test the group is viable and available */
> +	ioctl(group, VFIO_GROUP_GET_INFO,&group_info);
> +
> +	if (!(group_info.flags&  VFIO_GROUP_FLAGS_VIABLE))
> +		/* Group is not viable */
> +
> +	if ((group_info.flags&  VFIO_GROUP_FLAGS_MM_LOCKED))
> +		/* Already in use by someone else */
> +
> +	/* Get a file descriptor for the IOMMU */
> +	iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD);
> +
> +	/* Test the IOMMU is what we expect */
> +	ioctl(iommu, VFIO_IOMMU_GET_INFO,&iommu_info);
> +
> +	/* Allocate some space and setup a DMA mapping */
> +	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
> +			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
> +	dma_map.size = 1024 * 1024;
> +	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
> +	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	ioctl(iommu, VFIO_IOMMU_MAP_DMA,&dma_map);
> +
> +	/* Get a file descriptor for the device */
> +	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
> +
> +	/* Test and setup the device */
> +	ioctl(device, VFIO_DEVICE_GET_INFO,&device_info);
> +
> +	for (i = 0; i<  device_info.num_regions; i++) {
> +		struct vfio_region_info reg = { .argsz = sizeof(reg) };
> +
> +		reg.index = i;
> +
> +		ioctl(device, VFIO_DEVICE_GET_REGION_INFO,&reg);
> +
> +		/* Setup mappings... read/write offsets, mmaps
> +		 * For PCI devices, config space is a region */
> +	}
> +
> +	for (i = 0; i<  device_info.num_irqs; i++) {
> +		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
> +
> +		irq.index = i;
> +
> +		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO,&reg);
> +
> +		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */
> +	}
> +
> +	/* Gratuitous device reset and go... */
> +	ioctl(device, VFIO_DEVICE_RESET);
> +
> +VFIO User API
> +-------------------------------------------------------------------------------
> +
> +Please see include/linux/vfio.h for complete API documentation.
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding
> +
> +When initialized, the bus driver should enumerate the devices on its
> +bus and call vfio_group_add_dev() for each device.  If the bus
> +supports hotplug, notifiers should be enabled to track devices being
> +added and removed.  vfio_group_del_dev() removes a previously added
> +device from vfio.
> +
> +extern int vfio_group_add_dev(struct device *dev,
> +                              const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *dev);
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:
> +
> +struct vfio_device_ops {
> +	bool	(*match)(struct device *dev, char *buf);
> +	int	(*claim)(struct device *dev);
> +	int	(*open)(void *device_data);
> +	void	(*release)(void *device_data);
> +	ssize_t	(*read)(void *device_data, char __user *buf,
> +			size_t count, loff_t *ppos);
> +	ssize_t	(*write)(void *device_data, const char __user *buf,
> +			 size_t size, loff_t *ppos);
> +	long	(*ioctl)(void *device_data, unsigned int cmd,
> +			 unsigned long arg);
> +	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
> +};
> +
> +For buses supporting hotplug, all functions are required to be
> +implemented.  Non-hotplug buses do not need to implement claim().
> +
> +match() provides a device specific method for associating a struct
> +device to a user provided string.  Many drivers may simply strcmp the
> +buffer to dev_name().
> +
> +claim() is used when a device is hot-added to a group that is already
> +in use.  This is how VFIO requests that a bus driver manually takes
> +ownership of a device.  The expected call path for this is triggered
> +from the bus add notifier.  The bus driver calls vfio_group_add_dev for
> +the newly added device, vfio-core determines this group is already in
> +use and calls claim on the bus driver.  This triggers the bus driver
> +to call it's own probe function, including calling vfio_bind_dev to
> +mark the device as controlled by vfio.  The device is then available
> +for use by the group.
> +
> +The remaining vfio_device_ops are similar to a simplified struct
> +file_operations except a device_data pointer is provided rather than a
> +file pointer.  The device_data is an opaque structure registered by
> +the bus driver when a device is bound to the vfio bus driver:
> +
> +extern int vfio_bind_dev(struct device *dev, void *device_data);
> +extern void *vfio_unbind_dev(struct device *dev);
> +
> +When the device is unbound from the driver, the bus driver will call
> +vfio_unbind_dev() which will return the device_data for any bus driver
> +specific cleanup and freeing of the structure.  The vfio_unbind_dev
> +call may block if the group is currently in use.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since
> +outgrown the acronym, but it's catchy.
> +
> +[2] "safe" also depends upon a device being "well behaved".  It's
> +possible for multi-function devices to have backdoors between
> +functions and even for single function devices to have alternative
> +access to things like PCI config space through MMIO registers.  To
> +guard against the former we can include additional precautions in the
> +IOMMU driver to group multi-function PCI devices together
> +(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
> +still provide isolation.  For PCI, Virtual Functions are the best
> +indicator of "well behaved", as these are designed for virtualization
> +usage models.
> +
> +[3] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> +
> +[4] In this case the device is below a PCI bridge:
> +
> +-[0000:00]-+-1e.0-[06]--+-0d.0
> +                        \-0d.1
> +
> +00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)
>
>
Alex Williamson - Jan. 3, 2012, 3:21 p.m.
On Wed, 2011-12-28 at 19:16 +0200, Ronen Hod wrote:
> On 12/21/2011 11:42 PM, Alex Williamson wrote:
> > +The final aspect of VFIO is the notion of merging groups.  In both the
> > +assignment of devices to virtual machines and the pure userspace
> > +driver model, it's expect that a single user instance is likely to
> > +have multiple groups in use simultaneously.  If these groups are all
> > +using the same set of IOMMU mappings, the overhead of userspace
> > +setting up and tearing down the mappings, as well as the internal
> > +IOMMU driver overhead of managing those mappings can be non-trivial.
> > +Some IOMMU implementations are able to easily reduce this overhead by
> > +simply using the same set of page tables across multiple groups.
> > +VFIO allows users to take advantage of this option by merging groups
> > +together, effectively creating a super group (IOMMU groups only define
> > +the minimum granularity).
> > +
> > +A user can attempt to merge groups together by calling the merge ioctl
> > +on one group (the "merger") and passing a file descriptor for the
> > +group to be merged in (the "mergee").  Note that existing DMA mappings
> > +cannot be atomically merged between groups, it's therefore a
> > +requirement that the mergee group is not in use.  This is enforced by
> > +not allowing open device or iommu file descriptors on the mergee group
> > +at the time of merging.  The merger group can be actively in use at
> > +the time of merging.  Likewise, to unmerge a group, none of the device
> > +file descriptors for the group being removed can be in use.  The
> > +remaining merged group can be actively in use.
> > +
> 
> Can you elaborate on the scenario that led a user to merge groups?

Anytime a user is managing multiple groups with the same set of iommu
mappings, they probably wants to try to merge the groups so they only
have to program the iommu once, instead of once per group.  This works
well in the qemu/x86 device assignment model where we map full guest
memory into the iommu to allow transparent device assignment, ie. the
guest doesn't need to be enlightened or forced to use a specific DMA
mechanism.  If instead we exposed an emulated or paravirtualized iommu
and expected the guest to make use of it to limit the number of pinned
guest pages, it might make sense to expose that per group, especially if
the I/O virtual address window is small or lives at different offsets,
so merging wouldn't make sense then.  I'll add something to this effect
in the documentation.

> Does it make sense to try to "automatically" merge a (new) group with 
> all the existing groups sometime prior to its first device open?

I don't think so.  As above, it would assume a usage model.

> As always, it is a pleasure to read your documentation.

Thanks!

Alex

Patch

diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
new file mode 100644
index 0000000..09a5a5b
--- /dev/null
+++ b/Documentation/vfio.txt
@@ -0,0 +1,352 @@ 
+VFIO - "Virtual Function I/O"[1]
+-------------------------------------------------------------------------------
+Many modern system now provide DMA and interrupt remapping facilities
+to help ensure I/O devices behave within the boundaries they've been
+allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d,
+POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
+systems such as Freescale PAMU.  The VFIO driver is an IOMMU/device
+agnostic framework for exposing direct device access to userspace, in
+a secure, IOMMU protected environment.  In other words, this allows
+safe[2], non-privileged, userspace drivers.
+
+Why do we want that?  Virtual machines often make use of direct device
+access ("device assignment") when configured for the highest possible
+I/O performance.  From a device and host perspective, this simply
+turns the VM into a userspace driver, with the benefits of
+significantly reduced latency, higher bandwidth, and direct use of
+bare-metal device drivers[3].
+
+Some applications, particularly in the high performance computing
+field, also benefit from low-overhead, direct device access from
+userspace.  Examples include network adapters (often non-TCP/IP based)
+and compute accelerators.  Prior to VFIO, these drivers had to either
+go through the full development cycle to become proper upstream
+driver, be maintained out of tree, or make use of the UIO framework,
+which has no notion of IOMMU protection, limited interrupt support,
+and requires root privileges to access things like PCI configuration
+space.
+
+The VFIO driver framework intends to unify these, replacing both the
+KVM PCI specific device assignment code as well as provide a more
+secure, more featureful userspace driver environment than UIO.
+
+Groups, Devices, and IOMMUs
+-------------------------------------------------------------------------------
+
+Userspace drivers are primarily concerned with manipulating individual
+devices and setting up mappings in the IOMMU for those devices.
+Unfortunately, the IOMMU doesn't always have the granularity to track
+mappings for an individual device.  Sometimes this is a topology
+barrier, such as a PCIe-to-PCI bridge interposing the device and
+IOMMU, other times this is an IOMMU limitation.  In any case, the
+reality is that devices are not always independent with respect to the
+IOMMU.  Translations setup for one device can be used by another
+device in these scenarios.
+
+The IOMMU API exposes these relationships by identifying an "IOMMU
+group" for these dependent devices.  Devices on the same bus with the
+same IOMMU group (or just "group" for this document) are not isolated
+from each other with respect to DMA mappings.  For userspace usage,
+this logically means that instead of being able to grant ownership of
+an individual device, we must grant ownership of a group, which may
+contain one or more devices.
+
+These groups therefore become a fundamental component of VFIO and the
+working unit we use for exposing devices and granting permissions to
+userspace.  In addition, VFIO make efforts to ensure the integrity of
+the group for user access.  This includes ensuring that all devices
+within the group are controlled by VFIO (vs native host drivers)
+before allowing a user to access any member of the group or the IOMMU
+mappings, as well as maintaining the group viability as devices are
+dynamically added or removed from the system.
+
+To access a device through VFIO, a user must open a character device
+for the group that the device belongs to and then issue an ioctl to
+retrieve a file descriptor for the individual device.  This ensures
+that the user has permissions to the group (file based access to the
+/dev entry) and allows a check point at which VFIO can deny access to
+the device if the group is not viable (all devices within the group
+controlled by VFIO).  A file descriptor for the IOMMU is obtain in the
+same fashion.
+
+VFIO defines a standard set of APIs for access to devices and a
+modular interface for adding new, bus-specific VFIO device drivers.
+We call these "VFIO bus drivers".  The vfio-pci module is an example
+of a bus driver for exposing PCI devices.  When the bus driver module
+is loaded it enumerates all of the devices for it's bus, registering
+each device with the vfio core along with a set of callbacks.  For
+buses that support hotplug, the bus driver also adds itself to the
+notification chain for such events.  The callbacks registered with
+each device implement the VFIO device access API for that bus.
+
+The VFIO device API includes ioctls for describing the device, the I/O
+regions and their read/write/mmap offsets on the device descriptor, as
+well as mechanisms for describing and registering interrupt
+notifications.
+
+The VFIO IOMMU object is accessed in a similar way; an ioctl on the
+group provides a file descriptor for programming the IOMMU.  Like
+devices, the IOMMU file descriptor is only accessible when a group is
+viable.  The API for the IOMMU is effectively a userspace extension of
+the kernel IOMMU API.  The IOMMU provides an ioctl to describe the
+IOMMU domain as well as to setup and teardown DMA mappings.  As the
+IOMMU API is extended to support more esoteric IOMMU implementations,
+it's expected that the VFIO interface will also evolve.
+
+To facilitate this evolution, all of the VFIO interfaces are designed
+for extensions.  Particularly, for all structures passed via ioctl, we
+include a structure size and flags field.  We also define the ioctl
+request to be independent of passed structure size.  This allows us to
+later add structure fields and define flags as necessary.  It's
+expected that each additional field will have an associated flag to
+indicate whether the data is valid.  Additionally, we provide an
+"info" ioctl for each file descriptor, which allows us to flag new
+features as they're added (ex. an IOMMU domain configuration ioctl).
+
+The final aspect of VFIO is the notion of merging groups.  In both the
+assignment of devices to virtual machines and the pure userspace
+driver model, it's expect that a single user instance is likely to
+have multiple groups in use simultaneously.  If these groups are all
+using the same set of IOMMU mappings, the overhead of userspace
+setting up and tearing down the mappings, as well as the internal
+IOMMU driver overhead of managing those mappings can be non-trivial.
+Some IOMMU implementations are able to easily reduce this overhead by
+simply using the same set of page tables across multiple groups.
+VFIO allows users to take advantage of this option by merging groups
+together, effectively creating a super group (IOMMU groups only define
+the minimum granularity).
+
+A user can attempt to merge groups together by calling the merge ioctl
+on one group (the "merger") and passing a file descriptor for the
+group to be merged in (the "mergee").  Note that existing DMA mappings
+cannot be atomically merged between groups, it's therefore a
+requirement that the mergee group is not in use.  This is enforced by
+not allowing open device or iommu file descriptors on the mergee group
+at the time of merging.  The merger group can be actively in use at
+the time of merging.  Likewise, to unmerge a group, none of the device
+file descriptors for the group being removed can be in use.  The
+remaining merged group can be actively in use.
+
+If the groups cannot be merged, the ioctl will fail and the user will
+need to manage the groups independently.  Users should have no
+expectation for group merging to be successful.  Some platforms may
+not support it at all, others may only enable merging of sufficiently
+similar groups.  If the ioctl succeeds, then the group file
+descriptors are effectively fungible between the groups.  That is,
+instead of their actions being isolated to the individual group, each
+of them are gateways into the combined, merged group.  For instance,
+retrieving an IOMMU file descriptor from any group returns a reference
+to the same object, mappings to that IOMMU descriptor are visible to
+all devices in the merged group, and device descriptors can be
+retrieved for any device in the merged group from any one of the group
+file descriptors.  In effect, a user can manage devices and the IOMMU
+of a merged group using a single file descriptor (saving the merged
+groups file descriptors away only for unmerged) without the
+permission complications of creating a separate "super group" character
+device.
+
+VFIO Usage Example
+-------------------------------------------------------------------------------
+
+Assume user wants to access PCI device 0000:06:0d.0
+
+$ cat /sys/bus/pci/devices/0000:06:0d.0/iommu_group
+240
+
+Since this device is on the "pci" bus, the user can then find the
+character device for interacting with the VFIO group as:
+
+$ ls -l /dev/vfio/pci:240
+crw-rw---- 1 root root 252, 27 Dec 15 15:13 /dev/vfio/pci:240
+
+We can also examine other members of the group through sysfs:
+
+$ ls -l /sys/devices/virtual/vfio/pci:240/devices/
+total 0
+lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.0 -> \
+		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.0
+lrwxrwxrwx 1 root root 0 Dec 20 12:01 0000:06:0d.1 -> \
+		../../../../pci0000:00/0000:00:1e.0/0000:06:0d.1
+
+This group therefore contains two devices[4].  VFIO will prevent
+device or iommu manipulation unless all group members are attached to
+the vfio bus driver, so we simply unbind the devices from their
+current driver and rebind them to vfio:
+
+# for i in /sys/devices/virtual/vfio/pci:240/devices/*; do
+	dir=$(readlink -f $i)
+	if [ -L $dir/driver ]; then
+		echo $(basename $i) > $dir/driver/unbind
+	fi
+	vendor=$(cat $dir/vendor)
+	device=$(cat $dir/device)
+	echo $vendor $device > /sys/bus/pci/drivers/vfio/new_id
+	echo $(basename $i) > /sys/bus/pci/drivers/vfio/bind
+done
+
+# chown user:user /dev/vfio/pci:240
+
+The user now has full access to all the devices and the iommu for this
+group and can access them as follows:
+
+	int group, iommu, device, i;
+	struct vfio_group_info group_info = { .argsz = sizeof(group_info) };
+	struct vfio_iommu_info iommu_info = { .argsz = sizeof(iommu_info) };
+	struct vfio_dma_map dma_map = { .argsz = sizeof(dma_map) };
+	struct vfio_device_info device_info = { .argsz = sizeof(device_info) };
+
+	/* Open the group */
+	group = open("/dev/vfio/pci:240", O_RDWR);
+
+	/* Test the group is viable and available */
+	ioctl(group, VFIO_GROUP_GET_INFO, &group_info);
+
+	if (!(group_info.flags & VFIO_GROUP_FLAGS_VIABLE))
+		/* Group is not viable */
+
+	if ((group_info.flags & VFIO_GROUP_FLAGS_MM_LOCKED))
+		/* Already in use by someone else */
+
+	/* Get a file descriptor for the IOMMU */
+	iommu = ioctl(group, VFIO_GROUP_GET_IOMMU_FD);
+
+	/* Test the IOMMU is what we expect */
+	ioctl(iommu, VFIO_IOMMU_GET_INFO, &iommu_info);
+
+	/* Allocate some space and setup a DMA mapping */
+	dma_map.vaddr = mmap(0, 1024 * 1024, PROT_READ | PROT_WRITE,
+			     MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
+	dma_map.size = 1024 * 1024;
+	dma_map.iova = 0; /* 1MB starting at 0x0 from device view */
+	dma_map.flags = VFIO_DMA_MAP_FLAG_READ | VFIO_DMA_MAP_FLAG_WRITE;
+
+	ioctl(iommu, VFIO_IOMMU_MAP_DMA, &dma_map);
+
+	/* Get a file descriptor for the device */
+	device = ioctl(group, VFIO_GROUP_GET_DEVICE_FD, "0000:06:0d.0");
+
+	/* Test and setup the device */
+	ioctl(device, VFIO_DEVICE_GET_INFO, &device_info);
+
+	for (i = 0; i < device_info.num_regions; i++) {
+		struct vfio_region_info reg = { .argsz = sizeof(reg) };
+
+		reg.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_REGION_INFO, &reg);
+
+		/* Setup mappings... read/write offsets, mmaps
+		 * For PCI devices, config space is a region */
+	}
+
+	for (i = 0; i < device_info.num_irqs; i++) {
+		struct vfio_irq_info irq = { .argsz = sizeof(irq) };
+
+		irq.index = i;
+
+		ioctl(device, VFIO_DEVICE_GET_IRQ_INFO, &reg);
+
+		/* Setup IRQs... eventfds, VFIO_DEVICE_SET_IRQ_EVENTFDS */
+	}
+
+	/* Gratuitous device reset and go... */
+	ioctl(device, VFIO_DEVICE_RESET);
+
+VFIO User API
+-------------------------------------------------------------------------------
+
+Please see include/linux/vfio.h for complete API documentation.
+
+VFIO bus driver API
+-------------------------------------------------------------------------------
+
+Bus drivers, such as PCI, have three jobs:
+ 1) Add/remove devices from vfio
+ 2) Provide vfio_device_ops for device access
+ 3) Device binding and unbinding
+
+When initialized, the bus driver should enumerate the devices on its
+bus and call vfio_group_add_dev() for each device.  If the bus
+supports hotplug, notifiers should be enabled to track devices being
+added and removed.  vfio_group_del_dev() removes a previously added
+device from vfio.
+
+extern int vfio_group_add_dev(struct device *dev,
+                              const struct vfio_device_ops *ops);
+extern void vfio_group_del_dev(struct device *dev);
+
+Adding a device registers a vfio_device_ops function pointer structure
+for the device:
+
+struct vfio_device_ops {
+	bool	(*match)(struct device *dev, char *buf);
+	int	(*claim)(struct device *dev);
+	int	(*open)(void *device_data);
+	void	(*release)(void *device_data);
+	ssize_t	(*read)(void *device_data, char __user *buf,
+			size_t count, loff_t *ppos);
+	ssize_t	(*write)(void *device_data, const char __user *buf,
+			 size_t size, loff_t *ppos);
+	long	(*ioctl)(void *device_data, unsigned int cmd,
+			 unsigned long arg);
+	int	(*mmap)(void *device_data, struct vm_area_struct *vma);
+};
+
+For buses supporting hotplug, all functions are required to be
+implemented.  Non-hotplug buses do not need to implement claim().
+
+match() provides a device specific method for associating a struct
+device to a user provided string.  Many drivers may simply strcmp the
+buffer to dev_name().
+
+claim() is used when a device is hot-added to a group that is already
+in use.  This is how VFIO requests that a bus driver manually takes
+ownership of a device.  The expected call path for this is triggered
+from the bus add notifier.  The bus driver calls vfio_group_add_dev for
+the newly added device, vfio-core determines this group is already in
+use and calls claim on the bus driver.  This triggers the bus driver
+to call it's own probe function, including calling vfio_bind_dev to
+mark the device as controlled by vfio.  The device is then available
+for use by the group.
+
+The remaining vfio_device_ops are similar to a simplified struct
+file_operations except a device_data pointer is provided rather than a
+file pointer.  The device_data is an opaque structure registered by
+the bus driver when a device is bound to the vfio bus driver:
+
+extern int vfio_bind_dev(struct device *dev, void *device_data);
+extern void *vfio_unbind_dev(struct device *dev);
+
+When the device is unbound from the driver, the bus driver will call
+vfio_unbind_dev() which will return the device_data for any bus driver
+specific cleanup and freeing of the structure.  The vfio_unbind_dev
+call may block if the group is currently in use.
+
+-------------------------------------------------------------------------------
+
+[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
+initial implementation by Tom Lyon while as Cisco.  We've since
+outgrown the acronym, but it's catchy.
+
+[2] "safe" also depends upon a device being "well behaved".  It's
+possible for multi-function devices to have backdoors between
+functions and even for single function devices to have alternative
+access to things like PCI config space through MMIO registers.  To
+guard against the former we can include additional precautions in the
+IOMMU driver to group multi-function PCI devices together
+(iommu=group_mf).  The latter we can't prevent, but the IOMMU should
+still provide isolation.  For PCI, Virtual Functions are the best
+indicator of "well behaved", as these are designed for virtualization
+usage models.
+
+[3] As always there are trade-offs to virtual machine device
+assignment that are beyond the scope of VFIO.  It's expected that
+future IOMMU technologies will reduce some, but maybe not all, of
+these trade-offs.
+
+[4] In this case the device is below a PCI bridge:
+
+-[0000:00]-+-1e.0-[06]--+-0d.0
+                        \-0d.1
+
+00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)