Patchwork [RFC] vfio: VFIO Driver core framework

login
register
mail settings
Submitter Alex Williamson
Date Nov. 3, 2011, 8:12 p.m.
Message ID <20111103195452.21259.93021.stgit@bling.home>
Download mbox | patch
Permalink /patch/123504/
State New
Headers show

Comments

Alex Williamson - Nov. 3, 2011, 8:12 p.m.
VFIO provides a secure, IOMMU based interface for user space
drivers, including device assignment to virtual machines.
This provides the base management of IOMMU groups, devices,
and IOMMU objects.  See Documentation/vfio.txt included in
this patch for user and kernel API description.

Note, this implements the new API discussed at KVM Forum
2011, as represented by the drvier version 0.2.  It's hoped
that this provides a modular enough interface to support PCI
and non-PCI userspace drivers across various architectures
and IOMMU implementations.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

Fingers crossed, this is the last RFC for VFIO, but we need
the iommu group support before this can go upstream
(http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
hoping this helps push that along.

Since the last posting, this version completely modularizes
the device backends and better defines the APIs between the
core VFIO code and the device backends.  I expect that we
might also adopt a modular IOMMU interface as iommu_ops learns
about different types of hardware.  Also many, many cleanups.
Check the complete git history for details:

git://github.com/awilliam/linux-vfio.git vfio-ng

(matching qemu tree: git://github.com/awilliam/qemu-vfio.git)

This version, along with the supporting VFIO PCI backend can
be found here:

git://github.com/awilliam/linux-vfio.git vfio-next-20111103

I've held off on implementing a kernel->user signaling
mechanism for now since the previous netlink version produced
too many gag reflexes.  It's easy enough to set a bit in the
group flags too indicate such support in the future, so I
think we can move ahead without it.

Appreciate any feedback or suggestions.  Thanks,

Alex

 Documentation/ioctl/ioctl-number.txt |    1 
 Documentation/vfio.txt               |  304 +++++++++
 MAINTAINERS                          |    8 
 drivers/Kconfig                      |    2 
 drivers/Makefile                     |    1 
 drivers/vfio/Kconfig                 |    8 
 drivers/vfio/Makefile                |    3 
 drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
 drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
 drivers/vfio/vfio_private.h          |   34 +
 include/linux/vfio.h                 |  155 +++++
 11 files changed, 2197 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/vfio.txt
 create mode 100644 drivers/vfio/Kconfig
 create mode 100644 drivers/vfio/Makefile
 create mode 100644 drivers/vfio/vfio_iommu.c
 create mode 100644 drivers/vfio/vfio_main.c
 create mode 100644 drivers/vfio/vfio_private.h
 create mode 100644 include/linux/vfio.h
Aaron Fabbri - Nov. 9, 2011, 4:17 a.m.
I'm going to send out chunks of comments as I go over this stuff.  Below
I've covered the documentation file and vfio_iommu.c.  More comments coming
soon...

On 11/3/11 1:12 PM, "Alex Williamson" <alex.williamson@redhat.com> wrote:

> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
<snip>
> +
> +Groups, Devices, IOMMUs, oh my
> +-----------------------------------------------------------------------------
> --
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a

Can you define this acronym the first time you use it, i.e.

+ PEs (partitionable endpoints), ...

> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for

restrictions

> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices

distinguish

> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
<snip>
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE

explicitly

<snip>
> +
> +Device tree devices also invlude ioctls for further defining the

include

<snip>
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
<snip>
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +                      dma_addr_t start, size_t size)
> +{
> +    struct list_head *pos;
> +    struct dma_map_page *mlp;
> +
> +    list_for_each(pos, &iommu->dm_list) {
> +        mlp = list_entry(pos, struct dma_map_page, list);
> +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +                   start, size))
> +            return mlp;
> +    }
> +    return NULL;
> +}
> +

This function below should be static.

> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +                size_t size, struct dma_map_page *mlp)
> +{
> +    struct dma_map_page *split;
> +    int npage_lo, npage_hi;
> +
> +    /* Existing dma region is completely covered, unmap all */
> +    if (start <= mlp->daddr &&
> +        start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +        vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +        list_del(&mlp->list);
> +        npage_lo = mlp->npage;
> +        kfree(mlp);
> +        return npage_lo;
> +    }
> +
> +    /* Overlap low address of existing range */
> +    if (start <= mlp->daddr) {
> +        size_t overlap;
> +
> +        overlap = start + size - mlp->daddr;
> +        npage_lo = overlap >> PAGE_SHIFT;
> +        npage_hi = mlp->npage - npage_lo;

npage_hi not used.. Delete this line ^

> +
> +        vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +        mlp->daddr += overlap;
> +        mlp->vaddr += overlap;
> +        mlp->npage -= npage_lo;
> +        return npage_lo;
> +    }
> +
> +    /* Overlap high address of existing range */
> +    if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +        size_t overlap;
> +
> +        overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +        npage_hi = overlap >> PAGE_SHIFT;
> +        npage_lo = mlp->npage - npage_hi;
> +
> +        vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +        mlp->npage -= npage_hi;
> +        return npage_hi;
> +    }
> +
> +    /* Split existing */
> +    npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +    npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +    split = kzalloc(sizeof *split, GFP_KERNEL);
> +    if (!split)
> +        return -ENOMEM;
> +
> +    vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +    mlp->npage = npage_lo;
> +
> +    split->npage = npage_hi;
> +    split->daddr = start + size;
> +    split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +    split->rdwr = mlp->rdwr;
> +    list_add(&split->list, &iommu->dm_list);
> +    return size >> PAGE_SHIFT;
> +}
> +

Function should be static.

> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +    int ret = 0;
> +    size_t npage = dmp->size >> PAGE_SHIFT;
> +    struct list_head *pos, *n;
> +
> +    if (dmp->dmaaddr & ~PAGE_MASK)
> +        return -EINVAL;
> +    if (dmp->size & ~PAGE_MASK)
> +        return -EINVAL;
> +
> +    mutex_lock(&iommu->dgate);
> +
> +    list_for_each_safe(pos, n, &iommu->dm_list) {
> +        struct dma_map_page *mlp;
> +
> +        mlp = list_entry(pos, struct dma_map_page, list);
> +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +                   dmp->dmaaddr, dmp->size)) {
> +            ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> +                              dmp->size, mlp);
> +            if (ret > 0)
> +                npage -= NPAGE_TO_SIZE(ret);

Why NPAGE_TO_SIZE here?

> +            if (ret < 0 || npage == 0)
> +                break;
> +        }
> +    }
> +    mutex_unlock(&iommu->dgate);
> +    return ret > 0 ? 0 : ret;
> +}
> +

Function should be static.

> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +    int npage;
> +    struct dma_map_page *mlp, *mmlp = NULL;
> +    dma_addr_t daddr = dmp->dmaaddr;
Alex Williamson - Nov. 9, 2011, 4:41 a.m.
On Tue, 2011-11-08 at 20:17 -0800, Aaron Fabbri wrote:
> I'm going to send out chunks of comments as I go over this stuff.  Below
> I've covered the documentation file and vfio_iommu.c.  More comments coming
> soon...
> 
> On 11/3/11 1:12 PM, "Alex Williamson" <alex.williamson@redhat.com> wrote:
> 
> > VFIO provides a secure, IOMMU based interface for user space
> > drivers, including device assignment to virtual machines.
> > This provides the base management of IOMMU groups, devices,
> > and IOMMU objects.  See Documentation/vfio.txt included in
> > this patch for user and kernel API description.
> > 
> > Note, this implements the new API discussed at KVM Forum
> > 2011, as represented by the drvier version 0.2.  It's hoped
> > that this provides a modular enough interface to support PCI
> > and non-PCI userspace drivers across various architectures
> > and IOMMU implementations.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> <snip>
> > +
> > +Groups, Devices, IOMMUs, oh my
> > +-----------------------------------------------------------------------------
> > --
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> 
> Can you define this acronym the first time you use it, i.e.
> 
> + PEs (partitionable endpoints), ...

It was actually up in the <snip>:

... POWER systems with Partitionable Endpoints (PEs) ...

I tried to make sure I defined them, but let me know if anything else is
missing/non-obvious.

> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> 
> restrictions

Ugh, lost w/o a spell checker.  Fixed all these.

> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
> <snip>
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +                      dma_addr_t start, size_t size)
> > +{
> > +    struct list_head *pos;
> > +    struct dma_map_page *mlp;
> > +
> > +    list_for_each(pos, &iommu->dm_list) {
> > +        mlp = list_entry(pos, struct dma_map_page, list);
> > +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +                   start, size))
> > +            return mlp;
> > +    }
> > +    return NULL;
> > +}
> > +
> 
> This function below should be static.

Fixed

> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +                size_t size, struct dma_map_page *mlp)
> > +{
> > +    struct dma_map_page *split;
> > +    int npage_lo, npage_hi;
> > +
> > +    /* Existing dma region is completely covered, unmap all */
> > +    if (start <= mlp->daddr &&
> > +        start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +        vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +        list_del(&mlp->list);
> > +        npage_lo = mlp->npage;
> > +        kfree(mlp);
> > +        return npage_lo;
> > +    }
> > +
> > +    /* Overlap low address of existing range */
> > +    if (start <= mlp->daddr) {
> > +        size_t overlap;
> > +
> > +        overlap = start + size - mlp->daddr;
> > +        npage_lo = overlap >> PAGE_SHIFT;
> > +        npage_hi = mlp->npage - npage_lo;
> 
> npage_hi not used.. Delete this line ^

Yep, and npage_lo in the next block.  I was setting them just for
symmetry, but they can be removed now.

> > +
> > +        vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +        mlp->daddr += overlap;
> > +        mlp->vaddr += overlap;
> > +        mlp->npage -= npage_lo;
> > +        return npage_lo;
> > +    }
> > +
> > +    /* Overlap high address of existing range */
> > +    if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +        size_t overlap;
> > +
> > +        overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +        npage_hi = overlap >> PAGE_SHIFT;
> > +        npage_lo = mlp->npage - npage_hi;
> > +
> > +        vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +        mlp->npage -= npage_hi;
> > +        return npage_hi;
> > +    }
> > +
> > +    /* Split existing */
> > +    npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +    npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +    split = kzalloc(sizeof *split, GFP_KERNEL);
> > +    if (!split)
> > +        return -ENOMEM;
> > +
> > +    vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +    mlp->npage = npage_lo;
> > +
> > +    split->npage = npage_hi;
> > +    split->daddr = start + size;
> > +    split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +    split->rdwr = mlp->rdwr;
> > +    list_add(&split->list, &iommu->dm_list);
> > +    return size >> PAGE_SHIFT;
> > +}
> > +
> 
> Function should be static.

Fixed

> > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +    int ret = 0;
> > +    size_t npage = dmp->size >> PAGE_SHIFT;
> > +    struct list_head *pos, *n;
> > +
> > +    if (dmp->dmaaddr & ~PAGE_MASK)
> > +        return -EINVAL;
> > +    if (dmp->size & ~PAGE_MASK)
> > +        return -EINVAL;
> > +
> > +    mutex_lock(&iommu->dgate);
> > +
> > +    list_for_each_safe(pos, n, &iommu->dm_list) {
> > +        struct dma_map_page *mlp;
> > +
> > +        mlp = list_entry(pos, struct dma_map_page, list);
> > +        if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +                   dmp->dmaaddr, dmp->size)) {
> > +            ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > +                              dmp->size, mlp);
> > +            if (ret > 0)
> > +                npage -= NPAGE_TO_SIZE(ret);
> 
> Why NPAGE_TO_SIZE here?

Looks like a bug, I'll change and test.

> > +            if (ret < 0 || npage == 0)
> > +                break;
> > +        }
> > +    }
> > +    mutex_unlock(&iommu->dgate);
> > +    return ret > 0 ? 0 : ret;
> > +}
> > +
> 
> Function should be static.

Fixed.

> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +    int npage;
> > +    struct dma_map_page *mlp, *mmlp = NULL;
> > +    dma_addr_t daddr = dmp->dmaaddr;
> 

Thanks!

Alex
Christian Benvenuti - Nov. 9, 2011, 8:11 a.m.
I have not gone through the all patch yet, but here are
my first comments/questions about the code in vfio_main.c
(and pci/vfio_pci.c).

> -----Original Message-----

> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Thursday, November 03, 2011 1:12 PM

> To: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;

> dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Christian

> Benvenuti (benve); Aaron Fabbri (aafabbri); B08248@freescale.com;

> B07421@freescale.com; avi@redhat.com; konrad.wilk@oracle.com;

> kvm@vger.kernel.org; qemu-devel@nongnu.org; iommu@lists.linux-

> foundation.org; linux-pci@vger.kernel.org

> Subject: [RFC PATCH] vfio: VFIO Driver core framework


<snip>

> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c

> new file mode 100644

> index 0000000..6169356

> --- /dev/null

> +++ b/drivers/vfio/vfio_main.c

> @@ -0,0 +1,1151 @@

> +/*

> + * VFIO framework

> + *

> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.

> + *     Author: Alex Williamson <alex.williamson@redhat.com>

> + *

> + * This program is free software; you can redistribute it and/or

> modify

> + * it under the terms of the GNU General Public License version 2 as

> + * published by the Free Software Foundation.

> + *

> + * Derived from original vfio:

> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.

> + * Author: Tom Lyon, pugs@cisco.com

> + */

> +

> +#include <linux/cdev.h>

> +#include <linux/compat.h>

> +#include <linux/device.h>

> +#include <linux/file.h>

> +#include <linux/anon_inodes.h>

> +#include <linux/fs.h>

> +#include <linux/idr.h>

> +#include <linux/iommu.h>

> +#include <linux/mm.h>

> +#include <linux/module.h>

> +#include <linux/slab.h>

> +#include <linux/string.h>

> +#include <linux/uaccess.h>

> +#include <linux/vfio.h>

> +#include <linux/wait.h>

> +

> +#include "vfio_private.h"

> +

> +#define DRIVER_VERSION	"0.2"

> +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"

> +#define DRIVER_DESC	"VFIO - User Level meta-driver"

> +

> +static int allow_unsafe_intrs;

> +module_param(allow_unsafe_intrs, int, 0);

> +MODULE_PARM_DESC(allow_unsafe_intrs,

> +        "Allow use of IOMMUs which do not support interrupt

> remapping");

> +

> +static struct vfio {

> +	dev_t			devt;

> +	struct cdev		cdev;

> +	struct list_head	group_list;

> +	struct mutex		lock;

> +	struct kref		kref;

> +	struct class		*class;

> +	struct idr		idr;

> +	wait_queue_head_t	release_q;

> +} vfio;

> +

> +static const struct file_operations vfio_group_fops;

> +extern const struct file_operations vfio_iommu_fops;

> +

> +struct vfio_group {

> +	dev_t			devt;

> +	unsigned int		groupid;


This groupid is returned by the device_group callback you recently added
with a separate (not yet in tree) IOMMU patch.
Is it correct to say that the scope of this ID is the bus the iommu
belongs too (but you use it as if it was global)?
I believe there is nothing right now to ensure the uniqueness of such
ID across bus types (assuming there will be other bus drivers in the
future besides vfio-pci).
If that's the case, the vfio.group_list global list and the __vfio_lookup_dev
routine should be changed to account for the bus too?
Ops, I just saw the error msg in vfio_group_add_dev about the group id conflict.
Is that warning related to what I mentioned above?

> +	struct bus_type		*bus;

> +	struct vfio_iommu	*iommu;

> +	struct list_head	device_list;

> +	struct list_head	iommu_next;

> +	struct list_head	group_next;

> +	int			refcnt;

> +};

> +

> +struct vfio_device {

> +	struct device			*dev;

> +	const struct vfio_device_ops	*ops;

> +	struct vfio_iommu		*iommu;


I wonder if you need to have the 'iommu' field here.
vfio_device.iommu is always set and reset together with
vfio_group.iommu.
Given that a vfio_device instance is always linked to a vfio_group
instance, do we need this duplication? Is this duplication there
because you do not want the double dereference device->group->iommu?

> +	struct vfio_group		*group;

> +	struct list_head		device_next;

> +	bool				attached;

> +	int				refcnt;

> +	void				*device_data;

> +};

> +

> +/*

> + * Helper functions called under vfio.lock

> + */

> +

> +/* Return true if any devices within a group are opened */

> +static bool __vfio_group_devs_inuse(struct vfio_group *group)

> +{

> +	struct list_head *pos;

> +

> +	list_for_each(pos, &group->device_list) {

> +		struct vfio_device *device;

> +

> +		device = list_entry(pos, struct vfio_device, device_next);

> +		if (device->refcnt)

> +			return true;

> +	}

> +	return false;

> +}

> +

> +/* Return true if any of the groups attached to an iommu are opened.

> + * We can only tear apart merged groups when nothing is left open. */

> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)

> +{

> +	struct list_head *pos;

> +

> +	list_for_each(pos, &iommu->group_list) {

> +		struct vfio_group *group;

> +

> +		group = list_entry(pos, struct vfio_group, iommu_next);

> +		if (group->refcnt)

> +			return true;

> +	}

> +	return false;

> +}

> +

> +/* An iommu is "in use" if it has a file descriptor open or if any of

> + * the groups assigned to the iommu have devices open. */

> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)

> +{

> +	struct list_head *pos;

> +

> +	if (iommu->refcnt)

> +		return true;

> +

> +	list_for_each(pos, &iommu->group_list) {

> +		struct vfio_group *group;

> +

> +		group = list_entry(pos, struct vfio_group, iommu_next);

> +

> +		if (__vfio_group_devs_inuse(group))

> +			return true;

> +	}

> +	return false;

> +}


I looked at how you take care of ref counts ...

This is how the tree of vfio_iommu/vfio_group/vfio_device data
Structures is organized (I'll use just iommu/group/dev to make
the graph smaller):

            iommu
           /     \
          /       \ 
    group   ...     group
    /  \           /  \   
   /    \         /    \
dev  ..  dev   dev  ..  dev

This is how you get a file descriptor for the three kind of objects:

- group : open /dev/vfio/xxx for group xxx
- iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
- device: group ioctl VFIO_GROUP_GET_DEVICE_FD

Given the above topology, I would assume that:

(1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
                             b) any of its groups is 'inuse'

(2) a  group is 'inuse' if : a) group refcnt > 0, or
                             b) any of its devices is 'inuse'

(3) a device is 'inuse' if : a) device refcnt > 0

You have coded the 'inuse' logic with these three routines:

    __vfio_iommu_inuse, which implements (1) above

and
    __vfio_iommu_groups_inuse
    __vfio_group_devs_inuse

which are used by __vfio_iommu_inuse.
Why don't you check the group refcnt in __vfio_iommu_groups_inuse?

Would it make sense (and the code more readable) to structure the
nested refcnt/inuse check like this?
(The numbers (1)(2)(3) refer to the three 'inuse' conditions above)

   (1)__vfio_iommu_inuse
   |
   +-> check iommu refcnt
   +-> __vfio_iommu_groups_inuse
       |
       +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING
                |
                +-> check group refcnt<--MISSING
                +-> __vfio_group_devs_inuse()
                    |
                    +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING
                              |
                              +-> check device refcnt

> +static void __vfio_group_set_iommu(struct vfio_group *group,

> +				   struct vfio_iommu *iommu)

> +{

> +	struct list_head *pos;

> +

> +	if (group->iommu)

> +		list_del(&group->iommu_next);

> +	if (iommu)

> +		list_add(&group->iommu_next, &iommu->group_list);

> +

> +	group->iommu = iommu;


If you remove the vfio_device.iommu field (as suggested above in a previous
Comment), the block below would not be needed anymore.

> +	list_for_each(pos, &group->device_list) {

> +		struct vfio_device *device;

> +

> +		device = list_entry(pos, struct vfio_device, device_next);

> +		device->iommu = iommu;

> +	}

> +}

> +

> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,

> +				    struct vfio_device *device)

> +{

> +	BUG_ON(!iommu->domain && device->attached);

> +

> +	if (!iommu->domain || !device->attached)

> +		return;

> +

> +	iommu_detach_device(iommu->domain, device->dev);

> +	device->attached = false;

> +}

> +

> +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,

> +				      struct vfio_group *group)

> +{

> +	struct list_head *pos;

> +

> +	list_for_each(pos, &group->device_list) {

> +		struct vfio_device *device;

> +

> +		device = list_entry(pos, struct vfio_device, device_next);

> +		__vfio_iommu_detach_dev(iommu, device);

> +	}

> +}

> +

> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,

> +				   struct vfio_device *device)

> +{

> +	int ret;

> +

> +	BUG_ON(device->attached);

> +

> +	if (!iommu || !iommu->domain)

> +		return -EINVAL;

> +

> +	ret = iommu_attach_device(iommu->domain, device->dev);

> +	if (!ret)

> +		device->attached = true;

> +

> +	return ret;

> +}

> +

> +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,

> +				     struct vfio_group *group)

> +{

> +	struct list_head *pos;

> +

> +	list_for_each(pos, &group->device_list) {

> +		struct vfio_device *device;

> +		int ret;

> +

> +		device = list_entry(pos, struct vfio_device, device_next);

> +		ret = __vfio_iommu_attach_dev(iommu, device);

> +		if (ret) {

> +			__vfio_iommu_detach_group(iommu, group);

> +			return ret;

> +		}

> +	}

> +	return 0;

> +}

> +

> +/* The iommu is viable, ie. ready to be configured, when all the

> devices

> + * for all the groups attached to the iommu are bound to their vfio

> device

> + * drivers (ex. vfio-pci).  This sets the device_data private data

> pointer. */

> +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)

> +{

> +	struct list_head *gpos, *dpos;

> +

> +	list_for_each(gpos, &iommu->group_list) {

> +		struct vfio_group *group;

> +		group = list_entry(gpos, struct vfio_group, iommu_next);

> +

> +		list_for_each(dpos, &group->device_list) {

> +			struct vfio_device *device;

> +			device = list_entry(dpos,

> +					    struct vfio_device, device_next);

> +

> +			if (!device->device_data)

> +				return false;

> +		}

> +	}

> +	return true;

> +}

> +

> +static void __vfio_close_iommu(struct vfio_iommu *iommu)

> +{

> +	struct list_head *pos;

> +

> +	if (!iommu->domain)

> +		return;

> +

> +	list_for_each(pos, &iommu->group_list) {

> +		struct vfio_group *group;

> +		group = list_entry(pos, struct vfio_group, iommu_next);

> +

> +		__vfio_iommu_detach_group(iommu, group);

> +	}

> +

> +	vfio_iommu_unmapall(iommu);

> +

> +	iommu_domain_free(iommu->domain);

> +	iommu->domain = NULL;

> +	iommu->mm = NULL;

> +}

> +

> +/* Open the IOMMU.  This gates all access to the iommu or device file

> + * descriptors and sets current->mm as the exclusive user. */


Given the fn  vfio_group_open (ie, 1st object, 2nd operation), I would have
called this one __vfio_iommu_open (instead of __vfio_open_iommu).
Is it named __vfio_open_iommu to avoid a conflict with the namespace in vfio_iommu.c?      

> +static int __vfio_open_iommu(struct vfio_iommu *iommu)

> +{

> +	struct list_head *pos;

> +	int ret;

> +

> +	if (!__vfio_iommu_viable(iommu))

> +		return -EBUSY;

> +

> +	if (iommu->domain)

> +		return -EINVAL;

> +

> +	iommu->domain = iommu_domain_alloc(iommu->bus);

> +	if (!iommu->domain)

> +		return -EFAULT;

> +

> +	list_for_each(pos, &iommu->group_list) {

> +		struct vfio_group *group;

> +		group = list_entry(pos, struct vfio_group, iommu_next);

> +

> +		ret = __vfio_iommu_attach_group(iommu, group);

> +		if (ret) {

> +			__vfio_close_iommu(iommu);

> +			return ret;

> +		}

> +	}

> +

> +	if (!allow_unsafe_intrs &&

> +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {

> +		__vfio_close_iommu(iommu);

> +		return -EFAULT;

> +	}

> +

> +	iommu->cache = (iommu_domain_has_cap(iommu->domain,

> +					     IOMMU_CAP_CACHE_COHERENCY) != 0);

> +	iommu->mm = current->mm;

> +

> +	return 0;

> +}

> +

> +/* Actively try to tear down the iommu and merged groups.  If there

> are no

> + * open iommu or device fds, we close the iommu.  If we close the

> iommu and

> + * there are also no open group fds, we can futher dissolve the group

> to

> + * iommu association and free the iommu data structure. */

> +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)

> +{

> +

> +	if (__vfio_iommu_inuse(iommu))

> +		return -EBUSY;

> +

> +	__vfio_close_iommu(iommu);

> +

> +	if (!__vfio_iommu_groups_inuse(iommu)) {

> +		struct list_head *pos, *ppos;

> +

> +		list_for_each_safe(pos, ppos, &iommu->group_list) {

> +			struct vfio_group *group;

> +

> +			group = list_entry(pos, struct vfio_group,

> iommu_next);

> +			__vfio_group_set_iommu(group, NULL);

> +		}

> +

> +

> +		kfree(iommu);

> +	}

> +

> +	return 0;

> +}

> +

> +static struct vfio_device *__vfio_lookup_dev(struct device *dev)

> +{

> +	struct list_head *gpos;

> +	unsigned int groupid;

> +

> +	if (iommu_device_group(dev, &groupid))

> +		return NULL;

> +

> +	list_for_each(gpos, &vfio.group_list) {

> +		struct vfio_group *group;

> +		struct list_head *dpos;

> +

> +		group = list_entry(gpos, struct vfio_group, group_next);

> +

> +		if (group->groupid != groupid)

> +			continue;

> +

> +		list_for_each(dpos, &group->device_list) {

> +			struct vfio_device *device;

> +

> +			device = list_entry(dpos,

> +					    struct vfio_device, device_next);

> +

> +			if (device->dev == dev)

> +				return device;

> +		}

> +	}

> +	return NULL;

> +}

> +

> +/* All release paths simply decrement the refcnt, attempt to teardown

> + * the iommu and merged groups, and wakeup anything that might be

> + * waiting if we successfully dissolve anything. */

> +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)

> +{

> +	bool wake;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	(*refcnt)--;

> +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);

> +

> +	mutex_unlock(&vfio.lock);

> +

> +	if (wake)

> +		wake_up(&vfio.release_q);

> +

> +	return 0;

> +}

> +

> +/*

> + * Device fops - passthrough to vfio device driver w/ device_data

> + */

> +static int vfio_device_release(struct inode *inode, struct file

> *filep)

> +{

> +	struct vfio_device *device = filep->private_data;

> +

> +	vfio_do_release(&device->refcnt, device->iommu);

> +

> +	device->ops->put(device->device_data);

> +

> +	return 0;

> +}

> +

> +static long vfio_device_unl_ioctl(struct file *filep,

> +				  unsigned int cmd, unsigned long arg)

> +{

> +	struct vfio_device *device = filep->private_data;

> +

> +	return device->ops->ioctl(device->device_data, cmd, arg);

> +}

> +

> +static ssize_t vfio_device_read(struct file *filep, char __user *buf,

> +				size_t count, loff_t *ppos)

> +{

> +	struct vfio_device *device = filep->private_data;

> +

> +	return device->ops->read(device->device_data, buf, count, ppos);

> +}

> +

> +static ssize_t vfio_device_write(struct file *filep, const char __user

> *buf,

> +				 size_t count, loff_t *ppos)

> +{

> +	struct vfio_device *device = filep->private_data;

> +

> +	return device->ops->write(device->device_data, buf, count, ppos);

> +}

> +

> +static int vfio_device_mmap(struct file *filep, struct vm_area_struct

> *vma)

> +{

> +	struct vfio_device *device = filep->private_data;

> +

> +	return device->ops->mmap(device->device_data, vma);

> +}

> +

> +#ifdef CONFIG_COMPAT

> +static long vfio_device_compat_ioctl(struct file *filep,

> +				     unsigned int cmd, unsigned long arg)

> +{

> +	arg = (unsigned long)compat_ptr(arg);

> +	return vfio_device_unl_ioctl(filep, cmd, arg);

> +}

> +#endif	/* CONFIG_COMPAT */

> +

> +const struct file_operations vfio_device_fops = {

> +	.owner		= THIS_MODULE,

> +	.release	= vfio_device_release,

> +	.read		= vfio_device_read,

> +	.write		= vfio_device_write,

> +	.unlocked_ioctl	= vfio_device_unl_ioctl,

> +#ifdef CONFIG_COMPAT

> +	.compat_ioctl	= vfio_device_compat_ioctl,

> +#endif

> +	.mmap		= vfio_device_mmap,

> +};

> +

> +/*

> + * Group fops

> + */

> +static int vfio_group_open(struct inode *inode, struct file *filep)

> +{

> +	struct vfio_group *group;

> +	int ret = 0;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	group = idr_find(&vfio.idr, iminor(inode));

> +

> +	if (!group) {

> +		ret = -ENODEV;

> +		goto out;

> +	}

> +

> +	filep->private_data = group;

> +

> +	if (!group->iommu) {

> +		struct vfio_iommu *iommu;

> +

> +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);

> +		if (!iommu) {

> +			ret = -ENOMEM;

> +			goto out;

> +		}

> +		INIT_LIST_HEAD(&iommu->group_list);

> +		INIT_LIST_HEAD(&iommu->dm_list);

> +		mutex_init(&iommu->dgate);

> +		iommu->bus = group->bus;

> +		__vfio_group_set_iommu(group, iommu);

> +	}

> +	group->refcnt++;

> +

> +out:

> +	mutex_unlock(&vfio.lock);

> +

> +	return ret;

> +}

> +

> +static int vfio_group_release(struct inode *inode, struct file *filep)

> +{

> +	struct vfio_group *group = filep->private_data;

> +

> +	return vfio_do_release(&group->refcnt, group->iommu);

> +}

> +

> +/* Attempt to merge the group pointed to by fd into group.  The merge-

> ee

> + * group must not have an iommu or any devices open because we cannot

> + * maintain that context across the merge.  The merge-er group can be

> + * in use. */

> +static int vfio_group_merge(struct vfio_group *group, int fd)


The documentation in vfio.txt explains clearly the logic implemented by
the merge/unmerge group ioctls.
However, what you are doing is not merging groups, but rather adding/removing
groups to/from iommus (and creating flat lists of groups).
For example, when you do

  merge(A,B)

you actually mean to say "merge B to the list of groups assigned to the
same iommu as group A".
For the same reason, you do not really need to provide the group you want
to unmerge from, which means that instead of

  unmerge(A,B) 

you would just need

  unmerge(B)

I understand the reason why it is not a real merge/unmerge (ie, to keep the
original groups so that you can unmerge later) ... however I just wonder if
it wouldn't be more natural to implement the VFIO_IOMMU_ADD_GROUP/DEL_GROUP
iommu ioctls instead? (the relationships between the data structure would
remain the same)
I guess you already discarded this option for some reasons, right? What was
the reason?

> +{

> +	struct vfio_group *new;

> +	struct vfio_iommu *old_iommu;

> +	struct file *file;

> +	int ret = 0;

> +	bool opened = false;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	file = fget(fd);

> +	if (!file) {

> +		ret = -EBADF;

> +		goto out_noput;

> +	}

> +

> +	/* Sanity check, is this really our fd? */

> +	if (file->f_op != &vfio_group_fops) {

> +		ret = -EINVAL;

> +		goto out;

> +	}

> +

> +	new = file->private_data;

> +

> +	if (!new || new == group || !new->iommu ||

> +	    new->iommu->domain || new->bus != group->bus) {

> +		ret = -EINVAL;

> +		goto out;

> +	}

> +

> +	/* We need to attach all the devices to each domain separately

> +	 * in order to validate that the capabilities match for both.  */

> +	ret = __vfio_open_iommu(new->iommu);

> +	if (ret)

> +		goto out;

> +

> +	if (!group->iommu->domain) {

> +		ret = __vfio_open_iommu(group->iommu);

> +		if (ret)

> +			goto out;

> +		opened = true;

> +	}

> +

> +	/* If cache coherency doesn't match we'd potentialy need to

> +	 * remap existing iommu mappings in the merge-er domain.

> +	 * Poor return to bother trying to allow this currently. */

> +	if (iommu_domain_has_cap(group->iommu->domain,

> +				 IOMMU_CAP_CACHE_COHERENCY) !=

> +	    iommu_domain_has_cap(new->iommu->domain,

> +				 IOMMU_CAP_CACHE_COHERENCY)) {

> +		__vfio_close_iommu(new->iommu);

> +		if (opened)

> +			__vfio_close_iommu(group->iommu);

> +		ret = -EINVAL;

> +		goto out;

> +	}

> +

> +	/* Close the iommu for the merge-ee and attach all its devices

> +	 * to the merge-er iommu. */

> +	__vfio_close_iommu(new->iommu);

> +

> +	ret = __vfio_iommu_attach_group(group->iommu, new);

> +	if (ret)

> +		goto out;

> +

> +	/* set_iommu unlinks new from the iommu, so save a pointer to it

> */

> +	old_iommu = new->iommu;

> +	__vfio_group_set_iommu(new, group->iommu);

> +	kfree(old_iommu);

> +

> +out:

> +	fput(file);

> +out_noput:

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +

> +/* Unmerge the group pointed to by fd from group. */

> +static int vfio_group_unmerge(struct vfio_group *group, int fd)

> +{

> +	struct vfio_group *new;

> +	struct vfio_iommu *new_iommu;

> +	struct file *file;

> +	int ret = 0;

> +

> +	/* Since the merge-out group is already opened, it needs to

> +	 * have an iommu struct associated with it. */

> +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);

> +	if (!new_iommu)

> +		return -ENOMEM;

> +

> +	INIT_LIST_HEAD(&new_iommu->group_list);

> +	INIT_LIST_HEAD(&new_iommu->dm_list);

> +	mutex_init(&new_iommu->dgate);

> +	new_iommu->bus = group->bus;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	file = fget(fd);

> +	if (!file) {

> +		ret = -EBADF;

> +		goto out_noput;

> +	}

> +

> +	/* Sanity check, is this really our fd? */

> +	if (file->f_op != &vfio_group_fops) {

> +		ret = -EINVAL;

> +		goto out;

> +	}

> +

> +	new = file->private_data;

> +	if (!new || new == group || new->iommu != group->iommu) {

> +		ret = -EINVAL;

> +		goto out;

> +	}

> +

> +	/* We can't merge-out a group with devices still in use. */

> +	if (__vfio_group_devs_inuse(new)) {

> +		ret = -EBUSY;

> +		goto out;

> +	}

> +

> +	__vfio_iommu_detach_group(group->iommu, new);

> +	__vfio_group_set_iommu(new, new_iommu);

> +

> +out:

> +	fput(file);

> +out_noput:

> +	if (ret)

> +		kfree(new_iommu);

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +

> +/* Get a new iommu file descriptor.  This will open the iommu, setting

> + * the current->mm ownership if it's not already set. */

> +static int vfio_group_get_iommu_fd(struct vfio_group *group)

> +{

> +	int ret = 0;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	if (!group->iommu->domain) {

> +		ret = __vfio_open_iommu(group->iommu);

> +		if (ret)

> +			goto out;

> +	}

> +

> +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,

> +			       group->iommu, O_RDWR);

> +	if (ret < 0)

> +		goto out;

> +

> +	group->iommu->refcnt++;

> +out:

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +

> +/* Get a new device file descriptor.  This will open the iommu,

> setting

> + * the current->mm ownership if it's not already set.  It's difficult

> to

> + * specify the requirements for matching a user supplied buffer to a

> + * device, so we use a vfio driver callback to test for a match.  For

> + * PCI, dev_name(dev) is unique, but other drivers may require

> including

> + * a parent device string. */

> +static int vfio_group_get_device_fd(struct vfio_group *group, char

> *buf)

> +{

> +	struct vfio_iommu *iommu = group->iommu;

> +	struct list_head *gpos;

> +	int ret = -ENODEV;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	if (!iommu->domain) {

> +		ret = __vfio_open_iommu(iommu);

> +		if (ret)

> +			goto out;

> +	}

> +

> +	list_for_each(gpos, &iommu->group_list) {

> +		struct list_head *dpos;

> +

> +		group = list_entry(gpos, struct vfio_group, iommu_next);

> +

> +		list_for_each(dpos, &group->device_list) {

> +			struct vfio_device *device;

> +

> +			device = list_entry(dpos,

> +					    struct vfio_device, device_next);

> +

> +			if (device->ops->match(device->dev, buf)) {

> +				struct file *file;

> +

> +				if (device->ops->get(device->device_data)) {

> +					ret = -EFAULT;

> +					goto out;

> +				}

> +

> +				/* We can't use anon_inode_getfd(), like above

> +				 * because we need to modify the f_mode flags

> +				 * directly to allow more than just ioctls */

> +				ret = get_unused_fd();

> +				if (ret < 0) {

> +					device->ops->put(device->device_data);

> +					goto out;

> +				}

> +

> +				file = anon_inode_getfile("[vfio-device]",

> +							  &vfio_device_fops,

> +							  device, O_RDWR);

> +				if (IS_ERR(file)) {

> +					put_unused_fd(ret);

> +					ret = PTR_ERR(file);

> +					device->ops->put(device->device_data);

> +					goto out;

> +				}

> +

> +				/* Todo: add an anon_inode interface to do

> +				 * this.  Appears to be missing by lack of

> +				 * need rather than explicitly prevented.

> +				 * Now there's need. */

> +				file->f_mode |= (FMODE_LSEEK |

> +						 FMODE_PREAD |

> +						 FMODE_PWRITE);

> +

> +				fd_install(ret, file);

> +

> +				device->refcnt++;

> +				goto out;

> +			}

> +		}

> +	}

> +out:

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +

> +static long vfio_group_unl_ioctl(struct file *filep,

> +				 unsigned int cmd, unsigned long arg)

> +{

> +	struct vfio_group *group = filep->private_data;

> +

> +	if (cmd == VFIO_GROUP_GET_FLAGS) {

> +		u64 flags = 0;

> +

> +		mutex_lock(&vfio.lock);

> +		if (__vfio_iommu_viable(group->iommu))

> +			flags |= VFIO_GROUP_FLAGS_VIABLE;

> +		mutex_unlock(&vfio.lock);

> +

> +		if (group->iommu->mm)

> +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;

> +

> +		return put_user(flags, (u64 __user *)arg);

> +	}

> +

> +	/* Below commands are restricted once the mm is set */

> +	if (group->iommu->mm && group->iommu->mm != current->mm)

> +		return -EPERM;

> +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {

> +		int fd;

> +

> +		if (get_user(fd, (int __user *)arg))

> +			return -EFAULT;

> +		if (fd < 0)

> +			return -EINVAL;

> +

> +		if (cmd == VFIO_GROUP_MERGE)

> +			return vfio_group_merge(group, fd);

> +		else

> +			return vfio_group_unmerge(group, fd);

> +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {

> +		return vfio_group_get_iommu_fd(group);

> +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {

> +		char *buf;

> +		int ret;

> +

> +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);

> +		if (IS_ERR(buf))

> +			return PTR_ERR(buf);

> +

> +		ret = vfio_group_get_device_fd(group, buf);

> +		kfree(buf);

> +		return ret;

> +	}

> +

> +	return -ENOSYS;

> +}

> +

> +#ifdef CONFIG_COMPAT

> +static long vfio_group_compat_ioctl(struct file *filep,

> +				    unsigned int cmd, unsigned long arg)

> +{

> +	arg = (unsigned long)compat_ptr(arg);

> +	return vfio_group_unl_ioctl(filep, cmd, arg);

> +}

> +#endif	/* CONFIG_COMPAT */

> +

> +static const struct file_operations vfio_group_fops = {

> +	.owner		= THIS_MODULE,

> +	.open		= vfio_group_open,

> +	.release	= vfio_group_release,

> +	.unlocked_ioctl	= vfio_group_unl_ioctl,

> +#ifdef CONFIG_COMPAT

> +	.compat_ioctl	= vfio_group_compat_ioctl,

> +#endif

> +};

> +

> +/* iommu fd release hook */


Given vfio_device_release and
      vfio_group_release (ie, 1st object, 2nd operation), I was
going to suggest renaming the fn below to vfio_iommu_release, but
then I saw the latter name being already used in vfio_iommu.c ...
a bit confusing but I guess it's ok then.

> +int vfio_release_iommu(struct vfio_iommu *iommu)

> +{

> +	return vfio_do_release(&iommu->refcnt, iommu);

> +}

> +

> +/*

> + * VFIO driver API

> + */

> +

> +/* Add a new device to the vfio framework with associated vfio driver

> + * callbacks.  This is the entry point for vfio drivers to register

> devices. */

> +int vfio_group_add_dev(struct device *dev, const struct

> vfio_device_ops *ops)

> +{

> +	struct list_head *pos;

> +	struct vfio_group *group = NULL;

> +	struct vfio_device *device = NULL;

> +	unsigned int groupid;

> +	int ret = 0;

> +	bool new_group = false;

> +

> +	if (!ops)

> +		return -EINVAL;

> +

> +	if (iommu_device_group(dev, &groupid))

> +		return -ENODEV;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	list_for_each(pos, &vfio.group_list) {

> +		group = list_entry(pos, struct vfio_group, group_next);

> +		if (group->groupid == groupid)

> +			break;

> +		group = NULL;

> +	}

> +

> +	if (!group) {

> +		int minor;

> +

> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {

> +			ret = -ENOMEM;

> +			goto out;

> +		}

> +

> +		group = kzalloc(sizeof(*group), GFP_KERNEL);

> +		if (!group) {

> +			ret = -ENOMEM;

> +			goto out;

> +		}

> +

> +		group->groupid = groupid;

> +		INIT_LIST_HEAD(&group->device_list);

> +

> +		ret = idr_get_new(&vfio.idr, group, &minor);

> +		if (ret == 0 && minor > MINORMASK) {

> +			idr_remove(&vfio.idr, minor);

> +			kfree(group);

> +			ret = -ENOSPC;

> +			goto out;

> +		}

> +

> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);

> +		device_create(vfio.class, NULL, group->devt,

> +			      group, "%u", groupid);

> +

> +		group->bus = dev->bus;

> +		list_add(&group->group_next, &vfio.group_list);

> +		new_group = true;

> +	} else {

> +		if (group->bus != dev->bus) {

> +			printk(KERN_WARNING

> +			       "Error: IOMMU group ID conflict.  Group ID %u

> "

> +				"on both bus %s and %s\n", groupid,

> +				group->bus->name, dev->bus->name);

> +			ret = -EFAULT;

> +			goto out;

> +		}

> +

> +		list_for_each(pos, &group->device_list) {

> +			device = list_entry(pos,

> +					    struct vfio_device, device_next);

> +			if (device->dev == dev)

> +				break;

> +			device = NULL;

> +		}

> +	}

> +

> +	if (!device) {

> +		if (__vfio_group_devs_inuse(group) ||

> +		    (group->iommu && group->iommu->refcnt)) {

> +			printk(KERN_WARNING

> +			       "Adding device %s to group %u while group is

> already in use!!\n",

> +			       dev_name(dev), group->groupid);

> +			/* XXX How to prevent other drivers from claiming? */


Here we are adding a device (not yet assigned to a vfio bus) to a group
that is already in use.
Given that it would not be acceptable for this device to get assigned
to a non vfio driver, why not forcing such assignment here then?
I am not sure though what the best way to do it would be.
What about something like this:

- when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE
  notification it assigns to the device a PCI ID that will make sure
  the vfio-pci's probe routine will be invoked (and no other driver can
  therefore claim the device). That PCI ID would have to be added
  to the vfio_pci_driver's id_table (it would be the exception to the
  "only dynamic IDs" rule). Too hackish?

> +		}

> +

> +		device = kzalloc(sizeof(*device), GFP_KERNEL);

> +		if (!device) {

> +			/* If we just created this group, tear it down */

> +			if (new_group) {

> +				list_del(&group->group_next);

> +				device_destroy(vfio.class, group->devt);

> +				idr_remove(&vfio.idr, MINOR(group->devt));

> +				kfree(group);

> +			}

> +			ret = -ENOMEM;

> +			goto out;

> +		}

> +

> +		list_add(&device->device_next, &group->device_list);

> +		device->dev = dev;

> +		device->ops = ops;

> +		device->iommu = group->iommu; /* NULL if new */


Shouldn't you check the return code of __vfio_iommu_attach_dev?

> +		__vfio_iommu_attach_dev(group->iommu, device);

> +	}

> +out:

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +EXPORT_SYMBOL_GPL(vfio_group_add_dev);

> +

> +/* Remove a device from the vfio framework */


This fn below does not return any error code. Ok ...
However, there are a number of errors case that you test, for example
- device that does not belong to any group (according to iommu API)
- device that belongs to a group but that does not appear in the list
  of devices of the vfio_group structure.
Are the above two errors checks just paranoia or are those errors actually possible?
If they were possible, shouldn't we generate a warning (most probably
it would be a bug in the code)?

> +void vfio_group_del_dev(struct device *dev)

> +{

> +	struct list_head *pos;

> +	struct vfio_group *group = NULL;

> +	struct vfio_device *device = NULL;

> +	unsigned int groupid;

> +

> +	if (iommu_device_group(dev, &groupid))

> +		return;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	list_for_each(pos, &vfio.group_list) {

> +		group = list_entry(pos, struct vfio_group, group_next);

> +		if (group->groupid == groupid)

> +			break;

> +		group = NULL;

> +	}

> +

> +	if (!group)

> +		goto out;

> +

> +	list_for_each(pos, &group->device_list) {

> +		device = list_entry(pos, struct vfio_device, device_next);

> +		if (device->dev == dev)

> +			break;

> +		device = NULL;

> +	}

> +

> +	if (!device)

> +		goto out;

> +

> +	BUG_ON(device->refcnt);

> +

> +	if (device->attached)

> +		__vfio_iommu_detach_dev(group->iommu, device);

> +

> +	list_del(&device->device_next);

> +	kfree(device);

> +

> +	/* If this was the only device in the group, remove the group.

> +	 * Note that we intentionally unmerge empty groups here if the

> +	 * group fd isn't opened. */

> +	if (list_empty(&group->device_list) && group->refcnt == 0) {

> +		struct vfio_iommu *iommu = group->iommu;

> +

> +		if (iommu) {

> +			__vfio_group_set_iommu(group, NULL);

> +			__vfio_try_dissolve_iommu(iommu);

> +		}

> +

> +		device_destroy(vfio.class, group->devt);

> +		idr_remove(&vfio.idr, MINOR(group->devt));

> +		list_del(&group->group_next);

> +		kfree(group);

> +	}

> +out:

> +	mutex_unlock(&vfio.lock);

> +}

> +EXPORT_SYMBOL_GPL(vfio_group_del_dev);

> +

> +/* When a device is bound to a vfio device driver (ex. vfio-pci), this

> + * entry point is used to mark the device usable (viable).  The vfio

> + * device driver associates a private device_data struct with the

> device

> + * here, which will later be return for vfio_device_fops callbacks. */

> +int vfio_bind_dev(struct device *dev, void *device_data)

> +{

> +	struct vfio_device *device;

> +	int ret = -EINVAL;

> +

> +	BUG_ON(!device_data);

> +

> +	mutex_lock(&vfio.lock);

> +

> +	device = __vfio_lookup_dev(dev);

> +

> +	BUG_ON(!device);

> +

> +	ret = dev_set_drvdata(dev, device);

> +	if (!ret)

> +		device->device_data = device_data;

> +

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +EXPORT_SYMBOL_GPL(vfio_bind_dev);

> +

> +/* A device is only removeable if the iommu for the group is not in

> use. */

> +static bool vfio_device_removeable(struct vfio_device *device)

> +{

> +	bool ret = true;

> +

> +	mutex_lock(&vfio.lock);

> +

> +	if (device->iommu && __vfio_iommu_inuse(device->iommu))

> +		ret = false;

> +

> +	mutex_unlock(&vfio.lock);

> +	return ret;

> +}

> +

> +/* Notify vfio that a device is being unbound from the vfio device

> driver

> + * and return the device private device_data pointer.  If the group is

> + * in use, we need to block or take other measures to make it safe for

> + * the device to be removed from the iommu. */

> +void *vfio_unbind_dev(struct device *dev)

> +{

> +	struct vfio_device *device = dev_get_drvdata(dev);

> +	void *device_data;

> +

> +	BUG_ON(!device);

> +

> +again:

> +	if (!vfio_device_removeable(device)) {

> +		/* XXX signal for all devices in group to be removed or

> +		 * resort to killing the process holding the device fds.

> +		 * For now just block waiting for releases to wake us. */

> +		wait_event(vfio.release_q, vfio_device_removeable(device));


Any new idea/proposal on how to handle this situation?
The last one I remember was to leave the soft/hard/etc timeout handling in
userspace and implement it as a sort of policy. Is that one still the most
likely candidate solution to handle this situation?

> +	}

> +

> +	mutex_lock(&vfio.lock);

> +

> +	/* Need to re-check that the device is still removeable under

> lock. */

> +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {

> +		mutex_unlock(&vfio.lock);

> +		goto again;

> +	}

> +

> +	device_data = device->device_data;

> +

> +	device->device_data = NULL;

> +	dev_set_drvdata(dev, NULL);

> +

> +	mutex_unlock(&vfio.lock);

> +	return device_data;

> +}

> +EXPORT_SYMBOL_GPL(vfio_unbind_dev);

> +

> +/*

> + * Module/class support

> + */

> +static void vfio_class_release(struct kref *kref)

> +{

> +	class_destroy(vfio.class);

> +	vfio.class = NULL;

> +}

> +

> +static char *vfio_devnode(struct device *dev, mode_t *mode)

> +{

> +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));

> +}

> +

> +static int __init vfio_init(void)

> +{

> +	int ret;

> +

> +	idr_init(&vfio.idr);

> +	mutex_init(&vfio.lock);

> +	INIT_LIST_HEAD(&vfio.group_list);

> +	init_waitqueue_head(&vfio.release_q);

> +

> +	kref_init(&vfio.kref);

> +	vfio.class = class_create(THIS_MODULE, "vfio");

> +	if (IS_ERR(vfio.class)) {

> +		ret = PTR_ERR(vfio.class);

> +		goto err_class;

> +	}

> +

> +	vfio.class->devnode = vfio_devnode;

> +

> +	/* FIXME - how many minors to allocate... all of them! */

> +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");

> +	if (ret)

> +		goto err_chrdev;

> +

> +	cdev_init(&vfio.cdev, &vfio_group_fops);

> +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);

> +	if (ret)

> +		goto err_cdev;

> +

> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");

> +

> +	return 0;

> +

> +err_cdev:

> +	unregister_chrdev_region(vfio.devt, MINORMASK);

> +err_chrdev:

> +	kref_put(&vfio.kref, vfio_class_release);

> +err_class:

> +	return ret;

> +}

> +

> +static void __exit vfio_cleanup(void)

> +{

> +	struct list_head *gpos, *gppos;

> +

> +	list_for_each_safe(gpos, gppos, &vfio.group_list) {

> +		struct vfio_group *group;

> +		struct list_head *dpos, *dppos;

> +

> +		group = list_entry(gpos, struct vfio_group, group_next);

> +

> +		list_for_each_safe(dpos, dppos, &group->device_list) {

> +			struct vfio_device *device;

> +

> +			device = list_entry(dpos,

> +					    struct vfio_device, device_next);

> +			vfio_group_del_dev(device->dev);

> +		}

> +	}

> +

> +	idr_destroy(&vfio.idr);

> +	cdev_del(&vfio.cdev);

> +	unregister_chrdev_region(vfio.devt, MINORMASK);

> +	kref_put(&vfio.kref, vfio_class_release);

> +}

> +

> +module_init(vfio_init);

> +module_exit(vfio_cleanup);

> +

> +MODULE_VERSION(DRIVER_VERSION);

> +MODULE_LICENSE("GPL v2");

> +MODULE_AUTHOR(DRIVER_AUTHOR);

> +MODULE_DESCRIPTION(DRIVER_DESC);

> diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h

> new file mode 100644

> index 0000000..350ad67

> --- /dev/null

> +++ b/drivers/vfio/vfio_private.h

> @@ -0,0 +1,34 @@

> +/*

> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.

> + *     Author: Alex Williamson <alex.williamson@redhat.com>

> + *

> + * This program is free software; you can redistribute it and/or

> modify

> + * it under the terms of the GNU General Public License version 2 as

> + * published by the Free Software Foundation.

> + *

> + * Derived from original vfio:

> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.

> + * Author: Tom Lyon, pugs@cisco.com

> + */

> +

> +#include <linux/list.h>

> +#include <linux/mutex.h>

> +

> +#ifndef VFIO_PRIVATE_H

> +#define VFIO_PRIVATE_H

> +

> +struct vfio_iommu {

> +	struct iommu_domain		*domain;

> +	struct bus_type			*bus;

> +	struct mutex			dgate;

> +	struct list_head		dm_list;

> +	struct mm_struct		*mm;

> +	struct list_head		group_list;

> +	int				refcnt;

> +	bool				cache;

> +};

> +

> +extern int vfio_release_iommu(struct vfio_iommu *iommu);

> +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);

> +

> +#endif /* VFIO_PRIVATE_H */

> diff --git a/include/linux/vfio.h b/include/linux/vfio.h

> new file mode 100644

> index 0000000..4269b08

> --- /dev/null

> +++ b/include/linux/vfio.h

> @@ -0,0 +1,155 @@

> +/*

> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.

> + * Author: Tom Lyon, pugs@cisco.com

> + *

> + * This program is free software; you may redistribute it and/or

> modify

> + * it under the terms of the GNU General Public License as published

> by

> + * the Free Software Foundation; version 2 of the License.

> + *

> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,

> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF

> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND

> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS

> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN

> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN

> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE

> + * SOFTWARE.

> + *

> + * Portions derived from drivers/uio/uio.c:

> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>

> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>

> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>

> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>

> + *

> + * Portions derived from drivers/uio/uio_pci_generic.c:

> + * Copyright (C) 2009 Red Hat, Inc.

> + * Author: Michael S. Tsirkin <mst@redhat.com>

> + */

> +#include <linux/types.h>

> +

> +#ifndef VFIO_H

> +#define VFIO_H

> +

> +#ifdef __KERNEL__

> +

> +struct vfio_device_ops {

> +	bool			(*match)(struct device *, char *);

> +	int			(*get)(void *);

> +	void			(*put)(void *);

> +	ssize_t			(*read)(void *, char __user *,

> +					size_t, loff_t *);

> +	ssize_t			(*write)(void *, const char __user *,

> +					 size_t, loff_t *);

> +	long			(*ioctl)(void *, unsigned int, unsigned long);

> +	int			(*mmap)(void *, struct vm_area_struct *);

> +};

> +

> +extern int vfio_group_add_dev(struct device *device,

> +			      const struct vfio_device_ops *ops);

> +extern void vfio_group_del_dev(struct device *device);

> +extern int vfio_bind_dev(struct device *device, void *device_data);

> +extern void *vfio_unbind_dev(struct device *device);

> +

> +#endif /* __KERNEL__ */

> +

> +/*

> + * VFIO driver - allow mapping and use of certain devices

> + * in unprivileged user processes. (If IOMMU is present)

> + * Especially useful for Virtual Function parts of SR-IOV devices

> + */

> +

> +

> +/* Kernel & User level defines for ioctls */

> +

> +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)

> + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)

> + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)

> +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)

> +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)

> +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)

> +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)

> +

> +/*

> + * Structure for DMA mapping of user buffers

> + * vaddr, dmaaddr, and size must all be page aligned

> + */

> +struct vfio_dma_map {

> +	__u64	len;		/* length of structure */

> +	__u64	vaddr;		/* process virtual addr */

> +	__u64	dmaaddr;	/* desired and/or returned dma address */

> +	__u64	size;		/* size in bytes */

> +	__u64	flags;

> +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA

> mem */

> +};

> +

> +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)

> + /* Does the IOMMU support mapping any IOVA to any virtual address? */

> + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)

> +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct

> vfio_dma_map)

> +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct

> vfio_dma_map)

> +

> +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)

> + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)

> + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)

> + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)

> +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)

> +

> +struct vfio_region_info {

> +	__u32	len;		/* length of structure */

> +	__u32	index;		/* region number */

> +	__u64	size;		/* size in bytes of region */

> +	__u64	offset;		/* start offset of region */

> +	__u64	flags;

> +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)

> +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)

> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)

> +	__u64	phys;		/* physical address of region */

> +};

> +

> +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct

> vfio_region_info)

> +

> +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)

> +

> +struct vfio_irq_info {

> +	__u32	len;		/* length of structure */

> +	__u32	index;		/* IRQ number */

> +	__u32	count;		/* number of individual IRQs */

> +	__u32	flags;

> +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)

> +};

> +

> +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct

> vfio_irq_info)

> +

> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] =

> eventfds */

> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)

> +

> +/* Unmask IRQ index, arg[0] = index */

> +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)

> +

> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */

> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)

> +

> +#define VFIO_DEVICE_RESET		_IO(';', 116)

> +

> +struct vfio_dtpath {

> +	__u32	len;		/* length of structure */

> +	__u32	index;

> +	__u64	flags;

> +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)

> +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)

> +	char	*path;

> +};

> +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct

> vfio_dtpath)

> +

> +struct vfio_dtindex {

> +	__u32	len;		/* length of structure */

> +	__u32	index;

> +	__u32	prop_type;

> +	__u32	prop_index;

> +	__u64	flags;

> +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)

> +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)

> +};

> +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct

> vfio_dtindex)

> +

> +#endif /* VFIO_H */


/Chris
Alex Williamson - Nov. 9, 2011, 6:02 p.m.
On Wed, 2011-11-09 at 02:11 -0600, Christian Benvenuti (benve) wrote:
> I have not gone through the all patch yet, but here are
> my first comments/questions about the code in vfio_main.c
> (and pci/vfio_pci.c).

Thanks!  Comments inline...

> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Thursday, November 03, 2011 1:12 PM
> > To: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;
> > dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Christian
> > Benvenuti (benve); Aaron Fabbri (aafabbri); B08248@freescale.com;
> > B07421@freescale.com; avi@redhat.com; konrad.wilk@oracle.com;
> > kvm@vger.kernel.org; qemu-devel@nongnu.org; iommu@lists.linux-
> > foundation.org; linux-pci@vger.kernel.org
> > Subject: [RFC PATCH] vfio: VFIO Driver core framework
> 
> <snip>
> 
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > new file mode 100644
> > index 0000000..6169356
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -0,0 +1,1151 @@
> > +/*
> > + * VFIO framework
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or
> > modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/cdev.h>
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs.h>
> > +#include <linux/idr.h>
> > +#include <linux/iommu.h>
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/wait.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +#define DRIVER_VERSION	"0.2"
> > +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> > +
> > +static int allow_unsafe_intrs;
> > +module_param(allow_unsafe_intrs, int, 0);
> > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > +        "Allow use of IOMMUs which do not support interrupt
> > remapping");
> > +
> > +static struct vfio {
> > +	dev_t			devt;
> > +	struct cdev		cdev;
> > +	struct list_head	group_list;
> > +	struct mutex		lock;
> > +	struct kref		kref;
> > +	struct class		*class;
> > +	struct idr		idr;
> > +	wait_queue_head_t	release_q;
> > +} vfio;
> > +
> > +static const struct file_operations vfio_group_fops;
> > +extern const struct file_operations vfio_iommu_fops;
> > +
> > +struct vfio_group {
> > +	dev_t			devt;
> > +	unsigned int		groupid;
> 
> This groupid is returned by the device_group callback you recently added
> with a separate (not yet in tree) IOMMU patch.
> Is it correct to say that the scope of this ID is the bus the iommu
> belongs too (but you use it as if it was global)?
> I believe there is nothing right now to ensure the uniqueness of such
> ID across bus types (assuming there will be other bus drivers in the
> future besides vfio-pci).
> If that's the case, the vfio.group_list global list and the __vfio_lookup_dev
> routine should be changed to account for the bus too?
> Ops, I just saw the error msg in vfio_group_add_dev about the group id conflict.
> Is that warning related to what I mentioned above?

Yeah, this is a concern, but I can't think of a system where we would
manifest a collision.  The IOMMU driver is expected to provide unique
groupids for all devices below them, but we could imagine a system that
implements two different bus_types, each with a different IOMMU driver
and we have no coordination between them.  Perhaps since we have
iommu_ops per bus, we should also expose the bus in the vfio group path,
ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This
means userspace would need to do a readlink of the subsystem entry where
it finds the iommu_group to find the vfio group.  Reasonable?

> > +	struct bus_type		*bus;
> > +	struct vfio_iommu	*iommu;
> > +	struct list_head	device_list;
> > +	struct list_head	iommu_next;
> > +	struct list_head	group_next;
> > +	int			refcnt;
> > +};
> > +
> > +struct vfio_device {
> > +	struct device			*dev;
> > +	const struct vfio_device_ops	*ops;
> > +	struct vfio_iommu		*iommu;
> 
> I wonder if you need to have the 'iommu' field here.
> vfio_device.iommu is always set and reset together with
> vfio_group.iommu.
> Given that a vfio_device instance is always linked to a vfio_group
> instance, do we need this duplication? Is this duplication there
> because you do not want the double dereference device->group->iommu?

I think that was my initial goal in duplicating the pointer on the
device.  I believe I was also at one point passing a vfio_device around
and needed the pointer.  We seem to be getting along fine w/o that and I
don't see any performance sensitive paths from getting from the device
to iommu, so I'll see about removing it.

> > +	struct vfio_group		*group;
> > +	struct list_head		device_next;
> > +	bool				attached;
> > +	int				refcnt;
> > +	void				*device_data;
> > +};
> > +
> > +/*
> > + * Helper functions called under vfio.lock
> > + */
> > +
> > +/* Return true if any devices within a group are opened */
> > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* Return true if any of the groups attached to an iommu are opened.
> > + * We can only tear apart merged groups when nothing is left open. */
> > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +		if (group->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* An iommu is "in use" if it has a file descriptor open or if any of
> > + * the groups assigned to the iommu have devices open. */
> > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (iommu->refcnt)
> > +		return true;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		if (__vfio_group_devs_inuse(group))
> > +			return true;
> > +	}
> > +	return false;
> > +}
> 
> I looked at how you take care of ref counts ...
> 
> This is how the tree of vfio_iommu/vfio_group/vfio_device data
> Structures is organized (I'll use just iommu/group/dev to make
> the graph smaller):
> 
>             iommu
>            /     \
>           /       \ 
>     group   ...     group
>     /  \           /  \   
>    /    \         /    \
> dev  ..  dev   dev  ..  dev
> 
> This is how you get a file descriptor for the three kind of objects:
> 
> - group : open /dev/vfio/xxx for group xxx
> - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
> - device: group ioctl VFIO_GROUP_GET_DEVICE_FD
> 
> Given the above topology, I would assume that:
> 
> (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
>                              b) any of its groups is 'inuse'
> 
> (2) a  group is 'inuse' if : a) group refcnt > 0, or
>                              b) any of its devices is 'inuse'
> 
> (3) a device is 'inuse' if : a) device refcnt > 0

(2) is a bit debatable.  I've wrestled with this one for a while.  The
vfio_iommu serves two purposes.  First, it is the object we use for
managing iommu domains, which includes allocating domains and attaching
devices to domains.  Groups objects aren't involved here, they just
manage the set of devices.  The second role is to manage merged groups,
because whether or not groups can be merged is a function of iommu
domain compatibility.

So if we look at "is the iommu in use?" ie. can I destroy the mapping
context, detach devices and free the domain, the reference count on the
group is irrelevant.  The user has to have a device or iommu file
descriptor opened somewhere, across the group or merged group, for that
context to be maintained.  A reasonable requirement, I think.

However, if we ask "is the group in use?" ie. can I not only destroy the
mappings above, but also automatically tear apart merged groups, then I
think we need to look at the group refcnt.

There's also a symmetry factor, the group is a benign entry point to
device access.  It's only when device or iommu access is granted that
the group gains any real power.  Therefore, shouldn't that power also be
removed when those access points are closed?

> You have coded the 'inuse' logic with these three routines:
> 
>     __vfio_iommu_inuse, which implements (1) above
> 
> and
>     __vfio_iommu_groups_inuse

Implements (2.a)

>     __vfio_group_devs_inuse

Implements (2.b)

> which are used by __vfio_iommu_inuse.
> Why don't you check the group refcnt in __vfio_iommu_groups_inuse?

Hopefully explained above, but open for discussion.

> Would it make sense (and the code more readable) to structure the
> nested refcnt/inuse check like this?
> (The numbers (1)(2)(3) refer to the three 'inuse' conditions above)
> 
>    (1)__vfio_iommu_inuse
>    |
>    +-> check iommu refcnt
>    +-> __vfio_iommu_groups_inuse
>        |
>        +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING
>                 |
>                 +-> check group refcnt<--MISSING
>                 +-> __vfio_group_devs_inuse()
>                     |
>                     +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING
>                               |
>                               +-> check device refcnt

We currently do:

   (1)__vfio_iommu_inuse
    |
    +-> check iommu refcnt
    +-> __vfio_group_devs_inuse
        |
        +->LOOP: (2.b)__vfio_group_devs_inuse
                  |
                  +-> LOOP: (3) check device refcnt

If that passes, the iommu context can be dissolved and we follow up
with:

    __vfio_iommu_groups_inuse
    |
    +-> LOOP: (2.a)__vfio_iommu_groups_inuse
               |
               +-> check group refcnt

If that passes, groups can also be umerged.

Is this right?

> > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > +				   struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (group->iommu)
> > +		list_del(&group->iommu_next);
> > +	if (iommu)
> > +		list_add(&group->iommu_next, &iommu->group_list);
> > +
> > +	group->iommu = iommu;
> 
> If you remove the vfio_device.iommu field (as suggested above in a previous
> Comment), the block below would not be needed anymore.

Yep, I'll try removing that and see how it plays out.

> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		device->iommu = iommu;
> > +	}
> > +}
> > +
> > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > +				    struct vfio_device *device)
> > +{
> > +	BUG_ON(!iommu->domain && device->attached);
> > +
> > +	if (!iommu->domain || !device->attached)
> > +		return;
> > +
> > +	iommu_detach_device(iommu->domain, device->dev);
> > +	device->attached = false;
> > +}
> > +
> > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > +				      struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		__vfio_iommu_detach_dev(iommu, device);
> > +	}
> > +}
> > +
> > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > +				   struct vfio_device *device)
> > +{
> > +	int ret;
> > +
> > +	BUG_ON(device->attached);
> > +
> > +	if (!iommu || !iommu->domain)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_attach_device(iommu->domain, device->dev);
> > +	if (!ret)
> > +		device->attached = true;
> > +
> > +	return ret;
> > +}
> > +
> > +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> > +				     struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +		int ret;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		ret = __vfio_iommu_attach_dev(iommu, device);
> > +		if (ret) {
> > +			__vfio_iommu_detach_group(iommu, group);
> > +			return ret;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* The iommu is viable, ie. ready to be configured, when all the
> > devices
> > + * for all the groups attached to the iommu are bound to their vfio
> > device
> > + * drivers (ex. vfio-pci).  This sets the device_data private data
> > pointer. */
> > +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *gpos, *dpos;
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (!device->device_data)
> > +				return false;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> > +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (!iommu->domain)
> > +		return;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		__vfio_iommu_detach_group(iommu, group);
> > +	}
> > +
> > +	vfio_iommu_unmapall(iommu);
> > +
> > +	iommu_domain_free(iommu->domain);
> > +	iommu->domain = NULL;
> > +	iommu->mm = NULL;
> > +}
> > +
> > +/* Open the IOMMU.  This gates all access to the iommu or device file
> > + * descriptors and sets current->mm as the exclusive user. */
> 
> Given the fn  vfio_group_open (ie, 1st object, 2nd operation), I would have
> called this one __vfio_iommu_open (instead of __vfio_open_iommu).
> Is it named __vfio_open_iommu to avoid a conflict with the namespace in vfio_iommu.c?      

I would have expected that too, I'll look at renaming these.

> > +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +	int ret;
> > +
> > +	if (!__vfio_iommu_viable(iommu))
> > +		return -EBUSY;
> > +
> > +	if (iommu->domain)
> > +		return -EINVAL;
> > +
> > +	iommu->domain = iommu_domain_alloc(iommu->bus);
> > +	if (!iommu->domain)
> > +		return -EFAULT;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		ret = __vfio_iommu_attach_group(iommu, group);
> > +		if (ret) {
> > +			__vfio_close_iommu(iommu);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	if (!allow_unsafe_intrs &&
> > +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > +		__vfio_close_iommu(iommu);
> > +		return -EFAULT;
> > +	}
> > +
> > +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> > +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> > +	iommu->mm = current->mm;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Actively try to tear down the iommu and merged groups.  If there
> > are no
> > + * open iommu or device fds, we close the iommu.  If we close the
> > iommu and
> > + * there are also no open group fds, we can futher dissolve the group
> > to
> > + * iommu association and free the iommu data structure. */
> > +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> > +{
> > +
> > +	if (__vfio_iommu_inuse(iommu))
> > +		return -EBUSY;
> > +
> > +	__vfio_close_iommu(iommu);
> > +
> > +	if (!__vfio_iommu_groups_inuse(iommu)) {
> > +		struct list_head *pos, *ppos;
> > +
> > +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> > +			struct vfio_group *group;
> > +
> > +			group = list_entry(pos, struct vfio_group,
> > iommu_next);
> > +			__vfio_group_set_iommu(group, NULL);
> > +		}
> > +
> > +
> > +		kfree(iommu);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> > +{
> > +	struct list_head *gpos;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return NULL;
> > +
> > +	list_for_each(gpos, &vfio.group_list) {
> > +		struct vfio_group *group;
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, group_next);
> > +
> > +		if (group->groupid != groupid)
> > +			continue;
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->dev == dev)
> > +				return device;
> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/* All release paths simply decrement the refcnt, attempt to teardown
> > + * the iommu and merged groups, and wakeup anything that might be
> > + * waiting if we successfully dissolve anything. */
> > +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> > +{
> > +	bool wake;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	(*refcnt)--;
> > +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> > +
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	if (wake)
> > +		wake_up(&vfio.release_q);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Device fops - passthrough to vfio device driver w/ device_data
> > + */
> > +static int vfio_device_release(struct inode *inode, struct file
> > *filep)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	vfio_do_release(&device->refcnt, device->iommu);
> > +
> > +	device->ops->put(device->device_data);
> > +
> > +	return 0;
> > +}
> > +
> > +static long vfio_device_unl_ioctl(struct file *filep,
> > +				  unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->ioctl(device->device_data, cmd, arg);
> > +}
> > +
> > +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> > +				size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->read(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static ssize_t vfio_device_write(struct file *filep, const char __user
> > *buf,
> > +				 size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->write(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static int vfio_device_mmap(struct file *filep, struct vm_area_struct
> > *vma)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->mmap(device->device_data, vma);
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_device_compat_ioctl(struct file *filep,
> > +				     unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_device_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_device_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_device_release,
> > +	.read		= vfio_device_read,
> > +	.write		= vfio_device_write,
> > +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_device_compat_ioctl,
> > +#endif
> > +	.mmap		= vfio_device_mmap,
> > +};
> > +
> > +/*
> > + * Group fops
> > + */
> > +static int vfio_group_open(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	group = idr_find(&vfio.idr, iminor(inode));
> > +
> > +	if (!group) {
> > +		ret = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	filep->private_data = group;
> > +
> > +	if (!group->iommu) {
> > +		struct vfio_iommu *iommu;
> > +
> > +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > +		if (!iommu) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		INIT_LIST_HEAD(&iommu->group_list);
> > +		INIT_LIST_HEAD(&iommu->dm_list);
> > +		mutex_init(&iommu->dgate);
> > +		iommu->bus = group->bus;
> > +		__vfio_group_set_iommu(group, iommu);
> > +	}
> > +	group->refcnt++;
> > +
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	return ret;
> > +}
> > +
> > +static int vfio_group_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	return vfio_do_release(&group->refcnt, group->iommu);
> > +}
> > +
> > +/* Attempt to merge the group pointed to by fd into group.  The merge-
> > ee
> > + * group must not have an iommu or any devices open because we cannot
> > + * maintain that context across the merge.  The merge-er group can be
> > + * in use. */
> > +static int vfio_group_merge(struct vfio_group *group, int fd)
> 
> The documentation in vfio.txt explains clearly the logic implemented by
> the merge/unmerge group ioctls.
> However, what you are doing is not merging groups, but rather adding/removing
> groups to/from iommus (and creating flat lists of groups).
> For example, when you do
> 
>   merge(A,B)
> 
> you actually mean to say "merge B to the list of groups assigned to the
> same iommu as group A".

It's actually a little more than that.  After you've merged B into A,
you can close the file descriptor for B and access all of the devices
for the merged group from A.

> For the same reason, you do not really need to provide the group you want
> to unmerge from, which means that instead of
> 
>   unmerge(A,B) 
> 
> you would just need
> 
>   unmerge(B)

Good point, we can avoid the awkward reference via file descriptor for
the unmerge.

> I understand the reason why it is not a real merge/unmerge (ie, to keep the
> original groups so that you can unmerge later)

Right, we still need to have visibility of the groups comprising the
merged group, but the abstraction provided to the user seems to be
deeper than you're thinking.

>  ... however I just wonder if
> it wouldn't be more natural to implement the VFIO_IOMMU_ADD_GROUP/DEL_GROUP
> iommu ioctls instead? (the relationships between the data structure would
> remain the same)
> I guess you already discarded this option for some reasons, right? What was
> the reason?

It's a possibility, I'm not sure it was discussed or really what
advantage it provides.  It seems like we'd logically lose the ability to
access devices from other groups, whether that's good or bad, I don't
know.  I think the notion of "merge" promotes the idea that the groups
are peers and an iommu_add/del feels a bit more hierarchical.

> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *old_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +	bool opened = false;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +
> > +	if (!new || new == group || !new->iommu ||
> > +	    new->iommu->domain || new->bus != group->bus) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We need to attach all the devices to each domain separately
> > +	 * in order to validate that the capabilities match for both.  */
> > +	ret = __vfio_open_iommu(new->iommu);
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +		opened = true;
> > +	}
> > +
> > +	/* If cache coherency doesn't match we'd potentialy need to
> > +	 * remap existing iommu mappings in the merge-er domain.
> > +	 * Poor return to bother trying to allow this currently. */
> > +	if (iommu_domain_has_cap(group->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY) !=
> > +	    iommu_domain_has_cap(new->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY)) {
> > +		__vfio_close_iommu(new->iommu);
> > +		if (opened)
> > +			__vfio_close_iommu(group->iommu);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* Close the iommu for the merge-ee and attach all its devices
> > +	 * to the merge-er iommu. */
> > +	__vfio_close_iommu(new->iommu);
> > +
> > +	ret = __vfio_iommu_attach_group(group->iommu, new);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* set_iommu unlinks new from the iommu, so save a pointer to it
> > */
> > +	old_iommu = new->iommu;
> > +	__vfio_group_set_iommu(new, group->iommu);
> > +	kfree(old_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Unmerge the group pointed to by fd from group. */
> > +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *new_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +
> > +	/* Since the merge-out group is already opened, it needs to
> > +	 * have an iommu struct associated with it. */
> > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> > +	if (!new_iommu)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&new_iommu->group_list);
> > +	INIT_LIST_HEAD(&new_iommu->dm_list);
> > +	mutex_init(&new_iommu->dgate);
> > +	new_iommu->bus = group->bus;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +	if (!new || new == group || new->iommu != group->iommu) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We can't merge-out a group with devices still in use. */
> > +	if (__vfio_group_devs_inuse(new)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	__vfio_iommu_detach_group(group->iommu, new);
> > +	__vfio_group_set_iommu(new, new_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	if (ret)
> > +		kfree(new_iommu);
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set. */
> > +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> > +			       group->iommu, O_RDWR);
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	group->iommu->refcnt++;
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new device file descriptor.  This will open the iommu,
> > setting
> > + * the current->mm ownership if it's not already set.  It's difficult
> > to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require
> > including
> > + * a parent device string. */
> > +static int vfio_group_get_device_fd(struct vfio_group *group, char
> > *buf)
> > +{
> > +	struct vfio_iommu *iommu = group->iommu;
> > +	struct list_head *gpos;
> > +	int ret = -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!iommu->domain) {
> > +		ret = __vfio_open_iommu(iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->ops->match(device->dev, buf)) {
> > +				struct file *file;
> > +
> > +				if (device->ops->get(device->device_data)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> > +
> > +				/* We can't use anon_inode_getfd(), like above
> > +				 * because we need to modify the f_mode flags
> > +				 * directly to allow more than just ioctls */
> > +				ret = get_unused_fd();
> > +				if (ret < 0) {
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				file = anon_inode_getfile("[vfio-device]",
> > +							  &vfio_device_fops,
> > +							  device, O_RDWR);
> > +				if (IS_ERR(file)) {
> > +					put_unused_fd(ret);
> > +					ret = PTR_ERR(file);
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				/* Todo: add an anon_inode interface to do
> > +				 * this.  Appears to be missing by lack of
> > +				 * need rather than explicitly prevented.
> > +				 * Now there's need. */
> > +				file->f_mode |= (FMODE_LSEEK |
> > +						 FMODE_PREAD |
> > +						 FMODE_PWRITE);
> > +
> > +				fd_install(ret, file);
> > +
> > +				device->refcnt++;
> > +				goto out;
> > +			}
> > +		}
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_group_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> > +		u64 flags = 0;
> > +
> > +		mutex_lock(&vfio.lock);
> > +		if (__vfio_iommu_viable(group->iommu))
> > +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> > +		mutex_unlock(&vfio.lock);
> > +
> > +		if (group->iommu->mm)
> > +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> > +
> > +		return put_user(flags, (u64 __user *)arg);
> > +	}
> > +
> > +	/* Below commands are restricted once the mm is set */
> > +	if (group->iommu->mm && group->iommu->mm != current->mm)
> > +		return -EPERM;
> > +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> > +		int fd;
> > +
> > +		if (get_user(fd, (int __user *)arg))
> > +			return -EFAULT;
> > +		if (fd < 0)
> > +			return -EINVAL;
> > +
> > +		if (cmd == VFIO_GROUP_MERGE)
> > +			return vfio_group_merge(group, fd);
> > +		else
> > +			return vfio_group_unmerge(group, fd);
> > +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> > +		return vfio_group_get_iommu_fd(group);
> > +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> > +		char *buf;
> > +		int ret;
> > +
> > +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> > +		if (IS_ERR(buf))
> > +			return PTR_ERR(buf);
> > +
> > +		ret = vfio_group_get_device_fd(group, buf);
> > +		kfree(buf);
> > +		return ret;
> > +	}
> > +
> > +	return -ENOSYS;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_group_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_group_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +static const struct file_operations vfio_group_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.open		= vfio_group_open,
> > +	.release	= vfio_group_release,
> > +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_group_compat_ioctl,
> > +#endif
> > +};
> > +
> > +/* iommu fd release hook */
> 
> Given vfio_device_release and
>       vfio_group_release (ie, 1st object, 2nd operation), I was
> going to suggest renaming the fn below to vfio_iommu_release, but
> then I saw the latter name being already used in vfio_iommu.c ...
> a bit confusing but I guess it's ok then.

Right, this one was definitely because of naming collision.

> > +int vfio_release_iommu(struct vfio_iommu *iommu)
> > +{
> > +	return vfio_do_release(&iommu->refcnt, iommu);
> > +}
> > +
> > +/*
> > + * VFIO driver API
> > + */
> > +
> > +/* Add a new device to the vfio framework with associated vfio driver
> > + * callbacks.  This is the entry point for vfio drivers to register
> > devices. */
> > +int vfio_group_add_dev(struct device *dev, const struct
> > vfio_device_ops *ops)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +	int ret = 0;
> > +	bool new_group = false;
> > +
> > +	if (!ops)
> > +		return -EINVAL;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group) {
> > +		int minor;
> > +
> > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +		if (!group) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group->groupid = groupid;
> > +		INIT_LIST_HEAD(&group->device_list);
> > +
> > +		ret = idr_get_new(&vfio.idr, group, &minor);
> > +		if (ret == 0 && minor > MINORMASK) {
> > +			idr_remove(&vfio.idr, minor);
> > +			kfree(group);
> > +			ret = -ENOSPC;
> > +			goto out;
> > +		}
> > +
> > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > +		device_create(vfio.class, NULL, group->devt,
> > +			      group, "%u", groupid);
> > +
> > +		group->bus = dev->bus;
> > +		list_add(&group->group_next, &vfio.group_list);
> > +		new_group = true;
> > +	} else {
> > +		if (group->bus != dev->bus) {
> > +			printk(KERN_WARNING
> > +			       "Error: IOMMU group ID conflict.  Group ID %u
> > "
> > +				"on both bus %s and %s\n", groupid,
> > +				group->bus->name, dev->bus->name);
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +
> > +		list_for_each(pos, &group->device_list) {
> > +			device = list_entry(pos,
> > +					    struct vfio_device, device_next);
> > +			if (device->dev == dev)
> > +				break;
> > +			device = NULL;
> > +		}
> > +	}
> > +
> > +	if (!device) {
> > +		if (__vfio_group_devs_inuse(group) ||
> > +		    (group->iommu && group->iommu->refcnt)) {
> > +			printk(KERN_WARNING
> > +			       "Adding device %s to group %u while group is
> > already in use!!\n",
> > +			       dev_name(dev), group->groupid);
> > +			/* XXX How to prevent other drivers from claiming? */
> 
> Here we are adding a device (not yet assigned to a vfio bus) to a group
> that is already in use.
> Given that it would not be acceptable for this device to get assigned
> to a non vfio driver, why not forcing such assignment here then?

Exactly, I just don't know the mechanics of how to make that happen and
was hoping for suggestions...

> I am not sure though what the best way to do it would be.
> What about something like this:
> 
> - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE
>   notification it assigns to the device a PCI ID that will make sure
>   the vfio-pci's probe routine will be invoked (and no other driver can
>   therefore claim the device). That PCI ID would have to be added
>   to the vfio_pci_driver's id_table (it would be the exception to the
>   "only dynamic IDs" rule). Too hackish?

Presumably some other driver also has the ID in it's id_table, how do we
make sure we win?

> > +		}
> > +
> > +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +		if (!device) {
> > +			/* If we just created this group, tear it down */
> > +			if (new_group) {
> > +				list_del(&group->group_next);
> > +				device_destroy(vfio.class, group->devt);
> > +				idr_remove(&vfio.idr, MINOR(group->devt));
> > +				kfree(group);
> > +			}
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		list_add(&device->device_next, &group->device_list);
> > +		device->dev = dev;
> > +		device->ops = ops;
> > +		device->iommu = group->iommu; /* NULL if new */
> 
> Shouldn't you check the return code of __vfio_iommu_attach_dev?

Yep, looks like I did this because the expected use case has a NULL
iommu here, so I need to distiguish that error from an actual
iommu_attach_device() error.

> > +		__vfio_iommu_attach_dev(group->iommu, device);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> > +
> > +/* Remove a device from the vfio framework */
> 
> This fn below does not return any error code. Ok ...
> However, there are a number of errors case that you test, for example
> - device that does not belong to any group (according to iommu API)
> - device that belongs to a group but that does not appear in the list
>   of devices of the vfio_group structure.
> Are the above two errors checks just paranoia or are those errors actually possible?
> If they were possible, shouldn't we generate a warning (most probably
> it would be a bug in the code)?

They're all vfio-bus driver bugs of some sort, so it's just a matter of
how much we want to scream about them.  I'll comments on each below.

> > +void vfio_group_del_dev(struct device *dev)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return;

Here the bus driver is probably just sitting on a notifier list for
their bus_type and a device is getting removed.  Unless we want to
require the bus driver to track everything it's attempted to add and
whether it worked, we can just ignore this.

> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group)
> > +		goto out;

We don't even have a group for the device, we could BUG_ON here.  The
bus driver failed to tell us about something that was then removed.

> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->dev == dev)
> > +			break;
> > +		device = NULL;
> > +	}
> > +
> > +	if (!device)
> > +		goto out;

Same here.

> > +
> > +	BUG_ON(device->refcnt);
> > +
> > +	if (device->attached)
> > +		__vfio_iommu_detach_dev(group->iommu, device);
> > +
> > +	list_del(&device->device_next);
> > +	kfree(device);
> > +
> > +	/* If this was the only device in the group, remove the group.
> > +	 * Note that we intentionally unmerge empty groups here if the
> > +	 * group fd isn't opened. */
> > +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> > +		struct vfio_iommu *iommu = group->iommu;
> > +
> > +		if (iommu) {
> > +			__vfio_group_set_iommu(group, NULL);
> > +			__vfio_try_dissolve_iommu(iommu);
> > +		}
> > +
> > +		device_destroy(vfio.class, group->devt);
> > +		idr_remove(&vfio.idr, MINOR(group->devt));
> > +		list_del(&group->group_next);
> > +		kfree(group);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> > +
> > +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> > + * entry point is used to mark the device usable (viable).  The vfio
> > + * device driver associates a private device_data struct with the
> > device
> > + * here, which will later be return for vfio_device_fops callbacks. */
> > +int vfio_bind_dev(struct device *dev, void *device_data)
> > +{
> > +	struct vfio_device *device;
> > +	int ret = -EINVAL;
> > +
> > +	BUG_ON(!device_data);
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	device = __vfio_lookup_dev(dev);
> > +
> > +	BUG_ON(!device);
> > +
> > +	ret = dev_set_drvdata(dev, device);
> > +	if (!ret)
> > +		device->device_data = device_data;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> > +
> > +/* A device is only removeable if the iommu for the group is not in
> > use. */
> > +static bool vfio_device_removeable(struct vfio_device *device)
> > +{
> > +	bool ret = true;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> > +		ret = false;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Notify vfio that a device is being unbound from the vfio device
> > driver
> > + * and return the device private device_data pointer.  If the group is
> > + * in use, we need to block or take other measures to make it safe for
> > + * the device to be removed from the iommu. */
> > +void *vfio_unbind_dev(struct device *dev)
> > +{
> > +	struct vfio_device *device = dev_get_drvdata(dev);
> > +	void *device_data;
> > +
> > +	BUG_ON(!device);
> > +
> > +again:
> > +	if (!vfio_device_removeable(device)) {
> > +		/* XXX signal for all devices in group to be removed or
> > +		 * resort to killing the process holding the device fds.
> > +		 * For now just block waiting for releases to wake us. */
> > +		wait_event(vfio.release_q, vfio_device_removeable(device));
> 
> Any new idea/proposal on how to handle this situation?
> The last one I remember was to leave the soft/hard/etc timeout handling in
> userspace and implement it as a sort of policy. Is that one still the most
> likely candidate solution to handle this situation?

I haven't heard any new proposals.  I think we need the hard timeout
handling in the kernel.  We can't leave it to userspace to decide they
get to keep the device.  We could have this tunable via an ioctl, but I
don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak
it.  I was intending to re-implement the netlink interface to signal the
removal, but expect to get allergic reactions to that.

Thanks for the comments!

Alex
Christian Benvenuti - Nov. 9, 2011, 9:08 p.m.
Comments inline...

> -----Original Message-----

> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Wednesday, November 09, 2011 10:03 AM

> To: Christian Benvenuti (benve)

> Cc: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;

> dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Aaron Fabbri

> (aafabbri); B08248@freescale.com; B07421@freescale.com; avi@redhat.com;

> konrad.wilk@oracle.com; kvm@vger.kernel.org; qemu-devel@nongnu.org;

> iommu@lists.linux-foundation.org; linux-pci@vger.kernel.org

> Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework

> 

> On Wed, 2011-11-09 at 02:11 -0600, Christian Benvenuti (benve) wrote:

> > I have not gone through the all patch yet, but here are

> > my first comments/questions about the code in vfio_main.c

> > (and pci/vfio_pci.c).

> 

> Thanks!  Comments inline...

> 

> > > -----Original Message-----

> > > From: Alex Williamson [mailto:alex.williamson@redhat.com]

> > > Sent: Thursday, November 03, 2011 1:12 PM

> > > To: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;

> > > dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Christian

> > > Benvenuti (benve); Aaron Fabbri (aafabbri); B08248@freescale.com;

> > > B07421@freescale.com; avi@redhat.com; konrad.wilk@oracle.com;

> > > kvm@vger.kernel.org; qemu-devel@nongnu.org; iommu@lists.linux-

> > > foundation.org; linux-pci@vger.kernel.org

> > > Subject: [RFC PATCH] vfio: VFIO Driver core framework

> >

> > <snip>

> >

> > > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c

> > > new file mode 100644

> > > index 0000000..6169356

> > > --- /dev/null

> > > +++ b/drivers/vfio/vfio_main.c

> > > @@ -0,0 +1,1151 @@

> > > +/*

> > > + * VFIO framework

> > > + *

> > > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.

> > > + *     Author: Alex Williamson <alex.williamson@redhat.com>

> > > + *

> > > + * This program is free software; you can redistribute it and/or

> > > modify

> > > + * it under the terms of the GNU General Public License version 2

> as

> > > + * published by the Free Software Foundation.

> > > + *

> > > + * Derived from original vfio:

> > > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.

> > > + * Author: Tom Lyon, pugs@cisco.com

> > > + */

> > > +

> > > +#include <linux/cdev.h>

> > > +#include <linux/compat.h>

> > > +#include <linux/device.h>

> > > +#include <linux/file.h>

> > > +#include <linux/anon_inodes.h>

> > > +#include <linux/fs.h>

> > > +#include <linux/idr.h>

> > > +#include <linux/iommu.h>

> > > +#include <linux/mm.h>

> > > +#include <linux/module.h>

> > > +#include <linux/slab.h>

> > > +#include <linux/string.h>

> > > +#include <linux/uaccess.h>

> > > +#include <linux/vfio.h>

> > > +#include <linux/wait.h>

> > > +

> > > +#include "vfio_private.h"

> > > +

> > > +#define DRIVER_VERSION	"0.2"

> > > +#define DRIVER_AUTHOR	"Alex Williamson

> <alex.williamson@redhat.com>"

> > > +#define DRIVER_DESC	"VFIO - User Level meta-driver"

> > > +

> > > +static int allow_unsafe_intrs;

> > > +module_param(allow_unsafe_intrs, int, 0);

> > > +MODULE_PARM_DESC(allow_unsafe_intrs,

> > > +        "Allow use of IOMMUs which do not support interrupt

> > > remapping");

> > > +

> > > +static struct vfio {

> > > +	dev_t			devt;

> > > +	struct cdev		cdev;

> > > +	struct list_head	group_list;

> > > +	struct mutex		lock;

> > > +	struct kref		kref;

> > > +	struct class		*class;

> > > +	struct idr		idr;

> > > +	wait_queue_head_t	release_q;

> > > +} vfio;

> > > +

> > > +static const struct file_operations vfio_group_fops;

> > > +extern const struct file_operations vfio_iommu_fops;

> > > +

> > > +struct vfio_group {

> > > +	dev_t			devt;

> > > +	unsigned int		groupid;

> >

> > This groupid is returned by the device_group callback you recently

> added

> > with a separate (not yet in tree) IOMMU patch.

> > Is it correct to say that the scope of this ID is the bus the iommu

> > belongs too (but you use it as if it was global)?

> > I believe there is nothing right now to ensure the uniqueness of such

> > ID across bus types (assuming there will be other bus drivers in the

> > future besides vfio-pci).

> > If that's the case, the vfio.group_list global list and the

> __vfio_lookup_dev

> > routine should be changed to account for the bus too?

> > Ops, I just saw the error msg in vfio_group_add_dev about the group

> id conflict.

> > Is that warning related to what I mentioned above?

> 

> Yeah, this is a concern, but I can't think of a system where we would

> manifest a collision.  The IOMMU driver is expected to provide unique

> groupids for all devices below them, but we could imagine a system that

> implements two different bus_types, each with a different IOMMU driver

> and we have no coordination between them.  Perhaps since we have

> iommu_ops per bus, we should also expose the bus in the vfio group

> path,

> ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This

> means userspace would need to do a readlink of the subsystem entry

> where

> it finds the iommu_group to find the vfio group.  Reasonable?


Most probably we won't see use cases with multiple buses anytime soon, but
this scheme you proposed (with the per-bus subdir) looks good to me. 

> > > +	struct bus_type		*bus;

> > > +	struct vfio_iommu	*iommu;

> > > +	struct list_head	device_list;

> > > +	struct list_head	iommu_next;

> > > +	struct list_head	group_next;

> > > +	int			refcnt;

> > > +};

> > > +

> > > +struct vfio_device {

> > > +	struct device			*dev;

> > > +	const struct vfio_device_ops	*ops;

> > > +	struct vfio_iommu		*iommu;

> >

> > I wonder if you need to have the 'iommu' field here.

> > vfio_device.iommu is always set and reset together with

> > vfio_group.iommu.

> > Given that a vfio_device instance is always linked to a vfio_group

> > instance, do we need this duplication? Is this duplication there

> > because you do not want the double dereference device->group->iommu?

> 

> I think that was my initial goal in duplicating the pointer on the

> device.  I believe I was also at one point passing a vfio_device around

> and needed the pointer.  We seem to be getting along fine w/o that and

> I

> don't see any performance sensitive paths from getting from the device

> to iommu, so I'll see about removing it.


I guess you can add it back later if there will be need for it.
Right now, since you always init/deinit both at the same time, this would simplify
the code and make it more unlikely to use an out-of-sync pointer.

> > > +	struct vfio_group		*group;

> > > +	struct list_head		device_next;

> > > +	bool				attached;

> > > +	int				refcnt;

> > > +	void				*device_data;

> > > +};

> > > +

> > > +/*

> > > + * Helper functions called under vfio.lock

> > > + */

> > > +

> > > +/* Return true if any devices within a group are opened */

> > > +static bool __vfio_group_devs_inuse(struct vfio_group *group)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	list_for_each(pos, &group->device_list) {

> > > +		struct vfio_device *device;

> > > +

> > > +		device = list_entry(pos, struct vfio_device, device_next);

> > > +		if (device->refcnt)

> > > +			return true;

> > > +	}

> > > +	return false;

> > > +}

> > > +

> > > +/* Return true if any of the groups attached to an iommu are

> opened.

> > > + * We can only tear apart merged groups when nothing is left open.

> */

> > > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	list_for_each(pos, &iommu->group_list) {

> > > +		struct vfio_group *group;

> > > +

> > > +		group = list_entry(pos, struct vfio_group, iommu_next);

> > > +		if (group->refcnt)

> > > +			return true;

> > > +	}

> > > +	return false;

> > > +}

> > > +

> > > +/* An iommu is "in use" if it has a file descriptor open or if any

> of

> > > + * the groups assigned to the iommu have devices open. */

> > > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	if (iommu->refcnt)

> > > +		return true;

> > > +

> > > +	list_for_each(pos, &iommu->group_list) {

> > > +		struct vfio_group *group;

> > > +

> > > +		group = list_entry(pos, struct vfio_group, iommu_next);

> > > +

> > > +		if (__vfio_group_devs_inuse(group))

> > > +			return true;

> > > +	}

> > > +	return false;

> > > +}

> >

> > I looked at how you take care of ref counts ...

> >

> > This is how the tree of vfio_iommu/vfio_group/vfio_device data

> > Structures is organized (I'll use just iommu/group/dev to make

> > the graph smaller):

> >

> >             iommu

> >            /     \

> >           /       \

> >     group   ...     group

> >     /  \           /  \

> >    /    \         /    \

> > dev  ..  dev   dev  ..  dev

> >

> > This is how you get a file descriptor for the three kind of objects:

> >

> > - group : open /dev/vfio/xxx for group xxx

> > - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD

> > - device: group ioctl VFIO_GROUP_GET_DEVICE_FD

> >

> > Given the above topology, I would assume that:

> >

> > (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or

> >                              b) any of its groups is 'inuse'

> >

> > (2) a  group is 'inuse' if : a) group refcnt > 0, or

> >                              b) any of its devices is 'inuse'

> >

> > (3) a device is 'inuse' if : a) device refcnt > 0

> 

> (2) is a bit debatable.  I've wrestled with this one for a while.  The

> vfio_iommu serves two purposes.  First, it is the object we use for

> managing iommu domains, which includes allocating domains and attaching

> devices to domains.  Groups objects aren't involved here, they just

> manage the set of devices.  The second role is to manage merged groups,

> because whether or not groups can be merged is a function of iommu

> domain compatibility.

> 

> So if we look at "is the iommu in use?" ie. can I destroy the mapping

> context, detach devices and free the domain, the reference count on the

> group is irrelevant.  The user has to have a device or iommu file

> descriptor opened somewhere, across the group or merged group, for that

> context to be maintained.  A reasonable requirement, I think.


OK, then if you close all devices and the iommu, keeping the group open
Would not protect the iommu domain mapping. This means that if you (or
A management application) need to close all devices+iommu and reopen
right away again the same devices+iommu you may get a failure on the
iommu domain creation (supposing the system goes out of resources).
Is this just a very unlikely scenario? 
I guess in this case you would simply have to avoid releasing the iommu
fd, right?

> However, if we ask "is the group in use?" ie. can I not only destroy

> the

> mappings above, but also automatically tear apart merged groups, then I

> think we need to look at the group refcnt.


Correct.

> There's also a symmetry factor, the group is a benign entry point to

> device access.  It's only when device or iommu access is granted that

> the group gains any real power.  Therefore, shouldn't that power also

> be

> removed when those access points are closed?

> 

> > You have coded the 'inuse' logic with these three routines:

> >

> >     __vfio_iommu_inuse, which implements (1) above

> >

> > and

> >     __vfio_iommu_groups_inuse

> 

> Implements (2.a)


Yes, but for al groups at once.

> >     __vfio_group_devs_inuse

> 

> Implements (2.b)


Yes

> > which are used by __vfio_iommu_inuse.

> > Why don't you check the group refcnt in __vfio_iommu_groups_inuse?

> 

> Hopefully explained above, but open for discussion.

> 

> > Would it make sense (and the code more readable) to structure the

> > nested refcnt/inuse check like this?

> > (The numbers (1)(2)(3) refer to the three 'inuse' conditions above)

> >

> >    (1)__vfio_iommu_inuse

> >    |

> >    +-> check iommu refcnt

> >    +-> __vfio_iommu_groups_inuse

> >        |

> >        +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING

> >                 |

> >                 +-> check group refcnt<--MISSING

> >                 +-> __vfio_group_devs_inuse()

> >                     |

> >                     +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING

> >                               |

> >                               +-> check device refcnt

> 

> We currently do:

> 

>    (1)__vfio_iommu_inuse

>     |

>     +-> check iommu refcnt

>     +-> __vfio_group_devs_inuse

>         |

>         +->LOOP: (2.b)__vfio_group_devs_inuse

>                   |

>                   +-> LOOP: (3) check device refcnt

> 

> If that passes, the iommu context can be dissolved and we follow up

> with:

> 

>     __vfio_iommu_groups_inuse

>     |

>     +-> LOOP: (2.a)__vfio_iommu_groups_inuse

>                |

>                +-> check group refcnt

> 

> If that passes, groups can also be umerged.

> 

> Is this right?


Yes, assuming we stick to the "benign" role of groups you
described above.

> > > +static void __vfio_group_set_iommu(struct vfio_group *group,

> > > +				   struct vfio_iommu *iommu)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	if (group->iommu)

> > > +		list_del(&group->iommu_next);

> > > +	if (iommu)

> > > +		list_add(&group->iommu_next, &iommu->group_list);

> > > +

> > > +	group->iommu = iommu;

> >

> > If you remove the vfio_device.iommu field (as suggested above in a

> previous

> > Comment), the block below would not be needed anymore.

> 

> Yep, I'll try removing that and see how it plays out.

> 

> > > +	list_for_each(pos, &group->device_list) {

> > > +		struct vfio_device *device;

> > > +

> > > +		device = list_entry(pos, struct vfio_device, device_next);

> > > +		device->iommu = iommu;

> > > +	}

> > > +}

> > > +

> > > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,

> > > +				    struct vfio_device *device)

> > > +{

> > > +	BUG_ON(!iommu->domain && device->attached);

> > > +

> > > +	if (!iommu->domain || !device->attached)

> > > +		return;

> > > +

> > > +	iommu_detach_device(iommu->domain, device->dev);

> > > +	device->attached = false;

> > > +}

> > > +

> > > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,

> > > +				      struct vfio_group *group)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	list_for_each(pos, &group->device_list) {

> > > +		struct vfio_device *device;

> > > +

> > > +		device = list_entry(pos, struct vfio_device, device_next);

> > > +		__vfio_iommu_detach_dev(iommu, device);

> > > +	}

> > > +}

> > > +

> > > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,

> > > +				   struct vfio_device *device)

> > > +{

> > > +	int ret;

> > > +

> > > +	BUG_ON(device->attached);

> > > +

> > > +	if (!iommu || !iommu->domain)

> > > +		return -EINVAL;

> > > +

> > > +	ret = iommu_attach_device(iommu->domain, device->dev);

> > > +	if (!ret)

> > > +		device->attached = true;

> > > +

> > > +	return ret;

> > > +}

> > > +

> > > +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,

> > > +				     struct vfio_group *group)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	list_for_each(pos, &group->device_list) {

> > > +		struct vfio_device *device;

> > > +		int ret;

> > > +

> > > +		device = list_entry(pos, struct vfio_device, device_next);

> > > +		ret = __vfio_iommu_attach_dev(iommu, device);

> > > +		if (ret) {

> > > +			__vfio_iommu_detach_group(iommu, group);

> > > +			return ret;

> > > +		}

> > > +	}

> > > +	return 0;

> > > +}

> > > +

> > > +/* The iommu is viable, ie. ready to be configured, when all the

> > > devices

> > > + * for all the groups attached to the iommu are bound to their

> vfio

> > > device

> > > + * drivers (ex. vfio-pci).  This sets the device_data private data

> > > pointer. */

> > > +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)

> > > +{

> > > +	struct list_head *gpos, *dpos;

> > > +

> > > +	list_for_each(gpos, &iommu->group_list) {

> > > +		struct vfio_group *group;

> > > +		group = list_entry(gpos, struct vfio_group, iommu_next);

> > > +

> > > +		list_for_each(dpos, &group->device_list) {

> > > +			struct vfio_device *device;

> > > +			device = list_entry(dpos,

> > > +					    struct vfio_device, device_next);

> > > +

> > > +			if (!device->device_data)

> > > +				return false;

> > > +		}

> > > +	}

> > > +	return true;

> > > +}

> > > +

> > > +static void __vfio_close_iommu(struct vfio_iommu *iommu)

> > > +{

> > > +	struct list_head *pos;

> > > +

> > > +	if (!iommu->domain)

> > > +		return;

> > > +

> > > +	list_for_each(pos, &iommu->group_list) {

> > > +		struct vfio_group *group;

> > > +		group = list_entry(pos, struct vfio_group, iommu_next);

> > > +

> > > +		__vfio_iommu_detach_group(iommu, group);

> > > +	}

> > > +

> > > +	vfio_iommu_unmapall(iommu);

> > > +

> > > +	iommu_domain_free(iommu->domain);

> > > +	iommu->domain = NULL;

> > > +	iommu->mm = NULL;

> > > +}

> > > +

> > > +/* Open the IOMMU.  This gates all access to the iommu or device

> file

> > > + * descriptors and sets current->mm as the exclusive user. */

> >

> > Given the fn  vfio_group_open (ie, 1st object, 2nd operation), I

> would have

> > called this one __vfio_iommu_open (instead of __vfio_open_iommu).

> > Is it named __vfio_open_iommu to avoid a conflict with the namespace

> in vfio_iommu.c?

> 

> I would have expected that too, I'll look at renaming these.

> 

> > > +static int __vfio_open_iommu(struct vfio_iommu *iommu)

> > > +{

> > > +	struct list_head *pos;

> > > +	int ret;

> > > +

> > > +	if (!__vfio_iommu_viable(iommu))

> > > +		return -EBUSY;

> > > +

> > > +	if (iommu->domain)

> > > +		return -EINVAL;

> > > +

> > > +	iommu->domain = iommu_domain_alloc(iommu->bus);

> > > +	if (!iommu->domain)

> > > +		return -EFAULT;

> > > +

> > > +	list_for_each(pos, &iommu->group_list) {

> > > +		struct vfio_group *group;

> > > +		group = list_entry(pos, struct vfio_group, iommu_next);

> > > +

> > > +		ret = __vfio_iommu_attach_group(iommu, group);

> > > +		if (ret) {

> > > +			__vfio_close_iommu(iommu);

> > > +			return ret;

> > > +		}

> > > +	}

> > > +

> > > +	if (!allow_unsafe_intrs &&

> > > +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {

> > > +		__vfio_close_iommu(iommu);

> > > +		return -EFAULT;

> > > +	}

> > > +

> > > +	iommu->cache = (iommu_domain_has_cap(iommu->domain,

> > > +					     IOMMU_CAP_CACHE_COHERENCY) != 0);

> > > +	iommu->mm = current->mm;

> > > +

> > > +	return 0;

> > > +}

> > > +

> > > +/* Actively try to tear down the iommu and merged groups.  If

> there

> > > are no

> > > + * open iommu or device fds, we close the iommu.  If we close the

> > > iommu and

> > > + * there are also no open group fds, we can futher dissolve the

> group

> > > to

> > > + * iommu association and free the iommu data structure. */

> > > +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)

> > > +{

> > > +

> > > +	if (__vfio_iommu_inuse(iommu))

> > > +		return -EBUSY;

> > > +

> > > +	__vfio_close_iommu(iommu);

> > > +

> > > +	if (!__vfio_iommu_groups_inuse(iommu)) {

> > > +		struct list_head *pos, *ppos;

> > > +

> > > +		list_for_each_safe(pos, ppos, &iommu->group_list) {

> > > +			struct vfio_group *group;

> > > +

> > > +			group = list_entry(pos, struct vfio_group,

> > > iommu_next);

> > > +			__vfio_group_set_iommu(group, NULL);

> > > +		}

> > > +

> > > +

> > > +		kfree(iommu);

> > > +	}

> > > +

> > > +	return 0;

> > > +}

> > > +

> > > +static struct vfio_device *__vfio_lookup_dev(struct device *dev)

> > > +{

> > > +	struct list_head *gpos;

> > > +	unsigned int groupid;

> > > +

> > > +	if (iommu_device_group(dev, &groupid))

> > > +		return NULL;

> > > +

> > > +	list_for_each(gpos, &vfio.group_list) {

> > > +		struct vfio_group *group;

> > > +		struct list_head *dpos;

> > > +

> > > +		group = list_entry(gpos, struct vfio_group, group_next);

> > > +

> > > +		if (group->groupid != groupid)

> > > +			continue;

> > > +

> > > +		list_for_each(dpos, &group->device_list) {

> > > +			struct vfio_device *device;

> > > +

> > > +			device = list_entry(dpos,

> > > +					    struct vfio_device, device_next);

> > > +

> > > +			if (device->dev == dev)

> > > +				return device;

> > > +		}

> > > +	}

> > > +	return NULL;

> > > +}

> > > +

> > > +/* All release paths simply decrement the refcnt, attempt to

> teardown

> > > + * the iommu and merged groups, and wakeup anything that might be

> > > + * waiting if we successfully dissolve anything. */

> > > +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)

> > > +{

> > > +	bool wake;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	(*refcnt)--;

> > > +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);

> > > +

> > > +	mutex_unlock(&vfio.lock);

> > > +

> > > +	if (wake)

> > > +		wake_up(&vfio.release_q);

> > > +

> > > +	return 0;

> > > +}

> > > +

> > > +/*

> > > + * Device fops - passthrough to vfio device driver w/ device_data

> > > + */

> > > +static int vfio_device_release(struct inode *inode, struct file

> > > *filep)

> > > +{

> > > +	struct vfio_device *device = filep->private_data;

> > > +

> > > +	vfio_do_release(&device->refcnt, device->iommu);

> > > +

> > > +	device->ops->put(device->device_data);

> > > +

> > > +	return 0;

> > > +}

> > > +

> > > +static long vfio_device_unl_ioctl(struct file *filep,

> > > +				  unsigned int cmd, unsigned long arg)

> > > +{

> > > +	struct vfio_device *device = filep->private_data;

> > > +

> > > +	return device->ops->ioctl(device->device_data, cmd, arg);

> > > +}

> > > +

> > > +static ssize_t vfio_device_read(struct file *filep, char __user

> *buf,

> > > +				size_t count, loff_t *ppos)

> > > +{

> > > +	struct vfio_device *device = filep->private_data;

> > > +

> > > +	return device->ops->read(device->device_data, buf, count, ppos);

> > > +}

> > > +

> > > +static ssize_t vfio_device_write(struct file *filep, const char

> __user

> > > *buf,

> > > +				 size_t count, loff_t *ppos)

> > > +{

> > > +	struct vfio_device *device = filep->private_data;

> > > +

> > > +	return device->ops->write(device->device_data, buf, count, ppos);

> > > +}

> > > +

> > > +static int vfio_device_mmap(struct file *filep, struct

> vm_area_struct

> > > *vma)

> > > +{

> > > +	struct vfio_device *device = filep->private_data;

> > > +

> > > +	return device->ops->mmap(device->device_data, vma);

> > > +}

> > > +

> > > +#ifdef CONFIG_COMPAT

> > > +static long vfio_device_compat_ioctl(struct file *filep,

> > > +				     unsigned int cmd, unsigned long arg)

> > > +{

> > > +	arg = (unsigned long)compat_ptr(arg);

> > > +	return vfio_device_unl_ioctl(filep, cmd, arg);

> > > +}

> > > +#endif	/* CONFIG_COMPAT */

> > > +

> > > +const struct file_operations vfio_device_fops = {

> > > +	.owner		= THIS_MODULE,

> > > +	.release	= vfio_device_release,

> > > +	.read		= vfio_device_read,

> > > +	.write		= vfio_device_write,

> > > +	.unlocked_ioctl	= vfio_device_unl_ioctl,

> > > +#ifdef CONFIG_COMPAT

> > > +	.compat_ioctl	= vfio_device_compat_ioctl,

> > > +#endif

> > > +	.mmap		= vfio_device_mmap,

> > > +};

> > > +

> > > +/*

> > > + * Group fops

> > > + */

> > > +static int vfio_group_open(struct inode *inode, struct file

> *filep)

> > > +{

> > > +	struct vfio_group *group;

> > > +	int ret = 0;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	group = idr_find(&vfio.idr, iminor(inode));

> > > +

> > > +	if (!group) {

> > > +		ret = -ENODEV;

> > > +		goto out;

> > > +	}

> > > +

> > > +	filep->private_data = group;

> > > +

> > > +	if (!group->iommu) {

> > > +		struct vfio_iommu *iommu;

> > > +

> > > +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);

> > > +		if (!iommu) {

> > > +			ret = -ENOMEM;

> > > +			goto out;

> > > +		}

> > > +		INIT_LIST_HEAD(&iommu->group_list);

> > > +		INIT_LIST_HEAD(&iommu->dm_list);

> > > +		mutex_init(&iommu->dgate);

> > > +		iommu->bus = group->bus;

> > > +		__vfio_group_set_iommu(group, iommu);

> > > +	}

> > > +	group->refcnt++;

> > > +

> > > +out:

> > > +	mutex_unlock(&vfio.lock);

> > > +

> > > +	return ret;

> > > +}

> > > +

> > > +static int vfio_group_release(struct inode *inode, struct file

> *filep)

> > > +{

> > > +	struct vfio_group *group = filep->private_data;

> > > +

> > > +	return vfio_do_release(&group->refcnt, group->iommu);

> > > +}

> > > +

> > > +/* Attempt to merge the group pointed to by fd into group.  The

> merge-

> > > ee

> > > + * group must not have an iommu or any devices open because we

> cannot

> > > + * maintain that context across the merge.  The merge-er group can

> be

> > > + * in use. */

> > > +static int vfio_group_merge(struct vfio_group *group, int fd)

> >

> > The documentation in vfio.txt explains clearly the logic implemented

> by

> > the merge/unmerge group ioctls.

> > However, what you are doing is not merging groups, but rather

> adding/removing

> > groups to/from iommus (and creating flat lists of groups).

> > For example, when you do

> >

> >   merge(A,B)

> >

> > you actually mean to say "merge B to the list of groups assigned to

> the

> > same iommu as group A".

> 

> It's actually a little more than that.  After you've merged B into A,

> you can close the file descriptor for B and access all of the devices

> for the merged group from A.


It is actually more...

Scenario 1:

  create_grp(A)
  create_grp(B)
  ...
  merge_grp(A,B)
  create_grp(C)
  merge_grp(C,B) ... this works, right?

Scenario 2:

  create_grp(A)
  create_grp(B)
  fd_x = get_dev_fd(B,x)
  ...
  merge_grp(A,B)
  create_grp(C)
  merge_grp(A,C)
  fd_x = get_dev_fd(C,x) 

Those two examples seems to suggest me more of a list-abstraction than a merge abstraction.
However, if it fits into the agreed syntax/logic it is ok, as long as we document it
properly.

> > For the same reason, you do not really need to provide the group you

> want

> > to unmerge from, which means that instead of

> >

> >   unmerge(A,B)

> >

> > you would just need

> >

> >   unmerge(B)

> 

> Good point, we can avoid the awkward reference via file descriptor for

> the unmerge.

> 

> > I understand the reason why it is not a real merge/unmerge (ie, to

> keep the

> > original groups so that you can unmerge later)

> 

> Right, we still need to have visibility of the groups comprising the

> merged group, but the abstraction provided to the user seems to be

> deeper than you're thinking.

> 

> >  ... however I just wonder if

> > it wouldn't be more natural to implement the

> VFIO_IOMMU_ADD_GROUP/DEL_GROUP

> > iommu ioctls instead? (the relationships between the data structure

> would

> > remain the same)

> > I guess you already discarded this option for some reasons, right?

> What was

> > the reason?

> 

> It's a possibility, I'm not sure it was discussed or really what

> advantage it provides.  It seems like we'd logically lose the ability

> to

> access devices from other groups,


What is the real (immediate) benefit of this capability?

> whether that's good or bad, I don't know.  I think the notion of "merge"

> promotes the idea that the groups

> are peers and an iommu_add/del feels a bit more hierarchical.


I agree. 

> > > +{

> > > +	struct vfio_group *new;

> > > +	struct vfio_iommu *old_iommu;

> > > +	struct file *file;

> > > +	int ret = 0;

> > > +	bool opened = false;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	file = fget(fd);

> > > +	if (!file) {

> > > +		ret = -EBADF;

> > > +		goto out_noput;

> > > +	}

> > > +

> > > +	/* Sanity check, is this really our fd? */

> > > +	if (file->f_op != &vfio_group_fops) {

> > > +		ret = -EINVAL;

> > > +		goto out;

> > > +	}

> > > +

> > > +	new = file->private_data;

> > > +

> > > +	if (!new || new == group || !new->iommu ||

> > > +	    new->iommu->domain || new->bus != group->bus) {

> > > +		ret = -EINVAL;

> > > +		goto out;

> > > +	}

> > > +

> > > +	/* We need to attach all the devices to each domain separately

> > > +	 * in order to validate that the capabilities match for both.  */

> > > +	ret = __vfio_open_iommu(new->iommu);

> > > +	if (ret)

> > > +		goto out;

> > > +

> > > +	if (!group->iommu->domain) {

> > > +		ret = __vfio_open_iommu(group->iommu);

> > > +		if (ret)

> > > +			goto out;

> > > +		opened = true;

> > > +	}

> > > +

> > > +	/* If cache coherency doesn't match we'd potentialy need to

> > > +	 * remap existing iommu mappings in the merge-er domain.

> > > +	 * Poor return to bother trying to allow this currently. */

> > > +	if (iommu_domain_has_cap(group->iommu->domain,

> > > +				 IOMMU_CAP_CACHE_COHERENCY) !=

> > > +	    iommu_domain_has_cap(new->iommu->domain,

> > > +				 IOMMU_CAP_CACHE_COHERENCY)) {

> > > +		__vfio_close_iommu(new->iommu);

> > > +		if (opened)

> > > +			__vfio_close_iommu(group->iommu);

> > > +		ret = -EINVAL;

> > > +		goto out;

> > > +	}

> > > +

> > > +	/* Close the iommu for the merge-ee and attach all its devices

> > > +	 * to the merge-er iommu. */

> > > +	__vfio_close_iommu(new->iommu);

> > > +

> > > +	ret = __vfio_iommu_attach_group(group->iommu, new);

> > > +	if (ret)

> > > +		goto out;

> > > +

> > > +	/* set_iommu unlinks new from the iommu, so save a pointer to it

> > > */

> > > +	old_iommu = new->iommu;

> > > +	__vfio_group_set_iommu(new, group->iommu);

> > > +	kfree(old_iommu);

> > > +

> > > +out:

> > > +	fput(file);

> > > +out_noput:

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +

> > > +/* Unmerge the group pointed to by fd from group. */

> > > +static int vfio_group_unmerge(struct vfio_group *group, int fd)

> > > +{

> > > +	struct vfio_group *new;

> > > +	struct vfio_iommu *new_iommu;

> > > +	struct file *file;

> > > +	int ret = 0;

> > > +

> > > +	/* Since the merge-out group is already opened, it needs to

> > > +	 * have an iommu struct associated with it. */

> > > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);

> > > +	if (!new_iommu)

> > > +		return -ENOMEM;

> > > +

> > > +	INIT_LIST_HEAD(&new_iommu->group_list);

> > > +	INIT_LIST_HEAD(&new_iommu->dm_list);

> > > +	mutex_init(&new_iommu->dgate);

> > > +	new_iommu->bus = group->bus;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	file = fget(fd);

> > > +	if (!file) {

> > > +		ret = -EBADF;

> > > +		goto out_noput;

> > > +	}

> > > +

> > > +	/* Sanity check, is this really our fd? */

> > > +	if (file->f_op != &vfio_group_fops) {

> > > +		ret = -EINVAL;

> > > +		goto out;

> > > +	}

> > > +

> > > +	new = file->private_data;

> > > +	if (!new || new == group || new->iommu != group->iommu) {

> > > +		ret = -EINVAL;

> > > +		goto out;

> > > +	}

> > > +

> > > +	/* We can't merge-out a group with devices still in use. */

> > > +	if (__vfio_group_devs_inuse(new)) {

> > > +		ret = -EBUSY;

> > > +		goto out;

> > > +	}

> > > +

> > > +	__vfio_iommu_detach_group(group->iommu, new);

> > > +	__vfio_group_set_iommu(new, new_iommu);

> > > +

> > > +out:

> > > +	fput(file);

> > > +out_noput:

> > > +	if (ret)

> > > +		kfree(new_iommu);

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +

> > > +/* Get a new iommu file descriptor.  This will open the iommu,

> setting

> > > + * the current->mm ownership if it's not already set. */

> > > +static int vfio_group_get_iommu_fd(struct vfio_group *group)

> > > +{

> > > +	int ret = 0;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	if (!group->iommu->domain) {

> > > +		ret = __vfio_open_iommu(group->iommu);

> > > +		if (ret)

> > > +			goto out;

> > > +	}

> > > +

> > > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,

> > > +			       group->iommu, O_RDWR);

> > > +	if (ret < 0)

> > > +		goto out;

> > > +

> > > +	group->iommu->refcnt++;

> > > +out:

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +

> > > +/* Get a new device file descriptor.  This will open the iommu,

> > > setting

> > > + * the current->mm ownership if it's not already set.  It's

> difficult

> > > to

> > > + * specify the requirements for matching a user supplied buffer to

> a

> > > + * device, so we use a vfio driver callback to test for a match.

> For

> > > + * PCI, dev_name(dev) is unique, but other drivers may require

> > > including

> > > + * a parent device string. */

> > > +static int vfio_group_get_device_fd(struct vfio_group *group, char

> > > *buf)

> > > +{

> > > +	struct vfio_iommu *iommu = group->iommu;

> > > +	struct list_head *gpos;

> > > +	int ret = -ENODEV;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	if (!iommu->domain) {

> > > +		ret = __vfio_open_iommu(iommu);

> > > +		if (ret)

> > > +			goto out;

> > > +	}

> > > +

> > > +	list_for_each(gpos, &iommu->group_list) {

> > > +		struct list_head *dpos;

> > > +

> > > +		group = list_entry(gpos, struct vfio_group, iommu_next);

> > > +

> > > +		list_for_each(dpos, &group->device_list) {

> > > +			struct vfio_device *device;

> > > +

> > > +			device = list_entry(dpos,

> > > +					    struct vfio_device, device_next);

> > > +

> > > +			if (device->ops->match(device->dev, buf)) {

> > > +				struct file *file;

> > > +

> > > +				if (device->ops->get(device->device_data)) {

> > > +					ret = -EFAULT;

> > > +					goto out;

> > > +				}

> > > +

> > > +				/* We can't use anon_inode_getfd(), like above

> > > +				 * because we need to modify the f_mode flags

> > > +				 * directly to allow more than just ioctls */

> > > +				ret = get_unused_fd();

> > > +				if (ret < 0) {

> > > +					device->ops->put(device->device_data);

> > > +					goto out;

> > > +				}

> > > +

> > > +				file = anon_inode_getfile("[vfio-device]",

> > > +							  &vfio_device_fops,

> > > +							  device, O_RDWR);

> > > +				if (IS_ERR(file)) {

> > > +					put_unused_fd(ret);

> > > +					ret = PTR_ERR(file);

> > > +					device->ops->put(device->device_data);

> > > +					goto out;

> > > +				}

> > > +

> > > +				/* Todo: add an anon_inode interface to do

> > > +				 * this.  Appears to be missing by lack of

> > > +				 * need rather than explicitly prevented.

> > > +				 * Now there's need. */

> > > +				file->f_mode |= (FMODE_LSEEK |

> > > +						 FMODE_PREAD |

> > > +						 FMODE_PWRITE);

> > > +

> > > +				fd_install(ret, file);

> > > +

> > > +				device->refcnt++;

> > > +				goto out;

> > > +			}

> > > +		}

> > > +	}

> > > +out:

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +

> > > +static long vfio_group_unl_ioctl(struct file *filep,

> > > +				 unsigned int cmd, unsigned long arg)

> > > +{

> > > +	struct vfio_group *group = filep->private_data;

> > > +

> > > +	if (cmd == VFIO_GROUP_GET_FLAGS) {

> > > +		u64 flags = 0;

> > > +

> > > +		mutex_lock(&vfio.lock);

> > > +		if (__vfio_iommu_viable(group->iommu))

> > > +			flags |= VFIO_GROUP_FLAGS_VIABLE;

> > > +		mutex_unlock(&vfio.lock);

> > > +

> > > +		if (group->iommu->mm)

> > > +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;

> > > +

> > > +		return put_user(flags, (u64 __user *)arg);

> > > +	}

> > > +

> > > +	/* Below commands are restricted once the mm is set */

> > > +	if (group->iommu->mm && group->iommu->mm != current->mm)

> > > +		return -EPERM;

> > > +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {

> > > +		int fd;

> > > +

> > > +		if (get_user(fd, (int __user *)arg))

> > > +			return -EFAULT;

> > > +		if (fd < 0)

> > > +			return -EINVAL;

> > > +

> > > +		if (cmd == VFIO_GROUP_MERGE)

> > > +			return vfio_group_merge(group, fd);

> > > +		else

> > > +			return vfio_group_unmerge(group, fd);

> > > +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {

> > > +		return vfio_group_get_iommu_fd(group);

> > > +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {

> > > +		char *buf;

> > > +		int ret;

> > > +

> > > +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);

> > > +		if (IS_ERR(buf))

> > > +			return PTR_ERR(buf);

> > > +

> > > +		ret = vfio_group_get_device_fd(group, buf);

> > > +		kfree(buf);

> > > +		return ret;

> > > +	}

> > > +

> > > +	return -ENOSYS;

> > > +}

> > > +

> > > +#ifdef CONFIG_COMPAT

> > > +static long vfio_group_compat_ioctl(struct file *filep,

> > > +				    unsigned int cmd, unsigned long arg)

> > > +{

> > > +	arg = (unsigned long)compat_ptr(arg);

> > > +	return vfio_group_unl_ioctl(filep, cmd, arg);

> > > +}

> > > +#endif	/* CONFIG_COMPAT */

> > > +

> > > +static const struct file_operations vfio_group_fops = {

> > > +	.owner		= THIS_MODULE,

> > > +	.open		= vfio_group_open,

> > > +	.release	= vfio_group_release,

> > > +	.unlocked_ioctl	= vfio_group_unl_ioctl,

> > > +#ifdef CONFIG_COMPAT

> > > +	.compat_ioctl	= vfio_group_compat_ioctl,

> > > +#endif

> > > +};

> > > +

> > > +/* iommu fd release hook */

> >

> > Given vfio_device_release and

> >       vfio_group_release (ie, 1st object, 2nd operation), I was

> > going to suggest renaming the fn below to vfio_iommu_release, but

> > then I saw the latter name being already used in vfio_iommu.c ...

> > a bit confusing but I guess it's ok then.

> 

> Right, this one was definitely because of naming collision.

> 

> > > +int vfio_release_iommu(struct vfio_iommu *iommu)

> > > +{

> > > +	return vfio_do_release(&iommu->refcnt, iommu);

> > > +}

> > > +

> > > +/*

> > > + * VFIO driver API

> > > + */

> > > +

> > > +/* Add a new device to the vfio framework with associated vfio

> driver

> > > + * callbacks.  This is the entry point for vfio drivers to

> register

> > > devices. */

> > > +int vfio_group_add_dev(struct device *dev, const struct

> > > vfio_device_ops *ops)

> > > +{

> > > +	struct list_head *pos;

> > > +	struct vfio_group *group = NULL;

> > > +	struct vfio_device *device = NULL;

> > > +	unsigned int groupid;

> > > +	int ret = 0;

> > > +	bool new_group = false;

> > > +

> > > +	if (!ops)

> > > +		return -EINVAL;

> > > +

> > > +	if (iommu_device_group(dev, &groupid))

> > > +		return -ENODEV;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	list_for_each(pos, &vfio.group_list) {

> > > +		group = list_entry(pos, struct vfio_group, group_next);

> > > +		if (group->groupid == groupid)

> > > +			break;

> > > +		group = NULL;

> > > +	}

> > > +

> > > +	if (!group) {

> > > +		int minor;

> > > +

> > > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {

> > > +			ret = -ENOMEM;

> > > +			goto out;

> > > +		}

> > > +

> > > +		group = kzalloc(sizeof(*group), GFP_KERNEL);

> > > +		if (!group) {

> > > +			ret = -ENOMEM;

> > > +			goto out;

> > > +		}

> > > +

> > > +		group->groupid = groupid;

> > > +		INIT_LIST_HEAD(&group->device_list);

> > > +

> > > +		ret = idr_get_new(&vfio.idr, group, &minor);

> > > +		if (ret == 0 && minor > MINORMASK) {

> > > +			idr_remove(&vfio.idr, minor);

> > > +			kfree(group);

> > > +			ret = -ENOSPC;

> > > +			goto out;

> > > +		}

> > > +

> > > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);

> > > +		device_create(vfio.class, NULL, group->devt,

> > > +			      group, "%u", groupid);

> > > +

> > > +		group->bus = dev->bus;

> > > +		list_add(&group->group_next, &vfio.group_list);

> > > +		new_group = true;

> > > +	} else {

> > > +		if (group->bus != dev->bus) {

> > > +			printk(KERN_WARNING

> > > +			       "Error: IOMMU group ID conflict.  Group ID %u

> > > "

> > > +				"on both bus %s and %s\n", groupid,

> > > +				group->bus->name, dev->bus->name);

> > > +			ret = -EFAULT;

> > > +			goto out;

> > > +		}

> > > +

> > > +		list_for_each(pos, &group->device_list) {

> > > +			device = list_entry(pos,

> > > +					    struct vfio_device, device_next);

> > > +			if (device->dev == dev)

> > > +				break;

> > > +			device = NULL;

> > > +		}

> > > +	}

> > > +

> > > +	if (!device) {

> > > +		if (__vfio_group_devs_inuse(group) ||

> > > +		    (group->iommu && group->iommu->refcnt)) {

> > > +			printk(KERN_WARNING

> > > +			       "Adding device %s to group %u while group is

> > > already in use!!\n",

> > > +			       dev_name(dev), group->groupid);

> > > +			/* XXX How to prevent other drivers from claiming? */

> >

> > Here we are adding a device (not yet assigned to a vfio bus) to a

> group

> > that is already in use.

> > Given that it would not be acceptable for this device to get assigned

> > to a non vfio driver, why not forcing such assignment here then?

> 

> Exactly, I just don't know the mechanics of how to make that happen and

> was hoping for suggestions...

> 

> > I am not sure though what the best way to do it would be.

> > What about something like this:

> >

> > - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE

> >   notification it assigns to the device a PCI ID that will make sure

> >   the vfio-pci's probe routine will be invoked (and no other driver

> can

> >   therefore claim the device). That PCI ID would have to be added

> >   to the vfio_pci_driver's id_table (it would be the exception to the

> >   "only dynamic IDs" rule). Too hackish?

> 

> Presumably some other driver also has the ID in it's id_table, how do

> we make sure we win?


By mangling such ID (when processing the BUS_NOTIFY_ADD_DEVICE notification) to
match against a 'fake' ID registered in the vfio-pci table (it would be like a
sort of driver redirect/divert). The vfio-pci's probe routine would restore
the original ID (we do not want to confuse userspace). This is hackish, I agree.

What about this:
- When vfio-pci processes the BUS_NOTIFY_ADD_DEVICE notification it can
  pre-initialize the driver pointer (via an API). We would then need to change
  the match/probe PCI mechanism too: for example, the PCI core will have to check
  and honor such pre-driver-initialization when present (and give it higher
  priority over the match callbacks).
  How to do this? For example, when vfio_group_add_dev is invoked, it checks
  whether the device is getting added to an already existent group where
  the other devices (well, you would need to check just one of the devices in
  the group) are already assigned to vfio-pci, and in such a case it
  pre-initialize the driver to vfio-pci.

NOTE: By "preinit" I mean "save into the device a reference to a driver before
      the 'match' callbacks".

This would be the timeline:

|
+-> new device gets added to (PCI) bus
|
+-> PCI: send BUS_NOTIFIER_ADD_DEVICE notification
|
+-> VFIO:vfio_pci_device_notifier
|        |
|        +-> BUS_NOTIFIER_ADD_DEVICE: vfio_group_add_dev
|            |
|            +->iommu_device_group(dev,&groupid)
|            +->group = <search groupid in vfio.group_list>
|            +->if (group && group_is_vfio(group))
|            |        <preinit device driver to vfio-pci>
|            ...
|
+-> PCI: xxx
|        |
|        +-> if (!device_driver_is_preinit(dev))
|        |       probe=<search driver's probe callback using 'match'>
|        |   else 
|        |       probe=<get it from preint driver config>
|        |       (+fallback to 'match' if preinit driver disappeared?)
|        |   
|        +-> rc = probe(...)
|        |
|        ...
v
...

Of course, what if multiple drivers decide to preinit the device ?

One way to make it cleaner would be to:
- have the PCI layer export an API that allows (for example) the bus
  notification callbacks (like vfio_pci_device_notifier) to preinit a driver
- make such API reject calls on devices that already have a preinit
  driver.
- make VFIO detect the case where vfio_pci_device_notifier can not
  preinit the driver (to vfio-pci) for the new device (because already
  preinited) and raise an error/warning.

Would this look a bit cleaner?

> > > +		}

> > > +

> > > +		device = kzalloc(sizeof(*device), GFP_KERNEL);

> > > +		if (!device) {

> > > +			/* If we just created this group, tear it down */

> > > +			if (new_group) {

> > > +				list_del(&group->group_next);

> > > +				device_destroy(vfio.class, group->devt);

> > > +				idr_remove(&vfio.idr, MINOR(group->devt));

> > > +				kfree(group);

> > > +			}

> > > +			ret = -ENOMEM;

> > > +			goto out;

> > > +		}

> > > +

> > > +		list_add(&device->device_next, &group->device_list);

> > > +		device->dev = dev;

> > > +		device->ops = ops;

> > > +		device->iommu = group->iommu; /* NULL if new */

> >

> > Shouldn't you check the return code of __vfio_iommu_attach_dev?

> 

> Yep, looks like I did this because the expected use case has a NULL

> iommu here, so I need to distiguish that error from an actual

> iommu_attach_device() error.

> 

> > > +		__vfio_iommu_attach_dev(group->iommu, device);

> > > +	}

> > > +out:

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +EXPORT_SYMBOL_GPL(vfio_group_add_dev);

> > > +

> > > +/* Remove a device from the vfio framework */

> >

> > This fn below does not return any error code. Ok ...

> > However, there are a number of errors case that you test, for example

> > - device that does not belong to any group (according to iommu API)

> > - device that belongs to a group but that does not appear in the list

> >   of devices of the vfio_group structure.

> > Are the above two errors checks just paranoia or are those errors

> actually possible?

> > If they were possible, shouldn't we generate a warning (most probably

> > it would be a bug in the code)?

> 

> They're all vfio-bus driver bugs of some sort, so it's just a matter of

> how much we want to scream about them.  I'll comments on each below.

> 

> > > +void vfio_group_del_dev(struct device *dev)

> > > +{

> > > +	struct list_head *pos;

> > > +	struct vfio_group *group = NULL;

> > > +	struct vfio_device *device = NULL;

> > > +	unsigned int groupid;

> > > +

> > > +	if (iommu_device_group(dev, &groupid))

> > > +		return;

> 

> Here the bus driver is probably just sitting on a notifier list for

> their bus_type and a device is getting removed.  Unless we want to

> require the bus driver to track everything it's attempted to add and

> whether it worked, we can just ignore this.


OK, I see what you mean. If vfio_group_add_dev fails for some reasons we
do not keep track of it. Right?
Would it make sense to add one special group to vfio.group_list (or better
On a separate field of the vfio structure) whose goal
would be just that: keep track of those devices that failed to be added
to the VFIO framework (can it help for debugging too?)?

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	list_for_each(pos, &vfio.group_list) {

> > > +		group = list_entry(pos, struct vfio_group, group_next);

> > > +		if (group->groupid == groupid)

> > > +			break;

> > > +		group = NULL;

> > > +	}

> > > +

> > > +	if (!group)

> > > +		goto out;

> 

> We don't even have a group for the device, we could BUG_ON here.  The

> bus driver failed to tell us about something that was then removed.

> 

> > > +

> > > +	list_for_each(pos, &group->device_list) {

> > > +		device = list_entry(pos, struct vfio_device, device_next);

> > > +		if (device->dev == dev)

> > > +			break;

> > > +		device = NULL;

> > > +	}

> > > +

> > > +	if (!device)

> > > +		goto out;

> 

> Same here.

> 

> > > +

> > > +	BUG_ON(device->refcnt);

> > > +

> > > +	if (device->attached)

> > > +		__vfio_iommu_detach_dev(group->iommu, device);

> > > +

> > > +	list_del(&device->device_next);

> > > +	kfree(device);

> > > +

> > > +	/* If this was the only device in the group, remove the group.

> > > +	 * Note that we intentionally unmerge empty groups here if the

> > > +	 * group fd isn't opened. */

> > > +	if (list_empty(&group->device_list) && group->refcnt == 0) {

> > > +		struct vfio_iommu *iommu = group->iommu;

> > > +

> > > +		if (iommu) {

> > > +			__vfio_group_set_iommu(group, NULL);

> > > +			__vfio_try_dissolve_iommu(iommu);

> > > +		}

> > > +

> > > +		device_destroy(vfio.class, group->devt);

> > > +		idr_remove(&vfio.idr, MINOR(group->devt));

> > > +		list_del(&group->group_next);

> > > +		kfree(group);

> > > +	}

> > > +out:

> > > +	mutex_unlock(&vfio.lock);

> > > +}

> > > +EXPORT_SYMBOL_GPL(vfio_group_del_dev);

> > > +

> > > +/* When a device is bound to a vfio device driver (ex. vfio-pci),

> this

> > > + * entry point is used to mark the device usable (viable).  The

> vfio

> > > + * device driver associates a private device_data struct with the

> > > device

> > > + * here, which will later be return for vfio_device_fops

> callbacks. */

> > > +int vfio_bind_dev(struct device *dev, void *device_data)

> > > +{

> > > +	struct vfio_device *device;

> > > +	int ret = -EINVAL;

> > > +

> > > +	BUG_ON(!device_data);

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	device = __vfio_lookup_dev(dev);

> > > +

> > > +	BUG_ON(!device);

> > > +

> > > +	ret = dev_set_drvdata(dev, device);

> > > +	if (!ret)

> > > +		device->device_data = device_data;

> > > +

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +EXPORT_SYMBOL_GPL(vfio_bind_dev);

> > > +

> > > +/* A device is only removeable if the iommu for the group is not

> in

> > > use. */

> > > +static bool vfio_device_removeable(struct vfio_device *device)

> > > +{

> > > +	bool ret = true;

> > > +

> > > +	mutex_lock(&vfio.lock);

> > > +

> > > +	if (device->iommu && __vfio_iommu_inuse(device->iommu))

> > > +		ret = false;

> > > +

> > > +	mutex_unlock(&vfio.lock);

> > > +	return ret;

> > > +}

> > > +

> > > +/* Notify vfio that a device is being unbound from the vfio device

> > > driver

> > > + * and return the device private device_data pointer.  If the

> group is

> > > + * in use, we need to block or take other measures to make it safe

> for

> > > + * the device to be removed from the iommu. */

> > > +void *vfio_unbind_dev(struct device *dev)

> > > +{

> > > +	struct vfio_device *device = dev_get_drvdata(dev);

> > > +	void *device_data;

> > > +

> > > +	BUG_ON(!device);

> > > +

> > > +again:

> > > +	if (!vfio_device_removeable(device)) {

> > > +		/* XXX signal for all devices in group to be removed or

> > > +		 * resort to killing the process holding the device fds.

> > > +		 * For now just block waiting for releases to wake us. */

> > > +		wait_event(vfio.release_q, vfio_device_removeable(device));

> >

> > Any new idea/proposal on how to handle this situation?

> > The last one I remember was to leave the soft/hard/etc timeout

> handling in

> > userspace and implement it as a sort of policy. Is that one still the

> most

> > likely candidate solution to handle this situation?

> 

> I haven't heard any new proposals.  I think we need the hard timeout

> handling in the kernel.  We can't leave it to userspace to decide they

> get to keep the device.  We could have this tunable via an ioctl, but I

> don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak

> it.  I was intending to re-implement the netlink interface to signal

> the

> removal, but expect to get allergic reactions to that.


(I personally like the async netlink signaling, but I am OK with an ioctl based
mechanism if it provides the same flexibility)

What would be a reasonable hard timeout?

/Chris
Alex Williamson - Nov. 9, 2011, 11:40 p.m.
On Wed, 2011-11-09 at 15:08 -0600, Christian Benvenuti (benve) wrote:
<snip>
> > > > +
> > > > +struct vfio_group {
> > > > +	dev_t			devt;
> > > > +	unsigned int		groupid;
> > >
> > > This groupid is returned by the device_group callback you recently
> > added
> > > with a separate (not yet in tree) IOMMU patch.
> > > Is it correct to say that the scope of this ID is the bus the iommu
> > > belongs too (but you use it as if it was global)?
> > > I believe there is nothing right now to ensure the uniqueness of such
> > > ID across bus types (assuming there will be other bus drivers in the
> > > future besides vfio-pci).
> > > If that's the case, the vfio.group_list global list and the
> > __vfio_lookup_dev
> > > routine should be changed to account for the bus too?
> > > Ops, I just saw the error msg in vfio_group_add_dev about the group
> > id conflict.
> > > Is that warning related to what I mentioned above?
> > 
> > Yeah, this is a concern, but I can't think of a system where we would
> > manifest a collision.  The IOMMU driver is expected to provide unique
> > groupids for all devices below them, but we could imagine a system that
> > implements two different bus_types, each with a different IOMMU driver
> > and we have no coordination between them.  Perhaps since we have
> > iommu_ops per bus, we should also expose the bus in the vfio group
> > path,
> > ie. /dev/vfio/%s/%u, dev->bus->name, iommu_device_group(dev,..).  This
> > means userspace would need to do a readlink of the subsystem entry
> > where
> > it finds the iommu_group to find the vfio group.  Reasonable?
> 
> Most probably we won't see use cases with multiple buses anytime soon, but
> this scheme you proposed (with the per-bus subdir) looks good to me. 

Ok, I think that's easier than any scheme of trying to organize globally
unique groupids instead of just bus_type unique.  That makes group
objects internally matched by the {groupid, bus} pair.

<snip>
> > >
> > > I looked at how you take care of ref counts ...
> > >
> > > This is how the tree of vfio_iommu/vfio_group/vfio_device data
> > > Structures is organized (I'll use just iommu/group/dev to make
> > > the graph smaller):
> > >
> > >             iommu
> > >            /     \
> > >           /       \
> > >     group   ...     group
> > >     /  \           /  \
> > >    /    \         /    \
> > > dev  ..  dev   dev  ..  dev
> > >
> > > This is how you get a file descriptor for the three kind of objects:
> > >
> > > - group : open /dev/vfio/xxx for group xxx
> > > - iommu : group ioctl VFIO_GROUP_GET_IOMMU_FD
> > > - device: group ioctl VFIO_GROUP_GET_DEVICE_FD
> > >
> > > Given the above topology, I would assume that:
> > >
> > > (1) an iommu is 'inuse' if : a) iommu refcnt > 0, or
> > >                              b) any of its groups is 'inuse'
> > >
> > > (2) a  group is 'inuse' if : a) group refcnt > 0, or
> > >                              b) any of its devices is 'inuse'
> > >
> > > (3) a device is 'inuse' if : a) device refcnt > 0
> > 
> > (2) is a bit debatable.  I've wrestled with this one for a while.  The
> > vfio_iommu serves two purposes.  First, it is the object we use for
> > managing iommu domains, which includes allocating domains and attaching
> > devices to domains.  Groups objects aren't involved here, they just
> > manage the set of devices.  The second role is to manage merged groups,
> > because whether or not groups can be merged is a function of iommu
> > domain compatibility.
> > 
> > So if we look at "is the iommu in use?" ie. can I destroy the mapping
> > context, detach devices and free the domain, the reference count on the
> > group is irrelevant.  The user has to have a device or iommu file
> > descriptor opened somewhere, across the group or merged group, for that
> > context to be maintained.  A reasonable requirement, I think.
> 
> OK, then if you close all devices and the iommu, keeping the group open
> Would not protect the iommu domain mapping. This means that if you (or
> A management application) need to close all devices+iommu and reopen
> right away again the same devices+iommu you may get a failure on the
> iommu domain creation (supposing the system goes out of resources).
> Is this just a very unlikely scenario? 

Can you think of a use case that would require such?  I can't.

> I guess in this case you would simply have to avoid releasing the iommu
> fd, right?

Right.  We could also debate whether we should drop all iommu mappings
when the iommu refcnt goes to zero.  We don't currently do that, but it
might make sense.

> 
> > However, if we ask "is the group in use?" ie. can I not only destroy
> > the
> > mappings above, but also automatically tear apart merged groups, then I
> > think we need to look at the group refcnt.
> 
> Correct.
> 
> > There's also a symmetry factor, the group is a benign entry point to
> > device access.  It's only when device or iommu access is granted that
> > the group gains any real power.  Therefore, shouldn't that power also
> > be
> > removed when those access points are closed?
> > 
> > > You have coded the 'inuse' logic with these three routines:
> > >
> > >     __vfio_iommu_inuse, which implements (1) above
> > >
> > > and
> > >     __vfio_iommu_groups_inuse
> > 
> > Implements (2.a)
> 
> Yes, but for al groups at once.

Right

> > >     __vfio_group_devs_inuse
> > 
> > Implements (2.b)
> 
> Yes
> 
> > > which are used by __vfio_iommu_inuse.
> > > Why don't you check the group refcnt in __vfio_iommu_groups_inuse?
> > 
> > Hopefully explained above, but open for discussion.
> > 
> > > Would it make sense (and the code more readable) to structure the
> > > nested refcnt/inuse check like this?
> > > (The numbers (1)(2)(3) refer to the three 'inuse' conditions above)
> > >
> > >    (1)__vfio_iommu_inuse
> > >    |
> > >    +-> check iommu refcnt
> > >    +-> __vfio_iommu_groups_inuse
> > >        |
> > >        +->LOOP: (2)__vfio_iommu_group_inuse<--MISSING
> > >                 |
> > >                 +-> check group refcnt<--MISSING
> > >                 +-> __vfio_group_devs_inuse()
> > >                     |
> > >                     +-> LOOP: (3)__vfio_group_dev_inuse<--MISSING
> > >                               |
> > >                               +-> check device refcnt
> > 
> > We currently do:
> > 
> >    (1)__vfio_iommu_inuse
> >     |
> >     +-> check iommu refcnt
> >     +-> __vfio_group_devs_inuse
> >         |
> >         +->LOOP: (2.b)__vfio_group_devs_inuse
> >                   |
> >                   +-> LOOP: (3) check device refcnt
> > 
> > If that passes, the iommu context can be dissolved and we follow up
> > with:
> > 
> >     __vfio_iommu_groups_inuse
> >     |
> >     +-> LOOP: (2.a)__vfio_iommu_groups_inuse
> >                |
> >                +-> check group refcnt
> > 
> > If that passes, groups can also be umerged.
> > 
> > Is this right?
> 
> Yes, assuming we stick to the "benign" role of groups you
> described above.

Ok, no change then.  Thanks for looking at that so closely.

<snip>
> > > > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > >
> > > The documentation in vfio.txt explains clearly the logic implemented
> > by
> > > the merge/unmerge group ioctls.
> > > However, what you are doing is not merging groups, but rather
> > adding/removing
> > > groups to/from iommus (and creating flat lists of groups).
> > > For example, when you do
> > >
> > >   merge(A,B)
> > >
> > > you actually mean to say "merge B to the list of groups assigned to
> > the
> > > same iommu as group A".
> > 
> > It's actually a little more than that.  After you've merged B into A,
> > you can close the file descriptor for B and access all of the devices
> > for the merged group from A.
> 
> It is actually more...
> 
> Scenario 1:
> 
>   create_grp(A)
>   create_grp(B)
>   ...
>   merge_grp(A,B)
>   create_grp(C)
>   merge_grp(C,B) ... this works, right?

No, but merge_grp(B,C) does.  I currently require that the incoming
group has no open device or iommu file descriptors and is a singular
group.  The device/iommu is a hard requirement since we'll be changing
the iommu context and can't leave an attack window.  The singular group
is an implementation detail.  Given the iommu/device requirement, it's
just as easy for userspace to tear apart the group and pass each
individually.

> Scenario 2:
> 
>   create_grp(A)
>   create_grp(B)
>   fd_x = get_dev_fd(B,x)
>   ...
>   merge_grp(A,B)

NAK, fails no open device test.  Again, merge_grp(B,A) is supported.

>   create_grp(C)
>   merge_grp(A,C)

Yep, this works.

>   fd_x = get_dev_fd(C,x) 

Yep, and if x is they same in both cases, you'll get 2 different file
descriptors backed by the same device.

> Those two examples seems to suggest me more of a list-abstraction than a merge abstraction.
> However, if it fits into the agreed syntax/logic it is ok, as long as we document it
> properly.

Can you suggest documentation changes that would make this more clear?

> > > For the same reason, you do not really need to provide the group you
> > want
> > > to unmerge from, which means that instead of
> > >
> > >   unmerge(A,B)
> > >
> > > you would just need
> > >
> > >   unmerge(B)
> > 
> > Good point, we can avoid the awkward reference via file descriptor for
> > the unmerge.
> > 
> > > I understand the reason why it is not a real merge/unmerge (ie, to
> > keep the
> > > original groups so that you can unmerge later)
> > 
> > Right, we still need to have visibility of the groups comprising the
> > merged group, but the abstraction provided to the user seems to be
> > deeper than you're thinking.
> > 
> > >  ... however I just wonder if
> > > it wouldn't be more natural to implement the
> > VFIO_IOMMU_ADD_GROUP/DEL_GROUP
> > > iommu ioctls instead? (the relationships between the data structure
> > would
> > > remain the same)
> > > I guess you already discarded this option for some reasons, right?
> > What was
> > > the reason?
> > 
> > It's a possibility, I'm not sure it was discussed or really what
> > advantage it provides.  It seems like we'd logically lose the ability
> > to
> > access devices from other groups,
> 
> What is the real (immediate) benefit of this capability?

Mostly convenience, but also promotes the peer idea where merged groups
simply create a "super" group that can access the iommu and all the
devices of the member groups.  On x86 we expect that merging groups will
always succeed and groups will typically have a single device, so a
driver could merge them all together, throw away all the extra group
file descriptors and manage the whole super group via a single group fd.

> > whether that's good or bad, I don't know.  I think the notion of "merge"
> > promotes the idea that the groups
> > are peers and an iommu_add/del feels a bit more hierarchical.
> 
> I agree. 
<snip>
> > > > +	if (!device) {
> > > > +		if (__vfio_group_devs_inuse(group) ||
> > > > +		    (group->iommu && group->iommu->refcnt)) {
> > > > +			printk(KERN_WARNING
> > > > +			       "Adding device %s to group %u while group is
> > > > already in use!!\n",
> > > > +			       dev_name(dev), group->groupid);
> > > > +			/* XXX How to prevent other drivers from claiming? */
> > >
> > > Here we are adding a device (not yet assigned to a vfio bus) to a
> > group
> > > that is already in use.
> > > Given that it would not be acceptable for this device to get assigned
> > > to a non vfio driver, why not forcing such assignment here then?
> > 
> > Exactly, I just don't know the mechanics of how to make that happen and
> > was hoping for suggestions...
> > 
> > > I am not sure though what the best way to do it would be.
> > > What about something like this:
> > >
> > > - when the bus vfio-pci processes the BUS_NOTIFY_ADD_DEVICE
> > >   notification it assigns to the device a PCI ID that will make sure
> > >   the vfio-pci's probe routine will be invoked (and no other driver
> > can
> > >   therefore claim the device). That PCI ID would have to be added
> > >   to the vfio_pci_driver's id_table (it would be the exception to the
> > >   "only dynamic IDs" rule). Too hackish?
> > 
> > Presumably some other driver also has the ID in it's id_table, how do
> > we make sure we win?
> 
> By mangling such ID (when processing the BUS_NOTIFY_ADD_DEVICE notification) to
> match against a 'fake' ID registered in the vfio-pci table (it would be like a
> sort of driver redirect/divert). The vfio-pci's probe routine would restore
> the original ID (we do not want to confuse userspace). This is hackish, I agree.
> 
> What about this:
> - When vfio-pci processes the BUS_NOTIFY_ADD_DEVICE notification it can
>   pre-initialize the driver pointer (via an API). We would then need to change
>   the match/probe PCI mechanism too: for example, the PCI core will have to check
>   and honor such pre-driver-initialization when present (and give it higher
>   priority over the match callbacks).
>   How to do this? For example, when vfio_group_add_dev is invoked, it checks
>   whether the device is getting added to an already existent group where
>   the other devices (well, you would need to check just one of the devices in
>   the group) are already assigned to vfio-pci, and in such a case it
>   pre-initialize the driver to vfio-pci.

It's ok to make a group "non-viable", we only want to intervene if the
iommu is inuse (iommu or device refcnt > 0).

> 
> NOTE: By "preinit" I mean "save into the device a reference to a driver before
>       the 'match' callbacks".
> 
> This would be the timeline:
> 
> |
> +-> new device gets added to (PCI) bus
> |
> +-> PCI: send BUS_NOTIFIER_ADD_DEVICE notification
> |
> +-> VFIO:vfio_pci_device_notifier
> |        |
> |        +-> BUS_NOTIFIER_ADD_DEVICE: vfio_group_add_dev
> |            |
> |            +->iommu_device_group(dev,&groupid)
> |            +->group = <search groupid in vfio.group_list>
> |            +->if (group && group_is_vfio(group))
> |            |        <preinit device driver to vfio-pci>
> |            ...
> |
> +-> PCI: xxx
> |        |
> |        +-> if (!device_driver_is_preinit(dev))
> |        |       probe=<search driver's probe callback using 'match'>
> |        |   else 
> |        |       probe=<get it from preint driver config>
> |        |       (+fallback to 'match' if preinit driver disappeared?)
> |        |   
> |        +-> rc = probe(...)
> |        |
> |        ...
> v
> ...
> 
> Of course, what if multiple drivers decide to preinit the device ?

Yep, we'd have to have a policy to BUG_ON if the preinit driver is
already set.

> One way to make it cleaner would be to:
> - have the PCI layer export an API that allows (for example) the bus
>   notification callbacks (like vfio_pci_device_notifier) to preinit a driver
> - make such API reject calls on devices that already have a preinit
>   driver.
> - make VFIO detect the case where vfio_pci_device_notifier can not
>   preinit the driver (to vfio-pci) for the new device (because already
>   preinited) and raise an error/warning.
> 
> Would this look a bit cleaner?

It looks like there might already be infrastructure that we can set
dev->driver and call the driver probe() function, so maybe we're only in
trouble if dev->driver is already set when we get the bus add
notification.  I just wasn't sure if that was entirely kosher.  I'll
have to try that and figure out how to test it; fake hotplug maybe.

<snip>
> > > This fn below does not return any error code. Ok ...
> > > However, there are a number of errors case that you test, for example
> > > - device that does not belong to any group (according to iommu API)
> > > - device that belongs to a group but that does not appear in the list
> > >   of devices of the vfio_group structure.
> > > Are the above two errors checks just paranoia or are those errors
> > actually possible?
> > > If they were possible, shouldn't we generate a warning (most probably
> > > it would be a bug in the code)?
> > 
> > They're all vfio-bus driver bugs of some sort, so it's just a matter of
> > how much we want to scream about them.  I'll comments on each below.
> > 
> > > > +void vfio_group_del_dev(struct device *dev)
> > > > +{
> > > > +	struct list_head *pos;
> > > > +	struct vfio_group *group = NULL;
> > > > +	struct vfio_device *device = NULL;
> > > > +	unsigned int groupid;
> > > > +
> > > > +	if (iommu_device_group(dev, &groupid))
> > > > +		return;
> > 
> > Here the bus driver is probably just sitting on a notifier list for
> > their bus_type and a device is getting removed.  Unless we want to
> > require the bus driver to track everything it's attempted to add and
> > whether it worked, we can just ignore this.
> 
> OK, I see what you mean. If vfio_group_add_dev fails for some reasons we
> do not keep track of it. Right?

The primary thing I'm thinking of here is not vfio_group_add_dev()
failing for "some reason", but specifically failing because the device
doesn't have a groupid, ie. it's not behind an iommu.  In that case it's
just a random device that can't be used by vfio.

> Would it make sense to add one special group to vfio.group_list (or better
> On a separate field of the vfio structure) whose goal
> would be just that: keep track of those devices that failed to be added
> to the VFIO framework (can it help for debugging too?)?

For the above case, no, we shouldn't need to track those.  But it does
seem like there's a gap for devices that fail vfio_group_add_dev() for
other reasons.  I don't think we want a special group for them, because
that isolates them from other devices that are potentially in the same
group.  I think instead what we want to do is set a taint flag on the
group.  We can do a BUG_ON not being able to allocate a group, then a
WARN_ON if we fail elsewhere and mark the group tainted so it's
effectively never viable.

<snip>
> > > > +	if (!vfio_device_removeable(device)) {
> > > > +		/* XXX signal for all devices in group to be removed or
> > > > +		 * resort to killing the process holding the device fds.
> > > > +		 * For now just block waiting for releases to wake us. */
> > > > +		wait_event(vfio.release_q, vfio_device_removeable(device));
> > >
> > > Any new idea/proposal on how to handle this situation?
> > > The last one I remember was to leave the soft/hard/etc timeout
> > handling in
> > > userspace and implement it as a sort of policy. Is that one still the
> > most
> > > likely candidate solution to handle this situation?
> > 
> > I haven't heard any new proposals.  I think we need the hard timeout
> > handling in the kernel.  We can't leave it to userspace to decide they
> > get to keep the device.  We could have this tunable via an ioctl, but I
> > don't see how we wouldn't require CAP_SYS_ADMIN (or similar) to tweak
> > it.  I was intending to re-implement the netlink interface to signal
> > the
> > removal, but expect to get allergic reactions to that.
> 
> (I personally like the async netlink signaling, but I am OK with an ioctl based
> mechanism if it provides the same flexibility)
> 
> What would be a reasonable hard timeout?

I think we were looking at 10s of seconds in the old vfio code.  Tough
call though.  Could potentially provide a module_param override so an
admin that trusts their users could set long/infinite timeout.  Thanks,

Alex
Christian Benvenuti - Nov. 10, 2011, 12:57 a.m.
Here are few minor comments on vfio_iommu.c ...

> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c

> new file mode 100644

> index 0000000..029dae3

> --- /dev/null

> +++ b/drivers/vfio/vfio_iommu.c

> @@ -0,0 +1,530 @@

> +/*

> + * VFIO: IOMMU DMA mapping support

> + *

> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.

> + *     Author: Alex Williamson <alex.williamson@redhat.com>

> + *

> + * This program is free software; you can redistribute it and/or

> modify

> + * it under the terms of the GNU General Public License version 2 as

> + * published by the Free Software Foundation.

> + *

> + * Derived from original vfio:

> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.

> + * Author: Tom Lyon, pugs@cisco.com

> + */

> +

> +#include <linux/compat.h>

> +#include <linux/device.h>

> +#include <linux/fs.h>

> +#include <linux/iommu.h>

> +#include <linux/module.h>

> +#include <linux/mm.h>

> +#include <linux/sched.h>

> +#include <linux/slab.h>

> +#include <linux/uaccess.h>

> +#include <linux/vfio.h>

> +#include <linux/workqueue.h>

> +

> +#include "vfio_private.h"


Doesn't the 'dma_'  prefix belong to the generic DMA code?

> +struct dma_map_page {

> +	struct list_head	list;

> +	dma_addr_t		daddr;

> +	unsigned long		vaddr;

> +	int			npage;

> +	int			rdwr;

> +};

> +

> +/*

> + * This code handles mapping and unmapping of user data buffers

> + * into DMA'ble space using the IOMMU

> + */

> +

> +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)

> +

> +struct vwork {

> +	struct mm_struct	*mm;

> +	int			npage;

> +	struct work_struct	work;

> +};

> +

> +/* delayed decrement for locked_vm */

> +static void vfio_lock_acct_bg(struct work_struct *work)

> +{

> +	struct vwork *vwork = container_of(work, struct vwork, work);

> +	struct mm_struct *mm;

> +

> +	mm = vwork->mm;

> +	down_write(&mm->mmap_sem);

> +	mm->locked_vm += vwork->npage;

> +	up_write(&mm->mmap_sem);

> +	mmput(mm);		/* unref mm */

> +	kfree(vwork);

> +}

> +

> +static void vfio_lock_acct(int npage)

> +{

> +	struct vwork *vwork;

> +	struct mm_struct *mm;

> +

> +	if (!current->mm) {

> +		/* process exited */

> +		return;

> +	}

> +	if (down_write_trylock(&current->mm->mmap_sem)) {

> +		current->mm->locked_vm += npage;

> +		up_write(&current->mm->mmap_sem);

> +		return;

> +	}

> +	/*

> +	 * Couldn't get mmap_sem lock, so must setup to decrement

                                                      ^^^^^^^^^

Increment?

> +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't

> +	 * need this silliness

> +	 */

> +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);

> +	if (!vwork)

> +		return;

> +	mm = get_task_mm(current);	/* take ref mm */

> +	if (!mm) {

> +		kfree(vwork);

> +		return;

> +	}

> +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);

> +	vwork->mm = mm;

> +	vwork->npage = npage;

> +	schedule_work(&vwork->work);

> +}

> +

> +/* Some mappings aren't backed by a struct page, for example an mmap'd

> + * MMIO range for our own or another device.  These use a different

> + * pfn conversion and shouldn't be tracked as locked pages. */

> +static int is_invalid_reserved_pfn(unsigned long pfn)

> +{

> +	if (pfn_valid(pfn)) {

> +		int reserved;

> +		struct page *tail = pfn_to_page(pfn);

> +		struct page *head = compound_trans_head(tail);

> +		reserved = PageReserved(head);

> +		if (head != tail) {

> +			/* "head" is not a dangling pointer

> +			 * (compound_trans_head takes care of that)

> +			 * but the hugepage may have been split

> +			 * from under us (and we may not hold a

> +			 * reference count on the head page so it can

> +			 * be reused before we run PageReferenced), so

> +			 * we've to check PageTail before returning

> +			 * what we just read.

> +			 */

> +			smp_rmb();

> +			if (PageTail(tail))

> +				return reserved;

> +		}

> +		return PageReserved(tail);

> +	}

> +

> +	return true;

> +}

> +

> +static int put_pfn(unsigned long pfn, int rdwr)

> +{

> +	if (!is_invalid_reserved_pfn(pfn)) {

> +		struct page *page = pfn_to_page(pfn);

> +		if (rdwr)

> +			SetPageDirty(page);

> +		put_page(page);

> +		return 1;

> +	}

> +	return 0;

> +}

> +

> +/* Unmap DMA region */

> +/* dgate must be held */

> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long

> iova,

> +			    int npage, int rdwr)

> +{

> +	int i, unlocked = 0;

> +

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {

> +		unsigned long pfn;

> +

> +		pfn = iommu_iova_to_phys(iommu->domain, iova) >>

> PAGE_SHIFT;

> +		if (pfn) {

> +			iommu_unmap(iommu->domain, iova, 0);

> +			unlocked += put_pfn(pfn, rdwr);

> +		}

> +	}

> +	return unlocked;

> +}

> +

> +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long

> iova,

> +			   unsigned long npage, int rdwr)

> +{

> +	int unlocked;

> +

> +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);

> +	vfio_lock_acct(-unlocked);

> +}

> +

> +/* Unmap ALL DMA regions */

> +void vfio_iommu_unmapall(struct vfio_iommu *iommu)

> +{

> +	struct list_head *pos, *pos2;

> +	struct dma_map_page *mlp;

> +

> +	mutex_lock(&iommu->dgate);

> +	list_for_each_safe(pos, pos2, &iommu->dm_list) {

> +		mlp = list_entry(pos, struct dma_map_page, list);

> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);

> +		list_del(&mlp->list);

> +		kfree(mlp);

> +	}

> +	mutex_unlock(&iommu->dgate);

> +}

> +

> +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long

> *pfn)

> +{

> +	struct page *page[1];

> +	struct vm_area_struct *vma;

> +	int ret = -EFAULT;

> +

> +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {

> +		*pfn = page_to_pfn(page[0]);

> +		return 0;

> +	}

> +

> +	down_read(&current->mm->mmap_sem);

> +

> +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);

> +

> +	if (vma && vma->vm_flags & VM_PFNMAP) {

> +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma-

> >vm_pgoff;

> +		if (is_invalid_reserved_pfn(*pfn))

> +			ret = 0;

> +	}

> +

> +	up_read(&current->mm->mmap_sem);

> +

> +	return ret;

> +}

> +

> +/* Map DMA region */

> +/* dgate must be held */

> +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,

> +			unsigned long vaddr, int npage, int rdwr)

> +{

> +	unsigned long start = iova;

> +	int i, ret, locked = 0, prot = IOMMU_READ;

> +

> +	/* Verify pages are not already mapped */

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)

> +		if (iommu_iova_to_phys(iommu->domain, iova))

> +			return -EBUSY;

> +

> +	iova = start;

> +

> +	if (rdwr)

> +		prot |= IOMMU_WRITE;

> +	if (iommu->cache)

> +		prot |= IOMMU_CACHE;

> +

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr +=

> PAGE_SIZE) {

> +		unsigned long pfn = 0;

> +

> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);

> +		if (ret) {

> +			__vfio_dma_unmap(iommu, start, i, rdwr);

> +			return ret;

> +		}

> +

> +		/* Only add actual locked pages to accounting */

> +		if (!is_invalid_reserved_pfn(pfn))

> +			locked++;

> +

> +		ret = iommu_map(iommu->domain, iova,

> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);

> +		if (ret) {

> +			/* Back out mappings on error */

> +			put_pfn(pfn, rdwr);

> +			__vfio_dma_unmap(iommu, start, i, rdwr);

> +			return ret;

> +		}

> +	}

> +	vfio_lock_acct(locked);

> +	return 0;

> +}

> +

> +static inline int ranges_overlap(unsigned long start1, size_t size1,

> +				 unsigned long start2, size_t size2)

> +{

> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);

> +}

> +

> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,

> +					  dma_addr_t start, size_t size)

> +{

> +	struct list_head *pos;

> +	struct dma_map_page *mlp;

> +

> +	list_for_each(pos, &iommu->dm_list) {

> +		mlp = list_entry(pos, struct dma_map_page, list);

> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),

> +				   start, size))

> +			return mlp;

> +	}

> +	return NULL;

> +}

> +

> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t

> start,

> +			    size_t size, struct dma_map_page *mlp)

> +{

> +	struct dma_map_page *split;

> +	int npage_lo, npage_hi;

> +

> +	/* Existing dma region is completely covered, unmap all */


This works. However, given how vfio_dma_map_dm implements the merging
logic, I think it is impossible to have

    (start < mlp->daddr &&
     start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))


> +	if (start <= mlp->daddr &&

> +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {

> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);

> +		list_del(&mlp->list);

> +		npage_lo = mlp->npage;

> +		kfree(mlp);

> +		return npage_lo;

> +	}

> +

> +	/* Overlap low address of existing range */


Same as above (ie, '<' is impossible)

> +	if (start <= mlp->daddr) {

> +		size_t overlap;

> +

> +		overlap = start + size - mlp->daddr;

> +		npage_lo = overlap >> PAGE_SHIFT;

> +		npage_hi = mlp->npage - npage_lo;

> +

> +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);

> +		mlp->daddr += overlap;

> +		mlp->vaddr += overlap;

> +		mlp->npage -= npage_lo;

> +		return npage_lo;

> +	}


Same as above (ie, '>' is impossible).

> +	/* Overlap high address of existing range */

> +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {

> +		size_t overlap;

> +

> +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;

> +		npage_hi = overlap >> PAGE_SHIFT;

> +		npage_lo = mlp->npage - npage_hi;

> +

> +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);

> +		mlp->npage -= npage_hi;

> +		return npage_hi;

> +	}

> +

> +	/* Split existing */

> +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;

> +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;

> +

> +	split = kzalloc(sizeof *split, GFP_KERNEL);

> +	if (!split)

> +		return -ENOMEM;

> +

> +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);

> +

> +	mlp->npage = npage_lo;

> +

> +	split->npage = npage_hi;

> +	split->daddr = start + size;

> +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;

> +	split->rdwr = mlp->rdwr;

> +	list_add(&split->list, &iommu->dm_list);

> +	return size >> PAGE_SHIFT;

> +}

> +

> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map

> *dmp)

> +{

> +	int ret = 0;

> +	size_t npage = dmp->size >> PAGE_SHIFT;

> +	struct list_head *pos, *n;

> +

> +	if (dmp->dmaaddr & ~PAGE_MASK)

> +		return -EINVAL;

> +	if (dmp->size & ~PAGE_MASK)

> +		return -EINVAL;

> +

> +	mutex_lock(&iommu->dgate);

> +

> +	list_for_each_safe(pos, n, &iommu->dm_list) {

> +		struct dma_map_page *mlp;

> +

> +		mlp = list_entry(pos, struct dma_map_page, list);

> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),

> +				   dmp->dmaaddr, dmp->size)) {

> +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,

> +						      dmp->size, mlp);

> +			if (ret > 0)

> +				npage -= NPAGE_TO_SIZE(ret);

> +			if (ret < 0 || npage == 0)

> +				break;

> +		}

> +	}

> +	mutex_unlock(&iommu->dgate);

> +	return ret > 0 ? 0 : ret;

> +}

> +

> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map

> *dmp)

> +{

> +	int npage;

> +	struct dma_map_page *mlp, *mmlp = NULL;

> +	dma_addr_t daddr = dmp->dmaaddr;

> +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;

> +	size_t size = dmp->size;

> +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;

> +

> +	if (vaddr & (PAGE_SIZE-1))

> +		return -EINVAL;

> +	if (daddr & (PAGE_SIZE-1))

> +		return -EINVAL;

> +	if (size & (PAGE_SIZE-1))

> +		return -EINVAL;

> +

> +	npage = size >> PAGE_SHIFT;

> +	if (!npage)

> +		return -EINVAL;

> +

> +	if (!iommu)

> +		return -EINVAL;

> +

> +	mutex_lock(&iommu->dgate);

> +

> +	if (vfio_find_dma(iommu, daddr, size)) {

> +		ret = -EBUSY;

> +		goto out_lock;

> +	}

> +

> +	/* account for locked pages */

> +	locked = current->mm->locked_vm + npage;

> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {

> +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",

> +			__func__, rlimit(RLIMIT_MEMLOCK));

> +		ret = -ENOMEM;

> +		goto out_lock;

> +	}

> +

> +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);

> +	if (ret)

> +		goto out_lock;

> +

> +	/* Check if we abut a region below */


Is !daddr possible?

> +	if (daddr) {

> +		mlp = vfio_find_dma(iommu, daddr - 1, 1);

> +		if (mlp && mlp->rdwr == rdwr &&

> +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {

> +

> +			mlp->npage += npage;

> +			daddr = mlp->daddr;

> +			vaddr = mlp->vaddr;

> +			npage = mlp->npage;

> +			size = NPAGE_TO_SIZE(npage);

> +

> +			mmlp = mlp;

> +		}

> +	}


Is !(daddr + size) possible?

> +	if (daddr + size) {

> +		mlp = vfio_find_dma(iommu, daddr + size, 1);

> +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size)

> {

> +

> +			mlp->npage += npage;

> +			mlp->daddr = daddr;

> +			mlp->vaddr = vaddr;

> +

> +			/* If merged above and below, remove previously

> +			 * merged entry.  New entry covers it.  */

> +			if (mmlp) {

> +				list_del(&mmlp->list);

> +				kfree(mmlp);

> +			}

> +			mmlp = mlp;

> +		}

> +	}

> +

> +	if (!mmlp) {

> +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);

> +		if (!mlp) {

> +			ret = -ENOMEM;

> +			vfio_dma_unmap(iommu, daddr, npage, rdwr);

> +			goto out_lock;

> +		}

> +

> +		mlp->npage = npage;

> +		mlp->daddr = daddr;

> +		mlp->vaddr = vaddr;

> +		mlp->rdwr = rdwr;

> +		list_add(&mlp->list, &iommu->dm_list);

> +	}

> +

> +out_lock:

> +	mutex_unlock(&iommu->dgate);

> +	return ret;

> +}

> +

> +static int vfio_iommu_release(struct inode *inode, struct file *filep)

> +{

> +	struct vfio_iommu *iommu = filep->private_data;

> +

> +	vfio_release_iommu(iommu);

> +	return 0;

> +}

> +

> +static long vfio_iommu_unl_ioctl(struct file *filep,

> +				 unsigned int cmd, unsigned long arg)

> +{

> +	struct vfio_iommu *iommu = filep->private_data;

> +	int ret = -ENOSYS;


Any reason for not using "switch" ?

> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {

> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;

> +

> +                ret = put_user(flags, (u64 __user *)arg);

> +

> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {

> +		struct vfio_dma_map dm;

> +

> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))

> +			return -EFAULT;


What does the "_dm" suffix stand for?

> +		ret = vfio_dma_map_dm(iommu, &dm);

> +

> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof

> dm))

> +			ret = -EFAULT;

> +

> +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {

> +		struct vfio_dma_map dm;

> +

> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))

> +			return -EFAULT;

> +

> +		ret = vfio_dma_unmap_dm(iommu, &dm);

> +

> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof

> dm))

> +			ret = -EFAULT;

> +	}

> +	return ret;

> +}

> +

> +#ifdef CONFIG_COMPAT

> +static long vfio_iommu_compat_ioctl(struct file *filep,

> +				    unsigned int cmd, unsigned long arg)

> +{

> +	arg = (unsigned long)compat_ptr(arg);

> +	return vfio_iommu_unl_ioctl(filep, cmd, arg);

> +}

> +#endif	/* CONFIG_COMPAT */

> +

> +const struct file_operations vfio_iommu_fops = {

> +	.owner		= THIS_MODULE,

> +	.release	= vfio_iommu_release,

> +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,

> +#ifdef CONFIG_COMPAT

> +	.compat_ioctl	= vfio_iommu_compat_ioctl,

> +#endif

> +};


/Chris
Konrad Rzeszutek Wilk - Nov. 11, 2011, 5:51 p.m.
On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Fingers crossed, this is the last RFC for VFIO, but we need
> the iommu group support before this can go upstream
> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> hoping this helps push that along.
> 
> Since the last posting, this version completely modularizes
> the device backends and better defines the APIs between the
> core VFIO code and the device backends.  I expect that we
> might also adopt a modular IOMMU interface as iommu_ops learns
> about different types of hardware.  Also many, many cleanups.
> Check the complete git history for details:
> 
> git://github.com/awilliam/linux-vfio.git vfio-ng
> 
> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> 
> This version, along with the supporting VFIO PCI backend can
> be found here:
> 
> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> 
> I've held off on implementing a kernel->user signaling
> mechanism for now since the previous netlink version produced
> too many gag reflexes.  It's easy enough to set a bit in the
> group flags too indicate such support in the future, so I
> think we can move ahead without it.
> 
> Appreciate any feedback or suggestions.  Thanks,
> 
> Alex
> 
>  Documentation/ioctl/ioctl-number.txt |    1 
>  Documentation/vfio.txt               |  304 +++++++++
>  MAINTAINERS                          |    8 
>  drivers/Kconfig                      |    2 
>  drivers/Makefile                     |    1 
>  drivers/vfio/Kconfig                 |    8 
>  drivers/vfio/Makefile                |    3 
>  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
>  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_private.h          |   34 +
>  include/linux/vfio.h                 |  155 +++++
>  11 files changed, 2197 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/vfio.txt
>  create mode 100644 drivers/vfio/Kconfig
>  create mode 100644 drivers/vfio/Makefile
>  create mode 100644 drivers/vfio/vfio_iommu.c
>  create mode 100644 drivers/vfio/vfio_main.c
>  create mode 100644 drivers/vfio/vfio_private.h
>  create mode 100644 include/linux/vfio.h
> 
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 54078ed..59d01e4 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
>  		and kernel/power/user.c
>  '8'	all				SNP8023 advanced NIC card
>  					<mailto:mcr@solidum.com>
> +';'	64-76	linux/vfio.h
>  '@'	00-0F	linux/radeonfb.h	conflict!
>  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
>  'A'	00-1F	linux/apm_bios.h	conflict!
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..5866896
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,304 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  The VFIO driver
> +is an IOMMU/device agnostic framework for exposing direct device
> +access to userspace, in a secure, IOMMU protected environment.  In
> +other words, this allows safe, non-privileged, userspace drivers.
> +
> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply turns
> +the VM into a userspace driver, with the benefits of significantly
> +reduced latency, higher bandwidth, and direct use of bare-metal device
> +drivers[2].

Are there any constraints of running a 32-bit userspace with
a 64-bit kernel and with 32-bit user space drivers?

> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Previous to VFIO, these drivers needed to
> +go through the full development cycle to become proper upstream driver,
> +be maintained out of tree, or make use of the UIO framework, which
> +has no notion of IOMMU protection, limited interrupt support, and
> +requires root privileges to access things like PCI configuration space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment currently used as well as provide
> +a more secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, IOMMUs, oh my

<chuckles> oh my, eh?

> +-------------------------------------------------------------------------------
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a
> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for
> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices
> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
> +The VFIO representation of groups is created as devices are added into
> +the framework by a VFIO bus driver.  The vfio-pci module is an example
> +of a bus driver.  This module registers devices along with a set of bus
> +specific callbacks with the VFIO core.  These callbacks provide the
> +interfaces later used for device access.  As each new group is created,
> +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> +character device.
> +
> +In addition to the device enumeration and callbacks, the VFIO bus driver
> +also provides a traditional device driver and is able to bind to devices
> +on it's bus.  When a device is bound to the bus driver it's available to
> +VFIO.  When all the devices within a group are bound to their bus drivers,
> +the group becomes "viable" and a user with sufficient access to the VFIO
> +group chardev can obtain exclusive access to the set of group devices.
> +
> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> +
> +The last two ioctls return new file descriptors for accessing
> +individual devices within the group and programming the IOMMU.  Each of
> +these new file descriptors provide their own set of file interfaces.
> +These ioctls will fail if any of the devices within the group are not
> +bound to their VFIO bus driver.  Additionally, when either of these
> +interfaces are used, the group is then bound to the struct_mm of the
> +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> +
> +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> +new IOMMU domain is created and all of the devices in the group are
> +attached to it.  This is the only way to ensure full IOMMU isolation
> +of the group, but potentially wastes resources and cycles if the user
> +intends to manage multiple groups with the same set of IOMMU mappings.
> +VFIO therefore provides a group MERGE and UNMERGE interface, which
> +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> +arbitrary groups to be merged, so the user should assume merging is
> +opportunistic.  A new group, with no open device or IOMMU file
> +descriptors, can be merged into an existing, in-use, group using the
> +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> +once all of the device file descriptors for the group being merged
> +"out" are closed.
> +
> +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> +essentially fungible between group file descriptors (ie. if device A
> +is in group X, and X is merged with Y, a file descriptor for A can be
> +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> +or automatically when ALL file descriptors for the merged group are
> +closed (all IOMMUs, all devices, all groups).
> +
> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)

Coherency support is not going to be addressed right? What about sync?
Say you need to sync CPU to Device address?

> +
> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +        __u64   len;            /* length of structure */

What is the purpose of the 'len' field? Is it to guard against future
version changes?

> +        __u64   vaddr;          /* process virtual addr */
> +        __u64   dmaaddr;        /* desired and/or returned dma address */
> +        __u64   size;           /* size in bytes */
> +        __u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> +};
> +
> +Current users of VFIO use relatively static DMA mappings, not requiring
> +high frequency turnover.  As new users are added, it's expected that the

Is there a limit to how many DMA mappings can be created?

> +IOMMU file descriptor will evolve to support new mapping interfaces, this
> +will be reflected in the flags and may present new ioctls and file
> +interfaces.
> +
> +The device GET_FLAGS ioctl is intended to return basic device type and
> +indicate support for optional capabilities.  Flags currently include whether
> +the device is PCI or described by Device Tree, and whether the RESET ioctl
> +is supported:

And reset in terms of PCIe spec is the FLR?

> +
> +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> +
> +The MMIO and IOP resources used by a device are described by regions.

IOP?

> +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> +
> +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)

Don't want __u32?
> +
> +Regions are described by a struct vfio_region_info, which is retrieved by
> +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> +the desired region (0 based index).  Note that devices may implement zero
> 
+sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> +mapping).

Huh?

> +
> +struct vfio_region_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* region number */
> +        __u64   size;           /* size in bytes of region */
> +        __u64   offset;         /* start offset of region */
> +        __u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)

What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped?
FLAG_RO is pretty obvious - presumarily this is for firmware regions and such.
And PHYS_VALID is if the region is disabled for some reasons? If so
would the name FLAG_DISABLED be better?

> +        __u64   phys;           /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)

_u32?

> +
> +struct vfio_irq_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* IRQ number */
> +        __u32   count;          /* number of individual IRQs */
> +        __u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> +};
> +
> +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> +type to index mapping).

I am not really sure what that means.

> +
> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */

Are eventfds u64 or u32?

Why not just define a structure?
struct vfio_irq_eventfds {
	__u32	index;
	__u32	count;
	__u64	eventfds[0]
};

How do you get an eventfd to feed in here?

> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)

u32?
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host.  This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system.  After servicing the interrupt,
> +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> +triggered interrupts implicitly have a count of 1 per index.

So they are enabled automatically? Meaning you don't even hav to do
SET_IRQ_EVENTFDS b/c the count is set to 1?

> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)

So this is for MSI as well? So if I've an index = 1, with count = 4,
and doing unmaks IRQ will chip enable all the MSI event at once?

I guess there is not much point in enabling/disabling selective MSI
IRQs..

> +
> +Level triggered interrupts can also be unmasked using an irqfd.  Use

irqfd or eventfd?

> +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.

So only level triggered? Hmm, how do I know whether the device is
level or edge? Or is that edge (MSI) can also be unmaked using the
eventfs

> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> +
> +When supported, as indicated by the device flags, reset the device.
> +
> +#define VFIO_DEVICE_RESET               _IO(';', 116)

Does it disable the 'count'? Err, does it disable the IRQ on the
device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
to set new eventfds? Or does it re-use the eventfds and the device
is enabled after this?


> +
> +Device tree devices also invlude ioctls for further defining the

include

> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> +        __u32   len;            /* length of structure */
> +        __u32   index;

0 based I presume?
> +        __u64   flags;
> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)

What is region in this context?? Or would this make much more sense
if I knew what Device Tree actually is.

> +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> +        char    *path;

Ah, now I see why you want 'len' here.. But I am still at loss
why you want that with the other structures.

> +};
> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u32   prop_type;

Is that an enum type? Is this definied somewhere?
> +        __u32   prop_index;

What is the purpose of this field?

> +        __u64   flags;
> +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> +
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding

suspend/resume?

> +
> +When initialized, the bus driver should enumerate the devices on it's
> +bus and call vfio_group_add_dev() for each device.  If the bus supports
> +hotplug, notifiers should be enabled to track devices being added and
> +removed.  vfio_group_del_dev() removes a previously added device from
> +vfio.
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:

Huh? So this gets created for _every_ 'struct device' that is added
the VFIO bus? Is this structure exposed? Or is this an internal one?

> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +When a device is bound to the bus driver, the bus driver indicates this
> +to vfio using the vfio_bind_dev() interface.  The device_data parameter

Might want to paste the function decleration for it.. b/c I am not sure
where the 'device_data' parameter is on the argument list.

> +is a pointer to an opaque data structure for use only by the bus driver.
> +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> +this data structure back to the bus driver.  When a device is unbound

Oh, so it is on the 'void *'.
> +from the bus driver, the vfio_unbind_dev() interface signals this to
> +vfio.  This function returns the pointer to the device_data structure

That function
> +registered for the device.

I am not really sure what this section purpose is? Could this be part
of the header file or the code? It does not look to be part of the
ioctl API?

> +
> +As noted previously, a group contains one or more devices, so
> +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> +The vfio_device_ops.match callback is used to allow bus drivers to determine
> +the match.  For drivers like vfio-pci, it's a simple match to dev_name(),
> +which is unique in the system due to the PCI bus topology, other bus drivers
> +may need to include parent devices to create a unique match, so this is
> +left as a bus driver interface.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since outgrown
> +the acronym, but it's catchy.
> +
> +[2] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f05f5f6..4bd5aa0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7106,6 +7106,14 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VFIO DRIVER
> +M:	Alex Williamson <alex.williamson@redhat.com>
> +L:	kvm@vger.kernel.org

No vfio mailing list? Or a vfio-mailing list? 
> +S:	Maintained
> +F:	Documentation/vfio.txt
> +F:	drivers/vfio/
> +F:	include/linux/vfio.h
> +
>  VIDEOBUF2 FRAMEWORK
>  M:	Pawel Osciak <pawel@osciak.com>
>  M:	Marek Szyprowski <m.szyprowski@samsung.com>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index b5e6f24..e15578b 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
>  
>  source "drivers/uio/Kconfig"
>  
> +source "drivers/vfio/Kconfig"
> +
>  source "drivers/vlynq/Kconfig"
>  
>  source "drivers/virtio/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 1b31421..5f138b5 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM)		+= atm/
>  obj-$(CONFIG_FUSION)		+= message/
>  obj-y				+= firewire/
>  obj-$(CONFIG_UIO)		+= uio/
> +obj-$(CONFIG_VFIO)		+= vfio/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> +	tristate "VFIO Non-Privileged userspace driver framework"
> +	depends on IOMMU_API
> +	help
> +	  VFIO provides a framework for secure userspace device drivers.
> +	  See Documentation/vfio.txt for more details.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> new file mode 100644
> index 0000000..088faf1
> --- /dev/null
> +++ b/drivers/vfio/Makefile
> @@ -0,0 +1,3 @@
> +vfio-y := vfio_main.o vfio_iommu.o
> +
> +obj-$(CONFIG_VFIO) := vfio.o
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -0,0 +1,530 @@
> +/*
> + * VFIO: IOMMU DMA mapping support
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>
> +
> +#include "vfio_private.h"
> +
> +struct dma_map_page {
> +	struct list_head	list;
> +	dma_addr_t		daddr;
> +	unsigned long		vaddr;
> +	int			npage;
> +	int			rdwr;

rdwr? Is this a flag thing? Could it be made in an enum?
> +};
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> +	struct mm_struct	*mm;
> +	int			npage;
> +	struct work_struct	work;
> +};
> +
> +/* delayed decrement for locked_vm */
> +static void vfio_lock_acct_bg(struct work_struct *work)
> +{
> +	struct vwork *vwork = container_of(work, struct vwork, work);
> +	struct mm_struct *mm;
> +
> +	mm = vwork->mm;
> +	down_write(&mm->mmap_sem);
> +	mm->locked_vm += vwork->npage;
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);		/* unref mm */
> +	kfree(vwork);
> +}
> +
> +static void vfio_lock_acct(int npage)
> +{
> +	struct vwork *vwork;
> +	struct mm_struct *mm;
> +
> +	if (!current->mm) {
> +		/* process exited */
> +		return;
> +	}
> +	if (down_write_trylock(&current->mm->mmap_sem)) {
> +		current->mm->locked_vm += npage;
> +		up_write(&current->mm->mmap_sem);
> +		return;
> +	}
> +	/*
> +	 * Couldn't get mmap_sem lock, so must setup to decrement
> +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> +	 * need this silliness
> +	 */
> +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> +	if (!vwork)
> +		return;
> +	mm = get_task_mm(current);	/* take ref mm */
> +	if (!mm) {
> +		kfree(vwork);
> +		return;
> +	}
> +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> +	vwork->mm = mm;
> +	vwork->npage = npage;
> +	schedule_work(&vwork->work);
> +}
> +
> +/* Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device.  These use a different
> + * pfn conversion and shouldn't be tracked as locked pages. */
> +static int is_invalid_reserved_pfn(unsigned long pfn)

static bool

> +{
> +	if (pfn_valid(pfn)) {
> +		int reserved;
> +		struct page *tail = pfn_to_page(pfn);
> +		struct page *head = compound_trans_head(tail);
> +		reserved = PageReserved(head);

bool reserved = PageReserved(head);


> +		if (head != tail) {
> +			/* "head" is not a dangling pointer
> +			 * (compound_trans_head takes care of that)
> +			 * but the hugepage may have been split
> +			 * from under us (and we may not hold a
> +			 * reference count on the head page so it can
> +			 * be reused before we run PageReferenced), so
> +			 * we've to check PageTail before returning
> +			 * what we just read.
> +			 */
> +			smp_rmb();
> +			if (PageTail(tail))
> +				return reserved;
> +		}
> +		return PageReserved(tail);
> +	}
> +
> +	return true;
> +}
> +
> +static int put_pfn(unsigned long pfn, int rdwr)
> +{
> +	if (!is_invalid_reserved_pfn(pfn)) {
> +		struct page *page = pfn_to_page(pfn);
> +		if (rdwr)
> +			SetPageDirty(page);
> +		put_page(page);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Unmap DMA region */
> +/* dgate must be held */

dgate?

> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			    int npage, int rdwr)
> +{
> +	int i, unlocked = 0;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> +		unsigned long pfn;
> +
> +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> +		if (pfn) {
> +			iommu_unmap(iommu->domain, iova, 0);

What is the '0' for? Perhaps a comment: /* We only do zero order */

> +			unlocked += put_pfn(pfn, rdwr);
> +		}
> +	}
> +	return unlocked;
> +}
> +
> +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			   unsigned long npage, int rdwr)
> +{
> +	int unlocked;
> +
> +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> +	vfio_lock_acct(-unlocked);
> +}
> +
> +/* Unmap ALL DMA regions */
> +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos, *pos2;

pos2 should probably be just called 'tmp'

> +	struct dma_map_page *mlp;

What does 'mlp' stand for?

mlp -> dma_page ?

> +
> +	mutex_lock(&iommu->dgate);
> +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);

Uh, so if it did not get put_page() we would try to still delete it?
Couldn't that lead to corruption as the 'mlp' is returned to the poll?

Ah wait, the put_page is on the DMA page, so it is OK to
delete the tracking structure. It will be just a leaked page.
> +		list_del(&mlp->list);
> +		kfree(mlp);
> +	}
> +	mutex_unlock(&iommu->dgate);
> +}
> +
> +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> +{
> +	struct page *page[1];
> +	struct vm_area_struct *vma;
> +	int ret = -EFAULT;
> +
> +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> +		*pfn = page_to_pfn(page[0]);
> +		return 0;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +
> +	if (vma && vma->vm_flags & VM_PFNMAP) {
> +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +		if (is_invalid_reserved_pfn(*pfn))
> +			ret = 0;

Did you mean to break here?

> +	}
> +
> +	up_read(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
> +/* Map DMA region */
> +/* dgate must be held */
> +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> +			unsigned long vaddr, int npage, int rdwr)
> +{
> +	unsigned long start = iova;
> +	int i, ret, locked = 0, prot = IOMMU_READ;
> +
> +	/* Verify pages are not already mapped */

I think a 'that' is missing above.

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> +		if (iommu_iova_to_phys(iommu->domain, iova))
> +			return -EBUSY;
> +
> +	iova = start;
> +
> +	if (rdwr)
> +		prot |= IOMMU_WRITE;
> +	if (iommu->cache)
> +		prot |= IOMMU_CACHE;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> +		unsigned long pfn = 0;
> +
> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> +		if (ret) {
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +
> +		/* Only add actual locked pages to accounting */
> +		if (!is_invalid_reserved_pfn(pfn))
> +			locked++;
> +
> +		ret = iommu_map(iommu->domain, iova,
> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);

Put a comment by the 0 saying /* order 0 pages only! */

> +		if (ret) {
> +			/* Back out mappings on error */
> +			put_pfn(pfn, rdwr);
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +	}
> +	vfio_lock_acct(locked);
> +	return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,

Perhaps a bool?

> +				 unsigned long start2, size_t size2)
> +{
> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> +}
> +
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +					  dma_addr_t start, size_t size)
> +{
> +	struct list_head *pos;
> +	struct dma_map_page *mlp;
> +
> +	list_for_each(pos, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   start, size))
> +			return mlp;
> +	}
> +	return NULL;
> +}
> +
> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +			    size_t size, struct dma_map_page *mlp)
> +{
> +	struct dma_map_page *split;
> +	int npage_lo, npage_hi;
> +
> +	/* Existing dma region is completely covered, unmap all */
> +	if (start <= mlp->daddr &&
> +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +		list_del(&mlp->list);
> +		npage_lo = mlp->npage;
> +		kfree(mlp);
> +		return npage_lo;
> +	}
> +
> +	/* Overlap low address of existing range */
> +	if (start <= mlp->daddr) {
> +		size_t overlap;
> +
> +		overlap = start + size - mlp->daddr;
> +		npage_lo = overlap >> PAGE_SHIFT;
> +		npage_hi = mlp->npage - npage_lo;
> +
> +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +		mlp->daddr += overlap;
> +		mlp->vaddr += overlap;
> +		mlp->npage -= npage_lo;
> +		return npage_lo;
> +	}
> +
> +	/* Overlap high address of existing range */
> +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		size_t overlap;
> +
> +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +		npage_hi = overlap >> PAGE_SHIFT;
> +		npage_lo = mlp->npage - npage_hi;
> +
> +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +		mlp->npage -= npage_hi;
> +		return npage_hi;
> +	}
> +
> +	/* Split existing */
> +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +	split = kzalloc(sizeof *split, GFP_KERNEL);
> +	if (!split)
> +		return -ENOMEM;
> +
> +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +	mlp->npage = npage_lo;
> +
> +	split->npage = npage_hi;
> +	split->daddr = start + size;
> +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +	split->rdwr = mlp->rdwr;
> +	list_add(&split->list, &iommu->dm_list);
> +	return size >> PAGE_SHIFT;
> +}
> +
> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int ret = 0;
> +	size_t npage = dmp->size >> PAGE_SHIFT;
> +	struct list_head *pos, *n;
> +
> +	if (dmp->dmaaddr & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (dmp->size & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	list_for_each_safe(pos, n, &iommu->dm_list) {
> +		struct dma_map_page *mlp;
> +
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   dmp->dmaaddr, dmp->size)) {
> +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> +						      dmp->size, mlp);
> +			if (ret > 0)
> +				npage -= NPAGE_TO_SIZE(ret);
> +			if (ret < 0 || npage == 0)
> +				break;
> +		}
> +	}
> +	mutex_unlock(&iommu->dgate);
> +	return ret > 0 ? 0 : ret;
> +}
> +
> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int npage;
> +	struct dma_map_page *mlp, *mmlp = NULL;
> +	dma_addr_t daddr = dmp->dmaaddr;
> +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> +	size_t size = dmp->size;
> +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	if (vaddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (daddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (size & (PAGE_SIZE-1))
> +		return -EINVAL;
> +
> +	npage = size >> PAGE_SHIFT;
> +	if (!npage)
> +		return -EINVAL;
> +
> +	if (!iommu)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	if (vfio_find_dma(iommu, daddr, size)) {
> +		ret = -EBUSY;
> +		goto out_lock;
> +	}
> +
> +	/* account for locked pages */
> +	locked = current->mm->locked_vm + npage;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +			__func__, rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +		goto out_lock;
> +	}
> +
> +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> +	if (ret)
> +		goto out_lock;
> +
> +	/* Check if we abut a region below */
> +	if (daddr) {
> +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> +		if (mlp && mlp->rdwr == rdwr &&
> +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> +
> +			mlp->npage += npage;
> +			daddr = mlp->daddr;
> +			vaddr = mlp->vaddr;
> +			npage = mlp->npage;
> +			size = NPAGE_TO_SIZE(npage);
> +
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (daddr + size) {
> +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> +
> +			mlp->npage += npage;
> +			mlp->daddr = daddr;
> +			mlp->vaddr = vaddr;
> +
> +			/* If merged above and below, remove previously
> +			 * merged entry.  New entry covers it.  */
> +			if (mmlp) {
> +				list_del(&mmlp->list);
> +				kfree(mmlp);
> +			}
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (!mmlp) {
> +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +		if (!mlp) {
> +			ret = -ENOMEM;
> +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> +			goto out_lock;
> +		}
> +
> +		mlp->npage = npage;
> +		mlp->daddr = daddr;
> +		mlp->vaddr = vaddr;
> +		mlp->rdwr = rdwr;
> +		list_add(&mlp->list, &iommu->dm_list);
> +	}
> +
> +out_lock:
> +	mutex_unlock(&iommu->dgate);
> +	return ret;
> +}
> +
> +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +
> +	vfio_release_iommu(iommu);
> +	return 0;
> +}
> +
> +static long vfio_iommu_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +	int ret = -ENOSYS;
> +
> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {

Something is weird with the tabbing here..

> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> +                ret = put_user(flags, (u64 __user *)arg);
> +
> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_map_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +
> +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_unmap_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +	}
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_iommu_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_iommu_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_iommu_release,
> +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> +#endif
> +};
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> new file mode 100644
> index 0000000..6169356
> --- /dev/null
> +++ b/drivers/vfio/vfio_main.c
> @@ -0,0 +1,1151 @@
> +/*
> + * VFIO framework
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/cdev.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fs.h>
> +#include <linux/idr.h>
> +#include <linux/iommu.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/wait.h>
> +
> +#include "vfio_private.h"
> +
> +#define DRIVER_VERSION	"0.2"
> +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> +
> +static int allow_unsafe_intrs;

__read_mostly
> +module_param(allow_unsafe_intrs, int, 0);

S_IRUGO ?

> +MODULE_PARM_DESC(allow_unsafe_intrs,
> +        "Allow use of IOMMUs which do not support interrupt remapping");
> +
> +static struct vfio {
> +	dev_t			devt;
> +	struct cdev		cdev;
> +	struct list_head	group_list;
> +	struct mutex		lock;
> +	struct kref		kref;
> +	struct class		*class;
> +	struct idr		idr;
> +	wait_queue_head_t	release_q;
> +} vfio;

You probably want to move this below the 'vfio_group'
as vfio contains the vfio_group.
> +
> +static const struct file_operations vfio_group_fops;
> +extern const struct file_operations vfio_iommu_fops;
> +
> +struct vfio_group {
> +	dev_t			devt;
> +	unsigned int		groupid;
> +	struct bus_type		*bus;
> +	struct vfio_iommu	*iommu;
> +	struct list_head	device_list;
> +	struct list_head	iommu_next;
> +	struct list_head	group_next;
> +	int			refcnt;
> +};
> +
> +struct vfio_device {
> +	struct device			*dev;
> +	const struct vfio_device_ops	*ops;
> +	struct vfio_iommu		*iommu;
> +	struct vfio_group		*group;
> +	struct list_head		device_next;
> +	bool				attached;
> +	int				refcnt;
> +	void				*device_data;
> +};

And perhaps move this above vfio_group. As vfio_group
contains a list of these structures?


> +
> +/*
> + * Helper functions called under vfio.lock
> + */
> +
> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* Return true if any of the groups attached to an iommu are opened.
> + * We can only tear apart merged groups when nothing is left open. */
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +		if (group->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* An iommu is "in use" if it has a file descriptor open or if any of
> + * the groups assigned to the iommu have devices open. */
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (iommu->refcnt)
> +		return true;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		if (__vfio_group_devs_inuse(group))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> +				   struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (group->iommu)
> +		list_del(&group->iommu_next);
> +	if (iommu)
> +		list_add(&group->iommu_next, &iommu->group_list);
> +
> +	group->iommu = iommu;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		device->iommu = iommu;
> +	}
> +}
> +
> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> +				    struct vfio_device *device)
> +{
> +	BUG_ON(!iommu->domain && device->attached);

Whoa. Heavy hammer there.

Perhaps WARN_ON as you do check it later on.

> +
> +	if (!iommu->domain || !device->attached)
> +		return;
> +
> +	iommu_detach_device(iommu->domain, device->dev);
> +	device->attached = false;
> +}
> +
> +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> +				      struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		__vfio_iommu_detach_dev(iommu, device);
> +	}
> +}
> +
> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> +				   struct vfio_device *device)
> +{
> +	int ret;
> +
> +	BUG_ON(device->attached);

How about:

WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
the device again! Tell him/her to stop please.\n");

> +
> +	if (!iommu || !iommu->domain)
> +		return -EINVAL;
> +
> +	ret = iommu_attach_device(iommu->domain, device->dev);
> +	if (!ret)
> +		device->attached = true;
> +
> +	return ret;
> +}
> +
> +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> +				     struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +		int ret;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		ret = __vfio_iommu_attach_dev(iommu, device);
> +		if (ret) {
> +			__vfio_iommu_detach_group(iommu, group);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/* The iommu is viable, ie. ready to be configured, when all the devices
> + * for all the groups attached to the iommu are bound to their vfio device
> + * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
> +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> +{
> +	struct list_head *gpos, *dpos;
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (!device->device_data)
> +				return false;
> +		}
> +	}
> +	return true;
> +}
> +
> +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (!iommu->domain)
> +		return;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		__vfio_iommu_detach_group(iommu, group);
> +	}
> +
> +	vfio_iommu_unmapall(iommu);
> +
> +	iommu_domain_free(iommu->domain);
> +	iommu->domain = NULL;
> +	iommu->mm = NULL;
> +}
> +
> +/* Open the IOMMU.  This gates all access to the iommu or device file
> + * descriptors and sets current->mm as the exclusive user. */
> +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +	int ret;
> +
> +	if (!__vfio_iommu_viable(iommu))
> +		return -EBUSY;
> +
> +	if (iommu->domain)
> +		return -EINVAL;
> +
> +	iommu->domain = iommu_domain_alloc(iommu->bus);
> +	if (!iommu->domain)
> +		return -EFAULT;

ENOMEM?

> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		ret = __vfio_iommu_attach_group(iommu, group);
> +		if (ret) {
> +			__vfio_close_iommu(iommu);
> +			return ret;
> +		}
> +	}
> +
> +	if (!allow_unsafe_intrs &&
> +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> +		__vfio_close_iommu(iommu);
> +		return -EFAULT;
> +	}
> +
> +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> +	iommu->mm = current->mm;
> +
> +	return 0;
> +}
> +
> +/* Actively try to tear down the iommu and merged groups.  If there are no
> + * open iommu or device fds, we close the iommu.  If we close the iommu and
> + * there are also no open group fds, we can futher dissolve the group to
> + * iommu association and free the iommu data structure. */
> +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> +{
> +
> +	if (__vfio_iommu_inuse(iommu))
> +		return -EBUSY;
> +
> +	__vfio_close_iommu(iommu);
> +
> +	if (!__vfio_iommu_groups_inuse(iommu)) {
> +		struct list_head *pos, *ppos;
> +
> +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> +			struct vfio_group *group;
> +
> +			group = list_entry(pos, struct vfio_group, iommu_next);
> +			__vfio_group_set_iommu(group, NULL);
> +		}
> +
> +
> +		kfree(iommu);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> +{
> +	struct list_head *gpos;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))

Hmm, where is this defined? v3.2-rc1 does not seem to have it?

> +		return NULL;
> +
> +	list_for_each(gpos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		if (group->groupid != groupid)
> +			continue;
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->dev == dev)
> +				return device;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/* All release paths simply decrement the refcnt, attempt to teardown
> + * the iommu and merged groups, and wakeup anything that might be
> + * waiting if we successfully dissolve anything. */
> +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> +{
> +	bool wake;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	(*refcnt)--;
> +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> +
> +	mutex_unlock(&vfio.lock);
> +
> +	if (wake)
> +		wake_up(&vfio.release_q);
> +
> +	return 0;
> +}
> +
> +/*
> + * Device fops - passthrough to vfio device driver w/ device_data
> + */
> +static int vfio_device_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	vfio_do_release(&device->refcnt, device->iommu);
> +
> +	device->ops->put(device->device_data);
> +
> +	return 0;
> +}
> +
> +static long vfio_device_unl_ioctl(struct file *filep,
> +				  unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->ioctl(device->device_data, cmd, arg);
> +}
> +
> +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->read(device->device_data, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->write(device->device_data, buf, count, ppos);
> +}
> +
> +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->mmap(device->device_data, vma);
> +}
> +	
> +#ifdef CONFIG_COMPAT
> +static long vfio_device_compat_ioctl(struct file *filep,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_device_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_device_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_device_release,
> +	.read		= vfio_device_read,
> +	.write		= vfio_device_write,
> +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_device_compat_ioctl,
> +#endif
> +	.mmap		= vfio_device_mmap,
> +};
> +
> +/*
> + * Group fops
> + */
> +static int vfio_group_open(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group;
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	group = idr_find(&vfio.idr, iminor(inode));
> +
> +	if (!group) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	filep->private_data = group;
> +
> +	if (!group->iommu) {
> +		struct vfio_iommu *iommu;
> +
> +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> +		if (!iommu) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		INIT_LIST_HEAD(&iommu->group_list);
> +		INIT_LIST_HEAD(&iommu->dm_list);
> +		mutex_init(&iommu->dgate);
> +		iommu->bus = group->bus;
> +		__vfio_group_set_iommu(group, iommu);
> +	}
> +	group->refcnt++;
> +
> +out:
> +	mutex_unlock(&vfio.lock);
> +
> +	return ret;
> +}
> +
> +static int vfio_group_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	return vfio_do_release(&group->refcnt, group->iommu);
> +}
> +
> +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> + * group must not have an iommu or any devices open because we cannot
> + * maintain that context across the merge.  The merge-er group can be
> + * in use. */
> +static int vfio_group_merge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *old_iommu;
> +	struct file *file;
> +	int ret = 0;
> +	bool opened = false;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +
> +	if (!new || new == group || !new->iommu ||
> +	    new->iommu->domain || new->bus != group->bus) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We need to attach all the devices to each domain separately
> +	 * in order to validate that the capabilities match for both.  */
> +	ret = __vfio_open_iommu(new->iommu);
> +	if (ret)
> +		goto out;
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +		opened = true;
> +	}
> +
> +	/* If cache coherency doesn't match we'd potentialy need to
> +	 * remap existing iommu mappings in the merge-er domain.
> +	 * Poor return to bother trying to allow this currently. */
> +	if (iommu_domain_has_cap(group->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY) !=
> +	    iommu_domain_has_cap(new->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY)) {
> +		__vfio_close_iommu(new->iommu);
> +		if (opened)
> +			__vfio_close_iommu(group->iommu);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Close the iommu for the merge-ee and attach all its devices
> +	 * to the merge-er iommu. */
> +	__vfio_close_iommu(new->iommu);
> +
> +	ret = __vfio_iommu_attach_group(group->iommu, new);
> +	if (ret)
> +		goto out;
> +
> +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> +	old_iommu = new->iommu;
> +	__vfio_group_set_iommu(new, group->iommu);
> +	kfree(old_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Unmerge the group pointed to by fd from group. */
> +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *new_iommu;
> +	struct file *file;
> +	int ret = 0;
> +
> +	/* Since the merge-out group is already opened, it needs to
> +	 * have an iommu struct associated with it. */
> +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> +	if (!new_iommu)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&new_iommu->group_list);
> +	INIT_LIST_HEAD(&new_iommu->dm_list);
> +	mutex_init(&new_iommu->dgate);
> +	new_iommu->bus = group->bus;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +	if (!new || new == group || new->iommu != group->iommu) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We can't merge-out a group with devices still in use. */
> +	if (__vfio_group_devs_inuse(new)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	__vfio_iommu_detach_group(group->iommu, new);
> +	__vfio_group_set_iommu(new, new_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	if (ret)
> +		kfree(new_iommu);
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new iommu file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set. */
> +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> +			       group->iommu, O_RDWR);
> +	if (ret < 0)
> +		goto out;
> +
> +	group->iommu->refcnt++;
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new device file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set.  It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match.  For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */
> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> +	struct vfio_iommu *iommu = group->iommu;
> +	struct list_head *gpos;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!iommu->domain) {
> +		ret = __vfio_open_iommu(iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->ops->match(device->dev, buf)) {
> +				struct file *file;
> +
> +				if (device->ops->get(device->device_data)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}
> +
> +				/* We can't use anon_inode_getfd(), like above
> +				 * because we need to modify the f_mode flags
> +				 * directly to allow more than just ioctls */
> +				ret = get_unused_fd();
> +				if (ret < 0) {
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				file = anon_inode_getfile("[vfio-device]",
> +							  &vfio_device_fops,
> +							  device, O_RDWR);
> +				if (IS_ERR(file)) {
> +					put_unused_fd(ret);
> +					ret = PTR_ERR(file);
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				/* Todo: add an anon_inode interface to do
> +				 * this.  Appears to be missing by lack of
> +				 * need rather than explicitly prevented.
> +				 * Now there's need. */
> +				file->f_mode |= (FMODE_LSEEK |
> +						 FMODE_PREAD |
> +						 FMODE_PWRITE);
> +
> +				fd_install(ret, file);
> +
> +				device->refcnt++;
> +				goto out;
> +			}
> +		}
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +static long vfio_group_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> +		u64 flags = 0;
> +
> +		mutex_lock(&vfio.lock);
> +		if (__vfio_iommu_viable(group->iommu))
> +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> +		mutex_unlock(&vfio.lock);
> +
> +		if (group->iommu->mm)
> +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> +
> +		return put_user(flags, (u64 __user *)arg);
> +	}
> +		
> +	/* Below commands are restricted once the mm is set */
> +	if (group->iommu->mm && group->iommu->mm != current->mm)
> +		return -EPERM;
> +
> +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> +		int fd;
> +		
> +		if (get_user(fd, (int __user *)arg))
> +			return -EFAULT;
> +		if (fd < 0)
> +			return -EINVAL;
> +
> +		if (cmd == VFIO_GROUP_MERGE)
> +			return vfio_group_merge(group, fd);
> +		else
> +			return vfio_group_unmerge(group, fd);
> +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> +		return vfio_group_get_iommu_fd(group);
> +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> +		char *buf;
> +		int ret;
> +
> +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> +		if (IS_ERR(buf))
> +			return PTR_ERR(buf);
> +
> +		ret = vfio_group_get_device_fd(group, buf);
> +		kfree(buf);
> +		return ret;
> +	}
> +
> +	return -ENOSYS;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_group_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_group_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +static const struct file_operations vfio_group_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vfio_group_open,
> +	.release	= vfio_group_release,
> +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_group_compat_ioctl,
> +#endif
> +};
> +
> +/* iommu fd release hook */
> +int vfio_release_iommu(struct vfio_iommu *iommu)
> +{
> +	return vfio_do_release(&iommu->refcnt, iommu);
> +}
> +
> +/*
> + * VFIO driver API
> + */
> +
> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks.  This is the entry point for vfio drivers to register devices. */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +	int ret = 0;
> +	bool new_group = false;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group) {
> +		int minor;
> +
> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> +		if (!group) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group->groupid = groupid;
> +		INIT_LIST_HEAD(&group->device_list);
> +
> +		ret = idr_get_new(&vfio.idr, group, &minor);
> +		if (ret == 0 && minor > MINORMASK) {
> +			idr_remove(&vfio.idr, minor);
> +			kfree(group);
> +			ret = -ENOSPC;
> +			goto out;
> +		}
> +
> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> +		device_create(vfio.class, NULL, group->devt,
> +			      group, "%u", groupid);
> +
> +		group->bus = dev->bus;


Oh, so that is how the IOMMU iommu_ops get copied! You might
want to mention that - I was not sure where the 'handoff' is
was done to insert a device so that it can do iommu_ops properly.

Ok, so the time when a device is detected whether it can do
IOMMU is when we try to open it - as that is when iommu_domain_alloc
is called which can return NULL if the iommu_ops is not set.

So what about devices that don't have an iommu_ops? Say they
are using SWIOTLB? (like the AMD-Vi sometimes does if the
device is not on its list).

Can we use iommu_present?

> +		list_add(&group->group_next, &vfio.group_list);
> +		new_group = true;
> +	} else {
> +		if (group->bus != dev->bus) {
> +			printk(KERN_WARNING
> +			       "Error: IOMMU group ID conflict.  Group ID %u "
> +				"on both bus %s and %s\n", groupid,
> +				group->bus->name, dev->bus->name);
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		list_for_each(pos, &group->device_list) {
> +			device = list_entry(pos,
> +					    struct vfio_device, device_next);
> +			if (device->dev == dev)
> +				break;
> +			device = NULL;
> +		}
> +	}
> +
> +	if (!device) {
> +		if (__vfio_group_devs_inuse(group) ||
> +		    (group->iommu && group->iommu->refcnt)) {
> +			printk(KERN_WARNING
> +			       "Adding device %s to group %u while group is already in use!!\n",
> +			       dev_name(dev), group->groupid);
> +			/* XXX How to prevent other drivers from claiming? */
> +		}
> +
> +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> +		if (!device) {
> +			/* If we just created this group, tear it down */
> +			if (new_group) {
> +				list_del(&group->group_next);
> +				device_destroy(vfio.class, group->devt);
> +				idr_remove(&vfio.idr, MINOR(group->devt));
> +				kfree(group);
> +			}
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		list_add(&device->device_next, &group->device_list);
> +		device->dev = dev;
> +		device->ops = ops;
> +		device->iommu = group->iommu; /* NULL if new */
> +		__vfio_iommu_attach_dev(group->iommu, device);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> +
> +/* Remove a device from the vfio framework */
> +void vfio_group_del_dev(struct device *dev)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group)
> +		goto out;
> +
> +	list_for_each(pos, &group->device_list) {
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->dev == dev)
> +			break;
> +		device = NULL;
> +	}
> +
> +	if (!device)
> +		goto out;
> +
> +	BUG_ON(device->refcnt);
> +
> +	if (device->attached)
> +		__vfio_iommu_detach_dev(group->iommu, device);
> +
> +	list_del(&device->device_next);
> +	kfree(device);
> +
> +	/* If this was the only device in the group, remove the group.
> +	 * Note that we intentionally unmerge empty groups here if the
> +	 * group fd isn't opened. */
> +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> +		struct vfio_iommu *iommu = group->iommu;
> +
> +		if (iommu) {
> +			__vfio_group_set_iommu(group, NULL);
> +			__vfio_try_dissolve_iommu(iommu);
> +		}
> +
> +		device_destroy(vfio.class, group->devt);
> +		idr_remove(&vfio.idr, MINOR(group->devt));
> +		list_del(&group->group_next);
> +		kfree(group);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> +
> +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> + * entry point is used to mark the device usable (viable).  The vfio
> + * device driver associates a private device_data struct with the device
> + * here, which will later be return for vfio_device_fops callbacks. */
> +int vfio_bind_dev(struct device *dev, void *device_data)
> +{
> +	struct vfio_device *device;
> +	int ret = -EINVAL;
> +
> +	BUG_ON(!device_data);
> +
> +	mutex_lock(&vfio.lock);
> +
> +	device = __vfio_lookup_dev(dev);
> +
> +	BUG_ON(!device);
> +
> +	ret = dev_set_drvdata(dev, device);
> +	if (!ret)
> +		device->device_data = device_data;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> +
> +/* A device is only removeable if the iommu for the group is not in use. */
> +static bool vfio_device_removeable(struct vfio_device *device)
> +{
> +	bool ret = true;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> +		ret = false;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Notify vfio that a device is being unbound from the vfio device driver
> + * and return the device private device_data pointer.  If the group is
> + * in use, we need to block or take other measures to make it safe for
> + * the device to be removed from the iommu. */
> +void *vfio_unbind_dev(struct device *dev)
> +{
> +	struct vfio_device *device = dev_get_drvdata(dev);
> +	void *device_data;
> +
> +	BUG_ON(!device);
> +
> +again:
> +	if (!vfio_device_removeable(device)) {
> +		/* XXX signal for all devices in group to be removed or
> +		 * resort to killing the process holding the device fds.
> +		 * For now just block waiting for releases to wake us. */
> +		wait_event(vfio.release_q, vfio_device_removeable(device));
> +	}
> +
> +	mutex_lock(&vfio.lock);
> +
> +	/* Need to re-check that the device is still removeable under lock. */
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> +		mutex_unlock(&vfio.lock);
> +		goto again;
> +	}
> +
> +	device_data = device->device_data;
> +
> +	device->device_data = NULL;
> +	dev_set_drvdata(dev, NULL);
> +
> +	mutex_unlock(&vfio.lock);
> +	return device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> +
> +/*
> + * Module/class support
> + */
> +static void vfio_class_release(struct kref *kref)
> +{
> +	class_destroy(vfio.class);
> +	vfio.class = NULL;
> +}
> +
> +static char *vfio_devnode(struct device *dev, mode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> +}
> +
> +static int __init vfio_init(void)
> +{
> +	int ret;
> +
> +	idr_init(&vfio.idr);
> +	mutex_init(&vfio.lock);
> +	INIT_LIST_HEAD(&vfio.group_list);
> +	init_waitqueue_head(&vfio.release_q);
> +
> +	kref_init(&vfio.kref);
> +	vfio.class = class_create(THIS_MODULE, "vfio");
> +	if (IS_ERR(vfio.class)) {
> +		ret = PTR_ERR(vfio.class);
> +		goto err_class;
> +	}
> +
> +	vfio.class->devnode = vfio_devnode;
> +
> +	/* FIXME - how many minors to allocate... all of them! */
> +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> +	if (ret)
> +		goto err_chrdev;
> +
> +	cdev_init(&vfio.cdev, &vfio_group_fops);
> +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> +	if (ret)
> +		goto err_cdev;
> +
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> +	return 0;
> +
> +err_cdev:
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +err_chrdev:
> +	kref_put(&vfio.kref, vfio_class_release);
> +err_class:
> +	return ret;
> +}
> +
> +static void __exit vfio_cleanup(void)
> +{
> +	struct list_head *gpos, *gppos;
> +
> +	list_for_each_safe(gpos, gppos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos, *dppos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		list_for_each_safe(dpos, dppos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +			vfio_group_del_dev(device->dev);
> +		}
> +	}
> +
> +	idr_destroy(&vfio.idr);
> +	cdev_del(&vfio.cdev);
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +	kref_put(&vfio.kref, vfio_class_release);
> +}
> +
> +module_init(vfio_init);
> +module_exit(vfio_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> new file mode 100644
> index 0000000..350ad67
> --- /dev/null
> +++ b/drivers/vfio/vfio_private.h
> @@ -0,0 +1,34 @@
> +/*
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +
> +#ifndef VFIO_PRIVATE_H
> +#define VFIO_PRIVATE_H
> +
> +struct vfio_iommu {
> +	struct iommu_domain		*domain;
> +	struct bus_type			*bus;
> +	struct mutex			dgate;
> +	struct list_head		dm_list;
> +	struct mm_struct		*mm;
> +	struct list_head		group_list;
> +	int				refcnt;
> +	bool				cache;
> +};
> +
> +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> +
> +#endif /* VFIO_PRIVATE_H */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> new file mode 100644
> index 0000000..4269b08
> --- /dev/null
> +++ b/include/linux/vfio.h
> @@ -0,0 +1,155 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +#include <linux/types.h>
> +
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +extern int vfio_group_add_dev(struct device *device,
> +			      const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *device);
> +extern int vfio_bind_dev(struct device *device, void *device_data);
> +extern void *vfio_unbind_dev(struct device *device);
> +
> +#endif /* __KERNEL__ */
> +
> +/*
> + * VFIO driver - allow mapping and use of certain devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +
> +/* Kernel & User level defines for ioctls */
> +
> +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)

> + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
> +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + */
> +struct vfio_dma_map {
> +	__u64	len;		/* length of structure */
> +	__u64	vaddr;		/* process virtual addr */
> +	__u64	dmaaddr;	/* desired and/or returned dma address */
> +	__u64	size;		/* size in bytes */
> +	__u64	flags;
> +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA mem */
> +};
> +
> +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)
> + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)
> +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct vfio_dma_map)
> +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct vfio_dma_map)
> +
> +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)
> +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)
> +
> +struct vfio_region_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* region number */
> +	__u64	size;		/* size in bytes of region */
> +	__u64	offset;		/* start offset of region */
> +	__u64	flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)
> +	__u64	phys;		/* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct vfio_region_info)
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* IRQ number */
> +	__u32	count;		/* number of individual IRQs */
> +	__u32	flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)
> +};
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct vfio_irq_info)
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)
> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)
> +
> +#define VFIO_DEVICE_RESET		_IO(';', 116)
> +
> +struct vfio_dtpath {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u64	flags;
> +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)
> +	char	*path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u32	prop_type;
> +	__u32	prop_index;
> +	__u64	flags;
> +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct vfio_dtindex)
> +
> +#endif /* VFIO_H */


So where is the vfio-pci? Is that a seperate posting?
Alex Williamson - Nov. 11, 2011, 6:04 p.m.
On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> Here are few minor comments on vfio_iommu.c ...

Sorry, I've been poking sticks at trying to figure out a clean way to
solve the force vfio driver attach problem.

> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
<snip>
> > +
> > +#include "vfio_private.h"
> 
> Doesn't the 'dma_'  prefix belong to the generic DMA code?

Sure, we could these more vfio-centric.

> > +struct dma_map_page {
> > +	struct list_head	list;
> > +	dma_addr_t		daddr;
> > +	unsigned long		vaddr;
> > +	int			npage;
> > +	int			rdwr;
> > +};
> > +
> > +/*
> > + * This code handles mapping and unmapping of user data buffers
> > + * into DMA'ble space using the IOMMU
> > + */
> > +
> > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> > +
> > +struct vwork {
> > +	struct mm_struct	*mm;
> > +	int			npage;
> > +	struct work_struct	work;
> > +};
> > +
> > +/* delayed decrement for locked_vm */
> > +static void vfio_lock_acct_bg(struct work_struct *work)
> > +{
> > +	struct vwork *vwork = container_of(work, struct vwork, work);
> > +	struct mm_struct *mm;
> > +
> > +	mm = vwork->mm;
> > +	down_write(&mm->mmap_sem);
> > +	mm->locked_vm += vwork->npage;
> > +	up_write(&mm->mmap_sem);
> > +	mmput(mm);		/* unref mm */
> > +	kfree(vwork);
> > +}
> > +
> > +static void vfio_lock_acct(int npage)
> > +{
> > +	struct vwork *vwork;
> > +	struct mm_struct *mm;
> > +
> > +	if (!current->mm) {
> > +		/* process exited */
> > +		return;
> > +	}
> > +	if (down_write_trylock(&current->mm->mmap_sem)) {
> > +		current->mm->locked_vm += npage;
> > +		up_write(&current->mm->mmap_sem);
> > +		return;
> > +	}
> > +	/*
> > +	 * Couldn't get mmap_sem lock, so must setup to decrement
>                                                       ^^^^^^^^^
> 
> Increment?

Yep

<snip>
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > start,
> > +			    size_t size, struct dma_map_page *mlp)
> > +{
> > +	struct dma_map_page *split;
> > +	int npage_lo, npage_hi;
> > +
> > +	/* Existing dma region is completely covered, unmap all */
> 
> This works. However, given how vfio_dma_map_dm implements the merging
> logic, I think it is impossible to have
> 
>     (start < mlp->daddr &&
>      start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))

It's quite possible.  This allows userspace to create a sparse mapping,
then blow it all away with a single unmap from 0 to ~0.

> > +	if (start <= mlp->daddr &&
> > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		npage_lo = mlp->npage;
> > +		kfree(mlp);
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap low address of existing range */
> 
> Same as above (ie, '<' is impossible)

existing:   |<--- A --->|      |<--- B --->|
unmap:                |<--- C --->|

Maybe not good practice from userspace, but we shouldn't count on
userspace to be well behaved.

> > +	if (start <= mlp->daddr) {
> > +		size_t overlap;
> > +
> > +		overlap = start + size - mlp->daddr;
> > +		npage_lo = overlap >> PAGE_SHIFT;
> > +		npage_hi = mlp->npage - npage_lo;
> > +
> > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +		mlp->daddr += overlap;
> > +		mlp->vaddr += overlap;
> > +		mlp->npage -= npage_lo;
> > +		return npage_lo;
> > +	}
> 
> Same as above (ie, '>' is impossible).

Same example as above.

> > +	/* Overlap high address of existing range */
> > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		size_t overlap;
> > +
> > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +		npage_hi = overlap >> PAGE_SHIFT;
> > +		npage_lo = mlp->npage - npage_hi;
> > +
> > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +		mlp->npage -= npage_hi;
> > +		return npage_hi;
> > +	}
<snip>
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map
> > *dmp)
> > +{
> > +	int npage;
> > +	struct dma_map_page *mlp, *mmlp = NULL;
> > +	dma_addr_t daddr = dmp->dmaaddr;
> > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +	size_t size = dmp->size;
> > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +	if (vaddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (daddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (size & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +
> > +	npage = size >> PAGE_SHIFT;
> > +	if (!npage)
> > +		return -EINVAL;
> > +
> > +	if (!iommu)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	if (vfio_find_dma(iommu, daddr, size)) {
> > +		ret = -EBUSY;
> > +		goto out_lock;
> > +	}
> > +
> > +	/* account for locked pages */
> > +	locked = current->mm->locked_vm + npage;
> > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > +		ret = -ENOMEM;
> > +		goto out_lock;
> > +	}
> > +
> > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > +	if (ret)
> > +		goto out_lock;
> > +
> > +	/* Check if we abut a region below */
> 
> Is !daddr possible?

Sure, an IOVA of 0x0.  There's no region below if we start at zero.

> > +	if (daddr) {
> > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > +		if (mlp && mlp->rdwr == rdwr &&
> > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > +			mlp->npage += npage;
> > +			daddr = mlp->daddr;
> > +			vaddr = mlp->vaddr;
> > +			npage = mlp->npage;
> > +			size = NPAGE_TO_SIZE(npage);
> > +
> > +			mmlp = mlp;
> > +		}
> > +	}
> 
> Is !(daddr + size) possible?

Same, there's no region above if this region goes to the top of the
address space, ie. 0xffffffff_fffff000 + 0x1000

Hmm, wonder if I'm missing a check for wrapping.

> > +	if (daddr + size) {
> > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size)
> > {
> > +
> > +			mlp->npage += npage;
> > +			mlp->daddr = daddr;
> > +			mlp->vaddr = vaddr;
> > +
> > +			/* If merged above and below, remove previously
> > +			 * merged entry.  New entry covers it.  */
> > +			if (mmlp) {
> > +				list_del(&mmlp->list);
> > +				kfree(mmlp);
> > +			}
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (!mmlp) {
> > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > +		if (!mlp) {
> > +			ret = -ENOMEM;
> > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > +			goto out_lock;
> > +		}
> > +
> > +		mlp->npage = npage;
> > +		mlp->daddr = daddr;
> > +		mlp->vaddr = vaddr;
> > +		mlp->rdwr = rdwr;
> > +		list_add(&mlp->list, &iommu->dm_list);
> > +	}
> > +
> > +out_lock:
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +
> > +	vfio_release_iommu(iommu);
> > +	return 0;
> > +}
> > +
> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> 
> Any reason for not using "switch" ?

It got ugly in vfio_main, so I decided to be consistent w/ it in the
driver and use if/else here too.  I don't like the aesthetics of extra
{}s to declare variables within a switch, nor do I like declaring all
the variables for each case for the whole function.  Personal quirk.

> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> > +
> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> 
> What does the "_dm" suffix stand for?

Inherited from Tom, but I figure _dma_map_dm = action(dma map),
object(dm), which is a vfio_Dma_Map.

Thanks,

Alex
Alex Williamson - Nov. 11, 2011, 10:10 p.m.
Thanks Konrad!  Comments inline.

On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > VFIO provides a secure, IOMMU based interface for user space
> > drivers, including device assignment to virtual machines.
> > This provides the base management of IOMMU groups, devices,
> > and IOMMU objects.  See Documentation/vfio.txt included in
> > this patch for user and kernel API description.
> > 
> > Note, this implements the new API discussed at KVM Forum
> > 2011, as represented by the drvier version 0.2.  It's hoped
> > that this provides a modular enough interface to support PCI
> > and non-PCI userspace drivers across various architectures
> > and IOMMU implementations.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> > 
> > Fingers crossed, this is the last RFC for VFIO, but we need
> > the iommu group support before this can go upstream
> > (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> > hoping this helps push that along.
> > 
> > Since the last posting, this version completely modularizes
> > the device backends and better defines the APIs between the
> > core VFIO code and the device backends.  I expect that we
> > might also adopt a modular IOMMU interface as iommu_ops learns
> > about different types of hardware.  Also many, many cleanups.
> > Check the complete git history for details:
> > 
> > git://github.com/awilliam/linux-vfio.git vfio-ng
> > 
> > (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> > 
> > This version, along with the supporting VFIO PCI backend can
> > be found here:
> > 
> > git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> > 
> > I've held off on implementing a kernel->user signaling
> > mechanism for now since the previous netlink version produced
> > too many gag reflexes.  It's easy enough to set a bit in the
> > group flags too indicate such support in the future, so I
> > think we can move ahead without it.
> > 
> > Appreciate any feedback or suggestions.  Thanks,
> > 
> > Alex
> > 
> >  Documentation/ioctl/ioctl-number.txt |    1 
> >  Documentation/vfio.txt               |  304 +++++++++
> >  MAINTAINERS                          |    8 
> >  drivers/Kconfig                      |    2 
> >  drivers/Makefile                     |    1 
> >  drivers/vfio/Kconfig                 |    8 
> >  drivers/vfio/Makefile                |    3 
> >  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
> >  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
> >  drivers/vfio/vfio_private.h          |   34 +
> >  include/linux/vfio.h                 |  155 +++++
> >  11 files changed, 2197 insertions(+), 0 deletions(-)
> >  create mode 100644 Documentation/vfio.txt
> >  create mode 100644 drivers/vfio/Kconfig
> >  create mode 100644 drivers/vfio/Makefile
> >  create mode 100644 drivers/vfio/vfio_iommu.c
> >  create mode 100644 drivers/vfio/vfio_main.c
> >  create mode 100644 drivers/vfio/vfio_private.h
> >  create mode 100644 include/linux/vfio.h
> > 
> > diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> > index 54078ed..59d01e4 100644
> > --- a/Documentation/ioctl/ioctl-number.txt
> > +++ b/Documentation/ioctl/ioctl-number.txt
> > @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
> >  		and kernel/power/user.c
> >  '8'	all				SNP8023 advanced NIC card
> >  					<mailto:mcr@solidum.com>
> > +';'	64-76	linux/vfio.h
> >  '@'	00-0F	linux/radeonfb.h	conflict!
> >  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
> >  'A'	00-1F	linux/apm_bios.h	conflict!
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > new file mode 100644
> > index 0000000..5866896
> > --- /dev/null
> > +++ b/Documentation/vfio.txt
> > @@ -0,0 +1,304 @@
> > +VFIO - "Virtual Function I/O"[1]
> > +-------------------------------------------------------------------------------
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > +is an IOMMU/device agnostic framework for exposing direct device
> > +access to userspace, in a secure, IOMMU protected environment.  In
> > +other words, this allows safe, non-privileged, userspace drivers.
> > +
> > +Why do we want that?  Virtual machines often make use of direct device
> > +access ("device assignment") when configured for the highest possible
> > +I/O performance.  From a device and host perspective, this simply turns
> > +the VM into a userspace driver, with the benefits of significantly
> > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > +drivers[2].
> 
> Are there any constraints of running a 32-bit userspace with
> a 64-bit kernel and with 32-bit user space drivers?

Shouldn't be.  I'll need to do some testing on that, but it was working
on the previous generation of vfio.

> > +
> > +Some applications, particularly in the high performance computing
> > +field, also benefit from low-overhead, direct device access from
> > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > +and compute accelerators.  Previous to VFIO, these drivers needed to
> > +go through the full development cycle to become proper upstream driver,
> > +be maintained out of tree, or make use of the UIO framework, which
> > +has no notion of IOMMU protection, limited interrupt support, and
> > +requires root privileges to access things like PCI configuration space.
> > +
> > +The VFIO driver framework intends to unify these, replacing both the
> > +KVM PCI specific device assignment currently used as well as provide
> > +a more secure, more featureful userspace driver environment than UIO.
> > +
> > +Groups, Devices, IOMMUs, oh my
> 
> <chuckles> oh my, eh?

Anything for a corny chuckle :)

> > +-------------------------------------------------------------------------------
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> > +this document).
> > +
> > +The IOMMU cannot distiguish transactions between the individual devices
> > +within the group, therefore the group is the basic unit of ownership for
> > +a userspace process.  Because of this, groups are also the primary
> > +interface to both devices and IOMMU domains in VFIO.
> > +
> > +The VFIO representation of groups is created as devices are added into
> > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > +of a bus driver.  This module registers devices along with a set of bus
> > +specific callbacks with the VFIO core.  These callbacks provide the
> > +interfaces later used for device access.  As each new group is created,
> > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > +character device.
> > +
> > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > +also provides a traditional device driver and is able to bind to devices
> > +on it's bus.  When a device is bound to the bus driver it's available to
> > +VFIO.  When all the devices within a group are bound to their bus drivers,
> > +the group becomes "viable" and a user with sufficient access to the VFIO
> > +group chardev can obtain exclusive access to the set of group devices.
> > +
> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> > +
> > +The last two ioctls return new file descriptors for accessing
> > +individual devices within the group and programming the IOMMU.  Each of
> > +these new file descriptors provide their own set of file interfaces.
> > +These ioctls will fail if any of the devices within the group are not
> > +bound to their VFIO bus driver.  Additionally, when either of these
> > +interfaces are used, the group is then bound to the struct_mm of the
> > +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> > +
> > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > +new IOMMU domain is created and all of the devices in the group are
> > +attached to it.  This is the only way to ensure full IOMMU isolation
> > +of the group, but potentially wastes resources and cycles if the user
> > +intends to manage multiple groups with the same set of IOMMU mappings.
> > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > +arbitrary groups to be merged, so the user should assume merging is
> > +opportunistic.  A new group, with no open device or IOMMU file
> > +descriptors, can be merged into an existing, in-use, group using the
> > +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> > +once all of the device file descriptors for the group being merged
> > +"out" are closed.
> > +
> > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> > +essentially fungible between group file descriptors (ie. if device A
> > +is in group X, and X is merged with Y, a file descriptor for A can be
> > +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> > +file descriptor referencing the same internal IOMMU object from either
> > +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> > +or automatically when ALL file descriptors for the merged group are
> > +closed (all IOMMUs, all devices, all groups).
> > +
> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> 
> Coherency support is not going to be addressed right? What about sync?
> Say you need to sync CPU to Device address?

Do we need to expose that to userspace or should the underlying
iommu_ops take care of it?

> > +
> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +        __u64   len;            /* length of structure */
> 
> What is the purpose of the 'len' field? Is it to guard against future
> version changes?

Yes, David Gibson suggested we include flags & len for all data
structures to help future proof them.

> > +        __u64   vaddr;          /* process virtual addr */
> > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > +        __u64   size;           /* size in bytes */
> > +        __u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > +};
> > +
> > +Current users of VFIO use relatively static DMA mappings, not requiring
> > +high frequency turnover.  As new users are added, it's expected that the
> 
> Is there a limit to how many DMA mappings can be created?

Not that I'm aware of for the current AMD-Vi/VT-d implementations.  I
suppose iommu_ops would return -ENOSPC if it hit a limit.  I added the
VFIO_IOMMU_FLAGS_MAP_ANY flag above to try to identify that kind of
restriction.

> > +IOMMU file descriptor will evolve to support new mapping interfaces, this
> > +will be reflected in the flags and may present new ioctls and file
> > +interfaces.
> > +
> > +The device GET_FLAGS ioctl is intended to return basic device type and
> > +indicate support for optional capabilities.  Flags currently include whether
> > +the device is PCI or described by Device Tree, and whether the RESET ioctl
> > +is supported:
> 
> And reset in terms of PCIe spec is the FLR?

Yes, just a pass through to pci_reset_function() for the pci vfio bus
driver.

> > +
> > +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> > + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> > + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> > + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> > +
> > +The MMIO and IOP resources used by a device are described by regions.
> 
> IOP?

I/O port, I'll spell it out.

> > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > +
> > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> 
> Don't want __u32?

It could be, not sure if it buys us anything maybe even restricts us.
We likely don't need 2^32 regions (famous last words?), so we could
later define <0 to something?

> > +
> > +Regions are described by a struct vfio_region_info, which is retrieved by
> > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > +the desired region (0 based index).  Note that devices may implement zero
> > 
> +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > +mapping).
> 
> Huh?

PCI has the following static mapping:

enum {
        VFIO_PCI_BAR0_REGION_INDEX,
        VFIO_PCI_BAR1_REGION_INDEX,
        VFIO_PCI_BAR2_REGION_INDEX,
        VFIO_PCI_BAR3_REGION_INDEX,
        VFIO_PCI_BAR4_REGION_INDEX,
        VFIO_PCI_BAR5_REGION_INDEX,
        VFIO_PCI_ROM_REGION_INDEX,
        VFIO_PCI_CONFIG_REGION_INDEX,
        VFIO_PCI_NUM_REGIONS
};

So 8 regions are always reported regardless of whether the device
implements all the BARs and the ROM.  Then we have a fixed bar:index
mapping so we don't have to create a region_info field to describe the
bar number for the index.

> > +
> > +struct vfio_region_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* region number */
> > +        __u64   size;           /* size in bytes of region */
> > +        __u64   offset;         /* start offset of region */
> > +        __u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> 
> What is FLAG_MMAP? Does it mean: 1) it can be mmaped, or 2) it is mmaped?

Supports mmap

> FLAG_RO is pretty obvious - presumarily this is for firmware regions and such.
> And PHYS_VALID is if the region is disabled for some reasons? If so
> would the name FLAG_DISABLED be better?

No, POWER guys have some need to report the host physical address of the
region, so the flag indicates whether the below field is present and
valid.  I'll clarify these in the docs.

> 
> > +        __u64   phys;           /* physical address of region */
> > +};
> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > +
> > +The offset indicates the offset into the device file descriptor which
> > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > +available access types and validity of optional fields.  For instance
> > +the phys field may only be valid for certain devices types.
> > +
> > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > +reports the number or IRQ indexes for the device.
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> 
> _u32?

Same as above, but I don't have a strong preference.

> > +
> > +struct vfio_irq_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* IRQ number */
> > +        __u32   count;          /* number of individual IRQs */
> > +        __u64   flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > +};
> > +
> > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > +type to index mapping).
> 
> I am not really sure what that means.

This is so PCI can expose:

enum {
        VFIO_PCI_INTX_IRQ_INDEX,
        VFIO_PCI_MSI_IRQ_INDEX,
        VFIO_PCI_MSIX_IRQ_INDEX,
        VFIO_PCI_NUM_IRQS
};

So like regions it always exposes 3 IRQ indexes where count=0 if the
device doesn't actually support that type of interrupt.  I just want to
spell out that bus drivers have this kind of flexibility.

> > +
> > +Information about each index can be retrieved using the GET_IRQ_INFO
> > +ioctl, used much like GET_REGION_INFO.
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > +
> > +Individual indexes can describe single or sets of IRQs.  This provides the
> > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > +
> > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> 
> Are eventfds u64 or u32?

int, they're just file descriptors

> Why not just define a structure?
> struct vfio_irq_eventfds {
> 	__u32	index;
> 	__u32	count;
> 	__u64	eventfds[0]
> };

We could do that if preferred.  Hmm, are we then going to need
size/flags?

> How do you get an eventfd to feed in here?

eventfd(2), in qemu event_notifier_init() -> event_notifier_get_fd()

> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> 
> u32?

Not here, it's an fd, so should be an int.

> > +
> > +When a level triggered interrupt is signaled, the interrupt is masked
> > +on the host.  This prevents an unresponsive userspace driver from
> > +continuing to interrupt the host system.  After servicing the interrupt,
> > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > +triggered interrupts implicitly have a count of 1 per index.
> 
> So they are enabled automatically? Meaning you don't even hav to do
> SET_IRQ_EVENTFDS b/c the count is set to 1?

I suppose that should be "no more than 1 per index" (ie. PCI would
report a count of 0 for VFIO_PCI_INTX_IRQ_INDEX if the device doesn't
support INTx).  I think you might be confusing VFIO_DEVICE_GET_IRQ_INFO
which tells how many are available with VFIO_DEVICE_SET_IRQ_EVENTFDS
which does the enabling/disabling.  All interrupts are disabled by
default because userspace needs to give us a way to signal them via
eventfds.  It will be device dependent whether multiple index can be
enabled simultaneously.  Hmm, is that another flag on the irq_info
struct or do we expect drivers to implicitly have that kind of
knowledge?

> > +
> > +/* Unmask IRQ index, arg[0] = index */
> > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> 
> So this is for MSI as well? So if I've an index = 1, with count = 4,
> and doing unmaks IRQ will chip enable all the MSI event at once?

No, this is only for re-enabling level triggered interrupts as discussed
above.  Edge triggered interrupts like MSI don't need an unmask... we
may want to do something to accelerate the MSI-X table access for
masking specific interrupts, but I figured that would need to be PCI
aware since those are PCI features, and would therefore be some future
extension of the PCI bus driver and exposed via VFIO_DEVICE_GET_FLAGS.

> I guess there is not much point in enabling/disabling selective MSI
> IRQs..

Some older OSes are said to make extensive use of masking for MSI, so we
probably want this at some point.  I'm assuming future PCI extension for
now.

> > +
> > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> 
> irqfd or eventfd?

irqfd is an eventfd in reverse.  eventfd = kernel signals userspace via
an fd, irqfd = userspace signals kernel via an fd.

> > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> 
> So only level triggered? Hmm, how do I know whether the device is
> level or edge? Or is that edge (MSI) can also be unmaked using the
> eventfs

Yes, only for level.  Isn't a device going to know what type of
interrupt it uses?  MSI masking is PCI specific, not handled by this.

> > +
> > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > +
> > +When supported, as indicated by the device flags, reset the device.
> > +
> > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> 
> Does it disable the 'count'? Err, does it disable the IRQ on the
> device after this and one should call VFIO_DEVICE_SET_IRQ_EVENTFDS
> to set new eventfds? Or does it re-use the eventfds and the device
> is enabled after this?

It doesn't affect the interrupt programming.  Should it?

> > +
> > +Device tree devices also invlude ioctls for further defining the
> 
> include
> 
> > +device tree properties of the device:
> > +
> > +struct vfio_dtpath {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> 
> 0 based I presume?

Everything else is, I would assume so/

> > +        __u64   flags;
> > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> 
> What is region in this context?? Or would this make much more sense
> if I knew what Device Tree actually is.

Powerpc guys, any comments?  This was their suggestion.  These are
effectively the first device specific extension, available when
VFIO_DEVICE_FLAGS_DT is set.

> > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > +        char    *path;
> 
> Ah, now I see why you want 'len' here.. But I am still at loss
> why you want that with the other structures.

Attempt to future proof and validate input.

> > +};
> > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> > +
> > +struct vfio_dtindex {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u32   prop_type;
> 
> Is that an enum type? Is this definied somewhere?
> > +        __u32   prop_index;
> 
> What is the purpose of this field?

Need input from powerpc folks here

> > +        __u64   flags;
> > +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> > +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> > +};
> > +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> > +
> > +
> > +VFIO bus driver API
> > +-------------------------------------------------------------------------------
> > +
> > +Bus drivers, such as PCI, have three jobs:
> > + 1) Add/remove devices from vfio
> > + 2) Provide vfio_device_ops for device access
> > + 3) Device binding and unbinding
> 
> suspend/resume?

In the previous version of vfio, the vfio core signaled suspend/resume
to userspace via netlink, effectively putting userspace on the pm
notifier chain.  I was intending to do the same here.

> > +
> > +When initialized, the bus driver should enumerate the devices on it's
> > +bus and call vfio_group_add_dev() for each device.  If the bus supports
> > +hotplug, notifiers should be enabled to track devices being added and
> > +removed.  vfio_group_del_dev() removes a previously added device from
> > +vfio.
> > +
> > +Adding a device registers a vfio_device_ops function pointer structure
> > +for the device:
> 
> Huh? So this gets created for _every_ 'struct device' that is added
> the VFIO bus? Is this structure exposed? Or is this an internal one?

Every device added creates a struct vfio_device and if necessary a
struct vfio_group.  These are internal, just for managing groups and
devices.

> > +
> > +struct vfio_device_ops {
> > +	bool			(*match)(struct device *, char *);
> > +	int			(*get)(void *);
> > +	void			(*put)(void *);
> > +	ssize_t			(*read)(void *, char __user *,
> > +					size_t, loff_t *);
> > +	ssize_t			(*write)(void *, const char __user *,
> > +					 size_t, loff_t *);
> > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > +	int			(*mmap)(void *, struct vm_area_struct *);
> > +};
> > +
> > +When a device is bound to the bus driver, the bus driver indicates this
> > +to vfio using the vfio_bind_dev() interface.  The device_data parameter
> 
> Might want to paste the function decleration for it.. b/c I am not sure
> where the 'device_data' parameter is on the argument list.

Ok

> > +is a pointer to an opaque data structure for use only by the bus driver.
> > +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> > +this data structure back to the bus driver.  When a device is unbound
> 
> Oh, so it is on the 'void *'.

Right

> > +from the bus driver, the vfio_unbind_dev() interface signals this to
> > +vfio.  This function returns the pointer to the device_data structure
> 
> That function
> > +registered for the device.
> 
> I am not really sure what this section purpose is? Could this be part
> of the header file or the code? It does not look to be part of the
> ioctl API?

We've passed into the "VFIO bus driver API" section of the document, to
explain the interaction between vfio-core and vfio bus drivers.

> > +
> > +As noted previously, a group contains one or more devices, so
> > +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> > +The vfio_device_ops.match callback is used to allow bus drivers to determine
> > +the match.  For drivers like vfio-pci, it's a simple match to dev_name(),
> > +which is unique in the system due to the PCI bus topology, other bus drivers
> > +may need to include parent devices to create a unique match, so this is
> > +left as a bus driver interface.
> > +
> > +-------------------------------------------------------------------------------
> > +
> > +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> > +initial implementation by Tom Lyon while as Cisco.  We've since outgrown
> > +the acronym, but it's catchy.
> > +
> > +[2] As always there are trade-offs to virtual machine device
> > +assignment that are beyond the scope of VFIO.  It's expected that
> > +future IOMMU technologies will reduce some, but maybe not all, of
> > +these trade-offs.
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index f05f5f6..4bd5aa0 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -7106,6 +7106,14 @@ S:	Maintained
> >  F:	Documentation/filesystems/vfat.txt
> >  F:	fs/fat/
> >  
> > +VFIO DRIVER
> > +M:	Alex Williamson <alex.williamson@redhat.com>
> > +L:	kvm@vger.kernel.org
> 
> No vfio mailing list? Or a vfio-mailing list? 

IIRC, Avi had agreed that we could use kvm for now.  I don't know that
vfio will warrant it's own list.  If it picks up, sure, we can move it.

> > +S:	Maintained
> > +F:	Documentation/vfio.txt
> > +F:	drivers/vfio/
> > +F:	include/linux/vfio.h
> > +
> >  VIDEOBUF2 FRAMEWORK
> >  M:	Pawel Osciak <pawel@osciak.com>
> >  M:	Marek Szyprowski <m.szyprowski@samsung.com>
> > diff --git a/drivers/Kconfig b/drivers/Kconfig
> > index b5e6f24..e15578b 100644
> > --- a/drivers/Kconfig
> > +++ b/drivers/Kconfig
> > @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
> >  
> >  source "drivers/uio/Kconfig"
> >  
> > +source "drivers/vfio/Kconfig"
> > +
> >  source "drivers/vlynq/Kconfig"
> >  
> >  source "drivers/virtio/Kconfig"
> > diff --git a/drivers/Makefile b/drivers/Makefile
> > index 1b31421..5f138b5 100644
> > --- a/drivers/Makefile
> > +++ b/drivers/Makefile
> > @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM)		+= atm/
> >  obj-$(CONFIG_FUSION)		+= message/
> >  obj-y				+= firewire/
> >  obj-$(CONFIG_UIO)		+= uio/
> > +obj-$(CONFIG_VFIO)		+= vfio/
> >  obj-y				+= cdrom/
> >  obj-y				+= auxdisplay/
> >  obj-$(CONFIG_PCCARD)		+= pcmcia/
> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> > new file mode 100644
> > index 0000000..9acb1e7
> > --- /dev/null
> > +++ b/drivers/vfio/Kconfig
> > @@ -0,0 +1,8 @@
> > +menuconfig VFIO
> > +	tristate "VFIO Non-Privileged userspace driver framework"
> > +	depends on IOMMU_API
> > +	help
> > +	  VFIO provides a framework for secure userspace device drivers.
> > +	  See Documentation/vfio.txt for more details.
> > +
> > +	  If you don't know what to do here, say N.
> > diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> > new file mode 100644
> > index 0000000..088faf1
> > --- /dev/null
> > +++ b/drivers/vfio/Makefile
> > @@ -0,0 +1,3 @@
> > +vfio-y := vfio_main.o vfio_iommu.o
> > +
> > +obj-$(CONFIG_VFIO) := vfio.o
> > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > new file mode 100644
> > index 0000000..029dae3
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_iommu.c
> > @@ -0,0 +1,530 @@
> > +/*
> > + * VFIO: IOMMU DMA mapping support
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/fs.h>
> > +#include <linux/iommu.h>
> > +#include <linux/module.h>
> > +#include <linux/mm.h>
> > +#include <linux/sched.h>
> > +#include <linux/slab.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/workqueue.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +struct dma_map_page {
> > +	struct list_head	list;
> > +	dma_addr_t		daddr;
> > +	unsigned long		vaddr;
> > +	int			npage;
> > +	int			rdwr;
> 
> rdwr? Is this a flag thing? Could it be made in an enum?

Or maybe better would just be a bool.

> > +};
> > +
> > +/*
> > + * This code handles mapping and unmapping of user data buffers
> > + * into DMA'ble space using the IOMMU
> > + */
> > +
> > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> > +
> > +struct vwork {
> > +	struct mm_struct	*mm;
> > +	int			npage;
> > +	struct work_struct	work;
> > +};
> > +
> > +/* delayed decrement for locked_vm */
> > +static void vfio_lock_acct_bg(struct work_struct *work)
> > +{
> > +	struct vwork *vwork = container_of(work, struct vwork, work);
> > +	struct mm_struct *mm;
> > +
> > +	mm = vwork->mm;
> > +	down_write(&mm->mmap_sem);
> > +	mm->locked_vm += vwork->npage;
> > +	up_write(&mm->mmap_sem);
> > +	mmput(mm);		/* unref mm */
> > +	kfree(vwork);
> > +}
> > +
> > +static void vfio_lock_acct(int npage)
> > +{
> > +	struct vwork *vwork;
> > +	struct mm_struct *mm;
> > +
> > +	if (!current->mm) {
> > +		/* process exited */
> > +		return;
> > +	}
> > +	if (down_write_trylock(&current->mm->mmap_sem)) {
> > +		current->mm->locked_vm += npage;
> > +		up_write(&current->mm->mmap_sem);
> > +		return;
> > +	}
> > +	/*
> > +	 * Couldn't get mmap_sem lock, so must setup to decrement
> > +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> > +	 * need this silliness
> > +	 */
> > +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> > +	if (!vwork)
> > +		return;
> > +	mm = get_task_mm(current);	/* take ref mm */
> > +	if (!mm) {
> > +		kfree(vwork);
> > +		return;
> > +	}
> > +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> > +	vwork->mm = mm;
> > +	vwork->npage = npage;
> > +	schedule_work(&vwork->work);
> > +}
> > +
> > +/* Some mappings aren't backed by a struct page, for example an mmap'd
> > + * MMIO range for our own or another device.  These use a different
> > + * pfn conversion and shouldn't be tracked as locked pages. */
> > +static int is_invalid_reserved_pfn(unsigned long pfn)
> 
> static bool
> 
> > +{
> > +	if (pfn_valid(pfn)) {
> > +		int reserved;
> > +		struct page *tail = pfn_to_page(pfn);
> > +		struct page *head = compound_trans_head(tail);
> > +		reserved = PageReserved(head);
> 
> bool reserved = PageReserved(head);

Agree on both

> > +		if (head != tail) {
> > +			/* "head" is not a dangling pointer
> > +			 * (compound_trans_head takes care of that)
> > +			 * but the hugepage may have been split
> > +			 * from under us (and we may not hold a
> > +			 * reference count on the head page so it can
> > +			 * be reused before we run PageReferenced), so
> > +			 * we've to check PageTail before returning
> > +			 * what we just read.
> > +			 */
> > +			smp_rmb();
> > +			if (PageTail(tail))
> > +				return reserved;
> > +		}
> > +		return PageReserved(tail);
> > +	}
> > +
> > +	return true;
> > +}
> > +
> > +static int put_pfn(unsigned long pfn, int rdwr)
> > +{
> > +	if (!is_invalid_reserved_pfn(pfn)) {
> > +		struct page *page = pfn_to_page(pfn);
> > +		if (rdwr)
> > +			SetPageDirty(page);
> > +		put_page(page);
> > +		return 1;
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* Unmap DMA region */
> > +/* dgate must be held */
> 
> dgate?

DMA gate, the mutex for iommu operations.  This a carry over from old
vfio.  As there's only one mutex on the struct vfio_iommu, I can just
rename that to "lock".

> > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			    int npage, int rdwr)
> > +{
> > +	int i, unlocked = 0;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > +		unsigned long pfn;
> > +
> > +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > +		if (pfn) {
> > +			iommu_unmap(iommu->domain, iova, 0);
> 
> What is the '0' for? Perhaps a comment: /* We only do zero order */

yep.  We'll need to improve this at some point to take advantage of
large iommu pages, but it shouldn't affect the API.  I'll add comment.

> > +			unlocked += put_pfn(pfn, rdwr);
> > +		}
> > +	}
> > +	return unlocked;
> > +}
> > +
> > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			   unsigned long npage, int rdwr)
> > +{
> > +	int unlocked;
> > +
> > +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> > +	vfio_lock_acct(-unlocked);
> > +}
> > +
> > +/* Unmap ALL DMA regions */
> > +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos, *pos2;
> 
> pos2 should probably be just called 'tmp'

ok

> > +	struct dma_map_page *mlp;
> 
> What does 'mlp' stand for?
> 
> mlp -> dma_page ?

Carry over from original code, I can guess, but not sure what Tom was
originally thinking.  I think everyone has asked so far, so I'll make a
pass at coming up with a names that I can explain.

> > +
> > +	mutex_lock(&iommu->dgate);
> > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> 
> Uh, so if it did not get put_page() we would try to still delete it?
> Couldn't that lead to corruption as the 'mlp' is returned to the poll?
> 
> Ah wait, the put_page is on the DMA page, so it is OK to
> delete the tracking structure. It will be just a leaked page.

Assume you're referencing this chunk:

vfio_dma_unmap
  __vfio_dma_unmap
    ...
        pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
        if (pfn) {
                iommu_unmap(iommu->domain, iova, 0);
                unlocked += put_pfn(pfn, rdwr);
        }

So we skip things that aren't mapped in the iommu, but anything not
mapped should have already been put (failed vfio_dma_map).  If it is
mapped, we put it if we originally got it via get_user_pages_fast.
unlocked would only not get incremented here if it was an mmap'd page
(such as the mmap of an mmio space of another vfio device), via the code
in vaddr_get_pfn (stolen from KVM).

> > +		list_del(&mlp->list);
> > +		kfree(mlp);
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> > +}
> > +
> > +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> > +{
> > +	struct page *page[1];
> > +	struct vm_area_struct *vma;
> > +	int ret = -EFAULT;
> > +
> > +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> > +		*pfn = page_to_pfn(page[0]);
> > +		return 0;
> > +	}
> > +
> > +	down_read(&current->mm->mmap_sem);
> > +
> > +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > +
> > +	if (vma && vma->vm_flags & VM_PFNMAP) {
> > +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +		if (is_invalid_reserved_pfn(*pfn))
> > +			ret = 0;
> 
> Did you mean to break here?

We're in an if block, not a loop.

> > +	}
> > +
> > +	up_read(&current->mm->mmap_sem);
> > +
> > +	return ret;
> > +}
> > +
> > +/* Map DMA region */
> > +/* dgate must be held */
> > +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> > +			unsigned long vaddr, int npage, int rdwr)
> > +{
> > +	unsigned long start = iova;
> > +	int i, ret, locked = 0, prot = IOMMU_READ;
> > +
> > +	/* Verify pages are not already mapped */
> 
> I think a 'that' is missing above.

Ok.

> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > +		if (iommu_iova_to_phys(iommu->domain, iova))
> > +			return -EBUSY;
> > +
> > +	iova = start;
> > +
> > +	if (rdwr)
> > +		prot |= IOMMU_WRITE;
> > +	if (iommu->cache)
> > +		prot |= IOMMU_CACHE;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > +		unsigned long pfn = 0;
> > +
> > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > +		if (ret) {
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +
> > +		/* Only add actual locked pages to accounting */
> > +		if (!is_invalid_reserved_pfn(pfn))
> > +			locked++;
> > +
> > +		ret = iommu_map(iommu->domain, iova,
> > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> 
> Put a comment by the 0 saying /* order 0 pages only! */

Yep

> > +		if (ret) {
> > +			/* Back out mappings on error */
> > +			put_pfn(pfn, rdwr);
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +	}
> > +	vfio_lock_acct(locked);
> > +	return 0;
> > +}
> > +
> > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> 
> Perhaps a bool?

Sure

> > +				 unsigned long start2, size_t size2)
> > +{
> > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> > +}
> > +
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +					  dma_addr_t start, size_t size)
> > +{
> > +	struct list_head *pos;
> > +	struct dma_map_page *mlp;
> > +
> > +	list_for_each(pos, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   start, size))
> > +			return mlp;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +			    size_t size, struct dma_map_page *mlp)
> > +{
> > +	struct dma_map_page *split;
> > +	int npage_lo, npage_hi;
> > +
> > +	/* Existing dma region is completely covered, unmap all */
> > +	if (start <= mlp->daddr &&
> > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		npage_lo = mlp->npage;
> > +		kfree(mlp);
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap low address of existing range */
> > +	if (start <= mlp->daddr) {
> > +		size_t overlap;
> > +
> > +		overlap = start + size - mlp->daddr;
> > +		npage_lo = overlap >> PAGE_SHIFT;
> > +		npage_hi = mlp->npage - npage_lo;
> > +
> > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +		mlp->daddr += overlap;
> > +		mlp->vaddr += overlap;
> > +		mlp->npage -= npage_lo;
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap high address of existing range */
> > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		size_t overlap;
> > +
> > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +		npage_hi = overlap >> PAGE_SHIFT;
> > +		npage_lo = mlp->npage - npage_hi;
> > +
> > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +		mlp->npage -= npage_hi;
> > +		return npage_hi;
> > +	}
> > +
> > +	/* Split existing */
> > +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +	split = kzalloc(sizeof *split, GFP_KERNEL);
> > +	if (!split)
> > +		return -ENOMEM;
> > +
> > +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +	mlp->npage = npage_lo;
> > +
> > +	split->npage = npage_hi;
> > +	split->daddr = start + size;
> > +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +	split->rdwr = mlp->rdwr;
> > +	list_add(&split->list, &iommu->dm_list);
> > +	return size >> PAGE_SHIFT;
> > +}
> > +
> > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int ret = 0;
> > +	size_t npage = dmp->size >> PAGE_SHIFT;
> > +	struct list_head *pos, *n;
> > +
> > +	if (dmp->dmaaddr & ~PAGE_MASK)
> > +		return -EINVAL;
> > +	if (dmp->size & ~PAGE_MASK)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	list_for_each_safe(pos, n, &iommu->dm_list) {
> > +		struct dma_map_page *mlp;
> > +
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   dmp->dmaaddr, dmp->size)) {
> > +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > +						      dmp->size, mlp);
> > +			if (ret > 0)
> > +				npage -= NPAGE_TO_SIZE(ret);
> > +			if (ret < 0 || npage == 0)
> > +				break;
> > +		}
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret > 0 ? 0 : ret;
> > +}
> > +
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int npage;
> > +	struct dma_map_page *mlp, *mmlp = NULL;
> > +	dma_addr_t daddr = dmp->dmaaddr;
> > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +	size_t size = dmp->size;
> > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +	if (vaddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (daddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (size & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +
> > +	npage = size >> PAGE_SHIFT;
> > +	if (!npage)
> > +		return -EINVAL;
> > +
> > +	if (!iommu)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	if (vfio_find_dma(iommu, daddr, size)) {
> > +		ret = -EBUSY;
> > +		goto out_lock;
> > +	}
> > +
> > +	/* account for locked pages */
> > +	locked = current->mm->locked_vm + npage;
> > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > +		ret = -ENOMEM;
> > +		goto out_lock;
> > +	}
> > +
> > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > +	if (ret)
> > +		goto out_lock;
> > +
> > +	/* Check if we abut a region below */
> > +	if (daddr) {
> > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > +		if (mlp && mlp->rdwr == rdwr &&
> > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > +			mlp->npage += npage;
> > +			daddr = mlp->daddr;
> > +			vaddr = mlp->vaddr;
> > +			npage = mlp->npage;
> > +			size = NPAGE_TO_SIZE(npage);
> > +
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (daddr + size) {
> > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> > +
> > +			mlp->npage += npage;
> > +			mlp->daddr = daddr;
> > +			mlp->vaddr = vaddr;
> > +
> > +			/* If merged above and below, remove previously
> > +			 * merged entry.  New entry covers it.  */
> > +			if (mmlp) {
> > +				list_del(&mmlp->list);
> > +				kfree(mmlp);
> > +			}
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (!mmlp) {
> > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> > +		if (!mlp) {
> > +			ret = -ENOMEM;
> > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> > +			goto out_lock;
> > +		}
> > +
> > +		mlp->npage = npage;
> > +		mlp->daddr = daddr;
> > +		mlp->vaddr = vaddr;
> > +		mlp->rdwr = rdwr;
> > +		list_add(&mlp->list, &iommu->dm_list);
> > +	}
> > +
> > +out_lock:
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret;
> > +}
> > +
> > +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +
> > +	vfio_release_iommu(iommu);
> > +	return 0;
> > +}
> > +
> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> > +
> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> 
> Something is weird with the tabbing here..

Indeed, the joys of switching between kernel and qemu ;)  fixed

> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> > +
> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> > +
> > +		ret = vfio_dma_map_dm(iommu, &dm);
> > +
> > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > +			ret = -EFAULT;
> > +
> > +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> > +		struct vfio_dma_map dm;
> > +
> > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > +			return -EFAULT;
> > +
> > +		ret = vfio_dma_unmap_dm(iommu, &dm);
> > +
> > +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> > +			ret = -EFAULT;
> > +	}
> > +	return ret;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_iommu_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_iommu_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_iommu_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_iommu_release,
> > +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> > +#endif
> > +};
> > diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> > new file mode 100644
> > index 0000000..6169356
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_main.c
> > @@ -0,0 +1,1151 @@
> > +/*
> > + * VFIO framework
> > + *
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/cdev.h>
> > +#include <linux/compat.h>
> > +#include <linux/device.h>
> > +#include <linux/file.h>
> > +#include <linux/anon_inodes.h>
> > +#include <linux/fs.h>
> > +#include <linux/idr.h>
> > +#include <linux/iommu.h>
> > +#include <linux/mm.h>
> > +#include <linux/module.h>
> > +#include <linux/slab.h>
> > +#include <linux/string.h>
> > +#include <linux/uaccess.h>
> > +#include <linux/vfio.h>
> > +#include <linux/wait.h>
> > +
> > +#include "vfio_private.h"
> > +
> > +#define DRIVER_VERSION	"0.2"
> > +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> > +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> > +
> > +static int allow_unsafe_intrs;
> 
> __read_mostly

Ok

> > +module_param(allow_unsafe_intrs, int, 0);
> 
> S_IRUGO ?

I actually intended that to be S_IRUGO | S_IWUSR just like the kvm
parameter so it can be toggled runtime.

> > +MODULE_PARM_DESC(allow_unsafe_intrs,
> > +        "Allow use of IOMMUs which do not support interrupt remapping");
> > +
> > +static struct vfio {
> > +	dev_t			devt;
> > +	struct cdev		cdev;
> > +	struct list_head	group_list;
> > +	struct mutex		lock;
> > +	struct kref		kref;
> > +	struct class		*class;
> > +	struct idr		idr;
> > +	wait_queue_head_t	release_q;
> > +} vfio;
> 
> You probably want to move this below the 'vfio_group'
> as vfio contains the vfio_group.

Only via the group_list.  Are you suggesting for readability or to avoid
forward declarations (which we don't need between these two with current
ordering).

> > +
> > +static const struct file_operations vfio_group_fops;
> > +extern const struct file_operations vfio_iommu_fops;
> > +
> > +struct vfio_group {
> > +	dev_t			devt;
> > +	unsigned int		groupid;
> > +	struct bus_type		*bus;
> > +	struct vfio_iommu	*iommu;
> > +	struct list_head	device_list;
> > +	struct list_head	iommu_next;
> > +	struct list_head	group_next;
> > +	int			refcnt;
> > +};
> > +
> > +struct vfio_device {
> > +	struct device			*dev;
> > +	const struct vfio_device_ops	*ops;
> > +	struct vfio_iommu		*iommu;
> > +	struct vfio_group		*group;
> > +	struct list_head		device_next;
> > +	bool				attached;
> > +	int				refcnt;
> > +	void				*device_data;
> > +};
> 
> And perhaps move this above vfio_group. As vfio_group
> contains a list of these structures?

These are inter-linked, so chicken and egg.  The current ordering is
more function based than definition based.  struct vfio is the highest
level object, groups are next, iommus and devices are next, but we need
to share iommus with the other file, so that lands in the header.

> > +
> > +/*
> > + * Helper functions called under vfio.lock
> > + */
> > +
> > +/* Return true if any devices within a group are opened */
> > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* Return true if any of the groups attached to an iommu are opened.
> > + * We can only tear apart merged groups when nothing is left open. */
> > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +		if (group->refcnt)
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +/* An iommu is "in use" if it has a file descriptor open or if any of
> > + * the groups assigned to the iommu have devices open. */
> > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (iommu->refcnt)
> > +		return true;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		if (__vfio_group_devs_inuse(group))
> > +			return true;
> > +	}
> > +	return false;
> > +}
> > +
> > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > +				   struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (group->iommu)
> > +		list_del(&group->iommu_next);
> > +	if (iommu)
> > +		list_add(&group->iommu_next, &iommu->group_list);
> > +
> > +	group->iommu = iommu;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		device->iommu = iommu;
> > +	}
> > +}
> > +
> > +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> > +				    struct vfio_device *device)
> > +{
> > +	BUG_ON(!iommu->domain && device->attached);
> 
> Whoa. Heavy hammer there.
> 
> Perhaps WARN_ON as you do check it later on.

I think it's warranted, internal consistency is broken if we have a
device that thinks it's attached to an iommu domain that doesn't exist.
It should, of course, never happen and this isn't a performance path.

> > +
> > +	if (!iommu->domain || !device->attached)
> > +		return;
> > +
> > +	iommu_detach_device(iommu->domain, device->dev);
> > +	device->attached = false;
> > +}
> > +
> > +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> > +				      struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		__vfio_iommu_detach_dev(iommu, device);
> > +	}
> > +}
> > +
> > +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> > +				   struct vfio_device *device)
> > +{
> > +	int ret;
> > +
> > +	BUG_ON(device->attached);
> 
> How about:
> 
> WARN_ON(device->attached, "The engineer who wrote the user-space device driver is trying to register
> the device again! Tell him/her to stop please.\n");

I would almost demote this one to a WARN_ON, but userspace isn't in
control of attaching and detaching devices from the iommu.  That's a
side effect of getting the iommu or device file descriptor.  So again,
this is an internal consistency check and it should never happen,
regardless of userspace.

> > +
> > +	if (!iommu || !iommu->domain)
> > +		return -EINVAL;
> > +
> > +	ret = iommu_attach_device(iommu->domain, device->dev);
> > +	if (!ret)
> > +		device->attached = true;
> > +
> > +	return ret;
> > +}
> > +
> > +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> > +				     struct vfio_group *group)
> > +{
> > +	struct list_head *pos;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		struct vfio_device *device;
> > +		int ret;
> > +
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		ret = __vfio_iommu_attach_dev(iommu, device);
> > +		if (ret) {
> > +			__vfio_iommu_detach_group(iommu, group);
> > +			return ret;
> > +		}
> > +	}
> > +	return 0;
> > +}
> > +
> > +/* The iommu is viable, ie. ready to be configured, when all the devices
> > + * for all the groups attached to the iommu are bound to their vfio device
> > + * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
> > +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *gpos, *dpos;
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (!device->device_data)
> > +				return false;
> > +		}
> > +	}
> > +	return true;
> > +}
> > +
> > +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +
> > +	if (!iommu->domain)
> > +		return;
> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		__vfio_iommu_detach_group(iommu, group);
> > +	}
> > +
> > +	vfio_iommu_unmapall(iommu);
> > +
> > +	iommu_domain_free(iommu->domain);
> > +	iommu->domain = NULL;
> > +	iommu->mm = NULL;
> > +}
> > +
> > +/* Open the IOMMU.  This gates all access to the iommu or device file
> > + * descriptors and sets current->mm as the exclusive user. */
> > +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos;
> > +	int ret;
> > +
> > +	if (!__vfio_iommu_viable(iommu))
> > +		return -EBUSY;
> > +
> > +	if (iommu->domain)
> > +		return -EINVAL;
> > +
> > +	iommu->domain = iommu_domain_alloc(iommu->bus);
> > +	if (!iommu->domain)
> > +		return -EFAULT;
> 
> ENOMEM?

Yeah, probably more appropriate.

> > +
> > +	list_for_each(pos, &iommu->group_list) {
> > +		struct vfio_group *group;
> > +		group = list_entry(pos, struct vfio_group, iommu_next);
> > +
> > +		ret = __vfio_iommu_attach_group(iommu, group);
> > +		if (ret) {
> > +			__vfio_close_iommu(iommu);
> > +			return ret;
> > +		}
> > +	}
> > +
> > +	if (!allow_unsafe_intrs &&
> > +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> > +		__vfio_close_iommu(iommu);
> > +		return -EFAULT;
> > +	}
> > +
> > +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> > +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> > +	iommu->mm = current->mm;
> > +
> > +	return 0;
> > +}
> > +
> > +/* Actively try to tear down the iommu and merged groups.  If there are no
> > + * open iommu or device fds, we close the iommu.  If we close the iommu and
> > + * there are also no open group fds, we can futher dissolve the group to
> > + * iommu association and free the iommu data structure. */
> > +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> > +{
> > +
> > +	if (__vfio_iommu_inuse(iommu))
> > +		return -EBUSY;
> > +
> > +	__vfio_close_iommu(iommu);
> > +
> > +	if (!__vfio_iommu_groups_inuse(iommu)) {
> > +		struct list_head *pos, *ppos;
> > +
> > +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> > +			struct vfio_group *group;
> > +
> > +			group = list_entry(pos, struct vfio_group, iommu_next);
> > +			__vfio_group_set_iommu(group, NULL);
> > +		}
> > +
> > +
> > +		kfree(iommu);
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> > +{
> > +	struct list_head *gpos;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> 
> Hmm, where is this defined? v3.2-rc1 does not seem to have it?

From patch header:

        Fingers crossed, this is the last RFC for VFIO, but we need
        the iommu group support before this can go upstream
        (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
        hoping this helps push that along.

hat's the one bit keeping me from doing a non-RFC of the core, besides
fixing all these comments ;)

> > +		return NULL;
> > +
> > +	list_for_each(gpos, &vfio.group_list) {
> > +		struct vfio_group *group;
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, group_next);
> > +
> > +		if (group->groupid != groupid)
> > +			continue;
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->dev == dev)
> > +				return device;
> > +		}
> > +	}
> > +	return NULL;
> > +}
> > +
> > +/* All release paths simply decrement the refcnt, attempt to teardown
> > + * the iommu and merged groups, and wakeup anything that might be
> > + * waiting if we successfully dissolve anything. */
> > +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> > +{
> > +	bool wake;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	(*refcnt)--;
> > +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> > +
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	if (wake)
> > +		wake_up(&vfio.release_q);
> > +
> > +	return 0;
> > +}
> > +
> > +/*
> > + * Device fops - passthrough to vfio device driver w/ device_data
> > + */
> > +static int vfio_device_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	vfio_do_release(&device->refcnt, device->iommu);
> > +
> > +	device->ops->put(device->device_data);
> > +
> > +	return 0;
> > +}
> > +
> > +static long vfio_device_unl_ioctl(struct file *filep,
> > +				  unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->ioctl(device->device_data, cmd, arg);
> > +}
> > +
> > +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> > +				size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->read(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> > +				 size_t count, loff_t *ppos)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->write(device->device_data, buf, count, ppos);
> > +}
> > +
> > +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> > +{
> > +	struct vfio_device *device = filep->private_data;
> > +
> > +	return device->ops->mmap(device->device_data, vma);
> > +}
> > +	
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_device_compat_ioctl(struct file *filep,
> > +				     unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_device_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +const struct file_operations vfio_device_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.release	= vfio_device_release,
> > +	.read		= vfio_device_read,
> > +	.write		= vfio_device_write,
> > +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_device_compat_ioctl,
> > +#endif
> > +	.mmap		= vfio_device_mmap,
> > +};
> > +
> > +/*
> > + * Group fops
> > + */
> > +static int vfio_group_open(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group;
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	group = idr_find(&vfio.idr, iminor(inode));
> > +
> > +	if (!group) {
> > +		ret = -ENODEV;
> > +		goto out;
> > +	}
> > +
> > +	filep->private_data = group;
> > +
> > +	if (!group->iommu) {
> > +		struct vfio_iommu *iommu;
> > +
> > +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> > +		if (!iommu) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +		INIT_LIST_HEAD(&iommu->group_list);
> > +		INIT_LIST_HEAD(&iommu->dm_list);
> > +		mutex_init(&iommu->dgate);
> > +		iommu->bus = group->bus;
> > +		__vfio_group_set_iommu(group, iommu);
> > +	}
> > +	group->refcnt++;
> > +
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +
> > +	return ret;
> > +}
> > +
> > +static int vfio_group_release(struct inode *inode, struct file *filep)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	return vfio_do_release(&group->refcnt, group->iommu);
> > +}
> > +
> > +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> > + * group must not have an iommu or any devices open because we cannot
> > + * maintain that context across the merge.  The merge-er group can be
> > + * in use. */
> > +static int vfio_group_merge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *old_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +	bool opened = false;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +
> > +	if (!new || new == group || !new->iommu ||
> > +	    new->iommu->domain || new->bus != group->bus) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We need to attach all the devices to each domain separately
> > +	 * in order to validate that the capabilities match for both.  */
> > +	ret = __vfio_open_iommu(new->iommu);
> > +	if (ret)
> > +		goto out;
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +		opened = true;
> > +	}
> > +
> > +	/* If cache coherency doesn't match we'd potentialy need to
> > +	 * remap existing iommu mappings in the merge-er domain.
> > +	 * Poor return to bother trying to allow this currently. */
> > +	if (iommu_domain_has_cap(group->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY) !=
> > +	    iommu_domain_has_cap(new->iommu->domain,
> > +				 IOMMU_CAP_CACHE_COHERENCY)) {
> > +		__vfio_close_iommu(new->iommu);
> > +		if (opened)
> > +			__vfio_close_iommu(group->iommu);
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* Close the iommu for the merge-ee and attach all its devices
> > +	 * to the merge-er iommu. */
> > +	__vfio_close_iommu(new->iommu);
> > +
> > +	ret = __vfio_iommu_attach_group(group->iommu, new);
> > +	if (ret)
> > +		goto out;
> > +
> > +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> > +	old_iommu = new->iommu;
> > +	__vfio_group_set_iommu(new, group->iommu);
> > +	kfree(old_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Unmerge the group pointed to by fd from group. */
> > +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> > +{
> > +	struct vfio_group *new;
> > +	struct vfio_iommu *new_iommu;
> > +	struct file *file;
> > +	int ret = 0;
> > +
> > +	/* Since the merge-out group is already opened, it needs to
> > +	 * have an iommu struct associated with it. */
> > +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> > +	if (!new_iommu)
> > +		return -ENOMEM;
> > +
> > +	INIT_LIST_HEAD(&new_iommu->group_list);
> > +	INIT_LIST_HEAD(&new_iommu->dm_list);
> > +	mutex_init(&new_iommu->dgate);
> > +	new_iommu->bus = group->bus;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	file = fget(fd);
> > +	if (!file) {
> > +		ret = -EBADF;
> > +		goto out_noput;
> > +	}
> > +
> > +	/* Sanity check, is this really our fd? */
> > +	if (file->f_op != &vfio_group_fops) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	new = file->private_data;
> > +	if (!new || new == group || new->iommu != group->iommu) {
> > +		ret = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/* We can't merge-out a group with devices still in use. */
> > +	if (__vfio_group_devs_inuse(new)) {
> > +		ret = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	__vfio_iommu_detach_group(group->iommu, new);
> > +	__vfio_group_set_iommu(new, new_iommu);
> > +
> > +out:
> > +	fput(file);
> > +out_noput:
> > +	if (ret)
> > +		kfree(new_iommu);
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new iommu file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set. */
> > +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> > +{
> > +	int ret = 0;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!group->iommu->domain) {
> > +		ret = __vfio_open_iommu(group->iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> > +			       group->iommu, O_RDWR);
> > +	if (ret < 0)
> > +		goto out;
> > +
> > +	group->iommu->refcnt++;
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Get a new device file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set.  It's difficult to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require including
> > + * a parent device string. */
> > +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> > +{
> > +	struct vfio_iommu *iommu = group->iommu;
> > +	struct list_head *gpos;
> > +	int ret = -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!iommu->domain) {
> > +		ret = __vfio_open_iommu(iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->ops->match(device->dev, buf)) {
> > +				struct file *file;
> > +
> > +				if (device->ops->get(device->device_data)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> > +
> > +				/* We can't use anon_inode_getfd(), like above
> > +				 * because we need to modify the f_mode flags
> > +				 * directly to allow more than just ioctls */
> > +				ret = get_unused_fd();
> > +				if (ret < 0) {
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				file = anon_inode_getfile("[vfio-device]",
> > +							  &vfio_device_fops,
> > +							  device, O_RDWR);
> > +				if (IS_ERR(file)) {
> > +					put_unused_fd(ret);
> > +					ret = PTR_ERR(file);
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				/* Todo: add an anon_inode interface to do
> > +				 * this.  Appears to be missing by lack of
> > +				 * need rather than explicitly prevented.
> > +				 * Now there's need. */
> > +				file->f_mode |= (FMODE_LSEEK |
> > +						 FMODE_PREAD |
> > +						 FMODE_PWRITE);
> > +
> > +				fd_install(ret, file);
> > +
> > +				device->refcnt++;
> > +				goto out;
> > +			}
> > +		}
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +static long vfio_group_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_group *group = filep->private_data;
> > +
> > +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> > +		u64 flags = 0;
> > +
> > +		mutex_lock(&vfio.lock);
> > +		if (__vfio_iommu_viable(group->iommu))
> > +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> > +		mutex_unlock(&vfio.lock);
> > +
> > +		if (group->iommu->mm)
> > +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> > +
> > +		return put_user(flags, (u64 __user *)arg);
> > +	}
> > +		
> > +	/* Below commands are restricted once the mm is set */
> > +	if (group->iommu->mm && group->iommu->mm != current->mm)
> > +		return -EPERM;
> > +
> > +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> > +		int fd;
> > +		
> > +		if (get_user(fd, (int __user *)arg))
> > +			return -EFAULT;
> > +		if (fd < 0)
> > +			return -EINVAL;
> > +
> > +		if (cmd == VFIO_GROUP_MERGE)
> > +			return vfio_group_merge(group, fd);
> > +		else
> > +			return vfio_group_unmerge(group, fd);
> > +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> > +		return vfio_group_get_iommu_fd(group);
> > +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> > +		char *buf;
> > +		int ret;
> > +
> > +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> > +		if (IS_ERR(buf))
> > +			return PTR_ERR(buf);
> > +
> > +		ret = vfio_group_get_device_fd(group, buf);
> > +		kfree(buf);
> > +		return ret;
> > +	}
> > +
> > +	return -ENOSYS;
> > +}
> > +
> > +#ifdef CONFIG_COMPAT
> > +static long vfio_group_compat_ioctl(struct file *filep,
> > +				    unsigned int cmd, unsigned long arg)
> > +{
> > +	arg = (unsigned long)compat_ptr(arg);
> > +	return vfio_group_unl_ioctl(filep, cmd, arg);
> > +}
> > +#endif	/* CONFIG_COMPAT */
> > +
> > +static const struct file_operations vfio_group_fops = {
> > +	.owner		= THIS_MODULE,
> > +	.open		= vfio_group_open,
> > +	.release	= vfio_group_release,
> > +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> > +#ifdef CONFIG_COMPAT
> > +	.compat_ioctl	= vfio_group_compat_ioctl,
> > +#endif
> > +};
> > +
> > +/* iommu fd release hook */
> > +int vfio_release_iommu(struct vfio_iommu *iommu)
> > +{
> > +	return vfio_do_release(&iommu->refcnt, iommu);
> > +}
> > +
> > +/*
> > + * VFIO driver API
> > + */
> > +
> > +/* Add a new device to the vfio framework with associated vfio driver
> > + * callbacks.  This is the entry point for vfio drivers to register devices. */
> > +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +	int ret = 0;
> > +	bool new_group = false;
> > +
> > +	if (!ops)
> > +		return -EINVAL;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group) {
> > +		int minor;
> > +
> > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +		if (!group) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group->groupid = groupid;
> > +		INIT_LIST_HEAD(&group->device_list);
> > +
> > +		ret = idr_get_new(&vfio.idr, group, &minor);
> > +		if (ret == 0 && minor > MINORMASK) {
> > +			idr_remove(&vfio.idr, minor);
> > +			kfree(group);
> > +			ret = -ENOSPC;
> > +			goto out;
> > +		}
> > +
> > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > +		device_create(vfio.class, NULL, group->devt,
> > +			      group, "%u", groupid);
> > +
> > +		group->bus = dev->bus;
> 
> 
> Oh, so that is how the IOMMU iommu_ops get copied! You might
> want to mention that - I was not sure where the 'handoff' is
> was done to insert a device so that it can do iommu_ops properly.
> 
> Ok, so the time when a device is detected whether it can do
> IOMMU is when we try to open it - as that is when iommu_domain_alloc
> is called which can return NULL if the iommu_ops is not set.
> 
> So what about devices that don't have an iommu_ops? Say they
> are using SWIOTLB? (like the AMD-Vi sometimes does if the
> device is not on its list).
> 
> Can we use iommu_present?

I'm not sure I'm following your revelation ;)  Take a look at the
pointer to iommu_device_group I pasted above, or these:

https://github.com/awilliam/linux-vfio/commit/37dd08c90d149caaed7779d4f38850a8f7ed0fa5
https://github.com/awilliam/linux-vfio/commit/63ca8543533d8130db23d7949133e548c3891c97
https://github.com/awilliam/linux-vfio/commit/8d7d70eb8e714fbf8710848a06f8cab0c741631e

That call includes an iommu_present() check, so if there's no iommu or
the iommu can't provide a groupid, the device is skipped over from vfio
(can't be used).

So the ordering is:

 - bus driver registers device
   - if it has an iommu group, add it to the vfio device/group tracking

 - group gets opened
   - user gets iommu or device fd results in iommu_domain_alloc

Devices without iommu_ops don't get to play in the vfio world.

> > +		list_add(&group->group_next, &vfio.group_list);
> > +		new_group = true;
> > +	} else {
> > +		if (group->bus != dev->bus) {
> > +			printk(KERN_WARNING
> > +			       "Error: IOMMU group ID conflict.  Group ID %u "
> > +				"on both bus %s and %s\n", groupid,
> > +				group->bus->name, dev->bus->name);
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> > +
> > +		list_for_each(pos, &group->device_list) {
> > +			device = list_entry(pos,
> > +					    struct vfio_device, device_next);
> > +			if (device->dev == dev)
> > +				break;
> > +			device = NULL;
> > +		}
> > +	}
> > +
> > +	if (!device) {
> > +		if (__vfio_group_devs_inuse(group) ||
> > +		    (group->iommu && group->iommu->refcnt)) {
> > +			printk(KERN_WARNING
> > +			       "Adding device %s to group %u while group is already in use!!\n",
> > +			       dev_name(dev), group->groupid);
> > +			/* XXX How to prevent other drivers from claiming? */
> > +		}
> > +
> > +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +		if (!device) {
> > +			/* If we just created this group, tear it down */
> > +			if (new_group) {
> > +				list_del(&group->group_next);
> > +				device_destroy(vfio.class, group->devt);
> > +				idr_remove(&vfio.idr, MINOR(group->devt));
> > +				kfree(group);
> > +			}
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		list_add(&device->device_next, &group->device_list);
> > +		device->dev = dev;
> > +		device->ops = ops;
> > +		device->iommu = group->iommu; /* NULL if new */
> > +		__vfio_iommu_attach_dev(group->iommu, device);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> > +
> > +/* Remove a device from the vfio framework */
> > +void vfio_group_del_dev(struct device *dev)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> > +
> > +	if (!group)
> > +		goto out;
> > +
> > +	list_for_each(pos, &group->device_list) {
> > +		device = list_entry(pos, struct vfio_device, device_next);
> > +		if (device->dev == dev)
> > +			break;
> > +		device = NULL;
> > +	}
> > +
> > +	if (!device)
> > +		goto out;
> > +
> > +	BUG_ON(device->refcnt);
> > +
> > +	if (device->attached)
> > +		__vfio_iommu_detach_dev(group->iommu, device);
> > +
> > +	list_del(&device->device_next);
> > +	kfree(device);
> > +
> > +	/* If this was the only device in the group, remove the group.
> > +	 * Note that we intentionally unmerge empty groups here if the
> > +	 * group fd isn't opened. */
> > +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> > +		struct vfio_iommu *iommu = group->iommu;
> > +
> > +		if (iommu) {
> > +			__vfio_group_set_iommu(group, NULL);
> > +			__vfio_try_dissolve_iommu(iommu);
> > +		}
> > +
> > +		device_destroy(vfio.class, group->devt);
> > +		idr_remove(&vfio.idr, MINOR(group->devt));
> > +		list_del(&group->group_next);
> > +		kfree(group);
> > +	}
> > +out:
> > +	mutex_unlock(&vfio.lock);
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> > +
> > +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> > + * entry point is used to mark the device usable (viable).  The vfio
> > + * device driver associates a private device_data struct with the device
> > + * here, which will later be return for vfio_device_fops callbacks. */
> > +int vfio_bind_dev(struct device *dev, void *device_data)
> > +{
> > +	struct vfio_device *device;
> > +	int ret = -EINVAL;
> > +
> > +	BUG_ON(!device_data);
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	device = __vfio_lookup_dev(dev);
> > +
> > +	BUG_ON(!device);
> > +
> > +	ret = dev_set_drvdata(dev, device);
> > +	if (!ret)
> > +		device->device_data = device_data;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> > +
> > +/* A device is only removeable if the iommu for the group is not in use. */
> > +static bool vfio_device_removeable(struct vfio_device *device)
> > +{
> > +	bool ret = true;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> > +		ret = false;
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return ret;
> > +}
> > +
> > +/* Notify vfio that a device is being unbound from the vfio device driver
> > + * and return the device private device_data pointer.  If the group is
> > + * in use, we need to block or take other measures to make it safe for
> > + * the device to be removed from the iommu. */
> > +void *vfio_unbind_dev(struct device *dev)
> > +{
> > +	struct vfio_device *device = dev_get_drvdata(dev);
> > +	void *device_data;
> > +
> > +	BUG_ON(!device);
> > +
> > +again:
> > +	if (!vfio_device_removeable(device)) {
> > +		/* XXX signal for all devices in group to be removed or
> > +		 * resort to killing the process holding the device fds.
> > +		 * For now just block waiting for releases to wake us. */
> > +		wait_event(vfio.release_q, vfio_device_removeable(device));
> > +	}
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	/* Need to re-check that the device is still removeable under lock. */
> > +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> > +		mutex_unlock(&vfio.lock);
> > +		goto again;
> > +	}
> > +
> > +	device_data = device->device_data;
> > +
> > +	device->device_data = NULL;
> > +	dev_set_drvdata(dev, NULL);
> > +
> > +	mutex_unlock(&vfio.lock);
> > +	return device_data;
> > +}
> > +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> > +
> > +/*
> > + * Module/class support
> > + */
> > +static void vfio_class_release(struct kref *kref)
> > +{
> > +	class_destroy(vfio.class);
> > +	vfio.class = NULL;
> > +}
> > +
> > +static char *vfio_devnode(struct device *dev, mode_t *mode)
> > +{
> > +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> > +}
> > +
> > +static int __init vfio_init(void)
> > +{
> > +	int ret;
> > +
> > +	idr_init(&vfio.idr);
> > +	mutex_init(&vfio.lock);
> > +	INIT_LIST_HEAD(&vfio.group_list);
> > +	init_waitqueue_head(&vfio.release_q);
> > +
> > +	kref_init(&vfio.kref);
> > +	vfio.class = class_create(THIS_MODULE, "vfio");
> > +	if (IS_ERR(vfio.class)) {
> > +		ret = PTR_ERR(vfio.class);
> > +		goto err_class;
> > +	}
> > +
> > +	vfio.class->devnode = vfio_devnode;
> > +
> > +	/* FIXME - how many minors to allocate... all of them! */
> > +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> > +	if (ret)
> > +		goto err_chrdev;
> > +
> > +	cdev_init(&vfio.cdev, &vfio_group_fops);
> > +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> > +	if (ret)
> > +		goto err_cdev;
> > +
> > +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> > +
> > +	return 0;
> > +
> > +err_cdev:
> > +	unregister_chrdev_region(vfio.devt, MINORMASK);
> > +err_chrdev:
> > +	kref_put(&vfio.kref, vfio_class_release);
> > +err_class:
> > +	return ret;
> > +}
> > +
> > +static void __exit vfio_cleanup(void)
> > +{
> > +	struct list_head *gpos, *gppos;
> > +
> > +	list_for_each_safe(gpos, gppos, &vfio.group_list) {
> > +		struct vfio_group *group;
> > +		struct list_head *dpos, *dppos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, group_next);
> > +
> > +		list_for_each_safe(dpos, dppos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +			vfio_group_del_dev(device->dev);
> > +		}
> > +	}
> > +
> > +	idr_destroy(&vfio.idr);
> > +	cdev_del(&vfio.cdev);
> > +	unregister_chrdev_region(vfio.devt, MINORMASK);
> > +	kref_put(&vfio.kref, vfio_class_release);
> > +}
> > +
> > +module_init(vfio_init);
> > +module_exit(vfio_cleanup);
> > +
> > +MODULE_VERSION(DRIVER_VERSION);
> > +MODULE_LICENSE("GPL v2");
> > +MODULE_AUTHOR(DRIVER_AUTHOR);
> > +MODULE_DESCRIPTION(DRIVER_DESC);
> > diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> > new file mode 100644
> > index 0000000..350ad67
> > --- /dev/null
> > +++ b/drivers/vfio/vfio_private.h
> > @@ -0,0 +1,34 @@
> > +/*
> > + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> > + *     Author: Alex Williamson <alex.williamson@redhat.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + *
> > + * Derived from original vfio:
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + */
> > +
> > +#include <linux/list.h>
> > +#include <linux/mutex.h>
> > +
> > +#ifndef VFIO_PRIVATE_H
> > +#define VFIO_PRIVATE_H
> > +
> > +struct vfio_iommu {
> > +	struct iommu_domain		*domain;
> > +	struct bus_type			*bus;
> > +	struct mutex			dgate;
> > +	struct list_head		dm_list;
> > +	struct mm_struct		*mm;
> > +	struct list_head		group_list;
> > +	int				refcnt;
> > +	bool				cache;
> > +};
> > +
> > +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> > +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> > +
> > +#endif /* VFIO_PRIVATE_H */
> > diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> > new file mode 100644
> > index 0000000..4269b08
> > --- /dev/null
> > +++ b/include/linux/vfio.h
> > @@ -0,0 +1,155 @@
> > +/*
> > + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> > + * Author: Tom Lyon, pugs@cisco.com
> > + *
> > + * This program is free software; you may redistribute it and/or modify
> > + * it under the terms of the GNU General Public License as published by
> > + * the Free Software Foundation; version 2 of the License.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> > + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> > + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> > + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> > + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> > + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> > + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> > + * SOFTWARE.
> > + *
> > + * Portions derived from drivers/uio/uio.c:
> > + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> > + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> > + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> > + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> > + *
> > + * Portions derived from drivers/uio/uio_pci_generic.c:
> > + * Copyright (C) 2009 Red Hat, Inc.
> > + * Author: Michael S. Tsirkin <mst@redhat.com>
> > + */
> > +#include <linux/types.h>
> > +
> > +#ifndef VFIO_H
> > +#define VFIO_H
> > +
> > +#ifdef __KERNEL__
> > +
> > +struct vfio_device_ops {
> > +	bool			(*match)(struct device *, char *);
> > +	int			(*get)(void *);
> > +	void			(*put)(void *);
> > +	ssize_t			(*read)(void *, char __user *,
> > +					size_t, loff_t *);
> > +	ssize_t			(*write)(void *, const char __user *,
> > +					 size_t, loff_t *);
> > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > +	int			(*mmap)(void *, struct vm_area_struct *);
> > +};
> > +
> > +extern int vfio_group_add_dev(struct device *device,
> > +			      const struct vfio_device_ops *ops);
> > +extern void vfio_group_del_dev(struct device *device);
> > +extern int vfio_bind_dev(struct device *device, void *device_data);
> > +extern void *vfio_unbind_dev(struct device *device);
> > +
> > +#endif /* __KERNEL__ */
> > +
> > +/*
> > + * VFIO driver - allow mapping and use of certain devices
> > + * in unprivileged user processes. (If IOMMU is present)
> > + * Especially useful for Virtual Function parts of SR-IOV devices
> > + */
> > +
> > +
> > +/* Kernel & User level defines for ioctls */
> > +
> > +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)
> 
> > + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
> > +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)
> > +
> > +/*
> > + * Structure for DMA mapping of user buffers
> > + * vaddr, dmaaddr, and size must all be page aligned
> > + */
> > +struct vfio_dma_map {
> > +	__u64	len;		/* length of structure */
> > +	__u64	vaddr;		/* process virtual addr */
> > +	__u64	dmaaddr;	/* desired and/or returned dma address */
> > +	__u64	size;		/* size in bytes */
> > +	__u64	flags;
> > +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA mem */
> > +};
> > +
> > +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)
> > + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)
> > +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct vfio_dma_map)
> > +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct vfio_dma_map)
> > +
> > +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)
> > + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)
> > + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)
> > + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)
> > +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)
> > +
> > +struct vfio_region_info {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;		/* region number */
> > +	__u64	size;		/* size in bytes of region */
> > +	__u64	offset;		/* start offset of region */
> > +	__u64	flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)
> > +	__u64	phys;		/* physical address of region */
> > +};
> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct vfio_region_info)
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)
> > +
> > +struct vfio_irq_info {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;		/* IRQ number */
> > +	__u32	count;		/* number of individual IRQs */
> > +	__u32	flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)
> > +};
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct vfio_irq_info)
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)
> > +
> > +/* Unmask IRQ index, arg[0] = index */
> > +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)
> > +
> > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)
> > +
> > +#define VFIO_DEVICE_RESET		_IO(';', 116)
> > +
> > +struct vfio_dtpath {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;
> > +	__u64	flags;
> > +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)
> > +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)
> > +	char	*path;
> > +};
> > +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct vfio_dtpath)
> > +
> > +struct vfio_dtindex {
> > +	__u32	len;		/* length of structure */
> > +	__u32	index;
> > +	__u32	prop_type;
> > +	__u32	prop_index;
> > +	__u64	flags;
> > +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)
> > +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)
> > +};
> > +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct vfio_dtindex)
> > +
> > +#endif /* VFIO_H */
> 
> 
> So where is the vfio-pci? Is that a seperate posting?

You can find it in the tree pointed to in the patch description:

https://github.com/awilliam/linux-vfio/commit/534725d327e2b7791a229ce72d2ae8a62ee0a4e5

I was hoping to get some consensus around the new core before spending
too much time polishing up the bus driver.  Thanks for the review, it's
very much appreciated!

Alex
Christian Benvenuti - Nov. 11, 2011, 10:22 p.m.
> -----Original Message-----

> From: Alex Williamson [mailto:alex.williamson@redhat.com]

> Sent: Friday, November 11, 2011 10:04 AM

> To: Christian Benvenuti (benve)

> Cc: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;

> dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Aaron Fabbri

> (aafabbri); B08248@freescale.com; B07421@freescale.com; avi@redhat.com;

> konrad.wilk@oracle.com; kvm@vger.kernel.org; qemu-devel@nongnu.org;

> iommu@lists.linux-foundation.org; linux-pci@vger.kernel.org

> Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework

> 

> On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:

> > Here are few minor comments on vfio_iommu.c ...

> 

> Sorry, I've been poking sticks at trying to figure out a clean way to

> solve the force vfio driver attach problem.


Attach o detach?

> > > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c

> > > new file mode 100644

> > > index 0000000..029dae3

> > > --- /dev/null

> > > +++ b/drivers/vfio/vfio_iommu.c

> <snip>

> > > +

> > > +#include "vfio_private.h"

> >

> > Doesn't the 'dma_'  prefix belong to the generic DMA code?

> 

> Sure, we could these more vfio-centric.


Like vfio_dma_map_page?

> 

> > > +struct dma_map_page {

> > > +	struct list_head	list;

> > > +	dma_addr_t		daddr;

> > > +	unsigned long		vaddr;

> > > +	int			npage;

> > > +	int			rdwr;

> > > +};

> > > +

> > > +/*

> > > + * This code handles mapping and unmapping of user data buffers

> > > + * into DMA'ble space using the IOMMU

> > > + */

> > > +

> > > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)

> > > +

> > > +struct vwork {

> > > +	struct mm_struct	*mm;

> > > +	int			npage;

> > > +	struct work_struct	work;

> > > +};

> > > +

> > > +/* delayed decrement for locked_vm */

> > > +static void vfio_lock_acct_bg(struct work_struct *work)

> > > +{

> > > +	struct vwork *vwork = container_of(work, struct vwork, work);

> > > +	struct mm_struct *mm;

> > > +

> > > +	mm = vwork->mm;

> > > +	down_write(&mm->mmap_sem);

> > > +	mm->locked_vm += vwork->npage;

> > > +	up_write(&mm->mmap_sem);

> > > +	mmput(mm);		/* unref mm */

> > > +	kfree(vwork);

> > > +}

> > > +

> > > +static void vfio_lock_acct(int npage)

> > > +{

> > > +	struct vwork *vwork;

> > > +	struct mm_struct *mm;

> > > +

> > > +	if (!current->mm) {

> > > +		/* process exited */

> > > +		return;

> > > +	}

> > > +	if (down_write_trylock(&current->mm->mmap_sem)) {

> > > +		current->mm->locked_vm += npage;

> > > +		up_write(&current->mm->mmap_sem);

> > > +		return;

> > > +	}

> > > +	/*

> > > +	 * Couldn't get mmap_sem lock, so must setup to decrement

> >                                                       ^^^^^^^^^

> >

> > Increment?

> 

> Yep

> 

> <snip>

> > > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t

> > > start,

> > > +			    size_t size, struct dma_map_page *mlp)

> > > +{

> > > +	struct dma_map_page *split;

> > > +	int npage_lo, npage_hi;

> > > +

> > > +	/* Existing dma region is completely covered, unmap all */

> >

> > This works. However, given how vfio_dma_map_dm implements the merging

> > logic, I think it is impossible to have

> >

> >     (start < mlp->daddr &&

> >      start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))

> 

> It's quite possible.  This allows userspace to create a sparse mapping,

> then blow it all away with a single unmap from 0 to ~0.


I would prefer the user to use exact ranges in the unmap operations
because it would make it easier to detect bugs/leaks in the map/unmap
logic used by the callers.
My assumptions are that:

- the user always keeps track of the mappings

- the user either unmaps one specific mapping or 'all of them'.
  The 'all of them' case would also take care of those cases where
  the user does _not_ keep track of mappings and simply uses
  the "unmap from 0 to ~0" each time.

Because of this you could still provide an exact map/unmap logic
and allow such "unmap from 0 to ~0" by making the latter a special
case.
However, if we want to allow any arbitrary/inexact unmap request, then OK.

> > > +	if (start <= mlp->daddr &&

> > > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {

> > > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);

> > > +		list_del(&mlp->list);

> > > +		npage_lo = mlp->npage;

> > > +		kfree(mlp);

> > > +		return npage_lo;

> > > +	}

> > > +

> > > +	/* Overlap low address of existing range */

> >

> > Same as above (ie, '<' is impossible)

> 

> existing:   |<--- A --->|      |<--- B --->|

> unmap:                |<--- C --->|

> 

> Maybe not good practice from userspace, but we shouldn't count on

> userspace to be well behaved.

> 

> > > +	if (start <= mlp->daddr) {

> > > +		size_t overlap;

> > > +

> > > +		overlap = start + size - mlp->daddr;

> > > +		npage_lo = overlap >> PAGE_SHIFT;

> > > +		npage_hi = mlp->npage - npage_lo;

> > > +

> > > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);

> > > +		mlp->daddr += overlap;

> > > +		mlp->vaddr += overlap;

> > > +		mlp->npage -= npage_lo;

> > > +		return npage_lo;

> > > +	}

> >

> > Same as above (ie, '>' is impossible).

> 

> Same example as above.

> 

> > > +	/* Overlap high address of existing range */

> > > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {

> > > +		size_t overlap;

> > > +

> > > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;

> > > +		npage_hi = overlap >> PAGE_SHIFT;

> > > +		npage_lo = mlp->npage - npage_hi;

> > > +

> > > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);

> > > +		mlp->npage -= npage_hi;

> > > +		return npage_hi;

> > > +	}

> <snip>

> > > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map

> > > *dmp)

> > > +{

> > > +	int npage;

> > > +	struct dma_map_page *mlp, *mmlp = NULL;

> > > +	dma_addr_t daddr = dmp->dmaaddr;

> > > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;

> > > +	size_t size = dmp->size;

> > > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;

> > > +

> > > +	if (vaddr & (PAGE_SIZE-1))

> > > +		return -EINVAL;

> > > +	if (daddr & (PAGE_SIZE-1))

> > > +		return -EINVAL;

> > > +	if (size & (PAGE_SIZE-1))

> > > +		return -EINVAL;

> > > +

> > > +	npage = size >> PAGE_SHIFT;

> > > +	if (!npage)

> > > +		return -EINVAL;

> > > +

> > > +	if (!iommu)

> > > +		return -EINVAL;

> > > +

> > > +	mutex_lock(&iommu->dgate);

> > > +

> > > +	if (vfio_find_dma(iommu, daddr, size)) {

> > > +		ret = -EBUSY;

> > > +		goto out_lock;

> > > +	}

> > > +

> > > +	/* account for locked pages */

> > > +	locked = current->mm->locked_vm + npage;

> > > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;

> > > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {

> > > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",

> > > +			__func__, rlimit(RLIMIT_MEMLOCK));

> > > +		ret = -ENOMEM;

> > > +		goto out_lock;

> > > +	}

> > > +

> > > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);

> > > +	if (ret)

> > > +		goto out_lock;

> > > +

> > > +	/* Check if we abut a region below */

> >

> > Is !daddr possible?

> 

> Sure, an IOVA of 0x0.  There's no region below if we start at zero.

> 

> > > +	if (daddr) {

> > > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);

> > > +		if (mlp && mlp->rdwr == rdwr &&

> > > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {

> > > +

> > > +			mlp->npage += npage;

> > > +			daddr = mlp->daddr;

> > > +			vaddr = mlp->vaddr;

> > > +			npage = mlp->npage;

> > > +			size = NPAGE_TO_SIZE(npage);

> > > +

> > > +			mmlp = mlp;

> > > +		}

> > > +	}

> >

> > Is !(daddr + size) possible?

> 

> Same, there's no region above if this region goes to the top of the

> address space, ie. 0xffffffff_fffff000 + 0x1000

> 

> Hmm, wonder if I'm missing a check for wrapping.

> 

> > > +	if (daddr + size) {

> > > +		mlp = vfio_find_dma(iommu, daddr + size, 1);

> > > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size)

> > > {

> > > +

> > > +			mlp->npage += npage;

> > > +			mlp->daddr = daddr;

> > > +			mlp->vaddr = vaddr;

> > > +

> > > +			/* If merged above and below, remove previously

> > > +			 * merged entry.  New entry covers it.  */

> > > +			if (mmlp) {

> > > +				list_del(&mmlp->list);

> > > +				kfree(mmlp);

> > > +			}

> > > +			mmlp = mlp;

> > > +		}

> > > +	}

> > > +

> > > +	if (!mmlp) {

> > > +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);

> > > +		if (!mlp) {

> > > +			ret = -ENOMEM;

> > > +			vfio_dma_unmap(iommu, daddr, npage, rdwr);

> > > +			goto out_lock;

> > > +		}

> > > +

> > > +		mlp->npage = npage;

> > > +		mlp->daddr = daddr;

> > > +		mlp->vaddr = vaddr;

> > > +		mlp->rdwr = rdwr;

> > > +		list_add(&mlp->list, &iommu->dm_list);

> > > +	}

> > > +

> > > +out_lock:

> > > +	mutex_unlock(&iommu->dgate);

> > > +	return ret;

> > > +}

> > > +

> > > +static int vfio_iommu_release(struct inode *inode, struct file

> *filep)

> > > +{

> > > +	struct vfio_iommu *iommu = filep->private_data;

> > > +

> > > +	vfio_release_iommu(iommu);

> > > +	return 0;

> > > +}

> > > +

> > > +static long vfio_iommu_unl_ioctl(struct file *filep,

> > > +				 unsigned int cmd, unsigned long arg)

> > > +{

> > > +	struct vfio_iommu *iommu = filep->private_data;

> > > +	int ret = -ENOSYS;

> >

> > Any reason for not using "switch" ?

> 

> It got ugly in vfio_main, so I decided to be consistent w/ it in the

> driver and use if/else here too.  I don't like the aesthetics of extra

> {}s to declare variables within a switch, nor do I like declaring all

> the variables for each case for the whole function.  Personal quirk.

> 

> > > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {

> > > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;

> > > +

> > > +                ret = put_user(flags, (u64 __user *)arg);

> > > +

> > > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {

> > > +		struct vfio_dma_map dm;

> > > +

> > > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))

> > > +			return -EFAULT;

> >

> > What does the "_dm" suffix stand for?

> 

> Inherited from Tom, but I figure _dma_map_dm = action(dma map),

> object(dm), which is a vfio_Dma_Map.


OK. The reason why I asked is that '_dm' does not add anything to 'vfio_dma_map'.

/Chris
Scott Wood - Nov. 12, 2011, 12:14 a.m.
On 11/03/2011 03:12 PM, Alex Williamson wrote:
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  

Maybe replace "(technology name unknown)" with "(such as Freescale chips
with PAMU)" or similar?

Or just leave out the parenthetical.

> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)

This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
pointer to char rather than a pointer to an array of char (just as e.g.
VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)

What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
an implementation supposed to add a new flag that describes its
restrictions?

Can we get a way to turn DMA access off and on, short of unmapping
everything, and then mapping it again?

> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +        __u64   len;            /* length of structure */
> +        __u64   vaddr;          /* process virtual addr */
> +        __u64   dmaaddr;        /* desired and/or returned dma address */
> +        __u64   size;           /* size in bytes */
> +        __u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> +};

What are the semantics of "desired and/or returned dma address"?

Are we always supposed to provide a desired address, but it may be
different on return?  Or are there cases where we want to say "give me
whatever you want" or "give me this or fail"?

How much of this needs to be filled out for unmap?

Note that the "length of structure" approach means that ioctl numbers
will change whenever this grows -- perhaps we should avoid encoding the
struct size into these ioctls?

> +struct vfio_region_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* region number */
> +        __u64   size;           /* size in bytes of region */
> +        __u64   offset;         /* start offset of region */
> +        __u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> +        __u64   phys;           /* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* IRQ number */
> +        __u32   count;          /* number of individual IRQs */
> +        __u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)

Make sure flags is 64-bit aligned -- some 32-bit ABIs, such as x86, will
not do this, causing problems if the kernel is 64-bit and thus assumes a
different layout.

> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host.  This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system.

It's usually necessary even in the case of responsive userspace, just to
get to the point where userspace can execute (ignoring cases where
userspace runs on one core while the interrupt storms another).

For edge interrupts, will me mask if an interrupt comes in and the
previous interrupt hasn't been read out yet (and then unmask when the
last interrupt gets read out), to isolate us from a rapidly firing
interrupt source that userspace can't keep up with?

> +Device tree devices also invlude ioctls for further defining the
> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u64   flags;
> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> +        char    *path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)

Where is length of buffer (and description of associated semantics)?

> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);

const char *?

> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};

When defining an API, please do not omit parameter names.

Should specify what the driver is supposed to do with get/put -- I guess
not try to unbind when the count is nonzero?  Races could still lead the
unbinder to be blocked, but I guess it lets the driver know when it's
likely to succeed.

> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> +	tristate "VFIO Non-Privileged userspace driver framework"
> +	depends on IOMMU_API
> +	help
> +	  VFIO provides a framework for secure userspace device drivers.
> +	  See Documentation/vfio.txt for more details.
> +
> +	  If you don't know what to do here, say N.

Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
would still be useful for devices which don't do DMA, or where we accept
the lack of protection/translation (e.g. we have a customer that wants
to do KVM device assignment on one of our lower-end chips that lacks an
IOMMU).

> +struct dma_map_page {
> +	struct list_head	list;
> +	dma_addr_t		daddr;
> +	unsigned long		vaddr;
> +	int			npage;
> +	int			rdwr;
> +};

npage should be long.

What is "rdwr"?  non-zero for write?  non-zero for read? :-)
is_write would be a better name.

> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> +		unsigned long pfn = 0;
> +
> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> +		if (ret) {
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +
> +		/* Only add actual locked pages to accounting */
> +		if (!is_invalid_reserved_pfn(pfn))
> +			locked++;
> +
> +		ret = iommu_map(iommu->domain, iova,
> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> +		if (ret) {
> +			/* Back out mappings on error */
> +			put_pfn(pfn, rdwr);
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +	}

There's no way to hand this stuff to the IOMMU driver in chunks larger
than a page?  That's going to be a problem for our IOMMU, which wants to
deal with large windows.

> +	vfio_lock_acct(locked);
> +	return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,
> +				 unsigned long start2, size_t size2)
> +{
> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> +}

You pass DMA addresses to this, so use dma_addr_t.  unsigned long is not
always large enough.

What if one of the ranges wraps around (including the legitimate
possibility of start + size == 0)?

> +static long vfio_iommu_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +	int ret = -ENOSYS;

-ENOIOCTLCMD or -ENOTTY?

> +
> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> +                ret = put_user(flags, (u64 __user *)arg);
> +
> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> +		struct vfio_dma_map dm;

Whitespace.

Any reason not to use a switch?

> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
[snip]
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
[snip]
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
[snip]
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> +				   struct vfio_iommu *iommu)

...and so on.

Why all the leading underscores?  Doesn't look like you're trying to
distinguish between this and a more public version with the same name.

> +/* Get a new device file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set.  It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match.  For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */
> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> +	struct vfio_iommu *iommu = group->iommu;
> +	struct list_head *gpos;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!iommu->domain) {
> +		ret = __vfio_open_iommu(iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->ops->match(device->dev, buf)) {

If there's a match, we're done with the loop -- might as well break out
now rather than indent everything else.

> +				struct file *file;
> +
> +				if (device->ops->get(device->device_data)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}

Why does a failure of get() result in -EFAULT?  -EFAULT is for bad user
addresses.

> +
> +				/* We can't use anon_inode_getfd(), like above
> +				 * because we need to modify the f_mode flags
> +				 * directly to allow more than just ioctls */
> +				ret = get_unused_fd();
> +				if (ret < 0) {
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				file = anon_inode_getfile("[vfio-device]",
> +							  &vfio_device_fops,
> +							  device, O_RDWR);
> +				if (IS_ERR(file)) {
> +					put_unused_fd(ret);
> +					ret = PTR_ERR(file);
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}

Maybe cleaner with goto-based error management?

> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks.  This is the entry point for vfio drivers to register devices. */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +	int ret = 0;
> +	bool new_group = false;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}

Factor this into vfio_dev_to_group() (and likewise for other such lookups)?

> +	if (!group) {
> +		int minor;
> +
> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> +		if (!group) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group->groupid = groupid;
> +		INIT_LIST_HEAD(&group->device_list);
> +
> +		ret = idr_get_new(&vfio.idr, group, &minor);
> +		if (ret == 0 && minor > MINORMASK) {
> +			idr_remove(&vfio.idr, minor);
> +			kfree(group);
> +			ret = -ENOSPC;
> +			goto out;
> +		}
> +
> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> +		device_create(vfio.class, NULL, group->devt,
> +			      group, "%u", groupid);
> +
> +		group->bus = dev->bus;
> +		list_add(&group->group_next, &vfio.group_list);

Factor out into vfio_create_group()?

> +		new_group = true;
> +	} else {
> +		if (group->bus != dev->bus) {
> +			printk(KERN_WARNING
> +			       "Error: IOMMU group ID conflict.  Group ID %u "
> +				"on both bus %s and %s\n", groupid,
> +				group->bus->name, dev->bus->name);
> +			ret = -EFAULT;
> +			goto out;
> +		}

It took me a little while to figure out that this was comparing bus
types, not actual bus instances (which would be an inappropriate
restriction). :-P

Still, isn't it what we really care about that it's the same IOMMU
domain?  Couldn't different bus types share an iommu_ops?

And again, -EFAULT isn't the right error.

-Scott
Alex Williamson - Nov. 14, 2011, 8:54 p.m.
On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> On 11/03/2011 03:12 PM, Alex Williamson wrote:
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  
> 
> Maybe replace "(technology name unknown)" with "(such as Freescale chips
> with PAMU)" or similar?
> 
> Or just leave out the parenthetical.

I was hoping that comment would lead to an answer.  Thanks for the
info ;)

> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> 
> This suggests the argument to VFIO_GROUP_GET_DEVICE_FD is a pointer to a
> pointer to char rather than a pointer to an array of char (just as e.g.
> VFIO_GROUP_MERGE takes a pointer to an int, not just an int).

I believe I was following the UI_SET_PHYS ioctl as an example, which is
defined as a char *.  I'll change to char and verify.

> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> 
> What is the implication if VFIO_IOMMU_FLAGS_MAP_ANY is clear?  Is such
> an implementation supposed to add a new flag that describes its
> restrictions?

If MAP_ANY is clear then I would expect a new flag is set defining a new
mapping paradigm, probably with an ioctl to describe the
restrictions/parameters.  MAP_ANY effectively means there are no
restrictions.

> Can we get a way to turn DMA access off and on, short of unmapping
> everything, and then mapping it again?

iommu_ops doesn't support such an interface, so no, not currently.

> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY flag.
> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +        __u64   len;            /* length of structure */
> > +        __u64   vaddr;          /* process virtual addr */
> > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > +        __u64   size;           /* size in bytes */
> > +        __u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> > +};
> 
> What are the semantics of "desired and/or returned dma address"?

I believe the original intention was that a user could leave dmaaddr
clear and let the iommu layer provide an iova address.  The iommu api
has since evolved and that mapping scheme really isn't present anymore.
We'll currently fail if we can map the requested address.  I'll update
the docs to make that be the definition.

> Are we always supposed to provide a desired address, but it may be
> different on return?  Or are there cases where we want to say "give me
> whatever you want" or "give me this or fail"?

Exactly, that's what it used to be, but we don't really implement that
any more.

> How much of this needs to be filled out for unmap?

dmaaddr & size, will update docs.

> Note that the "length of structure" approach means that ioctl numbers
> will change whenever this grows -- perhaps we should avoid encoding the
> struct size into these ioctls?

How so?  What's described here is effectively the base size.  If we
later add feature foo requiring additional fields, we set a flag, change
the size, and tack those fields onto the end.  The kernel side should
balk if the size doesn't match what it expects from the flags it
understands (which I think I probably need to be more strict about).

> > +struct vfio_region_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* region number */
> > +        __u64   size;           /* size in bytes of region */
> > +        __u64   offset;         /* start offset of region */
> > +        __u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > +        __u64   phys;           /* physical address of region */
> > +};

In light of the above, this struct should not include phys.  In fact, I
should probably remove the PHYS_VALID flag as well until we have a bus
driver implementation that actually makes use of it.

> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > +
> > +The offset indicates the offset into the device file descriptor which
> > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > +available access types and validity of optional fields.  For instance
> > +the phys field may only be valid for certain devices types.
> > +
> > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > +reports the number or IRQ indexes for the device.
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> > +
> > +struct vfio_irq_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* IRQ number */
> > +        __u32   count;          /* number of individual IRQs */
> > +        __u64   flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> 
> Make sure flags is 64-bit aligned -- some 32-bit ABIs, such as x86, will
> not do this, causing problems if the kernel is 64-bit and thus assumes a
> different layout.

Shoot, I'll push flags up above count to get it aligned.

> > +Information about each index can be retrieved using the GET_IRQ_INFO
> > +ioctl, used much like GET_REGION_INFO.
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > +
> > +Individual indexes can describe single or sets of IRQs.  This provides the
> > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > +
> > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > +
> > +When a level triggered interrupt is signaled, the interrupt is masked
> > +on the host.  This prevents an unresponsive userspace driver from
> > +continuing to interrupt the host system.
> 
> It's usually necessary even in the case of responsive userspace, just to
> get to the point where userspace can execute (ignoring cases where
> userspace runs on one core while the interrupt storms another).

Right, I'll try to clarify.

> For edge interrupts, will me mask if an interrupt comes in and the
> previous interrupt hasn't been read out yet (and then unmask when the
> last interrupt gets read out), to isolate us from a rapidly firing
> interrupt source that userspace can't keep up with?

We don't do that currently and I haven't seen a need to.  Seems like
there'd be no API change in doing that if we want at some point.

> > +Device tree devices also invlude ioctls for further defining the
> > +device tree properties of the device:
> > +
> > +struct vfio_dtpath {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u64   flags;
> > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > +        char    *path;
> > +};
> > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> 
> Where is length of buffer (and description of associated semantics)?

I think I should probably take the same approach as the phys field
above, leave it to the dt bus driver to add these ioctls and fields as
I'm almost certain to get it wrong trying to predict what it's going to
need.  Likewise, VFIO_DEVICE_FLAGS_PCI should be defined as part of the
pci bus driver patch, even though it doesn't need any extra
ioctls/fields.

> > +struct vfio_device_ops {
> > +	bool			(*match)(struct device *, char *);
> 
> const char *?

will fix

> > +	int			(*get)(void *);
> > +	void			(*put)(void *);
> > +	ssize_t			(*read)(void *, char __user *,
> > +					size_t, loff_t *);
> > +	ssize_t			(*write)(void *, const char __user *,
> > +					 size_t, loff_t *);
> > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > +	int			(*mmap)(void *, struct vm_area_struct *);
> > +};
> 
> When defining an API, please do not omit parameter names.

ok

> Should specify what the driver is supposed to do with get/put -- I guess
> not try to unbind when the count is nonzero?  Races could still lead the
> unbinder to be blocked, but I guess it lets the driver know when it's
> likely to succeed.

Right, for the pci bus driver, it's mainly for reference counting,
including the module_get to prevent vfio-pci from being unloaded.  On
the first get for a device, we also do a pci_enable() and pci_disable()
on last put.  I'll try to clarify in the docs.

> > diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> > new file mode 100644
> > index 0000000..9acb1e7
> > --- /dev/null
> > +++ b/drivers/vfio/Kconfig
> > @@ -0,0 +1,8 @@
> > +menuconfig VFIO
> > +	tristate "VFIO Non-Privileged userspace driver framework"
> > +	depends on IOMMU_API
> > +	help
> > +	  VFIO provides a framework for secure userspace device drivers.
> > +	  See Documentation/vfio.txt for more details.
> > +
> > +	  If you don't know what to do here, say N.
> 
> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
> would still be useful for devices which don't do DMA, or where we accept
> the lack of protection/translation (e.g. we have a customer that wants
> to do KVM device assignment on one of our lower-end chips that lacks an
> IOMMU).

Ugh.  I'm not really onboard with it given that we're trying to sell
vfio as a secure user space driver interface with iommu-based
protection.  That said, vifo_iommu.c is already it's own file, with the
thought that other platforms might need to manage the iommu differently.
Theoretically the IOMMU_API requirement could be tied specifically to
vfio_iommu and another iommu backend added.

> > +struct dma_map_page {
> > +	struct list_head	list;
> > +	dma_addr_t		daddr;
> > +	unsigned long		vaddr;
> > +	int			npage;
> > +	int			rdwr;
> > +};
> 
> npage should be long.

Seems like I went back and forth on that a couple times, I'll see if I
can remember why I landed on int or change it.  Practically, int is "big
enough", but that's not a good answer.

> What is "rdwr"?  non-zero for write?  non-zero for read? :-)
> is_write would be a better name.

Others commented on this too, I'll switch to a bool rename it so it's
obvious that it means write access enabled.

> 
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > +		unsigned long pfn = 0;
> > +
> > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > +		if (ret) {
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +
> > +		/* Only add actual locked pages to accounting */
> > +		if (!is_invalid_reserved_pfn(pfn))
> > +			locked++;
> > +
> > +		ret = iommu_map(iommu->domain, iova,
> > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > +		if (ret) {
> > +			/* Back out mappings on error */
> > +			put_pfn(pfn, rdwr);
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +	}
> 
> There's no way to hand this stuff to the IOMMU driver in chunks larger
> than a page?  That's going to be a problem for our IOMMU, which wants to
> deal with large windows.

There is, this is just a simple implementation that maps individual
pages.  We "just" need to determine physically contiguous chunks and
mlock them instead of using get_user_pages.  The current implementation
is much like how KVM maps iommu pages, but there shouldn't be a user API
change to try to use larger chinks.  We want this for IOMMU large page
support too.

> > +	vfio_lock_acct(locked);
> > +	return 0;
> > +}
> > +
> > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> > +				 unsigned long start2, size_t size2)
> > +{
> > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> > +}
> 
> You pass DMA addresses to this, so use dma_addr_t.  unsigned long is not
> always large enough.

ok

> What if one of the ranges wraps around (including the legitimate
> possibility of start + size == 0)?

Looks like a bug.

> > +static long vfio_iommu_unl_ioctl(struct file *filep,
> > +				 unsigned int cmd, unsigned long arg)
> > +{
> > +	struct vfio_iommu *iommu = filep->private_data;
> > +	int ret = -ENOSYS;
> 
> -ENOIOCTLCMD or -ENOTTY?

ok

> > +
> > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > +
> > +                ret = put_user(flags, (u64 __user *)arg);
> > +
> > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > +		struct vfio_dma_map dm;
> 
> Whitespace.

yep, will fix

> Any reason not to use a switch?

Personal preference.  It got ugly using a switch in vfio_main, trying to
keep variable scope to the case, followed suit here for consistency.

> > +/* Return true if any devices within a group are opened */
> > +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> [snip]
> > +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> [snip]
> > +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> [snip]
> > +static void __vfio_group_set_iommu(struct vfio_group *group,
> > +				   struct vfio_iommu *iommu)
> 
> ...and so on.
> 
> Why all the leading underscores?  Doesn't look like you're trying to
> distinguish between this and a more public version with the same name.

__ implies it should be called under vfio.lock.

> > +/* Get a new device file descriptor.  This will open the iommu, setting
> > + * the current->mm ownership if it's not already set.  It's difficult to
> > + * specify the requirements for matching a user supplied buffer to a
> > + * device, so we use a vfio driver callback to test for a match.  For
> > + * PCI, dev_name(dev) is unique, but other drivers may require including
> > + * a parent device string. */
> > +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> > +{
> > +	struct vfio_iommu *iommu = group->iommu;
> > +	struct list_head *gpos;
> > +	int ret = -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	if (!iommu->domain) {
> > +		ret = __vfio_open_iommu(iommu);
> > +		if (ret)
> > +			goto out;
> > +	}
> > +
> > +	list_for_each(gpos, &iommu->group_list) {
> > +		struct list_head *dpos;
> > +
> > +		group = list_entry(gpos, struct vfio_group, iommu_next);
> > +
> > +		list_for_each(dpos, &group->device_list) {
> > +			struct vfio_device *device;
> > +
> > +			device = list_entry(dpos,
> > +					    struct vfio_device, device_next);
> > +
> > +			if (device->ops->match(device->dev, buf)) {
> 
> If there's a match, we're done with the loop -- might as well break out
> now rather than indent everything else.

Sure, even just changing the polarity and making this a continue would
help the formatting below.

> > +				struct file *file;
> > +
> > +				if (device->ops->get(device->device_data)) {
> > +					ret = -EFAULT;
> > +					goto out;
> > +				}
> 
> Why does a failure of get() result in -EFAULT?  -EFAULT is for bad user
> addresses.

I'll just return what get() returns.

> > +
> > +				/* We can't use anon_inode_getfd(), like above
> > +				 * because we need to modify the f_mode flags
> > +				 * directly to allow more than just ioctls */
> > +				ret = get_unused_fd();
> > +				if (ret < 0) {
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> > +
> > +				file = anon_inode_getfile("[vfio-device]",
> > +							  &vfio_device_fops,
> > +							  device, O_RDWR);
> > +				if (IS_ERR(file)) {
> > +					put_unused_fd(ret);
> > +					ret = PTR_ERR(file);
> > +					device->ops->put(device->device_data);
> > +					goto out;
> > +				}
> 
> Maybe cleaner with goto-based error management?

I didn't see enough duplication creeping in to try that here.

> > +/* Add a new device to the vfio framework with associated vfio driver
> > + * callbacks.  This is the entry point for vfio drivers to register devices. */
> > +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> > +{
> > +	struct list_head *pos;
> > +	struct vfio_group *group = NULL;
> > +	struct vfio_device *device = NULL;
> > +	unsigned int groupid;
> > +	int ret = 0;
> > +	bool new_group = false;
> > +
> > +	if (!ops)
> > +		return -EINVAL;
> > +
> > +	if (iommu_device_group(dev, &groupid))
> > +		return -ENODEV;
> > +
> > +	mutex_lock(&vfio.lock);
> > +
> > +	list_for_each(pos, &vfio.group_list) {
> > +		group = list_entry(pos, struct vfio_group, group_next);
> > +		if (group->groupid == groupid)
> > +			break;
> > +		group = NULL;
> > +	}
> 
> Factor this into vfio_dev_to_group() (and likewise for other such lookups)?

Yeah, this ends up getting duplicated a few places.

> > +	if (!group) {
> > +		int minor;
> > +
> > +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> > +		if (!group) {
> > +			ret = -ENOMEM;
> > +			goto out;
> > +		}
> > +
> > +		group->groupid = groupid;
> > +		INIT_LIST_HEAD(&group->device_list);
> > +
> > +		ret = idr_get_new(&vfio.idr, group, &minor);
> > +		if (ret == 0 && minor > MINORMASK) {
> > +			idr_remove(&vfio.idr, minor);
> > +			kfree(group);
> > +			ret = -ENOSPC;
> > +			goto out;
> > +		}
> > +
> > +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> > +		device_create(vfio.class, NULL, group->devt,
> > +			      group, "%u", groupid);
> > +
> > +		group->bus = dev->bus;
> > +		list_add(&group->group_next, &vfio.group_list);
> 
> Factor out into vfio_create_group()?

sounds good

> > +		new_group = true;
> > +	} else {
> > +		if (group->bus != dev->bus) {
> > +			printk(KERN_WARNING
> > +			       "Error: IOMMU group ID conflict.  Group ID %u "
> > +				"on both bus %s and %s\n", groupid,
> > +				group->bus->name, dev->bus->name);
> > +			ret = -EFAULT;
> > +			goto out;
> > +		}
> 
> It took me a little while to figure out that this was comparing bus
> types, not actual bus instances (which would be an inappropriate
> restriction). :-P
> 
> Still, isn't it what we really care about that it's the same IOMMU
> domain?  Couldn't different bus types share an iommu_ops?

Nope, iommu_ops registration is now per bus_type.  Also, Christian
pointed out that groupid is really only guaranteed to be unique per
bus_type so I've been updating groupid comparisons to compare the
groupid, bus_type pair.  

> And again, -EFAULT isn't the right error.

Ok.

Thank you very much for the comments,

Alex
Alex Williamson - Nov. 14, 2011, 9:46 p.m.
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> > On 11/03/2011 03:12 PM, Alex Williamson wrote: 
> > > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > > +		unsigned long pfn = 0;
> > > +
> > > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > > +		if (ret) {
> > > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > > +			return ret;
> > > +		}
> > > +
> > > +		/* Only add actual locked pages to accounting */
> > > +		if (!is_invalid_reserved_pfn(pfn))
> > > +			locked++;
> > > +
> > > +		ret = iommu_map(iommu->domain, iova,
> > > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > > +		if (ret) {
> > > +			/* Back out mappings on error */
> > > +			put_pfn(pfn, rdwr);
> > > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > > +			return ret;
> > > +		}
> > > +	}
> > 
> > There's no way to hand this stuff to the IOMMU driver in chunks larger
> > than a page?  That's going to be a problem for our IOMMU, which wants to
> > deal with large windows.
> 
> There is, this is just a simple implementation that maps individual
> pages.  We "just" need to determine physically contiguous chunks and
> mlock them instead of using get_user_pages.  The current implementation
> is much like how KVM maps iommu pages, but there shouldn't be a user API
> change to try to use larger chinks.  We want this for IOMMU large page
> support too.

Also, at one point intel-iommu didn't allow sub-ranges to be unmapped;
an unmap of a single page would unmap the entire original mapping that
contained that page.  That made it easier to map each page individually
for the flexibility it provided on unmap.  I need to see if we still
have that restriction.  Thanks,

Alex
Scott Wood - Nov. 14, 2011, 10:26 p.m.
On 11/14/2011 02:54 PM, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
>> What are the semantics of "desired and/or returned dma address"?
> 
> I believe the original intention was that a user could leave dmaaddr
> clear and let the iommu layer provide an iova address.  The iommu api
> has since evolved and that mapping scheme really isn't present anymore.
> We'll currently fail if we can map the requested address.  I'll update
> the docs to make that be the definition.

OK... if there is any desire in the future to have the kernel pick an
address (which could be useful for IOMMUs that don't set
VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
since zero could be a valid address to request (doesn't mean "clear").

>> Note that the "length of structure" approach means that ioctl numbers
>> will change whenever this grows -- perhaps we should avoid encoding the
>> struct size into these ioctls?
> 
> How so?  What's described here is effectively the base size.  If we
> later add feature foo requiring additional fields, we set a flag, change
> the size, and tack those fields onto the end.  The kernel side should
> balk if the size doesn't match what it expects from the flags it
> understands (which I think I probably need to be more strict about).

The size of the struct is encoded into the ioctl number via the _IOWR()
macro.  If we want the struct to be growable in the future, we should
leave that out and just use _IO().  Otherwise if the size of the struct
changes, the ioctl number changes.  This is annoying for old userspace
plus new kernel (have to add compat entries to the switch), and broken
for old kernel plus new userspace.

>> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
>> would still be useful for devices which don't do DMA, or where we accept
>> the lack of protection/translation (e.g. we have a customer that wants
>> to do KVM device assignment on one of our lower-end chips that lacks an
>> IOMMU).
> 
> Ugh.  I'm not really onboard with it given that we're trying to sell
> vfio as a secure user space driver interface with iommu-based
> protection.

That's its main use case, but it doesn't make much sense to duplicate
the non-iommu-related bits for other use cases.

This applies at runtime too, some devices don't do DMA at all (and thus
may not be part of an IOMMU group, even if there is an IOMMU present for
other devices -- could be considered a standalone group of one device,
with a null IOMMU backend).  Support for such devices can wait, but it's
good to keep the possibility in mind.

-Scott
Alexander Graf - Nov. 14, 2011, 10:48 p.m.
Am 14.11.2011 um 23:26 schrieb Scott Wood <scottwood@freescale.com>:

> On 11/14/2011 02:54 PM, Alex Williamson wrote:
>> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
>>> What are the semantics of "desired and/or returned dma address"?
>> 
>> I believe the original intention was that a user could leave dmaaddr
>> clear and let the iommu layer provide an iova address.  The iommu api
>> has since evolved and that mapping scheme really isn't present anymore.
>> We'll currently fail if we can map the requested address.  I'll update
>> the docs to make that be the definition.
> 
> OK... if there is any desire in the future to have the kernel pick an
> address (which could be useful for IOMMUs that don't set
> VFIO_IOMMU_FLAGS_MAP_ANY), there should be an explicit flag for this,
> since zero could be a valid address to request (doesn't mean "clear").
> 
>>> Note that the "length of structure" approach means that ioctl numbers
>>> will change whenever this grows -- perhaps we should avoid encoding the
>>> struct size into these ioctls?
>> 
>> How so?  What's described here is effectively the base size.  If we
>> later add feature foo requiring additional fields, we set a flag, change
>> the size, and tack those fields onto the end.  The kernel side should
>> balk if the size doesn't match what it expects from the flags it
>> understands (which I think I probably need to be more strict about).
> 
> The size of the struct is encoded into the ioctl number via the _IOWR()
> macro.  If we want the struct to be growable in the future, we should
> leave that out and just use _IO().  Otherwise if the size of the struct
> changes, the ioctl number changes.  This is annoying for old userspace
> plus new kernel (have to add compat entries to the switch), and broken
> for old kernel plus new userspace.

Avi wanted to write up a patch for this to allow ioctls with arbitrary size, for exctly this purpose.

> 
>>> Can we limit the IOMMU_API dependency to the IOMMU parts of VFIO?  It
>>> would still be useful for devices which don't do DMA, or where we accept
>>> the lack of protection/translation (e.g. we have a customer that wants
>>> to do KVM device assignment on one of our lower-end chips that lacks an
>>> IOMMU).
>> 
>> Ugh.  I'm not really onboard with it given that we're trying to sell
>> vfio as a secure user space driver interface with iommu-based
>> protection.
> 
> That's its main use case, but it doesn't make much sense to duplicate
> the non-iommu-related bits for other use cases.
> 
> This applies at runtime too, some devices don't do DMA at all (and thus
> may not be part of an IOMMU group, even if there is an IOMMU present for
> other devices -- could be considered a standalone group of one device,
> with a null IOMMU backend).  Support for such devices can wait, but it's
> good to keep the possibility in mind.

I agree. Potentially backing a device with a nop iommu also makes testing easier.

Alex

>
Alex Williamson - Nov. 14, 2011, 10:59 p.m.
On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
> > -----Original Message-----
> > From: Alex Williamson [mailto:alex.williamson@redhat.com]
> > Sent: Friday, November 11, 2011 10:04 AM
> > To: Christian Benvenuti (benve)
> > Cc: chrisw@sous-sol.org; aik@au1.ibm.com; pmac@au1.ibm.com;
> > dwg@au1.ibm.com; joerg.roedel@amd.com; agraf@suse.de; Aaron Fabbri
> > (aafabbri); B08248@freescale.com; B07421@freescale.com; avi@redhat.com;
> > konrad.wilk@oracle.com; kvm@vger.kernel.org; qemu-devel@nongnu.org;
> > iommu@lists.linux-foundation.org; linux-pci@vger.kernel.org
> > Subject: RE: [RFC PATCH] vfio: VFIO Driver core framework
> > 
> > On Wed, 2011-11-09 at 18:57 -0600, Christian Benvenuti (benve) wrote:
> > > Here are few minor comments on vfio_iommu.c ...
> > 
> > Sorry, I've been poking sticks at trying to figure out a clean way to
> > solve the force vfio driver attach problem.
> 
> Attach o detach?

Attach.  For the case when a new device appears that belongs to a group
that already in use.  I'll probably add a claim() operation to the
vfio_device_ops that tells the driver to grab it.  I was hoping for pci
this would just add it to the dynamic ids, but that hits device lock
problems.

> > > > diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> > > > new file mode 100644
> > > > index 0000000..029dae3
> > > > --- /dev/null
> > > > +++ b/drivers/vfio/vfio_iommu.c
> > <snip>
> > > > +
> > > > +#include "vfio_private.h"
> > >
> > > Doesn't the 'dma_'  prefix belong to the generic DMA code?
> > 
> > Sure, we could these more vfio-centric.
> 
> Like vfio_dma_map_page?

Something like that, though _page doesn't seem appropriate as it tracks
a region.

> > 
> > > > +struct dma_map_page {
> > > > +	struct list_head	list;
> > > > +	dma_addr_t		daddr;
> > > > +	unsigned long		vaddr;
> > > > +	int			npage;
> > > > +	int			rdwr;
> > > > +};
> > > > +
> > > > +/*
> > > > + * This code handles mapping and unmapping of user data buffers
> > > > + * into DMA'ble space using the IOMMU
> > > > + */
> > > > +
> > > > +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> > > > +
> > > > +struct vwork {
> > > > +	struct mm_struct	*mm;
> > > > +	int			npage;
> > > > +	struct work_struct	work;
> > > > +};
> > > > +
> > > > +/* delayed decrement for locked_vm */
> > > > +static void vfio_lock_acct_bg(struct work_struct *work)
> > > > +{
> > > > +	struct vwork *vwork = container_of(work, struct vwork, work);
> > > > +	struct mm_struct *mm;
> > > > +
> > > > +	mm = vwork->mm;
> > > > +	down_write(&mm->mmap_sem);
> > > > +	mm->locked_vm += vwork->npage;
> > > > +	up_write(&mm->mmap_sem);
> > > > +	mmput(mm);		/* unref mm */
> > > > +	kfree(vwork);
> > > > +}
> > > > +
> > > > +static void vfio_lock_acct(int npage)
> > > > +{
> > > > +	struct vwork *vwork;
> > > > +	struct mm_struct *mm;
> > > > +
> > > > +	if (!current->mm) {
> > > > +		/* process exited */
> > > > +		return;
> > > > +	}
> > > > +	if (down_write_trylock(&current->mm->mmap_sem)) {
> > > > +		current->mm->locked_vm += npage;
> > > > +		up_write(&current->mm->mmap_sem);
> > > > +		return;
> > > > +	}
> > > > +	/*
> > > > +	 * Couldn't get mmap_sem lock, so must setup to decrement
> > >                                                       ^^^^^^^^^
> > >
> > > Increment?
> > 
> > Yep

Actually, side note, this is increment/decrement depending on the sign
of the parameter.  So "update" may be more appropriate.  I think Tom
originally used increment in one place and decrement in another to show
it's dual use.

> > <snip>
> > > > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t
> > > > start,
> > > > +			    size_t size, struct dma_map_page *mlp)
> > > > +{
> > > > +	struct dma_map_page *split;
> > > > +	int npage_lo, npage_hi;
> > > > +
> > > > +	/* Existing dma region is completely covered, unmap all */
> > >
> > > This works. However, given how vfio_dma_map_dm implements the merging
> > > logic, I think it is impossible to have
> > >
> > >     (start < mlp->daddr &&
> > >      start + size > mlp->daddr + NPAGE_TO_SIZE(mlp->npage))
> > 
> > It's quite possible.  This allows userspace to create a sparse mapping,
> > then blow it all away with a single unmap from 0 to ~0.
> 
> I would prefer the user to use exact ranges in the unmap operations
> because it would make it easier to detect bugs/leaks in the map/unmap
> logic used by the callers.
> My assumptions are that:
> 
> - the user always keeps track of the mappings

My qemu code plays a little on the loose side here, acting as a
passthrough for the internal memory client.  But even there, worst case
would probably be trying to unmap a non-existent entry, not unmapping a
sparse range.

> - the user either unmaps one specific mapping or 'all of them'.
>   The 'all of them' case would also take care of those cases where
>   the user does _not_ keep track of mappings and simply uses
>   the "unmap from 0 to ~0" each time.
> 
> Because of this you could still provide an exact map/unmap logic
> and allow such "unmap from 0 to ~0" by making the latter a special
> case.
> However, if we want to allow any arbitrary/inexact unmap request, then OK.

I can't think of any good reasons we shouldn't be more strict.  I think
it was primarily just convenient to hit all the corner cases since we
merge all the requests together for tracking and need to be able to
split them back apart.  It does feel a little awkward to have a 0/~0
special case though, but I don't think it's worth adding another ioctl
to handle it.

<snip>
> > > > +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> > > > +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> > > > +
> > > > +                ret = put_user(flags, (u64 __user *)arg);
> > > > +
> > > > +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> > > > +		struct vfio_dma_map dm;
> > > > +
> > > > +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> > > > +			return -EFAULT;
> > >
> > > What does the "_dm" suffix stand for?
> > 
> > Inherited from Tom, but I figure _dma_map_dm = action(dma map),
> > object(dm), which is a vfio_Dma_Map.
> 
> OK. The reason why I asked is that '_dm' does not add anything to 'vfio_dma_map'.

Yep.  Thanks,

Alex
David Gibson - Nov. 15, 2011, midnight
On Fri, Nov 11, 2011 at 03:10:56PM -0700, Alex Williamson wrote:
> Thanks Konrad!  Comments inline.
> On Fri, 2011-11-11 at 12:51 -0500, Konrad Rzeszutek Wilk wrote:
> > On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
[snip]
> > > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > > +
> > > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > 
> > Don't want __u32?
> 
> It could be, not sure if it buys us anything maybe even restricts us.
> We likely don't need 2^32 regions (famous last words?), so we could
> later define <0 to something?

As a rule, it's best to use explicit fixed width types for all ioctl()
arguments, to avoid compat hell for 32-bit userland on 64-bit kernel
setups.

[snip]
> > > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > > +type to index mapping).
> > 
> > I am not really sure what that means.
> 
> This is so PCI can expose:
> 
> enum {
>         VFIO_PCI_INTX_IRQ_INDEX,
>         VFIO_PCI_MSI_IRQ_INDEX,
>         VFIO_PCI_MSIX_IRQ_INDEX,
>         VFIO_PCI_NUM_IRQS
> };
> 
> So like regions it always exposes 3 IRQ indexes where count=0 if the
> device doesn't actually support that type of interrupt.  I just want to
> spell out that bus drivers have this kind of flexibility.

I knew what you were aiming for, so I could see what you meant here,
but I don't think the doco is very clearly expressed at all.
David Gibson - Nov. 15, 2011, 12:05 a.m.
On Mon, Nov 14, 2011 at 03:59:00PM -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 16:22 -0600, Christian Benvenuti (benve) wrote:
[snip]

> > - the user either unmaps one specific mapping or 'all of them'.
> >   The 'all of them' case would also take care of those cases where
> >   the user does _not_ keep track of mappings and simply uses
> >   the "unmap from 0 to ~0" each time.
> > 
> > Because of this you could still provide an exact map/unmap logic
> > and allow such "unmap from 0 to ~0" by making the latter a special
> > case.
> > However, if we want to allow any arbitrary/inexact unmap request, then OK.
> 
> I can't think of any good reasons we shouldn't be more strict.  I think
> it was primarily just convenient to hit all the corner cases since we
> merge all the requests together for tracking and need to be able to
> split them back apart.  It does feel a little awkward to have a 0/~0
> special case though, but I don't think it's worth adding another ioctl
> to handle it.

Being strict, or at least enforcing strictness, requires that the
infrastructure track all the maps, so that the unmaps can be
matching.  This is not a natural thing with the data structures you
want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
is a simple 1-level pagetable.  One pointer with a couple of
permission bits per IOMMU page.  Handling oddly overlapping operations
on that data structure is natural, enforcing strict matching of maps
and unmaps is not and would require extra information to be stored by
vfio.  On POWER, the IOMMU operations often *are* a hot path, so
manipulating those structures would have a real cost, too.
Benjamin Herrenschmidt - Nov. 15, 2011, 12:49 a.m.
On Tue, 2011-11-15 at 11:05 +1100, David Gibson wrote:
> Being strict, or at least enforcing strictness, requires that the
> infrastructure track all the maps, so that the unmaps can be
> matching.  This is not a natural thing with the data structures you
> want for all IOMMUs.  For example on POWER, the IOMMU (aka TCE table)
> is a simple 1-level pagetable.  One pointer with a couple of
> permission bits per IOMMU page.  Handling oddly overlapping operations
> on that data structure is natural, enforcing strict matching of maps
> and unmaps is not and would require extra information to be stored by
> vfio.  On POWER, the IOMMU operations often *are* a hot path, so
> manipulating those structures would have a real cost, too. 

In fact they are a very hot path even. There's no way we can afford the
cost of tracking per page mapping/unmapping (other than bumping the page
count on a page that's currently mapped or via some debug-only feature).

Cheers,
Ben.
Alex Williamson - Nov. 15, 2011, 2:29 a.m.
On Mon, 2011-11-14 at 13:54 -0700, Alex Williamson wrote:
> On Fri, 2011-11-11 at 18:14 -0600, Scott Wood wrote:
> > On 11/03/2011 03:12 PM, Alex Williamson wrote:
> > > +	int			(*get)(void *);
> > > +	void			(*put)(void *);
> > > +	ssize_t			(*read)(void *, char __user *,
> > > +					size_t, loff_t *);
> > > +	ssize_t			(*write)(void *, const char __user *,
> > > +					 size_t, loff_t *);
> > > +	long			(*ioctl)(void *, unsigned int, unsigned long);
> > > +	int			(*mmap)(void *, struct vm_area_struct *);
> > > +};
> > 
> > When defining an API, please do not omit parameter names.
> 
> ok
> 
> > Should specify what the driver is supposed to do with get/put -- I guess
> > not try to unbind when the count is nonzero?  Races could still lead the
> > unbinder to be blocked, but I guess it lets the driver know when it's
> > likely to succeed.
> 
> Right, for the pci bus driver, it's mainly for reference counting,
> including the module_get to prevent vfio-pci from being unloaded.  On
> the first get for a device, we also do a pci_enable() and pci_disable()
> on last put.  I'll try to clarify in the docs.

Looking at these again, I should just rename them to open/release.  That
matches the points when they're called.  I suspect I started with just
reference counting and it grew to more of a full blown open/release.
Thanks,

Alex
David Gibson - Nov. 15, 2011, 6:34 a.m.
On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> VFIO provides a secure, IOMMU based interface for user space
> drivers, including device assignment to virtual machines.
> This provides the base management of IOMMU groups, devices,
> and IOMMU objects.  See Documentation/vfio.txt included in
> this patch for user and kernel API description.
> 
> Note, this implements the new API discussed at KVM Forum
> 2011, as represented by the drvier version 0.2.  It's hoped
> that this provides a modular enough interface to support PCI
> and non-PCI userspace drivers across various architectures
> and IOMMU implementations.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
> 
> Fingers crossed, this is the last RFC for VFIO, but we need
> the iommu group support before this can go upstream
> (http://lkml.indiana.edu/hypermail/linux/kernel/1110.2/02303.html),
> hoping this helps push that along.
> 
> Since the last posting, this version completely modularizes
> the device backends and better defines the APIs between the
> core VFIO code and the device backends.  I expect that we
> might also adopt a modular IOMMU interface as iommu_ops learns
> about different types of hardware.  Also many, many cleanups.
> Check the complete git history for details:
> 
> git://github.com/awilliam/linux-vfio.git vfio-ng
> 
> (matching qemu tree: git://github.com/awilliam/qemu-vfio.git)
> 
> This version, along with the supporting VFIO PCI backend can
> be found here:
> 
> git://github.com/awilliam/linux-vfio.git vfio-next-20111103
> 
> I've held off on implementing a kernel->user signaling
> mechanism for now since the previous netlink version produced
> too many gag reflexes.  It's easy enough to set a bit in the
> group flags too indicate such support in the future, so I
> think we can move ahead without it.
> 
> Appreciate any feedback or suggestions.  Thanks,
> 
> Alex
> 
>  Documentation/ioctl/ioctl-number.txt |    1 
>  Documentation/vfio.txt               |  304 +++++++++
>  MAINTAINERS                          |    8 
>  drivers/Kconfig                      |    2 
>  drivers/Makefile                     |    1 
>  drivers/vfio/Kconfig                 |    8 
>  drivers/vfio/Makefile                |    3 
>  drivers/vfio/vfio_iommu.c            |  530 ++++++++++++++++
>  drivers/vfio/vfio_main.c             | 1151 ++++++++++++++++++++++++++++++++++
>  drivers/vfio/vfio_private.h          |   34 +
>  include/linux/vfio.h                 |  155 +++++
>  11 files changed, 2197 insertions(+), 0 deletions(-)
>  create mode 100644 Documentation/vfio.txt
>  create mode 100644 drivers/vfio/Kconfig
>  create mode 100644 drivers/vfio/Makefile
>  create mode 100644 drivers/vfio/vfio_iommu.c
>  create mode 100644 drivers/vfio/vfio_main.c
>  create mode 100644 drivers/vfio/vfio_private.h
>  create mode 100644 include/linux/vfio.h
> 
> diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
> index 54078ed..59d01e4 100644
> --- a/Documentation/ioctl/ioctl-number.txt
> +++ b/Documentation/ioctl/ioctl-number.txt
> @@ -88,6 +88,7 @@ Code  Seq#(hex)	Include File		Comments
>  		and kernel/power/user.c
>  '8'	all				SNP8023 advanced NIC card
>  					<mailto:mcr@solidum.com>
> +';'	64-76	linux/vfio.h
>  '@'	00-0F	linux/radeonfb.h	conflict!
>  '@'	00-0F	drivers/video/aty/aty128fb.c	conflict!
>  'A'	00-1F	linux/apm_bios.h	conflict!
> diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> new file mode 100644
> index 0000000..5866896
> --- /dev/null
> +++ b/Documentation/vfio.txt
> @@ -0,0 +1,304 @@
> +VFIO - "Virtual Function I/O"[1]
> +-------------------------------------------------------------------------------
> +Many modern system now provide DMA and interrupt remapping facilities
> +to help ensure I/O devices behave within the boundaries they've been
> +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> +well as POWER systems with Partitionable Endpoints (PEs) and even
> +embedded powerpc systems (technology name unknown).  The VFIO driver
> +is an IOMMU/device agnostic framework for exposing direct device
> +access to userspace, in a secure, IOMMU protected environment.  In
> +other words, this allows safe, non-privileged, userspace drivers.

It's perhaps worth emphasisng that "safe" depends on the hardware
being sufficiently well behaved.  BenH, I know, thinks there are a
*lot* of cards that, e.g. have debug registers that allow a backdoor
to their own config space via MMIO, which would bypass vfio's
filtering of config space access.  And that's before we even get into
the varying degrees of completeness in the isolation provided by
different IOMMUs.

> +Why do we want that?  Virtual machines often make use of direct device
> +access ("device assignment") when configured for the highest possible
> +I/O performance.  From a device and host perspective, this simply turns
> +the VM into a userspace driver, with the benefits of significantly
> +reduced latency, higher bandwidth, and direct use of bare-metal device
> +drivers[2].
> +
> +Some applications, particularly in the high performance computing
> +field, also benefit from low-overhead, direct device access from
> +userspace.  Examples include network adapters (often non-TCP/IP based)
> +and compute accelerators.  Previous to VFIO, these drivers needed to

s/Previous/Prior/  although that may be a .us vs .au usage thing.

> +go through the full development cycle to become proper upstream driver,
> +be maintained out of tree, or make use of the UIO framework, which
> +has no notion of IOMMU protection, limited interrupt support, and
> +requires root privileges to access things like PCI configuration space.
> +
> +The VFIO driver framework intends to unify these, replacing both the
> +KVM PCI specific device assignment currently used as well as provide
> +a more secure, more featureful userspace driver environment than UIO.
> +
> +Groups, Devices, IOMMUs, oh my
> +-------------------------------------------------------------------------------
> +
> +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> +can't always distinguish transactions from each individual device in
> +the system.  Sometimes this is because of the IOMMU design, such as with
> +PEs, other times it's caused by the I/O topology, for instance a
> +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> +devices created by these restictions IOMMU groups (or just "groups" for
> +this document).
> +
> +The IOMMU cannot distiguish transactions between the individual devices
> +within the group, therefore the group is the basic unit of ownership for
> +a userspace process.  Because of this, groups are also the primary
> +interface to both devices and IOMMU domains in VFIO.
> +
> +The VFIO representation of groups is created as devices are added into
> +the framework by a VFIO bus driver.  The vfio-pci module is an example
> +of a bus driver.  This module registers devices along with a set of bus
> +specific callbacks with the VFIO core.  These callbacks provide the
> +interfaces later used for device access.  As each new group is created,
> +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> +character device.

Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
bus driver is per bus type, not per bus instance.   But grouping
constraints could be per bus instance, if you have a couple of
different models of PCI host bridge with IOMMUs of different
capabilities built in, for example.

> +In addition to the device enumeration and callbacks, the VFIO bus driver
> +also provides a traditional device driver and is able to bind to devices
> +on it's bus.  When a device is bound to the bus driver it's available to
> +VFIO.  When all the devices within a group are bound to their bus drivers,
> +the group becomes "viable" and a user with sufficient access to the VFIO
> +group chardev can obtain exclusive access to the set of group devices.
> +
> +As documented in linux/vfio.h, several ioctls are provided on the
> +group chardev:
> +
> +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> +
> +The last two ioctls return new file descriptors for accessing
> +individual devices within the group and programming the IOMMU.  Each of
> +these new file descriptors provide their own set of file interfaces.
> +These ioctls will fail if any of the devices within the group are not
> +bound to their VFIO bus driver.  Additionally, when either of these
> +interfaces are used, the group is then bound to the struct_mm of the
> +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> +
> +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> +new IOMMU domain is created and all of the devices in the group are
> +attached to it.  This is the only way to ensure full IOMMU isolation
> +of the group, but potentially wastes resources and cycles if the user
> +intends to manage multiple groups with the same set of IOMMU mappings.
> +VFIO therefore provides a group MERGE and UNMERGE interface, which
> +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> +arbitrary groups to be merged, so the user should assume merging is
> +opportunistic.

I do not think "opportunistic" means what you think it means..

>  A new group, with no open device or IOMMU file
> +descriptors, can be merged into an existing, in-use, group using the
> +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> +once all of the device file descriptors for the group being merged
> +"out" are closed.
> +
> +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> +essentially fungible between group file descriptors (ie. if device
> A

IDNT "fungible" MWYTIM, either.

> +is in group X, and X is merged with Y, a file descriptor for A can be
> +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> +file descriptor referencing the same internal IOMMU object from either
> +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> +or automatically when ALL file descriptors for the merged group are
> +closed (all IOMMUs, all devices, all groups).

Blech.  I'm really not liking this merge/unmerge API as it stands,
it's horribly confusing.  At the very least, we need some better
terminology.  We need some term for the metagroups; supergroups; iommu
domains or-at-least-they-will-be-once-we-open-the-iommu or
whathaveyous.

The first confusing thing about this interface is that each open group
handle actually refers to two different things; the original group you
opened and the metagroup it's a part of.  For the GET_IOMMU_FD and
GET_DEVICE_FD operations, you're using the metagroup and two "merged"
group handles are interchangeable.  For other MERGE and especially
UNMERGE operations, it matters which is the original group.

The semantics of "merge" and "unmerge" under those names are really
non-obvious.  Merge kind of has to merge two whole metagroups, but
it's unclear if unmerge reverses one merge, or just takes out one
(atom) group.  These operations need better names, at least.

Then it's unclear what order you can do various operations, and which
order you can open and close various things.  You can kind of figure
it out but it takes far more thinking than it should.


So at the _very_ least, we need to invent new terminology and find a
much better way of describing this API's semantics.  I still think an
entirely different interface, where metagroups are created from
outside with a lifetime that's not tied to an fd would be a better
idea.



Now, you specify that you can't use a group as the second argument of
a merge if it already has an open iommu, but it's not clear from the
doc if you can merge things into a group with an open iommu.  Banning
this would make life simpler, because the IOMMU's effective
capabilities may change if you add more devices to the domain.  That's
yet another non-obvious constraint in the interface ordering, though.

> +The IOMMU file descriptor provides this set of ioctls:
> +
> +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> +
> +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> +We currently only support IOMMU domains that are able to map any
> +virtual address to any IOVA.  This is indicated by the MAP_ANY
> flag.

So.  I tend to think of an IOMMU mapping IOVAs to memory pages, rather
than memory pages to IOVAs.  The IOMMU itself, of course maps to
physical addresses, and the meaning of "virtual address" in this
context is not really clear.  I think you would be better off saying
the IOMMU can map any IOVA to any memory page.  From a hardware POV
that means any physical address, but of course for a VFIO user a page
is specified by its process virtual address.

I think we need to pin exactly what "MAP_ANY" means down better.  Now,
VFIO is pretty much a lost cause if you can't map any normal process
memory page into the IOMMU, so I think the only thing that is really
covered is IOVAs.  But saying "can map any IOVA" is not clear, because
if you can't map it, it's not a (valid) IOVA.  Better to say that
IOVAs can be any 64-bit value, which I think is what you really mean
here.

Of course, since POWER is a platform where this is *not* true, I'd
prefer to have something giving the range of valid IOVAs in the core
to start with.

> +
> +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> +and unmapping IOVAs to process virtual addresses:
> +
> +struct vfio_dma_map {
> +        __u64   len;            /* length of structure */

Thanks for adding these structure length fields.  But I think they
should be called something other than 'len', which is likely to be
confused with size (or some other length that's actually related to
the operation's parameters).  Better to call it 'structlen' or
'argslen' or something.

> +        __u64   vaddr;          /* process virtual addr */
> +        __u64   dmaaddr;        /* desired and/or returned dma address */
> +        __u64   size;           /* size in bytes */
> +        __u64   flags;
> +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */

Make it independent READ and WRITE flags from the start.  Not all
combinations will be be valid on all hardware, but that way we have
the possibilities covered without having to use strange encodings
later.

> +};
> +
> +Current users of VFIO use relatively static DMA mappings, not requiring
> +high frequency turnover.  As new users are added, it's expected that the
> +IOMMU file descriptor will evolve to support new mapping interfaces, this
> +will be reflected in the flags and may present new ioctls and file
> +interfaces.
> +
> +The device GET_FLAGS ioctl is intended to return basic device type and
> +indicate support for optional capabilities.  Flags currently include whether
> +the device is PCI or described by Device Tree, and whether the RESET ioctl
> +is supported:
> +
> +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)

TBH, I don't think the VFIO for DT stuff is mature enough yet to be in
an initial infrastructure patch, though we should certainly be
discussing it as an add-on patch.

> + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> +
> +The MMIO and IOP resources used by a device are described by regions.
> +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> +
> +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> +
> +Regions are described by a struct vfio_region_info, which is retrieved by
> +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> +the desired region (0 based index).  Note that devices may implement zero
> +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> +mapping).

So, I think you're saying that a zero-sized region is used to encode a
NOP region, that is, to basically put a "no region here" in between
valid region indices.  You should spell that out.

[Incidentally, any chance you could borrow one of RH's tech writers
for this?  I'm afraid you seem to lack the knack for clear and easily
read documentation]

> +struct vfio_region_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* region number */
> +        __u64   size;           /* size in bytes of region */
> +        __u64   offset;         /* start offset of region */
> +        __u64   flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)

Again having separate read and write bits from the start will save
strange encodings later.

> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> +        __u64   phys;           /* physical address of region */
> +};

I notice there is no field for "type" e.g. MMIO vs. PIO vs. config
space for PCI.  If you added that having a NONE type might be a
clearer way of encoding a non-region than just having size==0.

> +
> +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> +
> +The offset indicates the offset into the device file descriptor which
> +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> +available access types and validity of optional fields.  For instance
> +the phys field may only be valid for certain devices types.
> +
> +Interrupts are described using a similar interface.  GET_NUM_IRQS
> +reports the number or IRQ indexes for the device.
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +        __u32   len;            /* length of structure */
> +        __u32   index;          /* IRQ number */
> +        __u32   count;          /* number of individual IRQs */

Is there a reason for allowing irqs in batches like this, rather than
having each MSI be reflected by a separate irq_info?

> +        __u64   flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> +};
> +
> +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> +type to index mapping).

I know what you mean, but you need a clearer way to express it.

> +Information about each index can be retrieved using the GET_IRQ_INFO
> +ioctl, used much like GET_REGION_INFO.
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> +
> +Individual indexes can describe single or sets of IRQs.  This provides the
> +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> +
> +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> +as shown below, are used to pass the IRQ info index, the number of eventfds,
> +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> +
> +When a level triggered interrupt is signaled, the interrupt is masked
> +on the host.  This prevents an unresponsive userspace driver from
> +continuing to interrupt the host system.  After servicing the interrupt,
> +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> +triggered interrupts implicitly have a count of 1 per index.

This is a silly restriction.  Even PCI devices can have up to 4 LSIs
on a function in theory, though no-one ever does.  Embedded devices
can and do have multiple level interrupts.

> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> +
> +Level triggered interrupts can also be unmasked using an irqfd.  Use
> +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> +
> +When supported, as indicated by the device flags, reset the device.
> +
> +#define VFIO_DEVICE_RESET               _IO(';', 116)
> +
> +Device tree devices also invlude ioctls for further defining the
> +device tree properties of the device:
> +
> +struct vfio_dtpath {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u64   flags;
> +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> +        char    *path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +        __u32   len;            /* length of structure */
> +        __u32   index;
> +        __u32   prop_type;
> +        __u32   prop_index;
> +        __u64   flags;
> +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> +
> +
> +VFIO bus driver API
> +-------------------------------------------------------------------------------
> +
> +Bus drivers, such as PCI, have three jobs:
> + 1) Add/remove devices from vfio
> + 2) Provide vfio_device_ops for device access
> + 3) Device binding and unbinding
> +
> +When initialized, the bus driver should enumerate the devices on it's

s/it's/its/

> +bus and call vfio_group_add_dev() for each device.  If the bus supports
> +hotplug, notifiers should be enabled to track devices being added and
> +removed.  vfio_group_del_dev() removes a previously added device from
> +vfio.
> +
> +Adding a device registers a vfio_device_ops function pointer structure
> +for the device:
> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +When a device is bound to the bus driver, the bus driver indicates this
> +to vfio using the vfio_bind_dev() interface.  The device_data parameter
> +is a pointer to an opaque data structure for use only by the bus driver.
> +The get, put, read, write, ioctl, and mmap vfio_device_ops all pass
> +this data structure back to the bus driver.  When a device is unbound
> +from the bus driver, the vfio_unbind_dev() interface signals this to
> +vfio.  This function returns the pointer to the device_data structure
> +registered for the device.
> +
> +As noted previously, a group contains one or more devices, so
> +GROUP_GET_DEVICE_FD needs to identify the specific device being requested.
> +The vfio_device_ops.match callback is used to allow bus drivers to determine
> +the match.  For drivers like vfio-pci, it's a simple match to dev_name(),
> +which is unique in the system due to the PCI bus topology, other bus drivers
> +may need to include parent devices to create a unique match, so this is
> +left as a bus driver interface.
> +
> +-------------------------------------------------------------------------------
> +
> +[1] VFIO was originally an acronym for "Virtual Function I/O" in it's
> +initial implementation by Tom Lyon while as Cisco.  We've since outgrown
> +the acronym, but it's catchy.
> +
> +[2] As always there are trade-offs to virtual machine device
> +assignment that are beyond the scope of VFIO.  It's expected that
> +future IOMMU technologies will reduce some, but maybe not all, of
> +these trade-offs.
> diff --git a/MAINTAINERS b/MAINTAINERS
> index f05f5f6..4bd5aa0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -7106,6 +7106,14 @@ S:	Maintained
>  F:	Documentation/filesystems/vfat.txt
>  F:	fs/fat/
>  
> +VFIO DRIVER
> +M:	Alex Williamson <alex.williamson@redhat.com>
> +L:	kvm@vger.kernel.org
> +S:	Maintained
> +F:	Documentation/vfio.txt
> +F:	drivers/vfio/
> +F:	include/linux/vfio.h
> +
>  VIDEOBUF2 FRAMEWORK
>  M:	Pawel Osciak <pawel@osciak.com>
>  M:	Marek Szyprowski <m.szyprowski@samsung.com>
> diff --git a/drivers/Kconfig b/drivers/Kconfig
> index b5e6f24..e15578b 100644
> --- a/drivers/Kconfig
> +++ b/drivers/Kconfig
> @@ -112,6 +112,8 @@ source "drivers/auxdisplay/Kconfig"
>  
>  source "drivers/uio/Kconfig"
>  
> +source "drivers/vfio/Kconfig"
> +
>  source "drivers/vlynq/Kconfig"
>  
>  source "drivers/virtio/Kconfig"
> diff --git a/drivers/Makefile b/drivers/Makefile
> index 1b31421..5f138b5 100644
> --- a/drivers/Makefile
> +++ b/drivers/Makefile
> @@ -58,6 +58,7 @@ obj-$(CONFIG_ATM)		+= atm/
>  obj-$(CONFIG_FUSION)		+= message/
>  obj-y				+= firewire/
>  obj-$(CONFIG_UIO)		+= uio/
> +obj-$(CONFIG_VFIO)		+= vfio/
>  obj-y				+= cdrom/
>  obj-y				+= auxdisplay/
>  obj-$(CONFIG_PCCARD)		+= pcmcia/
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> new file mode 100644
> index 0000000..9acb1e7
> --- /dev/null
> +++ b/drivers/vfio/Kconfig
> @@ -0,0 +1,8 @@
> +menuconfig VFIO
> +	tristate "VFIO Non-Privileged userspace driver framework"
> +	depends on IOMMU_API
> +	help
> +	  VFIO provides a framework for secure userspace device drivers.
> +	  See Documentation/vfio.txt for more details.
> +
> +	  If you don't know what to do here, say N.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> new file mode 100644
> index 0000000..088faf1
> --- /dev/null
> +++ b/drivers/vfio/Makefile
> @@ -0,0 +1,3 @@
> +vfio-y := vfio_main.o vfio_iommu.o
> +
> +obj-$(CONFIG_VFIO) := vfio.o
> diff --git a/drivers/vfio/vfio_iommu.c b/drivers/vfio/vfio_iommu.c
> new file mode 100644
> index 0000000..029dae3
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu.c
> @@ -0,0 +1,530 @@
> +/*
> + * VFIO: IOMMU DMA mapping support
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/fs.h>
> +#include <linux/iommu.h>
> +#include <linux/module.h>
> +#include <linux/mm.h>
> +#include <linux/sched.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/workqueue.h>
> +
> +#include "vfio_private.h"
> +
> +struct dma_map_page {
> +	struct list_head	list;
> +	dma_addr_t		daddr;
> +	unsigned long		vaddr;
> +	int			npage;
> +	int			rdwr;
> +};
> +
> +/*
> + * This code handles mapping and unmapping of user data buffers
> + * into DMA'ble space using the IOMMU
> + */
> +
> +#define NPAGE_TO_SIZE(npage)	((size_t)(npage) << PAGE_SHIFT)
> +
> +struct vwork {
> +	struct mm_struct	*mm;
> +	int			npage;
> +	struct work_struct	work;
> +};
> +
> +/* delayed decrement for locked_vm */
> +static void vfio_lock_acct_bg(struct work_struct *work)
> +{
> +	struct vwork *vwork = container_of(work, struct vwork, work);
> +	struct mm_struct *mm;
> +
> +	mm = vwork->mm;
> +	down_write(&mm->mmap_sem);
> +	mm->locked_vm += vwork->npage;
> +	up_write(&mm->mmap_sem);
> +	mmput(mm);		/* unref mm */
> +	kfree(vwork);
> +}
> +
> +static void vfio_lock_acct(int npage)
> +{
> +	struct vwork *vwork;
> +	struct mm_struct *mm;
> +
> +	if (!current->mm) {
> +		/* process exited */
> +		return;
> +	}
> +	if (down_write_trylock(&current->mm->mmap_sem)) {
> +		current->mm->locked_vm += npage;
> +		up_write(&current->mm->mmap_sem);
> +		return;
> +	}
> +	/*
> +	 * Couldn't get mmap_sem lock, so must setup to decrement
> +	 * mm->locked_vm later. If locked_vm were atomic, we wouldn't
> +	 * need this silliness
> +	 */
> +	vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> +	if (!vwork)
> +		return;
> +	mm = get_task_mm(current);	/* take ref mm */
> +	if (!mm) {
> +		kfree(vwork);
> +		return;
> +	}
> +	INIT_WORK(&vwork->work, vfio_lock_acct_bg);
> +	vwork->mm = mm;
> +	vwork->npage = npage;
> +	schedule_work(&vwork->work);
> +}
> +
> +/* Some mappings aren't backed by a struct page, for example an mmap'd
> + * MMIO range for our own or another device.  These use a different
> + * pfn conversion and shouldn't be tracked as locked pages. */
> +static int is_invalid_reserved_pfn(unsigned long pfn)
> +{
> +	if (pfn_valid(pfn)) {
> +		int reserved;
> +		struct page *tail = pfn_to_page(pfn);
> +		struct page *head = compound_trans_head(tail);
> +		reserved = PageReserved(head);
> +		if (head != tail) {
> +			/* "head" is not a dangling pointer
> +			 * (compound_trans_head takes care of that)
> +			 * but the hugepage may have been split
> +			 * from under us (and we may not hold a
> +			 * reference count on the head page so it can
> +			 * be reused before we run PageReferenced), so
> +			 * we've to check PageTail before returning
> +			 * what we just read.
> +			 */
> +			smp_rmb();
> +			if (PageTail(tail))
> +				return reserved;
> +		}
> +		return PageReserved(tail);
> +	}
> +
> +	return true;
> +}
> +
> +static int put_pfn(unsigned long pfn, int rdwr)
> +{
> +	if (!is_invalid_reserved_pfn(pfn)) {
> +		struct page *page = pfn_to_page(pfn);
> +		if (rdwr)
> +			SetPageDirty(page);
> +		put_page(page);
> +		return 1;
> +	}
> +	return 0;
> +}
> +
> +/* Unmap DMA region */
> +/* dgate must be held */
> +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			    int npage, int rdwr)

Use of "read" and "write" in DMA can often be confusing, since it's
not always clear if you're talking from the perspective of the CPU or
the device (_writing_ data to a device will usually involve it doing
DMA _reads_ from memory).  It's often best to express things as DMA
direction, 'to device', and 'from device' instead.

> +{
> +	int i, unlocked = 0;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> +		unsigned long pfn;
> +
> +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> +		if (pfn) {
> +			iommu_unmap(iommu->domain, iova, 0);
> +			unlocked += put_pfn(pfn, rdwr);
> +		}
> +	}
> +	return unlocked;
> +}
> +
> +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> +			   unsigned long npage, int rdwr)
> +{
> +	int unlocked;
> +
> +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> +	vfio_lock_acct(-unlocked);

Have you checked that your accounting will work out if the user maps
the same memory page to multiple IOVAs?

> +}
> +
> +/* Unmap ALL DMA regions */
> +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos, *pos2;
> +	struct dma_map_page *mlp;
> +
> +	mutex_lock(&iommu->dgate);
> +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +		list_del(&mlp->list);
> +		kfree(mlp);
> +	}
> +	mutex_unlock(&iommu->dgate);

Ouch, no good at all.  Keeping track of every DMA map is no good on
POWER or other systems where IOMMU operations are a hot path.  I think
you'll need an iommu specific hook for this instead, which uses
whatever data structures are natural for the IOMMU.  For example a
1-level pagetable, like we use on POWER will just zero every entry.

> +}
> +
> +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> +{
> +	struct page *page[1];
> +	struct vm_area_struct *vma;
> +	int ret = -EFAULT;
> +
> +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> +		*pfn = page_to_pfn(page[0]);
> +		return 0;
> +	}
> +
> +	down_read(&current->mm->mmap_sem);
> +
> +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> +
> +	if (vma && vma->vm_flags & VM_PFNMAP) {
> +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> +		if (is_invalid_reserved_pfn(*pfn))
> +			ret = 0;
> +	}

It's kind of nasty that you take gup_fast(), already designed to grab
pointers for multiple user pages, then just use it one page at a time,
even for a big map.

> +	up_read(&current->mm->mmap_sem);
> +
> +	return ret;
> +}
> +
> +/* Map DMA region */
> +/* dgate must be held */
> +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> +			unsigned long vaddr, int npage, int rdwr)

iova should be a dma_addr_t.  Bus address size need not match virtual
address size, and may not fit in an unsigned long.

> +{
> +	unsigned long start = iova;
> +	int i, ret, locked = 0, prot = IOMMU_READ;
> +
> +	/* Verify pages are not already mapped */
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> +		if (iommu_iova_to_phys(iommu->domain, iova))
> +			return -EBUSY;
> +
> +	iova = start;
> +
> +	if (rdwr)
> +		prot |= IOMMU_WRITE;
> +	if (iommu->cache)
> +		prot |= IOMMU_CACHE;
> +
> +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> +		unsigned long pfn = 0;
> +
> +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> +		if (ret) {
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +
> +		/* Only add actual locked pages to accounting */
> +		if (!is_invalid_reserved_pfn(pfn))
> +			locked++;
> +
> +		ret = iommu_map(iommu->domain, iova,
> +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> +		if (ret) {
> +			/* Back out mappings on error */
> +			put_pfn(pfn, rdwr);
> +			__vfio_dma_unmap(iommu, start, i, rdwr);
> +			return ret;
> +		}
> +	}
> +	vfio_lock_acct(locked);
> +	return 0;
> +}
> +
> +static inline int ranges_overlap(unsigned long start1, size_t size1,
> +				 unsigned long start2, size_t size2)
> +{
> +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);

Needs overflow safety.

> +}
> +
> +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> +					  dma_addr_t start, size_t size)
> +{
> +	struct list_head *pos;
> +	struct dma_map_page *mlp;
> +
> +	list_for_each(pos, &iommu->dm_list) {
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   start, size))
> +			return mlp;
> +	}
> +	return NULL;
> +}

Again, keeping track of each dma map operation is no good for
performance.

> +
> +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> +			    size_t size, struct dma_map_page *mlp)
> +{
> +	struct dma_map_page *split;
> +	int npage_lo, npage_hi;
> +
> +	/* Existing dma region is completely covered, unmap all */
> +	if (start <= mlp->daddr &&
> +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> +		list_del(&mlp->list);
> +		npage_lo = mlp->npage;
> +		kfree(mlp);
> +		return npage_lo;
> +	}
> +
> +	/* Overlap low address of existing range */
> +	if (start <= mlp->daddr) {
> +		size_t overlap;
> +
> +		overlap = start + size - mlp->daddr;
> +		npage_lo = overlap >> PAGE_SHIFT;
> +		npage_hi = mlp->npage - npage_lo;
> +
> +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> +		mlp->daddr += overlap;
> +		mlp->vaddr += overlap;
> +		mlp->npage -= npage_lo;
> +		return npage_lo;
> +	}
> +
> +	/* Overlap high address of existing range */
> +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> +		size_t overlap;
> +
> +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> +		npage_hi = overlap >> PAGE_SHIFT;
> +		npage_lo = mlp->npage - npage_hi;
> +
> +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> +		mlp->npage -= npage_hi;
> +		return npage_hi;
> +	}
> +
> +	/* Split existing */
> +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> +
> +	split = kzalloc(sizeof *split, GFP_KERNEL);
> +	if (!split)
> +		return -ENOMEM;
> +
> +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> +
> +	mlp->npage = npage_lo;
> +
> +	split->npage = npage_hi;
> +	split->daddr = start + size;
> +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> +	split->rdwr = mlp->rdwr;
> +	list_add(&split->list, &iommu->dm_list);
> +	return size >> PAGE_SHIFT;
> +}
> +
> +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int ret = 0;
> +	size_t npage = dmp->size >> PAGE_SHIFT;
> +	struct list_head *pos, *n;
> +
> +	if (dmp->dmaaddr & ~PAGE_MASK)
> +		return -EINVAL;
> +	if (dmp->size & ~PAGE_MASK)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	list_for_each_safe(pos, n, &iommu->dm_list) {
> +		struct dma_map_page *mlp;
> +
> +		mlp = list_entry(pos, struct dma_map_page, list);
> +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> +				   dmp->dmaaddr, dmp->size)) {
> +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> +						      dmp->size, mlp);
> +			if (ret > 0)
> +				npage -= NPAGE_TO_SIZE(ret);
> +			if (ret < 0 || npage == 0)
> +				break;
> +		}
> +	}
> +	mutex_unlock(&iommu->dgate);
> +	return ret > 0 ? 0 : ret;
> +}
> +
> +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> +{
> +	int npage;
> +	struct dma_map_page *mlp, *mmlp = NULL;
> +	dma_addr_t daddr = dmp->dmaaddr;
> +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> +	size_t size = dmp->size;
> +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> +
> +	if (vaddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (daddr & (PAGE_SIZE-1))
> +		return -EINVAL;
> +	if (size & (PAGE_SIZE-1))
> +		return -EINVAL;
> +
> +	npage = size >> PAGE_SHIFT;
> +	if (!npage)
> +		return -EINVAL;
> +
> +	if (!iommu)
> +		return -EINVAL;
> +
> +	mutex_lock(&iommu->dgate);
> +
> +	if (vfio_find_dma(iommu, daddr, size)) {
> +		ret = -EBUSY;
> +		goto out_lock;
> +	}
> +
> +	/* account for locked pages */
> +	locked = current->mm->locked_vm + npage;
> +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> +			__func__, rlimit(RLIMIT_MEMLOCK));
> +		ret = -ENOMEM;
> +		goto out_lock;
> +	}
> +
> +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> +	if (ret)
> +		goto out_lock;
> +
> +	/* Check if we abut a region below */
> +	if (daddr) {
> +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> +		if (mlp && mlp->rdwr == rdwr &&
> +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> +
> +			mlp->npage += npage;
> +			daddr = mlp->daddr;
> +			vaddr = mlp->vaddr;
> +			npage = mlp->npage;
> +			size = NPAGE_TO_SIZE(npage);
> +
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (daddr + size) {
> +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> +
> +			mlp->npage += npage;
> +			mlp->daddr = daddr;
> +			mlp->vaddr = vaddr;
> +
> +			/* If merged above and below, remove previously
> +			 * merged entry.  New entry covers it.  */
> +			if (mmlp) {
> +				list_del(&mmlp->list);
> +				kfree(mmlp);
> +			}
> +			mmlp = mlp;
> +		}
> +	}
> +
> +	if (!mmlp) {
> +		mlp = kzalloc(sizeof *mlp, GFP_KERNEL);
> +		if (!mlp) {
> +			ret = -ENOMEM;
> +			vfio_dma_unmap(iommu, daddr, npage, rdwr);
> +			goto out_lock;
> +		}
> +
> +		mlp->npage = npage;
> +		mlp->daddr = daddr;
> +		mlp->vaddr = vaddr;
> +		mlp->rdwr = rdwr;
> +		list_add(&mlp->list, &iommu->dm_list);
> +	}
> +
> +out_lock:
> +	mutex_unlock(&iommu->dgate);
> +	return ret;
> +}

This whole tracking infrastructure is way too complex to impose on
every IOMMU.  We absolutely don't want to do all this when just
updating a 1-level pagetable.

> +static int vfio_iommu_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +
> +	vfio_release_iommu(iommu);
> +	return 0;
> +}
> +
> +static long vfio_iommu_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_iommu *iommu = filep->private_data;
> +	int ret = -ENOSYS;
> +
> +        if (cmd == VFIO_IOMMU_GET_FLAGS) {
> +                u64 flags = VFIO_IOMMU_FLAGS_MAP_ANY;
> +
> +                ret = put_user(flags, (u64 __user *)arg);

Um.. flags surely have to come from the IOMMU driver.

> +        } else if (cmd == VFIO_IOMMU_MAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_map_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +
> +	} else if (cmd == VFIO_IOMMU_UNMAP_DMA) {
> +		struct vfio_dma_map dm;
> +
> +		if (copy_from_user(&dm, (void __user *)arg, sizeof dm))
> +			return -EFAULT;
> +
> +		ret = vfio_dma_unmap_dm(iommu, &dm);
> +
> +		if (!ret && copy_to_user((void __user *)arg, &dm, sizeof dm))
> +			ret = -EFAULT;
> +	}
> +	return ret;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_iommu_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_iommu_unl_ioctl(filep, cmd, arg);

Um, this only works if the structures are exactly compatible between
32-bit and 64-bit ABIs.  I don't think that is always true.

> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_iommu_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_iommu_release,
> +	.unlocked_ioctl	= vfio_iommu_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_iommu_compat_ioctl,
> +#endif
> +};
> diff --git a/drivers/vfio/vfio_main.c b/drivers/vfio/vfio_main.c
> new file mode 100644
> index 0000000..6169356
> --- /dev/null
> +++ b/drivers/vfio/vfio_main.c
> @@ -0,0 +1,1151 @@
> +/*
> + * VFIO framework
> + *
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/cdev.h>
> +#include <linux/compat.h>
> +#include <linux/device.h>
> +#include <linux/file.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/fs.h>
> +#include <linux/idr.h>
> +#include <linux/iommu.h>
> +#include <linux/mm.h>
> +#include <linux/module.h>
> +#include <linux/slab.h>
> +#include <linux/string.h>
> +#include <linux/uaccess.h>
> +#include <linux/vfio.h>
> +#include <linux/wait.h>
> +
> +#include "vfio_private.h"
> +
> +#define DRIVER_VERSION	"0.2"
> +#define DRIVER_AUTHOR	"Alex Williamson <alex.williamson@redhat.com>"
> +#define DRIVER_DESC	"VFIO - User Level meta-driver"
> +
> +static int allow_unsafe_intrs;
> +module_param(allow_unsafe_intrs, int, 0);
> +MODULE_PARM_DESC(allow_unsafe_intrs,
> +        "Allow use of IOMMUs which do not support interrupt remapping");

This should not be a global option, but part of the AMD/Intel IOMMU
specific code.  In general it's a question of how strict the IOMMU
driver is about isolation when it determines what the groups are, and
only the IOMMU driver can know what the possibilities are for its
class of hardware.

> +
> +static struct vfio {
> +	dev_t			devt;
> +	struct cdev		cdev;
> +	struct list_head	group_list;
> +	struct mutex		lock;
> +	struct kref		kref;
> +	struct class		*class;
> +	struct idr		idr;
> +	wait_queue_head_t	release_q;
> +} vfio;
> +
> +static const struct file_operations vfio_group_fops;
> +extern const struct file_operations vfio_iommu_fops;
> +
> +struct vfio_group {
> +	dev_t			devt;
> +	unsigned int		groupid;
> +	struct bus_type		*bus;
> +	struct vfio_iommu	*iommu;
> +	struct list_head	device_list;
> +	struct list_head	iommu_next;
> +	struct list_head	group_next;
> +	int			refcnt;
> +};
> +
> +struct vfio_device {
> +	struct device			*dev;
> +	const struct vfio_device_ops	*ops;
> +	struct vfio_iommu		*iommu;
> +	struct vfio_group		*group;
> +	struct list_head		device_next;
> +	bool				attached;
> +	int				refcnt;
> +	void				*device_data;
> +};
> +
> +/*
> + * Helper functions called under vfio.lock
> + */
> +
> +/* Return true if any devices within a group are opened */
> +static bool __vfio_group_devs_inuse(struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* Return true if any of the groups attached to an iommu are opened.
> + * We can only tear apart merged groups when nothing is left open. */
> +static bool __vfio_iommu_groups_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +		if (group->refcnt)
> +			return true;
> +	}
> +	return false;
> +}
> +
> +/* An iommu is "in use" if it has a file descriptor open or if any of
> + * the groups assigned to the iommu have devices open. */
> +static bool __vfio_iommu_inuse(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (iommu->refcnt)
> +		return true;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		if (__vfio_group_devs_inuse(group))
> +			return true;
> +	}
> +	return false;
> +}
> +
> +static void __vfio_group_set_iommu(struct vfio_group *group,
> +				   struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (group->iommu)
> +		list_del(&group->iommu_next);
> +	if (iommu)
> +		list_add(&group->iommu_next, &iommu->group_list);
> +
> +	group->iommu = iommu;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		device->iommu = iommu;
> +	}
> +}
> +
> +static void __vfio_iommu_detach_dev(struct vfio_iommu *iommu,
> +				    struct vfio_device *device)
> +{
> +	BUG_ON(!iommu->domain && device->attached);
> +
> +	if (!iommu->domain || !device->attached)
> +		return;
> +
> +	iommu_detach_device(iommu->domain, device->dev);
> +	device->attached = false;
> +}
> +
> +static void __vfio_iommu_detach_group(struct vfio_iommu *iommu,
> +				      struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		__vfio_iommu_detach_dev(iommu, device);
> +	}
> +}
> +
> +static int __vfio_iommu_attach_dev(struct vfio_iommu *iommu,
> +				   struct vfio_device *device)
> +{
> +	int ret;
> +
> +	BUG_ON(device->attached);
> +
> +	if (!iommu || !iommu->domain)
> +		return -EINVAL;
> +
> +	ret = iommu_attach_device(iommu->domain, device->dev);
> +	if (!ret)
> +		device->attached = true;
> +
> +	return ret;
> +}
> +
> +static int __vfio_iommu_attach_group(struct vfio_iommu *iommu,
> +				     struct vfio_group *group)
> +{
> +	struct list_head *pos;
> +
> +	list_for_each(pos, &group->device_list) {
> +		struct vfio_device *device;
> +		int ret;
> +
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		ret = __vfio_iommu_attach_dev(iommu, device);
> +		if (ret) {
> +			__vfio_iommu_detach_group(iommu, group);
> +			return ret;
> +		}
> +	}
> +	return 0;
> +}
> +
> +/* The iommu is viable, ie. ready to be configured, when all the devices
> + * for all the groups attached to the iommu are bound to their vfio device
> + * drivers (ex. vfio-pci).  This sets the device_data private data pointer. */
> +static bool __vfio_iommu_viable(struct vfio_iommu *iommu)
> +{
> +	struct list_head *gpos, *dpos;
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (!device->device_data)
> +				return false;
> +		}
> +	}
> +	return true;
> +}
> +
> +static void __vfio_close_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +
> +	if (!iommu->domain)
> +		return;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		__vfio_iommu_detach_group(iommu, group);
> +	}
> +
> +	vfio_iommu_unmapall(iommu);
> +
> +	iommu_domain_free(iommu->domain);
> +	iommu->domain = NULL;
> +	iommu->mm = NULL;
> +}
> +
> +/* Open the IOMMU.  This gates all access to the iommu or device file
> + * descriptors and sets current->mm as the exclusive user. */
> +static int __vfio_open_iommu(struct vfio_iommu *iommu)
> +{
> +	struct list_head *pos;
> +	int ret;
> +
> +	if (!__vfio_iommu_viable(iommu))
> +		return -EBUSY;
> +
> +	if (iommu->domain)
> +		return -EINVAL;
> +
> +	iommu->domain = iommu_domain_alloc(iommu->bus);
> +	if (!iommu->domain)
> +		return -EFAULT;
> +
> +	list_for_each(pos, &iommu->group_list) {
> +		struct vfio_group *group;
> +		group = list_entry(pos, struct vfio_group, iommu_next);
> +
> +		ret = __vfio_iommu_attach_group(iommu, group);
> +		if (ret) {
> +			__vfio_close_iommu(iommu);
> +			return ret;
> +		}
> +	}
> +
> +	if (!allow_unsafe_intrs &&
> +	    !iommu_domain_has_cap(iommu->domain, IOMMU_CAP_INTR_REMAP)) {
> +		__vfio_close_iommu(iommu);
> +		return -EFAULT;
> +	}
> +
> +	iommu->cache = (iommu_domain_has_cap(iommu->domain,
> +					     IOMMU_CAP_CACHE_COHERENCY) != 0);
> +	iommu->mm = current->mm;
> +
> +	return 0;
> +}
> +
> +/* Actively try to tear down the iommu and merged groups.  If there are no
> + * open iommu or device fds, we close the iommu.  If we close the iommu and
> + * there are also no open group fds, we can futher dissolve the group to
> + * iommu association and free the iommu data structure. */
> +static int __vfio_try_dissolve_iommu(struct vfio_iommu *iommu)
> +{
> +
> +	if (__vfio_iommu_inuse(iommu))
> +		return -EBUSY;
> +
> +	__vfio_close_iommu(iommu);
> +
> +	if (!__vfio_iommu_groups_inuse(iommu)) {
> +		struct list_head *pos, *ppos;
> +
> +		list_for_each_safe(pos, ppos, &iommu->group_list) {
> +			struct vfio_group *group;
> +
> +			group = list_entry(pos, struct vfio_group, iommu_next);
> +			__vfio_group_set_iommu(group, NULL);
> +		}
> +
> +
> +		kfree(iommu);
> +	}
> +
> +	return 0;
> +}
> +
> +static struct vfio_device *__vfio_lookup_dev(struct device *dev)
> +{
> +	struct list_head *gpos;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return NULL;
> +
> +	list_for_each(gpos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		if (group->groupid != groupid)
> +			continue;
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->dev == dev)
> +				return device;
> +		}
> +	}
> +	return NULL;
> +}
> +
> +/* All release paths simply decrement the refcnt, attempt to teardown
> + * the iommu and merged groups, and wakeup anything that might be
> + * waiting if we successfully dissolve anything. */
> +static int vfio_do_release(int *refcnt, struct vfio_iommu *iommu)
> +{
> +	bool wake;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	(*refcnt)--;
> +	wake = (__vfio_try_dissolve_iommu(iommu) == 0);
> +
> +	mutex_unlock(&vfio.lock);
> +
> +	if (wake)
> +		wake_up(&vfio.release_q);
> +
> +	return 0;
> +}
> +
> +/*
> + * Device fops - passthrough to vfio device driver w/ device_data
> + */
> +static int vfio_device_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	vfio_do_release(&device->refcnt, device->iommu);
> +
> +	device->ops->put(device->device_data);
> +
> +	return 0;
> +}
> +
> +static long vfio_device_unl_ioctl(struct file *filep,
> +				  unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->ioctl(device->device_data, cmd, arg);
> +}
> +
> +static ssize_t vfio_device_read(struct file *filep, char __user *buf,
> +				size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->read(device->device_data, buf, count, ppos);
> +}
> +
> +static ssize_t vfio_device_write(struct file *filep, const char __user *buf,
> +				 size_t count, loff_t *ppos)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->write(device->device_data, buf, count, ppos);
> +}
> +
> +static int vfio_device_mmap(struct file *filep, struct vm_area_struct *vma)
> +{
> +	struct vfio_device *device = filep->private_data;
> +
> +	return device->ops->mmap(device->device_data, vma);
> +}
> +	
> +#ifdef CONFIG_COMPAT
> +static long vfio_device_compat_ioctl(struct file *filep,
> +				     unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_device_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +const struct file_operations vfio_device_fops = {
> +	.owner		= THIS_MODULE,
> +	.release	= vfio_device_release,
> +	.read		= vfio_device_read,
> +	.write		= vfio_device_write,
> +	.unlocked_ioctl	= vfio_device_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_device_compat_ioctl,
> +#endif
> +	.mmap		= vfio_device_mmap,
> +};
> +
> +/*
> + * Group fops
> + */
> +static int vfio_group_open(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group;
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	group = idr_find(&vfio.idr, iminor(inode));
> +
> +	if (!group) {
> +		ret = -ENODEV;
> +		goto out;
> +	}
> +
> +	filep->private_data = group;
> +
> +	if (!group->iommu) {
> +		struct vfio_iommu *iommu;
> +
> +		iommu = kzalloc(sizeof(*iommu), GFP_KERNEL);
> +		if (!iommu) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +		INIT_LIST_HEAD(&iommu->group_list);
> +		INIT_LIST_HEAD(&iommu->dm_list);
> +		mutex_init(&iommu->dgate);
> +		iommu->bus = group->bus;
> +		__vfio_group_set_iommu(group, iommu);
> +	}
> +	group->refcnt++;
> +
> +out:
> +	mutex_unlock(&vfio.lock);
> +
> +	return ret;
> +}
> +
> +static int vfio_group_release(struct inode *inode, struct file *filep)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	return vfio_do_release(&group->refcnt, group->iommu);
> +}
> +
> +/* Attempt to merge the group pointed to by fd into group.  The merge-ee
> + * group must not have an iommu or any devices open because we cannot
> + * maintain that context across the merge.  The merge-er group can be
> + * in use. */

Yeah, so merge-er group in use still has its problems, because it
could affect what the IOMMU is capable of.

> +static int vfio_group_merge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *old_iommu;
> +	struct file *file;
> +	int ret = 0;
> +	bool opened = false;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {

This should be a WARN_ON or BUG_ON rather than just an error return, surely.

> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +
> +	if (!new || new == group || !new->iommu ||
> +	    new->iommu->domain || new->bus != group->bus) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We need to attach all the devices to each domain separately
> +	 * in order to validate that the capabilities match for both.  */
> +	ret = __vfio_open_iommu(new->iommu);
> +	if (ret)
> +		goto out;
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +		opened = true;
> +	}
> +
> +	/* If cache coherency doesn't match we'd potentialy need to
> +	 * remap existing iommu mappings in the merge-er domain.
> +	 * Poor return to bother trying to allow this currently. */
> +	if (iommu_domain_has_cap(group->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY) !=
> +	    iommu_domain_has_cap(new->iommu->domain,
> +				 IOMMU_CAP_CACHE_COHERENCY)) {
> +		__vfio_close_iommu(new->iommu);
> +		if (opened)
> +			__vfio_close_iommu(group->iommu);
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* Close the iommu for the merge-ee and attach all its devices
> +	 * to the merge-er iommu. */
> +	__vfio_close_iommu(new->iommu);
> +
> +	ret = __vfio_iommu_attach_group(group->iommu, new);
> +	if (ret)
> +		goto out;
> +
> +	/* set_iommu unlinks new from the iommu, so save a pointer to it */
> +	old_iommu = new->iommu;
> +	__vfio_group_set_iommu(new, group->iommu);
> +	kfree(old_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Unmerge the group pointed to by fd from group. */
> +static int vfio_group_unmerge(struct vfio_group *group, int fd)
> +{
> +	struct vfio_group *new;
> +	struct vfio_iommu *new_iommu;
> +	struct file *file;
> +	int ret = 0;
> +
> +	/* Since the merge-out group is already opened, it needs to
> +	 * have an iommu struct associated with it. */
> +	new_iommu = kzalloc(sizeof(*new_iommu), GFP_KERNEL);
> +	if (!new_iommu)
> +		return -ENOMEM;
> +
> +	INIT_LIST_HEAD(&new_iommu->group_list);
> +	INIT_LIST_HEAD(&new_iommu->dm_list);
> +	mutex_init(&new_iommu->dgate);
> +	new_iommu->bus = group->bus;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	file = fget(fd);
> +	if (!file) {
> +		ret = -EBADF;
> +		goto out_noput;
> +	}
> +
> +	/* Sanity check, is this really our fd? */
> +	if (file->f_op != &vfio_group_fops) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	new = file->private_data;
> +	if (!new || new == group || new->iommu != group->iommu) {
> +		ret = -EINVAL;
> +		goto out;
> +	}
> +
> +	/* We can't merge-out a group with devices still in use. */
> +	if (__vfio_group_devs_inuse(new)) {
> +		ret = -EBUSY;
> +		goto out;
> +	}
> +
> +	__vfio_iommu_detach_group(group->iommu, new);
> +	__vfio_group_set_iommu(new, new_iommu);
> +
> +out:
> +	fput(file);
> +out_noput:
> +	if (ret)
> +		kfree(new_iommu);
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new iommu file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set. */

I know I've had this explained to me several times before, but I've
forgotten again.  Why do we need to wire the iommu to an mm?

> +static int vfio_group_get_iommu_fd(struct vfio_group *group)
> +{
> +	int ret = 0;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!group->iommu->domain) {
> +		ret = __vfio_open_iommu(group->iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	ret = anon_inode_getfd("[vfio-iommu]", &vfio_iommu_fops,
> +			       group->iommu, O_RDWR);
> +	if (ret < 0)
> +		goto out;
> +
> +	group->iommu->refcnt++;
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Get a new device file descriptor.  This will open the iommu, setting
> + * the current->mm ownership if it's not already set.  It's difficult to
> + * specify the requirements for matching a user supplied buffer to a
> + * device, so we use a vfio driver callback to test for a match.  For
> + * PCI, dev_name(dev) is unique, but other drivers may require including
> + * a parent device string. */

At some point we probably want an interface to enumerate the devices
too, but that can probably wait.

> +static int vfio_group_get_device_fd(struct vfio_group *group, char *buf)
> +{
> +	struct vfio_iommu *iommu = group->iommu;
> +	struct list_head *gpos;
> +	int ret = -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (!iommu->domain) {
> +		ret = __vfio_open_iommu(iommu);
> +		if (ret)
> +			goto out;
> +	}
> +
> +	list_for_each(gpos, &iommu->group_list) {
> +		struct list_head *dpos;
> +
> +		group = list_entry(gpos, struct vfio_group, iommu_next);
> +
> +		list_for_each(dpos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +
> +			if (device->ops->match(device->dev, buf)) {
> +				struct file *file;
> +
> +				if (device->ops->get(device->device_data)) {
> +					ret = -EFAULT;
> +					goto out;
> +				}
> +
> +				/* We can't use anon_inode_getfd(), like above
> +				 * because we need to modify the f_mode flags
> +				 * directly to allow more than just ioctls */
> +				ret = get_unused_fd();
> +				if (ret < 0) {
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				file = anon_inode_getfile("[vfio-device]",
> +							  &vfio_device_fops,
> +							  device, O_RDWR);
> +				if (IS_ERR(file)) {
> +					put_unused_fd(ret);
> +					ret = PTR_ERR(file);
> +					device->ops->put(device->device_data);
> +					goto out;
> +				}
> +
> +				/* Todo: add an anon_inode interface to do
> +				 * this.  Appears to be missing by lack of
> +				 * need rather than explicitly prevented.
> +				 * Now there's need. */
> +				file->f_mode |= (FMODE_LSEEK |
> +						 FMODE_PREAD |
> +						 FMODE_PWRITE);
> +
> +				fd_install(ret, file);
> +
> +				device->refcnt++;
> +				goto out;
> +			}
> +		}
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +static long vfio_group_unl_ioctl(struct file *filep,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct vfio_group *group = filep->private_data;
> +
> +	if (cmd == VFIO_GROUP_GET_FLAGS) {
> +		u64 flags = 0;
> +
> +		mutex_lock(&vfio.lock);
> +		if (__vfio_iommu_viable(group->iommu))
> +			flags |= VFIO_GROUP_FLAGS_VIABLE;
> +		mutex_unlock(&vfio.lock);
> +
> +		if (group->iommu->mm)
> +			flags |= VFIO_GROUP_FLAGS_MM_LOCKED;
> +
> +		return put_user(flags, (u64 __user *)arg);
> +	}
> +		
> +	/* Below commands are restricted once the mm is set */
> +	if (group->iommu->mm && group->iommu->mm != current->mm)
> +		return -EPERM;
> +
> +	if (cmd == VFIO_GROUP_MERGE || cmd == VFIO_GROUP_UNMERGE) {
> +		int fd;
> +		
> +		if (get_user(fd, (int __user *)arg))
> +			return -EFAULT;
> +		if (fd < 0)
> +			return -EINVAL;
> +
> +		if (cmd == VFIO_GROUP_MERGE)
> +			return vfio_group_merge(group, fd);
> +		else
> +			return vfio_group_unmerge(group, fd);
> +	} else if (cmd == VFIO_GROUP_GET_IOMMU_FD) {
> +		return vfio_group_get_iommu_fd(group);
> +	} else if (cmd == VFIO_GROUP_GET_DEVICE_FD) {
> +		char *buf;
> +		int ret;
> +
> +		buf = strndup_user((const char __user *)arg, PAGE_SIZE);
> +		if (IS_ERR(buf))
> +			return PTR_ERR(buf);
> +
> +		ret = vfio_group_get_device_fd(group, buf);
> +		kfree(buf);
> +		return ret;
> +	}
> +
> +	return -ENOSYS;
> +}
> +
> +#ifdef CONFIG_COMPAT
> +static long vfio_group_compat_ioctl(struct file *filep,
> +				    unsigned int cmd, unsigned long arg)
> +{
> +	arg = (unsigned long)compat_ptr(arg);
> +	return vfio_group_unl_ioctl(filep, cmd, arg);
> +}
> +#endif	/* CONFIG_COMPAT */
> +
> +static const struct file_operations vfio_group_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vfio_group_open,
> +	.release	= vfio_group_release,
> +	.unlocked_ioctl	= vfio_group_unl_ioctl,
> +#ifdef CONFIG_COMPAT
> +	.compat_ioctl	= vfio_group_compat_ioctl,
> +#endif
> +};
> +
> +/* iommu fd release hook */
> +int vfio_release_iommu(struct vfio_iommu *iommu)
> +{
> +	return vfio_do_release(&iommu->refcnt, iommu);
> +}
> +
> +/*
> + * VFIO driver API
> + */
> +
> +/* Add a new device to the vfio framework with associated vfio driver
> + * callbacks.  This is the entry point for vfio drivers to register devices. */
> +int vfio_group_add_dev(struct device *dev, const struct vfio_device_ops *ops)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +	int ret = 0;
> +	bool new_group = false;
> +
> +	if (!ops)
> +		return -EINVAL;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return -ENODEV;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group) {
> +		int minor;
> +
> +		if (unlikely(idr_pre_get(&vfio.idr, GFP_KERNEL) == 0)) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group = kzalloc(sizeof(*group), GFP_KERNEL);
> +		if (!group) {
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		group->groupid = groupid;
> +		INIT_LIST_HEAD(&group->device_list);
> +
> +		ret = idr_get_new(&vfio.idr, group, &minor);
> +		if (ret == 0 && minor > MINORMASK) {
> +			idr_remove(&vfio.idr, minor);
> +			kfree(group);
> +			ret = -ENOSPC;
> +			goto out;
> +		}
> +
> +		group->devt = MKDEV(MAJOR(vfio.devt), minor);
> +		device_create(vfio.class, NULL, group->devt,
> +			      group, "%u", groupid);
> +
> +		group->bus = dev->bus;
> +		list_add(&group->group_next, &vfio.group_list);
> +		new_group = true;
> +	} else {
> +		if (group->bus != dev->bus) {
> +			printk(KERN_WARNING
> +			       "Error: IOMMU group ID conflict.  Group ID %u "
> +				"on both bus %s and %s\n", groupid,
> +				group->bus->name, dev->bus->name);
> +			ret = -EFAULT;
> +			goto out;
> +		}
> +
> +		list_for_each(pos, &group->device_list) {
> +			device = list_entry(pos,
> +					    struct vfio_device, device_next);
> +			if (device->dev == dev)
> +				break;
> +			device = NULL;
> +		}
> +	}
> +
> +	if (!device) {
> +		if (__vfio_group_devs_inuse(group) ||
> +		    (group->iommu && group->iommu->refcnt)) {
> +			printk(KERN_WARNING
> +			       "Adding device %s to group %u while group is already in use!!\n",
> +			       dev_name(dev), group->groupid);
> +			/* XXX How to prevent other drivers from claiming? */
> +		}
> +
> +		device = kzalloc(sizeof(*device), GFP_KERNEL);
> +		if (!device) {
> +			/* If we just created this group, tear it down */
> +			if (new_group) {
> +				list_del(&group->group_next);
> +				device_destroy(vfio.class, group->devt);
> +				idr_remove(&vfio.idr, MINOR(group->devt));
> +				kfree(group);
> +			}
> +			ret = -ENOMEM;
> +			goto out;
> +		}
> +
> +		list_add(&device->device_next, &group->device_list);
> +		device->dev = dev;
> +		device->ops = ops;
> +		device->iommu = group->iommu; /* NULL if new */
> +		__vfio_iommu_attach_dev(group->iommu, device);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_add_dev);
> +
> +/* Remove a device from the vfio framework */
> +void vfio_group_del_dev(struct device *dev)
> +{
> +	struct list_head *pos;
> +	struct vfio_group *group = NULL;
> +	struct vfio_device *device = NULL;
> +	unsigned int groupid;
> +
> +	if (iommu_device_group(dev, &groupid))
> +		return;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	list_for_each(pos, &vfio.group_list) {
> +		group = list_entry(pos, struct vfio_group, group_next);
> +		if (group->groupid == groupid)
> +			break;
> +		group = NULL;
> +	}
> +
> +	if (!group)
> +		goto out;
> +
> +	list_for_each(pos, &group->device_list) {
> +		device = list_entry(pos, struct vfio_device, device_next);
> +		if (device->dev == dev)
> +			break;
> +		device = NULL;
> +	}
> +
> +	if (!device)
> +		goto out;
> +
> +	BUG_ON(device->refcnt);
> +
> +	if (device->attached)
> +		__vfio_iommu_detach_dev(group->iommu, device);
> +
> +	list_del(&device->device_next);
> +	kfree(device);
> +
> +	/* If this was the only device in the group, remove the group.
> +	 * Note that we intentionally unmerge empty groups here if the
> +	 * group fd isn't opened. */
> +	if (list_empty(&group->device_list) && group->refcnt == 0) {
> +		struct vfio_iommu *iommu = group->iommu;
> +
> +		if (iommu) {
> +			__vfio_group_set_iommu(group, NULL);
> +			__vfio_try_dissolve_iommu(iommu);
> +		}
> +
> +		device_destroy(vfio.class, group->devt);
> +		idr_remove(&vfio.idr, MINOR(group->devt));
> +		list_del(&group->group_next);
> +		kfree(group);
> +	}
> +out:
> +	mutex_unlock(&vfio.lock);
> +}
> +EXPORT_SYMBOL_GPL(vfio_group_del_dev);
> +
> +/* When a device is bound to a vfio device driver (ex. vfio-pci), this
> + * entry point is used to mark the device usable (viable).  The vfio
> + * device driver associates a private device_data struct with the device
> + * here, which will later be return for vfio_device_fops callbacks. */
> +int vfio_bind_dev(struct device *dev, void *device_data)
> +{
> +	struct vfio_device *device;
> +	int ret = -EINVAL;
> +
> +	BUG_ON(!device_data);
> +
> +	mutex_lock(&vfio.lock);
> +
> +	device = __vfio_lookup_dev(dev);
> +
> +	BUG_ON(!device);
> +
> +	ret = dev_set_drvdata(dev, device);
> +	if (!ret)
> +		device->device_data = device_data;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +EXPORT_SYMBOL_GPL(vfio_bind_dev);
> +
> +/* A device is only removeable if the iommu for the group is not in use. */
> +static bool vfio_device_removeable(struct vfio_device *device)
> +{
> +	bool ret = true;
> +
> +	mutex_lock(&vfio.lock);
> +
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu))
> +		ret = false;
> +
> +	mutex_unlock(&vfio.lock);
> +	return ret;
> +}
> +
> +/* Notify vfio that a device is being unbound from the vfio device driver
> + * and return the device private device_data pointer.  If the group is
> + * in use, we need to block or take other measures to make it safe for
> + * the device to be removed from the iommu. */
> +void *vfio_unbind_dev(struct device *dev)
> +{
> +	struct vfio_device *device = dev_get_drvdata(dev);
> +	void *device_data;
> +
> +	BUG_ON(!device);
> +
> +again:
> +	if (!vfio_device_removeable(device)) {
> +		/* XXX signal for all devices in group to be removed or
> +		 * resort to killing the process holding the device fds.
> +		 * For now just block waiting for releases to wake us. */
> +		wait_event(vfio.release_q, vfio_device_removeable(device));
> +	}
> +
> +	mutex_lock(&vfio.lock);
> +
> +	/* Need to re-check that the device is still removeable under lock. */
> +	if (device->iommu && __vfio_iommu_inuse(device->iommu)) {
> +		mutex_unlock(&vfio.lock);
> +		goto again;
> +	}
> +
> +	device_data = device->device_data;
> +
> +	device->device_data = NULL;
> +	dev_set_drvdata(dev, NULL);
> +
> +	mutex_unlock(&vfio.lock);
> +	return device_data;
> +}
> +EXPORT_SYMBOL_GPL(vfio_unbind_dev);
> +
> +/*
> + * Module/class support
> + */
> +static void vfio_class_release(struct kref *kref)
> +{
> +	class_destroy(vfio.class);
> +	vfio.class = NULL;
> +}
> +
> +static char *vfio_devnode(struct device *dev, mode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vfio/%s", dev_name(dev));
> +}
> +
> +static int __init vfio_init(void)
> +{
> +	int ret;
> +
> +	idr_init(&vfio.idr);
> +	mutex_init(&vfio.lock);
> +	INIT_LIST_HEAD(&vfio.group_list);
> +	init_waitqueue_head(&vfio.release_q);
> +
> +	kref_init(&vfio.kref);
> +	vfio.class = class_create(THIS_MODULE, "vfio");
> +	if (IS_ERR(vfio.class)) {
> +		ret = PTR_ERR(vfio.class);
> +		goto err_class;
> +	}
> +
> +	vfio.class->devnode = vfio_devnode;
> +
> +	/* FIXME - how many minors to allocate... all of them! */
> +	ret = alloc_chrdev_region(&vfio.devt, 0, MINORMASK, "vfio");
> +	if (ret)
> +		goto err_chrdev;
> +
> +	cdev_init(&vfio.cdev, &vfio_group_fops);
> +	ret = cdev_add(&vfio.cdev, vfio.devt, MINORMASK);
> +	if (ret)
> +		goto err_cdev;
> +
> +	pr_info(DRIVER_DESC " version: " DRIVER_VERSION "\n");
> +
> +	return 0;
> +
> +err_cdev:
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +err_chrdev:
> +	kref_put(&vfio.kref, vfio_class_release);
> +err_class:
> +	return ret;
> +}
> +
> +static void __exit vfio_cleanup(void)
> +{
> +	struct list_head *gpos, *gppos;
> +
> +	list_for_each_safe(gpos, gppos, &vfio.group_list) {
> +		struct vfio_group *group;
> +		struct list_head *dpos, *dppos;
> +
> +		group = list_entry(gpos, struct vfio_group, group_next);
> +
> +		list_for_each_safe(dpos, dppos, &group->device_list) {
> +			struct vfio_device *device;
> +
> +			device = list_entry(dpos,
> +					    struct vfio_device, device_next);
> +			vfio_group_del_dev(device->dev);
> +		}
> +	}
> +
> +	idr_destroy(&vfio.idr);
> +	cdev_del(&vfio.cdev);
> +	unregister_chrdev_region(vfio.devt, MINORMASK);
> +	kref_put(&vfio.kref, vfio_class_release);
> +}
> +
> +module_init(vfio_init);
> +module_exit(vfio_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> diff --git a/drivers/vfio/vfio_private.h b/drivers/vfio/vfio_private.h
> new file mode 100644
> index 0000000..350ad67
> --- /dev/null
> +++ b/drivers/vfio/vfio_private.h
> @@ -0,0 +1,34 @@
> +/*
> + * Copyright (C) 2011 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio:
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + */
> +
> +#include <linux/list.h>
> +#include <linux/mutex.h>
> +
> +#ifndef VFIO_PRIVATE_H
> +#define VFIO_PRIVATE_H
> +
> +struct vfio_iommu {
> +	struct iommu_domain		*domain;
> +	struct bus_type			*bus;
> +	struct mutex			dgate;
> +	struct list_head		dm_list;
> +	struct mm_struct		*mm;
> +	struct list_head		group_list;
> +	int				refcnt;
> +	bool				cache;
> +};
> +
> +extern int vfio_release_iommu(struct vfio_iommu *iommu);
> +extern void vfio_iommu_unmapall(struct vfio_iommu *iommu);
> +
> +#endif /* VFIO_PRIVATE_H */
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> new file mode 100644
> index 0000000..4269b08
> --- /dev/null
> +++ b/include/linux/vfio.h
> @@ -0,0 +1,155 @@
> +/*
> + * Copyright 2010 Cisco Systems, Inc.  All rights reserved.
> + * Author: Tom Lyon, pugs@cisco.com
> + *
> + * This program is free software; you may redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; version 2 of the License.
> + *
> + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
> + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
> + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
> + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
> + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
> + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
> + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
> + * SOFTWARE.
> + *
> + * Portions derived from drivers/uio/uio.c:
> + * Copyright(C) 2005, Benedikt Spranger <b.spranger@linutronix.de>
> + * Copyright(C) 2005, Thomas Gleixner <tglx@linutronix.de>
> + * Copyright(C) 2006, Hans J. Koch <hjk@linutronix.de>
> + * Copyright(C) 2006, Greg Kroah-Hartman <greg@kroah.com>
> + *
> + * Portions derived from drivers/uio/uio_pci_generic.c:
> + * Copyright (C) 2009 Red Hat, Inc.
> + * Author: Michael S. Tsirkin <mst@redhat.com>
> + */
> +#include <linux/types.h>
> +
> +#ifndef VFIO_H
> +#define VFIO_H
> +
> +#ifdef __KERNEL__
> +
> +struct vfio_device_ops {
> +	bool			(*match)(struct device *, char *);
> +	int			(*get)(void *);
> +	void			(*put)(void *);
> +	ssize_t			(*read)(void *, char __user *,
> +					size_t, loff_t *);
> +	ssize_t			(*write)(void *, const char __user *,
> +					 size_t, loff_t *);
> +	long			(*ioctl)(void *, unsigned int, unsigned long);
> +	int			(*mmap)(void *, struct vm_area_struct *);
> +};
> +
> +extern int vfio_group_add_dev(struct device *device,
> +			      const struct vfio_device_ops *ops);
> +extern void vfio_group_del_dev(struct device *device);
> +extern int vfio_bind_dev(struct device *device, void *device_data);
> +extern void *vfio_unbind_dev(struct device *device);
> +
> +#endif /* __KERNEL__ */
> +
> +/*
> + * VFIO driver - allow mapping and use of certain devices
> + * in unprivileged user processes. (If IOMMU is present)
> + * Especially useful for Virtual Function parts of SR-IOV devices
> + */
> +
> +
> +/* Kernel & User level defines for ioctls */
> +
> +#define VFIO_GROUP_GET_FLAGS		_IOR(';', 100, __u64)
> + #define VFIO_GROUP_FLAGS_VIABLE	(1 << 0)
> + #define VFIO_GROUP_FLAGS_MM_LOCKED	(1 << 1)
> +#define VFIO_GROUP_MERGE		_IOW(';', 101, int)
> +#define VFIO_GROUP_UNMERGE		_IOW(';', 102, int)
> +#define VFIO_GROUP_GET_IOMMU_FD		_IO(';', 103)
> +#define VFIO_GROUP_GET_DEVICE_FD	_IOW(';', 104, char *)
> +
> +/*
> + * Structure for DMA mapping of user buffers
> + * vaddr, dmaaddr, and size must all be page aligned
> + */
> +struct vfio_dma_map {
> +	__u64	len;		/* length of structure */
> +	__u64	vaddr;		/* process virtual addr */
> +	__u64	dmaaddr;	/* desired and/or returned dma address */
> +	__u64	size;		/* size in bytes */
> +	__u64	flags;
> +#define	VFIO_DMA_MAP_FLAG_WRITE		(1 << 0) /* req writeable DMA mem */
> +};
> +
> +#define	VFIO_IOMMU_GET_FLAGS		_IOR(';', 105, __u64)
> + /* Does the IOMMU support mapping any IOVA to any virtual address? */
> + #define VFIO_IOMMU_FLAGS_MAP_ANY	(1 << 0)
> +#define	VFIO_IOMMU_MAP_DMA		_IOWR(';', 106, struct vfio_dma_map)
> +#define	VFIO_IOMMU_UNMAP_DMA		_IOWR(';', 107, struct vfio_dma_map)
> +
> +#define VFIO_DEVICE_GET_FLAGS		_IOR(';', 108, __u64)
> + #define VFIO_DEVICE_FLAGS_PCI		(1 << 0)
> + #define VFIO_DEVICE_FLAGS_DT		(1 << 1)
> + #define VFIO_DEVICE_FLAGS_RESET	(1 << 2)
> +#define VFIO_DEVICE_GET_NUM_REGIONS	_IOR(';', 109, int)
> +
> +struct vfio_region_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* region number */
> +	__u64	size;		/* size in bytes of region */
> +	__u64	offset;		/* start offset of region */
> +	__u64	flags;
> +#define VFIO_REGION_INFO_FLAG_MMAP		(1 << 0)
> +#define VFIO_REGION_INFO_FLAG_RO		(1 << 1)
> +#define VFIO_REGION_INFO_FLAG_PHYS_VALID	(1 << 2)
> +	__u64	phys;		/* physical address of region */
> +};
> +
> +#define VFIO_DEVICE_GET_REGION_INFO	_IOWR(';', 110, struct vfio_region_info)
> +
> +#define VFIO_DEVICE_GET_NUM_IRQS	_IOR(';', 111, int)
> +
> +struct vfio_irq_info {
> +	__u32	len;		/* length of structure */
> +	__u32	index;		/* IRQ number */
> +	__u32	count;		/* number of individual IRQs */
> +	__u32	flags;
> +#define VFIO_IRQ_INFO_FLAG_LEVEL		(1 << 0)
> +};
> +
> +#define VFIO_DEVICE_GET_IRQ_INFO	_IOWR(';', 112, struct vfio_irq_info)
> +
> +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> +#define VFIO_DEVICE_SET_IRQ_EVENTFDS	_IOW(';', 113, int)
> +
> +/* Unmask IRQ index, arg[0] = index */
> +#define VFIO_DEVICE_UNMASK_IRQ		_IOW(';', 114, int)
> +
> +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD	_IOW(';', 115, int)
> +
> +#define VFIO_DEVICE_RESET		_IO(';', 116)
> +
> +struct vfio_dtpath {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u64	flags;
> +#define VFIO_DTPATH_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTPATH_FLAGS_IRQ		(1 << 1)
> +	char	*path;
> +};
> +#define VFIO_DEVICE_GET_DTPATH		_IOWR(';', 117, struct vfio_dtpath)
> +
> +struct vfio_dtindex {
> +	__u32	len;		/* length of structure */
> +	__u32	index;
> +	__u32	prop_type;
> +	__u32	prop_index;
> +	__u64	flags;
> +#define VFIO_DTINDEX_FLAGS_REGION	(1 << 0)
> +#define VFIO_DTINDEX_FLAGS_IRQ		(1 << 1)
> +};
> +#define VFIO_DEVICE_GET_DTINDEX		_IOWR(';', 118, struct vfio_dtindex)
> +
> +#endif /* VFIO_H */
>
Alex Williamson - Nov. 15, 2011, 6:01 p.m.
On Tue, 2011-11-15 at 17:34 +1100, David Gibson wrote:
> On Thu, Nov 03, 2011 at 02:12:24PM -0600, Alex Williamson wrote:
> > diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
> > new file mode 100644
> > index 0000000..5866896
> > --- /dev/null
> > +++ b/Documentation/vfio.txt
> > @@ -0,0 +1,304 @@
> > +VFIO - "Virtual Function I/O"[1]
> > +-------------------------------------------------------------------------------
> > +Many modern system now provide DMA and interrupt remapping facilities
> > +to help ensure I/O devices behave within the boundaries they've been
> > +allotted.  This includes x86 hardware with AMD-Vi and Intel VT-d as
> > +well as POWER systems with Partitionable Endpoints (PEs) and even
> > +embedded powerpc systems (technology name unknown).  The VFIO driver
> > +is an IOMMU/device agnostic framework for exposing direct device
> > +access to userspace, in a secure, IOMMU protected environment.  In
> > +other words, this allows safe, non-privileged, userspace drivers.
> 
> It's perhaps worth emphasisng that "safe" depends on the hardware
> being sufficiently well behaved.  BenH, I know, thinks there are a
> *lot* of cards that, e.g. have debug registers that allow a backdoor
> to their own config space via MMIO, which would bypass vfio's
> filtering of config space access.  And that's before we even get into
> the varying degrees of completeness in the isolation provided by
> different IOMMUs.

Fair enough.  I know Tom had emphasized "well behaved" in the original
doc.  Virtual functions are probably the best indicator of well behaved.

> > +Why do we want that?  Virtual machines often make use of direct device
> > +access ("device assignment") when configured for the highest possible
> > +I/O performance.  From a device and host perspective, this simply turns
> > +the VM into a userspace driver, with the benefits of significantly
> > +reduced latency, higher bandwidth, and direct use of bare-metal device
> > +drivers[2].
> > +
> > +Some applications, particularly in the high performance computing
> > +field, also benefit from low-overhead, direct device access from
> > +userspace.  Examples include network adapters (often non-TCP/IP based)
> > +and compute accelerators.  Previous to VFIO, these drivers needed to
> 
> s/Previous/Prior/  although that may be a .us vs .au usage thing.

Same difference, AFAICT.

> > +go through the full development cycle to become proper upstream driver,
> > +be maintained out of tree, or make use of the UIO framework, which
> > +has no notion of IOMMU protection, limited interrupt support, and
> > +requires root privileges to access things like PCI configuration space.
> > +
> > +The VFIO driver framework intends to unify these, replacing both the
> > +KVM PCI specific device assignment currently used as well as provide
> > +a more secure, more featureful userspace driver environment than UIO.
> > +
> > +Groups, Devices, IOMMUs, oh my
> > +-------------------------------------------------------------------------------
> > +
> > +A fundamental component of VFIO is the notion of IOMMU groups.  IOMMUs
> > +can't always distinguish transactions from each individual device in
> > +the system.  Sometimes this is because of the IOMMU design, such as with
> > +PEs, other times it's caused by the I/O topology, for instance a
> > +PCIe-to-PCI bridge masking all devices behind it.  We call the sets of
> > +devices created by these restictions IOMMU groups (or just "groups" for
> > +this document).
> > +
> > +The IOMMU cannot distiguish transactions between the individual devices
> > +within the group, therefore the group is the basic unit of ownership for
> > +a userspace process.  Because of this, groups are also the primary
> > +interface to both devices and IOMMU domains in VFIO.
> > +
> > +The VFIO representation of groups is created as devices are added into
> > +the framework by a VFIO bus driver.  The vfio-pci module is an example
> > +of a bus driver.  This module registers devices along with a set of bus
> > +specific callbacks with the VFIO core.  These callbacks provide the
> > +interfaces later used for device access.  As each new group is created,
> > +as determined by iommu_device_group(), VFIO creates a /dev/vfio/$GROUP
> > +character device.
> 
> Ok.. so, the fact that it's called "vfio-pci" suggests that the VFIO
> bus driver is per bus type, not per bus instance.   But grouping
> constraints could be per bus instance, if you have a couple of
> different models of PCI host bridge with IOMMUs of different
> capabilities built in, for example.

Yes, vfio-pci manages devices on the pci_bus_type; per type, not per bus
instance.  IOMMUs also register drivers per bus type, not per bus
instance.  The IOMMU driver is free to impose any constraints it wants.

> > +In addition to the device enumeration and callbacks, the VFIO bus driver
> > +also provides a traditional device driver and is able to bind to devices
> > +on it's bus.  When a device is bound to the bus driver it's available to
> > +VFIO.  When all the devices within a group are bound to their bus drivers,
> > +the group becomes "viable" and a user with sufficient access to the VFIO
> > +group chardev can obtain exclusive access to the set of group devices.
> > +
> > +As documented in linux/vfio.h, several ioctls are provided on the
> > +group chardev:
> > +
> > +#define VFIO_GROUP_GET_FLAGS            _IOR(';', 100, __u64)
> > + #define VFIO_GROUP_FLAGS_VIABLE        (1 << 0)
> > + #define VFIO_GROUP_FLAGS_MM_LOCKED     (1 << 1)
> > +#define VFIO_GROUP_MERGE                _IOW(';', 101, int)
> > +#define VFIO_GROUP_UNMERGE              _IOW(';', 102, int)
> > +#define VFIO_GROUP_GET_IOMMU_FD         _IO(';', 103)
> > +#define VFIO_GROUP_GET_DEVICE_FD        _IOW(';', 104, char *)
> > +
> > +The last two ioctls return new file descriptors for accessing
> > +individual devices within the group and programming the IOMMU.  Each of
> > +these new file descriptors provide their own set of file interfaces.
> > +These ioctls will fail if any of the devices within the group are not
> > +bound to their VFIO bus driver.  Additionally, when either of these
> > +interfaces are used, the group is then bound to the struct_mm of the
> > +caller.  The GET_FLAGS ioctl can be used to view the state of the group.
> > +
> > +When either the GET_IOMMU_FD or GET_DEVICE_FD ioctls are invoked, a
> > +new IOMMU domain is created and all of the devices in the group are
> > +attached to it.  This is the only way to ensure full IOMMU isolation
> > +of the group, but potentially wastes resources and cycles if the user
> > +intends to manage multiple groups with the same set of IOMMU mappings.
> > +VFIO therefore provides a group MERGE and UNMERGE interface, which
> > +allows multiple groups to share an IOMMU domain.  Not all IOMMUs allow
> > +arbitrary groups to be merged, so the user should assume merging is
> > +opportunistic.
> 
> I do not think "opportunistic" means what you think it means..
> 
> >  A new group, with no open device or IOMMU file
> > +descriptors, can be merged into an existing, in-use, group using the
> > +MERGE ioctl.  A merged group can be unmerged using the UNMERGE ioctl
> > +once all of the device file descriptors for the group being merged
> > +"out" are closed.
> > +
> > +When groups are merged, the GET_IOMMU_FD and GET_DEVICE_FD ioctls are
> > +essentially fungible between group file descriptors (ie. if device
> > A
> 
> IDNT "fungible" MWYTIM, either.

Hmm, feel free to suggest.  Maybe we're hitting .us vs .au connotation.

> > +is in group X, and X is merged with Y, a file descriptor for A can be
> > +retrieved using GET_DEVICE_FD on Y.  Likewise, GET_IOMMU_FD returns a
> > +file descriptor referencing the same internal IOMMU object from either
> > +X or Y).  Merged groups can be dissolved either explictly with UNMERGE
> > +or automatically when ALL file descriptors for the merged group are
> > +closed (all IOMMUs, all devices, all groups).
> 
> Blech.  I'm really not liking this merge/unmerge API as it stands,
> it's horribly confusing.  At the very least, we need some better
> terminology.  We need some term for the metagroups; supergroups; iommu
> domains or-at-least-they-will-be-once-we-open-the-iommu or
> whathaveyous.
> 
> The first confusing thing about this interface is that each open group
> handle actually refers to two different things; the original group you
> opened and the metagroup it's a part of.  For the GET_IOMMU_FD and
> GET_DEVICE_FD operations, you're using the metagroup and two "merged"
> group handles are interchangeable.

Fungible, even ;)

> For other MERGE and especially
> UNMERGE operations, it matters which is the original group.

If I stick two LEGO blocks together, I need to identify the individual
block I want to remove to pull them back apart...

> The semantics of "merge" and "unmerge" under those names are really
> non-obvious.  Merge kind of has to merge two whole metagroups, but
> it's unclear if unmerge reverses one merge, or just takes out one
> (atom) group.  These operations need better names, at least.

Christian suggested a change to UNMERGE that we do not need to
specify a group to unmerge "from".  This makes it more like a list
implementation except there's no defined list_head.  Any member of the
list can pull in a new entry.  Calling UNMERGE on any member extracts
that member.

> Then it's unclear what order you can do various operations, and which
> order you can open and close various things.  You can kind of figure
> it out but it takes far more thinking than it should.
> 
> 
> So at the _very_ least, we need to invent new terminology and find a
> much better way of describing this API's semantics.  I still think an
> entirely different interface, where metagroups are created from
> outside with a lifetime that's not tied to an fd would be a better
> idea.

As we've discussed previously, configfs provides part of this, but has
no ioctl support.  It doesn't make sense to me to go play with groups in
configfs, but then still interact with them via a char dev.  It also
splits the ownership model and makes it harder to enforce who gets to
interact with the devices vs who gets to manipulate groups.  The current
model really isn't that complicated, imho.  As always, feel free to
suggest specific models.  If you have a specific terminology other than
MERGE, please suggest.

> Now, you specify that you can't use a group as the second argument of
> a merge if it already has an open iommu, but it's not clear from the
> doc if you can merge things into a group with an open iommu.

From above:

        A new group, with no open device or IOMMU file descriptors, can
        be merged into an existing, in-use, group using the MERGE ioctl.
                                 ^^^^^^

> Banning
> this would make life simpler, because the IOMMU's effective
> capabilities may change if you add more devices to the domain.  That's
> yet another non-obvious constraint in the interface ordering, though.

Banning this would prevent using merged groups with hotplug, which I
consider to be a primary use case.

> > +The IOMMU file descriptor provides this set of ioctls:
> > +
> > +#define VFIO_IOMMU_GET_FLAGS            _IOR(';', 105, __u64)
> > + #define VFIO_IOMMU_FLAGS_MAP_ANY       (1 << 0)
> > +#define VFIO_IOMMU_MAP_DMA              _IOWR(';', 106, struct vfio_dma_map)
> > +#define VFIO_IOMMU_UNMAP_DMA            _IOWR(';', 107, struct vfio_dma_map)
> > +
> > +The GET_FLAGS ioctl returns basic information about the IOMMU domain.
> > +We currently only support IOMMU domains that are able to map any
> > +virtual address to any IOVA.  This is indicated by the MAP_ANY
> > flag.
> 
> So.  I tend to think of an IOMMU mapping IOVAs to memory pages, rather
> than memory pages to IOVAs.  

I do too, not sure why I wrote it that way, will fix.

> The IOMMU itself, of course maps to
> physical addresses, and the meaning of "virtual address" in this
> context is not really clear.  I think you would be better off saying
> the IOMMU can map any IOVA to any memory page.  From a hardware POV
> that means any physical address, but of course for a VFIO user a page
> is specified by its process virtual address.

Will fix.

> I think we need to pin exactly what "MAP_ANY" means down better.  Now,
> VFIO is pretty much a lost cause if you can't map any normal process
> memory page into the IOMMU, so I think the only thing that is really
> covered is IOVAs.  But saying "can map any IOVA" is not clear, because
> if you can't map it, it's not a (valid) IOVA.  Better to say that
> IOVAs can be any 64-bit value, which I think is what you really mean
> here.

ok

> Of course, since POWER is a platform where this is *not* true, I'd
> prefer to have something giving the range of valid IOVAs in the core
> to start with.

Since iommu_ops does not yet have any concept of this (nudge, nudge), I
figured this would be added later.  A possible implementation would be
that such an iommu would not set MAP_ANY, would add a new flag for
MAP_RANGE, and provide a new VFIO_IOMMU_GET_RANGE_INFO ioctl to describe
it.  I'm guaranteed to get it wrong if I try to predict all your needs.

> > +
> > +The (UN)MAP_DMA commands make use of struct vfio_dma_map for mapping
> > +and unmapping IOVAs to process virtual addresses:
> > +
> > +struct vfio_dma_map {
> > +        __u64   len;            /* length of structure */
> 
> Thanks for adding these structure length fields.  But I think they
> should be called something other than 'len', which is likely to be
> confused with size (or some other length that's actually related to
> the operation's parameters).  Better to call it 'structlen' or
> 'argslen' or something.

Ok.  As Scott noted, I've failed to implement these in a way that
actually allows extension, but I'll work on it.

> > +        __u64   vaddr;          /* process virtual addr */
> > +        __u64   dmaaddr;        /* desired and/or returned dma address */
> > +        __u64   size;           /* size in bytes */
> > +        __u64   flags;
> > +#define VFIO_DMA_MAP_FLAG_WRITE         (1 << 0) /* req writeable DMA mem */
> 
> Make it independent READ and WRITE flags from the start.  Not all
> combinations will be be valid on all hardware, but that way we have
> the possibilities covered without having to use strange encodings
> later.

Ok.

> > +};
> > +
> > +Current users of VFIO use relatively static DMA mappings, not requiring
> > +high frequency turnover.  As new users are added, it's expected that the
> > +IOMMU file descriptor will evolve to support new mapping interfaces, this
> > +will be reflected in the flags and may present new ioctls and file
> > +interfaces.
> > +
> > +The device GET_FLAGS ioctl is intended to return basic device type and
> > +indicate support for optional capabilities.  Flags currently include whether
> > +the device is PCI or described by Device Tree, and whether the RESET ioctl
> > +is supported:
> > +
> > +#define VFIO_DEVICE_GET_FLAGS           _IOR(';', 108, __u64)
> > + #define VFIO_DEVICE_FLAGS_PCI          (1 << 0)
> > + #define VFIO_DEVICE_FLAGS_DT           (1 << 1)
> 
> TBH, I don't think the VFIO for DT stuff is mature enough yet to be in
> an initial infrastructure patch, though we should certainly be
> discussing it as an add-on patch.

I agree for DT, and PCI should be added with vfio-pci, not the initial
core.

> > + #define VFIO_DEVICE_FLAGS_RESET        (1 << 2)
> > +
> > +The MMIO and IOP resources used by a device are described by regions.
> > +The GET_NUM_REGIONS ioctl tells us how many regions the device supports:
> > +
> > +#define VFIO_DEVICE_GET_NUM_REGIONS     _IOR(';', 109, int)
> > +
> > +Regions are described by a struct vfio_region_info, which is retrieved by
> > +using the GET_REGION_INFO ioctl with vfio_region_info.index field set to
> > +the desired region (0 based index).  Note that devices may implement zero
> > +sized regions (vfio-pci does this to provide a 1:1 BAR to region index
> > +mapping).
> 
> So, I think you're saying that a zero-sized region is used to encode a
> NOP region, that is, to basically put a "no region here" in between
> valid region indices.  You should spell that out.

Ok.

> [Incidentally, any chance you could borrow one of RH's tech writers
> for this?  I'm afraid you seem to lack the knack for clear and easily
> read documentation]

Thanks for the encouragement :-\  It's no wonder there isn't more
content in Documentation.

> > +struct vfio_region_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* region number */
> > +        __u64   size;           /* size in bytes of region */
> > +        __u64   offset;         /* start offset of region */
> > +        __u64   flags;
> > +#define VFIO_REGION_INFO_FLAG_MMAP              (1 << 0)
> > +#define VFIO_REGION_INFO_FLAG_RO                (1 << 1)
> 
> Again having separate read and write bits from the start will save
> strange encodings later.

Seems highly unlikely, but we have bits to waste...

> > +#define VFIO_REGION_INFO_FLAG_PHYS_VALID        (1 << 2)
> > +        __u64   phys;           /* physical address of region */
> > +};
> 
> I notice there is no field for "type" e.g. MMIO vs. PIO vs. config
> space for PCI.  If you added that having a NONE type might be a
> clearer way of encoding a non-region than just having size==0.

I thought there was some resistance to including MMIO and PIO bits in
the flags.  If that's passed, I can add it, but PCI can determine this
through config space (and vfio-pci exposes config space at a fixed
index).  Having a regions w/ size == 0, MMIO and PIO flags unset seems a
little redundant if that's the only reason for having them.  A NONE flag
doesn't make sense to me.  Config space isn't NONE, but neither is it
MMIO nor PIO; and someone would probably be offended about even
mentioning PIO in the specification.

> > +
> > +#define VFIO_DEVICE_GET_REGION_INFO     _IOWR(';', 110, struct vfio_region_info)
> > +
> > +The offset indicates the offset into the device file descriptor which
> > +accesses the given range (for read/write/mmap/seek).  Flags indicate the
> > +available access types and validity of optional fields.  For instance
> > +the phys field may only be valid for certain devices types.
> > +
> > +Interrupts are described using a similar interface.  GET_NUM_IRQS
> > +reports the number or IRQ indexes for the device.
> > +
> > +#define VFIO_DEVICE_GET_NUM_IRQS        _IOR(';', 111, int)
> > +
> > +struct vfio_irq_info {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;          /* IRQ number */
> > +        __u32   count;          /* number of individual IRQs */
> 
> Is there a reason for allowing irqs in batches like this, rather than
> having each MSI be reflected by a separate irq_info?

Yes, bus drivers like vfio-pci can define index 1 as the MSI info
structure and index 2 as MSI-X.  There's really no need to expose 57
individual MSI interrupts and try to map them to the correct device
specific MSI type if they can only logically be enabled in two distinct
groups.  Bus drivers with individually controllable MSI vectors are free
to expose them separately.  I assume device tree paths would help
associate an index to a specific interrupt.

> > +        __u64   flags;
> > +#define VFIO_IRQ_INFO_FLAG_LEVEL                (1 << 0)
> > +};
> > +
> > +Again, zero count entries are allowed (vfio-pci uses a static interrupt
> > +type to index mapping).
> 
> I know what you mean, but you need a clearer way to express it.

I'll work on it.

> > +Information about each index can be retrieved using the GET_IRQ_INFO
> > +ioctl, used much like GET_REGION_INFO.
> > +
> > +#define VFIO_DEVICE_GET_IRQ_INFO        _IOWR(';', 112, struct vfio_irq_info)
> > +
> > +Individual indexes can describe single or sets of IRQs.  This provides the
> > +flexibility to describe PCI INTx, MSI, and MSI-X using a single interface.
> > +
> > +All VFIO interrupts are signaled to userspace via eventfds.  Integer arrays,
> > +as shown below, are used to pass the IRQ info index, the number of eventfds,
> > +and each eventfd to be signaled.  Using a count of 0 disables the interrupt.
> > +
> > +/* Set IRQ eventfds, arg[0] = index, arg[1] = count, arg[2-n] = eventfds */
> > +#define VFIO_DEVICE_SET_IRQ_EVENTFDS    _IOW(';', 113, int)
> > +
> > +When a level triggered interrupt is signaled, the interrupt is masked
> > +on the host.  This prevents an unresponsive userspace driver from
> > +continuing to interrupt the host system.  After servicing the interrupt,
> > +UNMASK_IRQ is used to allow the interrupt to retrigger.  Note that level
> > +triggered interrupts implicitly have a count of 1 per index.
> 
> This is a silly restriction.  Even PCI devices can have up to 4 LSIs
> on a function in theory, though no-one ever does.  Embedded devices
> can and do have multiple level interrupts.

Per the PCI spec, an individual PCI function can only ever have, at
most, a single INTx line.  A multi-function *device* can have up to 4
INTx lines, but what we're exposing here is a struct device, ie. a PCI
function.

Other devices could certainly have multiple level interrupts, and if
grouping them as we do with MSI on PCI makes sense, please let me know.
I just didn't see the value in making the unmask operations handle
sub-indexes if it's not needed.

> > +
> > +/* Unmask IRQ index, arg[0] = index */
> > +#define VFIO_DEVICE_UNMASK_IRQ          _IOW(';', 114, int)
> > +
> > +Level triggered interrupts can also be unmasked using an irqfd.  Use
> > +SET_UNMASK_IRQ_EVENTFD to set the file descriptor for this.
> > +
> > +/* Set unmask eventfd, arg[0] = index, arg[1] = eventfd */
> > +#define VFIO_DEVICE_SET_UNMASK_IRQ_EVENTFD      _IOW(';', 115, int)
> > +
> > +When supported, as indicated by the device flags, reset the device.
> > +
> > +#define VFIO_DEVICE_RESET               _IO(';', 116)
> > +
> > +Device tree devices also invlude ioctls for further defining the
> > +device tree properties of the device:
> > +
> > +struct vfio_dtpath {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u64   flags;
> > +#define VFIO_DTPATH_FLAGS_REGION        (1 << 0)
> > +#define VFIO_DTPATH_FLAGS_IRQ           (1 << 1)
> > +        char    *path;
> > +};
> > +#define VFIO_DEVICE_GET_DTPATH          _IOWR(';', 117, struct vfio_dtpath)
> > +
> > +struct vfio_dtindex {
> > +        __u32   len;            /* length of structure */
> > +        __u32   index;
> > +        __u32   prop_type;
> > +        __u32   prop_index;
> > +        __u64   flags;
> > +#define VFIO_DTINDEX_FLAGS_REGION       (1 << 0)
> > +#define VFIO_DTINDEX_FLAGS_IRQ          (1 << 1)
> > +};
> > +#define VFIO_DEVICE_GET_DTINDEX         _IOWR(';', 118, struct vfio_dtindex)
> > +
> > +
> > +VFIO bus driver API
> > +-------------------------------------------------------------------------------
> > +
> > +Bus drivers, such as PCI, have three jobs:
> > + 1) Add/remove devices from vfio
> > + 2) Provide vfio_device_ops for device access
> > + 3) Device binding and unbinding
> > +
> > +When initialized, the bus driver should enumerate the devices on it's
> 
> s/it's/its/

Noted.

<snip>
> > +/* Unmap DMA region */
> > +/* dgate must be held */
> > +static int __vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			    int npage, int rdwr)
> 
> Use of "read" and "write" in DMA can often be confusing, since it's
> not always clear if you're talking from the perspective of the CPU or
> the device (_writing_ data to a device will usually involve it doing
> DMA _reads_ from memory).  It's often best to express things as DMA
> direction, 'to device', and 'from device' instead.

Good point.

> > +{
> > +	int i, unlocked = 0;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE) {
> > +		unsigned long pfn;
> > +
> > +		pfn = iommu_iova_to_phys(iommu->domain, iova) >> PAGE_SHIFT;
> > +		if (pfn) {
> > +			iommu_unmap(iommu->domain, iova, 0);
> > +			unlocked += put_pfn(pfn, rdwr);
> > +		}
> > +	}
> > +	return unlocked;
> > +}
> > +
> > +static void vfio_dma_unmap(struct vfio_iommu *iommu, unsigned long iova,
> > +			   unsigned long npage, int rdwr)
> > +{
> > +	int unlocked;
> > +
> > +	unlocked = __vfio_dma_unmap(iommu, iova, npage, rdwr);
> > +	vfio_lock_acct(-unlocked);
> 
> Have you checked that your accounting will work out if the user maps
> the same memory page to multiple IOVAs?

Hmm, it probably doesn't.  We potentially over-penalize the user process
here.

> > +}
> > +
> > +/* Unmap ALL DMA regions */
> > +void vfio_iommu_unmapall(struct vfio_iommu *iommu)
> > +{
> > +	struct list_head *pos, *pos2;
> > +	struct dma_map_page *mlp;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +	list_for_each_safe(pos, pos2, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		kfree(mlp);
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> 
> Ouch, no good at all.  Keeping track of every DMA map is no good on
> POWER or other systems where IOMMU operations are a hot path.  I think
> you'll need an iommu specific hook for this instead, which uses
> whatever data structures are natural for the IOMMU.  For example a
> 1-level pagetable, like we use on POWER will just zero every entry.

It's already been noted in the docs that current users have relatively
static mappings and a performance interface is TBD for dynamically
backing streaming DMA.  The current vfio_iommu exposes iommu_ops, POWER
will need to come up with something to expose instead.

> > +}
> > +
> > +static int vaddr_get_pfn(unsigned long vaddr, int rdwr, unsigned long *pfn)
> > +{
> > +	struct page *page[1];
> > +	struct vm_area_struct *vma;
> > +	int ret = -EFAULT;
> > +
> > +	if (get_user_pages_fast(vaddr, 1, rdwr, page) == 1) {
> > +		*pfn = page_to_pfn(page[0]);
> > +		return 0;
> > +	}
> > +
> > +	down_read(&current->mm->mmap_sem);
> > +
> > +	vma = find_vma_intersection(current->mm, vaddr, vaddr + 1);
> > +
> > +	if (vma && vma->vm_flags & VM_PFNMAP) {
> > +		*pfn = ((vaddr - vma->vm_start) >> PAGE_SHIFT) + vma->vm_pgoff;
> > +		if (is_invalid_reserved_pfn(*pfn))
> > +			ret = 0;
> > +	}
> 
> It's kind of nasty that you take gup_fast(), already designed to grab
> pointers for multiple user pages, then just use it one page at a time,
> even for a big map.

Yep, this needs work, but shouldn't really change the API.

> > +	up_read(&current->mm->mmap_sem);
> > +
> > +	return ret;
> > +}
> > +
> > +/* Map DMA region */
> > +/* dgate must be held */
> > +static int vfio_dma_map(struct vfio_iommu *iommu, unsigned long iova,
> > +			unsigned long vaddr, int npage, int rdwr)
> 
> iova should be a dma_addr_t.  Bus address size need not match virtual
> address size, and may not fit in an unsigned long.

ok.

> > +{
> > +	unsigned long start = iova;
> > +	int i, ret, locked = 0, prot = IOMMU_READ;
> > +
> > +	/* Verify pages are not already mapped */
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE)
> > +		if (iommu_iova_to_phys(iommu->domain, iova))
> > +			return -EBUSY;
> > +
> > +	iova = start;
> > +
> > +	if (rdwr)
> > +		prot |= IOMMU_WRITE;
> > +	if (iommu->cache)
> > +		prot |= IOMMU_CACHE;
> > +
> > +	for (i = 0; i < npage; i++, iova += PAGE_SIZE, vaddr += PAGE_SIZE) {
> > +		unsigned long pfn = 0;
> > +
> > +		ret = vaddr_get_pfn(vaddr, rdwr, &pfn);
> > +		if (ret) {
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +
> > +		/* Only add actual locked pages to accounting */
> > +		if (!is_invalid_reserved_pfn(pfn))
> > +			locked++;
> > +
> > +		ret = iommu_map(iommu->domain, iova,
> > +				(phys_addr_t)pfn << PAGE_SHIFT, 0, prot);
> > +		if (ret) {
> > +			/* Back out mappings on error */
> > +			put_pfn(pfn, rdwr);
> > +			__vfio_dma_unmap(iommu, start, i, rdwr);
> > +			return ret;
> > +		}
> > +	}
> > +	vfio_lock_acct(locked);
> > +	return 0;
> > +}
> > +
> > +static inline int ranges_overlap(unsigned long start1, size_t size1,
> > +				 unsigned long start2, size_t size2)
> > +{
> > +	return !(start1 + size1 <= start2 || start2 + size2 <= start1);
> 
> Needs overflow safety.

Yep.

> > +}
> > +
> > +static struct dma_map_page *vfio_find_dma(struct vfio_iommu *iommu,
> > +					  dma_addr_t start, size_t size)
> > +{
> > +	struct list_head *pos;
> > +	struct dma_map_page *mlp;
> > +
> > +	list_for_each(pos, &iommu->dm_list) {
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   start, size))
> > +			return mlp;
> > +	}
> > +	return NULL;
> > +}
> 
> Again, keeping track of each dma map operation is no good for
> performance.

This is not the performance interface you're looking for.

> > +
> > +int vfio_remove_dma_overlap(struct vfio_iommu *iommu, dma_addr_t start,
> > +			    size_t size, struct dma_map_page *mlp)
> > +{
> > +	struct dma_map_page *split;
> > +	int npage_lo, npage_hi;
> > +
> > +	/* Existing dma region is completely covered, unmap all */
> > +	if (start <= mlp->daddr &&
> > +	    start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		vfio_dma_unmap(iommu, mlp->daddr, mlp->npage, mlp->rdwr);
> > +		list_del(&mlp->list);
> > +		npage_lo = mlp->npage;
> > +		kfree(mlp);
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap low address of existing range */
> > +	if (start <= mlp->daddr) {
> > +		size_t overlap;
> > +
> > +		overlap = start + size - mlp->daddr;
> > +		npage_lo = overlap >> PAGE_SHIFT;
> > +		npage_hi = mlp->npage - npage_lo;
> > +
> > +		vfio_dma_unmap(iommu, mlp->daddr, npage_lo, mlp->rdwr);
> > +		mlp->daddr += overlap;
> > +		mlp->vaddr += overlap;
> > +		mlp->npage -= npage_lo;
> > +		return npage_lo;
> > +	}
> > +
> > +	/* Overlap high address of existing range */
> > +	if (start + size >= mlp->daddr + NPAGE_TO_SIZE(mlp->npage)) {
> > +		size_t overlap;
> > +
> > +		overlap = mlp->daddr + NPAGE_TO_SIZE(mlp->npage) - start;
> > +		npage_hi = overlap >> PAGE_SHIFT;
> > +		npage_lo = mlp->npage - npage_hi;
> > +
> > +		vfio_dma_unmap(iommu, start, npage_hi, mlp->rdwr);
> > +		mlp->npage -= npage_hi;
> > +		return npage_hi;
> > +	}
> > +
> > +	/* Split existing */
> > +	npage_lo = (start - mlp->daddr) >> PAGE_SHIFT;
> > +	npage_hi = mlp->npage - (size >> PAGE_SHIFT) - npage_lo;
> > +
> > +	split = kzalloc(sizeof *split, GFP_KERNEL);
> > +	if (!split)
> > +		return -ENOMEM;
> > +
> > +	vfio_dma_unmap(iommu, start, size >> PAGE_SHIFT, mlp->rdwr);
> > +
> > +	mlp->npage = npage_lo;
> > +
> > +	split->npage = npage_hi;
> > +	split->daddr = start + size;
> > +	split->vaddr = mlp->vaddr + NPAGE_TO_SIZE(npage_lo) + size;
> > +	split->rdwr = mlp->rdwr;
> > +	list_add(&split->list, &iommu->dm_list);
> > +	return size >> PAGE_SHIFT;
> > +}
> > +
> > +int vfio_dma_unmap_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int ret = 0;
> > +	size_t npage = dmp->size >> PAGE_SHIFT;
> > +	struct list_head *pos, *n;
> > +
> > +	if (dmp->dmaaddr & ~PAGE_MASK)
> > +		return -EINVAL;
> > +	if (dmp->size & ~PAGE_MASK)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	list_for_each_safe(pos, n, &iommu->dm_list) {
> > +		struct dma_map_page *mlp;
> > +
> > +		mlp = list_entry(pos, struct dma_map_page, list);
> > +		if (ranges_overlap(mlp->daddr, NPAGE_TO_SIZE(mlp->npage),
> > +				   dmp->dmaaddr, dmp->size)) {
> > +			ret = vfio_remove_dma_overlap(iommu, dmp->dmaaddr,
> > +						      dmp->size, mlp);
> > +			if (ret > 0)
> > +				npage -= NPAGE_TO_SIZE(ret);
> > +			if (ret < 0 || npage == 0)
> > +				break;
> > +		}
> > +	}
> > +	mutex_unlock(&iommu->dgate);
> > +	return ret > 0 ? 0 : ret;
> > +}
> > +
> > +int vfio_dma_map_dm(struct vfio_iommu *iommu, struct vfio_dma_map *dmp)
> > +{
> > +	int npage;
> > +	struct dma_map_page *mlp, *mmlp = NULL;
> > +	dma_addr_t daddr = dmp->dmaaddr;
> > +	unsigned long locked, lock_limit, vaddr = dmp->vaddr;
> > +	size_t size = dmp->size;
> > +	int ret = 0, rdwr = dmp->flags & VFIO_DMA_MAP_FLAG_WRITE;
> > +
> > +	if (vaddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (daddr & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +	if (size & (PAGE_SIZE-1))
> > +		return -EINVAL;
> > +
> > +	npage = size >> PAGE_SHIFT;
> > +	if (!npage)
> > +		return -EINVAL;
> > +
> > +	if (!iommu)
> > +		return -EINVAL;
> > +
> > +	mutex_lock(&iommu->dgate);
> > +
> > +	if (vfio_find_dma(iommu, daddr, size)) {
> > +		ret = -EBUSY;
> > +		goto out_lock;
> > +	}
> > +
> > +	/* account for locked pages */
> > +	locked = current->mm->locked_vm + npage;
> > +	lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > +	if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> > +		printk(KERN_WARNING "%s: RLIMIT_MEMLOCK (%ld) exceeded\n",
> > +			__func__, rlimit(RLIMIT_MEMLOCK));
> > +		ret = -ENOMEM;
> > +		goto out_lock;
> > +	}
> > +
> > +	ret = vfio_dma_map(iommu, daddr, vaddr, npage, rdwr);
> > +	if (ret)
> > +		goto out_lock;
> > +
> > +	/* Check if we abut a region below */
> > +	if (daddr) {
> > +		mlp = vfio_find_dma(iommu, daddr - 1, 1);
> > +		if (mlp && mlp->rdwr == rdwr &&
> > +		    mlp->vaddr + NPAGE_TO_SIZE(mlp->npage) == vaddr) {
> > +
> > +			mlp->npage += npage;
> > +			daddr = mlp->daddr;
> > +			vaddr = mlp->vaddr;
> > +			npage = mlp->npage;
> > +			size = NPAGE_TO_SIZE(npage);
> > +
> > +			mmlp = mlp;
> > +		}
> > +	}
> > +
> > +	if (daddr + size) {
> > +		mlp = vfio_find_dma(iommu, daddr + size, 1);
> > +		if (mlp && mlp->rdwr == rdwr && mlp->vaddr == vaddr + size) {
> > +
> > +			mlp->npage += npage;
> > +			mlp->daddr = daddr;
> > +			mlp->vaddr = vaddr;
> > +
> > +			/* If merged above and below, remove previously
> > +			 * merged entry.  New entry covers it.  */
> > +			if (mmlp) {
> > +				list_del(&mmlp->list);
> > +				kfree(mmlp);
>