diff mbox

[02/13] iommu: IOMMU Groups

Message ID 20120511225541.30496.34751.stgit@bling.home
State New
Headers show

Commit Message

Alex Williamson May 11, 2012, 10:55 p.m. UTC
IOMMU device groups are currently a rather vague associative notion
with assembly required by the user or user level driver provider to
do anything useful.  This patch intends to grow the IOMMU group concept
into something a bit more consumable.

To do this, we first create an object representing the group, struct
iommu_group.  This structure is allocated (iommu_group_alloc) and
filled (iommu_group_add_device) by the iommu driver.  The iommu driver
is free to add devices to the group using it's own set of policies.
This allows inclusion of devices based on physical hardware or topology
limitations of the platform, as well as soft requirements, such as
multi-function trust levels or peer-to-peer protection of the
interconnects.  Each device may only belong to a single iommu group,
which is linked from struct device.iommu_group.  IOMMU groups are
maintained using kobject reference counting, allowing for automatic
removal of empty, unreferenced groups.  It is the responsibility of
the iommu driver to remove devices from the group
(iommu_group_remove_device).

IOMMU groups also include a userspace representation in sysfs under
/sys/kernel/iommu_groups.  When allocated, each group is given a
dynamically assign ID (int).  The ID is managed by the core IOMMU group
code to support multiple heterogeneous iommu drivers, which could
potentially collide in group naming/numbering.  This also keeps group
IDs to small, easily managed values.  A directory is created under
/sys/kernel/iommu_groups for each group.  A further subdirectory named
"devices" contains links to each device within the group.  The iommu_group
file in the device's sysfs directory, which formerly contained a group
number when read, is now a link to the iommu group.  Example:

$ ls -l /sys/kernel/iommu_groups/26/devices/
total 0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1

$ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
[truncating perms/owner/timestamp]
/sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
					../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
					../../../../kernel/iommu_groups/26
/sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
					../../../../kernel/iommu_groups/26

Groups also include several exported functions for use by user level
driver providers, for example VFIO.  These include:

iommu_group_get(): Acquires a reference to a group from a device
iommu_group_put(): Releases reference
iommu_group_for_each_dev(): Iterates over group devices using callback
iommu_group_[un]register_notifier(): Allows notification of device add
        and remove operations relevant to the group
iommu_group_id(): Return the group number

This patch also extends the IOMMU API to allow attaching groups to
domains.  This is currently a simple wrapper for iterating through
devices within a group, but it's expected that the IOMMU API may
eventually make groups a more integral part of domains.

Groups intentionally do not try to manage group ownership.  A user
level driver provider must independently acquire ownership for each
device within a group before making use of the group as a whole.
This may change in the future if group usage becomes more pervasive
across both DMA and IOMMU ops.

Groups intentionally do not provide a mechanism for driver locking
or otherwise manipulating driver matching/probing of devices within
the group.  Such interfaces are generic to devices and beyond the
scope of IOMMU groups.  If implemented, user level providers have
ready access via iommu_group_for_each_dev and group notifiers.

Groups currently provide no storage for iommu context, but some kind
of iommu_group_get/set_iommudata() interface is likely if groups
become more pervasive in the dma layers.

iommu_device_group() is removed here as it has no users.  The
replacement is:

	group = iommu_group_get(dev);
	id = iommu_group_id(group);
	iommu_group_put(group);

AMD-Vi & Intel VT-d support re-added in following patches.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/iommu/amd_iommu.c   |   21 --
 drivers/iommu/intel-iommu.c |   49 -----
 drivers/iommu/iommu.c       |  449 ++++++++++++++++++++++++++++++++++++++++---
 include/linux/iommu.h       |   84 ++++++++
 4 files changed, 499 insertions(+), 104 deletions(-)

Comments

David Gibson May 14, 2012, 1:16 a.m. UTC | #1
On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> IOMMU device groups are currently a rather vague associative notion
> with assembly required by the user or user level driver provider to
> do anything useful.  This patch intends to grow the IOMMU group concept
> into something a bit more consumable.
> 
> To do this, we first create an object representing the group, struct
> iommu_group.  This structure is allocated (iommu_group_alloc) and
> filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> is free to add devices to the group using it's own set of policies.
> This allows inclusion of devices based on physical hardware or topology
> limitations of the platform, as well as soft requirements, such as
> multi-function trust levels or peer-to-peer protection of the
> interconnects.  Each device may only belong to a single iommu group,
> which is linked from struct device.iommu_group.  IOMMU groups are
> maintained using kobject reference counting, allowing for automatic
> removal of empty, unreferenced groups.  It is the responsibility of
> the iommu driver to remove devices from the group
> (iommu_group_remove_device).
> 
> IOMMU groups also include a userspace representation in sysfs under
> /sys/kernel/iommu_groups.  When allocated, each group is given a
> dynamically assign ID (int).  The ID is managed by the core IOMMU group
> code to support multiple heterogeneous iommu drivers, which could
> potentially collide in group naming/numbering.  This also keeps group
> IDs to small, easily managed values.  A directory is created under
> /sys/kernel/iommu_groups for each group.  A further subdirectory named
> "devices" contains links to each device within the group.  The iommu_group
> file in the device's sysfs directory, which formerly contained a group
> number when read, is now a link to the iommu group.  Example:
> 
> $ ls -l /sys/kernel/iommu_groups/26/devices/
> total 0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> 
> $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> [truncating perms/owner/timestamp]
> /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> 					../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> 					../../../../kernel/iommu_groups/26
> 
> Groups also include several exported functions for use by user level
> driver providers, for example VFIO.  These include:
> 
> iommu_group_get(): Acquires a reference to a group from a device
> iommu_group_put(): Releases reference
> iommu_group_for_each_dev(): Iterates over group devices using callback
> iommu_group_[un]register_notifier(): Allows notification of device add
>         and remove operations relevant to the group
> iommu_group_id(): Return the group number
> 
> This patch also extends the IOMMU API to allow attaching groups to
> domains.  This is currently a simple wrapper for iterating through
> devices within a group, but it's expected that the IOMMU API may
> eventually make groups a more integral part of domains.
> 
> Groups intentionally do not try to manage group ownership.  A user
> level driver provider must independently acquire ownership for each
> device within a group before making use of the group as a whole.
> This may change in the future if group usage becomes more pervasive
> across both DMA and IOMMU ops.
> 
> Groups intentionally do not provide a mechanism for driver locking
> or otherwise manipulating driver matching/probing of devices within
> the group.  Such interfaces are generic to devices and beyond the
> scope of IOMMU groups.  If implemented, user level providers have
> ready access via iommu_group_for_each_dev and group notifiers.
> 
> Groups currently provide no storage for iommu context, but some kind
> of iommu_group_get/set_iommudata() interface is likely if groups
> become more pervasive in the dma layers.
> 
> iommu_device_group() is removed here as it has no users.  The
> replacement is:
> 
> 	group = iommu_group_get(dev);
> 	id = iommu_group_id(group);
> 	iommu_group_put(group);

Looks reasonable to me, with a few nits, noted below.

[snip]
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 2198b2d..f75004e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -26,60 +26,404 @@
>  #include <linux/slab.h>
>  #include <linux/errno.h>
>  #include <linux/iommu.h>
> +#include <linux/idr.h>
> +#include <linux/notifier.h>
> +
> +static struct kset *iommu_group_kset;
> +static struct ida iommu_group_ida;
> +static struct mutex iommu_group_mutex;
> +
> +struct iommu_group {
> +	struct kobject kobj;
> +	struct kobject *devices_kobj;
> +	struct list_head devices;
> +	struct mutex mutex;
> +	struct blocking_notifier_head notifier;
> +	int id;

I think you should add some sort of name string to the group as well
(supplied by the iommu driver creating the group).  That would make it
easier to connect the arbitrary assigned IDs to any platform-standard
naming convention for these things.

[snip]
> +/**
> + * iommu_group_add_device - add a device to an iommu group
> + * @group: the group into which to add the device (reference should be held)
> + * @dev: the device
> + *
> + * This function is called by an iommu driver to add a device into a
> + * group.  Adding a device increments the group reference count.
> + */
> +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> +{
> +	int ret;
> +	struct iommu_device *device;
> +
> +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> +	if (!device)
> +		return -ENOMEM;
> +
> +	device->dev = dev;
> +
> +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> +	if (ret) {
> +		kfree(device);
> +		return ret;
> +	}
> +
> +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> +				kobject_name(&dev->kobj));
> +	if (ret) {
> +		sysfs_remove_link(&dev->kobj, "iommu_group");
> +		kfree(device);
> +		return ret;
> +	}

So, it's not clear that the kobject_name() here has to be unique
across all devices in the group.  It might be better to use an
arbitrary index here instead of a name to avoid that problem.

[snip]
> +/**
> + * iommu_group_remove_device - remove a device from it's current group
> + * @dev: device to be removed
> + *
> + * This function is called by an iommu driver to remove the device from
> + * it's current group.  This decrements the iommu group reference count.
> + */
> +void iommu_group_remove_device(struct device *dev)
> +{
> +	struct iommu_group *group = dev->iommu_group;
> +	struct iommu_device *device;
> +
> +	/* Pre-notify listeners that a device is being removed. */
> +	blocking_notifier_call_chain(&group->notifier,
> +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> +
> +	mutex_lock(&group->mutex);
> +	list_for_each_entry(device, &group->devices, list) {
> +		if (device->dev == dev) {
> +			list_del(&device->list);
> +			kfree(device);
> +			break;
> +		}
> +	}
> +	mutex_unlock(&group->mutex);
> +
> +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> +	sysfs_remove_link(&dev->kobj, "iommu_group");
> +
> +	dev->iommu_group = NULL;

I suspect the dev -> group pointer should be cleared first, under the
group lock, but I'm not certain about that.

[snip]
> +/**
> + * iommu_group_for_each_dev - iterate over each device in the group
> + * @group: the group
> + * @data: caller opaque data to be passed to callback function
> + * @fn: caller supplied callback function
> + *
> + * This function is called by group users to iterate over group devices.
> + * Callers should hold a reference count to the group during
> callback.

Probably also worth noting in this doco that the group lock will be
held across the callback.

[snip]
> +static int iommu_bus_notifier(struct notifier_block *nb,
> +			      unsigned long action, void *data)
>  {
>  	struct device *dev = data;
> +	struct iommu_ops *ops = dev->bus->iommu_ops;
> +	struct iommu_group *group;
> +	unsigned long group_action = 0;
> +
> +	/*
> +	 * ADD/DEL call into iommu driver ops if provided, which may
> +	 * result in ADD/DEL notifiers to group->notifier
> +	 */
> +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> +		if (ops->add_device)
> +			return ops->add_device(dev);
> +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> +		if (ops->remove_device && dev->iommu_group) {
> +			ops->remove_device(dev);
> +			return 0;
> +		}
> +	}

So, there's still the question of how to assign grouping for devices
on a subordinate bus behind a bridge which is iommu managed.  The
iommu driver for the top-level bus can't know about all possible
subordinate bus types, but the subordinate devices will (in most
cases, anyway) be iommu translated as if originating with the bus
bridge.
Alex Williamson May 14, 2012, 5:11 p.m. UTC | #2
On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > IOMMU device groups are currently a rather vague associative notion
> > with assembly required by the user or user level driver provider to
> > do anything useful.  This patch intends to grow the IOMMU group concept
> > into something a bit more consumable.
> > 
> > To do this, we first create an object representing the group, struct
> > iommu_group.  This structure is allocated (iommu_group_alloc) and
> > filled (iommu_group_add_device) by the iommu driver.  The iommu driver
> > is free to add devices to the group using it's own set of policies.
> > This allows inclusion of devices based on physical hardware or topology
> > limitations of the platform, as well as soft requirements, such as
> > multi-function trust levels or peer-to-peer protection of the
> > interconnects.  Each device may only belong to a single iommu group,
> > which is linked from struct device.iommu_group.  IOMMU groups are
> > maintained using kobject reference counting, allowing for automatic
> > removal of empty, unreferenced groups.  It is the responsibility of
> > the iommu driver to remove devices from the group
> > (iommu_group_remove_device).
> > 
> > IOMMU groups also include a userspace representation in sysfs under
> > /sys/kernel/iommu_groups.  When allocated, each group is given a
> > dynamically assign ID (int).  The ID is managed by the core IOMMU group
> > code to support multiple heterogeneous iommu drivers, which could
> > potentially collide in group naming/numbering.  This also keeps group
> > IDs to small, easily managed values.  A directory is created under
> > /sys/kernel/iommu_groups for each group.  A further subdirectory named
> > "devices" contains links to each device within the group.  The iommu_group
> > file in the device's sysfs directory, which formerly contained a group
> > number when read, is now a link to the iommu group.  Example:
> > 
> > $ ls -l /sys/kernel/iommu_groups/26/devices/
> > total 0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:00:1e.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.0 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
> > lrwxrwxrwx. 1 root root 0 Apr 17 12:57 0000:06:0d.1 ->
> > 		../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
> > 
> > $ ls -l  /sys/kernel/iommu_groups/26/devices/*/iommu_group
> > [truncating perms/owner/timestamp]
> > /sys/kernel/iommu_groups/26/devices/0000:00:1e.0/iommu_group ->
> > 					../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.0/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > /sys/kernel/iommu_groups/26/devices/0000:06:0d.1/iommu_group ->
> > 					../../../../kernel/iommu_groups/26
> > 
> > Groups also include several exported functions for use by user level
> > driver providers, for example VFIO.  These include:
> > 
> > iommu_group_get(): Acquires a reference to a group from a device
> > iommu_group_put(): Releases reference
> > iommu_group_for_each_dev(): Iterates over group devices using callback
> > iommu_group_[un]register_notifier(): Allows notification of device add
> >         and remove operations relevant to the group
> > iommu_group_id(): Return the group number
> > 
> > This patch also extends the IOMMU API to allow attaching groups to
> > domains.  This is currently a simple wrapper for iterating through
> > devices within a group, but it's expected that the IOMMU API may
> > eventually make groups a more integral part of domains.
> > 
> > Groups intentionally do not try to manage group ownership.  A user
> > level driver provider must independently acquire ownership for each
> > device within a group before making use of the group as a whole.
> > This may change in the future if group usage becomes more pervasive
> > across both DMA and IOMMU ops.
> > 
> > Groups intentionally do not provide a mechanism for driver locking
> > or otherwise manipulating driver matching/probing of devices within
> > the group.  Such interfaces are generic to devices and beyond the
> > scope of IOMMU groups.  If implemented, user level providers have
> > ready access via iommu_group_for_each_dev and group notifiers.
> > 
> > Groups currently provide no storage for iommu context, but some kind
> > of iommu_group_get/set_iommudata() interface is likely if groups
> > become more pervasive in the dma layers.
> > 
> > iommu_device_group() is removed here as it has no users.  The
> > replacement is:
> > 
> > 	group = iommu_group_get(dev);
> > 	id = iommu_group_id(group);
> > 	iommu_group_put(group);
> 
> Looks reasonable to me, with a few nits, noted below.
> 
> [snip]
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 2198b2d..f75004e 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -26,60 +26,404 @@
> >  #include <linux/slab.h>
> >  #include <linux/errno.h>
> >  #include <linux/iommu.h>
> > +#include <linux/idr.h>
> > +#include <linux/notifier.h>
> > +
> > +static struct kset *iommu_group_kset;
> > +static struct ida iommu_group_ida;
> > +static struct mutex iommu_group_mutex;
> > +
> > +struct iommu_group {
> > +	struct kobject kobj;
> > +	struct kobject *devices_kobj;
> > +	struct list_head devices;
> > +	struct mutex mutex;
> > +	struct blocking_notifier_head notifier;
> > +	int id;
> 
> I think you should add some sort of name string to the group as well
> (supplied by the iommu driver creating the group).  That would make it
> easier to connect the arbitrary assigned IDs to any platform-standard
> naming convention for these things.

When would the name be used and how is it exposed?

> [snip]
> > +/**
> > + * iommu_group_add_device - add a device to an iommu group
> > + * @group: the group into which to add the device (reference should be held)
> > + * @dev: the device
> > + *
> > + * This function is called by an iommu driver to add a device into a
> > + * group.  Adding a device increments the group reference count.
> > + */
> > +int iommu_group_add_device(struct iommu_group *group, struct device *dev)
> > +{
> > +	int ret;
> > +	struct iommu_device *device;
> > +
> > +	device = kzalloc(sizeof(*device), GFP_KERNEL);
> > +	if (!device)
> > +		return -ENOMEM;
> > +
> > +	device->dev = dev;
> > +
> > +	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
> > +	if (ret) {
> > +		kfree(device);
> > +		return ret;
> > +	}
> > +
> > +	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
> > +				kobject_name(&dev->kobj));
> > +	if (ret) {
> > +		sysfs_remove_link(&dev->kobj, "iommu_group");
> > +		kfree(device);
> > +		return ret;
> > +	}
> 
> So, it's not clear that the kobject_name() here has to be unique
> across all devices in the group.  It might be better to use an
> arbitrary index here instead of a name to avoid that problem.

Hmm, that loses useful convenience when they are unique, such as on PCI.
I'll look and see if sysfs_create_link will fail on duplicate names and
see about adding some kind of instance to it.

> [snip]
> > +/**
> > + * iommu_group_remove_device - remove a device from it's current group
> > + * @dev: device to be removed
> > + *
> > + * This function is called by an iommu driver to remove the device from
> > + * it's current group.  This decrements the iommu group reference count.
> > + */
> > +void iommu_group_remove_device(struct device *dev)
> > +{
> > +	struct iommu_group *group = dev->iommu_group;
> > +	struct iommu_device *device;
> > +
> > +	/* Pre-notify listeners that a device is being removed. */
> > +	blocking_notifier_call_chain(&group->notifier,
> > +				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
> > +
> > +	mutex_lock(&group->mutex);
> > +	list_for_each_entry(device, &group->devices, list) {
> > +		if (device->dev == dev) {
> > +			list_del(&device->list);
> > +			kfree(device);
> > +			break;
> > +		}
> > +	}
> > +	mutex_unlock(&group->mutex);
> > +
> > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > +
> > +	dev->iommu_group = NULL;
> 
> I suspect the dev -> group pointer should be cleared first, under the
> group lock, but I'm not certain about that.

group->mutex is protecting the group's device list.  I think my
assumption is that when a device is being removed, there should be no
references to it for anyone to race with iommu_group_get(dev), but I'm
not sure how valid that is.

> [snip]
> > +/**
> > + * iommu_group_for_each_dev - iterate over each device in the group
> > + * @group: the group
> > + * @data: caller opaque data to be passed to callback function
> > + * @fn: caller supplied callback function
> > + *
> > + * This function is called by group users to iterate over group devices.
> > + * Callers should hold a reference count to the group during
> > callback.
> 
> Probably also worth noting in this doco that the group lock will be
> held across the callback.

Yes; calling iommu_group_remove_device through this would be a bad idea.

> [snip]
> > +static int iommu_bus_notifier(struct notifier_block *nb,
> > +			      unsigned long action, void *data)
> >  {
> >  	struct device *dev = data;
> > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > +	struct iommu_group *group;
> > +	unsigned long group_action = 0;
> > +
> > +	/*
> > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > +	 * result in ADD/DEL notifiers to group->notifier
> > +	 */
> > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > +		if (ops->add_device)
> > +			return ops->add_device(dev);
> > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > +		if (ops->remove_device && dev->iommu_group) {
> > +			ops->remove_device(dev);
> > +			return 0;
> > +		}
> > +	}
> 
> So, there's still the question of how to assign grouping for devices
> on a subordinate bus behind a bridge which is iommu managed.  The
> iommu driver for the top-level bus can't know about all possible
> subordinate bus types, but the subordinate devices will (in most
> cases, anyway) be iommu translated as if originating with the bus
> bridge.

Not just any bridge, there has to be a different bus_type on the
subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
would trigger this.  In general, I don't know how to handle it and I'm
open to suggestions.  Perhaps we need to register notifiers for every
bus_type and create notifiers for new bus_types.  If we can catch the
ADD_DEVICE for it, we can walk up to the first parent bus that supports
an iommu_ops and add the device there.  But then we need some kind of
foreign bus support for the group since the iommu driver won't know what
to do with the device if we call add_device with it.  Other ideas?
Thanks,

Alex
David Gibson May 15, 2012, 2:03 a.m. UTC | #3
On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
[snip]
> > > +struct iommu_group {
> > > +	struct kobject kobj;
> > > +	struct kobject *devices_kobj;
> > > +	struct list_head devices;
> > > +	struct mutex mutex;
> > > +	struct blocking_notifier_head notifier;
> > > +	int id;
> > 
> > I think you should add some sort of name string to the group as well
> > (supplied by the iommu driver creating the group).  That would make it
> > easier to connect the arbitrary assigned IDs to any platform-standard
> > naming convention for these things.
> 
> When would the name be used and how is it exposed?

I'm thinking of this basically as a debugging aid.  So I'd expect it
to appear in a 'name' (or 'description') sysfs property on the group,
and in printk messages regarding the group.

[snip]
> > So, it's not clear that the kobject_name() here has to be unique
> > across all devices in the group.  It might be better to use an
> > arbitrary index here instead of a name to avoid that problem.
> 
> Hmm, that loses useful convenience when they are unique, such as on PCI.
> I'll look and see if sysfs_create_link will fail on duplicate names and
> see about adding some kind of instance to it.

Ok.  Is the name necessarily unique even for PCI, if the group crosses
multiple domains?

[snip]
> > > +	mutex_lock(&group->mutex);
> > > +	list_for_each_entry(device, &group->devices, list) {
> > > +		if (device->dev == dev) {
> > > +			list_del(&device->list);
> > > +			kfree(device);
> > > +			break;
> > > +		}
> > > +	}
> > > +	mutex_unlock(&group->mutex);
> > > +
> > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > +
> > > +	dev->iommu_group = NULL;
> > 
> > I suspect the dev -> group pointer should be cleared first, under the
> > group lock, but I'm not certain about that.
> 
> group->mutex is protecting the group's device list.  I think my
> assumption is that when a device is being removed, there should be no
> references to it for anyone to race with iommu_group_get(dev), but I'm
> not sure how valid that is.

What I'm concerned about here is someone grabbing the device by
non-group-related means, grabbing a pointer to its group and that
racing with remove_device().  It would then end up with a group
pointer it thinks is right for the device, when the group no longer
thinks it owns the device.

Doing it under the lock is so that on the other side, group aware code
doesn't traverse the group member list and grab a reference to a
device which no longer points back to the group.

> > [snip]
> > > +/**
> > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > + * @group: the group
> > > + * @data: caller opaque data to be passed to callback function
> > > + * @fn: caller supplied callback function
> > > + *
> > > + * This function is called by group users to iterate over group devices.
> > > + * Callers should hold a reference count to the group during
> > > callback.
> > 
> > Probably also worth noting in this doco that the group lock will be
> > held across the callback.
> 
> Yes; calling iommu_group_remove_device through this would be a bad idea.

Or anything which blocks.

> > [snip]
> > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > +			      unsigned long action, void *data)
> > >  {
> > >  	struct device *dev = data;
> > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > +	struct iommu_group *group;
> > > +	unsigned long group_action = 0;
> > > +
> > > +	/*
> > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > +	 * result in ADD/DEL notifiers to group->notifier
> > > +	 */
> > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > +		if (ops->add_device)
> > > +			return ops->add_device(dev);
> > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > +		if (ops->remove_device && dev->iommu_group) {
> > > +			ops->remove_device(dev);
> > > +			return 0;
> > > +		}
> > > +	}
> > 
> > So, there's still the question of how to assign grouping for devices
> > on a subordinate bus behind a bridge which is iommu managed.  The
> > iommu driver for the top-level bus can't know about all possible
> > subordinate bus types, but the subordinate devices will (in most
> > cases, anyway) be iommu translated as if originating with the bus
> > bridge.
> 
> Not just any bridge, there has to be a different bus_type on the
> subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> would trigger this.

Right, although ISA-under-PCI is a bit of a special case anyway.  I
think PCI to Firewire/IEEE1394 would also have this issue, as would
SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
virtual struct devices where the PCI driver is structured as a wrapper
around a "vanilla" device driver, a pattern used in a number of
drivers for chips with both PCI and non PCI variants.

>  In general, I don't know how to handle it and I'm
> open to suggestions.  Perhaps we need to register notifiers for every
> bus_type and create notifiers for new bus_types.

I think we do need to do something like that and basically ensure that
if a new device is not explicitly claimed by an IOMMU driver, it
inherits its group from its parent bridge.

>  If we can catch the
> ADD_DEVICE for it, we can walk up to the first parent bus that supports
> an iommu_ops and add the device there.  But then we need some kind of
> foreign bus support for the group since the iommu driver won't know what
> to do with the device if we call add_device with it.  Other ideas?

Hrm, yeah.  We may need a distinction between primary group members
(i.e. explicitly claimed by the IOMMU driver) and subordinate group
members (devices which are descendents of a primary group member).
Alex Williamson May 15, 2012, 6:34 a.m. UTC | #4
On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> [snip]
> > > > +struct iommu_group {
> > > > +	struct kobject kobj;
> > > > +	struct kobject *devices_kobj;
> > > > +	struct list_head devices;
> > > > +	struct mutex mutex;
> > > > +	struct blocking_notifier_head notifier;
> > > > +	int id;
> > > 
> > > I think you should add some sort of name string to the group as well
> > > (supplied by the iommu driver creating the group).  That would make it
> > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > naming convention for these things.
> > 
> > When would the name be used and how is it exposed?
> 
> I'm thinking of this basically as a debugging aid.  So I'd expect it
> to appear in a 'name' (or 'description') sysfs property on the group,
> and in printk messages regarding the group.

Ok, so long as it's only descriptive/debugging I don't have a problem
adding something like that.

> [snip]
> > > So, it's not clear that the kobject_name() here has to be unique
> > > across all devices in the group.  It might be better to use an
> > > arbitrary index here instead of a name to avoid that problem.
> > 
> > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > I'll look and see if sysfs_create_link will fail on duplicate names and
> > see about adding some kind of instance to it.
> 
> Ok.  Is the name necessarily unique even for PCI, if the group crosses
> multiple domains?

Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
just use sysfs_create_link_nowarn and add a .# index when we have a name
collision.

> [snip]
> > > > +	mutex_lock(&group->mutex);
> > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > +		if (device->dev == dev) {
> > > > +			list_del(&device->list);
> > > > +			kfree(device);
> > > > +			break;
> > > > +		}
> > > > +	}
> > > > +	mutex_unlock(&group->mutex);
> > > > +
> > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > +
> > > > +	dev->iommu_group = NULL;
> > > 
> > > I suspect the dev -> group pointer should be cleared first, under the
> > > group lock, but I'm not certain about that.
> > 
> > group->mutex is protecting the group's device list.  I think my
> > assumption is that when a device is being removed, there should be no
> > references to it for anyone to race with iommu_group_get(dev), but I'm
> > not sure how valid that is.
> 
> What I'm concerned about here is someone grabbing the device by
> non-group-related means, grabbing a pointer to its group and that
> racing with remove_device().  It would then end up with a group
> pointer it thinks is right for the device, when the group no longer
> thinks it owns the device.
> 
> Doing it under the lock is so that on the other side, group aware code
> doesn't traverse the group member list and grab a reference to a
> device which no longer points back to the group.

Our for_each function does grab the lock, as you noticed below, so
removing it from the list under lock prevents that path.  Where it gets
fuzzy is if someone can call iommu_group_get(dev) to get a group
reference in this gap.  Whether we clear the iommu_group pointer under
lock or not doesn't matter for that path since it doesn't retrieve it
under lock.  The assumption there is that the caller is going to have a
reference to the device and therefore the device is not being removed.
The asynchronous locking and reference counting is by far the hardest
part of iommu_groups and vfio core, so appreciate any hard analysis of
that.

> > > [snip]
> > > > +/**
> > > > + * iommu_group_for_each_dev - iterate over each device in the group
> > > > + * @group: the group
> > > > + * @data: caller opaque data to be passed to callback function
> > > > + * @fn: caller supplied callback function
> > > > + *
> > > > + * This function is called by group users to iterate over group devices.
> > > > + * Callers should hold a reference count to the group during
> > > > callback.
> > > 
> > > Probably also worth noting in this doco that the group lock will be
> > > held across the callback.
> > 
> > Yes; calling iommu_group_remove_device through this would be a bad idea.
> 
> Or anything which blocks.
> 
> > > [snip]
> > > > +static int iommu_bus_notifier(struct notifier_block *nb,
> > > > +			      unsigned long action, void *data)
> > > >  {
> > > >  	struct device *dev = data;
> > > > +	struct iommu_ops *ops = dev->bus->iommu_ops;
> > > > +	struct iommu_group *group;
> > > > +	unsigned long group_action = 0;
> > > > +
> > > > +	/*
> > > > +	 * ADD/DEL call into iommu driver ops if provided, which may
> > > > +	 * result in ADD/DEL notifiers to group->notifier
> > > > +	 */
> > > > +	if (action == BUS_NOTIFY_ADD_DEVICE) {
> > > > +		if (ops->add_device)
> > > > +			return ops->add_device(dev);
> > > > +	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
> > > > +		if (ops->remove_device && dev->iommu_group) {
> > > > +			ops->remove_device(dev);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > 
> > > So, there's still the question of how to assign grouping for devices
> > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > iommu driver for the top-level bus can't know about all possible
> > > subordinate bus types, but the subordinate devices will (in most
> > > cases, anyway) be iommu translated as if originating with the bus
> > > bridge.
> > 
> > Not just any bridge, there has to be a different bus_type on the
> > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > would trigger this.
> 
> Right, although ISA-under-PCI is a bit of a special case anyway.  I
> think PCI to Firewire/IEEE1394 would also have this issue, as would
> SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> virtual struct devices where the PCI driver is structured as a wrapper
> around a "vanilla" device driver, a pattern used in a number of
> drivers for chips with both PCI and non PCI variants.

Sorry, I jumped into reliving this issue without remembering how I
decided to rationalize it for IOMMU groups.  Let's step through it.
Given DeviceA that's a member of GroupA and potentially sources a
subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
From a VFIO perspective, GroupA isn't usable so long as DeviceA is
claimed by a non-VFIO driver.  That same non-VFIO driver is the one
causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
goes away and we can freely give GroupA to userspace.  I believe this is
always true; there are no subordinate buses to devices that meet the
"viable" driver requirements of VFIO.  I don't see any problems with the
fact that userspace can then re-source A_bus_type and find DeviceA'.
That's what should happen.  If we want to assign just DeviceA' to
userspace, well, it has no IOMMU group of it's own, so clearly it's not
assignable on it's own.

For the more general IOMMU group case, I'm still struggling to figure
out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
don't think it's unreasonable that map_page would discover there's no
iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
practical reason why DeviceA' would need to be listed as a member of
GroupA, or is it just an optimization?  I know we had a number of
discussions about these type of devices for isolation groups, but I
think that trying to strictly represent these types of devices was also
one of the downfalls of the isolation proposal.

This did make me think of one other generic quirk we might need.
There's some funkiness with USB that makes me think that it's
effectively a shared bus between 1.x and 2.0 controllers.  So assigning
the 1.x and 2.0 controllers to different groups potentially allows a
fast and a slow path to the same set of devices.  Is this true?  If so,
we probably need to quirk OHCI/UHCI and EHCI functions together when
they're on the same PCI device.  I think the PCI class code is
sufficient to do this.  Thanks,

Alex
David Gibson May 17, 2012, 3:29 a.m. UTC | #5
On Tue, May 15, 2012 at 12:34:03AM -0600, Alex Williamson wrote:
> On Tue, 2012-05-15 at 12:03 +1000, David Gibson wrote:
> > On Mon, May 14, 2012 at 11:11:42AM -0600, Alex Williamson wrote:
> > > On Mon, 2012-05-14 at 11:16 +1000, David Gibson wrote:
> > > > On Fri, May 11, 2012 at 04:55:41PM -0600, Alex Williamson wrote:
> > [snip]
> > > > > +struct iommu_group {
> > > > > +	struct kobject kobj;
> > > > > +	struct kobject *devices_kobj;
> > > > > +	struct list_head devices;
> > > > > +	struct mutex mutex;
> > > > > +	struct blocking_notifier_head notifier;
> > > > > +	int id;
> > > > 
> > > > I think you should add some sort of name string to the group as well
> > > > (supplied by the iommu driver creating the group).  That would make it
> > > > easier to connect the arbitrary assigned IDs to any platform-standard
> > > > naming convention for these things.
> > > 
> > > When would the name be used and how is it exposed?
> > 
> > I'm thinking of this basically as a debugging aid.  So I'd expect it
> > to appear in a 'name' (or 'description') sysfs property on the group,
> > and in printk messages regarding the group.
> 
> Ok, so long as it's only descriptive/debugging I don't have a problem
> adding something like that.
> 
> > [snip]
> > > > So, it's not clear that the kobject_name() here has to be unique
> > > > across all devices in the group.  It might be better to use an
> > > > arbitrary index here instead of a name to avoid that problem.
> > > 
> > > Hmm, that loses useful convenience when they are unique, such as on PCI.
> > > I'll look and see if sysfs_create_link will fail on duplicate names and
> > > see about adding some kind of instance to it.
> > 
> > Ok.  Is the name necessarily unique even for PCI, if the group crosses
> > multiple domains?
> 
> Yes, it includes the domain in the dddd:bb:dd.f form.  I've found I can
> just use sysfs_create_link_nowarn and add a .# index when we have a name
> collision.

Ok, that sounds good.

> > [snip]
> > > > > +	mutex_lock(&group->mutex);
> > > > > +	list_for_each_entry(device, &group->devices, list) {
> > > > > +		if (device->dev == dev) {
> > > > > +			list_del(&device->list);
> > > > > +			kfree(device);
> > > > > +			break;
> > > > > +		}
> > > > > +	}
> > > > > +	mutex_unlock(&group->mutex);
> > > > > +
> > > > > +	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
> > > > > +	sysfs_remove_link(&dev->kobj, "iommu_group");
> > > > > +
> > > > > +	dev->iommu_group = NULL;
> > > > 
> > > > I suspect the dev -> group pointer should be cleared first, under the
> > > > group lock, but I'm not certain about that.
> > > 
> > > group->mutex is protecting the group's device list.  I think my
> > > assumption is that when a device is being removed, there should be no
> > > references to it for anyone to race with iommu_group_get(dev), but I'm
> > > not sure how valid that is.
> > 
> > What I'm concerned about here is someone grabbing the device by
> > non-group-related means, grabbing a pointer to its group and that
> > racing with remove_device().  It would then end up with a group
> > pointer it thinks is right for the device, when the group no longer
> > thinks it owns the device.
> > 
> > Doing it under the lock is so that on the other side, group aware code
> > doesn't traverse the group member list and grab a reference to a
> > device which no longer points back to the group.
> 
> Our for_each function does grab the lock, as you noticed below, so
> removing it from the list under lock prevents that path.  Where it gets
> fuzzy is if someone can call iommu_group_get(dev) to get a group
> reference in this gap.

Right, that's what I'm concerned about.

>  Whether we clear the iommu_group pointer under
> lock or not doesn't matter for that path since it doesn't retrieve it
> under lock.  The assumption there is that the caller is going to have a
> reference to the device and therefore the device is not being
> removed.

Hrm.  I guess that works, assuming that removing a device from the
system is the only thing that will cause a device to be removed from
the group.  How confident are we of that?

> The asynchronous locking and reference counting is by far the hardest
> part of iommu_groups and vfio core, so appreciate any hard analysis of
> that.

Well, I've given my recommendation on this small aspect.

[snip]
> > > > So, there's still the question of how to assign grouping for devices
> > > > on a subordinate bus behind a bridge which is iommu managed.  The
> > > > iommu driver for the top-level bus can't know about all possible
> > > > subordinate bus types, but the subordinate devices will (in most
> > > > cases, anyway) be iommu translated as if originating with the bus
> > > > bridge.
> > > 
> > > Not just any bridge, there has to be a different bus_type on the
> > > subordinate side.  Things like PCI-to-PCI work as is, but a PCI-to-ISA
> > > would trigger this.
> > 
> > Right, although ISA-under-PCI is a bit of a special case anyway.  I
> > think PCI to Firewire/IEEE1394 would also have this issue, as would
> > SoC-bus-to-PCI for a SoC which had an IOMMU at the SoC bus level.  And
> > virtual struct devices where the PCI driver is structured as a wrapper
> > around a "vanilla" device driver, a pattern used in a number of
> > drivers for chips with both PCI and non PCI variants.
> 
> Sorry, I jumped into reliving this issue without remembering how I
> decided to rationalize it for IOMMU groups.  Let's step through it.
> Given DeviceA that's a member of GroupA and potentially sources a
> subordinate bus (A_bus_type) exposing DeviceA', what are the issues?
> >From a VFIO perspective, GroupA isn't usable so long as DeviceA is
> claimed by a non-VFIO driver.  That same non-VFIO driver is the one
> causing DeviceA to source A_bus_type, so remove the driver and DeviceA'
> goes away and we can freely give GroupA to userspace.  I believe this is
> always true; there are no subordinate buses to devices that meet the
> "viable" driver requirements of VFIO.  I don't see any problems with the
> fact that userspace can then re-source A_bus_type and find DeviceA'.
> That's what should happen.  If we want to assign just DeviceA' to
> userspace, well, it has no IOMMU group of it's own, so clearly it's not
> assignable on it's own.

Ah, right, so the assumption is that a bridge will only expose it's
subordinate bus when it has an active kernel driver.  Yeah, I think
that works at least for the current use cases.

> For the more general IOMMU group case, I'm still struggling to figure
> out why this is an issue.  If we were to do dma_ops via IOMMU groups, I
> don't think it's unreasonable that map_page would discover there's no
> iommu_ops on dev->bus (A_bus_type) and step up to dev->bus->self to find
> both iommu_group on DeviceA and iommu_ops on DeviceA->bus.  Is there a
> practical reason why DeviceA' would need to be listed as a member of
> GroupA, or is it just an optimization?  I know we had a number of
> discussions about these type of devices for isolation groups, but I
> think that trying to strictly represent these types of devices was also
> one of the downfalls of the isolation proposal.

Yeah, walking up the tree to find a group is certainly a possibility
as well.  Ok, so I'm happy enough only to track primary group members
for now.  I think we should bear in mind the possibility we might need
to track secondary members in future though (either as an optimization
or as something we need for a use case we haven't pinned down yet).

> This did make me think of one other generic quirk we might need.
> There's some funkiness with USB that makes me think that it's
> effectively a shared bus between 1.x and 2.0 controllers.  So assigning
> the 1.x and 2.0 controllers to different groups potentially allows a
> fast and a slow path to the same set of devices.  Is this true?  If so,
> we probably need to quirk OHCI/UHCI and EHCI functions together when
> they're on the same PCI device.  I think the PCI class code is
> sufficient to do this.  Thanks,

Ah, maybe.  I don't know enough about USB conventions to say for sure.
diff mbox

Patch

diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c
index a5bee8e..32c00cd 100644
--- a/drivers/iommu/amd_iommu.c
+++ b/drivers/iommu/amd_iommu.c
@@ -3193,26 +3193,6 @@  static int amd_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-static int amd_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct iommu_dev_data *dev_data = dev->archdata.iommu;
-	struct pci_dev *pdev = to_pci_dev(dev);
-	u16 devid;
-
-	if (!dev_data)
-		return -ENODEV;
-
-	if (pdev->is_virtfn || !iommu_group_mf)
-		devid = dev_data->devid;
-	else
-		devid = calc_devid(pdev->bus->number,
-				   PCI_DEVFN(PCI_SLOT(pdev->devfn), 0));
-
-	*groupid = amd_iommu_alias_table[devid];
-
-	return 0;
-}
-
 static struct iommu_ops amd_iommu_ops = {
 	.domain_init = amd_iommu_domain_init,
 	.domain_destroy = amd_iommu_domain_destroy,
@@ -3222,7 +3202,6 @@  static struct iommu_ops amd_iommu_ops = {
 	.unmap = amd_iommu_unmap,
 	.iova_to_phys = amd_iommu_iova_to_phys,
 	.domain_has_cap = amd_iommu_domain_has_cap,
-	.device_group = amd_iommu_device_group,
 	.pgsize_bitmap	= AMD_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
index f93d5ac..d4a0ff7 100644
--- a/drivers/iommu/intel-iommu.c
+++ b/drivers/iommu/intel-iommu.c
@@ -4087,54 +4087,6 @@  static int intel_iommu_domain_has_cap(struct iommu_domain *domain,
 	return 0;
 }
 
-/*
- * Group numbers are arbitrary.  Device with the same group number
- * indicate the iommu cannot differentiate between them.  To avoid
- * tracking used groups we just use the seg|bus|devfn of the lowest
- * level we're able to differentiate devices
- */
-static int intel_iommu_device_group(struct device *dev, unsigned int *groupid)
-{
-	struct pci_dev *pdev = to_pci_dev(dev);
-	struct pci_dev *bridge;
-	union {
-		struct {
-			u8 devfn;
-			u8 bus;
-			u16 segment;
-		} pci;
-		u32 group;
-	} id;
-
-	if (iommu_no_mapping(dev))
-		return -ENODEV;
-
-	id.pci.segment = pci_domain_nr(pdev->bus);
-	id.pci.bus = pdev->bus->number;
-	id.pci.devfn = pdev->devfn;
-
-	if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn))
-		return -ENODEV;
-
-	bridge = pci_find_upstream_pcie_bridge(pdev);
-	if (bridge) {
-		if (pci_is_pcie(bridge)) {
-			id.pci.bus = bridge->subordinate->number;
-			id.pci.devfn = 0;
-		} else {
-			id.pci.bus = bridge->bus->number;
-			id.pci.devfn = bridge->devfn;
-		}
-	}
-
-	if (!pdev->is_virtfn && iommu_group_mf)
-		id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0);
-
-	*groupid = id.group;
-
-	return 0;
-}
-
 static struct iommu_ops intel_iommu_ops = {
 	.domain_init	= intel_iommu_domain_init,
 	.domain_destroy = intel_iommu_domain_destroy,
@@ -4144,7 +4096,6 @@  static struct iommu_ops intel_iommu_ops = {
 	.unmap		= intel_iommu_unmap,
 	.iova_to_phys	= intel_iommu_iova_to_phys,
 	.domain_has_cap = intel_iommu_domain_has_cap,
-	.device_group	= intel_iommu_device_group,
 	.pgsize_bitmap	= INTEL_IOMMU_PGSIZES,
 };
 
diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 2198b2d..f75004e 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -26,60 +26,404 @@ 
 #include <linux/slab.h>
 #include <linux/errno.h>
 #include <linux/iommu.h>
+#include <linux/idr.h>
+#include <linux/notifier.h>
+
+static struct kset *iommu_group_kset;
+static struct ida iommu_group_ida;
+static struct mutex iommu_group_mutex;
+
+struct iommu_group {
+	struct kobject kobj;
+	struct kobject *devices_kobj;
+	struct list_head devices;
+	struct mutex mutex;
+	struct blocking_notifier_head notifier;
+	int id;
+};
+
+struct iommu_device {
+	struct list_head list;
+	struct device *dev;
+};
+
+struct iommu_group_attribute {
+	struct attribute attr;
+	ssize_t (*show)(struct iommu_group *group, char *buf);
+	ssize_t (*store)(struct iommu_group *group,
+			 const char *buf, size_t count);
+};
 
-static ssize_t show_iommu_group(struct device *dev,
-				struct device_attribute *attr, char *buf)
+#define to_iommu_group_attr(_attr)	\
+	container_of(_attr, struct iommu_group_attribute, attr)
+#define to_iommu_group(_kobj)		\
+	container_of(_kobj, struct iommu_group, kobj)
+
+static ssize_t iommu_group_attr_show(struct kobject *kobj,
+				     struct attribute *__attr, char *buf)
 {
-	unsigned int groupid;
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
 
-	if (iommu_device_group(dev, &groupid))
-		return 0;
+	if (attr->show)
+		ret = attr->show(group, buf);
+	return ret;
+}
 
-	return sprintf(buf, "%u", groupid);
+static ssize_t iommu_group_attr_store(struct kobject *kobj,
+				      struct attribute *__attr,
+				      const char *buf, size_t count)
+{
+	struct iommu_group_attribute *attr = to_iommu_group_attr(__attr);
+	struct iommu_group *group = to_iommu_group(kobj);
+	ssize_t ret = -EIO;
+
+	if (attr->store)
+		ret = attr->store(group, buf, count);
+	return ret;
 }
-static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL);
 
-static int add_iommu_group(struct device *dev, void *data)
+static void iommu_group_release(struct kobject *kobj)
 {
-	unsigned int groupid;
+	struct iommu_group *group = to_iommu_group(kobj);
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		return device_create_file(dev, &dev_attr_iommu_group);
+	mutex_lock(&iommu_group_mutex);
+	ida_remove(&iommu_group_ida, group->id);
+	mutex_unlock(&iommu_group_mutex);
 
-	return 0;
+	kfree(group);
 }
 
-static int remove_iommu_group(struct device *dev)
+static const struct sysfs_ops iommu_group_sysfs_ops = {
+	.show = iommu_group_attr_show,
+	.store = iommu_group_attr_store,
+};
+
+static struct kobj_type iommu_group_ktype = {
+	.sysfs_ops = &iommu_group_sysfs_ops,
+	.release = iommu_group_release,
+};
+
+/**
+ * iommu_group_alloc - Allocate a new group
+ *
+ * This function is called by an iommu driver to allocate a new iommu
+ * group.  The iommu group represents the minimum granularity of the iommu.
+ * Upon successful return, the caller holds a reference to the supplied
+ * group in order to hold the group until devices are added.  Use
+ * iommu_group_put() to release this extra reference count, allowing the
+ * group to be automatically reclaimed once it has no devices or external
+ * references.
+ */
+struct iommu_group *iommu_group_alloc(void)
 {
-	unsigned int groupid;
+	struct iommu_group *group;
+	int ret;
+
+	group = kzalloc(sizeof(*group), GFP_KERNEL);
+	if (!group)
+		return ERR_PTR(-ENOMEM);
+
+	group->kobj.kset = iommu_group_kset;
+	mutex_init(&group->mutex);
+	INIT_LIST_HEAD(&group->devices);
+	BLOCKING_INIT_NOTIFIER_HEAD(&group->notifier);
+
+	mutex_lock(&iommu_group_mutex);
+
+again:
+	if (unlikely(0 == ida_pre_get(&iommu_group_ida, GFP_KERNEL))) {
+		kfree(group);
+		mutex_unlock(&iommu_group_mutex);
+		return ERR_PTR(-ENOMEM);
+	}
+
+	if (-EAGAIN == ida_get_new(&iommu_group_ida, &group->id))
+		goto again;
 
-	if (iommu_device_group(dev, &groupid) == 0)
-		device_remove_file(dev, &dev_attr_iommu_group);
+	mutex_unlock(&iommu_group_mutex);
+
+	ret = kobject_init_and_add(&group->kobj, &iommu_group_ktype,
+				   NULL, "%d", group->id);
+	if (ret) {
+		mutex_lock(&iommu_group_mutex);
+		ida_remove(&iommu_group_ida, group->id);
+		mutex_unlock(&iommu_group_mutex);
+		kfree(group);
+		return ERR_PTR(ret);
+	}
+
+	group->devices_kobj = kobject_create_and_add("devices", &group->kobj);
+	if (!group->devices_kobj) {
+		kobject_put(&group->kobj); /* triggers .release & free */
+		return ERR_PTR(-ENOMEM);
+	}
+
+	/*
+	 * The devices_kobj holds a reference on the group kobject, so
+	 * as long as that exists so will the group.  We can therefore
+	 * use the devices_kobj for reference counting.
+	 */
+	kobject_put(&group->kobj);
+
+	return group;
+}
 
+/**
+ * iommu_group_add_device - add a device to an iommu group
+ * @group: the group into which to add the device (reference should be held)
+ * @dev: the device
+ *
+ * This function is called by an iommu driver to add a device into a
+ * group.  Adding a device increments the group reference count.
+ */
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	int ret;
+	struct iommu_device *device;
+
+	device = kzalloc(sizeof(*device), GFP_KERNEL);
+	if (!device)
+		return -ENOMEM;
+
+	device->dev = dev;
+
+	ret = sysfs_create_link(&dev->kobj, &group->kobj, "iommu_group");
+	if (ret) {
+		kfree(device);
+		return ret;
+	}
+
+	ret = sysfs_create_link(group->devices_kobj, &dev->kobj,
+				kobject_name(&dev->kobj));
+	if (ret) {
+		sysfs_remove_link(&dev->kobj, "iommu_group");
+		kfree(device);
+		return ret;
+	}
+
+	kobject_get(group->devices_kobj);
+
+	dev->iommu_group = group;
+
+	mutex_lock(&group->mutex);
+	list_add_tail(&device->list, &group->devices);
+	mutex_unlock(&group->mutex);
+
+	/* Notify any listeners about change to group. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_ADD_DEVICE, dev);
 	return 0;
 }
 
-static int iommu_device_notifier(struct notifier_block *nb,
-				 unsigned long action, void *data)
+/**
+ * iommu_group_remove_device - remove a device from it's current group
+ * @dev: device to be removed
+ *
+ * This function is called by an iommu driver to remove the device from
+ * it's current group.  This decrements the iommu group reference count.
+ */
+void iommu_group_remove_device(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+	struct iommu_device *device;
+
+	/* Pre-notify listeners that a device is being removed. */
+	blocking_notifier_call_chain(&group->notifier,
+				     IOMMU_GROUP_NOTIFY_DEL_DEVICE, dev);
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		if (device->dev == dev) {
+			list_del(&device->list);
+			kfree(device);
+			break;
+		}
+	}
+	mutex_unlock(&group->mutex);
+
+	sysfs_remove_link(group->devices_kobj, kobject_name(&dev->kobj));
+	sysfs_remove_link(&dev->kobj, "iommu_group");
+
+	dev->iommu_group = NULL;
+	kobject_put(group->devices_kobj);
+}
+
+/**
+ * iommu_group_for_each_dev - iterate over each device in the group
+ * @group: the group
+ * @data: caller opaque data to be passed to callback function
+ * @fn: caller supplied callback function
+ *
+ * This function is called by group users to iterate over group devices.
+ * Callers should hold a reference count to the group during callback.
+ */
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
+{
+	struct iommu_device *device;
+	int ret = 0;
+
+	mutex_lock(&group->mutex);
+	list_for_each_entry(device, &group->devices, list) {
+		ret = fn(device->dev, data);
+		if (ret)
+			break;
+	}
+	mutex_unlock(&group->mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(iommu_group_for_each_dev);
+
+/**
+ * iommu_group_get - Return the group for a device and increment reference
+ * @dev: get the group that this device belongs to
+ *
+ * This function is called by iommu drivers and users to get the group
+ * for the specified device.  If found, the group is returned and the group
+ * reference in incremented, else NULL.
+ */
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	struct iommu_group *group = dev->iommu_group;
+
+	if (group)
+		kobject_get(group->devices_kobj);
+
+	return group;
+}
+EXPORT_SYMBOL_GPL(iommu_group_get);
+
+/**
+ * iommu_group_put - Decrement group reference
+ * @group: the group to use
+ *
+ * This function is called by iommu drivers and users to release the
+ * iommu group.  Once the reference count is zero, the group is released.
+ */
+void iommu_group_put(struct iommu_group *group)
+{
+	if (group)
+		kobject_put(group->devices_kobj);
+}
+EXPORT_SYMBOL_GPL(iommu_group_put);
+
+/**
+ * iommu_group_register_notifier - Register a notifier for group changes
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * This function allows iommu group users to track changes in a group.
+ * See include/linux/iommu.h for actions sent via this notifier.  Caller
+ * should hold a reference to the group throughout notifier registration.
+ */
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_register_notifier);
+
+/**
+ * iommu_group_unregister_notifier - Unregister a notifier
+ * @group: the group to watch
+ * @nb: notifier block to signal
+ *
+ * Unregister a previously registered group notifier block.
+ */
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return blocking_notifier_chain_unregister(&group->notifier, nb);
+}
+EXPORT_SYMBOL_GPL(iommu_group_unregister_notifier);
+
+/**
+ * iommu_group_id - Return ID for a group
+ * @group: the group to ID
+ *
+ * Return the unique ID for the group matching the sysfs group number.
+ */
+int iommu_group_id(struct iommu_group *group)
+{
+	return group->id;
+}
+EXPORT_SYMBOL_GPL(iommu_group_id);
+
+static int add_iommu_group(struct device *dev, void *data)
+{
+	struct iommu_ops *ops = data;
+
+	if (!ops->add_device)
+		return -ENODEV;
+
+	WARN_ON(dev->iommu_group);
+
+	return ops->add_device(dev);
+}
+
+static int iommu_bus_notifier(struct notifier_block *nb,
+			      unsigned long action, void *data)
 {
 	struct device *dev = data;
+	struct iommu_ops *ops = dev->bus->iommu_ops;
+	struct iommu_group *group;
+	unsigned long group_action = 0;
+
+	/*
+	 * ADD/DEL call into iommu driver ops if provided, which may
+	 * result in ADD/DEL notifiers to group->notifier
+	 */
+	if (action == BUS_NOTIFY_ADD_DEVICE) {
+		if (ops->add_device)
+			return ops->add_device(dev);
+	} else if (action == BUS_NOTIFY_DEL_DEVICE) {
+		if (ops->remove_device && dev->iommu_group) {
+			ops->remove_device(dev);
+			return 0;
+		}
+	}
 
-	if (action == BUS_NOTIFY_ADD_DEVICE)
-		return add_iommu_group(dev, NULL);
-	else if (action == BUS_NOTIFY_DEL_DEVICE)
-		return remove_iommu_group(dev);
+	/*
+	 * Remaining BUS_NOTIFYs get filtered and republished to the
+	 * group, if anyone is listening
+	 */
+	group = iommu_group_get(dev);
+	if (!group)
+		return 0;
 
+	switch (action) {
+	case BUS_NOTIFY_BIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BIND_DRIVER;
+		break;
+	case BUS_NOTIFY_BOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_BOUND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBIND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBIND_DRIVER;
+		break;
+	case BUS_NOTIFY_UNBOUND_DRIVER:
+		group_action = IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER;
+		break;
+	}
+
+	if (group_action)
+		blocking_notifier_call_chain(&group->notifier,
+					     group_action, dev);
+
+	iommu_group_put(group);
 	return 0;
 }
 
-static struct notifier_block iommu_device_nb = {
-	.notifier_call = iommu_device_notifier,
+static struct notifier_block iommu_bus_nb = {
+	.notifier_call = iommu_bus_notifier,
 };
 
 static void iommu_bus_init(struct bus_type *bus, struct iommu_ops *ops)
 {
-	bus_register_notifier(bus, &iommu_device_nb);
-	bus_for_each_dev(bus, NULL, NULL, add_iommu_group);
+	bus_register_notifier(bus, &iommu_bus_nb);
+	bus_for_each_dev(bus, NULL, ops, add_iommu_group);
 }
 
 /**
@@ -189,6 +533,45 @@  void iommu_detach_device(struct iommu_domain *domain, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_detach_device);
 
+/*
+ * IOMMU groups are really the natrual working unit of the IOMMU, but
+ * the IOMMU API works on domains and devices.  Bridge that gap by
+ * iterating over the devices in a group.  Ideally we'd have a single
+ * device which represents the requestor ID of the group, but we also
+ * allow IOMMU drivers to create policy defined minimum sets, where
+ * the physical hardware may be able to distiguish members, but we
+ * wish to group them at a higher level (ex. untrusted multi-function
+ * PCI devices).  Thus we attach each device.
+ */
+static int iommu_group_do_attach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	return iommu_attach_device(domain, dev);
+}
+
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return iommu_group_for_each_dev(group, domain,
+					iommu_group_do_attach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_attach_group);
+
+static int iommu_group_do_detach_device(struct device *dev, void *data)
+{
+	struct iommu_domain *domain = data;
+
+	iommu_detach_device(domain, dev);
+
+	return 0;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	iommu_group_for_each_dev(group, domain, iommu_group_do_detach_device);
+}
+EXPORT_SYMBOL_GPL(iommu_detach_group);
+
 phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain,
 			       unsigned long iova)
 {
@@ -333,11 +716,15 @@  size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t size)
 }
 EXPORT_SYMBOL_GPL(iommu_unmap);
 
-int iommu_device_group(struct device *dev, unsigned int *groupid)
+static int __init iommu_init(void)
 {
-	if (iommu_present(dev->bus) && dev->bus->iommu_ops->device_group)
-		return dev->bus->iommu_ops->device_group(dev, groupid);
+	iommu_group_kset = kset_create_and_add("iommu_groups",
+					       NULL, kernel_kobj);
+	ida_init(&iommu_group_ida);
+	mutex_init(&iommu_group_mutex);
 
-	return -ENODEV;
+	BUG_ON(!iommu_group_kset);
+
+	return 0;
 }
-EXPORT_SYMBOL_GPL(iommu_device_group);
+subsys_initcall(iommu_init);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..92cf3c2 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@ 
 #define IOMMU_CACHE	(4) /* DMA cache coherency */
 
 struct iommu_ops;
+struct iommu_group;
 struct bus_type;
 struct device;
 struct iommu_domain;
@@ -59,6 +60,8 @@  struct iommu_domain {
  * @iova_to_phys: translate iova to physical address
  * @domain_has_cap: domain capabilities query
  * @commit: commit iommu domain
+ * @add_device: add device to iommu grouping
+ * @remove_device: remove device from iommu grouping
  * @pgsize_bitmap: bitmap of supported page sizes
  */
 struct iommu_ops {
@@ -74,10 +77,18 @@  struct iommu_ops {
 				    unsigned long iova);
 	int (*domain_has_cap)(struct iommu_domain *domain,
 			      unsigned long cap);
-	int (*device_group)(struct device *dev, unsigned int *groupid);
+	int (*add_device)(struct device *dev);
+	void (*remove_device)(struct device *dev);
 	unsigned long pgsize_bitmap;
 };
 
+#define IOMMU_GROUP_NOTIFY_ADD_DEVICE		1 /* Device added */
+#define IOMMU_GROUP_NOTIFY_DEL_DEVICE		2 /* Pre Device removed */
+#define IOMMU_GROUP_NOTIFY_BIND_DRIVER		3 /* Pre Driver bind */
+#define IOMMU_GROUP_NOTIFY_BOUND_DRIVER		4 /* Post Driver bind */
+#define IOMMU_GROUP_NOTIFY_UNBIND_DRIVER	5 /* Pre Driver unbind */
+#define IOMMU_GROUP_NOTIFY_UNBOUND_DRIVER	6 /* Post Driver unbind */
+
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -96,7 +107,24 @@  extern int iommu_domain_has_cap(struct iommu_domain *domain,
 				unsigned long cap);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
 					iommu_fault_handler_t handler);
-extern int iommu_device_group(struct device *dev, unsigned int *groupid);
+
+extern int iommu_attach_group(struct iommu_domain *domain,
+			      struct iommu_group *group);
+extern void iommu_detach_group(struct iommu_domain *domain,
+			       struct iommu_group *group);
+extern struct iommu_group *iommu_group_alloc(void);
+extern int iommu_group_add_device(struct iommu_group *group,
+				  struct device *dev);
+extern void iommu_group_remove_device(struct device *dev);
+extern int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+				    int (*fn)(struct device *, void *));
+extern struct iommu_group *iommu_group_get(struct device *dev);
+extern void iommu_group_put(struct iommu_group *group);
+extern int iommu_group_register_notifier(struct iommu_group *group,
+					 struct notifier_block *nb);
+extern int iommu_group_unregister_notifier(struct iommu_group *group,
+					   struct notifier_block *nb);
+extern int iommu_group_id(struct iommu_group *group);
 
 /**
  * report_iommu_fault() - report about an IOMMU fault to the IOMMU framework
@@ -140,6 +168,7 @@  static inline int report_iommu_fault(struct iommu_domain *domain,
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
+struct iommu_group {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {
@@ -195,11 +224,60 @@  static inline void iommu_set_fault_handler(struct iommu_domain *domain,
 {
 }
 
-static inline int iommu_device_group(struct device *dev, unsigned int *groupid)
+int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+	return -ENODEV;
+}
+
+void iommu_detach_group(struct iommu_domain *domain, struct iommu_group *group)
+{
+}
+
+struct iommu_group *iommu_group_alloc(void)
+{
+	return ERR_PTR(-ENODEV);
+}
+
+int iommu_group_add_device(struct iommu_group *group, struct device *dev)
+{
+	return -ENODEV;
+}
+
+void iommu_group_remove_device(struct device *dev)
+{
+}
+
+int iommu_group_for_each_dev(struct iommu_group *group, void *data,
+			     int (*fn)(struct device *, void *))
 {
 	return -ENODEV;
 }
 
+struct iommu_group *iommu_group_get(struct device *dev)
+{
+	return NULL;
+}
+
+void iommu_group_put(struct iommu_group *group)
+{
+}
+
+int iommu_group_register_notifier(struct iommu_group *group,
+				  struct notifier_block *nb)
+{
+	return -ENODEV;
+}
+
+int iommu_group_unregister_notifier(struct iommu_group *group,
+				    struct notifier_block *nb)
+{
+	return 0;
+}
+
+int iommu_group_id(struct iommu_group *group)
+{
+	return -ENODEV;
+}
 #endif /* CONFIG_IOMMU_API */
 
 #endif /* __LINUX_IOMMU_H */