Patchwork [RFC,1/3] iommu: Introduce iommu_group

login
register
mail settings
Submitter Alex Williamson
Date April 2, 2012, 9:14 p.m.
Message ID <20120402211440.28977.74440.stgit@bling.home>
Download mbox | patch
Permalink /patch/150248/
State New
Headers show

Comments

Alex Williamson - April 2, 2012, 9:14 p.m.
IOMMUs often do not have visibility of individual devices in the
system.  Due to IOMMU design, bus topology, or device quirks, we
can often only identify groups of devices.  Examples include
Intel VT-d & AMD-Vi which often have function level visibility
compared to POWER partitionable endpoints which have bridge level
granularity.  PCIe-to-PCI bridges also often cloud the IOMMU
visibility as it cannot distiguish devices behind the bridge.
Devices can also sometimes hurt themselves by initiating DMA using
the wrong source ID on a multifunction PCI device.

IOMMU groups are meant to help solve these problems and hopefully
become the working unit of the IOMMI API.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 include/linux/device.h |    2 ++
 include/linux/iommu.h  |    5 +++++
 2 files changed, 7 insertions(+), 0 deletions(-)
David Gibson - April 18, 2012, 9:58 a.m.
On Mon, Apr 02, 2012 at 03:14:40PM -0600, Alex Williamson wrote:
> IOMMUs often do not have visibility of individual devices in the
> system.  Due to IOMMU design, bus topology, or device quirks, we
> can often only identify groups of devices.  Examples include
> Intel VT-d & AMD-Vi which often have function level visibility
> compared to POWER partitionable endpoints which have bridge level
> granularity.

That's a significant oversimplification of the situation on POWER,
although it doesn't really matter in this context.  On older (i.e. pre
PCI-E) hardware, PEs have either host bridge (i.e. domain)
granularity, or in IIUC in some cases p2p bridge granularity, using
special p2p bridges, since that's the only real way to do iommu
differentiation without the PCI-E requestor IDs.  This isn't as coarse
as it seems in practice, because the hardware is usually built with a
bridge per physical PCI slot.

On newer PCI-E hardware, the PE granularity is basically a firmware
decision, and can go down to function level.  I believe pHyp puts the
granularity at the bridge level.  Our non-virtualized Linux "firmware"
currently does put it at the function level, but Ben is thinking about
changing that to bridge level: again, because of the hardware design
that isn't as coarse as it seems, and at this level we can hardware
guarantee isolation to a degree that's not possible at the function
level.

>  PCIe-to-PCI bridges also often cloud the IOMMU
> visibility as it cannot distiguish devices behind the bridge.
> Devices can also sometimes hurt themselves by initiating DMA using
> the wrong source ID on a multifunction PCI device.
> 
> IOMMU groups are meant to help solve these problems and hopefully
> become the working unit of the IOMMI API.

So far, so simple.  No objections here.  I am trying to work out what
the real difference in approach is in this seriers from either your or
my earlier isolation group series.  AFAICT it's just that this
approach is explicitly only about IOMMU identity, ignoring (here) any
other factors which might affect isolation.  Or am I missing
something?
Alex Williamson - April 18, 2012, 8:07 p.m.
On Wed, 2012-04-18 at 19:58 +1000, David Gibson wrote:
> On Mon, Apr 02, 2012 at 03:14:40PM -0600, Alex Williamson wrote:
> > IOMMUs often do not have visibility of individual devices in the
> > system.  Due to IOMMU design, bus topology, or device quirks, we
> > can often only identify groups of devices.  Examples include
> > Intel VT-d & AMD-Vi which often have function level visibility
> > compared to POWER partitionable endpoints which have bridge level
> > granularity.
> 
> That's a significant oversimplification of the situation on POWER,
> although it doesn't really matter in this context.  On older (i.e. pre
> PCI-E) hardware, PEs have either host bridge (i.e. domain)
> granularity, or in IIUC in some cases p2p bridge granularity, using
> special p2p bridges, since that's the only real way to do iommu
> differentiation without the PCI-E requestor IDs.  This isn't as coarse
> as it seems in practice, because the hardware is usually built with a
> bridge per physical PCI slot.
> 
> On newer PCI-E hardware, the PE granularity is basically a firmware
> decision, and can go down to function level.  I believe pHyp puts the
> granularity at the bridge level.  Our non-virtualized Linux "firmware"
> currently does put it at the function level, but Ben is thinking about
> changing that to bridge level: again, because of the hardware design
> that isn't as coarse as it seems, and at this level we can hardware
> guarantee isolation to a degree that's not possible at the function
> level.

Ok, thanks for the clarification.  This should support either model and
it will be up to the iommu driver to fill the groups with the right
devices.

> >  PCIe-to-PCI bridges also often cloud the IOMMU
> > visibility as it cannot distiguish devices behind the bridge.
> > Devices can also sometimes hurt themselves by initiating DMA using
> > the wrong source ID on a multifunction PCI device.
> > 
> > IOMMU groups are meant to help solve these problems and hopefully
> > become the working unit of the IOMMI API.
> 
> So far, so simple.  No objections here.  I am trying to work out what
> the real difference in approach is in this seriers from either your or
> my earlier isolation group series.  AFAICT it's just that this
> approach is explicitly only about IOMMU identity, ignoring (here) any
> other factors which might affect isolation.  Or am I missing
> something?

Yes, they are very similar and actually also similar to how VFIO manages
groups.  It's easy to start some kind of group structure, the hard part
is in the details and particularly where to stop.  My attempt to figure
out where isolation groups stop went quite poorly, ending up with a
ridiculously complicated mess of hierarchical groups that got worse as I
tried to fill in the gaps.

With iommu groups I try to take a step back and simplify.  I initially
had a goal of describing only the minimum iommu granularity sets, this
is where the dma_dev idea came from.  But the iommu granularity doesn't
really guarantee all that much or allow any ability for the iommu driver
to add additional policies that would be useful for userspace drivers
(ex. multi-function devices and peer-to-peer isolation).  So again I've
had to allow that a group might have multiple visible requestor IDs
within the group.  This time though I'm trying to disallow hierarchies,
which means that even kernel, dma_ops, usage of groups are restricted to
a single, platform defined, level of isolation.  I'm also trying to stay
out of the business of providing a group management interface.  I only
want to describe groups.  Things like stopping driver probes should be
device level problems.  In effect, this level should not be providing
enforcement and ownership, something like VFIO will do that.  So the
differences are subtle, but important.  Thanks,

Alex

Patch

diff --git a/include/linux/device.h b/include/linux/device.h
index b63fb39..6acab1c 100644
--- a/include/linux/device.h
+++ b/include/linux/device.h
@@ -35,6 +35,7 @@  struct subsys_private;
 struct bus_type;
 struct device_node;
 struct iommu_ops;
+struct iommu_group;
 
 struct bus_attribute {
 	struct attribute	attr;
@@ -683,6 +684,7 @@  struct device {
 	const struct attribute_group **groups;	/* optional groups */
 
 	void	(*release)(struct device *dev);
+	struct iommu_group	*iommu_group;
 };
 
 /* Get the wakeup routines, which depend on struct device */
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index d937580..2ee375c 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -26,6 +26,7 @@ 
 #define IOMMU_CACHE	(4) /* DMA cache coherency */
 
 struct iommu_ops;
+struct iommu_group;
 struct bus_type;
 struct device;
 struct iommu_domain;
@@ -78,6 +79,9 @@  struct iommu_ops {
 	unsigned long pgsize_bitmap;
 };
 
+struct iommu_group {
+};
+
 extern int bus_set_iommu(struct bus_type *bus, struct iommu_ops *ops);
 extern bool iommu_present(struct bus_type *bus);
 extern struct iommu_domain *iommu_domain_alloc(struct bus_type *bus);
@@ -140,6 +144,7 @@  static inline int report_iommu_fault(struct iommu_domain *domain,
 #else /* CONFIG_IOMMU_API */
 
 struct iommu_ops {};
+struct iommu_group {};
 
 static inline bool iommu_present(struct bus_type *bus)
 {