Patchwork [05/13] pci: New pci_acs_enabled()

login
register
mail settings
Submitter Alex Williamson
Date May 11, 2012, 10:56 p.m.
Message ID <20120511225602.30496.80438.stgit@bling.home>
Download mbox | patch
Permalink /patch/158658/
State New
Headers show

Comments

Alex Williamson - May 11, 2012, 10:56 p.m.
In a PCIe environment, transactions aren't always required to
reach the root bus before being re-routed.  Peer-to-peer DMA
may actually not be seen by the IOMMU in these cases.  For
IOMMU groups, we want to provide IOMMU drivers a way to detect
these restrictions.  Provided with a PCI device, pci_acs_enabled
returns the furthest downstream device with a complete PCI ACS
chain.  This information can then be used in grouping to create
fully isolated groups.  ACS chain logic extracted from libvirt.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---

 drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
 include/linux/pci.h |    1 +
 2 files changed, 44 insertions(+), 0 deletions(-)
Bjorn Helgaas - May 14, 2012, 10:02 p.m.
On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> In a PCIe environment, transactions aren't always required to
> reach the root bus before being re-routed.  Peer-to-peer DMA
> may actually not be seen by the IOMMU in these cases.  For
> IOMMU groups, we want to provide IOMMU drivers a way to detect
> these restrictions.  Provided with a PCI device, pci_acs_enabled
> returns the furthest downstream device with a complete PCI ACS
> chain.  This information can then be used in grouping to create
> fully isolated groups.  ACS chain logic extracted from libvirt.

The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

I'm not sure what "a complete PCI ACS chain" means.

The function starts from "dev" and searches *upstream*, so I'm
guessing it returns the root of a subtree that must be contained in a
group.

> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>
>  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>  include/linux/pci.h |    1 +
>  2 files changed, 44 insertions(+), 0 deletions(-)
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 111569c..d7f05ce 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>  }
>
> +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> +                                        PCI_ACS_CR | PCI_ACS_UF)
> +
> +/**
> + * pci_acs_enabled - test ACS support in downstream chain
> + * @dev: starting PCI device
> + *
> + * Returns the furthest downstream device with an unbroken ACS chain.  If
> + * ACS is enabled throughout the chain, the returned device is the same as
> + * the one passed in.
> + */
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> +{
> +       struct pci_dev *acs_dev;
> +       int pos;
> +       u16 ctrl;
> +
> +       if (!pci_is_root_bus(dev->bus))
> +               acs_dev = pci_acs_enabled(dev->bus->self);
> +       else
> +               return dev;
> +
> +       /* If the chain is already broken, pass on the device */
> +       if (acs_dev != dev->bus->self)
> +               return acs_dev;
> +
> +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> +               return dev;
> +
> +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> +               return dev;
> +
> +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> +       if (!pos)
> +               return acs_dev;
> +
> +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> +               return acs_dev;
> +
> +       return dev;
> +}
> +
>  /**
>  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>  * @dev: the PCI device
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 9910b5c..dc25da3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>  }
>
>  void pci_request_acs(void);
> +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>
>
>  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>
Alex Williamson - May 14, 2012, 10:49 p.m.
On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > In a PCIe environment, transactions aren't always required to
> > reach the root bus before being re-routed.  Peer-to-peer DMA
> > may actually not be seen by the IOMMU in these cases.  For
> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> > returns the furthest downstream device with a complete PCI ACS
> > chain.  This information can then be used in grouping to create
> > fully isolated groups.  ACS chain logic extracted from libvirt.
> 
> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.

Right, maybe this should be:

struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);

> I'm not sure what "a complete PCI ACS chain" means.
> 
> The function starts from "dev" and searches *upstream*, so I'm
> guessing it returns the root of a subtree that must be contained in a
> group.

Any intermediate switch between an endpoint and the root bus can
redirect a dma access without iommu translation, so we're looking for
the furthest upstream device for which acs is enabled all the way up to
the root bus.  I'll fix the function name and comments/commit log if
that makes it sufficiently clear.  Thanks,

Alex

> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >
> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
> >  include/linux/pci.h |    1 +
> >  2 files changed, 44 insertions(+), 0 deletions(-)
> >
> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> > index 111569c..d7f05ce 100644
> > --- a/drivers/pci/pci.c
> > +++ b/drivers/pci/pci.c
> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
> >  }
> >
> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
> > +                                        PCI_ACS_CR | PCI_ACS_UF)
> > +
> > +/**
> > + * pci_acs_enabled - test ACS support in downstream chain
> > + * @dev: starting PCI device
> > + *
> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
> > + * ACS is enabled throughout the chain, the returned device is the same as
> > + * the one passed in.
> > + */
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
> > +{
> > +       struct pci_dev *acs_dev;
> > +       int pos;
> > +       u16 ctrl;
> > +
> > +       if (!pci_is_root_bus(dev->bus))
> > +               acs_dev = pci_acs_enabled(dev->bus->self);
> > +       else
> > +               return dev;
> > +
> > +       /* If the chain is already broken, pass on the device */
> > +       if (acs_dev != dev->bus->self)
> > +               return acs_dev;
> > +
> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
> > +               return dev;
> > +
> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
> > +               return dev;
> > +
> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
> > +       if (!pos)
> > +               return acs_dev;
> > +
> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
> > +               return acs_dev;
> > +
> > +       return dev;
> > +}
> > +
> >  /**
> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
> >  * @dev: the PCI device
> > diff --git a/include/linux/pci.h b/include/linux/pci.h
> > index 9910b5c..dc25da3 100644
> > --- a/include/linux/pci.h
> > +++ b/include/linux/pci.h
> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
> >  }
> >
> >  void pci_request_acs(void);
> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
> >
> >
> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas - May 15, 2012, 7:56 p.m.
On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
<alex.williamson@redhat.com> wrote:
> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>> <alex.williamson@redhat.com> wrote:
>> > In a PCIe environment, transactions aren't always required to
>> > reach the root bus before being re-routed.  Peer-to-peer DMA
>> > may actually not be seen by the IOMMU in these cases.  For
>> > IOMMU groups, we want to provide IOMMU drivers a way to detect
>> > these restrictions.  Provided with a PCI device, pci_acs_enabled
>> > returns the furthest downstream device with a complete PCI ACS
>> > chain.  This information can then be used in grouping to create
>> > fully isolated groups.  ACS chain logic extracted from libvirt.
>>
>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>
> Right, maybe this should be:
>
> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>
>> I'm not sure what "a complete PCI ACS chain" means.
>>
>> The function starts from "dev" and searches *upstream*, so I'm
>> guessing it returns the root of a subtree that must be contained in a
>> group.
>
> Any intermediate switch between an endpoint and the root bus can
> redirect a dma access without iommu translation,

Is this "redirection" just the normal PCI bridge forwarding that
allows peer-to-peer transactions, i.e., the rule (from P2P bridge
spec, rev 1.2, sec 4.1) that the bridge apertures define address
ranges that are forwarded from primary to secondary interface, and the
inverse ranges are forwarded from secondary to primary?  For example,
here:

                   ^
                   |
          +--------+-------+
          |                |
   +------+-----+    +-----++-----+
   | Downstream |    | Downstream |
   |    Port    |    |    Port    |
   |   06:05.0  |    |   06:06.0  |
   +------+-----+    +------+-----+
          |                 |
     +----v----+       +----v----+
     | Endpoint|       | Endpoint|
     | 07:00.0 |       | 08:00.0 |
     +---------+       +---------+

that rule is all that's needed for a transaction from 07:00.0 to be
forwarded from upstream to the internal switch bus 06, then claimed by
06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
nothing specific to PCIe.

I don't understand ACS very well, but it looks like it basically
provides ways to prevent that peer-to-peer forwarding, so transactions
would be sent upstream toward the root (and specifically, the IOMMU)
instead of being directly claimed by 06:06.0.

> so we're looking for
> the furthest upstream device for which acs is enabled all the way up to
> the root bus.

Correct me if this is wrong: To force device A's DMAs to be processed
by an IOMMU, ACS must be enabled on the root port and every downstream
port along the path to A.

If so, I think you're trying to find out the closest upstream device X
such that everything leading to X has ACS enabled.  Every device below
X can DMA freely to other devices below X, so they would all have to
be in the same isolated group.

I tried to work through some examples to develop some intuition about this:

                                |
        +------------+----------+----------------------+
        |            |                                 |
        |            |
+----------------|-------------------------------+
   +----v----+  +----v----+           |          +-----v----+
                |
   | 00:00.0 |  | 00:01.0 |           |          | 00:02.0  |
                |
   |   PCI   |  | PCIe-to |           |          | Upstream |
                |
   +---------+  |   PCI   |           |          +-----+----+
                |
                +----+----+           |                |
                |
                     |                |
+---------+------+----------------+       |
              +------+------+         |      |                |
        |       |
              |             |         | +----v-----+     +----v-----+
   +----v-----+ |
         +----v----+   +----v----+    | | 02:00.0  |     | 02:01.0  |
   | 02:02.0  | |
         | 01:00.0 |   | 01:01.0 |    | |Downstream|     |Downstream|
   |Downstream| |
         |   PCI   |   |   PCI   |    | |  w/o ACS |     |  w/ ACS  |
   |  w/ ACS  | |
         +---------+   +---------+    | +-----+----+     +----+-----+
   +----+-----+ |

+-------|---------------|----------------|-------+
                                              |               |                |
                                         +----v----+     +----v----+
   +----v----+
                                         | 03:00.0 |     | 04:00.0 |
   | 05:00.0 |
                                         |  PCIe   |     |  PCIe   |
   |  PCIe   |
                                         +---------+     | w/o ACS |
   |  w/ ACS |
                                                         +---------+
   +---------+

pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
if 00:00.0 is PCIe or if RP has ACS?))
pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
PCIe; seems wrong)
pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
doesn't have ACS)
pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
a bridge; seems wrong if 04:00 is a multi-function device)
pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)

But it didn't really help.  I still can't develop a mental picture of
what this function does.

>> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> > ---
>> >
>> >  drivers/pci/pci.c   |   43 +++++++++++++++++++++++++++++++++++++++++++
>> >  include/linux/pci.h |    1 +
>> >  2 files changed, 44 insertions(+), 0 deletions(-)
>> >
>> > diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> > index 111569c..d7f05ce 100644
>> > --- a/drivers/pci/pci.c
>> > +++ b/drivers/pci/pci.c
>> > @@ -2358,6 +2358,49 @@ void pci_enable_acs(struct pci_dev *dev)
>> >        pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
>> >  }
>> >
>> > +#define PCI_EXT_CAP_ACS_ENABLED                (PCI_ACS_SV | PCI_ACS_RR | \
>> > +                                        PCI_ACS_CR | PCI_ACS_UF)
>> > +
>> > +/**
>> > + * pci_acs_enabled - test ACS support in downstream chain
>> > + * @dev: starting PCI device
>> > + *
>> > + * Returns the furthest downstream device with an unbroken ACS chain.  If
>> > + * ACS is enabled throughout the chain, the returned device is the same as
>> > + * the one passed in.
>> > + */
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
>> > +{
>> > +       struct pci_dev *acs_dev;
>> > +       int pos;
>> > +       u16 ctrl;
>> > +
>> > +       if (!pci_is_root_bus(dev->bus))
>> > +               acs_dev = pci_acs_enabled(dev->bus->self);
>> > +       else
>> > +               return dev;
>> > +
>> > +       /* If the chain is already broken, pass on the device */
>> > +       if (acs_dev != dev->bus->self)
>> > +               return acs_dev;
>> > +
>> > +       if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
>> > +               return dev;
>> > +
>> > +       if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
>> > +               return dev;
>> > +
>> > +       pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
>> > +       if (!pos)
>> > +               return acs_dev;
>> > +
>> > +       pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
>> > +       if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
>> > +               return acs_dev;
>> > +
>> > +       return dev;
>> > +}
>> > +
>> >  /**
>> >  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
>> >  * @dev: the PCI device
>> > diff --git a/include/linux/pci.h b/include/linux/pci.h
>> > index 9910b5c..dc25da3 100644
>> > --- a/include/linux/pci.h
>> > +++ b/include/linux/pci.h
>> > @@ -1586,6 +1586,7 @@ static inline bool pci_is_pcie(struct pci_dev *dev)
>> >  }
>> >
>> >  void pci_request_acs(void);
>> > +struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
>> >
>> >
>> >  #define PCI_VPD_LRDT                   0x80    /* Large Resource Data Type */
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
Bjorn Helgaas - May 15, 2012, 8:05 p.m.
> I tried to work through some examples to develop some intuition about this:

Sorry, gmail inserted line breaks that ruined this picture.  Here's a
URL for it:

http://www.asciiflow.com/#3736558963405980039
Alex Williamson - May 15, 2012, 9:09 p.m.
On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> <alex.williamson@redhat.com> wrote:
> > On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >> <alex.williamson@redhat.com> wrote:
> >> > In a PCIe environment, transactions aren't always required to
> >> > reach the root bus before being re-routed.  Peer-to-peer DMA
> >> > may actually not be seen by the IOMMU in these cases.  For
> >> > IOMMU groups, we want to provide IOMMU drivers a way to detect
> >> > these restrictions.  Provided with a PCI device, pci_acs_enabled
> >> > returns the furthest downstream device with a complete PCI ACS
> >> > chain.  This information can then be used in grouping to create
> >> > fully isolated groups.  ACS chain logic extracted from libvirt.
> >>
> >> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >
> > Right, maybe this should be:
> >
> > struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >
> >> I'm not sure what "a complete PCI ACS chain" means.
> >>
> >> The function starts from "dev" and searches *upstream*, so I'm
> >> guessing it returns the root of a subtree that must be contained in a
> >> group.
> >
> > Any intermediate switch between an endpoint and the root bus can
> > redirect a dma access without iommu translation,
> 
> Is this "redirection" just the normal PCI bridge forwarding that
> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> ranges that are forwarded from primary to secondary interface, and the
> inverse ranges are forwarded from secondary to primary?  For example,
> here:
> 
>                    ^
>                    |
>           +--------+-------+
>           |                |
>    +------+-----+    +-----++-----+
>    | Downstream |    | Downstream |
>    |    Port    |    |    Port    |
>    |   06:05.0  |    |   06:06.0  |
>    +------+-----+    +------+-----+
>           |                 |
>      +----v----+       +----v----+
>      | Endpoint|       | Endpoint|
>      | 07:00.0 |       | 08:00.0 |
>      +---------+       +---------+
> 
> that rule is all that's needed for a transaction from 07:00.0 to be
> forwarded from upstream to the internal switch bus 06, then claimed by
> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> nothing specific to PCIe.

Right, I think the main PCI difference is the point-to-point nature of
PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
devices talking to each other, but on PCIe the transaction makes a
U-turn at some point and heads out another downstream port.  ACS allows
us to prevent that from happening.

> I don't understand ACS very well, but it looks like it basically
> provides ways to prevent that peer-to-peer forwarding, so transactions
> would be sent upstream toward the root (and specifically, the IOMMU)
> instead of being directly claimed by 06:06.0.

Yep, that's my meager understanding as well.

> > so we're looking for
> > the furthest upstream device for which acs is enabled all the way up to
> > the root bus.
> 
> Correct me if this is wrong: To force device A's DMAs to be processed
> by an IOMMU, ACS must be enabled on the root port and every downstream
> port along the path to A.

Yes, modulo this comment in libvirt source:

    /* if we have no parent, and this is the root bus, ACS doesn't come
     * into play since devices on the root bus can't P2P without going
     * through the root IOMMU.
     */

So we assume that a redirect at the point of the iommu will factor in
iommu translation.

> If so, I think you're trying to find out the closest upstream device X
> such that everything leading to X has ACS enabled.  Every device below
> X can DMA freely to other devices below X, so they would all have to
> be in the same isolated group.

Yes

> I tried to work through some examples to develop some intuition about this:

(inserting fixed url)
> http://www.asciiflow.com/#3736558963405980039

> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> if 00:00.0 is PCIe or if RP has ACS?))

Hmm, the latter is the assumption above.  For the former, I think
libvirt was probably assuming that PCI devices must have a PCIe device
upstream from them because x86 doesn't have assignment friendly IOMMUs
except on PCIe.  I'll need to work on making that more generic.

> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> PCIe; seems wrong)

Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
input devices, so this was passing for me.  I'll need to incorporate
that generically.

> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> doesn't have ACS)

Yeah, let me validate the libvirt assumption.  I see ACS on my root
port, so maybe they're just assuming it's always enabled or that the
precedence favors IOMMU translation.  I'm also starting to think that we
might want "from" and "to" struct pci_dev parameters to make it more
flexible where the iommu lives in the system.

> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> a bridge; seems wrong if 04:00 is a multi-function device)

AIUI, ACS is not an endpoint property, so this is what should happen.  I
don't think multifunction plays a role other than how much do we trust
the implementation to not allow back channels between functions (the
answer should probably be not at all).

> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
> 
> But it didn't really help.  I still can't develop a mental picture of
> what this function does.

It helped me :)  These are good examples, I'll work on fixing it for
them.  Thanks,

Alex
Don Dutile - May 16, 2012, 1:29 p.m.
On 05/15/2012 05:09 PM, Alex Williamson wrote:
> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>> <alex.williamson@redhat.com>  wrote:
>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>  wrote:
>>>>> In a PCIe environment, transactions aren't always required to
>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>> chain.  This information can then be used in grouping to create
>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>
>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>
>>> Right, maybe this should be:
>>>
>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>
+1; there is a global in the PCI code, pci_acs_enable,
and a function pci_enable_acs(), which the above name certainly
confuses.  I recommend  pci_find_top_acs_bridge()
would be most descriptive.

>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>
>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>> guessing it returns the root of a subtree that must be contained in a
>>>> group.
>>>
>>> Any intermediate switch between an endpoint and the root bus can
>>> redirect a dma access without iommu translation,
>>
>> Is this "redirection" just the normal PCI bridge forwarding that
>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>> ranges that are forwarded from primary to secondary interface, and the
>> inverse ranges are forwarded from secondary to primary?  For example,
>> here:
>>
>>                     ^
>>                     |
>>            +--------+-------+
>>            |                |
>>     +------+-----+    +-----++-----+
>>     | Downstream |    | Downstream |
>>     |    Port    |    |    Port    |
>>     |   06:05.0  |    |   06:06.0  |
>>     +------+-----+    +------+-----+
>>            |                 |
>>       +----v----+       +----v----+
>>       | Endpoint|       | Endpoint|
>>       | 07:00.0 |       | 08:00.0 |
>>       +---------+       +---------+
>>
>> that rule is all that's needed for a transaction from 07:00.0 to be
>> forwarded from upstream to the internal switch bus 06, then claimed by
>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>> nothing specific to PCIe.
>
> Right, I think the main PCI difference is the point-to-point nature of
> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> devices talking to each other, but on PCIe the transaction makes a
> U-turn at some point and heads out another downstream port.  ACS allows
> us to prevent that from happening.
>
detail: PCIe up/downstream routing is really done by an internal switch;
         ACS forces the legacy, PCI base-limit address routing and *forces*
         the switch to always route the transaction from a downstream port
         to the upstream port.

>> I don't understand ACS very well, but it looks like it basically
>> provides ways to prevent that peer-to-peer forwarding, so transactions
>> would be sent upstream toward the root (and specifically, the IOMMU)
>> instead of being directly claimed by 06:06.0.
>
> Yep, that's my meager understanding as well.
>
+1

>>> so we're looking for
>>> the furthest upstream device for which acs is enabled all the way up to
>>> the root bus.
>>
>> Correct me if this is wrong: To force device A's DMAs to be processed
>> by an IOMMU, ACS must be enabled on the root port and every downstream
>> port along the path to A.
>
> Yes, modulo this comment in libvirt source:
>
>      /* if we have no parent, and this is the root bus, ACS doesn't come
>       * into play since devices on the root bus can't P2P without going
>       * through the root IOMMU.
>       */
>
Correct. PCIe spec says roots must support ACS. I believe all the
root bridges that have an IOMMU have ACS wired in/on.

> So we assume that a redirect at the point of the iommu will factor in
> iommu translation.
>
>> If so, I think you're trying to find out the closest upstream device X
>> such that everything leading to X has ACS enabled.  Every device below
>> X can DMA freely to other devices below X, so they would all have to
>> be in the same isolated group.
>
> Yes
>
>> I tried to work through some examples to develop some intuition about this:
>
> (inserting fixed url)
>> http://www.asciiflow.com/#3736558963405980039
>
>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>> if 00:00.0 is PCIe or if RP has ACS?))
>
> Hmm, the latter is the assumption above.  For the former, I think
> libvirt was probably assuming that PCI devices must have a PCIe device
> upstream from them because x86 doesn't have assignment friendly IOMMUs
> except on PCIe.  I'll need to work on making that more generic.
>
>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>> PCIe; seems wrong)
>
> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> input devices, so this was passing for me.  I'll need to incorporate
> that generically.
>
>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>> doesn't have ACS)
>
> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> port, so maybe they're just assuming it's always enabled or that the
> precedence favors IOMMU translation.  I'm also starting to think that we
> might want "from" and "to" struct pci_dev parameters to make it more
> flexible where the iommu lives in the system.
>
see comment above wrt root ports that have IOMMUs in them.

>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>> a bridge; seems wrong if 04:00 is a multi-function device)
>
> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> don't think multifunction plays a role other than how much do we trust
> the implementation to not allow back channels between functions (the
> answer should probably be not at all).
>
correct. ACS is a *bridge* property.
The unknown wrt multifunction devices is that such devices *could* be implemented
by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
btwn the functions within a device.
Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
force ACS.  So, one has to ask the hw vendors if such a hidden device exists
in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
and allow parent bridge re-route it back down if peer-to-peer is desired.
Debate exists whether multifunction devices are 'secure' b/c of this unknown.
Maybe a PCIe (min., SRIOV) spec change is needed in this area to
determine this status about a device (via pci cfg/cap space).

>> pci_acs_enabled(02:02.0) = 02:02.0 (acs_dev = 00:02.0, 02:02.0 has ACS enabled)
>> pci_acs_enabled(05:00.0) = 05:00.0 (acs_dev = 02:02.0, 05:00.0 is not a bridge)
>>
>> But it didn't really help.  I still can't develop a mental picture of
>> what this function does.
>
> It helped me :)  These are good examples, I'll work on fixing it for
> them.  Thanks,
>
> Alex
>
>
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
Alex Williamson - May 16, 2012, 4:21 p.m.
On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> > On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >> <alex.williamson@redhat.com>  wrote:
> >>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>> <alex.williamson@redhat.com>  wrote:
> >>>>> In a PCIe environment, transactions aren't always required to
> >>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>> chain.  This information can then be used in grouping to create
> >>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>
> >>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>
> >>> Right, maybe this should be:
> >>>
> >>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>
> +1; there is a global in the PCI code, pci_acs_enable,
> and a function pci_enable_acs(), which the above name certainly
> confuses.  I recommend  pci_find_top_acs_bridge()
> would be most descriptive.

Yep, the new API I'm working with is:

bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
bool pci_acs_path_enabled(struct pci_dev *start,
                          struct pci_dev *end, u16 acs_flags);

> >>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>
> >>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>> guessing it returns the root of a subtree that must be contained in a
> >>>> group.
> >>>
> >>> Any intermediate switch between an endpoint and the root bus can
> >>> redirect a dma access without iommu translation,
> >>
> >> Is this "redirection" just the normal PCI bridge forwarding that
> >> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >> ranges that are forwarded from primary to secondary interface, and the
> >> inverse ranges are forwarded from secondary to primary?  For example,
> >> here:
> >>
> >>                     ^
> >>                     |
> >>            +--------+-------+
> >>            |                |
> >>     +------+-----+    +-----++-----+
> >>     | Downstream |    | Downstream |
> >>     |    Port    |    |    Port    |
> >>     |   06:05.0  |    |   06:06.0  |
> >>     +------+-----+    +------+-----+
> >>            |                 |
> >>       +----v----+       +----v----+
> >>       | Endpoint|       | Endpoint|
> >>       | 07:00.0 |       | 08:00.0 |
> >>       +---------+       +---------+
> >>
> >> that rule is all that's needed for a transaction from 07:00.0 to be
> >> forwarded from upstream to the internal switch bus 06, then claimed by
> >> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >> nothing specific to PCIe.
> >
> > Right, I think the main PCI difference is the point-to-point nature of
> > PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> > devices talking to each other, but on PCIe the transaction makes a
> > U-turn at some point and heads out another downstream port.  ACS allows
> > us to prevent that from happening.
> >
> detail: PCIe up/downstream routing is really done by an internal switch;
>          ACS forces the legacy, PCI base-limit address routing and *forces*
>          the switch to always route the transaction from a downstream port
>          to the upstream port.
> 
> >> I don't understand ACS very well, but it looks like it basically
> >> provides ways to prevent that peer-to-peer forwarding, so transactions
> >> would be sent upstream toward the root (and specifically, the IOMMU)
> >> instead of being directly claimed by 06:06.0.
> >
> > Yep, that's my meager understanding as well.
> >
> +1
> 
> >>> so we're looking for
> >>> the furthest upstream device for which acs is enabled all the way up to
> >>> the root bus.
> >>
> >> Correct me if this is wrong: To force device A's DMAs to be processed
> >> by an IOMMU, ACS must be enabled on the root port and every downstream
> >> port along the path to A.
> >
> > Yes, modulo this comment in libvirt source:
> >
> >      /* if we have no parent, and this is the root bus, ACS doesn't come
> >       * into play since devices on the root bus can't P2P without going
> >       * through the root IOMMU.
> >       */
> >
> Correct. PCIe spec says roots must support ACS. I believe all the
> root bridges that have an IOMMU have ACS wired in/on.

Would you mind looking for the paragraph that says this?  I'd rather
code this into the iommu driver callers than core PCI code if this is
just a platform standard.

> > So we assume that a redirect at the point of the iommu will factor in
> > iommu translation.
> >
> >> If so, I think you're trying to find out the closest upstream device X
> >> such that everything leading to X has ACS enabled.  Every device below
> >> X can DMA freely to other devices below X, so they would all have to
> >> be in the same isolated group.
> >
> > Yes
> >
> >> I tried to work through some examples to develop some intuition about this:
> >
> > (inserting fixed url)
> >> http://www.asciiflow.com/#3736558963405980039
> >
> >> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >> if 00:00.0 is PCIe or if RP has ACS?))
> >
> > Hmm, the latter is the assumption above.  For the former, I think
> > libvirt was probably assuming that PCI devices must have a PCIe device
> > upstream from them because x86 doesn't have assignment friendly IOMMUs
> > except on PCIe.  I'll need to work on making that more generic.
> >
> >> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >> PCIe; seems wrong)
> >
> > Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> > input devices, so this was passing for me.  I'll need to incorporate
> > that generically.
> >
> >> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >> doesn't have ACS)
> >
> > Yeah, let me validate the libvirt assumption.  I see ACS on my root
> > port, so maybe they're just assuming it's always enabled or that the
> > precedence favors IOMMU translation.  I'm also starting to think that we
> > might want "from" and "to" struct pci_dev parameters to make it more
> > flexible where the iommu lives in the system.
> >
> see comment above wrt root ports that have IOMMUs in them.

Except it really seems to be platform convention where the IOMMU lives.
The DMAR for VT-d describes which devices and hierarchies a DRHD is used
for and from that we can make assumptions about where it physically
lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
function on the root bus.  For now I'm just allowing
pci_acs_path_enabled to take NULL for and end, which means "up to the
root bus".

> 
> >> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >> a bridge; seems wrong if 04:00 is a multi-function device)
> >
> > AIUI, ACS is not an endpoint property, so this is what should happen.  I
> > don't think multifunction plays a role other than how much do we trust
> > the implementation to not allow back channels between functions (the
> > answer should probably be not at all).
> >
> correct. ACS is a *bridge* property.
> The unknown wrt multifunction devices is that such devices *could* be implemented
> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> btwn the functions within a device.
> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> and allow parent bridge re-route it back down if peer-to-peer is desired.
> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> determine this status about a device (via pci cfg/cap space).

Well, there is actually a section of the ACS part of the spec
identifying valid flags for multifunction devices.  Secretly I'd like to
use this as justification for blacklisting all multifunction devices
that don't explicitly support ACS, but that makes for pretty course
granularity.  For instance, all these devices end up in a group:

   +-14.0  ATI Technologies Inc SBx00 SMBus Controller
   +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
   +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
   +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller

  00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)

And these in another:

   +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
   +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
   +-15.2-[08]--
   +-15.3-[09]--

  00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
  00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
  00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
  00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)

Am I misinterpreting the spec or is this the correct, if strict,
interpretation?

Alex
Don Dutile - May 18, 2012, 11 p.m.
On 05/18/2012 06:02 PM, Alex Williamson wrote:
> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>> <alex.williamson@redhat.com>   wrote:
>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>> <alex.williamson@redhat.com>   wrote:
>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>
>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>
>>>>> Right, maybe this should be:
>>>>>
>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>
>> +1; there is a global in the PCI code, pci_acs_enable,
>> and a function pci_enable_acs(), which the above name certainly
>> confuses.  I recommend  pci_find_top_acs_bridge()
>> would be most descriptive.
Finally, with my email filters fixed, I can see this email... :)

>
> Yep, the new API I'm working with is:
>
> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> bool pci_acs_path_enabled(struct pci_dev *start,
>                            struct pci_dev *end, u16 acs_flags);
>
ok.

>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>
>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>> group.
>>>>>
>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>> redirect a dma access without iommu translation,
>>>>
>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>> ranges that are forwarded from primary to secondary interface, and the
>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>> here:
>>>>
>>>>                      ^
>>>>                      |
>>>>             +--------+-------+
>>>>             |                |
>>>>      +------+-----+    +-----++-----+
>>>>      | Downstream |    | Downstream |
>>>>      |    Port    |    |    Port    |
>>>>      |   06:05.0  |    |   06:06.0  |
>>>>      +------+-----+    +------+-----+
>>>>             |                 |
>>>>        +----v----+       +----v----+
>>>>        | Endpoint|       | Endpoint|
>>>>        | 07:00.0 |       | 08:00.0 |
>>>>        +---------+       +---------+
>>>>
>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>> nothing specific to PCIe.
>>>
>>> Right, I think the main PCI difference is the point-to-point nature of
>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>> devices talking to each other, but on PCIe the transaction makes a
>>> U-turn at some point and heads out another downstream port.  ACS allows
>>> us to prevent that from happening.
>>>
>> detail: PCIe up/downstream routing is really done by an internal switch;
>>           ACS forces the legacy, PCI base-limit address routing and *forces*
>>           the switch to always route the transaction from a downstream port
>>           to the upstream port.
>>
>>>> I don't understand ACS very well, but it looks like it basically
>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>> instead of being directly claimed by 06:06.0.
>>>
>>> Yep, that's my meager understanding as well.
>>>
>> +1
>>
>>>>> so we're looking for
>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>> the root bus.
>>>>
>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>> port along the path to A.
>>>
>>> Yes, modulo this comment in libvirt source:
>>>
>>>       /* if we have no parent, and this is the root bus, ACS doesn't come
>>>        * into play since devices on the root bus can't P2P without going
>>>        * through the root IOMMU.
>>>        */
>>>
>> Correct. PCIe spec says roots must support ACS. I believe all the
>> root bridges that have an IOMMU have ACS wired in/on.
>
> Would you mind looking for the paragraph that says this?  I'd rather
> code this into the iommu driver callers than core PCI code if this is
> just a platform standard.
>
In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
ACS upstream fwding: Must be implemented by Root Ports if the RC supports
                      Redirected Request Validation;
-- which means, if a Root port allows a peer-to-peer transaction to another
    one of its ports, then it has to support ACS.

So, this means that:
(a) if a Root complex with multiple ports can't do peer-to-peer to another port,
     ACS isn't needed
(b) if a Root complex w/multiple ports can do peer-to-peer to another port,
     it must have ACS capability if it does...
and since the linux code turns on ACS for all ports with an ACS cap,
it degenerates (in Linux) that all Root ports are implementing the
end functionality of ACS==on, all traffic goes up to IOMMU in RC.
  
>>> So we assume that a redirect at the point of the iommu will factor in
>>> iommu translation.
>>>
>>>> If so, I think you're trying to find out the closest upstream device X
>>>> such that everything leading to X has ACS enabled.  Every device below
>>>> X can DMA freely to other devices below X, so they would all have to
>>>> be in the same isolated group.
>>>
>>> Yes
>>>
>>>> I tried to work through some examples to develop some intuition about this:
>>>
>>> (inserting fixed url)
>>>> http://www.asciiflow.com/#3736558963405980039
>>>
>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>
>>> Hmm, the latter is the assumption above.  For the former, I think
>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>> except on PCIe.  I'll need to work on making that more generic.
>>>
>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>> PCIe; seems wrong)
>>>
>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>> input devices, so this was passing for me.  I'll need to incorporate
>>> that generically.
>>>
>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>> doesn't have ACS)
>>>
>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>> port, so maybe they're just assuming it's always enabled or that the
>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>> might want "from" and "to" struct pci_dev parameters to make it more
>>> flexible where the iommu lives in the system.
>>>
>> see comment above wrt root ports that have IOMMUs in them.
>
> Except it really seems to be platform convention where the IOMMU lives.
> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> for and from that we can make assumptions about where it physically
> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> function on the root bus.  For now I'm just allowing
> pci_acs_path_enabled to take NULL for and end, which means "up to the
> root bus".
>
ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
For AMD-Vi, I thought the same held true, ATM, but I have to dig through
yet-another spec (or ask Joerg to check-in to this thread & provide the details).

>>
>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>
>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>> don't think multifunction plays a role other than how much do we trust
>>> the implementation to not allow back channels between functions (the
>>> answer should probably be not at all).
>>>
>> correct. ACS is a *bridge* property.
>> The unknown wrt multifunction devices is that such devices *could* be implemented
>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>> btwn the functions within a device.
>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>> determine this status about a device (via pci cfg/cap space).
>
> Well, there is actually a section of the ACS part of the spec
> identifying valid flags for multifunction devices.  Secretly I'd like to
> use this as justification for blacklisting all multifunction devices
> that don't explicitly support ACS, but that makes for pretty course
> granularity.  For instance, all these devices end up in a group:
>
>     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>
>    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>
> And these in another:
>
>     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>     +-15.2-[08]--
>     +-15.3-[09]--
>
>    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>
> Am I misinterpreting the spec or is this the correct, if strict,
> interpretation?
>
> Alex
>
Well, in digging into the ACS-support in Root port question above,
I just found out about this ACS support status for multifunctions too.
I need more time to read/digest, but a quick read says MFDs should have
an ACS cap with relevant RO-status & control bits to direct/control
peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
peer-to-peer can happen, and thus, the above have to be in the same iommu group.
Unfortunately, I think a large lot of MFDs don't have ACS caps,
and don't/can't do peer-to-peer, so the above is heavy-handed,
albeit to spec.
Maybe we need a (large?) pci-quirk for the list of existing
MFDs that don't have ACS caps that would enable the above devices
to be in separate groups.
On the flip side, it solves some of the quirks for MFDs that
use the wrong BDF in their src-id dma packets! :) -- they default
to the same group now...
Alex Williamson - May 19, 2012, 2:47 a.m.
On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> > On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>> <alex.williamson@redhat.com>   wrote:
> >>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>> <alex.williamson@redhat.com>   wrote:
> >>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>
> >>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>
> >>>>> Right, maybe this should be:
> >>>>>
> >>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>
> >> +1; there is a global in the PCI code, pci_acs_enable,
> >> and a function pci_enable_acs(), which the above name certainly
> >> confuses.  I recommend  pci_find_top_acs_bridge()
> >> would be most descriptive.
> Finally, with my email filters fixed, I can see this email... :)

Welcome back ;)

> > Yep, the new API I'm working with is:
> >
> > bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> > bool pci_acs_path_enabled(struct pci_dev *start,
> >                            struct pci_dev *end, u16 acs_flags);
> >
> ok.
> 
> >>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>
> >>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>> group.
> >>>>>
> >>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>> redirect a dma access without iommu translation,
> >>>>
> >>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>> ranges that are forwarded from primary to secondary interface, and the
> >>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>> here:
> >>>>
> >>>>                      ^
> >>>>                      |
> >>>>             +--------+-------+
> >>>>             |                |
> >>>>      +------+-----+    +-----++-----+
> >>>>      | Downstream |    | Downstream |
> >>>>      |    Port    |    |    Port    |
> >>>>      |   06:05.0  |    |   06:06.0  |
> >>>>      +------+-----+    +------+-----+
> >>>>             |                 |
> >>>>        +----v----+       +----v----+
> >>>>        | Endpoint|       | Endpoint|
> >>>>        | 07:00.0 |       | 08:00.0 |
> >>>>        +---------+       +---------+
> >>>>
> >>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>> nothing specific to PCIe.
> >>>
> >>> Right, I think the main PCI difference is the point-to-point nature of
> >>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>> devices talking to each other, but on PCIe the transaction makes a
> >>> U-turn at some point and heads out another downstream port.  ACS allows
> >>> us to prevent that from happening.
> >>>
> >> detail: PCIe up/downstream routing is really done by an internal switch;
> >>           ACS forces the legacy, PCI base-limit address routing and *forces*
> >>           the switch to always route the transaction from a downstream port
> >>           to the upstream port.
> >>
> >>>> I don't understand ACS very well, but it looks like it basically
> >>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>> instead of being directly claimed by 06:06.0.
> >>>
> >>> Yep, that's my meager understanding as well.
> >>>
> >> +1
> >>
> >>>>> so we're looking for
> >>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>> the root bus.
> >>>>
> >>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>> port along the path to A.
> >>>
> >>> Yes, modulo this comment in libvirt source:
> >>>
> >>>       /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>        * into play since devices on the root bus can't P2P without going
> >>>        * through the root IOMMU.
> >>>        */
> >>>
> >> Correct. PCIe spec says roots must support ACS. I believe all the
> >> root bridges that have an IOMMU have ACS wired in/on.
> >
> > Would you mind looking for the paragraph that says this?  I'd rather
> > code this into the iommu driver callers than core PCI code if this is
> > just a platform standard.
> >
> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>                       Redirected Request Validation;

Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.

> -- which means, if a Root port allows a peer-to-peer transaction to another
>     one of its ports, then it has to support ACS.

I don't get that at all from 6.12.1.1, especially given the first
sentence of that section:

        This section applies to Root Ports and Downstream Switch Ports
        that implement an ACS Extended Capability structure.
        

> So, this means that:
> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>      ACS isn't needed
> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>      it must have ACS capability if it does...
> and since the linux code turns on ACS for all ports with an ACS cap,
> it degenerates (in Linux) that all Root ports are implementing the
> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>   
> >>> So we assume that a redirect at the point of the iommu will factor in
> >>> iommu translation.
> >>>
> >>>> If so, I think you're trying to find out the closest upstream device X
> >>>> such that everything leading to X has ACS enabled.  Every device below
> >>>> X can DMA freely to other devices below X, so they would all have to
> >>>> be in the same isolated group.
> >>>
> >>> Yes
> >>>
> >>>> I tried to work through some examples to develop some intuition about this:
> >>>
> >>> (inserting fixed url)
> >>>> http://www.asciiflow.com/#3736558963405980039
> >>>
> >>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>
> >>> Hmm, the latter is the assumption above.  For the former, I think
> >>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>> except on PCIe.  I'll need to work on making that more generic.
> >>>
> >>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>> PCIe; seems wrong)
> >>>
> >>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>> input devices, so this was passing for me.  I'll need to incorporate
> >>> that generically.
> >>>
> >>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>> doesn't have ACS)
> >>>
> >>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>> port, so maybe they're just assuming it's always enabled or that the
> >>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>> might want "from" and "to" struct pci_dev parameters to make it more
> >>> flexible where the iommu lives in the system.
> >>>
> >> see comment above wrt root ports that have IOMMUs in them.
> >
> > Except it really seems to be platform convention where the IOMMU lives.
> > The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> > for and from that we can make assumptions about where it physically
> > lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> > function on the root bus.  For now I'm just allowing
> > pci_acs_path_enabled to take NULL for and end, which means "up to the
> > root bus".
> >
> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> yet-another spec (or ask Joerg to check-in to this thread & provide the details).

But are we programming to convention or spec?  And I'm still confused
about why we assume the root port isn't susceptible to redirection
before IOMMU translation.  One of the benefits of the
pci_acs_path_enable() API is that it pushes convention out to the IOMMU
driver.  So it's intel-iommu.c's problem whether to test for ACS support
to the RC or to a given level (and for that matter know whether IOMMU
translation takes precedence over redirection in the RC).

> >>
> >>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>
> >>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>> don't think multifunction plays a role other than how much do we trust
> >>> the implementation to not allow back channels between functions (the
> >>> answer should probably be not at all).
> >>>
> >> correct. ACS is a *bridge* property.
> >> The unknown wrt multifunction devices is that such devices *could* be implemented
> >> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >> btwn the functions within a device.
> >> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >> determine this status about a device (via pci cfg/cap space).
> >
> > Well, there is actually a section of the ACS part of the spec
> > identifying valid flags for multifunction devices.  Secretly I'd like to
> > use this as justification for blacklisting all multifunction devices
> > that don't explicitly support ACS, but that makes for pretty course
> > granularity.  For instance, all these devices end up in a group:
> >
> >     +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >     +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >     +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >     +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >
> >    00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >
> > And these in another:
> >
> >     +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >     +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >     +-15.2-[08]--
> >     +-15.3-[09]--
> >
> >    00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >    00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >    00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >    00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >
> > Am I misinterpreting the spec or is this the correct, if strict,
> > interpretation?
> >
> > Alex
> >
> Well, in digging into the ACS-support in Root port question above,
> I just found out about this ACS support status for multifunctions too.
> I need more time to read/digest, but a quick read says MFDs should have
> an ACS cap with relevant RO-status & control bits to direct/control
> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> and don't/can't do peer-to-peer, so the above is heavy-handed,
> albeit to spec.
> Maybe we need a (large?) pci-quirk for the list of existing
> MFDs that don't have ACS caps that would enable the above devices
> to be in separate groups.
> On the flip side, it solves some of the quirks for MFDs that
> use the wrong BDF in their src-id dma packets! :) -- they default
> to the same group now...

Yep, sounds like you might agree with my patch, it's heavy handed, but
seems to adhere to the spec.  That probably just means we need an option
to allow a more lenient interpretation, that maybe we don't have to
support.  Thanks,

Alex
Don Dutile - May 21, 2012, 1:31 p.m.
On 05/18/2012 10:47 PM, Alex Williamson wrote:
> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>> <alex.williamson@redhat.com>    wrote:
>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>
>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>
>>>>>>> Right, maybe this should be:
>>>>>>>
>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>
>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>> and a function pci_enable_acs(), which the above name certainly
>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>> would be most descriptive.
>> Finally, with my email filters fixed, I can see this email... :)
>
> Welcome back ;)
>
Indeed... and I recvd 3 copies of this reply,
so the pendulum has flipped the other direction... ;-)

>>> Yep, the new API I'm working with is:
>>>
>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>                             struct pci_dev *end, u16 acs_flags);
>>>
>> ok.
>>
>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>
>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>> group.
>>>>>>>
>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>> redirect a dma access without iommu translation,
>>>>>>
>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>> here:
>>>>>>
>>>>>>                       ^
>>>>>>                       |
>>>>>>              +--------+-------+
>>>>>>              |                |
>>>>>>       +------+-----+    +-----++-----+
>>>>>>       | Downstream |    | Downstream |
>>>>>>       |    Port    |    |    Port    |
>>>>>>       |   06:05.0  |    |   06:06.0  |
>>>>>>       +------+-----+    +------+-----+
>>>>>>              |                 |
>>>>>>         +----v----+       +----v----+
>>>>>>         | Endpoint|       | Endpoint|
>>>>>>         | 07:00.0 |       | 08:00.0 |
>>>>>>         +---------+       +---------+
>>>>>>
>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>> nothing specific to PCIe.
>>>>>
>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>> us to prevent that from happening.
>>>>>
>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>            the switch to always route the transaction from a downstream port
>>>>            to the upstream port.
>>>>
>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>> instead of being directly claimed by 06:06.0.
>>>>>
>>>>> Yep, that's my meager understanding as well.
>>>>>
>>>> +1
>>>>
>>>>>>> so we're looking for
>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>> the root bus.
>>>>>>
>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>> port along the path to A.
>>>>>
>>>>> Yes, modulo this comment in libvirt source:
>>>>>
>>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>         * into play since devices on the root bus can't P2P without going
>>>>>         * through the root IOMMU.
>>>>>         */
>>>>>
>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>
>>> Would you mind looking for the paragraph that says this?  I'd rather
>>> code this into the iommu driver callers than core PCI code if this is
>>> just a platform standard.
>>>
>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>                        Redirected Request Validation;
>
> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>
UF?

>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>      one of its ports, then it has to support ACS.
>
> I don't get that at all from 6.12.1.1, especially given the first
> sentence of that section:
>
>          This section applies to Root Ports and Downstream Switch Ports
>          that implement an ACS Extended Capability structure.
>
>
hmmm, well I did.  The root port section is different than Downstream ports
as well.  downstream ports *must* support peer-xfers due to positive decoding
of base/limit addresses, and ACS is optional in downstream ports.
Peer-to-peer btwn root ports is optional.
so, I don't get what you don't get... ;-)
.. but, I understand how the spec can be read/interpreted differently,
    given it's clarity (did lawyers write it?!?!!), so I could be interpreting
    incorrectly.

>> So, this means that:
>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>       ACS isn't needed
>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>       it must have ACS capability if it does...
>> and since the linux code turns on ACS for all ports with an ACS cap,
>> it degenerates (in Linux) that all Root ports are implementing the
>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>
I thought I explained how I interpreted the root-port part of ACS above,
so maybe you can tell me how you think my interpretation is incorrect.

>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>> iommu translation.
>>>>>
>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>> be in the same isolated group.
>>>>>
>>>>> Yes
>>>>>
>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>
>>>>> (inserting fixed url)
>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>
>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>
>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>
>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>> PCIe; seems wrong)
>>>>>
>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>> that generically.
>>>>>
>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>> doesn't have ACS)
>>>>>
>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>> flexible where the iommu lives in the system.
>>>>>
>>>> see comment above wrt root ports that have IOMMUs in them.
>>>
>>> Except it really seems to be platform convention where the IOMMU lives.
>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>> for and from that we can make assumptions about where it physically
>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>> function on the root bus.  For now I'm just allowing
>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>> root bus".
>>>
>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
>
> But are we programming to convention or spec?  And I'm still confused
> about why we assume the root port isn't susceptible to redirection
> before IOMMU translation.  One of the benefits of the
> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> driver.  So it's intel-iommu.c's problem whether to test for ACS support
> to the RC or to a given level (and for that matter know whether IOMMU
> translation takes precedence over redirection in the RC).
>
Agreed. the best design here would be for the intel-iommu.c to test for ACS
wrt the root port (not to be confused with root complex) since Intel-IOMMUs
are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
so maybe it can be simplified in that case.... but it may be best to mimic
the iommu-set-acs-for-root-port attribute the same manner.

>>>>
>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>
>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>> don't think multifunction plays a role other than how much do we trust
>>>>> the implementation to not allow back channels between functions (the
>>>>> answer should probably be not at all).
>>>>>
>>>> correct. ACS is a *bridge* property.
>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>> btwn the functions within a device.
>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>> determine this status about a device (via pci cfg/cap space).
>>>
>>> Well, there is actually a section of the ACS part of the spec
>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>> use this as justification for blacklisting all multifunction devices
>>> that don't explicitly support ACS, but that makes for pretty course
>>> granularity.  For instance, all these devices end up in a group:
>>>
>>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>
>>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>
>>> And these in another:
>>>
>>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>      +-15.2-[08]--
>>>      +-15.3-[09]--
>>>
>>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>
>>> Am I misinterpreting the spec or is this the correct, if strict,
>>> interpretation?
>>>
>>> Alex
>>>
>> Well, in digging into the ACS-support in Root port question above,
>> I just found out about this ACS support status for multifunctions too.
>> I need more time to read/digest, but a quick read says MFDs should have
>> an ACS cap with relevant RO-status&  control bits to direct/control
>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>> albeit to spec.
>> Maybe we need a (large?) pci-quirk for the list of existing
>> MFDs that don't have ACS caps that would enable the above devices
>> to be in separate groups.
>> On the flip side, it solves some of the quirks for MFDs that
>> use the wrong BDF in their src-id dma packets! :) -- they default
>> to the same group now...
>
> Yep, sounds like you might agree with my patch, it's heavy handed, but
Yes, but we should design-in a quirk check list for MFDs,
b/c we already know some will fail when they should pass this check,
b/c the hw was made post ACS, or the vendors didn't see the benefit
of ACS reporting (even if the funcitonality was enabling bit-settings
that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
dev-assignment to virt guests, etc.).

> seems to adhere to the spec.  That probably just means we need an option
> to allow a more lenient interpretation, that maybe we don't have to
> support.  Thanks,
Right.  As I said, a hook to do quirk-level additions from the get-go
would speed this expected need/addition.

>
> Alex
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Williamson - May 21, 2012, 2:59 p.m.
On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
> On 05/18/2012 10:47 PM, Alex Williamson wrote:
> > On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
> >> On 05/18/2012 06:02 PM, Alex Williamson wrote:
> >>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
> >>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
> >>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
> >>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
> >>>>>> <alex.williamson@redhat.com>    wrote:
> >>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
> >>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
> >>>>>>>> <alex.williamson@redhat.com>    wrote:
> >>>>>>>>> In a PCIe environment, transactions aren't always required to
> >>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
> >>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
> >>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
> >>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
> >>>>>>>>> returns the furthest downstream device with a complete PCI ACS
> >>>>>>>>> chain.  This information can then be used in grouping to create
> >>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
> >>>>>>>>
> >>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
> >>>>>>>
> >>>>>>> Right, maybe this should be:
> >>>>>>>
> >>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
> >>>>>>>
> >>>> +1; there is a global in the PCI code, pci_acs_enable,
> >>>> and a function pci_enable_acs(), which the above name certainly
> >>>> confuses.  I recommend  pci_find_top_acs_bridge()
> >>>> would be most descriptive.
> >> Finally, with my email filters fixed, I can see this email... :)
> >
> > Welcome back ;)
> >
> Indeed... and I recvd 3 copies of this reply,
> so the pendulum has flipped the other direction... ;-)
> 
> >>> Yep, the new API I'm working with is:
> >>>
> >>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
> >>> bool pci_acs_path_enabled(struct pci_dev *start,
> >>>                             struct pci_dev *end, u16 acs_flags);
> >>>
> >> ok.
> >>
> >>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
> >>>>>>>>
> >>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
> >>>>>>>> guessing it returns the root of a subtree that must be contained in a
> >>>>>>>> group.
> >>>>>>>
> >>>>>>> Any intermediate switch between an endpoint and the root bus can
> >>>>>>> redirect a dma access without iommu translation,
> >>>>>>
> >>>>>> Is this "redirection" just the normal PCI bridge forwarding that
> >>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
> >>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
> >>>>>> ranges that are forwarded from primary to secondary interface, and the
> >>>>>> inverse ranges are forwarded from secondary to primary?  For example,
> >>>>>> here:
> >>>>>>
> >>>>>>                       ^
> >>>>>>                       |
> >>>>>>              +--------+-------+
> >>>>>>              |                |
> >>>>>>       +------+-----+    +-----++-----+
> >>>>>>       | Downstream |    | Downstream |
> >>>>>>       |    Port    |    |    Port    |
> >>>>>>       |   06:05.0  |    |   06:06.0  |
> >>>>>>       +------+-----+    +------+-----+
> >>>>>>              |                 |
> >>>>>>         +----v----+       +----v----+
> >>>>>>         | Endpoint|       | Endpoint|
> >>>>>>         | 07:00.0 |       | 08:00.0 |
> >>>>>>         +---------+       +---------+
> >>>>>>
> >>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
> >>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
> >>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
> >>>>>> nothing specific to PCIe.
> >>>>>
> >>>>> Right, I think the main PCI difference is the point-to-point nature of
> >>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
> >>>>> devices talking to each other, but on PCIe the transaction makes a
> >>>>> U-turn at some point and heads out another downstream port.  ACS allows
> >>>>> us to prevent that from happening.
> >>>>>
> >>>> detail: PCIe up/downstream routing is really done by an internal switch;
> >>>>            ACS forces the legacy, PCI base-limit address routing and *forces*
> >>>>            the switch to always route the transaction from a downstream port
> >>>>            to the upstream port.
> >>>>
> >>>>>> I don't understand ACS very well, but it looks like it basically
> >>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
> >>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
> >>>>>> instead of being directly claimed by 06:06.0.
> >>>>>
> >>>>> Yep, that's my meager understanding as well.
> >>>>>
> >>>> +1
> >>>>
> >>>>>>> so we're looking for
> >>>>>>> the furthest upstream device for which acs is enabled all the way up to
> >>>>>>> the root bus.
> >>>>>>
> >>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
> >>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
> >>>>>> port along the path to A.
> >>>>>
> >>>>> Yes, modulo this comment in libvirt source:
> >>>>>
> >>>>>        /* if we have no parent, and this is the root bus, ACS doesn't come
> >>>>>         * into play since devices on the root bus can't P2P without going
> >>>>>         * through the root IOMMU.
> >>>>>         */
> >>>>>
> >>>> Correct. PCIe spec says roots must support ACS. I believe all the
> >>>> root bridges that have an IOMMU have ACS wired in/on.
> >>>
> >>> Would you mind looking for the paragraph that says this?  I'd rather
> >>> code this into the iommu driver callers than core PCI code if this is
> >>> just a platform standard.
> >>>
> >> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
> >> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
> >>                        Redirected Request Validation;
> >
> > Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
> >
> UF?

Upstream Forwarding.  So basically that spec quote is saying that if the
RC supports one aspect of ACS then it must support this other.  But that
doesn't say to me that it must support ACS to begin with.

> >> -- which means, if a Root port allows a peer-to-peer transaction to another
> >>      one of its ports, then it has to support ACS.
> >
> > I don't get that at all from 6.12.1.1, especially given the first
> > sentence of that section:
> >
> >          This section applies to Root Ports and Downstream Switch Ports
> >          that implement an ACS Extended Capability structure.
> >
> >
> hmmm, well I did.  The root port section is different than Downstream ports
> as well.  downstream ports *must* support peer-xfers due to positive decoding
> of base/limit addresses, and ACS is optional in downstream ports.
> Peer-to-peer btwn root ports is optional.
> so, I don't get what you don't get... ;-)
> .. but, I understand how the spec can be read/interpreted differently,
>     given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>     incorrectly.
> 
> >> So, this means that:
> >> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
> >>       ACS isn't needed
> >> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
> >>       it must have ACS capability if it does...
> >> and since the linux code turns on ACS for all ports with an ACS cap,
> >> it degenerates (in Linux) that all Root ports are implementing the
> >> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
> >>
> I thought I explained how I interpreted the root-port part of ACS above,
> so maybe you can tell me how you think my interpretation is incorrect.

I don't know that it's incorrect, but I don't see how you're making the
leap you are.  I think the version I have now of the patch leaves this
nicely to the iommu drivers.  It would make sense, not that hardware
always makes sense, that anything with an iommu is going to hardwire
translations through the iommu when enabled or they couldn't reasonable
do any kind of peer-to-peer with iommu.

> >>>>> So we assume that a redirect at the point of the iommu will factor in
> >>>>> iommu translation.
> >>>>>
> >>>>>> If so, I think you're trying to find out the closest upstream device X
> >>>>>> such that everything leading to X has ACS enabled.  Every device below
> >>>>>> X can DMA freely to other devices below X, so they would all have to
> >>>>>> be in the same isolated group.
> >>>>>
> >>>>> Yes
> >>>>>
> >>>>>> I tried to work through some examples to develop some intuition about this:
> >>>>>
> >>>>> (inserting fixed url)
> >>>>>> http://www.asciiflow.com/#3736558963405980039
> >>>>>
> >>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
> >>>>>> if 00:00.0 is PCIe or if RP has ACS?))
> >>>>>
> >>>>> Hmm, the latter is the assumption above.  For the former, I think
> >>>>> libvirt was probably assuming that PCI devices must have a PCIe device
> >>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
> >>>>> except on PCIe.  I'll need to work on making that more generic.
> >>>>>
> >>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
> >>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
> >>>>>> PCIe; seems wrong)
> >>>>>
> >>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
> >>>>> input devices, so this was passing for me.  I'll need to incorporate
> >>>>> that generically.
> >>>>>
> >>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
> >>>>>> doesn't have ACS)
> >>>>>
> >>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
> >>>>> port, so maybe they're just assuming it's always enabled or that the
> >>>>> precedence favors IOMMU translation.  I'm also starting to think that we
> >>>>> might want "from" and "to" struct pci_dev parameters to make it more
> >>>>> flexible where the iommu lives in the system.
> >>>>>
> >>>> see comment above wrt root ports that have IOMMUs in them.
> >>>
> >>> Except it really seems to be platform convention where the IOMMU lives.
> >>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
> >>> for and from that we can make assumptions about where it physically
> >>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
> >>> function on the root bus.  For now I'm just allowing
> >>> pci_acs_path_enabled to take NULL for and end, which means "up to the
> >>> root bus".
> >>>
> >> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
> >> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
> >> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
> >> yet-another spec (or ask Joerg to check-in to this thread&  provide the details).
> >
> > But are we programming to convention or spec?  And I'm still confused
> > about why we assume the root port isn't susceptible to redirection
> > before IOMMU translation.  One of the benefits of the
> > pci_acs_path_enable() API is that it pushes convention out to the IOMMU
> > driver.  So it's intel-iommu.c's problem whether to test for ACS support
> > to the RC or to a given level (and for that matter know whether IOMMU
> > translation takes precedence over redirection in the RC).
> >
> Agreed. the best design here would be for the intel-iommu.c to test for ACS
> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
> so maybe it can be simplified in that case.... but it may be best to mimic
> the iommu-set-acs-for-root-port attribute the same manner.
> 
> >>>>
> >>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
> >>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
> >>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
> >>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
> >>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
> >>>>>
> >>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
> >>>>> don't think multifunction plays a role other than how much do we trust
> >>>>> the implementation to not allow back channels between functions (the
> >>>>> answer should probably be not at all).
> >>>>>
> >>>> correct. ACS is a *bridge* property.
> >>>> The unknown wrt multifunction devices is that such devices *could* be implemented
> >>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
> >>>> btwn the functions within a device.
> >>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
> >>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
> >>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
> >>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
> >>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
> >>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
> >>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
> >>>> determine this status about a device (via pci cfg/cap space).
> >>>
> >>> Well, there is actually a section of the ACS part of the spec
> >>> identifying valid flags for multifunction devices.  Secretly I'd like to
> >>> use this as justification for blacklisting all multifunction devices
> >>> that don't explicitly support ACS, but that makes for pretty course
> >>> granularity.  For instance, all these devices end up in a group:
> >>>
> >>>      +-14.0  ATI Technologies Inc SBx00 SMBus Controller
> >>>      +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
> >>>      +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
> >>>      +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
> >>>
> >>>     00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
> >>>
> >>> And these in another:
> >>>
> >>>      +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
> >>>      +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
> >>>      +-15.2-[08]--
> >>>      +-15.3-[09]--
> >>>
> >>>     00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
> >>>     00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
> >>>     00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
> >>>     00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
> >>>
> >>> Am I misinterpreting the spec or is this the correct, if strict,
> >>> interpretation?
> >>>
> >>> Alex
> >>>
> >> Well, in digging into the ACS-support in Root port question above,
> >> I just found out about this ACS support status for multifunctions too.
> >> I need more time to read/digest, but a quick read says MFDs should have
> >> an ACS cap with relevant RO-status&  control bits to direct/control
> >> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
> >> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
> >> Unfortunately, I think a large lot of MFDs don't have ACS caps,
> >> and don't/can't do peer-to-peer, so the above is heavy-handed,
> >> albeit to spec.
> >> Maybe we need a (large?) pci-quirk for the list of existing
> >> MFDs that don't have ACS caps that would enable the above devices
> >> to be in separate groups.
> >> On the flip side, it solves some of the quirks for MFDs that
> >> use the wrong BDF in their src-id dma packets! :) -- they default
> >> to the same group now...
> >
> > Yep, sounds like you might agree with my patch, it's heavy handed, but
> Yes, but we should design-in a quirk check list for MFDs,
> b/c we already know some will fail when they should pass this check,
> b/c the hw was made post ACS, or the vendors didn't see the benefit
> of ACS reporting (even if the funcitonality was enabling bit-settings
> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
> dev-assignment to virt guests, etc.).
> 
> > seems to adhere to the spec.  That probably just means we need an option
> > to allow a more lenient interpretation, that maybe we don't have to
> > support.  Thanks,
> Right.  As I said, a hook to do quirk-level additions from the get-go
> would speed this expected need/addition.

Ok, a quirk should be easy there.  Thanks,

Alex
Don Dutile - May 21, 2012, 6:14 p.m.
On 05/21/2012 10:59 AM, Alex Williamson wrote:
> On Mon, 2012-05-21 at 09:31 -0400, Don Dutile wrote:
>> On 05/18/2012 10:47 PM, Alex Williamson wrote:
>>> On Fri, 2012-05-18 at 19:00 -0400, Don Dutile wrote:
>>>> On 05/18/2012 06:02 PM, Alex Williamson wrote:
>>>>> On Wed, 2012-05-16 at 09:29 -0400, Don Dutile wrote:
>>>>>> On 05/15/2012 05:09 PM, Alex Williamson wrote:
>>>>>>> On Tue, 2012-05-15 at 13:56 -0600, Bjorn Helgaas wrote:
>>>>>>>> On Mon, May 14, 2012 at 4:49 PM, Alex Williamson
>>>>>>>> <alex.williamson@redhat.com>     wrote:
>>>>>>>>> On Mon, 2012-05-14 at 16:02 -0600, Bjorn Helgaas wrote:
>>>>>>>>>> On Fri, May 11, 2012 at 4:56 PM, Alex Williamson
>>>>>>>>>> <alex.williamson@redhat.com>     wrote:
>>>>>>>>>>> In a PCIe environment, transactions aren't always required to
>>>>>>>>>>> reach the root bus before being re-routed.  Peer-to-peer DMA
>>>>>>>>>>> may actually not be seen by the IOMMU in these cases.  For
>>>>>>>>>>> IOMMU groups, we want to provide IOMMU drivers a way to detect
>>>>>>>>>>> these restrictions.  Provided with a PCI device, pci_acs_enabled
>>>>>>>>>>> returns the furthest downstream device with a complete PCI ACS
>>>>>>>>>>> chain.  This information can then be used in grouping to create
>>>>>>>>>>> fully isolated groups.  ACS chain logic extracted from libvirt.
>>>>>>>>>>
>>>>>>>>>> The name "pci_acs_enabled()" sounds like it returns a boolean, but it doesn't.
>>>>>>>>>
>>>>>>>>> Right, maybe this should be:
>>>>>>>>>
>>>>>>>>> struct pci_dev *pci_find_upstream_acs(struct pci_dev *pdev);
>>>>>>>>>
>>>>>> +1; there is a global in the PCI code, pci_acs_enable,
>>>>>> and a function pci_enable_acs(), which the above name certainly
>>>>>> confuses.  I recommend  pci_find_top_acs_bridge()
>>>>>> would be most descriptive.
>>>> Finally, with my email filters fixed, I can see this email... :)
>>>
>>> Welcome back ;)
>>>
>> Indeed... and I recvd 3 copies of this reply,
>> so the pendulum has flipped the other direction... ;-)
>>
>>>>> Yep, the new API I'm working with is:
>>>>>
>>>>> bool pci_acs_enabled(struct pci_dev *pdev, u16 acs_flags);
>>>>> bool pci_acs_path_enabled(struct pci_dev *start,
>>>>>                              struct pci_dev *end, u16 acs_flags);
>>>>>
>>>> ok.
>>>>
>>>>>>>>>> I'm not sure what "a complete PCI ACS chain" means.
>>>>>>>>>>
>>>>>>>>>> The function starts from "dev" and searches *upstream*, so I'm
>>>>>>>>>> guessing it returns the root of a subtree that must be contained in a
>>>>>>>>>> group.
>>>>>>>>>
>>>>>>>>> Any intermediate switch between an endpoint and the root bus can
>>>>>>>>> redirect a dma access without iommu translation,
>>>>>>>>
>>>>>>>> Is this "redirection" just the normal PCI bridge forwarding that
>>>>>>>> allows peer-to-peer transactions, i.e., the rule (from P2P bridge
>>>>>>>> spec, rev 1.2, sec 4.1) that the bridge apertures define address
>>>>>>>> ranges that are forwarded from primary to secondary interface, and the
>>>>>>>> inverse ranges are forwarded from secondary to primary?  For example,
>>>>>>>> here:
>>>>>>>>
>>>>>>>>                        ^
>>>>>>>>                        |
>>>>>>>>               +--------+-------+
>>>>>>>>               |                |
>>>>>>>>        +------+-----+    +-----++-----+
>>>>>>>>        | Downstream |    | Downstream |
>>>>>>>>        |    Port    |    |    Port    |
>>>>>>>>        |   06:05.0  |    |   06:06.0  |
>>>>>>>>        +------+-----+    +------+-----+
>>>>>>>>               |                 |
>>>>>>>>          +----v----+       +----v----+
>>>>>>>>          | Endpoint|       | Endpoint|
>>>>>>>>          | 07:00.0 |       | 08:00.0 |
>>>>>>>>          +---------+       +---------+
>>>>>>>>
>>>>>>>> that rule is all that's needed for a transaction from 07:00.0 to be
>>>>>>>> forwarded from upstream to the internal switch bus 06, then claimed by
>>>>>>>> 06:06.0 and forwarded downstream to 08:00.0.  This is plain old PCI,
>>>>>>>> nothing specific to PCIe.
>>>>>>>
>>>>>>> Right, I think the main PCI difference is the point-to-point nature of
>>>>>>> PCIe vs legacy PCI bus.  On a legacy PCI bus there's no way to prevent
>>>>>>> devices talking to each other, but on PCIe the transaction makes a
>>>>>>> U-turn at some point and heads out another downstream port.  ACS allows
>>>>>>> us to prevent that from happening.
>>>>>>>
>>>>>> detail: PCIe up/downstream routing is really done by an internal switch;
>>>>>>             ACS forces the legacy, PCI base-limit address routing and *forces*
>>>>>>             the switch to always route the transaction from a downstream port
>>>>>>             to the upstream port.
>>>>>>
>>>>>>>> I don't understand ACS very well, but it looks like it basically
>>>>>>>> provides ways to prevent that peer-to-peer forwarding, so transactions
>>>>>>>> would be sent upstream toward the root (and specifically, the IOMMU)
>>>>>>>> instead of being directly claimed by 06:06.0.
>>>>>>>
>>>>>>> Yep, that's my meager understanding as well.
>>>>>>>
>>>>>> +1
>>>>>>
>>>>>>>>> so we're looking for
>>>>>>>>> the furthest upstream device for which acs is enabled all the way up to
>>>>>>>>> the root bus.
>>>>>>>>
>>>>>>>> Correct me if this is wrong: To force device A's DMAs to be processed
>>>>>>>> by an IOMMU, ACS must be enabled on the root port and every downstream
>>>>>>>> port along the path to A.
>>>>>>>
>>>>>>> Yes, modulo this comment in libvirt source:
>>>>>>>
>>>>>>>         /* if we have no parent, and this is the root bus, ACS doesn't come
>>>>>>>          * into play since devices on the root bus can't P2P without going
>>>>>>>          * through the root IOMMU.
>>>>>>>          */
>>>>>>>
>>>>>> Correct. PCIe spec says roots must support ACS. I believe all the
>>>>>> root bridges that have an IOMMU have ACS wired in/on.
>>>>>
>>>>> Would you mind looking for the paragraph that says this?  I'd rather
>>>>> code this into the iommu driver callers than core PCI code if this is
>>>>> just a platform standard.
>>>>>
>>>> In section 6.12.1.1 of PCIe Base spec, rev 3.0, it states:
>>>> ACS upstream fwding: Must be implemented by Root Ports if the RC supports
>>>>                         Redirected Request Validation;
>>>
>>> Huh?  (If support ACS.RR then must support ACS.UF) != must support ACS.
>>>
>> UF?
>
> Upstream Forwarding.  So basically that spec quote is saying that if the
> RC supports one aspect of ACS then it must support this other.  But that
> doesn't say to me that it must support ACS to begin with.
>
It says if RC supports peer-to-peer btwn root ports, the root ports must
support ACS.... at least that's how I understood it.
but, as we've seen, there's spec, then there's reality...

>>>> -- which means, if a Root port allows a peer-to-peer transaction to another
>>>>       one of its ports, then it has to support ACS.
>>>
>>> I don't get that at all from 6.12.1.1, especially given the first
>>> sentence of that section:
>>>
>>>           This section applies to Root Ports and Downstream Switch Ports
>>>           that implement an ACS Extended Capability structure.
>>>
>>>
>> hmmm, well I did.  The root port section is different than Downstream ports
>> as well.  downstream ports *must* support peer-xfers due to positive decoding
>> of base/limit addresses, and ACS is optional in downstream ports.
>> Peer-to-peer btwn root ports is optional.
>> so, I don't get what you don't get... ;-)
>> .. but, I understand how the spec can be read/interpreted differently,
>>      given it's clarity (did lawyers write it?!?!!), so I could be interpreting
>>      incorrectly.
>>
>>>> So, this means that:
>>>> (a) if a Root complex with multiple ports can't do peer-to-peer to another port,
>>>>        ACS isn't needed
>>>> (b) if a Root complex w/multiple ports can do peer-to-peer to another port,
>>>>        it must have ACS capability if it does...
>>>> and since the linux code turns on ACS for all ports with an ACS cap,
>>>> it degenerates (in Linux) that all Root ports are implementing the
>>>> end functionality of ACS==on, all traffic goes up to IOMMU in RC.
>>>>
>> I thought I explained how I interpreted the root-port part of ACS above,
>> so maybe you can tell me how you think my interpretation is incorrect.
>
> I don't know that it's incorrect, but I don't see how you're making the
> leap you are.  I think the version I have now of the patch leaves this
> nicely to the iommu drivers.  It would make sense, not that hardware
> always makes sense, that anything with an iommu is going to hardwire
> translations through the iommu when enabled or they couldn't reasonable
> do any kind of peer-to-peer with iommu.
>
agreed. the version you have looks good,
and avoids/handles potential ACS cap idiosyncracies in RC's.

>>>>>>> So we assume that a redirect at the point of the iommu will factor in
>>>>>>> iommu translation.
>>>>>>>
>>>>>>>> If so, I think you're trying to find out the closest upstream device X
>>>>>>>> such that everything leading to X has ACS enabled.  Every device below
>>>>>>>> X can DMA freely to other devices below X, so they would all have to
>>>>>>>> be in the same isolated group.
>>>>>>>
>>>>>>> Yes
>>>>>>>
>>>>>>>> I tried to work through some examples to develop some intuition about this:
>>>>>>>
>>>>>>> (inserting fixed url)
>>>>>>>> http://www.asciiflow.com/#3736558963405980039
>>>>>>>
>>>>>>>> pci_acs_enabled(00:00.0) = 00:00.0 (on root bus (but doesn't it matter
>>>>>>>> if 00:00.0 is PCIe or if RP has ACS?))
>>>>>>>
>>>>>>> Hmm, the latter is the assumption above.  For the former, I think
>>>>>>> libvirt was probably assuming that PCI devices must have a PCIe device
>>>>>>> upstream from them because x86 doesn't have assignment friendly IOMMUs
>>>>>>> except on PCIe.  I'll need to work on making that more generic.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:01.0) = 00:01.0 (on root bus)
>>>>>>>> pci_acs_enabled(01:00.0) = 01:00.0 (acs_dev = 00:01.0, 01:00.0 is not
>>>>>>>> PCIe; seems wrong)
>>>>>>>
>>>>>>> Oops, I'm calling pci_find_upstream_pcie_bridge() first on any of my
>>>>>>> input devices, so this was passing for me.  I'll need to incorporate
>>>>>>> that generically.
>>>>>>>
>>>>>>>> pci_acs_enabled(00:02.0) = 00:02.0 (on root bus; seems wrong if RP
>>>>>>>> doesn't have ACS)
>>>>>>>
>>>>>>> Yeah, let me validate the libvirt assumption.  I see ACS on my root
>>>>>>> port, so maybe they're just assuming it's always enabled or that the
>>>>>>> precedence favors IOMMU translation.  I'm also starting to think that we
>>>>>>> might want "from" and "to" struct pci_dev parameters to make it more
>>>>>>> flexible where the iommu lives in the system.
>>>>>>>
>>>>>> see comment above wrt root ports that have IOMMUs in them.
>>>>>
>>>>> Except it really seems to be platform convention where the IOMMU lives.
>>>>> The DMAR for VT-d describes which devices and hierarchies a DRHD is used
>>>>> for and from that we can make assumptions about where it physically
>>>>> lives.  AMD-Vi exposes a PCI device as the IOMMU, but it's just a peer
>>>>> function on the root bus.  For now I'm just allowing
>>>>> pci_acs_path_enabled to take NULL for and end, which means "up to the
>>>>> root bus".
>>>>>
>>>> ATM, VT-d IOMMUs are only in the RCs, so, ACS at each downstream port
>>>> in a tree would/should return 'true' to the acs_enabled check when it gets to a Root port.
>>>> For AMD-Vi, I thought the same held true, ATM, but I have to dig through
>>>> yet-another spec (or ask Joerg to check-in to this thread&   provide the details).
>>>
>>> But are we programming to convention or spec?  And I'm still confused
>>> about why we assume the root port isn't susceptible to redirection
>>> before IOMMU translation.  One of the benefits of the
>>> pci_acs_path_enable() API is that it pushes convention out to the IOMMU
>>> driver.  So it's intel-iommu.c's problem whether to test for ACS support
>>> to the RC or to a given level (and for that matter know whether IOMMU
>>> translation takes precedence over redirection in the RC).
>>>
>> Agreed. the best design here would be for the intel-iommu.c to test for ACS
>> wrt the root port (not to be confused with root complex) since Intel-IOMMUs
>> are not 'attached' to a PCI bus.  Now, AMD's IOMMUs are attached to a PCI bus,
>> so maybe it can be simplified in that case.... but it may be best to mimic
>> the iommu-set-acs-for-root-port attribute the same manner.
>>
>>>>>>
>>>>>>>> pci_acs_enabled(02:00.0) = 00:02.0 (acs_dev = 00:02.0, 02:00.0 has no ACS cap)
>>>>>>>> pci_acs_enabled(03:00.0) = 00:02.0 (acs_dev = 00:02.0)
>>>>>>>> pci_acs_enabled(02:01.0) = 02:01.0 (acs_dev = 00:02.0, 02:01.0 has ACS enabled)
>>>>>>>> pci_acs_enabled(04:00.0) = 04:00.0 (acs_dev = 02:01.0, 04:00.0 is not
>>>>>>>> a bridge; seems wrong if 04:00 is a multi-function device)
>>>>>>>
>>>>>>> AIUI, ACS is not an endpoint property, so this is what should happen.  I
>>>>>>> don't think multifunction plays a role other than how much do we trust
>>>>>>> the implementation to not allow back channels between functions (the
>>>>>>> answer should probably be not at all).
>>>>>>>
>>>>>> correct. ACS is a *bridge* property.
>>>>>> The unknown wrt multifunction devices is that such devices *could* be implemented
>>>>>> by a hidden (not responding to PCI cfg accesses from downstream port) PCI bridge
>>>>>> btwn the functions within a device.
>>>>>> Such a bridge could allow peer-to-peer xactions and there is no way for OS's to
>>>>>> force ACS.  So, one has to ask the hw vendors if such a hidden device exists
>>>>>> in the implementation, and whether peer-to-peer is enabled/allowed -- a hidden PCI
>>>>>> bridge/PCIe-switch could just be hardwired to push all IO to upstream port,
>>>>>> and allow parent bridge re-route it back down if peer-to-peer is desired.
>>>>>> Debate exists whether multifunction devices are 'secure' b/c of this unknown.
>>>>>> Maybe a PCIe (min., SRIOV) spec change is needed in this area to
>>>>>> determine this status about a device (via pci cfg/cap space).
>>>>>
>>>>> Well, there is actually a section of the ACS part of the spec
>>>>> identifying valid flags for multifunction devices.  Secretly I'd like to
>>>>> use this as justification for blacklisting all multifunction devices
>>>>> that don't explicitly support ACS, but that makes for pretty course
>>>>> granularity.  For instance, all these devices end up in a group:
>>>>>
>>>>>       +-14.0  ATI Technologies Inc SBx00 SMBus Controller
>>>>>       +-14.2  ATI Technologies Inc SBx00 Azalia (Intel HDA)
>>>>>       +-14.3  ATI Technologies Inc SB7x0/SB8x0/SB9x0 LPC host controller
>>>>>       +-14.4-[05]----0e.0  VIA Technologies, Inc. VT6306/7/8 [Fire II(M)] IEEE 1394 OHCI Controller
>>>>>
>>>>>      00:14.4 PCI bridge: ATI Technologies Inc SBx00 PCI to PCI Bridge (rev 40)
>>>>>
>>>>> And these in another:
>>>>>
>>>>>       +-15.0-[06]----00.0  Realtek Semiconductor Co., Ltd. RTL8111/8168B PCI Express Gigabit Ethernet controller
>>>>>       +-15.1-[07]----00.0  Etron Technology, Inc. EJ168 USB 3.0 Host Controller
>>>>>       +-15.2-[08]--
>>>>>       +-15.3-[09]--
>>>>>
>>>>>      00:15.0 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 0)
>>>>>      00:15.1 PCI bridge: ATI Technologies Inc SB700/SB800/SB900 PCI to PCI bridge (PCIE port 1)
>>>>>      00:15.2 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 2)
>>>>>      00:15.3 PCI bridge: ATI Technologies Inc SB900 PCI to PCI bridge (PCIE port 3)
>>>>>
>>>>> Am I misinterpreting the spec or is this the correct, if strict,
>>>>> interpretation?
>>>>>
>>>>> Alex
>>>>>
>>>> Well, in digging into the ACS-support in Root port question above,
>>>> I just found out about this ACS support status for multifunctions too.
>>>> I need more time to read/digest, but a quick read says MFDs should have
>>>> an ACS cap with relevant RO-status&   control bits to direct/control
>>>> peer-to-peer and ACS-upstream.  Lack of an ACS struct implies(?)
>>>> peer-to-peer can happen, and thus, the above have to be in the same iommu group.
>>>> Unfortunately, I think a large lot of MFDs don't have ACS caps,
>>>> and don't/can't do peer-to-peer, so the above is heavy-handed,
>>>> albeit to spec.
>>>> Maybe we need a (large?) pci-quirk for the list of existing
>>>> MFDs that don't have ACS caps that would enable the above devices
>>>> to be in separate groups.
>>>> On the flip side, it solves some of the quirks for MFDs that
>>>> use the wrong BDF in their src-id dma packets! :) -- they default
>>>> to the same group now...
>>>
>>> Yep, sounds like you might agree with my patch, it's heavy handed, but
>> Yes, but we should design-in a quirk check list for MFDs,
>> b/c we already know some will fail when they should pass this check,
>> b/c the hw was made post ACS, or the vendors didn't see the benefit
>> of ACS reporting (even if the funcitonality was enabling bit-settings
>> that did nothing hw-wise), b/c they didn't understand the use case (VFIO,
>> dev-assignment to virt guests, etc.).
>>
>>> seems to adhere to the spec.  That probably just means we need an option
>>> to allow a more lenient interpretation, that maybe we don't have to
>>> support.  Thanks,
>> Right.  As I said, a hook to do quirk-level additions from the get-go
>> would speed this expected need/addition.
>
> Ok, a quirk should be easy there.  Thanks,
>
> Alex
>

Patch

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 111569c..d7f05ce 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2358,6 +2358,49 @@  void pci_enable_acs(struct pci_dev *dev)
 	pci_write_config_word(dev, pos + PCI_ACS_CTRL, ctrl);
 }
 
+#define PCI_EXT_CAP_ACS_ENABLED		(PCI_ACS_SV | PCI_ACS_RR | \
+					 PCI_ACS_CR | PCI_ACS_UF)
+
+/**
+ * pci_acs_enabled - test ACS support in downstream chain
+ * @dev: starting PCI device
+ *
+ * Returns the furthest downstream device with an unbroken ACS chain.  If
+ * ACS is enabled throughout the chain, the returned device is the same as
+ * the one passed in.
+ */
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev)
+{
+	struct pci_dev *acs_dev;
+	int pos;
+	u16 ctrl;
+
+	if (!pci_is_root_bus(dev->bus))
+		acs_dev = pci_acs_enabled(dev->bus->self);
+	else
+		return dev;
+
+	/* If the chain is already broken, pass on the device */
+	if (acs_dev != dev->bus->self)
+		return acs_dev;
+
+	if (!pci_is_pcie(dev) || (dev->class >> 8) != PCI_CLASS_BRIDGE_PCI)
+		return dev;
+
+	if (dev->pcie_type != PCI_EXP_TYPE_DOWNSTREAM)
+		return dev;
+
+	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ACS);
+	if (!pos)
+		return acs_dev;
+
+	pci_read_config_word(dev, pos + PCI_ACS_CTRL, &ctrl);
+	if ((ctrl & PCI_EXT_CAP_ACS_ENABLED) != PCI_EXT_CAP_ACS_ENABLED)
+		return acs_dev;
+
+	return dev;
+}
+
 /**
  * pci_swizzle_interrupt_pin - swizzle INTx for device behind bridge
  * @dev: the PCI device
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 9910b5c..dc25da3 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1586,6 +1586,7 @@  static inline bool pci_is_pcie(struct pci_dev *dev)
 }
 
 void pci_request_acs(void);
+struct pci_dev *pci_acs_enabled(struct pci_dev *dev);
 
 
 #define PCI_VPD_LRDT			0x80	/* Large Resource Data Type */