diff mbox

[v4,3/6] PCI: Add support for multiple DMA aliases

Message ID 20160224194406.7585.17447.stgit@bhelgaas-glaptop2.roam.corp.google.com
State Superseded
Headers show

Commit Message

Bjorn Helgaas Feb. 24, 2016, 7:44 p.m. UTC
From: Jacek Lawrynowicz <jacek.lawrynowicz@intel.com>

<Insert changelog here>
---
 drivers/iommu/iommu.c |   17 ++++++++++-------
 drivers/pci/pci.c     |   11 +++++++++--
 drivers/pci/probe.c   |    1 +
 drivers/pci/search.c  |   14 +++++++++-----
 include/linux/pci.h   |    4 +---
 5 files changed, 30 insertions(+), 17 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Bjorn Helgaas Feb. 25, 2016, 2:38 p.m. UTC | #1
On Wed, Feb 24, 2016 at 01:44:06PM -0600, Bjorn Helgaas wrote:
> From: Jacek Lawrynowicz <jacek.lawrynowicz@intel.com>
> 
> <Insert changelog here>

(Sorry, I should have copied this changelog in the patch; I copied
this manually from your v3 posting):

> This patch solves IOMMU support issues with PCIe non-transparent bridges
> that use Requester ID look-up tables (LUT), e.g. PEX8733. Before exiting
> the bridge, packet's RID is rewritten according to LUT programmed by
> a driver. Modified packets are then passed to a destination bus and
> processed upstream. The problem is that such packets seem to come from
> non-existent nodes that are hidden behind NTB and are not discoverable
> by a destination node, so IOMMU discards them. Adding DMA alias for a
> given LUT entry allows IOMMU to create a proper mapping that enables
> inter-node communication.

A specific example here would help me understand.  Here's how I
understand this (correct me if I'm wrong): We're talking about a DMA
packet being forwarded upstream from an NTB.  The NTB uses the LUT to
rewrite the RID in the DMA packet.  The new RID from the LUT is
unknown to the IOMMU, so it discards the DMA packet.

> The current DMA alias implementation supports only single alias, so it's
> not possible to connect more than two nodes when IOMMU is enabled. This
> implementation enables all possible aliases on a given bus (256) that
> are stored in a bitset. Alias devfn is directly translated to a bit
> number. The bitset is not allocated for devices that have no need for
> DMA aliases.

I think "two nodes" is referring to two PCIe devices on the other side
of the NTB.  You want DMA packets from those devices to have different
RIDs so the IOMMU can distinguish them.

The LUT entries basically create aliases of the NTB (one alias for
each device beyond the NTB).  Your quirk uses pci_add_dma_alias(), and
the aliases are all on the same bus as the NTB itself.

The quirk adds PCI_DEVFN(0x10, 0x0), PCI_DEVFN(0x11, 0x0), and
PCI_DEVFN(0x12, 0x0).  Shouldn't there be some connection between this
and the LUT programming?  I assume the LUT is programmed to correspond
to those aliases.  Does this mean you're limited to three devices
beyond the NTB?

> ---
>  drivers/iommu/iommu.c |   17 ++++++++++-------
>  drivers/pci/pci.c     |   11 +++++++++--
>  drivers/pci/probe.c   |    1 +
>  drivers/pci/search.c  |   14 +++++++++-----
>  include/linux/pci.h   |    4 +---
>  5 files changed, 30 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 0e3b009..a214e19 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -659,9 +659,15 @@ static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
>  	return NULL;
>  }
>  
> +static bool dma_alias_is_enabled(struct pci_dev *dev, u8 devfn)
> +{
> +	return dev->dma_alias_mask &&
> +	       test_bit(devfn, dev->dma_alias_mask);
> +}
> +
>  /*
> - * Look for aliases to or from the given device for exisiting groups.  The
> - * dma_alias_devfn only supports aliases on the same bus, therefore the search
> + * Look for aliases to or from the given device for existing groups. DMA
> + * aliases are only supported on the same bus, therefore the search

I'm trying to reconcile this statement that "DMA aliases are only
supported on the same bus" (which was there even before this patch)
with the fact that pci_for_each_dma_alias() does not have that
limitation.

>   * space is quite small (especially since we're really only looking at pcie
>   * device, and therefore only expect multiple slots on the root complex or
>   * downstream switch ports).  It's conceivable though that a pair of
> @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
>  			continue;
>  
>  		/* We alias them or they alias us */
> -		if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> -		     pdev->dma_alias_devfn == tmp->devfn) ||
> -		    ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> -		     tmp->dma_alias_devfn == pdev->devfn)) {
> -
> +		if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> +		    dma_alias_is_enabled(tmp, pdev->devfn)) {
>  			group = get_pci_alias_group(tmp, devfns);

We basically have this:

  for_each_pci_dev(tmp) {
    if (<pdev and tmp are DMA aliases>)
      group = get_pci_alias_group();
      ...
  }

The DMA alias stuff relies on PCI internals, so it doesn't doesn't
seem quite right to use things like PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and
dma_alias_devfn here in the IOMMU code.  

I'm trying to figure out why we don't do something like the following
instead:

  callback(struct pci_dev *pdev, u16 alias, void *opaque)
  {
    struct iommu_group *group;

    group = get_pci_alias_group();
    if (group)
      return group;

    return 0;
  }

  pci_for_each_dma_alias(pdev, callback, ...);

Is the existing code some sort of optimization, e.g., checking
PCI_DEV_FLAGS_DMA_ALIAS_DEVFN is cheaper than using
pci_for_each_dma_alias()?

It seems like this won't work for some very unlikely but theoretically
possible topologies, e.g.,

  PCIe Root Complex/IOMMU
    PCIe switch A
      PCIe to conventional PCI bridge
        PCI to PCIe Root Complex
	  PCIe NTB

Here, I think the IOMMU will only see RIDs from PCIe switch A, but the
current code only looks at DMA aliases that are on the same bus as the
PCIe NTB.  Wouldn't using pci_for_each_dma_alias() handle this
correctly?

>  			if (group) {
>  				pci_dev_put(tmp);
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacek Lawrynowicz Feb. 25, 2016, 3:41 p.m. UTC | #2
> -----Original Message-----
> From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> Sent: Thursday, February 25, 2016 3:39 PM
> To: Bjorn Helgaas <bhelgaas@google.com>
> Cc: Lawrynowicz, Jacek <jacek.lawrynowicz@intel.com>; linux-
> pci@vger.kernel.org; Alex Williamson <alex.williamson@redhat.com>; Joerg
> Roedel <jroedel@suse.de>; David Woodhouse <dwmw2@infradead.org>;
> iommu@lists.linux-foundation.org
> Subject: Re: [PATCH v4 3/6] PCI: Add support for multiple DMA aliases
> 
> On Wed, Feb 24, 2016 at 01:44:06PM -0600, Bjorn Helgaas wrote:
> > From: Jacek Lawrynowicz <jacek.lawrynowicz@intel.com>
> >
> > <Insert changelog here>
> 
> (Sorry, I should have copied this changelog in the patch; I copied
> this manually from your v3 posting):
> 
> > This patch solves IOMMU support issues with PCIe non-transparent bridges
> > that use Requester ID look-up tables (LUT), e.g. PEX8733. Before exiting
> > the bridge, packet's RID is rewritten according to LUT programmed by
> > a driver. Modified packets are then passed to a destination bus and
> > processed upstream. The problem is that such packets seem to come from
> > non-existent nodes that are hidden behind NTB and are not discoverable
> > by a destination node, so IOMMU discards them. Adding DMA alias for a
> > given LUT entry allows IOMMU to create a proper mapping that enables
> > inter-node communication.
> 
> A specific example here would help me understand.  Here's how I
> understand this (correct me if I'm wrong): We're talking about a DMA
> packet being forwarded upstream from an NTB.  The NTB uses the LUT to
> rewrite the RID in the DMA packet.  The new RID from the LUT is
> unknown to the IOMMU, so it discards the DMA packet.

Yes, this is exactly the problem.

> > The current DMA alias implementation supports only single alias, so it's
> > not possible to connect more than two nodes when IOMMU is enabled. This
> > implementation enables all possible aliases on a given bus (256) that
> > are stored in a bitset. Alias devfn is directly translated to a bit
> > number. The bitset is not allocated for devices that have no need for
> > DMA aliases.
> 
> I think "two nodes" is referring to two PCIe devices on the other side
> of the NTB.  You want DMA packets from those devices to have different
> RIDs so the IOMMU can distinguish them.

Right.

> The LUT entries basically create aliases of the NTB (one alias for
> each device beyond the NTB).  Your quirk uses pci_add_dma_alias(), and
> the aliases are all on the same bus as the NTB itself.
> 
> The quirk adds PCI_DEVFN(0x10, 0x0), PCI_DEVFN(0x11, 0x0), and
> PCI_DEVFN(0x12, 0x0).  Shouldn't there be some connection between this
> and the LUT programming?  I assume the LUT is programmed to correspond
> to those aliases.  Does this mean you're limited to three devices
> beyond the NTB?

Yes, there is an indirect connection between LUT table and devfns used in the
quirk.
Dev part is an offset in the LUT table and function is taken from the device
behind the NTB.
So the driver can only change the dev part by using different LUT offsets.
We don't plan to modify this quirk. The number of PCIe devices beyond single
x200 card NTB will not change.
Two are used by x200 CPU (host bridge & root port) and one is used by x200 DMA
engine.
I'm not sure introducing some dependencies to make sure the offsets are set
correctly is really worth it.

So regarding the improvements in the patch description, you want me to update
and repost it?

BTW I posted x200 DMA driver (the client for this change) on DMA list:
https://lkml.org/lkml/2016/2/9/287
I'm working on integrating review comments and hope to get it included in 4.6.

Regards,
Jacek 

> > ---
> >  drivers/iommu/iommu.c |   17 ++++++++++-------
> >  drivers/pci/pci.c     |   11 +++++++++--
> >  drivers/pci/probe.c   |    1 +
> >  drivers/pci/search.c  |   14 +++++++++-----
> >  include/linux/pci.h   |    4 +---
> >  5 files changed, 30 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > index 0e3b009..a214e19 100644
> > --- a/drivers/iommu/iommu.c
> > +++ b/drivers/iommu/iommu.c
> > @@ -659,9 +659,15 @@ static struct iommu_group
> *get_pci_function_alias_group(struct pci_dev *pdev,
> >  	return NULL;
> >  }
> >
> > +static bool dma_alias_is_enabled(struct pci_dev *dev, u8 devfn)
> > +{
> > +	return dev->dma_alias_mask &&
> > +	       test_bit(devfn, dev->dma_alias_mask);
> > +}
> > +
> >  /*
> > - * Look for aliases to or from the given device for exisiting groups.  The
> > - * dma_alias_devfn only supports aliases on the same bus, therefore the
> search
> > + * Look for aliases to or from the given device for existing groups. DMA
> > + * aliases are only supported on the same bus, therefore the search
> 
> I'm trying to reconcile this statement that "DMA aliases are only
> supported on the same bus" (which was there even before this patch)
> with the fact that pci_for_each_dma_alias() does not have that
> limitation.
> 
> >   * space is quite small (especially since we're really only looking at pcie
> >   * device, and therefore only expect multiple slots on the root complex or
> >   * downstream switch ports).  It's conceivable though that a pair of
> > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct
> pci_dev *pdev,
> >  			continue;
> >
> >  		/* We alias them or they alias us */
> > -		if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN)
> &&
> > -		     pdev->dma_alias_devfn == tmp->devfn) ||
> > -		    ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > -		     tmp->dma_alias_devfn == pdev->devfn)) {
> > -
> > +		if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > +		    dma_alias_is_enabled(tmp, pdev->devfn)) {
> >  			group = get_pci_alias_group(tmp, devfns);
> 
> We basically have this:
> 
>   for_each_pci_dev(tmp) {
>     if (<pdev and tmp are DMA aliases>)
>       group = get_pci_alias_group();
>       ...
>   }
> 
> The DMA alias stuff relies on PCI internals, so it doesn't doesn't
> seem quite right to use things like PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and
> dma_alias_devfn here in the IOMMU code.
> 
> I'm trying to figure out why we don't do something like the following
> instead:
> 
>   callback(struct pci_dev *pdev, u16 alias, void *opaque)
>   {
>     struct iommu_group *group;
> 
>     group = get_pci_alias_group();
>     if (group)
>       return group;
> 
>     return 0;
>   }
> 
>   pci_for_each_dma_alias(pdev, callback, ...);
> 
> Is the existing code some sort of optimization, e.g., checking
> PCI_DEV_FLAGS_DMA_ALIAS_DEVFN is cheaper than using
> pci_for_each_dma_alias()?
> 
> It seems like this won't work for some very unlikely but theoretically
> possible topologies, e.g.,
> 
>   PCIe Root Complex/IOMMU
>     PCIe switch A
>       PCIe to conventional PCI bridge
>         PCI to PCIe Root Complex
> 	  PCIe NTB
> 
> Here, I think the IOMMU will only see RIDs from PCIe switch A, but the
> current code only looks at DMA aliases that are on the same bus as the
> PCIe NTB.  Wouldn't using pci_for_each_dma_alias() handle this
> correctly?
> 
> >  			if (group) {
> >  				pci_dev_put(tmp);
Bjorn Helgaas Feb. 29, 2016, 10:44 p.m. UTC | #3
On Thu, Feb 25, 2016 at 03:41:51PM +0000, Lawrynowicz, Jacek wrote:
> > -----Original Message-----
> > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > Sent: Thursday, February 25, 2016 3:39 PM
> > To: Bjorn Helgaas <bhelgaas@google.com>
> > Cc: Lawrynowicz, Jacek <jacek.lawrynowicz@intel.com>; linux-
> > pci@vger.kernel.org; Alex Williamson <alex.williamson@redhat.com>; Joerg
> > Roedel <jroedel@suse.de>; David Woodhouse <dwmw2@infradead.org>;
> > iommu@lists.linux-foundation.org
> > Subject: Re: [PATCH v4 3/6] PCI: Add support for multiple DMA aliases
> > 
> > On Wed, Feb 24, 2016 at 01:44:06PM -0600, Bjorn Helgaas wrote:
> > > From: Jacek Lawrynowicz <jacek.lawrynowicz@intel.com>
> > >
> > > <Insert changelog here>
> > 
> > (Sorry, I should have copied this changelog in the patch; I copied
> > this manually from your v3 posting):
> > 
> > > This patch solves IOMMU support issues with PCIe non-transparent bridges
> > > that use Requester ID look-up tables (LUT), e.g. PEX8733. Before exiting
> > > the bridge, packet's RID is rewritten according to LUT programmed by
> > > a driver. Modified packets are then passed to a destination bus and
> > > processed upstream. The problem is that such packets seem to come from
> > > non-existent nodes that are hidden behind NTB and are not discoverable
> > > by a destination node, so IOMMU discards them. Adding DMA alias for a
> > > given LUT entry allows IOMMU to create a proper mapping that enables
> > > inter-node communication.
> > 
> > A specific example here would help me understand.  Here's how I
> > understand this (correct me if I'm wrong): We're talking about a DMA
> > packet being forwarded upstream from an NTB.  The NTB uses the LUT to
> > rewrite the RID in the DMA packet.  The new RID from the LUT is
> > unknown to the IOMMU, so it discards the DMA packet.
> 
> Yes, this is exactly the problem.
> 
> > > The current DMA alias implementation supports only single alias, so it's
> > > not possible to connect more than two nodes when IOMMU is enabled. This
> > > implementation enables all possible aliases on a given bus (256) that
> > > are stored in a bitset. Alias devfn is directly translated to a bit
> > > number. The bitset is not allocated for devices that have no need for
> > > DMA aliases.
> > 
> > I think "two nodes" is referring to two PCIe devices on the other side
> > of the NTB.  You want DMA packets from those devices to have different
> > RIDs so the IOMMU can distinguish them.
> 
> Right.
> 
> > The LUT entries basically create aliases of the NTB (one alias for
> > each device beyond the NTB).  Your quirk uses pci_add_dma_alias(), and
> > the aliases are all on the same bus as the NTB itself.
> > 
> > The quirk adds PCI_DEVFN(0x10, 0x0), PCI_DEVFN(0x11, 0x0), and
> > PCI_DEVFN(0x12, 0x0).  Shouldn't there be some connection between this
> > and the LUT programming?  I assume the LUT is programmed to correspond
> > to those aliases.  Does this mean you're limited to three devices
> > beyond the NTB?
> 
> Yes, there is an indirect connection between LUT table and devfns used in the
> quirk.
> Dev part is an offset in the LUT table and function is taken from the device
> behind the NTB.
> So the driver can only change the dev part by using different LUT offsets.
> We don't plan to modify this quirk. The number of PCIe devices beyond single
> x200 card NTB will not change.
> Two are used by x200 CPU (host bridge & root port) and one is used by x200 DMA
> engine.
> I'm not sure introducing some dependencies to make sure the offsets are set
> correctly is really worth it.

I'd like at least a comment that points to the specific x200 code that
must coordinate with this.

> So regarding the improvements in the patch description, you want me to update
> and repost it?

Yes, please.

> BTW I posted x200 DMA driver (the client for this change) on DMA list:
> https://lkml.org/lkml/2016/2/9/287
> I'm working on integrating review comments and hope to get it included in 4.6.

What about my questions on the code itself, below?

> > > ---
> > >  drivers/iommu/iommu.c |   17 ++++++++++-------
> > >  drivers/pci/pci.c     |   11 +++++++++--
> > >  drivers/pci/probe.c   |    1 +
> > >  drivers/pci/search.c  |   14 +++++++++-----
> > >  include/linux/pci.h   |    4 +---
> > >  5 files changed, 30 insertions(+), 17 deletions(-)
> > >
> > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > index 0e3b009..a214e19 100644
> > > --- a/drivers/iommu/iommu.c
> > > +++ b/drivers/iommu/iommu.c
> > > @@ -659,9 +659,15 @@ static struct iommu_group
> > *get_pci_function_alias_group(struct pci_dev *pdev,
> > >  	return NULL;
> > >  }
> > >
> > > +static bool dma_alias_is_enabled(struct pci_dev *dev, u8 devfn)
> > > +{
> > > +	return dev->dma_alias_mask &&
> > > +	       test_bit(devfn, dev->dma_alias_mask);
> > > +}
> > > +
> > >  /*
> > > - * Look for aliases to or from the given device for exisiting groups.  The
> > > - * dma_alias_devfn only supports aliases on the same bus, therefore the
> > search
> > > + * Look for aliases to or from the given device for existing groups. DMA
> > > + * aliases are only supported on the same bus, therefore the search
> > 
> > I'm trying to reconcile this statement that "DMA aliases are only
> > supported on the same bus" (which was there even before this patch)
> > with the fact that pci_for_each_dma_alias() does not have that
> > limitation.
> > 
> > >   * space is quite small (especially since we're really only looking at pcie
> > >   * device, and therefore only expect multiple slots on the root complex or
> > >   * downstream switch ports).  It's conceivable though that a pair of
> > > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct
> > pci_dev *pdev,
> > >  			continue;
> > >
> > >  		/* We alias them or they alias us */
> > > -		if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN)
> > &&
> > > -		     pdev->dma_alias_devfn == tmp->devfn) ||
> > > -		    ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > -		     tmp->dma_alias_devfn == pdev->devfn)) {
> > > -
> > > +		if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > > +		    dma_alias_is_enabled(tmp, pdev->devfn)) {
> > >  			group = get_pci_alias_group(tmp, devfns);
> > 
> > We basically have this:
> > 
> >   for_each_pci_dev(tmp) {
> >     if (<pdev and tmp are DMA aliases>)
> >       group = get_pci_alias_group();
> >       ...
> >   }
> > 
> > The DMA alias stuff relies on PCI internals, so it doesn't doesn't
> > seem quite right to use things like PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and
> > dma_alias_devfn here in the IOMMU code.
> > 
> > I'm trying to figure out why we don't do something like the following
> > instead:
> > 
> >   callback(struct pci_dev *pdev, u16 alias, void *opaque)
> >   {
> >     struct iommu_group *group;
> > 
> >     group = get_pci_alias_group();
> >     if (group)
> >       return group;
> > 
> >     return 0;
> >   }
> > 
> >   pci_for_each_dma_alias(pdev, callback, ...);
> > 
> > Is the existing code some sort of optimization, e.g., checking
> > PCI_DEV_FLAGS_DMA_ALIAS_DEVFN is cheaper than using
> > pci_for_each_dma_alias()?
> > 
> > It seems like this won't work for some very unlikely but theoretically
> > possible topologies, e.g.,
> > 
> >   PCIe Root Complex/IOMMU
> >     PCIe switch A
> >       PCIe to conventional PCI bridge
> >         PCI to PCIe Root Complex
> > 	  PCIe NTB
> > 
> > Here, I think the IOMMU will only see RIDs from PCIe switch A, but the
> > current code only looks at DMA aliases that are on the same bus as the
> > PCIe NTB.  Wouldn't using pci_for_each_dma_alias() handle this
> > correctly?
> > 
> > >  			if (group) {
> > >  				pci_dev_put(tmp);


--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jacek Lawrynowicz March 1, 2016, 4:57 p.m. UTC | #4
On Mon, Feb 29, 2016 at 04:44:17PM -0600, Bjorn Helgaas wrote:
> On Thu, Feb 25, 2016 at 03:41:51PM +0000, Lawrynowicz, Jacek wrote:
> > > -----Original Message-----
> > > From: Bjorn Helgaas [mailto:helgaas@kernel.org]
> > > Sent: Thursday, February 25, 2016 3:39 PM
> > > To: Bjorn Helgaas <bhelgaas@google.com>
> > > Cc: Lawrynowicz, Jacek <jacek.lawrynowicz@intel.com>; linux-
> > > pci@vger.kernel.org; Alex Williamson <alex.williamson@redhat.com>; Joerg
> > > Roedel <jroedel@suse.de>; David Woodhouse <dwmw2@infradead.org>;
> > > iommu@lists.linux-foundation.org
> > > Subject: Re: [PATCH v4 3/6] PCI: Add support for multiple DMA aliases
> > > 
> > > On Wed, Feb 24, 2016 at 01:44:06PM -0600, Bjorn Helgaas wrote:
> > > > From: Jacek Lawrynowicz <jacek.lawrynowicz@intel.com>
> 
> What about my questions on the code itself, below?
> 

Sorry but I'm not able to answer it.
Maybe Joerg or David could help here?

> > > > ---
> > > >  drivers/iommu/iommu.c |   17 ++++++++++-------
> > > >  drivers/pci/pci.c     |   11 +++++++++--
> > > >  drivers/pci/probe.c   |    1 +
> > > >  drivers/pci/search.c  |   14 +++++++++-----
> > > >  include/linux/pci.h   |    4 +---
> > > >  5 files changed, 30 insertions(+), 17 deletions(-)
> > > >
> > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> > > > index 0e3b009..a214e19 100644
> > > > --- a/drivers/iommu/iommu.c
> > > > +++ b/drivers/iommu/iommu.c
> > > > @@ -659,9 +659,15 @@ static struct iommu_group
> > > *get_pci_function_alias_group(struct pci_dev *pdev,
> > > >  	return NULL;
> > > >  }
> > > >
> > > > +static bool dma_alias_is_enabled(struct pci_dev *dev, u8 devfn)
> > > > +{
> > > > +	return dev->dma_alias_mask &&
> > > > +	       test_bit(devfn, dev->dma_alias_mask);
> > > > +}
> > > > +
> > > >  /*
> > > > - * Look for aliases to or from the given device for exisiting groups.  The
> > > > - * dma_alias_devfn only supports aliases on the same bus, therefore the
> > > search
> > > > + * Look for aliases to or from the given device for existing groups. DMA
> > > > + * aliases are only supported on the same bus, therefore the search
> > > 
> > > I'm trying to reconcile this statement that "DMA aliases are only
> > > supported on the same bus" (which was there even before this patch)
> > > with the fact that pci_for_each_dma_alias() does not have that
> > > limitation.
> > > 
> > > >   * space is quite small (especially since we're really only looking at pcie
> > > >   * device, and therefore only expect multiple slots on the root complex or
> > > >   * downstream switch ports).  It's conceivable though that a pair of
> > > > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct
> > > pci_dev *pdev,
> > > >  			continue;
> > > >
> > > >  		/* We alias them or they alias us */
> > > > -		if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN)
> > > &&
> > > > -		     pdev->dma_alias_devfn == tmp->devfn) ||
> > > > -		    ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > > -		     tmp->dma_alias_devfn == pdev->devfn)) {
> > > > -
> > > > +		if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > > > +		    dma_alias_is_enabled(tmp, pdev->devfn)) {
> > > >  			group = get_pci_alias_group(tmp, devfns);
> > > 
> > > We basically have this:
> > > 
> > >   for_each_pci_dev(tmp) {
> > >     if (<pdev and tmp are DMA aliases>)
> > >       group = get_pci_alias_group();
> > >       ...
> > >   }
> > > 
> > > The DMA alias stuff relies on PCI internals, so it doesn't doesn't
> > > seem quite right to use things like PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and
> > > dma_alias_devfn here in the IOMMU code.
> > > 
> > > I'm trying to figure out why we don't do something like the following
> > > instead:
> > > 
> > >   callback(struct pci_dev *pdev, u16 alias, void *opaque)
> > >   {
> > >     struct iommu_group *group;
> > > 
> > >     group = get_pci_alias_group();
> > >     if (group)
> > >       return group;
> > > 
> > >     return 0;
> > >   }
> > > 
> > >   pci_for_each_dma_alias(pdev, callback, ...);
> > > 
> > > Is the existing code some sort of optimization, e.g., checking
> > > PCI_DEV_FLAGS_DMA_ALIAS_DEVFN is cheaper than using
> > > pci_for_each_dma_alias()?
> > > 
> > > It seems like this won't work for some very unlikely but theoretically
> > > possible topologies, e.g.,
> > > 
> > >   PCIe Root Complex/IOMMU
> > >     PCIe switch A
> > >       PCIe to conventional PCI bridge
> > >         PCI to PCIe Root Complex
> > > 	  PCIe NTB
> > > 
> > > Here, I think the IOMMU will only see RIDs from PCIe switch A, but the
> > > current code only looks at DMA aliases that are on the same bus as the
> > > PCIe NTB.  Wouldn't using pci_for_each_dma_alias() handle this
> > > correctly?
> > > 
> > > >  			if (group) {
> > > >  				pci_dev_put(tmp);
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse March 14, 2016, 10:43 p.m. UTC | #5
On Thu, 2016-02-25 at 08:38 -0600, Bjorn Helgaas wrote:
> 
> >  /*
> > - * Look for aliases to or from the given device for exisiting groups.  The
> > - * dma_alias_devfn only supports aliases on the same bus, therefore the search
> > + * Look for aliases to or from the given device for existing groups. DMA
> > + * aliases are only supported on the same bus, therefore the search
> 
> I'm trying to reconcile this statement that "DMA aliases are only
> supported on the same bus" (which was there even before this patch)
> with the fact that pci_for_each_dma_alias() does not have that
> limitation.

Doesn't it? You can still only set a DMA alias on the same bus with
pci_add_dma_alias(), can't you?

> >   * space is quite small (especially since we're really only looking at pcie
> >   * device, and therefore only expect multiple slots on the root complex or
> >   * downstream switch ports).  It's conceivable though that a pair of
> > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> >                       continue;
> >  
> >               /* We alias them or they alias us */
> > -             if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > -                  pdev->dma_alias_devfn == tmp->devfn) ||
> > -                 ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > -                  tmp->dma_alias_devfn == pdev->devfn)) {
> > -
> > +             if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > +                 dma_alias_is_enabled(tmp, pdev->devfn)) {
> >                       group = get_pci_alias_group(tmp, devfns);
> 
> We basically have this:
> 
>   for_each_pci_dev(tmp) {
>     if ()
>       group = get_pci_alias_group();
>       ...
>   }

Strictly, that's:

 for_each_pci_dev(tmp) {
   if (pdev is an alias of tmp || tmp is an alias of pdev)
     group = get_pci_alias_group();
   ...
 }

> The DMA alias stuff relies on PCI internals, so it doesn't doesn't
> seem quite right to use things like PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and
> dma_alias_devfn here in the IOMMU code.  
>
> I'm trying to figure out why we don't do something like the following
> instead:
> 
>   callback(struct pci_dev *pdev, u16 alias, void *opaque)
>   {
>     struct iommu_group *group;
> 
>     group = get_pci_alias_group();
>     if (group)
>       return group;
> 
>     return 0;
>   }
> 
>   pci_for_each_dma_alias(pdev, callback, ...);

And this would be equivalent to

 for_each_pci_dev(tmp) {
   if (tmp is an alias of pdev)
     group = get_pci_alias_group();
   ...
 }

The "is an alias of" property is not commutative. Perhaps it should be.
But that's hard because in some cases the alias doesn't even *exist* as
a real PCI device. It's just that you appear to get DMA transactions
from a given source-id.

-- 
dwmw2
Bjorn Helgaas March 16, 2016, 12:48 a.m. UTC | #6
On Mon, Mar 14, 2016 at 10:43:40PM +0000, David Woodhouse wrote:
> On Thu, 2016-02-25 at 08:38 -0600, Bjorn Helgaas wrote:
> > 
> > >  /*
> > > - * Look for aliases to or from the given device for exisiting groups.  The
> > > - * dma_alias_devfn only supports aliases on the same bus, therefore the search
> > > + * Look for aliases to or from the given device for existing groups. DMA
> > > + * aliases are only supported on the same bus, therefore the search
> > 
> > I'm trying to reconcile this statement that "DMA aliases are only
> > supported on the same bus" (which was there even before this patch)
> > with the fact that pci_for_each_dma_alias() does not have that
> > limitation.
> 
> Doesn't it? You can still only set a DMA alias on the same bus with
> pci_add_dma_alias(), can't you?

I guess it's true that PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and the proposed
pci_add_dma_alias() only add aliases on the same bus.  I was thinking
about a scenario like this:

  00:00.0 PCIe-to-PCI bridge to [bus 01]
  01:01.0 conventional PCI device

where I think 01:00.0 is a DMA alias for 01:01.0 because the bridge
takes ownership of DMA transactions from 01:01.0 and assigns a
Requester ID of 01:00.0 (secondary bus number, device 0, function 0).

> > >   * space is quite small (especially since we're really only looking at pcie
> > >   * device, and therefore only expect multiple slots on the root complex or
> > >   * downstream switch ports).  It's conceivable though that a pair of
> > > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> > >                       continue;
> > >  
> > >               /* We alias them or they alias us */
> > > -             if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > -                  pdev->dma_alias_devfn == tmp->devfn) ||
> > > -                 ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > -                  tmp->dma_alias_devfn == pdev->devfn)) {
> > > -
> > > +             if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > > +                 dma_alias_is_enabled(tmp, pdev->devfn)) {
> > >                       group = get_pci_alias_group(tmp, devfns);
> > 
> > We basically have this:
> > 
> >   for_each_pci_dev(tmp) {
> >     if ()
> >       group = get_pci_alias_group();
> >       ...
> >   }
> 
> Strictly, that's:
> 
>  for_each_pci_dev(tmp) {
>    if (pdev is an alias of tmp || tmp is an alias of pdev)
>      group = get_pci_alias_group();
>    ...
>  }

OK.  

> > I'm trying to figure out why we don't do something like the following
> > instead:
> > 
> >   callback(struct pci_dev *pdev, u16 alias, void *opaque)
> >   {
> >     struct iommu_group *group;
> > 
> >     group = get_pci_alias_group();
> >     if (group)
> >       return group;
> > 
> >     return 0;
> >   }
> > 
> >   pci_for_each_dma_alias(pdev, callback, ...);
> 
> And this would be equivalent to
> 
>  for_each_pci_dev(tmp) {
>    if (tmp is an alias of pdev)
>      group = get_pci_alias_group();
>    ...
>  }
> 
> The "is an alias of" property is not commutative. Perhaps it should be.
> But that's hard because in some cases the alias doesn't even *exist* as
> a real PCI device. It's just that you appear to get DMA transactions
> from a given source-id.

Right.  In my example above, 01:00.0 is not a PCI device; it's only a
Requester ID that is fabricated by the bridge when it forwards DMA
transactions upstream.

I think I'm confused because I don't really understand IOMMU groups.

Let me explain what I think they are and you can correct me when I go
wrong.  The iommu_group_alloc() comment says "The IOMMU group
represents the minimum granularity of the IOMMU."  So I suppose the
IOMMU cannot distinguish between devices in a group.  All the devices
in the group use the same set of DMA mappings.  Granting device A DMA
access to a buffer grants the same access to all other members of A's
IOMMU group.

That would mean my question was fundamentally backwards.  In
get_pci_alias_group(A), we're not trying to figure out what all the
aliases of A are, which is what pci_for_each_dma_alias() does.

Instead, we're trying to figure out which IOMMU group A belongs to.
But I still don't quite understand how aliases fit into this.  Let's
go back to my example and assume we've already put 00:00.0 and 01:01.0
in IOMMU groups:

  00:00.0 PCIe-to-PCI bridge to [bus 01]     # in IOMMU group G0
  01:01.0 conventional PCI device            # in IOMMU group G1

I assume these devices are in different IOMMU groups because if the
bridge generated its own DMA, it would use Requester ID 00:00.0, which
is distinct from the 01:00.0 it would use when forwarding DMAs from
its secondary side.

What happens when we add 01:02.0?  I think 01:01.0 and 01:02.0 should
both end up in IOMMU group G1 because the IOMMU will see only
Requester ID 01:00.0, so it can't distinguish them.

When we add 01:02.0, the ops->add_device() ... ops->device_group()
path calls pci_device_group(01:02.0):

  pci_device_group(01:02.0)
    pci_for_each_dma_alias(01:02.0, get_pci_alias_or_group)
      get_pci_alias_or_group(01:02.0, 01:02.0)   # callback
        return 0           # 01:02.0 group not set yet
      get_pci_alias_or_group(00:00.0, 01:00.0)   # callback
        return 1           # 00:00.0 is in G0

It seems like we'll assign 01:02.0 to group G0, when I think it should
be in G1.  Where did I go wrong?

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas April 8, 2016, 4:06 p.m. UTC | #7
On Tue, Mar 15, 2016 at 07:48:17PM -0500, Bjorn Helgaas wrote:
> On Mon, Mar 14, 2016 at 10:43:40PM +0000, David Woodhouse wrote:
> > On Thu, 2016-02-25 at 08:38 -0600, Bjorn Helgaas wrote:
> > > 
> > > >  /*
> > > > - * Look for aliases to or from the given device for exisiting groups.  The
> > > > - * dma_alias_devfn only supports aliases on the same bus, therefore the search
> > > > + * Look for aliases to or from the given device for existing groups. DMA
> > > > + * aliases are only supported on the same bus, therefore the search
> > > 
> > > I'm trying to reconcile this statement that "DMA aliases are only
> > > supported on the same bus" (which was there even before this patch)
> > > with the fact that pci_for_each_dma_alias() does not have that
> > > limitation.
> > 
> > Doesn't it? You can still only set a DMA alias on the same bus with
> > pci_add_dma_alias(), can't you?
> 
> I guess it's true that PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and the proposed
> pci_add_dma_alias() only add aliases on the same bus.  I was thinking
> about a scenario like this:
> 
>   00:00.0 PCIe-to-PCI bridge to [bus 01]
>   01:01.0 conventional PCI device
> 
> where I think 01:00.0 is a DMA alias for 01:01.0 because the bridge
> takes ownership of DMA transactions from 01:01.0 and assigns a
> Requester ID of 01:00.0 (secondary bus number, device 0, function 0).
> 
> > > >   * space is quite small (especially since we're really only looking at pcie
> > > >   * device, and therefore only expect multiple slots on the root complex or
> > > >   * downstream switch ports).  It's conceivable though that a pair of
> > > > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> > > >                       continue;
> > > >  
> > > >               /* We alias them or they alias us */
> > > > -             if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > > -                  pdev->dma_alias_devfn == tmp->devfn) ||
> > > > -                 ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > > -                  tmp->dma_alias_devfn == pdev->devfn)) {
> > > > -
> > > > +             if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > > > +                 dma_alias_is_enabled(tmp, pdev->devfn)) {
> > > >                       group = get_pci_alias_group(tmp, devfns);
> > > 
> > > We basically have this:
> > > 
> > >   for_each_pci_dev(tmp) {
> > >     if ()
> > >       group = get_pci_alias_group();
> > >       ...
> > >   }
> > 
> > Strictly, that's:
> > 
> >  for_each_pci_dev(tmp) {
> >    if (pdev is an alias of tmp || tmp is an alias of pdev)
> >      group = get_pci_alias_group();
> >    ...
> >  }
> 
> OK.  
> 
> > > I'm trying to figure out why we don't do something like the following
> > > instead:
> > > 
> > >   callback(struct pci_dev *pdev, u16 alias, void *opaque)
> > >   {
> > >     struct iommu_group *group;
> > > 
> > >     group = get_pci_alias_group();
> > >     if (group)
> > >       return group;
> > > 
> > >     return 0;
> > >   }
> > > 
> > >   pci_for_each_dma_alias(pdev, callback, ...);
> > 
> > And this would be equivalent to
> > 
> >  for_each_pci_dev(tmp) {
> >    if (tmp is an alias of pdev)
> >      group = get_pci_alias_group();
> >    ...
> >  }
> > 
> > The "is an alias of" property is not commutative. Perhaps it should be.
> > But that's hard because in some cases the alias doesn't even *exist* as
> > a real PCI device. It's just that you appear to get DMA transactions
> > from a given source-id.
> 
> Right.  In my example above, 01:00.0 is not a PCI device; it's only a
> Requester ID that is fabricated by the bridge when it forwards DMA
> transactions upstream.
> 
> I think I'm confused because I don't really understand IOMMU groups.
> 
> Let me explain what I think they are and you can correct me when I go
> wrong.  The iommu_group_alloc() comment says "The IOMMU group
> represents the minimum granularity of the IOMMU."  So I suppose the
> IOMMU cannot distinguish between devices in a group.  All the devices
> in the group use the same set of DMA mappings.  Granting device A DMA
> access to a buffer grants the same access to all other members of A's
> IOMMU group.
> 
> That would mean my question was fundamentally backwards.  In
> get_pci_alias_group(A), we're not trying to figure out what all the
> aliases of A are, which is what pci_for_each_dma_alias() does.
> 
> Instead, we're trying to figure out which IOMMU group A belongs to.
> But I still don't quite understand how aliases fit into this.  Let's
> go back to my example and assume we've already put 00:00.0 and 01:01.0
> in IOMMU groups:
> 
>   00:00.0 PCIe-to-PCI bridge to [bus 01]     # in IOMMU group G0
>   01:01.0 conventional PCI device            # in IOMMU group G1
> 
> I assume these devices are in different IOMMU groups because if the
> bridge generated its own DMA, it would use Requester ID 00:00.0, which
> is distinct from the 01:00.0 it would use when forwarding DMAs from
> its secondary side.
> 
> What happens when we add 01:02.0?  I think 01:01.0 and 01:02.0 should
> both end up in IOMMU group G1 because the IOMMU will see only
> Requester ID 01:00.0, so it can't distinguish them.
> 
> When we add 01:02.0, the ops->add_device() ... ops->device_group()
> path calls pci_device_group(01:02.0):
> 
>   pci_device_group(01:02.0)
>     pci_for_each_dma_alias(01:02.0, get_pci_alias_or_group)
>       get_pci_alias_or_group(01:02.0, 01:02.0)   # callback
>         return 0           # 01:02.0 group not set yet
>       get_pci_alias_or_group(00:00.0, 01:00.0)   # callback
>         return 1           # 00:00.0 is in G0
> 
> It seems like we'll assign 01:02.0 to group G0, when I think it should
> be in G1.  Where did I go wrong?

Ping?
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Woodhouse April 8, 2016, 4:09 p.m. UTC | #8
On Fri, 2016-04-08 at 11:06 -0500, Bjorn Helgaas wrote:
> > I think I'm confused because I don't really understand IOMMU
> > groups.
,,,

> Ping?

I think this ended up being more of a Alex question...
Alex Williamson April 8, 2016, 5:31 p.m. UTC | #9
On Fri, 8 Apr 2016 11:06:32 -0500
Bjorn Helgaas <helgaas@kernel.org> wrote:

> On Tue, Mar 15, 2016 at 07:48:17PM -0500, Bjorn Helgaas wrote:
> > On Mon, Mar 14, 2016 at 10:43:40PM +0000, David Woodhouse wrote:  
> > > On Thu, 2016-02-25 at 08:38 -0600, Bjorn Helgaas wrote:  
> > > >   
> > > > >  /*
> > > > > - * Look for aliases to or from the given device for exisiting groups.  The
> > > > > - * dma_alias_devfn only supports aliases on the same bus, therefore the search
> > > > > + * Look for aliases to or from the given device for existing groups. DMA
> > > > > + * aliases are only supported on the same bus, therefore the search  
> > > > 
> > > > I'm trying to reconcile this statement that "DMA aliases are only
> > > > supported on the same bus" (which was there even before this patch)
> > > > with the fact that pci_for_each_dma_alias() does not have that
> > > > limitation.  
> > > 
> > > Doesn't it? You can still only set a DMA alias on the same bus with
> > > pci_add_dma_alias(), can't you?  
> > 
> > I guess it's true that PCI_DEV_FLAGS_DMA_ALIAS_DEVFN and the proposed
> > pci_add_dma_alias() only add aliases on the same bus.  I was thinking
> > about a scenario like this:
> > 
> >   00:00.0 PCIe-to-PCI bridge to [bus 01]
> >   01:01.0 conventional PCI device
> > 
> > where I think 01:00.0 is a DMA alias for 01:01.0 because the bridge
> > takes ownership of DMA transactions from 01:01.0 and assigns a
> > Requester ID of 01:00.0 (secondary bus number, device 0, function 0).

This is true, but this is a topology alias, not a quirk.
pci_for_each_dma_alias() already handles this case.  It would trigger
the callback function for any direct alias of the conventional device
as well as the bridge itself.  Obviously we don't end up with any
quirks for conventional devices because the aliases are masked behind
the bridge anyway.
 
> > > > >   * space is quite small (especially since we're really only looking at pcie
> > > > >   * device, and therefore only expect multiple slots on the root complex or
> > > > >   * downstream switch ports).  It's conceivable though that a pair of
> > > > > @@ -686,11 +692,8 @@ static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
> > > > >                       continue;
> > > > >  
> > > > >               /* We alias them or they alias us */
> > > > > -             if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > > > -                  pdev->dma_alias_devfn == tmp->devfn) ||
> > > > > -                 ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
> > > > > -                  tmp->dma_alias_devfn == pdev->devfn)) {
> > > > > -
> > > > > +             if (dma_alias_is_enabled(pdev, tmp->devfn) ||
> > > > > +                 dma_alias_is_enabled(tmp, pdev->devfn)) {
> > > > >                       group = get_pci_alias_group(tmp, devfns);  
> > > > 
> > > > We basically have this:
> > > > 
> > > >   for_each_pci_dev(tmp) {
> > > >     if ()
> > > >       group = get_pci_alias_group();
> > > >       ...
> > > >   }  
> > > 
> > > Strictly, that's:
> > > 
> > >  for_each_pci_dev(tmp) {
> > >    if (pdev is an alias of tmp || tmp is an alias of pdev)
> > >      group = get_pci_alias_group();
> > >    ...
> > >  }  
> > 
> > OK.  
> >   
> > > > I'm trying to figure out why we don't do something like the following
> > > > instead:
> > > > 
> > > >   callback(struct pci_dev *pdev, u16 alias, void *opaque)
> > > >   {
> > > >     struct iommu_group *group;
> > > > 
> > > >     group = get_pci_alias_group();
> > > >     if (group)
> > > >       return group;
> > > > 
> > > >     return 0;
> > > >   }
> > > > 
> > > >   pci_for_each_dma_alias(pdev, callback, ...);  
> > > 
> > > And this would be equivalent to
> > > 
> > >  for_each_pci_dev(tmp) {
> > >    if (tmp is an alias of pdev)
> > >      group = get_pci_alias_group();
> > >    ...
> > >  }
> > > 
> > > The "is an alias of" property is not commutative. Perhaps it should be.
> > > But that's hard because in some cases the alias doesn't even *exist* as
> > > a real PCI device. It's just that you appear to get DMA transactions
> > > from a given source-id.  
> > 
> > Right.  In my example above, 01:00.0 is not a PCI device; it's only a
> > Requester ID that is fabricated by the bridge when it forwards DMA
> > transactions upstream.
> > 
> > I think I'm confused because I don't really understand IOMMU groups.
> > 
> > Let me explain what I think they are and you can correct me when I go
> > wrong.  The iommu_group_alloc() comment says "The IOMMU group
> > represents the minimum granularity of the IOMMU."  So I suppose the
> > IOMMU cannot distinguish between devices in a group.  All the devices
> > in the group use the same set of DMA mappings.  Granting device A DMA
> > access to a buffer grants the same access to all other members of A's
> > IOMMU group.
> > 
> > That would mean my question was fundamentally backwards.  In
> > get_pci_alias_group(A), we're not trying to figure out what all the
> > aliases of A are, which is what pci_for_each_dma_alias() does.
> > 
> > Instead, we're trying to figure out which IOMMU group A belongs to.
> > But I still don't quite understand how aliases fit into this.  Let's
> > go back to my example and assume we've already put 00:00.0 and 01:01.0
> > in IOMMU groups:
> > 
> >   00:00.0 PCIe-to-PCI bridge to [bus 01]     # in IOMMU group G0
> >   01:01.0 conventional PCI device            # in IOMMU group G1
> > 
> > I assume these devices are in different IOMMU groups because if the
> > bridge generated its own DMA, it would use Requester ID 00:00.0, which
> > is distinct from the 01:00.0 it would use when forwarding DMAs from
> > its secondary side.

The actual requester ID of the bridge depends on quirks as well, see
PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS, so it might either be the bridge itself
or the subordinate bus address.  However, trace how these groups would
be created, the bridge device is discovered first, so we call
pci_device_group() for it.  since it's on the root bus and it's the 0th
function, there will be no existing groups for it to alias, so it gets
a new group.  Then we discover 01:01.0, pci_device_group() calls
pci_for_each_dma_alias(), which hits the bridge, regardless of which
alias is used.  So 01:01.0 ends up in the same iommu group as the
bridge.

From the purist standpoint of iommu groups, your expectations are
probably correct, but we also include within IOMMU groups DMA isolation
between devices.  So if two devices can DMA between each other without
it necessarily passing through the iommu, the devices are grouped
together.  The more important cases of this are ACS isolation on PCI-e
devices, but it still sort of applies to this conventional PCI example
here.
 
> > What happens when we add 01:02.0?  I think 01:01.0 and 01:02.0
> > should both end up in IOMMU group G1 because the IOMMU will see only
> > Requester ID 01:00.0, so it can't distinguish them.
> > 
> > When we add 01:02.0, the ops->add_device() ... ops->device_group()
> > path calls pci_device_group(01:02.0):
> > 
> >   pci_device_group(01:02.0)
> >     pci_for_each_dma_alias(01:02.0, get_pci_alias_or_group)
> >       get_pci_alias_or_group(01:02.0, 01:02.0)   # callback
> >         return 0           # 01:02.0 group not set yet
> >       get_pci_alias_or_group(00:00.0, 01:00.0)   # callback
> >         return 1           # 00:00.0 is in G0
> > 
> > It seems like we'll assign 01:02.0 to group G0, when I think it
> > should be in G1.  Where did I go wrong?  

In fact they're all part of G0, so you went wrong assuming the initial
grouping rather than applying the same rules and tracing how 01:01.0 is
added.  It's not an ideal representation, it tends to be more
conservative than necessary in some cases, but keeping the group
attached to devices has advantages in being able to search and find
points like this where we don't necessarily have access to the group
behind the bridge without tying it to the bridge itself.  Thanks,

Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 0e3b009..a214e19 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -659,9 +659,15 @@  static struct iommu_group *get_pci_function_alias_group(struct pci_dev *pdev,
 	return NULL;
 }
 
+static bool dma_alias_is_enabled(struct pci_dev *dev, u8 devfn)
+{
+	return dev->dma_alias_mask &&
+	       test_bit(devfn, dev->dma_alias_mask);
+}
+
 /*
- * Look for aliases to or from the given device for exisiting groups.  The
- * dma_alias_devfn only supports aliases on the same bus, therefore the search
+ * Look for aliases to or from the given device for existing groups. DMA
+ * aliases are only supported on the same bus, therefore the search
  * space is quite small (especially since we're really only looking at pcie
  * device, and therefore only expect multiple slots on the root complex or
  * downstream switch ports).  It's conceivable though that a pair of
@@ -686,11 +692,8 @@  static struct iommu_group *get_pci_alias_group(struct pci_dev *pdev,
 			continue;
 
 		/* We alias them or they alias us */
-		if (((pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
-		     pdev->dma_alias_devfn == tmp->devfn) ||
-		    ((tmp->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN) &&
-		     tmp->dma_alias_devfn == pdev->devfn)) {
-
+		if (dma_alias_is_enabled(pdev, tmp->devfn) ||
+		    dma_alias_is_enabled(tmp, pdev->devfn)) {
 			group = get_pci_alias_group(tmp, devfns);
 			if (group) {
 				pci_dev_put(tmp);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8b0a637..eb4dd49 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -4578,8 +4578,15 @@  int pci_set_vga_state(struct pci_dev *dev, bool decode,
  */
 void pci_add_dma_alias(struct pci_dev *dev, u8 devfn)
 {
-	dev->dma_alias_devfn = PCI_DEVFN(PCI_SLOT(dev->devfn), 0);
-	dev->dev_flags |= PCI_DEV_FLAGS_DMA_ALIAS_DEVFN;
+	if (!dev->dma_alias_mask)
+		dev->dma_alias_mask = kcalloc(BITS_TO_LONGS(U8_MAX),
+					      sizeof(long), GFP_KERNEL);
+	if (!dev->dma_alias_mask) {
+		dev_warn(&dev->dev, "Unable to allocate DMA alias mask\n");
+		return;
+	}
+
+	set_bit(devfn, dev->dma_alias_mask);
 	dev_info(&dev->dev, "Enabling fixed DMA alias to %02x.%d\n",
 		 PCI_SLOT(devfn), PCI_FUNC(devfn));
 }
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 6d7ab9b..23fc397 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1503,6 +1503,7 @@  static void pci_release_dev(struct device *dev)
 	pcibios_release_device(pci_dev);
 	pci_bus_put(pci_dev->bus);
 	kfree(pci_dev->driver_override);
+	kfree(pci_dev->dma_alias_mask);
 	kfree(pci_dev);
 }
 
diff --git a/drivers/pci/search.c b/drivers/pci/search.c
index a20ce7d..33e0f03 100644
--- a/drivers/pci/search.c
+++ b/drivers/pci/search.c
@@ -40,11 +40,15 @@  int pci_for_each_dma_alias(struct pci_dev *pdev,
 	 * If the device is broken and uses an alias requester ID for
 	 * DMA, iterate over that too.
 	 */
-	if (unlikely(pdev->dev_flags & PCI_DEV_FLAGS_DMA_ALIAS_DEVFN)) {
-		ret = fn(pdev, PCI_DEVID(pdev->bus->number,
-					 pdev->dma_alias_devfn), data);
-		if (ret)
-			return ret;
+	if (unlikely(pdev->dma_alias_mask)) {
+		u8 devfn;
+
+		for_each_set_bit(devfn, pdev->dma_alias_mask, U8_MAX) {
+			ret = fn(pdev, PCI_DEVID(pdev->bus->number, devfn),
+				 data);
+			if (ret)
+				return ret;
+		}
 	}
 
 	for (bus = pdev->bus; !pci_is_root_bus(bus); bus = bus->parent) {
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 27df4a6..d9e0c84 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -172,8 +172,6 @@  enum pci_dev_flags {
 	PCI_DEV_FLAGS_ASSIGNED = (__force pci_dev_flags_t) (1 << 2),
 	/* Flag for quirk use to store if quirk-specific ACS is enabled */
 	PCI_DEV_FLAGS_ACS_ENABLED_QUIRK = (__force pci_dev_flags_t) (1 << 3),
-	/* Flag to indicate the device uses dma_alias_devfn */
-	PCI_DEV_FLAGS_DMA_ALIAS_DEVFN = (__force pci_dev_flags_t) (1 << 4),
 	/* Use a PCIe-to-PCI bridge alias even if !pci_is_pcie */
 	PCI_DEV_FLAG_PCIE_BRIDGE_ALIAS = (__force pci_dev_flags_t) (1 << 5),
 	/* Do not use bus resets for device */
@@ -279,7 +277,7 @@  struct pci_dev {
 	u8		rom_base_reg;	/* which config register controls the ROM */
 	u8		pin;		/* which interrupt pin this device uses */
 	u16		pcie_flags_reg;	/* cached PCIe Capabilities Register */
-	u8		dma_alias_devfn;/* devfn of DMA alias, if any */
+	unsigned long	*dma_alias_mask;/* mask of enabled devfn aliases */
 
 	struct pci_driver *driver;	/* which driver has allocated this device */
 	u64		dma_mask;	/* Mask of the bits of bus address this