diff mbox

[v4,2/2] PCI: quirks: Fix ThunderX2 dma alias handling

Message ID 1491225304-3559-3-git-send-email-jnair@caviumnetworks.com
State Changes Requested
Headers show

Commit Message

Jayachandran C April 3, 2017, 1:15 p.m. UTC
The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
topology is slightly unusual. For a multi-node system, it looks like:

[node level PCI bridges - one per node]
    [SoC PCI devices with MSI-X but no IOMMU]
    [PCI-PCIe "glue" bridges - upto 14, one per real port below]
        [PCIe real root ports associated with IOMMU and GICv3 ITS]
            [External PCI devices connected to PCIe links]

The top two levels of bridges should have introduced aliases since they
are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
In the case of external PCIe devices, the "real" root ports are connected
to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
alias. The SoC PCI devices are directly connected to the GIC ITS, so the
node level bridges do not introduce an alias either.

To handle this quirk, we mark the real PCIe root ports and node level
PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
pci_for_each_dma_alias() works correctly for external PCIe devices and
SoC PCI devices.

For the current revision of Cavium ThunderX2, the VendorID and Device ID
are from Broadcom Vulcan (14e4:90XX).

Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
---
 drivers/pci/quirks.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

Comments

Robin Murphy April 3, 2017, 3:07 p.m. UTC | #1
On 03/04/17 14:15, Jayachandran C wrote:
> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> topology is slightly unusual. For a multi-node system, it looks like:
> 
> [node level PCI bridges - one per node]
>     [SoC PCI devices with MSI-X but no IOMMU]
>     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
>         [PCIe real root ports associated with IOMMU and GICv3 ITS]
>             [External PCI devices connected to PCIe links]

Since it's not entirely obvious, what does the actual DT - or IORT if
you must ;) - topology for this look like? I can't help thinking that
either it's inaccurate, or that this is going to expose a shortcoming in
pci_dma_configure() which breaks things - unless I'm missing something,
isn't find_pci_root_bus() going to go all the way up to the top-level
glue bridge and pick up the wrong firmware node (if any) for the
appropriate DMA properties?

Robin.

> The top two levels of bridges should have introduced aliases since they
> are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> In the case of external PCIe devices, the "real" root ports are connected
> to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> node level bridges do not introduce an alias either.
> 
> To handle this quirk, we mark the real PCIe root ports and node level
> PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> pci_for_each_dma_alias() works correctly for external PCIe devices and
> SoC PCI devices.
> 
> For the current revision of Cavium ThunderX2, the VendorID and Device ID
> are from Broadcom Vulcan (14e4:90XX).
> 
> Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
> ---
>  drivers/pci/quirks.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6736836..564a84a 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
>  
>  /*
> + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
> + * associated not at the root bus, but at a bridge below. This quirk flag
> + * will ensure that the aliases are identified correctly.
> + */
> +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
> +{
> +	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
> +				quirk_bridge_cavm_thrx2_pcie_root);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
> +				quirk_bridge_cavm_thrx2_pcie_root);
> +
> +/*
>   * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
>   * class code.  Fix it.
>   */
>
Jayachandran C April 4, 2017, 11:50 a.m. UTC | #2
On Mon, Apr 03, 2017 at 04:07:53PM +0100, Robin Murphy wrote:
> On 03/04/17 14:15, Jayachandran C wrote:
> > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > topology is slightly unusual. For a multi-node system, it looks like:
> > 
> > [node level PCI bridges - one per node]
> >     [SoC PCI devices with MSI-X but no IOMMU]
> >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> >             [External PCI devices connected to PCIe links]
> 
> Since it's not entirely obvious, what does the actual DT - or IORT if
> you must ;) - topology for this look like? I can't help thinking that
> either it's inaccurate, or that this is going to expose a shortcoming in
> pci_dma_configure() which breaks things - unless I'm missing something,
> isn't find_pci_root_bus() going to go all the way up to the top-level
> glue bridge and pick up the wrong firmware node (if any) for the
> appropriate DMA properties?

I will try to describe the ACPI interface:

There is just one ECAM area, a single bus range and one set of memory
windows for the whole system - so there is just one entry in DSDT for
the PCI controller. This entry also corresponds to the PCI RC node in
IORT. DMA is coherent and supports 64 bits system-wide, the attributes
(in DSDT and IORT) reflect this.

lspci on the system looks like this:
-[0000:00]-+-00.0-[01-1e]--+-04.0  14e4:9026
           |               +-04.1  14e4:9026
           |               +-05.0  14e4:9027
           |               +-05.1  14e4:9027
           |               +-0a.0-[02-03]----00.0-[03]--
           |               +-0a.1-[04-05]----00.0-[05]--
           |           [...etc...]
           |               +-0b.0-[12-14]----00.0-[13-14]--+-00.0  8086:1583
           |               |                               \-00.1  8086:1583
           |           [...etc...]
           |               \-0b.5-[1d-1e]----00.0-[1e]--
           \-00.1-[1f-3b]--+-04.0  14e4:9026
                           +-04.1  14e4:9026
                           +-05.0  14e4:9027
                           +-05.1  14e4:9027
                           +-0a.0-[20-21]----00.0-[21]--
                       [...etc...]

The devices here are:
 - 00:00.0 and 00:00.1 are the node (socket) level bridges
 - 01:[45].x and 1f:[45].x are SoC PCI devices like SATA and USB
 - 01:[ab].x and 1f:[ab].x are the PCI-PCIe "reverse"/glue bridges
 - 02:00.0 etc are the "real" PCIe ports connected to external PCIe cards. 
Each node has a GIC ITS, and a group of 4 PCIe ports have an SMMU.

The IORT is built by the firmware based on its PCI enumeration. The IORT
will have multiple entries under the PCI RC node:
 - one entry per node to map the SoC devices directly to ITS for MSI-X,
   since the SoC devices are not attached to any SMMU.
 - An entry per "real" PCIe port to map RIDs under it to the corresponding
   SMMU.
The SMMU nodes will have an entry to map its RID ranges to the node ITS.

The IORT spec supports this configuration, and the corresponding code is
already upstream, so the only sticking point right now is
pci_for_each_dma_alias().

JC.
Robin Murphy April 4, 2017, 2:28 p.m. UTC | #3
On 04/04/17 12:50, Jayachandran C wrote:
> On Mon, Apr 03, 2017 at 04:07:53PM +0100, Robin Murphy wrote:
>> On 03/04/17 14:15, Jayachandran C wrote:
>>> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
>>> topology is slightly unusual. For a multi-node system, it looks like:
>>>
>>> [node level PCI bridges - one per node]
>>>     [SoC PCI devices with MSI-X but no IOMMU]
>>>     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
>>>         [PCIe real root ports associated with IOMMU and GICv3 ITS]
>>>             [External PCI devices connected to PCIe links]
>>
>> Since it's not entirely obvious, what does the actual DT - or IORT if
>> you must ;) - topology for this look like? I can't help thinking that
>> either it's inaccurate, or that this is going to expose a shortcoming in
>> pci_dma_configure() which breaks things - unless I'm missing something,
>> isn't find_pci_root_bus() going to go all the way up to the top-level
>> glue bridge and pick up the wrong firmware node (if any) for the
>> appropriate DMA properties?
> 
> I will try to describe the ACPI interface:
> 
> There is just one ECAM area, a single bus range and one set of memory
> windows for the whole system - so there is just one entry in DSDT for
> the PCI controller. This entry also corresponds to the PCI RC node in
> IORT. DMA is coherent and supports 64 bits system-wide, the attributes
> (in DSDT and IORT) reflect this.
> 
> lspci on the system looks like this:
> -[0000:00]-+-00.0-[01-1e]--+-04.0  14e4:9026
>            |               +-04.1  14e4:9026
>            |               +-05.0  14e4:9027
>            |               +-05.1  14e4:9027
>            |               +-0a.0-[02-03]----00.0-[03]--
>            |               +-0a.1-[04-05]----00.0-[05]--
>            |           [...etc...]
>            |               +-0b.0-[12-14]----00.0-[13-14]--+-00.0  8086:1583
>            |               |                               \-00.1  8086:1583
>            |           [...etc...]
>            |               \-0b.5-[1d-1e]----00.0-[1e]--
>            \-00.1-[1f-3b]--+-04.0  14e4:9026
>                            +-04.1  14e4:9026
>                            +-05.0  14e4:9027
>                            +-05.1  14e4:9027
>                            +-0a.0-[20-21]----00.0-[21]--
>                        [...etc...]
> 
> The devices here are:
>  - 00:00.0 and 00:00.1 are the node (socket) level bridges
>  - 01:[45].x and 1f:[45].x are SoC PCI devices like SATA and USB
>  - 01:[ab].x and 1f:[ab].x are the PCI-PCIe "reverse"/glue bridges
>  - 02:00.0 etc are the "real" PCIe ports connected to external PCIe cards. 
> Each node has a GIC ITS, and a group of 4 PCIe ports have an SMMU.
> 
> The IORT is built by the firmware based on its PCI enumeration. The IORT
> will have multiple entries under the PCI RC node:
>  - one entry per node to map the SoC devices directly to ITS for MSI-X,
>    since the SoC devices are not attached to any SMMU.
>  - An entry per "real" PCIe port to map RIDs under it to the corresponding
>    SMMU.
> The SMMU nodes will have an entry to map its RID ranges to the node ITS.
> 
> The IORT spec supports this configuration, and the corresponding code is
> already upstream, so the only sticking point right now is
> pci_for_each_dma_alias().

Thanks, that helps a lot. The "single global ECAM space" idea was
eluding me, but in that context it all makes much more sense - I'm
assuming the two quirked device IDs correspond to the 00:00.[01] devices
and the [02-1e]:00.0 ones.

So (at the risk of Jon mooing at me), I guess the DT description would
be a single node looking something like:

pcie {
	reg = [global ECAM space for segment 0000];

	msi-map = <0x0100 &its0 0x0100 0x1d00>,
		  <0x1f00 &its1 0x1f00 0x1d00>;
	iommu-map = <0x0200 &smmu0 0x0200 0x1c00>,
		    <0x2000 &smmu0 0x2000 0x1c00>;
};

(note to self: which incidentally also means of_pci_map_rid() probably
wants fixing to not treat gaps in the map as an error)

With only one node like that, rather than having the whole first 3
levels of bridges described, the "stop at the appropriate node in the
callback" approach does become even more impractical in all cases. So,
for $TITLE, based on the above understanding:

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

Cheers,
Robin.

> 
> JC.
>
Jayachandran C April 10, 2017, 11:38 a.m. UTC | #4
[Moving Bjorn back to to: ]

On Tue, Apr 04, 2017 at 03:28:26PM +0100, Robin Murphy wrote:
> On 04/04/17 12:50, Jayachandran C wrote:
> > On Mon, Apr 03, 2017 at 04:07:53PM +0100, Robin Murphy wrote:
> >> On 03/04/17 14:15, Jayachandran C wrote:
> >>> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> >>> topology is slightly unusual. For a multi-node system, it looks like:
> >>>
> >>> [node level PCI bridges - one per node]
> >>>     [SoC PCI devices with MSI-X but no IOMMU]
> >>>     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> >>>         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> >>>             [External PCI devices connected to PCIe links]
> >>
> >> Since it's not entirely obvious, what does the actual DT - or IORT if
> >> you must ;) - topology for this look like? I can't help thinking that
> >> either it's inaccurate, or that this is going to expose a shortcoming in
> >> pci_dma_configure() which breaks things - unless I'm missing something,
> >> isn't find_pci_root_bus() going to go all the way up to the top-level
> >> glue bridge and pick up the wrong firmware node (if any) for the
> >> appropriate DMA properties?
> > 
> > I will try to describe the ACPI interface:
> > 
> > There is just one ECAM area, a single bus range and one set of memory
> > windows for the whole system - so there is just one entry in DSDT for
> > the PCI controller. This entry also corresponds to the PCI RC node in
> > IORT. DMA is coherent and supports 64 bits system-wide, the attributes
> > (in DSDT and IORT) reflect this.
> > 
> > lspci on the system looks like this:
> > -[0000:00]-+-00.0-[01-1e]--+-04.0  14e4:9026
> >            |               +-04.1  14e4:9026
> >            |               +-05.0  14e4:9027
> >            |               +-05.1  14e4:9027
> >            |               +-0a.0-[02-03]----00.0-[03]--
> >            |               +-0a.1-[04-05]----00.0-[05]--
> >            |           [...etc...]
> >            |               +-0b.0-[12-14]----00.0-[13-14]--+-00.0  8086:1583
> >            |               |                               \-00.1  8086:1583
> >            |           [...etc...]
> >            |               \-0b.5-[1d-1e]----00.0-[1e]--
> >            \-00.1-[1f-3b]--+-04.0  14e4:9026
> >                            +-04.1  14e4:9026
> >                            +-05.0  14e4:9027
> >                            +-05.1  14e4:9027
> >                            +-0a.0-[20-21]----00.0-[21]--
> >                        [...etc...]
> > 
> > The devices here are:
> >  - 00:00.0 and 00:00.1 are the node (socket) level bridges
> >  - 01:[45].x and 1f:[45].x are SoC PCI devices like SATA and USB
> >  - 01:[ab].x and 1f:[ab].x are the PCI-PCIe "reverse"/glue bridges
> >  - 02:00.0 etc are the "real" PCIe ports connected to external PCIe cards. 
> > Each node has a GIC ITS, and a group of 4 PCIe ports have an SMMU.
> > 
> > The IORT is built by the firmware based on its PCI enumeration. The IORT
> > will have multiple entries under the PCI RC node:
> >  - one entry per node to map the SoC devices directly to ITS for MSI-X,
> >    since the SoC devices are not attached to any SMMU.
> >  - An entry per "real" PCIe port to map RIDs under it to the corresponding
> >    SMMU.
> > The SMMU nodes will have an entry to map its RID ranges to the node ITS.
> > 
> > The IORT spec supports this configuration, and the corresponding code is
> > already upstream, so the only sticking point right now is
> > pci_for_each_dma_alias().
> 
> Thanks, that helps a lot. The "single global ECAM space" idea was
> eluding me, but in that context it all makes much more sense - I'm
> assuming the two quirked device IDs correspond to the 00:00.[01] devices
> and the [02-1e]:00.0 ones.
> 
> So (at the risk of Jon mooing at me), I guess the DT description would
> be a single node looking something like:
> 
> pcie {
> 	reg = [global ECAM space for segment 0000];
> 
> 	msi-map = <0x0100 &its0 0x0100 0x1d00>,
> 		  <0x1f00 &its1 0x1f00 0x1d00>;
> 	iommu-map = <0x0200 &smmu0 0x0200 0x1c00>,
> 		    <0x2000 &smmu0 0x2000 0x1c00>;
> };
> 
> (note to self: which incidentally also means of_pci_map_rid() probably
> wants fixing to not treat gaps in the map as an error)
> 
> With only one node like that, rather than having the whole first 3
> levels of bridges described, the "stop at the appropriate node in the
> callback" approach does become even more impractical in all cases. So,
> for $TITLE, based on the above understanding:
> 
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>

Hi Bjorn,

This seems to be the reasonable way to add support for the quirk. 
Would really appreciate feedback from you.

Thanks,
JC.
Bjorn Helgaas April 11, 2017, 1:28 a.m. UTC | #5
Hi Jayachandran,

On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> topology is slightly unusual. For a multi-node system, it looks like:
> 
> [node level PCI bridges - one per node]
>     [SoC PCI devices with MSI-X but no IOMMU]
>     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
>         [PCIe real root ports associated with IOMMU and GICv3 ITS]
>             [External PCI devices connected to PCIe links]
> 
> The top two levels of bridges should have introduced aliases since they
> are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> In the case of external PCIe devices, the "real" root ports are connected
> to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> node level bridges do not introduce an alias either.
> 
> To handle this quirk, we mark the real PCIe root ports and node level
> PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> pci_for_each_dma_alias() works correctly for external PCIe devices and
> SoC PCI devices.
> 
> For the current revision of Cavium ThunderX2, the VendorID and Device ID
> are from Broadcom Vulcan (14e4:90XX).

Can you supply some text here about why we want to apply this patch?
E.g., does it avoid making unnecessary IOMMU mappings, improve
performance, avoid a crash, etc?

> Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
> ---
>  drivers/pci/quirks.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index 6736836..564a84a 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
>  
>  /*
> + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
> + * associated not at the root bus, but at a bridge below. This quirk flag
> + * will ensure that the aliases are identified correctly.
> + */
> +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
> +{
> +	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
> +}
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
> +				quirk_bridge_cavm_thrx2_pcie_root);
> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
> +				quirk_bridge_cavm_thrx2_pcie_root);
> +
> +/*
>   * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
>   * class code.  Fix it.
>   */
> -- 
> 2.7.4
> 
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jayachandran C April 11, 2017, 7:10 a.m. UTC | #6
On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> Hi Jayachandran,
> 
> On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > topology is slightly unusual. For a multi-node system, it looks like:
> > 
> > [node level PCI bridges - one per node]
> >     [SoC PCI devices with MSI-X but no IOMMU]
> >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> >             [External PCI devices connected to PCIe links]
> > 
> > The top two levels of bridges should have introduced aliases since they
> > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > In the case of external PCIe devices, the "real" root ports are connected
> > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > node level bridges do not introduce an alias either.
> > 
> > To handle this quirk, we mark the real PCIe root ports and node level
> > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > SoC PCI devices.
> > 
> > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > are from Broadcom Vulcan (14e4:90XX).
> 
> Can you supply some text here about why we want to apply this patch?
> E.g., does it avoid making unnecessary IOMMU mappings, improve
> performance, avoid a crash, etc?

If this is for the commit message, I hope the following is ok:

"With this change, both MSI-X and IO virtualization work correctly on
Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
devices, and the IOMMU groups are setup correctly."

I can send out a new patch if needed.

The on chip SATA and USB use MSI-X, so this is needed for basic
functionality of the platform.

> 
> > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
> > ---
> >  drivers/pci/quirks.c | 14 ++++++++++++++
> >  1 file changed, 14 insertions(+)
> > 
> > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > index 6736836..564a84a 100644
> > --- a/drivers/pci/quirks.c
> > +++ b/drivers/pci/quirks.c
> > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
> >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
> >  
> >  /*
> > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
> > + * associated not at the root bus, but at a bridge below. This quirk flag
> > + * will ensure that the aliases are identified correctly.
> > + */
> > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
> > +{
> > +	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
> > +}
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
> > +				quirk_bridge_cavm_thrx2_pcie_root);
> > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
> > +				quirk_bridge_cavm_thrx2_pcie_root);
> > +
> > +/*
> >   * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
> >   * class code.  Fix it.
> >   */

Thanks,
JC.
Bjorn Helgaas April 11, 2017, 1:41 p.m. UTC | #7
[+cc Joerg]

On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > Hi Jayachandran,
> > 
> > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > topology is slightly unusual. For a multi-node system, it looks like:
> > > 
> > > [node level PCI bridges - one per node]
> > >     [SoC PCI devices with MSI-X but no IOMMU]
> > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > >             [External PCI devices connected to PCIe links]
> > > 
> > > The top two levels of bridges should have introduced aliases since they
> > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > In the case of external PCIe devices, the "real" root ports are connected
> > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > node level bridges do not introduce an alias either.
> > > 
> > > To handle this quirk, we mark the real PCIe root ports and node level
> > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > SoC PCI devices.
> > > 
> > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > are from Broadcom Vulcan (14e4:90XX).
> > 
> > Can you supply some text here about why we want to apply this patch?
> > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > performance, avoid a crash, etc?
> 
> If this is for the commit message, I hope the following is ok:
> 
> "With this change, both MSI-X and IO virtualization work correctly on
> Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> devices, and the IOMMU groups are setup correctly."

This doesn't get at what the actual problem is.  I'm hoping for
something like "without this change, we set up an IOMMU mapping for
requestor ID X, but device DMA uses requestor ID Y because ...., which
results in an IOMMU fault"

I've been puzzling over the fact that most of the callers of
pci_for_each_dma_alias() don't seem to use it correctly.  For Intel
IOMMUs, domain_context_mapping() uses it to add a mapping for every
possible alias.  But most of the other callers only look at the last
alias and ignore all the others.  That might work most of the time,
but:

  - There's no guarantee that pci_for_each_dma_alias() iterates in any
    particular order, so relying on the current order is fragile,

  - The pci_add_dma_alias() interface allows an arbitrary number of
    aliases (as long as they're all on the same bus), and some devices
    do use more than one, e.g., quirk_dma_func0_alias(),
    quirk_mic_x200_dma_alias(),

  - pci_for_each_dma_alias() translates the rules in the PCIe to
    PCI/PCI-X Bridge spec, r1.0, sec 2.3, about taking ownership into
    aliases.  I think it's important to pay attention to *every*
    possible alias, not just the last one.

I suspect the reason this patch makes a difference is because the
current pci_for_each_dma_alias() believes one of those top-level
bridges is an alias, and the iterator produces it last, so that's the
one you map.  The IOMMU is attached lower down, so that top-level
bridge is not in fact an alias, but since you only look at the *last*
one, you don't map the correct aliases from lower down in the tree.

Stopping the iterator earlier happens to make the last alias be one of
the correct ones, but it doesn't solve the problems of quirked devices
that can use multiple requester IDs, and it doesn't solve the problem
of PCIe-to-PCI bridges that optionally take ownership of transactions.

> I can send out a new patch if needed.
> 
> The on chip SATA and USB use MSI-X, so this is needed for basic
> functionality of the platform.

No need for a new patch; I can integrate something into the changelog.

> > > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
> > > ---
> > >  drivers/pci/quirks.c | 14 ++++++++++++++
> > >  1 file changed, 14 insertions(+)
> > > 
> > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > index 6736836..564a84a 100644
> > > --- a/drivers/pci/quirks.c
> > > +++ b/drivers/pci/quirks.c
> > > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
> > >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
> > >  
> > >  /*
> > > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
> > > + * associated not at the root bus, but at a bridge below. This quirk flag
> > > + * will ensure that the aliases are identified correctly.
> > > + */
> > > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
> > > +{
> > > +	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
> > > +}
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
> > > +				quirk_bridge_cavm_thrx2_pcie_root);
> > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
> > > +				quirk_bridge_cavm_thrx2_pcie_root);
> > > +
> > > +/*
> > >   * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
> > >   * class code.  Fix it.
> > >   */
> 
> Thanks,
> JC.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jayachandran C April 11, 2017, 3:27 p.m. UTC | #8
On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> [+cc Joerg]
> 
> On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > Hi Jayachandran,
> > > 
> > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > 
> > > > [node level PCI bridges - one per node]
> > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > >             [External PCI devices connected to PCIe links]
> > > > 
> > > > The top two levels of bridges should have introduced aliases since they
> > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > node level bridges do not introduce an alias either.
> > > > 
> > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > SoC PCI devices.
> > > > 
> > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > are from Broadcom Vulcan (14e4:90XX).
> > > 
> > > Can you supply some text here about why we want to apply this patch?
> > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > performance, avoid a crash, etc?
> > 
> > If this is for the commit message, I hope the following is ok:
> > 
> > "With this change, both MSI-X and IO virtualization work correctly on
> > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > devices, and the IOMMU groups are setup correctly."
> 
> This doesn't get at what the actual problem is.  I'm hoping for
> something like "without this change, we set up an IOMMU mapping for
> requestor ID X, but device DMA uses requestor ID Y because ...., which
> results in an IOMMU fault"

Ok. I hope this would be better:

"Without this change, the last alias seen while traversing the PCI
hierarchy will be used as the RID to generate the device ID for ITS
and stream ID for SMMU. This in turn causes the MSI-X generated by the
device to fail since the ITS expects to have translation tables based
on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
device DMA also fails when SMMU is enabled due to incorrect value in
SMMU translation tables"

> I've been puzzling over the fact that most of the callers of
> pci_for_each_dma_alias() don't seem to use it correctly.  For Intel
> IOMMUs, domain_context_mapping() uses it to add a mapping for every
> possible alias.  But most of the other callers only look at the last
> alias and ignore all the others.  That might work most of the time,
> but:
> 
>   - There's no guarantee that pci_for_each_dma_alias() iterates in any
>     particular order, so relying on the current order is fragile,
> 
>   - The pci_add_dma_alias() interface allows an arbitrary number of
>     aliases (as long as they're all on the same bus), and some devices
>     do use more than one, e.g., quirk_dma_func0_alias(),
>     quirk_mic_x200_dma_alias(),
> 
>   - pci_for_each_dma_alias() translates the rules in the PCIe to
>     PCI/PCI-X Bridge spec, r1.0, sec 2.3, about taking ownership into
>     aliases.  I think it's important to pay attention to *every*
>     possible alias, not just the last one.

pci_for_each_dma_alias() is used by the ARM code to find the RID
(Requester ID), and this is taken as the last alias as seen from the
PCI controller (RC). The RID is then used to program the Device ID
of the GIC ITS (ARM generic interrupt controller's interrupt translation
service) for MSI-X (and similarly to program Stream ID of the SMMU).

The translation from RID to Device ID or stream ID is provided by the
IORT ACPI table[1] or by the a {iommu,msi}-{map,mask} [2] property in
the device tree.

Taking the last alias maybe reasonable since the mapping is from
(PCI RC, RID) to (SMMU, streamID) or (GIC ITS, deviceID) and we are
looking for a single the RID for a device as seen from the controller.

> I suspect the reason this patch makes a difference is because the
> current pci_for_each_dma_alias() believes one of those top-level
> bridges is an alias, and the iterator produces it last, so that's the
> one you map.  The IOMMU is attached lower down, so that top-level
> bridge is not in fact an alias, but since you only look at the *last*
> one, you don't map the correct aliases from lower down in the tree.

Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which
means that a PCI RC can multiple SMMUs, each handling a subset of RIDs.

In the case of Cavium ThunderX2, the RID which we should see on the RC
- if we follow the standard and factor in the aliasing introduced by the
PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or
ITS).

But, if we stop the traversal at the point where SMMU (or ITS) is
attached, we will get the correct RID as seen by these.

> Stopping the iterator earlier happens to make the last alias be one of
> the correct ones, but it doesn't solve the problems of quirked devices
> that can use multiple requester IDs, and it doesn't solve the problem
> of PCIe-to-PCI bridges that optionally take ownership of transactions.
 
If these happen below the point where the SMMU is attached, we will
consider the last alias introduced, which should be ok. If they are
above, the alias introduced is not relevant.  Devices with multiple
aliases is not handled anywhere in ARM code, so I don't think we should
consider that here.

> > I can send out a new patch if needed.
> > 
> > The on chip SATA and USB use MSI-X, so this is needed for basic
> > functionality of the platform.
> 
> No need for a new patch; I can integrate something into the changelog.
> 
> > > > Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
> > > > ---
> > > >  drivers/pci/quirks.c | 14 ++++++++++++++
> > > >  1 file changed, 14 insertions(+)
> > > > 
> > > > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> > > > index 6736836..564a84a 100644
> > > > --- a/drivers/pci/quirks.c
> > > > +++ b/drivers/pci/quirks.c
> > > > @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
> > > >  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
> > > >  
> > > >  /*
> > > > + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
> > > > + * associated not at the root bus, but at a bridge below. This quirk flag
> > > > + * will ensure that the aliases are identified correctly.
> > > > + */
> > > > +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
> > > > +{
> > > > +	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
> > > > +}
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
> > > > +				quirk_bridge_cavm_thrx2_pcie_root);
> > > > +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
> > > > +				quirk_bridge_cavm_thrx2_pcie_root);
> > > > +
> > > > +/*
> > > >   * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
> > > >   * class code.  Fix it.
> > > >   */

Thanks,
JC.
[1] http://infocenter.arm.com/help/topic/com.arm.doc.den0049b/DEN0049B_IO_Remapping_Table.pdf
[2] https://www.kernel.org/doc/Documentation/devicetree/bindings/pci/pci-iommu.txt
Robin Murphy April 11, 2017, 3:34 p.m. UTC | #9
On 11/04/17 14:41, Bjorn Helgaas wrote:
> [+cc Joerg]
> 
> On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
>> On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
>>> Hi Jayachandran,
>>>
>>> On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
>>>> The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
>>>> topology is slightly unusual. For a multi-node system, it looks like:
>>>>
>>>> [node level PCI bridges - one per node]
>>>>     [SoC PCI devices with MSI-X but no IOMMU]
>>>>     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
>>>>         [PCIe real root ports associated with IOMMU and GICv3 ITS]
>>>>             [External PCI devices connected to PCIe links]
>>>>
>>>> The top two levels of bridges should have introduced aliases since they
>>>> are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
>>>> In the case of external PCIe devices, the "real" root ports are connected
>>>> to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
>>>> alias. The SoC PCI devices are directly connected to the GIC ITS, so the
>>>> node level bridges do not introduce an alias either.
>>>>
>>>> To handle this quirk, we mark the real PCIe root ports and node level
>>>> PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
>>>> pci_for_each_dma_alias() works correctly for external PCIe devices and
>>>> SoC PCI devices.
>>>>
>>>> For the current revision of Cavium ThunderX2, the VendorID and Device ID
>>>> are from Broadcom Vulcan (14e4:90XX).
>>>
>>> Can you supply some text here about why we want to apply this patch?
>>> E.g., does it avoid making unnecessary IOMMU mappings, improve
>>> performance, avoid a crash, etc?
>>
>> If this is for the commit message, I hope the following is ok:
>>
>> "With this change, both MSI-X and IO virtualization work correctly on
>> Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
>> configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
>> devices, and the IOMMU groups are setup correctly."
> 
> This doesn't get at what the actual problem is.  I'm hoping for
> something like "without this change, we set up an IOMMU mapping for
> requestor ID X, but device DMA uses requestor ID Y because ...., which
> results in an IOMMU fault"

The fact that certain bits of code only consider the last-seen alias RID
does mean we will install all mappings for a RID the IOMMU will never
see, causing subsequent attempt to access them (via the real RID) to
fault. Even with that fixed, considering the past-the-IOMMU-connection
bridges will cause IOMMU group assignment to believe the IOMMU can't
distinguish them and thus prevent individual devices being assignable to
different VMs.

> I've been puzzling over the fact that most of the callers of
> pci_for_each_dma_alias() don't seem to use it correctly.  For Intel
> IOMMUs, domain_context_mapping() uses it to add a mapping for every
> possible alias.  But most of the other callers only look at the last
> alias and ignore all the others.

As it happens, I've just been looking into this and reaching the same
conclusion (not least because a fair few of of the incorrect uses have
my fingerprints all over them). The of_iommu_configure() and
corresponding iort_iommu_configure() uses turn out to already be broken
WRT DMA phantom functions vs. MSIs, but pretty straightforward to fix
(and I now have one of the offending Marvell SATA cards to test things
with). The one in its_pci_msi_prepare() turns out to be totally
backwards, as it actually wants to iterate through every device which
may alias with the given device (any suggestions for the neatest way to
do that are most welcome).

The other uses in pci_msi_*() are a little trickier, as in general we're
assuming every device is associated with just a single ID (certainly for
ARM GICv3 this assumption runs quite strongly through the architecture)
- Marc and I have chucked some ideas around, but in the short term, it
might be easier to just not even pretend to support MSIs from behind
aliasing bridges for the DT/IORT cases. Either way, those are also
broken for the phantom function case (and the tentative fix I've written
will need rethinking in light of this discussion, oh well).

>  That might work most of the time,
> but:
> 
>   - There's no guarantee that pci_for_each_dma_alias() iterates in any
>     particular order, so relying on the current order is fragile,
> 
>   - The pci_add_dma_alias() interface allows an arbitrary number of
>     aliases (as long as they're all on the same bus), and some devices
>     do use more than one, e.g., quirk_dma_func0_alias(),
>     quirk_mic_x200_dma_alias(),
> 
>   - pci_for_each_dma_alias() translates the rules in the PCIe to
>     PCI/PCI-X Bridge spec, r1.0, sec 2.3, about taking ownership into
>     aliases.  I think it's important to pay attention to *every*
>     possible alias, not just the last one.

Ha, that exact page is still open on my desktop since the "oh crap!"
moment last Friday :)

If I've interpreted that spec correctly, and it definitely is the case
that a bridge may alias or not on a per-transaction basis, then that
does end up making matters simpler; I can remove all the attempts to
skip any IDs that the IOMMU is guaranteed never to see due to aliasing,
if that set is in fact empty.

> I suspect the reason this patch makes a difference is because the
> current pci_for_each_dma_alias() believes one of those top-level
> bridges is an alias, and the iterator produces it last, so that's the
> one you map.  The IOMMU is attached lower down, so that top-level
> bridge is not in fact an alias, but since you only look at the *last*
> one, you don't map the correct aliases from lower down in the tree.
> 
> Stopping the iterator earlier happens to make the last alias be one of
> the correct ones, but it doesn't solve the problems of quirked devices
> that can use multiple requester IDs, and it doesn't solve the problem
> of PCIe-to-PCI bridges that optionally take ownership of transactions.

Yes, that's pretty much the state of things, other than also solving the
legitimate problem of get_pci_alias_group() going too far and inferring
a false lack of isolation.

Robin.

>> I can send out a new patch if needed.
>>
>> The on chip SATA and USB use MSI-X, so this is needed for basic
>> functionality of the platform.
> 
> No need for a new patch; I can integrate something into the changelog.
> 
>>>> Signed-off-by: Jayachandran C <jnair@caviumnetworks.com>
>>>> ---
>>>>  drivers/pci/quirks.c | 14 ++++++++++++++
>>>>  1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
>>>> index 6736836..564a84a 100644
>>>> --- a/drivers/pci/quirks.c
>>>> +++ b/drivers/pci/quirks.c
>>>> @@ -3958,6 +3958,20 @@ DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
>>>>  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
>>>>  
>>>>  /*
>>>> + * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
>>>> + * associated not at the root bus, but at a bridge below. This quirk flag
>>>> + * will ensure that the aliases are identified correctly.
>>>> + */
>>>> +static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
>>>> +{
>>>> +	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
>>>> +}
>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
>>>> +				quirk_bridge_cavm_thrx2_pcie_root);
>>>> +DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
>>>> +				quirk_bridge_cavm_thrx2_pcie_root);
>>>> +
>>>> +/*
>>>>   * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
>>>>   * class code.  Fix it.
>>>>   */
>>
>> Thanks,
>> JC.
>>
>> _______________________________________________
>> linux-arm-kernel mailing list
>> linux-arm-kernel@lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
Jon Masters April 11, 2017, 3:43 p.m. UTC | #10
On 04/11/2017 11:27 AM, Jayachandran C wrote:
> On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:

>> I suspect the reason this patch makes a difference is because the
>> current pci_for_each_dma_alias() believes one of those top-level
>> bridges is an alias, and the iterator produces it last, so that's the
>> one you map.  The IOMMU is attached lower down, so that top-level
>> bridge is not in fact an alias, but since you only look at the *last*
>> one, you don't map the correct aliases from lower down in the tree.
> 
> Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which
> means that a PCI RC can multiple SMMUs, each handling a subset of RIDs.
> 
> In the case of Cavium ThunderX2, the RID which we should see on the RC
> - if we follow the standard and factor in the aliasing introduced by the
> PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or
> ITS).
> 
> But, if we stop the traversal at the point where SMMU (or ITS) is
> attached, we will get the correct RID as seen by these.

Side note that I am trying to get various specifications clarified to
promote more of a familiar alternative architecture (x86) approach in
the future in which these aren't at different levels in the topology.
But to do that requires integrated Root Complex IP with bells/whistles.

Jon.
Bjorn Helgaas April 12, 2017, 4:21 p.m. UTC | #11
On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote:
> On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> > [+cc Joerg]
> > 
> > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > > Hi Jayachandran,
> > > > 
> > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > > 
> > > > > [node level PCI bridges - one per node]
> > > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > > >             [External PCI devices connected to PCIe links]
> > > > > 
> > > > > The top two levels of bridges should have introduced aliases since they
> > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > > node level bridges do not introduce an alias either.
> > > > > 
> > > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > > SoC PCI devices.
> > > > > 
> > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > > are from Broadcom Vulcan (14e4:90XX).
> > > > 
> > > > Can you supply some text here about why we want to apply this patch?
> > > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > > performance, avoid a crash, etc?
> > > 
> > > If this is for the commit message, I hope the following is ok:
> > > 
> > > "With this change, both MSI-X and IO virtualization work correctly on
> > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > > devices, and the IOMMU groups are setup correctly."
> > 
> > This doesn't get at what the actual problem is.  I'm hoping for
> > something like "without this change, we set up an IOMMU mapping for
> > requestor ID X, but device DMA uses requestor ID Y because ...., which
> > results in an IOMMU fault"
> 
> Ok. I hope this would be better:
> 
> "Without this change, the last alias seen while traversing the PCI
> hierarchy will be used as the RID to generate the device ID for ITS
> and stream ID for SMMU. This in turn causes the MSI-X generated by the
> device to fail since the ITS expects to have translation tables based
> on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
> device DMA also fails when SMMU is enabled due to incorrect value in
> SMMU translation tables"

This description is true, but I don't think it addresses the real
problem.  I think the real problem is that your IOMMU code doesn't
handle aliases correctly, and by ignoring these invalid aliases, we
happen to map an alias that works for the builtin devices.  But that's
only because we got lucky (those devices use a single RID and they're
not behind bridges that optionally take ownership).

It would make sense to me if we fixed the IOMMU code to map *all* the
aliases, which should be enough to make your devices work.  If we then
wanted to apply a patch like this on top, it would be simply an
optimization that avoids unnecessary IOMMU mappings.

> > I suspect the reason this patch makes a difference is because the
> > current pci_for_each_dma_alias() believes one of those top-level
> > bridges is an alias, and the iterator produces it last, so that's the
> > one you map.  The IOMMU is attached lower down, so that top-level
> > bridge is not in fact an alias, but since you only look at the *last*
> > one, you don't map the correct aliases from lower down in the tree.
> 
> Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which
> means that a PCI RC can multiple SMMUs, each handling a subset of RIDs.
> 
> In the case of Cavium ThunderX2, the RID which we should see on the RC
> - if we follow the standard and factor in the aliasing introduced by the
> PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or
> ITS).
> 
> But, if we stop the traversal at the point where SMMU (or ITS) is
> attached, we will get the correct RID as seen by these.

There is a single "correct RID" for your builtin SATA and USB, but in
general there is no single RID.

> > Stopping the iterator earlier happens to make the last alias be one of
> > the correct ones, but it doesn't solve the problems of quirked devices
> > that can use multiple requester IDs, and it doesn't solve the problem
> > of PCIe-to-PCI bridges that optionally take ownership of transactions.
>  
> If these happen below the point where the SMMU is attached, we will
> consider the last alias introduced, which should be ok. If they are
> above, the alias introduced is not relevant.  Devices with multiple
> aliases is not handled anywhere in ARM code, so I don't think we should
> consider that here.

I think we *should* consider it here.  The multiple alias situation is
generic PCIe, independent of ARM.  If you want to support arbitrary
PCIe plugin devices, you have to handle it.  I think any device with a
quirk that calls pci_add_dma_alias() will currently fail on your
system.  And I think devices behind a bridge that optionally takes
ownership of DMA transactions will also fail.

Bjorn
Jayachandran C April 12, 2017, 6:10 p.m. UTC | #12
On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote:
> On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote:
> > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> > > [+cc Joerg]
> > > 
> > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > > > Hi Jayachandran,
> > > > > 
> > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > > > 
> > > > > > [node level PCI bridges - one per node]
> > > > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > > > >             [External PCI devices connected to PCIe links]
> > > > > > 
> > > > > > The top two levels of bridges should have introduced aliases since they
> > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > > > node level bridges do not introduce an alias either.
> > > > > > 
> > > > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > > > SoC PCI devices.
> > > > > > 
> > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > > > are from Broadcom Vulcan (14e4:90XX).
> > > > > 
> > > > > Can you supply some text here about why we want to apply this patch?
> > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > > > performance, avoid a crash, etc?
> > > > 
> > > > If this is for the commit message, I hope the following is ok:
> > > > 
> > > > "With this change, both MSI-X and IO virtualization work correctly on
> > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > > > devices, and the IOMMU groups are setup correctly."
> > > 
> > > This doesn't get at what the actual problem is.  I'm hoping for
> > > something like "without this change, we set up an IOMMU mapping for
> > > requestor ID X, but device DMA uses requestor ID Y because ...., which
> > > results in an IOMMU fault"
> > 
> > Ok. I hope this would be better:
> > 
> > "Without this change, the last alias seen while traversing the PCI
> > hierarchy will be used as the RID to generate the device ID for ITS
> > and stream ID for SMMU. This in turn causes the MSI-X generated by the
> > device to fail since the ITS expects to have translation tables based
> > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
> > device DMA also fails when SMMU is enabled due to incorrect value in
> > SMMU translation tables"
> 
> This description is true, but I don't think it addresses the real
> problem.  I think the real problem is that your IOMMU code doesn't
> handle aliases correctly, and by ignoring these invalid aliases, we
> happen to map an alias that works for the builtin devices.  But that's
> only because we got lucky (those devices use a single RID and they're
> not behind bridges that optionally take ownership).
> 
> It would make sense to me if we fixed the IOMMU code to map *all* the
> aliases, which should be enough to make your devices work.  If we then
> wanted to apply a patch like this on top, it would be simply an
> optimization that avoids unnecessary IOMMU mappings.

The issue that the IOMMU code does not handle valid aliases is
unrelated to what I am trying to fix. The quirk is to make sure
that invalid aliases are not seen on ThunderX2 while doing
pci_for_each_dma_alias().

The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point
where the SMMU (or ITS) is attached, i.e. at the bridge marked with
PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look
for aliases above that point.

The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since
the on-chip devices are directly connected to the ITS (they do not
use SMMU).

The top two levels of bridges are not real PCI bridges but just
PCI bridge-like things that were added to tie the whole hierarchy
together for configuration and enumeration. They do not handle
PCI/PCIe transactions in the traditional sense.

I think my problem description is still not correct, maybe:
"The SMMU (and ITS) expects to device tables to use the RID seen
at the bridge they are associated with. Currently the
pci_for_each_dma_alias() code traverses beyond this point and
generates incorrect aliases due to the PCI and PCI/PCIe bridges
above. This causes MSI-X interrupts and device DMA to fail since
the SMMU and ITS tables to be setup with incorrect IDs"

> > > I suspect the reason this patch makes a difference is because the
> > > current pci_for_each_dma_alias() believes one of those top-level
> > > bridges is an alias, and the iterator produces it last, so that's the
> > > one you map.  The IOMMU is attached lower down, so that top-level
> > > bridge is not in fact an alias, but since you only look at the *last*
> > > one, you don't map the correct aliases from lower down in the tree.
> > 
> > Exactly. The IORT spec allows a range of RIDs to map to an SMMU, which
> > means that a PCI RC can multiple SMMUs, each handling a subset of RIDs.
> > 
> > In the case of Cavium ThunderX2, the RID which we should see on the RC
> > - if we follow the standard and factor in the aliasing introduced by the
> > PCI bridge and the PCI/PCIe bridge - is not the RID seen by the SMMU (or
> > ITS).
> > 
> > But, if we stop the traversal at the point where SMMU (or ITS) is
> > attached, we will get the correct RID as seen by these.
> 
> There is a single "correct RID" for your builtin SATA and USB, but in
> general there is no single RID.

In case of external PCIe devices too, the RID (or aliases) seen to the
point where SMMU or ITS is connected (i.e, the bridge marked with flag
PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT) are correct.
 
> > > Stopping the iterator earlier happens to make the last alias be one of
> > > the correct ones, but it doesn't solve the problems of quirked devices
> > > that can use multiple requester IDs, and it doesn't solve the problem
> > > of PCIe-to-PCI bridges that optionally take ownership of transactions.
> >  
> > If these happen below the point where the SMMU is attached, we will
> > consider the last alias introduced, which should be ok. If they are
> > above, the alias introduced is not relevant.  Devices with multiple
> > aliases is not handled anywhere in ARM code, so I don't think we should
> > consider that here.
> 
> I think we *should* consider it here.  The multiple alias situation is
> generic PCIe, independent of ARM.  If you want to support arbitrary
> PCIe plugin devices, you have to handle it.  I think any device with a
> quirk that calls pci_add_dma_alias() will currently fail on your
> system.  And I think devices behind a bridge that optionally takes
> ownership of DMA transactions will also fail.
 
The issue that IOMMU code does not handle all aliases has to be addressed
separately (and Robin seems to be looking at this from his response).

And even when that is addressed, this quirk is still needed on ThunderX2,
as I pointed out above.

Hope I am still not missing the point, and thanks for your patience here.
JC.
Bjorn Helgaas April 12, 2017, 7:11 p.m. UTC | #13
On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote:
> On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote:
> > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote:
> > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> > > > [+cc Joerg]
> > > > 
> > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > > > > Hi Jayachandran,
> > > > > > 
> > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > > > > 
> > > > > > > [node level PCI bridges - one per node]
> > > > > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > > > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > > > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > > > > >             [External PCI devices connected to PCIe links]
> > > > > > > 
> > > > > > > The top two levels of bridges should have introduced aliases since they
> > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > > > > node level bridges do not introduce an alias either.
> > > > > > > 
> > > > > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > > > > SoC PCI devices.
> > > > > > > 
> > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > > > > are from Broadcom Vulcan (14e4:90XX).
> > > > > > 
> > > > > > Can you supply some text here about why we want to apply this patch?
> > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > > > > performance, avoid a crash, etc?
> > > > > 
> > > > > If this is for the commit message, I hope the following is ok:
> > > > > 
> > > > > "With this change, both MSI-X and IO virtualization work correctly on
> > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > > > > devices, and the IOMMU groups are setup correctly."
> > > > 
> > > > This doesn't get at what the actual problem is.  I'm hoping for
> > > > something like "without this change, we set up an IOMMU mapping for
> > > > requestor ID X, but device DMA uses requestor ID Y because ...., which
> > > > results in an IOMMU fault"
> > > 
> > > Ok. I hope this would be better:
> > > 
> > > "Without this change, the last alias seen while traversing the PCI
> > > hierarchy will be used as the RID to generate the device ID for ITS
> > > and stream ID for SMMU. This in turn causes the MSI-X generated by the
> > > device to fail since the ITS expects to have translation tables based
> > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
> > > device DMA also fails when SMMU is enabled due to incorrect value in
> > > SMMU translation tables"
> > 
> > This description is true, but I don't think it addresses the real
> > problem.  I think the real problem is that your IOMMU code doesn't
> > handle aliases correctly, and by ignoring these invalid aliases, we
> > happen to map an alias that works for the builtin devices.  But that's
> > only because we got lucky (those devices use a single RID and they're
> > not behind bridges that optionally take ownership).
> > 
> > It would make sense to me if we fixed the IOMMU code to map *all* the
> > aliases, which should be enough to make your devices work.  If we then
> > wanted to apply a patch like this on top, it would be simply an
> > optimization that avoids unnecessary IOMMU mappings.
> 
> The issue that the IOMMU code does not handle valid aliases is
> unrelated to what I am trying to fix. The quirk is to make sure
> that invalid aliases are not seen on ThunderX2 while doing
> pci_for_each_dma_alias().
> 
> The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point
> where the SMMU (or ITS) is attached, i.e. at the bridge marked with
> PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look
> for aliases above that point.
> 
> The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since
> the on-chip devices are directly connected to the ITS (they do not
> use SMMU).
> 
> The top two levels of bridges are not real PCI bridges but just
> PCI bridge-like things that were added to tie the whole hierarchy
> together for configuration and enumeration. They do not handle
> PCI/PCIe transactions in the traditional sense.
> 
> I think my problem description is still not correct, maybe:
> "The SMMU (and ITS) expects to device tables to use the RID seen
> at the bridge they are associated with. Currently the
> pci_for_each_dma_alias() code traverses beyond this point and
> generates incorrect aliases due to the PCI and PCI/PCIe bridges
> above. This causes MSI-X interrupts and device DMA to fail since
> the SMMU and ITS tables to be setup with incorrect IDs"

I haven't tried to figure out the MSI-X piece of this, but let me try
to come up with a concrete DMA example.  Assume this topology:

  00:00.0 bridge to [bus 01-1e]
  01:0a.0 bridge to [bus 04-05]
  04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT)
  05:00.0 endpoint

Assume 05:00.0 generates a DMA request.  Assume the top two bridges
are such that pci_for_each_dma_alias() includes them as well, so it
iterates through 05:00.0, 01:0a.0, and 00:00.0.

When the driver for 05:00.0 makes a DMA mapping, the current code
apparently makes an IOMMU mapping for requester ID 00:00.0 because
that's the last alias.  Obviously this doesn't work because the IOMMU
at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0.

With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an
IOMMU mapping for requester ID 05:00.0, which will work fine.  I think
it would *also* work fine if we made IOMMU mappings for 05:00.0,
01:0a.0, and 00:0.0.  The last two are unnecessary, but probably not
harmful.

Now assume 05:00 is a multi-function device that has a DMA alias
quirk, e.g., see quirk_dma_func0_alias().  It has another function:

  05:00.3 endpoint

DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0.
The driver makes a DMA mapping, pci_for_each_dma_alias() iterates
through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only
00:00.0, which again doesn't work.

With this quirk, we create a single mapping for 05:00.0.  That will
work sometimes, but the device may also generate DMA with a requester
ID of 05:00.3, and that won't work.

If your on-chip device is, e.g., 01:04.0, pci_for_each_dma_alias()
probably iterates through 01:04.0, 00:00.0.  Today we make an IOMMU
mapping for 00:00.0, which doesn't work.  With this quirk, we'll
ignore 00:00.0 and make a mapping for 01:04.0, which does work.  But I
think if you made the IOMMU code add mappings for each of the aliases,
i.e., for both 01:04.0 and 00:00.0, that device *would* work even
without this quirk.

Bjorn
Jayachandran C April 12, 2017, 8:41 p.m. UTC | #14
On Wed, Apr 12, 2017 at 02:11:38PM -0500, Bjorn Helgaas wrote:
> On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote:
> > On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote:
> > > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> > > > > [+cc Joerg]
> > > > > 
> > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > > > > > Hi Jayachandran,
> > > > > > > 
> > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > > > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > > > > > 
> > > > > > > > [node level PCI bridges - one per node]
> > > > > > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > > > > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > > > > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > > > > > >             [External PCI devices connected to PCIe links]
> > > > > > > > 
> > > > > > > > The top two levels of bridges should have introduced aliases since they
> > > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > > > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > > > > > node level bridges do not introduce an alias either.
> > > > > > > > 
> > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > > > > > SoC PCI devices.
> > > > > > > > 
> > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > > > > > are from Broadcom Vulcan (14e4:90XX).
> > > > > > > 
> > > > > > > Can you supply some text here about why we want to apply this patch?
> > > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > > > > > performance, avoid a crash, etc?
> > > > > > 
> > > > > > If this is for the commit message, I hope the following is ok:
> > > > > > 
> > > > > > "With this change, both MSI-X and IO virtualization work correctly on
> > > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > > > > > devices, and the IOMMU groups are setup correctly."
> > > > > 
> > > > > This doesn't get at what the actual problem is.  I'm hoping for
> > > > > something like "without this change, we set up an IOMMU mapping for
> > > > > requestor ID X, but device DMA uses requestor ID Y because ...., which
> > > > > results in an IOMMU fault"
> > > > 
> > > > Ok. I hope this would be better:
> > > > 
> > > > "Without this change, the last alias seen while traversing the PCI
> > > > hierarchy will be used as the RID to generate the device ID for ITS
> > > > and stream ID for SMMU. This in turn causes the MSI-X generated by the
> > > > device to fail since the ITS expects to have translation tables based
> > > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
> > > > device DMA also fails when SMMU is enabled due to incorrect value in
> > > > SMMU translation tables"
> > > 
> > > This description is true, but I don't think it addresses the real
> > > problem.  I think the real problem is that your IOMMU code doesn't
> > > handle aliases correctly, and by ignoring these invalid aliases, we
> > > happen to map an alias that works for the builtin devices.  But that's
> > > only because we got lucky (those devices use a single RID and they're
> > > not behind bridges that optionally take ownership).
> > > 
> > > It would make sense to me if we fixed the IOMMU code to map *all* the
> > > aliases, which should be enough to make your devices work.  If we then
> > > wanted to apply a patch like this on top, it would be simply an
> > > optimization that avoids unnecessary IOMMU mappings.
> > 
> > The issue that the IOMMU code does not handle valid aliases is
> > unrelated to what I am trying to fix. The quirk is to make sure
> > that invalid aliases are not seen on ThunderX2 while doing
> > pci_for_each_dma_alias().
> > 
> > The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point
> > where the SMMU (or ITS) is attached, i.e. at the bridge marked with
> > PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look
> > for aliases above that point.
> > 
> > The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since
> > the on-chip devices are directly connected to the ITS (they do not
> > use SMMU).
> > 
> > The top two levels of bridges are not real PCI bridges but just
> > PCI bridge-like things that were added to tie the whole hierarchy
> > together for configuration and enumeration. They do not handle
> > PCI/PCIe transactions in the traditional sense.
> > 
> > I think my problem description is still not correct, maybe:
> > "The SMMU (and ITS) expects to device tables to use the RID seen
> > at the bridge they are associated with. Currently the
> > pci_for_each_dma_alias() code traverses beyond this point and
> > generates incorrect aliases due to the PCI and PCI/PCIe bridges
> > above. This causes MSI-X interrupts and device DMA to fail since
> > the SMMU and ITS tables to be setup with incorrect IDs"
> 
> I haven't tried to figure out the MSI-X piece of this, but let me try
> to come up with a concrete DMA example.  Assume this topology:
> 
>   00:00.0 bridge to [bus 01-1e]
>   01:0a.0 bridge to [bus 04-05]
>   04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT)
>   05:00.0 endpoint
> 
> Assume 05:00.0 generates a DMA request.  Assume the top two bridges
> are such that pci_for_each_dma_alias() includes them as well, so it
> iterates through 05:00.0, 01:0a.0, and 00:00.0.
> 
> When the driver for 05:00.0 makes a DMA mapping, the current code
> apparently makes an IOMMU mapping for requester ID 00:00.0 because
> that's the last alias.  Obviously this doesn't work because the IOMMU
> at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0.
> 
> With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an
> IOMMU mapping for requester ID 05:00.0, which will work fine.  I think
> it would *also* work fine if we made IOMMU mappings for 05:00.0,
> 01:0a.0, and 00:0.0.  The last two are unnecessary, but probably not
> harmful.
 
Ok. The last two are not harmful but still incorrect. The bridges
0:0.0 and 1:a.0 are outside the path with the PCI transactions takes
(they go from 4:0.0 to the SMMU).

> Now assume 05:00 is a multi-function device that has a DMA alias
> quirk, e.g., see quirk_dma_func0_alias().  It has another function:
> 
>   05:00.3 endpoint
> 
> DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0.
> The driver makes a DMA mapping, pci_for_each_dma_alias() iterates
> through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only
> 00:00.0, which again doesn't work.
> 
> With this quirk, we create a single mapping for 05:00.0.  That will
> work sometimes, but the device may also generate DMA with a requester
> ID of 05:00.3, and that won't work.

Note that here, pci_for_each_dma_alias() works correctly with the quirk
and incorrectly without the quirk. With the quirk, the callback function
is involved on 05:00.3 and 05:00.0 as expected. Without the quirk the
call back is invoked on 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, which
includes aliases that are not valid.

The idea behind the quirk is to get pci_for_each_dma_alias work correctly
on ThunderX2, and invoke the function only on the valid aliases. With the
quirk, the code using it - like the SMMU code - gets the correct aliases
to process, and do not have to deal with or filter out incorrect aliases.

I don't think it would be safe to assume that all callers of
pci_for_each_dma_alias will be ok to handle invalid aliases. Even in
the IOMMU case, the case which calculates stream IDs is one usage, the
usage for iommu groups also did not deal with invalid aliases correctly
when I tried it. Then there are the MSI cases too, I had done some work
on trying to filter out invalid aliases in the relevant code[1], but
in my opinion, getting pci_for_each_dma_alias() do the right thing is
the correct solution.

> If your on-chip device is, e.g., 01:04.0, pci_for_each_dma_alias()
> probably iterates through 01:04.0, 00:00.0.  Today we make an IOMMU
> mapping for 00:00.0, which doesn't work.  With this quirk, we'll
> ignore 00:00.0 and make a mapping for 01:04.0, which does work.  But I
> think if you made the IOMMU code add mappings for each of the aliases,
> i.e., for both 01:04.0 and 00:00.0, that device *would* work even
> without this quirk.

There is no IOMMU for the on-chip devices, but the MSI-X interrupts are
looked up in a table based on DeviceID(RID) by the ARM interrupt
controller. The issue here is similar.

JC.

[1]https://www.spinics.net/lists/arm-kernel/msg573744.html
Bjorn Helgaas April 12, 2017, 11:18 p.m. UTC | #15
On Wed, Apr 12, 2017 at 08:41:20PM +0000, Jayachandran C wrote:
> On Wed, Apr 12, 2017 at 02:11:38PM -0500, Bjorn Helgaas wrote:
> > On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote:
> > > On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote:
> > > > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote:
> > > > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> > > > > > [+cc Joerg]
> > > > > > 
> > > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > > > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > > > > > > Hi Jayachandran,
> > > > > > > > 
> > > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > > > > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > > > > > > 
> > > > > > > > > [node level PCI bridges - one per node]
> > > > > > > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > > > > > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > > > > > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > > > > > > >             [External PCI devices connected to PCIe links]
> > > > > > > > > 
> > > > > > > > > The top two levels of bridges should have introduced aliases since they
> > > > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > > > > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > > > > > > node level bridges do not introduce an alias either.
> > > > > > > > > 
> > > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > > > > > > SoC PCI devices.
> > > > > > > > > 
> > > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > > > > > > are from Broadcom Vulcan (14e4:90XX).
> > > > > > > > 
> > > > > > > > Can you supply some text here about why we want to apply this patch?
> > > > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > > > > > > performance, avoid a crash, etc?
> > > > > > > 
> > > > > > > If this is for the commit message, I hope the following is ok:
> > > > > > > 
> > > > > > > "With this change, both MSI-X and IO virtualization work correctly on
> > > > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > > > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > > > > > > devices, and the IOMMU groups are setup correctly."
> > > > > > 
> > > > > > This doesn't get at what the actual problem is.  I'm hoping for
> > > > > > something like "without this change, we set up an IOMMU mapping for
> > > > > > requestor ID X, but device DMA uses requestor ID Y because ...., which
> > > > > > results in an IOMMU fault"
> > > > > 
> > > > > Ok. I hope this would be better:
> > > > > 
> > > > > "Without this change, the last alias seen while traversing the PCI
> > > > > hierarchy will be used as the RID to generate the device ID for ITS
> > > > > and stream ID for SMMU. This in turn causes the MSI-X generated by the
> > > > > device to fail since the ITS expects to have translation tables based
> > > > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
> > > > > device DMA also fails when SMMU is enabled due to incorrect value in
> > > > > SMMU translation tables"
> > > > 
> > > > This description is true, but I don't think it addresses the real
> > > > problem.  I think the real problem is that your IOMMU code doesn't
> > > > handle aliases correctly, and by ignoring these invalid aliases, we
> > > > happen to map an alias that works for the builtin devices.  But that's
> > > > only because we got lucky (those devices use a single RID and they're
> > > > not behind bridges that optionally take ownership).
> > > > 
> > > > It would make sense to me if we fixed the IOMMU code to map *all* the
> > > > aliases, which should be enough to make your devices work.  If we then
> > > > wanted to apply a patch like this on top, it would be simply an
> > > > optimization that avoids unnecessary IOMMU mappings.
> > > 
> > > The issue that the IOMMU code does not handle valid aliases is
> > > unrelated to what I am trying to fix. The quirk is to make sure
> > > that invalid aliases are not seen on ThunderX2 while doing
> > > pci_for_each_dma_alias().
> > > 
> > > The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point
> > > where the SMMU (or ITS) is attached, i.e. at the bridge marked with
> > > PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look
> > > for aliases above that point.
> > > 
> > > The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since
> > > the on-chip devices are directly connected to the ITS (they do not
> > > use SMMU).
> > > 
> > > The top two levels of bridges are not real PCI bridges but just
> > > PCI bridge-like things that were added to tie the whole hierarchy
> > > together for configuration and enumeration. They do not handle
> > > PCI/PCIe transactions in the traditional sense.
> > > 
> > > I think my problem description is still not correct, maybe:
> > > "The SMMU (and ITS) expects to device tables to use the RID seen
> > > at the bridge they are associated with. Currently the
> > > pci_for_each_dma_alias() code traverses beyond this point and
> > > generates incorrect aliases due to the PCI and PCI/PCIe bridges
> > > above. This causes MSI-X interrupts and device DMA to fail since
> > > the SMMU and ITS tables to be setup with incorrect IDs"
> > 
> > I haven't tried to figure out the MSI-X piece of this, but let me try
> > to come up with a concrete DMA example.  Assume this topology:
> > 
> >   00:00.0 bridge to [bus 01-1e]
> >   01:0a.0 bridge to [bus 04-05]
> >   04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT)
> >   05:00.0 endpoint
> > 
> > Assume 05:00.0 generates a DMA request.  Assume the top two bridges
> > are such that pci_for_each_dma_alias() includes them as well, so it
> > iterates through 05:00.0, 01:0a.0, and 00:00.0.
> > 
> > When the driver for 05:00.0 makes a DMA mapping, the current code
> > apparently makes an IOMMU mapping for requester ID 00:00.0 because
> > that's the last alias.  Obviously this doesn't work because the IOMMU
> > at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0.
> > 
> > With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an
> > IOMMU mapping for requester ID 05:00.0, which will work fine.  I think
> > it would *also* work fine if we made IOMMU mappings for 05:00.0,
> > 01:0a.0, and 00:0.0.  The last two are unnecessary, but probably not
> > harmful.
>  
> Ok. The last two are not harmful but still incorrect. The bridges
> 0:0.0 and 1:a.0 are outside the path with the PCI transactions takes
> (they go from 4:0.0 to the SMMU).
> 
> > Now assume 05:00 is a multi-function device that has a DMA alias
> > quirk, e.g., see quirk_dma_func0_alias().  It has another function:
> > 
> >   05:00.3 endpoint
> > 
> > DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0.
> > The driver makes a DMA mapping, pci_for_each_dma_alias() iterates
> > through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only
> > 00:00.0, which again doesn't work.
> > 
> > With this quirk, we create a single mapping for 05:00.0.  That will
> > work sometimes, but the device may also generate DMA with a requester
> > ID of 05:00.3, and that won't work.
> 
> Note that here, pci_for_each_dma_alias() works correctly with the quirk
> and incorrectly without the quirk. With the quirk, the callback function
> is involved on 05:00.3 and 05:00.0 as expected.

With the quirk, pci_for_each_dma_alias() works correctly and iterates
through 05:00.3 and 05:00.0.  But the IOMMU code only pay attention to
the *last* alias, i.e., 05:00.0.  So this does not work.

> Without the quirk the
> call back is invoked on 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, which
> includes aliases that are not valid.
> 
> The idea behind the quirk is to get pci_for_each_dma_alias work correctly
> on ThunderX2, and invoke the function only on the valid aliases. With the
> quirk, the code using it - like the SMMU code - gets the correct aliases
> to process, and do not have to deal with or filter out incorrect aliases.
> 
> I don't think it would be safe to assume that all callers of
> pci_for_each_dma_alias will be ok to handle invalid aliases. Even in
> the IOMMU case, the case which calculates stream IDs is one usage, the
> usage for iommu groups also did not deal with invalid aliases correctly
> when I tried it. Then there are the MSI cases too, I had done some work
> on trying to filter out invalid aliases in the relevant code[1], but
> in my opinion, getting pci_for_each_dma_alias() do the right thing is
> the correct solution.

I agree, we should fix this and it sounds like it's more than just an
optimization.  But I don't think the quirk is a complete fix, and the
changelog needs a short sketch of this discussion to make it clear
that this happens to fix some DMA faults, but others remain because
the current IOMMU code only maps one of the valid aliases.

So I think the only thing we need to move forward on this is a revised
changelog.  I propose something like this:

  On Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the
  PCI topology is slightly unusual.  For a multi-node system, it looks
  like:

    00:00.0 [?? type] bridge to [bus 01-1e]
    01:0a.0 [?? type] bridge to [bus 04-05]
    04:00.0 [?? type] bridge to [bus 05] (ThunderX2 XLATE_ROOT)
    05:00.0 endpoint

  pci_for_each_dma_alias() assumes IOMMU translation is done at the
  root of the PCI hierarchy.  It generates 05:00.0 and <BB:DD.F> as
  DMA aliases for 05:00.0 because bus <BB> is a non-PCIe bus that
  doesn't carry the Requester ID.

  Because the ThunderX2 IOMMU is at 04:00.0, <BB:DD.F> is never a valid
  Requester ID.  This quirk stops alias generation at the XLATE_ROOT
  bridge so we won't generate <BB:DD.F>.

  The current IOMMU code only maps the last alias (this is a separate
  bug in itself).  Prior to this quirk, we only created IOMMU mappings
  for the invalid Requester ID <BB:DD.F>, which never matched any DMA
  transactions.
  
  With this quirk, we create IOMMU mappings for a valid Requester ID,
  which fixes devices with no aliases but leaves devices with aliases
  still broken.

I don't know the details of what type of bridges those are
(conventional PCI?  PCIe PCI-to-PCIe? etc?) and exactly what RIDs are
involved.  It'd be nice if you could fill those in.

Bjorn
Jon Masters April 13, 2017, 6:43 a.m. UTC | #16
On 04/04/2017 10:28 AM, Robin Murphy wrote:

> So (at the risk of Jon mooing at me)

moooooooooooo
diff mbox

Patch

diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index 6736836..564a84a 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -3958,6 +3958,20 @@  DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2260, quirk_mic_x200_dma_alias);
 DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_INTEL, 0x2264, quirk_mic_x200_dma_alias);
 
 /*
+ * The IOMMU and interrupt controller on Broadcom Vulcan/Cavium ThunderX2 are
+ * associated not at the root bus, but at a bridge below. This quirk flag
+ * will ensure that the aliases are identified correctly.
+ */
+static void quirk_bridge_cavm_thrx2_pcie_root(struct pci_dev *pdev)
+{
+	pdev->dev_flags |= PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT;
+}
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9000,
+				quirk_bridge_cavm_thrx2_pcie_root);
+DECLARE_PCI_FIXUP_HEADER(PCI_VENDOR_ID_BROADCOM, 0x9084,
+				quirk_bridge_cavm_thrx2_pcie_root);
+
+/*
  * Intersil/Techwell TW686[4589]-based video capture cards have an empty (zero)
  * class code.  Fix it.
  */