diff mbox

[24/32] pci: PCIe driver for Marvell Armada 370/XP systems

Message ID 1360686546-24277-25-git-send-email-thomas.petazzoni@free-electrons.com
State Not Applicable
Headers show

Commit Message

Thomas Petazzoni Feb. 12, 2013, 4:28 p.m. UTC
This driver implements the support for the PCIe interfaces on the
Marvell Armada 370/XP ARM SoCs. In the future, it might be extended to
cover earlier families of Marvell SoCs, such as Dove, Orion and
Kirkwood.

The driver implements the hw_pci operations needed by the core ARM PCI
code to setup PCI devices and get their corresponding IRQs, and the
pci_ops operations that are used by the PCI core to read/write the
configuration space of PCI devices.

Since the PCIe interfaces of Marvell SoCs are completely separate and
not linked together in a bus, this driver sets up an emulated PCI host
bridge, with one PCI-to-PCI bridge as child for each hardware PCIe
interface.

In addition, this driver enumerates the different PCIe slots, and for
those having a device plugged in, it sets up the necessary address
decoding windows, using the new armada_370_xp_alloc_pcie_window()
function from mach-mvebu/addr-map.c.

Signed-off-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
---
 .../devicetree/bindings/pci/mvebu-pci.txt          |  174 +++++
 drivers/pci/host/Kconfig                           |    6 +
 drivers/pci/host/Makefile                          |    4 +
 drivers/pci/host/pci-mvebu.c                       |  688 ++++++++++++++++++++
 4 files changed, 872 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/pci/mvebu-pci.txt
 create mode 100644 drivers/pci/host/Makefile
 create mode 100644 drivers/pci/host/pci-mvebu.c

Comments

Arnd Bergmann Feb. 12, 2013, 6:30 p.m. UTC | #1
On Tuesday 12 February 2013, Thomas Petazzoni wrote:
> diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
> new file mode 100644
> index 0000000..3ad563f
> --- /dev/null
> +++ b/drivers/pci/host/Makefile
> @@ -0,0 +1,4 @@
> +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
> +CFLAGS_pci-mvebu.o += \
> +	-I$(srctree)/arch/arm/plat-orion/include \
> +	-I$(srctree)/arch/arm/mach-mvebu/include

This does not seem like a good idea to me. We should not include
architecture specific directories from a driver directory.

What are the header files you need here? 

> +/*
> + * This product ID is registered by Marvell, and used when the Marvell
> + * SoC is not the root complex, but an endpoint on the PCIe bus. It is
> + * therefore safe to re-use this PCI ID for our emulated PCI-to-PCI
> + * bridge.
> + */
> +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 0x7846

Just a side note: What happens if you have two of these systems and connect
them over PCIe, putting one of them into host mode and the other into
endpoint mode?

> +static void mvebu_pcie_setup_io_window(struct mvebu_pcie_port *port,
> +				       int enable)
> +{
> +	unsigned long iobase, iolimit;
> +
> +	if (port->bridge.iolimit < port->bridge.iobase)
> +		return;
> +
> +	iolimit = 0xFFF | ((port->bridge.iolimit & 0xF0) << 8) |
> +		(port->bridge.iolimitupper << 16);
> +	iobase = ((port->bridge.iobase & 0xF0) << 8) |
> +		(port->bridge.iobaseupper << 16);

I don't understand this code with the masks and shifts. Could you
add a comment here for readers like me?

> +
> +/*
> + * Initialize the configuration space of the PCI-to-PCI bridge
> + * associated with the given PCIe interface.
> + */
> +static void mvebu_sw_pci_bridge_init(struct mvebu_pcie_port *port)
> +{

As mentioned, I'm still skeptical of the sw_pci_bridge approach,
so I'm not commenting on the details of your implementations
(they seem fine on a first look though)

> +	/* Get the I/O and memory ranges from DT */
> +	while ((range = of_pci_process_ranges(np, &res, range)) != NULL) {
> +		if (resource_type(&res) == IORESOURCE_IO) {
> +			memcpy(&pcie->io, &res, sizeof(res));
> +			memcpy(&pcie->realio, &res, sizeof(res));
> +			pcie->io.name = "I/O";
> +			pcie->realio.start &= 0xFFFFF;
> +			pcie->realio.end   &= 0xFFFFF;
> +		}

The bit masking seems fishy here. What exactly are you doing,
does this just assume you have a 1MB window at most?

Maybe something like

	pcie->realio.start = 0;
	pcie->realio.end = pcie->io.end - pcie->io.start;

I suppose you also need to fix up pcie->io to be in IORESOURCE_MEM
space instead of IORESOURCE_IO, or fix the of_pci_process_ranges
function to return it in a different way.

> +static int mvebu_pcie_init(void)
> +{
> +	return platform_driver_probe(&mvebu_pcie_driver,
> +				     mvebu_pcie_probe);
> +}
> +
> +subsys_initcall(mvebu_pcie_init);

You don't have to do it, but I wonder if this could be a module
with unload support instead.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 12, 2013, 7:22 p.m. UTC | #2
Dear Arnd Bergmann,

On Tue, 12 Feb 2013 18:30:11 +0000, Arnd Bergmann wrote:
> On Tuesday 12 February 2013, Thomas Petazzoni wrote:
> > diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
> > new file mode 100644
> > index 0000000..3ad563f
> > --- /dev/null
> > +++ b/drivers/pci/host/Makefile
> > @@ -0,0 +1,4 @@
> > +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
> > +CFLAGS_pci-mvebu.o += \
> > +	-I$(srctree)/arch/arm/plat-orion/include \
> > +	-I$(srctree)/arch/arm/mach-mvebu/include
> 
> This does not seem like a good idea to me. We should not include
> architecture specific directories from a driver directory.
> 
> What are the header files you need here? 

From the patch itself:

+#include <plat/pcie.h>
+#include <mach/addr-map.h>

<plat/pcie.h> is needed for a few PCIe functions shared with earlier
families of Marvell SoC. My plan is that once this PCI driver gets
accepted, I work on migrating the earlier Marvell SoC families to using
this PCI driver, and therefore those functions would ultimately move in
the driver in drivers/pci/host/, which would remove the <plat/pcie.h>.

The <mach/addr-map.h> is here to access the address decoding windows
allocation/free API. And for this, there is no other long term plan
than having an API provided by the platform code in arch/arm/, and used
by drivers. Some other drivers may have to use this API as well in the
future.

I think that completely preventing <mach/> and <plat/> includes from
drivers is not possible. Some sub-architectures will also have some
bizarre mechanism to handle (in our case the address decoding windows),
for which there is no kernel-wide API and kernel-wide subsystem to
handle it. In such cases, a sub-architecture specific solution is
really the only reasonable way, and in this case, we have to include
the sub-architecture headers.

Note that I have been careful to use CFLAGS_pci-mvebu.o, so that those
include paths only apply to *this* driver. I added a separate dummy
driver in drivers/pci/host/, and verified that those include paths are
not used when building this other driver. So those special CFLAGS are
still compatible with the multiplatform kernel.

> > +/*
> > + * This product ID is registered by Marvell, and used when the
> > Marvell
> > + * SoC is not the root complex, but an endpoint on the PCIe bus.
> > It is
> > + * therefore safe to re-use this PCI ID for our emulated PCI-to-PCI
> > + * bridge.
> > + */
> > +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 0x7846
> 
> Just a side note: What happens if you have two of these systems and
> connect them over PCIe, putting one of them into host mode and the
> other into endpoint mode?

I am not a PCI expert, but I don't think it would cause issues. Maybe
Jason Gunthorpe can comment on this, as he originally suggested to
re-use this PCI ID.

> > +static void mvebu_pcie_setup_io_window(struct mvebu_pcie_port
> > *port,
> > +				       int enable)
> > +{
> > +	unsigned long iobase, iolimit;
> > +
> > +	if (port->bridge.iolimit < port->bridge.iobase)
> > +		return;
> > +
> > +	iolimit = 0xFFF | ((port->bridge.iolimit & 0xF0) << 8) |
> > +		(port->bridge.iolimitupper << 16);
> > +	iobase = ((port->bridge.iobase & 0xF0) << 8) |
> > +		(port->bridge.iobaseupper << 16);
> 
> I don't understand this code with the masks and shifts. Could you
> add a comment here for readers like me?

Sure, will do.

It basically comes from the PCI-to-PCI bridge specification, which
explains how the I/O address and I/O limit is split into two 16 bits
registers, with those bizarre shifts and hardcoded values. I'll put a
reference to the relevant section of the PCI-to-PCI bridge
specification here.

> > +
> > +/*
> > + * Initialize the configuration space of the PCI-to-PCI bridge
> > + * associated with the given PCIe interface.
> > + */
> > +static void mvebu_sw_pci_bridge_init(struct mvebu_pcie_port *port)
> > +{
> 
> As mentioned, I'm still skeptical of the sw_pci_bridge approach,
> so I'm not commenting on the details of your implementations
> (they seem fine on a first look though)

Yes, I understood your were still skeptical. But as I've mentioned in
other e-mails, I still haven't seen any other serious alternate
proposal that takes into account the need for dynamic assignment of
addresses.

> > +	/* Get the I/O and memory ranges from DT */
> > +	while ((range = of_pci_process_ranges(np, &res, range)) !=
> > NULL) {
> > +		if (resource_type(&res) == IORESOURCE_IO) {
> > +			memcpy(&pcie->io, &res, sizeof(res));
> > +			memcpy(&pcie->realio, &res, sizeof(res));
> > +			pcie->io.name = "I/O";
> > +			pcie->realio.start &= 0xFFFFF;
> > +			pcie->realio.end   &= 0xFFFFF;
> > +		}
> 
> The bit masking seems fishy here. What exactly are you doing,
> does this just assume you have a 1MB window at most?

Basically, I have two resources for the I/O:

 * One described in the DT, from 0xC0000000 to 0xC00FFFFF which will be
   used to create the address decoding windows for the I/O regions of
   the different PCIe interfaces. The PCI I/O virtual address 0xffe00000
   will be mapped to those physical addresses. Those address decoding
   windows are configured with the special "remap" mechanism that
   ensures that if an access is made at 0xC0000000 + offset, it will
   appear on the PCI bus as an I/O access at address "offset".

 * One covering the low addresses 0x0 -> 0xFFFFF (pcie->realio), which
   is used to tell the Linux PCI subsystem from which address range it
   should assign I/O addresses.

> Maybe something like
> 
> 	pcie->realio.start = 0;
> 	pcie->realio.end = pcie->io.end - pcie->io.start;

Indeed, that would result in the same values. If you find it clearer,
I'm fine with it.

> I suppose you also need to fix up pcie->io to be in IORESOURCE_MEM
> space instead of IORESOURCE_IO, or fix the of_pci_process_ranges
> function to return it in a different way.

Ok.

> > +static int mvebu_pcie_init(void)
> > +{
> > +	return platform_driver_probe(&mvebu_pcie_driver,
> > +				     mvebu_pcie_probe);
> > +}
> > +
> > +subsys_initcall(mvebu_pcie_init);
> 
> You don't have to do it, but I wonder if this could be a module
> with unload support instead.

This has already been discussed in the review of PATCHv2. Please see
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-January/145580.html.

Basically, doing a module_init() initialization fails, because the XHCI
USB quirks are executed before we have the chance to create the address
decoding windows, which crashes the kernel at boot time (and we have
one platform where an USB 3.0 XHCI controller sits on the PCIe bus).
Bjorn Helgaas has acknowledged the problem in
http://lists.infradead.org/pipermail/linux-arm-kernel/2013-February/148292.html:

"""
This is not really a problem in your code; it's a generic PCI core
problem.  pci_scan_root_bus() does everything including creating the
root bus, scanning it, and adding the devices we find.  At the point
where we add a device (pci_bus_add_device()), it should be ready for a
driver to claim it -- all resource assignment should already be done.

I don't think it's completely trivial to fix this in the PCI core yet
(but we're moving in that direction) because we have some boot-time
ordering issues, e.g., x86 scans the root buses before we know about
the address space consumed by ACPI devices, so we can't just assign
the resources when we scan the bus.
"""

Best regards,

Thomas
Jason Gunthorpe Feb. 12, 2013, 7:49 p.m. UTC | #3
On Tue, Feb 12, 2013 at 08:22:52PM +0100, Thomas Petazzoni wrote:

> > > +/*
> > > + * This product ID is registered by Marvell, and used when the
> > > Marvell
> > > + * SoC is not the root complex, but an endpoint on the PCIe bus.
> > > It is
> > > + * therefore safe to re-use this PCI ID for our emulated PCI-to-PCI
> > > + * bridge.
> > > + */
> > > +#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 0x7846
> > 
> > Just a side note: What happens if you have two of these systems and
> > connect them over PCIe, putting one of them into host mode and the
> > other into endpoint mode?
> 
> I am not a PCI expert, but I don't think it would cause issues. Maybe
> Jason Gunthorpe can comment on this, as he originally suggested to
> re-use this PCI ID.

The answer is a bit complex..

- Yes the Marvell SOC, or related, can be used as an end point, and
  you could pair two SOCs together over PCI

- When in end point mode the SOC *probably* should have its ID
  reassigned. The ID is supposed to reflect the function of the
  device, not the HW vendor. In end point mode the ARM code controls
  how the device appears on PCIe.

  So, if someone made an add-in card based around one of these Marvell
  SOCs then it still won't expose this ID to the kernel.

- Even if the ID is reused there is a PCI-PCI bridge version and a
  end device version, so drivers could still tell them apart by class
  matching.

FWIW, I would fetch the ID from the HW, as different SOC varients may
vary the ID, though that is minor.

Basically, there is a theoretical problem here, but it is trivially
solvable in two ways by whomever decides to actually do this.

> > I suppose you also need to fix up pcie->io to be in IORESOURCE_MEM
> > space instead of IORESOURCE_IO, or fix the of_pci_process_ranges
> > function to return it in a different way.
> 
> Ok.

What does /proc/iomem say with your driver, are all the memory regions
properly requested?

FWIW, the orion mbus window driver should also get an update to
request memory regions, and IMHO, some of that code is better placed
there than in the PCI driver. Specifically, I think it should:

- Request regions for the DDR ('System RAM') and internal regs
- Request regions for every mbus window that was programmed in,
  orion_addr_map_cfg - ideally updated to decode the target specifier.
- Request and unrequest regions for the PCI driver's dynamic requests.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 12, 2013, 10:35 p.m. UTC | #4
On Tue, Feb 12, 2013 at 05:28:58PM +0100, Thomas Petazzoni wrote:
> +++ b/Documentation/devicetree/bindings/pci/mvebu-pci.txt
[..]
> +pcie-controller {
> +	compatible = "marvell,armada-370-xp-pcie";
> +	status = "disabled";
> +
> +	#address-cells = <3>;
> +	#size-cells = <2>;
> +
> +	bus-range = <0x00 0xff>;
> +
> +	ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
> +	          0x00004800 0 0xd0042000 0xd0042000 0 0x00002000   /* port 2.0 registers */
> +	          0x00001000 0 0xd0044000 0xd0044000 0 0x00002000   /* port 0.1 registers */
> +	          0x00001800 0 0xd0048000 0xd0048000 0 0x00002000   /* port 0.2 registers */
> +	          0x00002000 0 0xd004C000 0xd004C000 0 0x00002000   /* port 0.3 registers */
> +		  0x00002800 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
> +	          0x00005000 0 0xd0082000 0xd0082000 0 0x00002000   /* port 3.0 registers */
> +		  0x00003000 0 0xd0084000 0xd0084000 0 0x00002000   /* port 1.1 registers */
> +		  0x00003800 0 0xd0088000 0xd0088000 0 0x00002000   /* port 1.2 registers */
> +		  0x00004000 0 0xd008C000 0xd008C000 0 0x00002000   /* port 1.3 registers */
> +		  0x81000000 0 0	  0xc0000000 0 0x00100000   /* downstream I/O */
> +		  0x82000000 0 0	  0xc1000000 0 0x08000000>; /* non-prefetchable memory */
> +
> +	#interrupt-cells = <1>;
> +	interrupt-map-mask = <0xf800 0 0 1>;
> +	interrupt-map = <0x0800 0 0 1 &mpic 58
> +		         0x1000 0 0 1 &mpic 59
> +			 0x1800 0 0 1 &mpic 60
> +			 0x2000 0 0 1 &mpic 61
> +			 0x2800 0 0 1 &mpic 62
> +		         0x3000 0 0 1 &mpic 63
> +			 0x3800 0 0 1 &mpic 64
> +			 0x4000 0 0 1 &mpic 65
> +			 0x4800 0 0 1 &mpic 99
> +			 0x5000 0 0 1 &mpic 103>;

This is out of date now?

> +	pcie@0,0 {
> +		device_type = "pciex";
> +		reg = <0x0800 0 0xd0040000 0 0x2000>;

It would be great to get this sorted as per my prior comments.. Maybe
like this is easy?

pcie-controller {
 compatible = "marvell,armada-370-xp-pcie";

 // Index by marvell,pcie-port ?
 regs = <0xd0040000 0x00002000
         0xd0080000 0x00002000>;

 ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
           0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */

 pcie@0,0 {
      device_type = "pci";
      reg = <0x0800 0 0 0>; // 00:01.0  (????)
      marvell,pcie-port = <0>;
 };
}

It is abusive to map the device internal per-port registers through
'0x00000800 0 0xd0040000' and 'reg' - that is not really the intent of
the OF spec.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 12, 2013, 10:59 p.m. UTC | #5
On Tuesday 12 February 2013, Thomas Petazzoni wrote:
> On Tue, 12 Feb 2013 18:30:11 +0000, Arnd Bergmann wrote:
> > On Tuesday 12 February 2013, Thomas Petazzoni wrote:
> > > diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
> > > new file mode 100644
> > > index 0000000..3ad563f
> > > --- /dev/null
> > > +++ b/drivers/pci/host/Makefile
> > > @@ -0,0 +1,4 @@
> > > +obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
> > > +CFLAGS_pci-mvebu.o += \
> > > +	-I$(srctree)/arch/arm/plat-orion/include \
> > > +	-I$(srctree)/arch/arm/mach-mvebu/include
> > 
> > This does not seem like a good idea to me. We should not include
> > architecture specific directories from a driver directory.
> > 
> > What are the header files you need here? 
> 
> From the patch itself:
> 
> +#include <plat/pcie.h>
> +#include <mach/addr-map.h>
> 
> <plat/pcie.h> is needed for a few PCIe functions shared with earlier
> families of Marvell SoC. My plan is that once this PCI driver gets
> accepted, I work on migrating the earlier Marvell SoC families to using
> this PCI driver, and therefore those functions would ultimately move in
> the driver in drivers/pci/host/, which would remove the <plat/pcie.h>.

Hmm, although it is a bit unusual, I would actually propose duplicating
that code for now, and merging a copy into the new driver right away.
This gets rid of the header file dependency and lets you just delete
the old file when the other platforms are converted.

> The <mach/addr-map.h> is here to access the address decoding windows
> allocation/free API. And for this, there is no other long term plan
> than having an API provided by the platform code in arch/arm/, and used
> by drivers. Some other drivers may have to use this API as well in the
> future.

This is harder to do, but I'm sure we can find a solution. At least
the addr-map.c code has no other dependencies than the plat/addr-map.h
header, so we are fairly free to move it elsehwere.

I can think of several approaches that I'd prefer over your approach,
although I have to admit that none of them makes me really happy:

a) arch/arm/common/mv-addr-map.c and arch/arm/include/asm/mach/mv-addr-map.h
I assume that Russell will object to this one, but I let him speak for
himself

b) drivers/bus/<something>
This would make a lot more sense if we followed the scheme I explained
in my discussion to Jason Gunthorpe, where we basically treat this
bus as a parent node in the device tree for anything that can get remapped.
Without that change, it feels a little misplaced

c) drivers/misc/
I usually object to anything put in here and would certainly prefer the
previous one over this.

> Note that I have been careful to use CFLAGS_pci-mvebu.o, so that those
> include paths only apply to *this* driver. I added a separate dummy
> driver in drivers/pci/host/, and verified that those include paths are
> not used when building this other driver. So those special CFLAGS are
> still compatible with the multiplatform kernel.

I understand this, but I think it would be better not to set a precedent
for doing this.

> 
> > > +static void mvebu_pcie_setup_io_window(struct mvebu_pcie_port
> > > *port,
> > > +				       int enable)
> > > +{
> > > +	unsigned long iobase, iolimit;
> > > +
> > > +	if (port->bridge.iolimit < port->bridge.iobase)
> > > +		return;
> > > +
> > > +	iolimit = 0xFFF | ((port->bridge.iolimit & 0xF0) << 8) |
> > > +		(port->bridge.iolimitupper << 16);
> > > +	iobase = ((port->bridge.iobase & 0xF0) << 8) |
> > > +		(port->bridge.iobaseupper << 16);
> > 
> > I don't understand this code with the masks and shifts. Could you
> > add a comment here for readers like me?
> 
> Sure, will do.
> 
> It basically comes from the PCI-to-PCI bridge specification, which
> explains how the I/O address and I/O limit is split into two 16 bits
> registers, with those bizarre shifts and hardcoded values. I'll put a
> reference to the relevant section of the PCI-to-PCI bridge
> specification here.

Ok, I see. Thanks for the explanation.
 
> > > +	/* Get the I/O and memory ranges from DT */
> > > +	while ((range = of_pci_process_ranges(np, &res, range)) !=
> > > NULL) {
> > > +		if (resource_type(&res) == IORESOURCE_IO) {
> > > +			memcpy(&pcie->io, &res, sizeof(res));
> > > +			memcpy(&pcie->realio, &res, sizeof(res));
> > > +			pcie->io.name = "I/O";
> > > +			pcie->realio.start &= 0xFFFFF;
> > > +			pcie->realio.end   &= 0xFFFFF;
> > > +		}
> > 
> > The bit masking seems fishy here. What exactly are you doing,
> > does this just assume you have a 1MB window at most?
> 
> Basically, I have two resources for the I/O:
> 
>  * One described in the DT, from 0xC0000000 to 0xC00FFFFF which will be
>    used to create the address decoding windows for the I/O regions of
>    the different PCIe interfaces. The PCI I/O virtual address 0xffe00000
>    will be mapped to those physical addresses. Those address decoding
>    windows are configured with the special "remap" mechanism that
>    ensures that if an access is made at 0xC0000000 + offset, it will
>    appear on the PCI bus as an I/O access at address "offset".

Right, I got this from reading the code. Unfortunately the of_pci_process_ranges
functions from your earlier patch makes this a little confusing as it
marks this resource as IORESOURCE_IO when in reality it is the IORESOURCE_MEM
resource that describes where in MMIO space the I/O window is.

>  * One covering the low addresses 0x0 -> 0xFFFFF (pcie->realio), which
>    is used to tell the Linux PCI subsystem from which address range it
>    should assign I/O addresses.
> 
> > Maybe something like
> > 
> > 	pcie->realio.start = 0;
> > 	pcie->realio.end = pcie->io.end - pcie->io.start;
> 
> Indeed, that would result in the same values. If you find it clearer,
> I'm fine with it.

Ideally we would read that from the ranges property as well, since it
does not necessarily start at zero, although any other value would
be a bit silly.

I definitely prefer the version I suggested over your version. On second
thought I would make this

	pcie->realio.start = PCIBIOS_MIN_IO;
	pcie->realio.end = min(pcie->io.end - pcie->io.start, IO_SPACE_LIMIT);

to ensure that we are inside of the limits.

> 
> > > +static int mvebu_pcie_init(void)
> > > +{
> > > +	return platform_driver_probe(&mvebu_pcie_driver,
> > > +				     mvebu_pcie_probe);
> > > +}
> > > +
> > > +subsys_initcall(mvebu_pcie_init);
> > 
> > You don't have to do it, but I wonder if this could be a module
> > with unload support instead.
> 
> This has already been discussed in the review of PATCHv2. Please see
> http://lists.infradead.org/pipermail/linux-arm-kernel/2013-January/145580.html.

Ok.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 13, 2013, 12:41 a.m. UTC | #6
On Tue, Feb 12, 2013 at 10:59:53PM +0000, Arnd Bergmann wrote:
 
> b) drivers/bus/<something>
> This would make a lot more sense if we followed the scheme I explained
> in my discussion to Jason Gunthorpe, where we basically treat this
> bus as a parent node in the device tree for anything that can get remapped.
> Without that change, it feels a little misplaced

Also, FWIW, recall this related discussion and other possible DT
binding:

http://www.spinics.net/lists/arm-kernel/msg219992.html

drivers/bus/orion-mbus.c feels like the right option, but when I
looked at it, getting the DT binding, and full dynamicness setup
seemed like it would be best done after non-DT support was purged, and
that is somewhat contigent on getting the irqchip and timer stuff
sorted (see my first attempt at that):

https://patchwork.kernel.org/patch/1852011/

Guess it depends where you want to draw the line on cleanups before
something can be accepted..

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 13, 2013, 8:23 a.m. UTC | #7
Dear Arnd Bergmann,

On Tue, 12 Feb 2013 22:59:53 +0000, Arnd Bergmann wrote:

> > <plat/pcie.h> is needed for a few PCIe functions shared with earlier
> > families of Marvell SoC. My plan is that once this PCI driver gets
> > accepted, I work on migrating the earlier Marvell SoC families to using
> > this PCI driver, and therefore those functions would ultimately move in
> > the driver in drivers/pci/host/, which would remove the <plat/pcie.h>.
> 
> Hmm, although it is a bit unusual, I would actually propose duplicating
> that code for now, and merging a copy into the new driver right away.
> This gets rid of the header file dependency and lets you just delete
> the old file when the other platforms are converted.

Hum, why not, but I would definitely prefer to wait for the conversion
of older platforms instead of duplicating this code. But if you feel
like it's the right solution, I'll do it.

> > The <mach/addr-map.h> is here to access the address decoding windows
> > allocation/free API. And for this, there is no other long term plan
> > than having an API provided by the platform code in arch/arm/, and used
> > by drivers. Some other drivers may have to use this API as well in the
> > future.
> 
> This is harder to do, but I'm sure we can find a solution. At least
> the addr-map.c code has no other dependencies than the plat/addr-map.h
> header, so we are fairly free to move it elsehwere.

The arch/arm/mach-mvebu/addr-map.c depends on
arch/arm/plat-orion/addr-map.c, so any change on this will affect
mach-kirkwood, mach-orion5x, mach-dove and mach-mv78xx0. As you can
see, we have to take into account the existing code, and I don't think
it's realistic to have a perfect solution immediately.

This address decoding code will continue to change for two reasons:

 *) We are going to work on NOR/NAND support for mach-mvebu, and that
    will also involve more interaction with the address decoding code.

 *) As we are moving the earlier Marvell SoC families (mach-kirkwood,
    mach-orion5x, mach-dove, mach-mv78xx0) to the Device Tree, this
    address decoding code will also evolve.

mach-mvebu is not a standalone new platform like mach-highbank for
example, it relies on a lot of existing code from ealier platforms. It
is nice to share existing code, but it also means that cleanups and
refactoring take a lot more time.

I think we need to leave time for all these platforms to gradually
converge and cleanup their infrastructure. It's not going to happen
overnight.

> I can think of several approaches that I'd prefer over your approach,
> although I have to admit that none of them makes me really happy:
> 
> a) arch/arm/common/mv-addr-map.c and arch/arm/include/asm/mach/mv-addr-map.h
> I assume that Russell will object to this one, but I let him speak for
> himself
> 
> b) drivers/bus/<something>
> This would make a lot more sense if we followed the scheme I explained
> in my discussion to Jason Gunthorpe, where we basically treat this
> bus as a parent node in the device tree for anything that can get remapped.
> Without that change, it feels a little misplaced
> 
> c) drivers/misc/
> I usually object to anything put in here and would certainly prefer the
> previous one over this.

All of those approaches require a huge amount of work to convert the
existing SoC families. Certainly, it will be done as time goes, and as
older SoC families are converted to the DT and gradually converge
inside mach-mvebu/, but if we have to wait for all of this to happen to
get the PCIe support merged, it's not going to happen before several
months (the PCIe stuff was originally posted on December, 7th,
initially with the hope of targeting 3.9, the review and rework has
taken a long time, so I'm now targeting 3.10 for the PCIe stuff, but
I'd prefer not to have to postpone this to 3.11 and even 3.12 due to
the heavy dependencies on address decoding rework).

> > > I don't understand this code with the masks and shifts. Could you
> > > add a comment here for readers like me?
> > 
> > Sure, will do.
> > 
> > It basically comes from the PCI-to-PCI bridge specification, which
> > explains how the I/O address and I/O limit is split into two 16 bits
> > registers, with those bizarre shifts and hardcoded values. I'll put a
> > reference to the relevant section of the PCI-to-PCI bridge
> > specification here.
> 
> Ok, I see. Thanks for the explanation.

I'll add a comment anyway, because it's true that it's a bit of magic
going on here.

> > > > +	/* Get the I/O and memory ranges from DT */
> > > > +	while ((range = of_pci_process_ranges(np, &res, range)) !=
> > > > NULL) {
> > > > +		if (resource_type(&res) == IORESOURCE_IO) {
> > > > +			memcpy(&pcie->io, &res, sizeof(res));
> > > > +			memcpy(&pcie->realio, &res, sizeof(res));
> > > > +			pcie->io.name = "I/O";
> > > > +			pcie->realio.start &= 0xFFFFF;
> > > > +			pcie->realio.end   &= 0xFFFFF;
> > > > +		}
> > > 
> > > The bit masking seems fishy here. What exactly are you doing,
> > > does this just assume you have a 1MB window at most?
> > 
> > Basically, I have two resources for the I/O:
> > 
> >  * One described in the DT, from 0xC0000000 to 0xC00FFFFF which will be
> >    used to create the address decoding windows for the I/O regions of
> >    the different PCIe interfaces. The PCI I/O virtual address 0xffe00000
> >    will be mapped to those physical addresses. Those address decoding
> >    windows are configured with the special "remap" mechanism that
> >    ensures that if an access is made at 0xC0000000 + offset, it will
> >    appear on the PCI bus as an I/O access at address "offset".
> 
> Right, I got this from reading the code. Unfortunately the of_pci_process_ranges
> functions from your earlier patch makes this a little confusing as it
> marks this resource as IORESOURCE_IO when in reality it is the IORESOURCE_MEM
> resource that describes where in MMIO space the I/O window is.

Indeed. So maybe I should mark this resource as being IORESOURCE_MEM
in the DT.

> >  * One covering the low addresses 0x0 -> 0xFFFFF (pcie->realio), which
> >    is used to tell the Linux PCI subsystem from which address range it
> >    should assign I/O addresses.
> > 
> > > Maybe something like
> > > 
> > > 	pcie->realio.start = 0;
> > > 	pcie->realio.end = pcie->io.end - pcie->io.start;
> > 
> > Indeed, that would result in the same values. If you find it clearer,
> > I'm fine with it.
> 
> Ideally we would read that from the ranges property as well, since it
> does not necessarily start at zero, although any other value would
> be a bit silly.
> 
> I definitely prefer the version I suggested over your version. On second
> thought I would make this
> 
> 	pcie->realio.start = PCIBIOS_MIN_IO;
> 	pcie->realio.end = min(pcie->io.end - pcie->io.start, IO_SPACE_LIMIT);
> 
> to ensure that we are inside of the limits.

Ok, thanks!

Thomas
Thomas Petazzoni Feb. 13, 2013, 8:57 a.m. UTC | #8
Dear Jason Gunthorpe,

On Tue, 12 Feb 2013 15:35:11 -0700, Jason Gunthorpe wrote:

> This is out of date now?

Yes, will fix.

> > +	pcie@0,0 {
> > +		device_type = "pciex";
> > +		reg = <0x0800 0 0xd0040000 0 0x2000>;
> 
> It would be great to get this sorted as per my prior comments.. Maybe
> like this is easy?
> 
> pcie-controller {
>  compatible = "marvell,armada-370-xp-pcie";
> 
>  // Index by marvell,pcie-port ?
>  regs = <0xd0040000 0x00002000
>          0xd0080000 0x00002000>;
> 
>  ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
>            0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */
> 
>  pcie@0,0 {
>       device_type = "pci";
>       reg = <0x0800 0 0 0>; // 00:01.0  (????)
>       marvell,pcie-port = <0>;
>  };
> }
> 
> It is abusive to map the device internal per-port registers through
> '0x00000800 0 0xd0040000' and 'reg' - that is not really the intent of
> the OF spec.

I am not sure to understand how this would work. Given a pcie@X,Y node,
how would I find the address of the internal registers (i.e the ones at
0xd0040000, 0xd0080000) ?

You seem to propose a global regs = <...> property under
pcie-controller, but indexing using marvell,pcie-port cannot work. PCIe
interfaces are identified by two values (port,lane), so we have 0.0,
0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 1.3, 2.0 and 3.0 on MV78460. I really
would like to avoid having bizarre computations to find which entry in
this big regs = <...> array correspond to a given PCIe interface.

Could you give a more detailed example, matching the PCIe DT data of
the MV78460, which has many PCIe interfaces ?

Thanks,

Thomas
Arnd Bergmann Feb. 13, 2013, 9:18 a.m. UTC | #9
On Wednesday 13 February 2013, Jason Gunthorpe wrote:
> , Arnd Bergmann wrote:
>  
> > b) drivers/bus/<something>
> > This would make a lot more sense if we followed the scheme I explained
> > in my discussion to Jason Gunthorpe, where we basically treat this
> > bus as a parent node in the device tree for anything that can get remapped.
> > Without that change, it feels a little misplaced
> 
> Also, FWIW, recall this related discussion and other possible DT
> binding:
> 
> http://www.spinics.net/lists/arm-kernel/msg219992.html
> 
> drivers/bus/orion-mbus.c feels like the right option, but when I
> looked at it, getting the DT binding, and full dynamicness setup
> seemed like it would be best done after non-DT support was purged, and
> that is somewhat contigent on getting the irqchip and timer stuff
> sorted (see my first attempt at that):
> 
> https://patchwork.kernel.org/patch/1852011/
> 
> Guess it depends where you want to draw the line on cleanups before
> something can be accepted..

I guess as long as we agree on where we are headed with the
address translation, it's ok to just move the code now and
change the code later. I would like to be strict about the
include path stuff though.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 13, 2013, 9:29 a.m. UTC | #10
On Wednesday 13 February 2013, Thomas Petazzoni wrote:
> Dear Arnd Bergmann,
> 
> On Tue, 12 Feb 2013 22:59:53 +0000, Arnd Bergmann wrote:
> 
> > > <plat/pcie.h> is needed for a few PCIe functions shared with earlier
> > > families of Marvell SoC. My plan is that once this PCI driver gets
> > > accepted, I work on migrating the earlier Marvell SoC families to using
> > > this PCI driver, and therefore those functions would ultimately move in
> > > the driver in drivers/pci/host/, which would remove the <plat/pcie.h>.
> > 
> > Hmm, although it is a bit unusual, I would actually propose duplicating
> > that code for now, and merging a copy into the new driver right away.
> > This gets rid of the header file dependency and lets you just delete
> > the old file when the other platforms are converted.
> 
> Hum, why not, but I would definitely prefer to wait for the conversion
> of older platforms instead of duplicating this code. But if you feel
> like it's the right solution, I'll do it.

It's not something we do a lot, but in this case, I think it's better
if it lets us avoid adding the platform specific include path, which I
really want to avoid here.

> > > The <mach/addr-map.h> is here to access the address decoding windows
> > > allocation/free API. And for this, there is no other long term plan
> > > than having an API provided by the platform code in arch/arm/, and used
> > > by drivers. Some other drivers may have to use this API as well in the
> > > future.
> > 
> > This is harder to do, but I'm sure we can find a solution. At least
> > the addr-map.c code has no other dependencies than the plat/addr-map.h
> > header, so we are fairly free to move it elsehwere.
> 
> The arch/arm/mach-mvebu/addr-map.c depends on
> arch/arm/plat-orion/addr-map.c, so any change on this will affect
> mach-kirkwood, mach-orion5x, mach-dove and mach-mv78xx0. As you can
> see, we have to take into account the existing code, and I don't think
> it's realistic to have a perfect solution immediately.

Yes, I realize this. I was thinking we would move all at least the
file from plat-orion, and the header file. I don't care much whether
we also move the platform specific setup from mach-*/addr-map.c,
it works either way.
 
> This address decoding code will continue to change for two reasons:
> 
>  *) We are going to work on NOR/NAND support for mach-mvebu, and that
>     will also involve more interaction with the address decoding code.
> 
>  *) As we are moving the earlier Marvell SoC families (mach-kirkwood,
>     mach-orion5x, mach-dove, mach-mv78xx0) to the Device Tree, this
>     address decoding code will also evolve.
> 
> mach-mvebu is not a standalone new platform like mach-highbank for
> example, it relies on a lot of existing code from ealier platforms. It
> is nice to share existing code, but it also means that cleanups and
> refactoring take a lot more time.
> 
> I think we need to leave time for all these platforms to gradually
> converge and cleanup their infrastructure. It's not going to happen
> overnight.

We don't need to do a complete overhaul of that code right now, but
if we agree on a place where it can go, then I think we should
move it now as just one extra patch to get rid of the header
dependency. In the worst case, moving just the header file to
include/linux would work, too.

> > > 
> > >  * One described in the DT, from 0xC0000000 to 0xC00FFFFF which will be
> > >    used to create the address decoding windows for the I/O regions of
> > >    the different PCIe interfaces. The PCI I/O virtual address 0xffe00000
> > >    will be mapped to those physical addresses. Those address decoding
> > >    windows are configured with the special "remap" mechanism that
> > >    ensures that if an access is made at 0xC0000000 + offset, it will
> > >    appear on the PCI bus as an I/O access at address "offset".
> > 
> > Right, I got this from reading the code. Unfortunately the of_pci_process_ranges
> > functions from your earlier patch makes this a little confusing as it
> > marks this resource as IORESOURCE_IO when in reality it is the IORESOURCE_MEM
> > resource that describes where in MMIO space the I/O window is.
> 
> Indeed. So maybe I should mark this resource as being IORESOURCE_MEM
> in the DT.

The DT seems fine here, just the code that interprets it is a little
unusual. Maybe you can change the calling convention of that function
to pass the type of resource you want as an argument?


	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 13, 2013, 9:31 a.m. UTC | #11
Dear Arnd Bergmann,

On Wed, 13 Feb 2013 09:18:56 +0000, Arnd Bergmann wrote:

> > Also, FWIW, recall this related discussion and other possible DT
> > binding:
> > 
> > http://www.spinics.net/lists/arm-kernel/msg219992.html
> > 
> > drivers/bus/orion-mbus.c feels like the right option, but when I
> > looked at it, getting the DT binding, and full dynamicness setup
> > seemed like it would be best done after non-DT support was purged, and
> > that is somewhat contigent on getting the irqchip and timer stuff
> > sorted (see my first attempt at that):
> > 
> > https://patchwork.kernel.org/patch/1852011/
> > 
> > Guess it depends where you want to draw the line on cleanups before
> > something can be accepted..
> 
> I guess as long as we agree on where we are headed with the
> address translation, it's ok to just move the code now and
> change the code later. I would like to be strict about the
> include path stuff though.

Ok, so I could move the existing address decoding code into say
drivers/bus/orion-mbus.c. Would that be ok for you?

Thanks,

Thomas
Thomas Petazzoni Feb. 13, 2013, 9:40 a.m. UTC | #12
Dear Arnd Bergmann,

On Wed, 13 Feb 2013 09:29:21 +0000, Arnd Bergmann wrote:

> > Hum, why not, but I would definitely prefer to wait for the conversion
> > of older platforms instead of duplicating this code. But if you feel
> > like it's the right solution, I'll do it.
> 
> It's not something we do a lot, but in this case, I think it's better
> if it lets us avoid adding the platform specific include path, which I
> really want to avoid here.

Ok.

> > The arch/arm/mach-mvebu/addr-map.c depends on
> > arch/arm/plat-orion/addr-map.c, so any change on this will affect
> > mach-kirkwood, mach-orion5x, mach-dove and mach-mv78xx0. As you can
> > see, we have to take into account the existing code, and I don't think
> > it's realistic to have a perfect solution immediately.
> 
> Yes, I realize this. I was thinking we would move all at least the
> file from plat-orion, and the header file. I don't care much whether
> we also move the platform specific setup from mach-*/addr-map.c,
> it works either way.

Ok.


> We don't need to do a complete overhaul of that code right now, but
> if we agree on a place where it can go, then I think we should
> move it now as just one extra patch to get rid of the header
> dependency. In the worst case, moving just the header file to
> include/linux would work, too.

I'll try to cook something for this.

> > Indeed. So maybe I should mark this resource as being IORESOURCE_MEM
> > in the DT.
> 
> The DT seems fine here, just the code that interprets it is a little
> unusual. Maybe you can change the calling convention of that function
> to pass the type of resource you want as an argument?

Erm? The type of the resource is encoded into the DT:

+				  0x81000000 0 0	  0xc0000000 0 0x00010000   /* downstream I/O */
+				  0x82000000 0 0	  0xc1000000 0 0x08000000>; /* non-prefetchable memory */

From http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge:

     phys.hi cell: npt000ss bbbbbbbb dddddfff rrrrrrrr
    phys.mid cell: hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh
    phys.low cell: llllllll llllllll llllllll llllllll 

 ss: space code

    00: configuration space
    01: I/O space
    10: 32 bit memory space
    11: 64 bit memory space 

So the 0x81 at the beginning of the first line means I/O space, the
0x82 at the beginning of the second line means 32 bits memory space.
The of_pci_process_ranges() function simply decodes those informations
and fills the struct resource it returns with the appropriate resource
type.

So passing the resource type as argument to of_pci_process_ranges()
doesn't make much sense to me.

Best regards,

Thomas
Arnd Bergmann Feb. 13, 2013, 10:23 a.m. UTC | #13
On Wednesday 13 February 2013, Thomas Petazzoni wrote:
> > I guess as long as we agree on where we are headed with the
> > address translation, it's ok to just move the code now and
> > change the code later. I would like to be strict about the
> > include path stuff though.
> 
> Ok, so I could move the existing address decoding code into say
> drivers/bus/orion-mbus.c. Would that be ok for you?

Sure, sounds good to me.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 13, 2013, 10:37 a.m. UTC | #14
On Wednesday 13 February 2013, Thomas Petazzoni wrote:
> > > Indeed. So maybe I should mark this resource as being IORESOURCE_MEM
> > > in the DT.
> > 
> > The DT seems fine here, just the code that interprets it is a little
> > unusual. Maybe you can change the calling convention of that function
> > to pass the type of resource you want as an argument?
> 
> Erm? The type of the resource is encoded into the DT:
> 
> +                                 0x81000000 0 0          0xc0000000 0 0x00010000   /* downstream I/O */
> +                                 0x82000000 0 0          0xc1000000 0 0x08000000>; /* non-prefetchable memory */
> 
> From http://devicetree.org/Device_Tree_Usage#PCI_Host_Bridge:
> 
>      phys.hi cell: npt000ss bbbbbbbb dddddfff rrrrrrrr
>     phys.mid cell: hhhhhhhh hhhhhhhh hhhhhhhh hhhhhhhh
>     phys.low cell: llllllll llllllll llllllll llllllll 
> 
>  ss: space code
> 
>     00: configuration space
>     01: I/O space
>     10: 32 bit memory space
>     11: 64 bit memory space 
> 
> So the 0x81 at the beginning of the first line means I/O space, the
> 0x82 at the beginning of the second line means 32 bits memory space.
> The of_pci_process_ranges() function simply decodes those informations
> and fills the struct resource it returns with the appropriate resource
> type.

Yes, of course. And the ranges property tells you how to turn the first
address space into the second address space. So the above property
defines that the PCI bus I/O space range from 0 to 0x10000 gets converted
into the host MMIO range 0xc0000000 to 0xc0010000 on the host, and
the PCI bus memory space range from 0 to 0x08000000 gets converted to
the host MMIO range 0xc1000000 to 0xc9000000.

The output of your of_pci_process_ranges() function is the host MMIO
range, not the range in the bus address space, so it has to be
IORESOURCE_MEM.

Note that the memory range you have given in the example is a little
unconventional. Usually we try to have an identity mapping between
PCI bus memory space addresses and host MMIO space, which would
result in a zero sys->mem_offset value in addition to the sys->io_offset
value that is already zero.

Is there a limitation with the way that your address translation
windows are set up? What I would hope you could do is a memory window
like 

          <0x82000000 0 0xc1000000   0xc1000000 0 0x08000000>; /* non-prefetchable memory */

that translates the host MMIO range 0xc1000000-0xc9000000 into  bus
address 0xc1000000-0xc9000000 instead of bus address 0x00000000-0x08000000.
Would it help if you switch the addresses around, so that the physical
location of the memory space window is naturally aligned at
0xc0000000-0xc8000000 and the I/O space window is at 0xc8000000-0xc8100000?

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 13, 2013, 6:04 p.m. UTC | #15
On Wed, Feb 13, 2013 at 09:57:01AM +0100, Thomas Petazzoni wrote:

> > > +	pcie@0,0 {
> > > +		device_type = "pciex";
> > > +		reg = <0x0800 0 0xd0040000 0 0x2000>;
> > 
> > It would be great to get this sorted as per my prior comments.. Maybe
> > like this is easy?
> > 
> > pcie-controller {
> >  compatible = "marvell,armada-370-xp-pcie";
> > 
> >  // Index by marvell,pcie-port ?
> >  regs = <0xd0040000 0x00002000
> >          0xd0080000 0x00002000>;
> > 
> >  ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
> >            0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */
> > 
> >  pcie@0,0 {
> >       device_type = "pci";
> >       reg = <0x0800 0 0 0>; // 00:01.0  (????)
> >       marvell,pcie-port = <0>;
> >  };
> > }
> > 
> > It is abusive to map the device internal per-port registers through
> > '0x00000800 0 0xd0040000' and 'reg' - that is not really the intent of
> > the OF spec.
> 
> I am not sure to understand how this would work. Given a pcie@X,Y node,
> how would I find the address of the internal registers (i.e the ones at
> 0xd0040000, 0xd0080000) ?

> You seem to propose a global regs = <...> property under
> pcie-controller, but indexing using marvell,pcie-port cannot work. PCIe
> interfaces are identified by two values (port,lane), so we have 0.0,
> 0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 1.3, 2.0 and 3.0 on MV78460. I really
> would like to avoid having bizarre computations to find which entry in
> this big regs = <...> array correspond to a given PCIe interface.

Well, using the 'global' regs is an easy way to avoid the complexity with
address translation and 'assigned-addresses'. So some way to translate
between link # and regs index is necessary. How about using the PCI
device number? Device number 1 is regs entry 0, 2 is regs entry 1, etc.

The code flow would be then to iterate over DT children, extract the
bus/dev/function from 'reg', ensure bus ==0, and index the global regs
by device-1.

The other option is to try to make assigned-addresses work, Thierry
had some comments on that, it is a bit convoluted too, IMHO.

Also note setting the device_type = "pci";

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Bergmann Feb. 13, 2013, 7:33 p.m. UTC | #16
On Wednesday 13 February 2013, Jason Gunthorpe wrote:
> > You seem to propose a global regs = <...> property under
> > pcie-controller, but indexing using marvell,pcie-port cannot work. PCIe
> > interfaces are identified by two values (port,lane), so we have 0.0,
> > 0.1, 0.2, 0.3, 1.0, 1.1, 1.2, 1.3, 2.0 and 3.0 on MV78460. I really
> > would like to avoid having bizarre computations to find which entry in
> > this big regs = <...> array correspond to a given PCIe interface.
> 
> Well, using the 'global' regs is an easy way to avoid the complexity with
> address translation and 'assigned-addresses'. So some way to translate
> between link # and regs index is necessary. How about using the PCI
> device number? Device number 1 is regs entry 0, 2 is regs entry 1, etc.

Or you could use the "reg-names" property.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bjorn Helgaas Feb. 15, 2013, 12:36 a.m. UTC | #17
On Tue, Feb 12, 2013 at 9:28 AM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:

> --- a/drivers/pci/host/Kconfig
> +++ b/drivers/pci/host/Kconfig
> @@ -1,4 +1,10 @@
>  menu "PCI host controller drivers"
>         depends on PCI
>
> +config PCI_MVEBU
> +       bool "Marvell EBU PCIe controller"
> +       depends on ARCH_MVEBU
> +       select PCI_SW_HOST_BRIDGE
> +       select PCI_SW_PCI_PCI_BRIDGE

I think PCI_SW_HOST_BRIDGE and PCI_SW_PCI_PCI_BRIDGE are obsolete and
can be removed, right?

> --- /dev/null
> +++ b/drivers/pci/host/pci-mvebu.c
> ...
> +static void mvebu_pcie_setup_io_window(struct mvebu_pcie_port *port,
> +                                      int enable)
> +{
> +       unsigned long iobase, iolimit;
> +
> +       if (port->bridge.iolimit < port->bridge.iobase)
> +               return;
> +
> +       iolimit = 0xFFF | ((port->bridge.iolimit & 0xF0) << 8) |
> +               (port->bridge.iolimitupper << 16);
> +       iobase = ((port->bridge.iobase & 0xF0) << 8) |
> +               (port->bridge.iobaseupper << 16);
> +
> +       if (enable) {
> +               unsigned long physbase = port->pcie->io.start + iobase;
> +               armada_370_xp_alloc_pcie_window(port->port, port->lane,
> +                                               physbase, iobase,
> +                                               iolimit-iobase,
> +                                               IORESOURCE_IO);
> +               pci_ioremap_io(iobase, physbase);
> +       }
> +       else
> +               armada_370_xp_free_pcie_window(iobase);
> +}
> +
> +static void mvebu_pcie_setup_mem_window(struct mvebu_pcie_port *port,
> +                                       int enable)
> +{
> +       unsigned long membase, memlimit;
> +
> +       if (port->bridge.memlimit < port->bridge.membase)
> +               return;
> +
> +       membase = ((port->bridge.membase & 0xFFF0) << 16);
> +       memlimit = ((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF;
> +
> +       if (enable)
> +               armada_370_xp_alloc_pcie_window(port->port, port->lane,
> +                                               membase, ORION_ADDR_MAP_NO_REMAP,
> +                                               memlimit-membase,
> +                                               IORESOURCE_MEM);
> +       else
> +               armada_370_xp_free_pcie_window(membase);
> +}
> +
> +static void mvebu_handle_pcie_command(struct mvebu_pcie_port *port, u16 old,
> +                                     u16 new)
> +{
> +       /* Enabling an I/O window ? */
> +       if (!(old & PCI_COMMAND_IO) && (new & PCI_COMMAND_IO))
> +               mvebu_pcie_setup_io_window(port, 1);
> +
> +       /* Disabling an I/O window ? */
> +       if ((old & PCI_COMMAND_IO) && !(new & PCI_COMMAND_IO))
> +               mvebu_pcie_setup_io_window(port, 0);
> +
> +       /* Enabling a memory window ? */
> +       if (!(old & PCI_COMMAND_MEMORY) && (new & PCI_COMMAND_MEMORY))
> +               mvebu_pcie_setup_mem_window(port, 1);
> +
> +       /* Disabling a memory window ? */
> +       if ((old & PCI_COMMAND_MEMORY) && !(new & PCI_COMMAND_MEMORY))
> +               mvebu_pcie_setup_mem_window(port, 0);
> +}
> ...
> +/* Write to the PCI-to-PCI bridge configuration space */
> +static int mvebu_sw_pci_bridge_write(struct mvebu_pcie_port *port,
> +                                    unsigned int where, int size, u32 value)
> +{
> +       struct mvebu_sw_pci_bridge *bridge = &port->bridge;
> +       u32 mask, reg;
> +       int err;
> +
> +       if (size == 4)
> +               mask = 0x0;
> +       else if (size == 2)
> +               mask = ~(0xffff << ((where & 3) * 8));
> +       else if (size == 1)
> +               mask = ~(0xff << ((where & 3) * 8));
> +       else
> +               return PCIBIOS_BAD_REGISTER_NUMBER;
> +
> +       err = mvebu_sw_pci_bridge_read(port, where & ~3, 4, &reg);
> +       if (err)
> +               return err;
> +
> +       value = (reg & mask) | value << ((where & 3) * 8);
> +
> +       switch (where & ~3) {
> +       case PCI_COMMAND:
> +               mvebu_handle_pcie_command(port, bridge->command,
> +                                         value & 0xffff);
> +               bridge->command = value & 0xffff;
> +               bridge->status = value >> 16;
> +               break;
> +
> +       case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_1:
> +               bridge->bar[((where & ~3) - PCI_BASE_ADDRESS_0) / 4] = value;
> +               break;
> +
> +       case PCI_IO_BASE:
> +               /*
> +                * We also keep bit 1 set, it is a read-only bit that
> +                * indicates we support 32 bits addressing for the
> +                * I/O
> +                */
> +               bridge->iobase = (value & 0xff) | PCI_IO_RANGE_TYPE_32;
> +               bridge->iolimit = ((value >> 8) & 0xff) | PCI_IO_RANGE_TYPE_32;
> +               bridge->secondary_status = value >> 16;
> +               break;
> +
> +       case PCI_MEMORY_BASE:
> +               bridge->membase = value & 0xffff;
> +               bridge->memlimit = value >> 16;
> +               break;
> +
> +       case PCI_PREF_MEMORY_BASE:
> +               bridge->prefmembase = value & 0xffff;
> +               bridge->prefmemlimit = value >> 16;
> +               break;
> +
> +       case PCI_PREF_BASE_UPPER32:
> +               bridge->prefbaseupper = value;
> +               break;
> +
> +       case PCI_PREF_LIMIT_UPPER32:
> +               bridge->preflimitupper = value;
> +               break;

You're relying on a subsequent PCI_COMMAND write to set PCI_COMMAND_IO
and/or PCI_COMMAND_MEMORY, and you program the bridge windows at that
time.  It might be a good idea if the PCI core did clear those bits
while updating the windows, but I'm not sure we do.  In any case,
delaying the update is a difference from a standard P2P bridge that
could cause issues later.

> +       case PCI_IO_BASE_UPPER16:
> +               bridge->iobaseupper = value & 0xffff;
> +               bridge->iolimitupper = value >> 16;
> +               break;
> +
> +       case PCI_PRIMARY_BUS:
> +               bridge->primary_bus             = value & 0xff;
> +               bridge->secondary_bus           = (value >> 8) & 0xff;
> +               bridge->subordinate_bus         = (value >> 16) & 0xff;
> +               bridge->secondary_latency_timer = (value >> 24) & 0xff;
> +               orion_pcie_set_local_bus_nr(port->base, bridge->secondary_bus);
> +               break;
> +
> +       default:
> +               break;
> +       }
> +
> +       return PCIBIOS_SUCCESSFUL;
> +}

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni Feb. 15, 2013, 5:06 a.m. UTC | #18
Dear Bjorn Helgaas,

On Thu, 14 Feb 2013 17:36:39 -0700, Bjorn Helgaas wrote:

> > +config PCI_MVEBU
> > +       bool "Marvell EBU PCIe controller"
> > +       depends on ARCH_MVEBU
> > +       select PCI_SW_HOST_BRIDGE
> > +       select PCI_SW_PCI_PCI_BRIDGE
> 
> I think PCI_SW_HOST_BRIDGE and PCI_SW_PCI_PCI_BRIDGE are obsolete and
> can be removed, right?

Indeed, will fix.


> > +       case PCI_PREF_LIMIT_UPPER32:
> > +               bridge->preflimitupper = value;
> > +               break;
> 
> You're relying on a subsequent PCI_COMMAND write to set PCI_COMMAND_IO
> and/or PCI_COMMAND_MEMORY, and you program the bridge windows at that
> time.

Correct.

>  It might be a good idea if the PCI core did clear those bits
> while updating the windows, but I'm not sure we do.  In any case,
> delaying the update is a difference from a standard P2P bridge that
> could cause issues later.

When would you want the window assignment to occur? Directly when the
registers containing the base/limit are being written to? There are
multiple registers for base/limit, so I'll have to figure out a way to
validate when the base/limit informations are valid.

Thanks,

Thomas
Bjorn Helgaas Feb. 15, 2013, 4:26 p.m. UTC | #19
On Thu, Feb 14, 2013 at 10:06 PM, Thomas Petazzoni
<thomas.petazzoni@free-electrons.com> wrote:

>> > +       case PCI_PREF_LIMIT_UPPER32:
>> > +               bridge->preflimitupper = value;
>> > +               break;
>>
>> You're relying on a subsequent PCI_COMMAND write to set PCI_COMMAND_IO
>> and/or PCI_COMMAND_MEMORY, and you program the bridge windows at that
>> time.
>
> Correct.
>
>>  It might be a good idea if the PCI core did clear those bits
>> while updating the windows, but I'm not sure we do.  In any case,
>> delaying the update is a difference from a standard P2P bridge that
>> could cause issues later.
>
> When would you want the window assignment to occur? Directly when the
> registers containing the base/limit are being written to? There are
> multiple registers for base/limit, so I'll have to figure out a way to
> validate when the base/limit informations are valid.

For a standard bridge, the window change occurs immediately upon
writing the base/limit, so that would be the best time.

Here's the scenario I have in mind: We have a bridge X leading to
several devices A and B.  Drivers are bound to A and B, and they are
operating.  A hotplug event requires us to expand or shrink one of X's
windows, but the new window is still sufficient to operate A and B.
The question is how we do the window update.  If we turn off
PCI_COMMAND_MEMORY for the update, then we have to coordinate with the
drivers for A and B.  But I think it is possible to do some updates
safely even with the window decoding enabled, without any effect on A
and B, and these are the ones that would cause trouble with your
current scheme.

I don't think the PCI core does any updates like this today, but it's
something we might like to do in the future.

Bjorn
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe Feb. 15, 2013, 4:44 p.m. UTC | #20
On Fri, Feb 15, 2013 at 09:26:09AM -0700, Bjorn Helgaas wrote:
> On Thu, Feb 14, 2013 at 10:06 PM, Thomas Petazzoni
> <thomas.petazzoni@free-electrons.com> wrote:
> 
> >> > +       case PCI_PREF_LIMIT_UPPER32:
> >> > +               bridge->preflimitupper = value;
> >> > +               break;
> >>
> >> You're relying on a subsequent PCI_COMMAND write to set PCI_COMMAND_IO
> >> and/or PCI_COMMAND_MEMORY, and you program the bridge windows at that
> >> time.
> >
> > Correct.
> >
> >>  It might be a good idea if the PCI core did clear those bits
> >> while updating the windows, but I'm not sure we do.  In any case,
> >> delaying the update is a difference from a standard P2P bridge that
> >> could cause issues later.
> >
> > When would you want the window assignment to occur? Directly when the
> > registers containing the base/limit are being written to? There are
> > multiple registers for base/limit, so I'll have to figure out a way to
> > validate when the base/limit informations are valid.
> 
> For a standard bridge, the window change occurs immediately upon
> writing the base/limit, so that would be the best time.

The standard says a window where limit <= base is invalid/disabled.

So the write mechanism from the PCI core should be

base = 0xFFFFFFFF
limit = true value
base = true value

In the driver the window should be synchronized after any base/limit
update where base < limit. The initial values (ignoring fixed bits)
should be

base = 0xFFFFFFFF
limit = 0

It would be optimal to use the resource framework to ensure the region
is available, but that is something for the mbus driver to take care
of, not PCI-E...

> drivers for A and B.  But I think it is possible to do some updates
> safely even with the window decoding enabled, without any effect on A
> and B, and these are the ones that would cause trouble with your
> current scheme.

A resize update should just write the true values in any order, the
bridge window will expand without interrupting existing traffic.

However the core code must ensure that at no time do two bridges
decode the same window. That could crash the machine..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni March 6, 2013, 9:50 a.m. UTC | #21
Dear Arnd Bergmann,

On Wed, 13 Feb 2013 10:37:02 +0000, Arnd Bergmann wrote:

> Yes, of course. And the ranges property tells you how to turn the
> first address space into the second address space. So the above
> property defines that the PCI bus I/O space range from 0 to 0x10000
> gets converted into the host MMIO range 0xc0000000 to 0xc0010000 on
> the host, and the PCI bus memory space range from 0 to 0x08000000
> gets converted to the host MMIO range 0xc1000000 to 0xc9000000.
> 
> The output of your of_pci_process_ranges() function is the host MMIO
> range, not the range in the bus address space, so it has to be
> IORESOURCE_MEM.

I am sorry, but I don't get how this can work. My code currently relies
on the DT encoding one I/O resource and one MEM resource to find which
is one to use.

Right now, in the ranges property of my DT, I have:

          0x81000000 0 0          0xc0000000 0 0x00010000   /* downstream I/O */
          0x82000000 0 0          0xc1000000 0 0x08000000>; /* non-prefetchable memory */

And the code parsing this does:

        /* Get the I/O and memory ranges from DT */
        while ((range = of_pci_process_ranges(np, &res, range)) != NULL) {
                if (resource_type(&res) == IORESOURCE_IO) {
                        memcpy(&pcie->io, &res, sizeof(res));
                        memcpy(&pcie->realio, &res, sizeof(res));
                        pcie->io.name = "I/O";
                        pcie->realio.start &= 0xFFFFF;
                        pcie->realio.end   &= 0xFFFFF;
                }
                if (resource_type(&res) == IORESOURCE_MEM) {
                        memcpy(&pcie->mem, &res, sizeof(res));
                        pcie->mem.name = "MEM";
                }
        }

As you can see, it relies on one of the two ranges being for I/O
(address starts with 0x81), and another range being for MEM (address
starts with 0x82).

If I make both ranges start with 0x82, how can my code differentiate
between the range used for I/O and the range used for MEM?

Thanks,

Thomas
Thomas Petazzoni March 6, 2013, 9:54 a.m. UTC | #22
Dear Jason Gunthorpe,

On Tue, 12 Feb 2013 15:35:11 -0700, Jason Gunthorpe wrote:

> > +	pcie@0,0 {
> > +		device_type = "pciex";
> > +		reg = <0x0800 0 0xd0040000 0 0x2000>;
> 
> It would be great to get this sorted as per my prior comments.. Maybe
> like this is easy?
> 
> pcie-controller {
>  compatible = "marvell,armada-370-xp-pcie";
> 
>  // Index by marvell,pcie-port ?
>  regs = <0xd0040000 0x00002000
>          0xd0080000 0x00002000>;
> 
>  ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
>            0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */
> 
>  pcie@0,0 {
>       device_type = "pci";
>       reg = <0x0800 0 0 0>; // 00:01.0  (????)
>       marvell,pcie-port = <0>;
>  };
> }
> 
> It is abusive to map the device internal per-port registers through
> '0x00000800 0 0xd0040000' and 'reg' - that is not really the intent of
> the OF spec.

The Device Tree would really look odd. We have one register range for
each PCIe interface, but instead of nicely putting them inside the
pcie@X,Y subnodes, we have a global regs = <..> property at the
pcie-controller level? I can do that if you want, but it really sounds
like the standard PCI DT bindings are horrible. Those register ranges
are *per* PCIe interface, so any logical person would expect them
inside the pcie@X,Y node...

But ok, if that's the way things should be, so be it.

Best regards,

Thomas
Arnd Bergmann March 6, 2013, 10:43 a.m. UTC | #23
On Wednesday 06 March 2013, Thomas Petazzoni wrote:
> Dear Arnd Bergmann,
> 
> On Wed, 13 Feb 2013 10:37:02 +0000, Arnd Bergmann wrote:
> 
> > Yes, of course. And the ranges property tells you how to turn the
> > first address space into the second address space. So the above
> > property defines that the PCI bus I/O space range from 0 to 0x10000
> > gets converted into the host MMIO range 0xc0000000 to 0xc0010000 on
> > the host, and the PCI bus memory space range from 0 to 0x08000000
> > gets converted to the host MMIO range 0xc1000000 to 0xc9000000.
> > 
> > The output of your of_pci_process_ranges() function is the host MMIO
> > range, not the range in the bus address space, so it has to be
> > IORESOURCE_MEM.
> 
> I am sorry, but I don't get how this can work. My code currently relies
> on the DT encoding one I/O resource and one MEM resource to find which
> is one to use.
> 
> Right now, in the ranges property of my DT, I have:
> 
>           0x81000000 0 0          0xc0000000 0 0x00010000   /* downstream I/O */
>           0x82000000 0 0          0xc1000000 0 0x08000000>; /* non-prefetchable memory */

This looks correct, yes, although as I suggested earlier it would be more
conventional to use

          0x81000000 0 0          0xc0000000 0 0x00010000   /* downstream I/O */
          0x82000000 0 0xc1000000 0xc1000000 0 0x08000000>; /* non-prefetchable memory */


> And the code parsing this does:
> 
>         /* Get the I/O and memory ranges from DT */
>         while ((range = of_pci_process_ranges(np, &res, range)) != NULL) {
>                 if (resource_type(&res) == IORESOURCE_IO) {
>                         memcpy(&pcie->io, &res, sizeof(res));
>                         memcpy(&pcie->realio, &res, sizeof(res));
>                         pcie->io.name = "I/O";
>                         pcie->realio.start &= 0xFFFFF;
>                         pcie->realio.end   &= 0xFFFFF;
>                 }
>                 if (resource_type(&res) == IORESOURCE_MEM) {
>                         memcpy(&pcie->mem, &res, sizeof(res));
>                         pcie->mem.name = "MEM";
>                 }
>         }
> 
> As you can see, it relies on one of the two ranges being for I/O
> (address starts with 0x81), and another range being for MEM (address
> starts with 0x82).
> 
> If I make both ranges start with 0x82, how can my code differentiate
> between the range used for I/O and the range used for MEM?

No, I mean the parser is wrong ;-)

I think the proper solution would be to change the parser to return both
the host side resource, and another resource as the bus view. In the example
above, the resources you get should be:

io_host  = { .flags = IORESOURCE_MEM, .start = 0xc0000000, .end = 0xc000ffff };
io_bus   = { .flags = IORESOURCE_IO,  .start = 0,          .end =     0xffff };
mem_host = { .flags = IORESOURCE_MEM, .start = 0xc1000000, .end = 0xc9ffffff };
mem_bus  = { .flags = IORESOURCE_MEM, .start = 0,          .end =  0x7ffffff };

indicating that the PCI I/O space range 0-0xffff is mapped to the host physical
address space at 0xc0000000-0xc000ffff, and that the PCI memory space range 
0-0x7ffffff is mapped to host physical address space 0xc1000000-0xc9ffffff.

You really need all four resources to set up the mapping. For I/O space, you
should do

	pci_ioremap_io(io_bus->start, io_host->start);

and for the memory space, you need both to set up the correct mem_offset:

	pci_add_resource_offset(&sys->resources, mem_host, mem_host->start - mem_bus->start);

If you do as I suggested above and set the memory window to an identity
map at bus address 0xc1000000, the second argument to pci_add_resource_offset
becomes zero, indicating that any access to the window results in a bus
access at the same address.

	Arnd
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 6, 2013, 12:11 p.m. UTC | #24
On Wed, Mar 06, 2013 at 10:54:41AM +0100, Thomas Petazzoni wrote:
> Dear Jason Gunthorpe,
> 
> On Tue, 12 Feb 2013 15:35:11 -0700, Jason Gunthorpe wrote:
> 
> > > +	pcie@0,0 {
> > > +		device_type = "pciex";
> > > +		reg = <0x0800 0 0xd0040000 0 0x2000>;
> > 
> > It would be great to get this sorted as per my prior comments.. Maybe
> > like this is easy?
> > 
> > pcie-controller {
> >  compatible = "marvell,armada-370-xp-pcie";
> > 
> >  // Index by marvell,pcie-port ?
> >  regs = <0xd0040000 0x00002000
> >          0xd0080000 0x00002000>;
> > 
> >  ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
> >            0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */
> > 
> >  pcie@0,0 {
> >       device_type = "pci";
> >       reg = <0x0800 0 0 0>; // 00:01.0  (????)
> >       marvell,pcie-port = <0>;
> >  };
> > }
> > 
> > It is abusive to map the device internal per-port registers through
> > '0x00000800 0 0xd0040000' and 'reg' - that is not really the intent of
> > the OF spec.
> 
> The Device Tree would really look odd. We have one register range for
> each PCIe interface, but instead of nicely putting them inside the
> pcie@X,Y subnodes, we have a global regs = <..> property at the
> pcie-controller level? I can do that if you want, but it really sounds
> like the standard PCI DT bindings are horrible. Those register ranges
> are *per* PCIe interface, so any logical person would expect them
> inside the pcie@X,Y node...

I agree here. It seems that back at the time when the PCI DT bindings
were devised no hardware existed with these properties. So I think that
maybe it needs to be revised to take into account some of these new
requirements. Just adding a bunch of properties to work around
shortcomings in the specification doesn't seem very desirable to me.

That said, on Tegra there's an additional argument in favour of this
because the addresses mapped through the ranges/reg properties are
actually used to access the device's configuration space, which the
'ss' field in the first cell specifies. Quoting the DTS:

	pcie-controller {
		...
		ranges = <0x00000800 0 0x80000000 0x80000000 0 0x00001000   /* port 0 registers */
			  0x00001000 0 0x80001000 0x80001000 0 0x00001000   /* port 1 registers */
			  0x81000000 0 0          0x82000000 0 0x00010000   /* downstream I/O */
			  0x82000000 0 0          0xa0000000 0 0x10000000   /* non-prefetchable memory */
			  0xc2000000 0 0          0xb0000000 0 0x10000000>; /* prefetchable memory */
		...
		pci@1,0 {
			...
			reg = <0x000800 0 0x80000000 0 0x1000>;
			...
		};

		pci@2,0 {
			...
			reg = <0x001000 0 0x80001000 0 0x1000>;
			...
		};
	};

So we're actually mapping the configuration space of pci@1,0 and pci@2,0
to the 0x80000000 and 0x80001000 addresses of the CPU physical address
space. According to the PCI DT specification this isn't allowed, but it
works on Linux and quite frankly I don't see why it shouldn't be
allowed.

Thierry
Jason Gunthorpe March 6, 2013, 6:09 p.m. UTC | #25
On Wed, Mar 06, 2013 at 01:11:19PM +0100, Thierry Reding wrote:

> I agree here. It seems that back at the time when the PCI DT bindings
> were devised no hardware existed with these properties. So I think that
> maybe it needs to be revised to take into account some of these new
> requirements. Just adding a bunch of properties to work around
> shortcomings in the specification doesn't seem very desirable to me.

Well, all the PCI hosts should use the same mechanism, and it should
make sense in the context of the rest of the spec. So I think some
approval/comment on this specific issue by Grant/Rob/Arnd is important
at this point.

The way the spec is written, 'reg' in a 'device_type = pci' context is
the device's config address - it is like 'reg' in a CPU block, or I2C
bus - the value has nothing to do with CPU mappable memory.

> space. According to the PCI DT specification this isn't allowed, but it
> works on Linux and quite frankly I don't see why it shouldn't be
> allowed.

AFAIK, this only works because you are not using 'device_type = pci'.

IMHO, 'device_type = pci' is not optional, so you need to come up with
something that works under those conditions, and it changes how the OF
core handles 'reg' and the like.

As I said before, the two best options to me seem to be:

pcie-controller {
 compatible = "marvell,armada-370-xp-pcie";

  // Index by marvell,pcie-port/reg-names/etc
  regs = <0xd0040000 0x00002000
          0xd0080000 0x00002000>;

  ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
            0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */

  pcie@0,0 {
       device_type = "pci";
       reg = <0x0800 0 0 0>; // 00:01.0  (????)
       marvell,pcie-port = <0>;
  };
}

or:

pcie-controller {
  compatible = "marvell,armada-370-xp-pcie";

  ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
            0x82000000 0 0  0xc1000000  0 0x08000000   /* non-prefetchable memory */
            // Mapping for per-port host controller registers
            0x82000000 0x10000000 0xd0040000  0xd0040000  0 0x2000>;

  pcie@0,0 {
       device_type = "pci";
       reg = <0x0800 0 0 0>; // 00:01.0  (????)
       marvell,pcie-port = <0>;
       
       assigned-addresses = <0 0 0  0 0 // BAR 0 of the bridge, unused
       			     0 0 0  0 0 // BAR 1 of the bridge, unused
                             0x82000000 0x10000000 0xd0040000  0 0x2000>; // Extended

  };
}

Both have various problems, but I think I prefer the first one as it
doesn't conflate the contoller registers and host apertures in a
single ranges..

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 7, 2013, 8:08 a.m. UTC | #26
On Wed, Mar 06, 2013 at 11:09:46AM -0700, Jason Gunthorpe wrote:
> On Wed, Mar 06, 2013 at 01:11:19PM +0100, Thierry Reding wrote:
> 
> > I agree here. It seems that back at the time when the PCI DT bindings
> > were devised no hardware existed with these properties. So I think that
> > maybe it needs to be revised to take into account some of these new
> > requirements. Just adding a bunch of properties to work around
> > shortcomings in the specification doesn't seem very desirable to me.
> 
> Well, all the PCI hosts should use the same mechanism, and it should
> make sense in the context of the rest of the spec. So I think some
> approval/comment on this specific issue by Grant/Rob/Arnd is important
> at this point.
> 
> The way the spec is written, 'reg' in a 'device_type = pci' context is
> the device's config address - it is like 'reg' in a CPU block, or I2C
> bus - the value has nothing to do with CPU mappable memory.
> 
> > space. According to the PCI DT specification this isn't allowed, but it
> > works on Linux and quite frankly I don't see why it shouldn't be
> > allowed.
> 
> AFAIK, this only works because you are not using 'device_type = pci'.
> 
> IMHO, 'device_type = pci' is not optional, so you need to come up with
> something that works under those conditions, and it changes how the OF
> core handles 'reg' and the like.

Actually I do use device_type = "pciex" for the pci@1,0 and pci@2,0
nodes which was suggested back when I first posted this series. Yet the
address translation still works properly. And it should since they are
translated to the parent bus which is used to access the configuration
space window for each root port.

Looking at the code it looks indeed like the OF core only matches on
"pci" (and "vci"), but not "pciex". Perhaps I should change both nodes
to device_type = "pci" and see what the results are.

> As I said before, the two best options to me seem to be:
> 
> pcie-controller {
>  compatible = "marvell,armada-370-xp-pcie";
> 
>   // Index by marvell,pcie-port/reg-names/etc
>   regs = <0xd0040000 0x00002000
>           0xd0080000 0x00002000>;
> 
>   ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
>             0x82000000 0 0  0xc1000000  0 0x08000000>; /* non-prefetchable memory */
> 
>   pcie@0,0 {
>        device_type = "pci";
>        reg = <0x0800 0 0 0>; // 00:01.0  (????)
>        marvell,pcie-port = <0>;
>   };
> }
> 
> or:
> 
> pcie-controller {
>   compatible = "marvell,armada-370-xp-pcie";
> 
>   ranges = <0x81000000 0 0  0xc0000000  0 0x00010000   /* downstream I/O */
>             0x82000000 0 0  0xc1000000  0 0x08000000   /* non-prefetchable memory */
>             // Mapping for per-port host controller registers
>             0x82000000 0x10000000 0xd0040000  0xd0040000  0 0x2000>;
> 
>   pcie@0,0 {
>        device_type = "pci";
>        reg = <0x0800 0 0 0>; // 00:01.0  (????)
>        marvell,pcie-port = <0>;
>        
>        assigned-addresses = <0 0 0  0 0 // BAR 0 of the bridge, unused
>        			     0 0 0  0 0 // BAR 1 of the bridge, unused
>                              0x82000000 0x10000000 0xd0040000  0 0x2000>; // Extended
> 
>   };
> }
> 
> Both have various problems, but I think I prefer the first one as it
> doesn't conflate the contoller registers and host apertures in a
> single ranges..

I think a better alternative would be (and this matches what Thomas has
said elsewhere) to use something like the first alternative but move the
regs property into the pcie@0,X nodes. That would save us from having to
index a property in the parent. At least from a DT point of view I find
that to be a more consistent representation.

However that would probably not work out-of-the-box with the current OF
core because of_address_to_resource() won't know how to find the new
property. The assigned-addresses alternative seems to be the only one
that would work on the current OF core and achieve proper address
translation. But it doesn't seem like a good solution either since it
repurposes the meaning of the property and therefore isn't any better
than encoding the same information in the reg property.

Thierry
Jason Gunthorpe March 7, 2013, 5:49 p.m. UTC | #27
On Thu, Mar 07, 2013 at 09:08:32AM +0100, Thierry Reding wrote:

> Actually I do use device_type = "pciex" for the pci@1,0 and pci@2,0
> nodes which was suggested back when I first posted this series. Yet the
> address translation still works properly. And it should since they are
> translated to the parent bus which is used to access the configuration
> space window for each root port.

Right - pciex != pci - as you note:

> Looking at the code it looks indeed like the OF core only matches on
> "pci" (and "vci"), but not "pciex". Perhaps I should change both nodes
> to device_type = "pci" and see what the results are.

The string 'pci' engages the special behavior, pciex isn't a valid
keyword.

This is why I keep saying these bindings need to use 'device_type = "pci"' :)

> > Both have various problems, but I think I prefer the first one as it
> > doesn't conflate the contoller registers and host apertures in a
> > single ranges..
> 
> I think a better alternative would be (and this matches what Thomas has
> said elsewhere) to use something like the first alternative but move the
> regs property into the pcie@0,X nodes. That would save us from having to
> index a property in the parent. At least from a DT point of view I find
> that to be a more consistent representation.

You are thinking a new property 'host-controller-regs' or the like?

> However that would probably not work out-of-the-box with the current OF
> core because of_address_to_resource() won't know how to find the new
> property. The assigned-addresses alternative seems to be the only one
> that would work on the current OF core and achieve proper address
> translation. But it doesn't seem like a good solution either since it
> repurposes the meaning of the property and therefore isn't any better
> than encoding the same information in the reg property.

Except that reg is specifically not supposed to be handling CPU bus
addresses and assigned-addresses is.

Adding a hidden non-standard BAR to model the hidden non-standard
memory region associated with the bridge is a stretch, but it is not a
very big stretch...

In any event, regs should not be used, and something needs to be
decided!

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 7, 2013, 7:48 p.m. UTC | #28
On Thu, Mar 07, 2013 at 10:49:55AM -0700, Jason Gunthorpe wrote:
> On Thu, Mar 07, 2013 at 09:08:32AM +0100, Thierry Reding wrote:
[...]
> > > Both have various problems, but I think I prefer the first one as it
> > > doesn't conflate the contoller registers and host apertures in a
> > > single ranges..
> > 
> > I think a better alternative would be (and this matches what Thomas has
> > said elsewhere) to use something like the first alternative but move the
> > regs property into the pcie@0,X nodes. That would save us from having to
> > index a property in the parent. At least from a DT point of view I find
> > that to be a more consistent representation.
> 
> You are thinking a new property 'host-controller-regs' or the like?

Well, something shorter would be nice, but that's the general idea, yes.
As I mentioned before, for Tegra these registers aren't actually any
controller specific registers but rather a window to access the PCI
configuration space for the root ports.

Going via the FPCI mapping of the configuration space doesn't generate
transactions on bus 0, which is why there are extra windows for each
root port. That's why I was arguing about reusing the reg property in
the first place because it actually represents the configuration space.

But this is likely to be unique to Tegra and Marvell hardware doesn't
behave this way, but instead needs the registers for a different
purpose. Something like host-controller-regs might be more appropriate.

> > However that would probably not work out-of-the-box with the current OF
> > core because of_address_to_resource() won't know how to find the new
> > property. The assigned-addresses alternative seems to be the only one
> > that would work on the current OF core and achieve proper address
> > translation. But it doesn't seem like a good solution either since it
> > repurposes the meaning of the property and therefore isn't any better
> > than encoding the same information in the reg property.
> 
> Except that reg is specifically not supposed to be handling CPU bus
> addresses and assigned-addresses is.
> 
> Adding a hidden non-standard BAR to model the hidden non-standard
> memory region associated with the bridge is a stretch, but it is not a
> very big stretch...
> 
> In any event, regs should not be used, and something needs to be
> decided!

I don't think assigned-addresses is a good fit either. The PCI binding
document is equally specific about it as it is about the reg property.
So in my opinion a separate property would be a better choice. The only
big obstacle is that it needs to be somehow hooked up with the OF core
so that proper address translation can be performed.

One possible solution that wouldn't be too hard to implement is to
provide a new function (say of_get_named_address()) similar to
of_get_address() which doesn't get the name of the register property
from the struct of_bus but from a parameter and call that function from
another new function similar to of_address_to_resource() that also gets
the property name from a parameter. I can't think of a better name for
the latter than of_named_address_to_resource(), which is rather long.

Thierry
Jason Gunthorpe March 7, 2013, 8:02 p.m. UTC | #29
On Thu, Mar 07, 2013 at 08:48:30PM +0100, Thierry Reding wrote:
> On Thu, Mar 07, 2013 at 10:49:55AM -0700, Jason Gunthorpe wrote:
> > On Thu, Mar 07, 2013 at 09:08:32AM +0100, Thierry Reding wrote:
> [...]
> > > > Both have various problems, but I think I prefer the first one as it
> > > > doesn't conflate the contoller registers and host apertures in a
> > > > single ranges..
> > > 
> > > I think a better alternative would be (and this matches what Thomas has
> > > said elsewhere) to use something like the first alternative but move the
> > > regs property into the pcie@0,X nodes. That would save us from having to
> > > index a property in the parent. At least from a DT point of view I find
> > > that to be a more consistent representation.
> > 
> > You are thinking a new property 'host-controller-regs' or the like?
> 
> Well, something shorter would be nice, but that's the general idea, yes.
> As I mentioned before, for Tegra these registers aren't actually any
> controller specific registers but rather a window to access the PCI
> configuration space for the root ports.

Yes, I understand - but in this DT model configuration space access is
a host controller function, not a PCI-device function. Anyhow I was
also thinking that by the choice of the name it could do translation
from the host-controller scope, not from the bridge scope - so the
extra elements in ranges could be avoided as well. Hence the name..

> I don't think assigned-addresses is a good fit either. The PCI binding
> document is equally specific about it as it is about the reg property.
> So in my opinion a separate property would be a better choice. The only
> big obstacle is that it needs to be somehow hooked up with the OF core
> so that proper address translation can be performed.

Yes, agreed.

My suggestion is to get the OF experts like GKL/Rob H/etc to weigh in
on a preferred approach to this problem with the goal of standardizing
across all PCI host drivers. Seems like there are 2 main options
(outside regs + regnames/etc or new 'regs' in the bridge) and 1 hacky
one (assigned addresses)

> One possible solution that wouldn't be too hard to implement is to
> provide a new function (say of_get_named_address()) similar to
> of_get_address() which doesn't get the name of the register property
> from the struct of_bus but from a parameter and call that function from
> another new function similar to of_address_to_resource() that also gets
> the property name from a parameter. I can't think of a better name for
> the latter than of_named_address_to_resource(), which is rather long.

Seems like a reasonable API, maybe pass in a be32*/length pointer
instead of a name to be more flexible?

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 7, 2013, 8:47 p.m. UTC | #30
On Thu, Mar 07, 2013 at 01:02:35PM -0700, Jason Gunthorpe wrote:
> On Thu, Mar 07, 2013 at 08:48:30PM +0100, Thierry Reding wrote:
> > On Thu, Mar 07, 2013 at 10:49:55AM -0700, Jason Gunthorpe wrote:
> > > On Thu, Mar 07, 2013 at 09:08:32AM +0100, Thierry Reding wrote:
> > [...]
> > > > > Both have various problems, but I think I prefer the first one as it
> > > > > doesn't conflate the contoller registers and host apertures in a
> > > > > single ranges..
> > > > 
> > > > I think a better alternative would be (and this matches what Thomas has
> > > > said elsewhere) to use something like the first alternative but move the
> > > > regs property into the pcie@0,X nodes. That would save us from having to
> > > > index a property in the parent. At least from a DT point of view I find
> > > > that to be a more consistent representation.
> > > 
> > > You are thinking a new property 'host-controller-regs' or the like?
> > 
> > Well, something shorter would be nice, but that's the general idea, yes.
> > As I mentioned before, for Tegra these registers aren't actually any
> > controller specific registers but rather a window to access the PCI
> > configuration space for the root ports.
> 
> Yes, I understand - but in this DT model configuration space access is
> a host controller function, not a PCI-device function. Anyhow I was
> also thinking that by the choice of the name it could do translation
> from the host-controller scope, not from the bridge scope - so the
> extra elements in ranges could be avoided as well. Hence the name..
> 
> > I don't think assigned-addresses is a good fit either. The PCI binding
> > document is equally specific about it as it is about the reg property.
> > So in my opinion a separate property would be a better choice. The only
> > big obstacle is that it needs to be somehow hooked up with the OF core
> > so that proper address translation can be performed.
> 
> Yes, agreed.
> 
> My suggestion is to get the OF experts like GKL/Rob H/etc to weigh in
> on a preferred approach to this problem with the goal of standardizing
> across all PCI host drivers. Seems like there are 2 main options
> (outside regs + regnames/etc or new 'regs' in the bridge) and 1 hacky
> one (assigned addresses)

Arnd is already Cc'ed on this thread, adding Grant, Rob and the
devicetree-discuss ML.

In a nutshell (since some of the context isn't quoted anymore) the
problem that we're trying to solve is that some of the embedded SoCs
require per-root-port registers for configuration. The PCI DT
specification doesn't make any provisions for this. A few alternatives
have been discussed so far:

	1) Use a "regs" property outside of the root port nodes with
	   some mechanism to index into them from within the root port
	   nodes. Conceptually somewhat like this:

		pcie-controller {
			...
			regs = <0x80000000 0x00001000
			        0x80001000 0x00001000>;

			pci@0,1 {
				...
				port-index = <0>;
			};

			pci@0,2 {
				...
				port-index = <1>;
			};
		};

	2) Use a "regs" property inside of the root part nodes, along
	   the following lines:

		pcie-controller {
			...
			pci@0,1 {
				...
				reg = <0x00000800 0 0 0 0>;

				regs = <0x80000000 0x00001000>;
			};

			pci@0,2 {
				...
				reg = <0x00001000 0 0 0 0>;

				regs = <0x80001000 0x00001000>;
			};
		};

	3) Repurpose the "assigned-addresses" property to achieve the
	   same. This should work out-of-the-box but isn't a good fit
	   because it conflicts with the OF PCI specification which
	   defines this property to contain the addresses assigned to
	   the base address registers.

Options 1 and 2 above require changes to the OF core to allow proper
address translation, but the changes shouldn't be very big.

> > One possible solution that wouldn't be too hard to implement is to
> > provide a new function (say of_get_named_address()) similar to
> > of_get_address() which doesn't get the name of the register property
> > from the struct of_bus but from a parameter and call that function from
> > another new function similar to of_address_to_resource() that also gets
> > the property name from a parameter. I can't think of a better name for
> > the latter than of_named_address_to_resource(), which is rather long.
> 
> Seems like a reasonable API, maybe pass in a be32*/length pointer
> instead of a name to be more flexible?

There's already __of_address_to_resource() which takes a be32 * and size
but I thought it might be easier to wrap that to make it easier on the
drivers to use the API.

Thierry
Rob Herring March 8, 2013, 12:05 a.m. UTC | #31
On 03/07/2013 02:47 PM, Thierry Reding wrote:
> On Thu, Mar 07, 2013 at 01:02:35PM -0700, Jason Gunthorpe wrote:
>> On Thu, Mar 07, 2013 at 08:48:30PM +0100, Thierry Reding wrote:
>>> On Thu, Mar 07, 2013 at 10:49:55AM -0700, Jason Gunthorpe wrote:
>>>> On Thu, Mar 07, 2013 at 09:08:32AM +0100, Thierry Reding wrote:
>>> [...]
>>>>>> Both have various problems, but I think I prefer the first one as it
>>>>>> doesn't conflate the contoller registers and host apertures in a
>>>>>> single ranges..
>>>>>
>>>>> I think a better alternative would be (and this matches what Thomas has
>>>>> said elsewhere) to use something like the first alternative but move the
>>>>> regs property into the pcie@0,X nodes. That would save us from having to
>>>>> index a property in the parent. At least from a DT point of view I find
>>>>> that to be a more consistent representation.
>>>>
>>>> You are thinking a new property 'host-controller-regs' or the like?
>>>
>>> Well, something shorter would be nice, but that's the general idea, yes.
>>> As I mentioned before, for Tegra these registers aren't actually any
>>> controller specific registers but rather a window to access the PCI
>>> configuration space for the root ports.
>>
>> Yes, I understand - but in this DT model configuration space access is
>> a host controller function, not a PCI-device function. Anyhow I was
>> also thinking that by the choice of the name it could do translation
>> from the host-controller scope, not from the bridge scope - so the
>> extra elements in ranges could be avoided as well. Hence the name..
>>
>>> I don't think assigned-addresses is a good fit either. The PCI binding
>>> document is equally specific about it as it is about the reg property.
>>> So in my opinion a separate property would be a better choice. The only
>>> big obstacle is that it needs to be somehow hooked up with the OF core
>>> so that proper address translation can be performed.
>>
>> Yes, agreed.
>>
>> My suggestion is to get the OF experts like GKL/Rob H/etc to weigh in
>> on a preferred approach to this problem with the goal of standardizing
>> across all PCI host drivers. Seems like there are 2 main options
>> (outside regs + regnames/etc or new 'regs' in the bridge) and 1 hacky
>> one (assigned addresses)
> 
> Arnd is already Cc'ed on this thread, adding Grant, Rob and the
> devicetree-discuss ML.
> 
> In a nutshell (since some of the context isn't quoted anymore) the
> problem that we're trying to solve is that some of the embedded SoCs
> require per-root-port registers for configuration. The PCI DT
> specification doesn't make any provisions for this. A few alternatives
> have been discussed so far:

I'm not sure I follow. This is different than the host controller
registers? Why would this not just be multiple entries in the reg property?

Rob

> 
> 	1) Use a "regs" property outside of the root port nodes with
> 	   some mechanism to index into them from within the root port
> 	   nodes. Conceptually somewhat like this:
> 
> 		pcie-controller {
> 			...
> 			regs = <0x80000000 0x00001000
> 			        0x80001000 0x00001000>;
> 
> 			pci@0,1 {
> 				...
> 				port-index = <0>;
> 			};
> 
> 			pci@0,2 {
> 				...
> 				port-index = <1>;
> 			};
> 		};
> 
> 	2) Use a "regs" property inside of the root part nodes, along
> 	   the following lines:
> 
> 		pcie-controller {
> 			...
> 			pci@0,1 {
> 				...
> 				reg = <0x00000800 0 0 0 0>;
> 
> 				regs = <0x80000000 0x00001000>;
> 			};
> 
> 			pci@0,2 {
> 				...
> 				reg = <0x00001000 0 0 0 0>;
> 
> 				regs = <0x80001000 0x00001000>;
> 			};
> 		};
> 
> 	3) Repurpose the "assigned-addresses" property to achieve the
> 	   same. This should work out-of-the-box but isn't a good fit
> 	   because it conflicts with the OF PCI specification which
> 	   defines this property to contain the addresses assigned to
> 	   the base address registers.
> 
> Options 1 and 2 above require changes to the OF core to allow proper
> address translation, but the changes shouldn't be very big.
> 
>>> One possible solution that wouldn't be too hard to implement is to
>>> provide a new function (say of_get_named_address()) similar to
>>> of_get_address() which doesn't get the name of the register property
>>> from the struct of_bus but from a parameter and call that function from
>>> another new function similar to of_address_to_resource() that also gets
>>> the property name from a parameter. I can't think of a better name for
>>> the latter than of_named_address_to_resource(), which is rather long.
>>
>> Seems like a reasonable API, maybe pass in a be32*/length pointer
>> instead of a name to be more flexible?
> 
> There's already __of_address_to_resource() which takes a be32 * and size
> but I thought it might be easier to wrap that to make it easier on the
> drivers to use the API.
> 
> Thierry
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 8, 2013, 7:14 a.m. UTC | #32
On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
> On 03/07/2013 02:47 PM, Thierry Reding wrote:
[...]
> > In a nutshell (since some of the context isn't quoted anymore) the
> > problem that we're trying to solve is that some of the embedded SoCs
> > require per-root-port registers for configuration. The PCI DT
> > specification doesn't make any provisions for this. A few alternatives
> > have been discussed so far:
> 
> I'm not sure I follow. This is different than the host controller
> registers? Why would this not just be multiple entries in the reg property?

Well the register regions are per root-port. On Tegra20 there's 2 of
them, Tegra30 has 3 and if I understand correctly Marvell can have up to
10 (!). Adding all of them to the reg property of the host controller
could work but it needs some way to match the reg entry to the root port
similar to option 1 below.

Adding a property in the root port nodes seems like a cleaner and more
accurate representation of the hardware to me, but if that's not
acceptable perhaps we need to bite the bullet and add the code to look
the registers up from the parent's reg property.

Thierry

> > 	1) Use a "regs" property outside of the root port nodes with
> > 	   some mechanism to index into them from within the root port
> > 	   nodes. Conceptually somewhat like this:
> > 
> > 		pcie-controller {
> > 			...
> > 			regs = <0x80000000 0x00001000
> > 			        0x80001000 0x00001000>;
> > 
> > 			pci@0,1 {
> > 				...
> > 				port-index = <0>;
> > 			};
> > 
> > 			pci@0,2 {
> > 				...
> > 				port-index = <1>;
> > 			};
> > 		};
> > 
> > 	2) Use a "regs" property inside of the root part nodes, along
> > 	   the following lines:
> > 
> > 		pcie-controller {
> > 			...
> > 			pci@0,1 {
> > 				...
> > 				reg = <0x00000800 0 0 0 0>;
> > 
> > 				regs = <0x80000000 0x00001000>;
> > 			};
> > 
> > 			pci@0,2 {
> > 				...
> > 				reg = <0x00001000 0 0 0 0>;
> > 
> > 				regs = <0x80001000 0x00001000>;
> > 			};
> > 		};
> > 
> > 	3) Repurpose the "assigned-addresses" property to achieve the
> > 	   same. This should work out-of-the-box but isn't a good fit
> > 	   because it conflicts with the OF PCI specification which
> > 	   defines this property to contain the addresses assigned to
> > 	   the base address registers.
> > 
> > Options 1 and 2 above require changes to the OF core to allow proper
> > address translation, but the changes shouldn't be very big.
Jason Gunthorpe March 8, 2013, 4:52 p.m. UTC | #33
On Fri, Mar 08, 2013 at 08:14:44AM +0100, Thierry Reding wrote:
> On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
> > On 03/07/2013 02:47 PM, Thierry Reding wrote:
> [...]
> > > In a nutshell (since some of the context isn't quoted anymore) the
> > > problem that we're trying to solve is that some of the embedded SoCs
> > > require per-root-port registers for configuration. The PCI DT
> > > specification doesn't make any provisions for this. A few alternatives
> > > have been discussed so far:
> > 
> > I'm not sure I follow. This is different than the host controller
> > registers? Why would this not just be multiple entries in the reg property?
> 
> Well the register regions are per root-port. On Tegra20 there's 2 of
> them, Tegra30 has 3 and if I understand correctly Marvell can have up to
> 10 (!). Adding all of them to the reg property of the host controller
> could work but it needs some way to match the reg entry to the root port
> similar to option 1 below.

Thomas just posted an implementation like this, please see:

http://www.spinics.net/lists/arm-kernel/msg228749.html

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 8, 2013, 7:12 p.m. UTC | #34
On Fri, Mar 08, 2013 at 09:52:28AM -0700, Jason Gunthorpe wrote:
> On Fri, Mar 08, 2013 at 08:14:44AM +0100, Thierry Reding wrote:
> > On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
> > > On 03/07/2013 02:47 PM, Thierry Reding wrote:
> > [...]
> > > > In a nutshell (since some of the context isn't quoted anymore) the
> > > > problem that we're trying to solve is that some of the embedded SoCs
> > > > require per-root-port registers for configuration. The PCI DT
> > > > specification doesn't make any provisions for this. A few alternatives
> > > > have been discussed so far:
> > > 
> > > I'm not sure I follow. This is different than the host controller
> > > registers? Why would this not just be multiple entries in the reg property?
> > 
> > Well the register regions are per root-port. On Tegra20 there's 2 of
> > them, Tegra30 has 3 and if I understand correctly Marvell can have up to
> > 10 (!). Adding all of them to the reg property of the host controller
> > could work but it needs some way to match the reg entry to the root port
> > similar to option 1 below.
> 
> Thomas just posted an implementation like this, please see:
> 
> http://www.spinics.net/lists/arm-kernel/msg228749.html

Oh well. I don't like it, but if that's the way it has to be, then so be
it. Any reason why the reg-names can't match the root port's node name
so that it can be used directly instead of going through hoops to
construct a string from extra parameters? Like below:

	pcie-controller {
		regs = <...>;
		reg-names = ..., "pcie@1,0", "pcie@2,0";

		pcie@1,0 {
			...
		};

		pcie@2,0 {
			...
		};
	};

Thierry
Mitch Bradley March 8, 2013, 7:43 p.m. UTC | #35
On 3/8/2013 9:12 AM, Thierry Reding wrote:
> On Fri, Mar 08, 2013 at 09:52:28AM -0700, Jason Gunthorpe wrote:
>> On Fri, Mar 08, 2013 at 08:14:44AM +0100, Thierry Reding wrote:
>>> On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
>>>> On 03/07/2013 02:47 PM, Thierry Reding wrote:
>>> [...]
>>>>> In a nutshell (since some of the context isn't quoted anymore) the
>>>>> problem that we're trying to solve is that some of the embedded SoCs
>>>>> require per-root-port registers for configuration. The PCI DT
>>>>> specification doesn't make any provisions for this. A few alternatives
>>>>> have been discussed so far:
>>>>
>>>> I'm not sure I follow. This is different than the host controller
>>>> registers? Why would this not just be multiple entries in the reg property?
>>>
>>> Well the register regions are per root-port. On Tegra20 there's 2 of
>>> them, Tegra30 has 3 and if I understand correctly Marvell can have up to
>>> 10 (!). Adding all of them to the reg property of the host controller
>>> could work but it needs some way to match the reg entry to the root port
>>> similar to option 1 below.
>>
>> Thomas just posted an implementation like this, please see:
>>
>> http://www.spinics.net/lists/arm-kernel/msg228749.html

The example in that posting looks messed up to me.

1) It has "reg = <0x0800 0 0 0 0>", but 0x0800 0 0  is not a valid
address in the address space defined by its parent - because the form of
the parent's ranges property indicates that it's a PCI-style address
form.  0x0800 0 0 lacks the top bits that indicate non-relocatable and
the type (I/O, memory, etc).

2) The "@0,0" and "@1,0" suffixes do not correspond to the reg values
<0x0800 0 0 0 0> and <0x1000 0 0 0 0> using any rule that I know.

What is address space represented by the pcie-controller node?  reg
properties in child nodes must contain proper addresses in some
well-defined address space.  It looks like this example is trying to put
half of the PCI bus abstraction in the pcie-controller node and the rest
in the pcie sub-nodes, with some handwaving address-space splicing
between the two.  That's not kosher.  Each level of the tree needs a
coherent address space definition, otherwise the address space
"calculus" breaks.

> 
> Oh well. I don't like it, but if that's the way it has to be, then so be
> it. Any reason why the reg-names can't match the root port's node name
> so that it can be used directly instead of going through hoops to
> construct a string from extra parameters? Like below:
> 
> 	pcie-controller {
> 		regs = <...>;
> 		reg-names = ..., "pcie@1,0", "pcie@2,0";
> 
> 		pcie@1,0 {
> 			...
> 		};
> 
> 		pcie@2,0 {
> 			...
> 		};
> 	};
> 
> Thierry
> 
> 
> 
> _______________________________________________
> devicetree-discuss mailing list
> devicetree-discuss@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/devicetree-discuss
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 8, 2013, 8:02 p.m. UTC | #36
On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:

> >> http://www.spinics.net/lists/arm-kernel/msg228749.html
> 
> The example in that posting looks messed up to me.
> 
> 1) It has "reg = <0x0800 0 0 0 0>", but 0x0800 0 0  is not a valid
> address in the address space defined by its parent - because the form of
> the parent's ranges property indicates that it's a PCI-style address
> form.  0x0800 0 0 lacks the top bits that indicate non-relocatable and
> the type (I/O, memory, etc).

You need to review the OF PCI bindings to make sense of this.  The
subnodes are PCI devices. Those PCI devices show up in the
configuration space (ie lspci). They are the PCI root port bridges.

The reg value follows the OF PCI spec and has the configuration
address of the bridge. For the first port's bridge this address in
lspci format is 00:01.0 which encodes to <0x800 0 0>

The Linux OF PCI core uses the reg value in this format to match the
OF node in the DT to the PCI device node as it does PCI discovery.

> 2) The "@0,0" and "@1,0" suffixes do not correspond to the reg values
> <0x0800 0 0 0 0> and <0x1000 0 0 0 0> using any rule that I know.

@0,1,0 (bus,device,fn) could be more appropriate, but that is
cosmetic?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 8, 2013, 8:13 p.m. UTC | #37
On Fri, Mar 08, 2013 at 01:02:46PM -0700, Jason Gunthorpe wrote:
> On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:
[...]
> > 2) The "@0,0" and "@1,0" suffixes do not correspond to the reg values
> > <0x0800 0 0 0 0> and <0x1000 0 0 0 0> using any rule that I know.
> 
> @0,1,0 (bus,device,fn) could be more appropriate, but that is
> cosmetic?

The OF PCI specification is pretty strict about this as well. It says in
section 2.2.1.3. that only the DD and DD,FF forms can appear in a device
path.

Thierry
Rob Herring March 8, 2013, 11:12 p.m. UTC | #38
On 03/08/2013 01:14 AM, Thierry Reding wrote:
> On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
>> On 03/07/2013 02:47 PM, Thierry Reding wrote:
> [...]
>>> In a nutshell (since some of the context isn't quoted anymore) the
>>> problem that we're trying to solve is that some of the embedded SoCs
>>> require per-root-port registers for configuration. The PCI DT
>>> specification doesn't make any provisions for this. A few alternatives
>>> have been discussed so far:
>>
>> I'm not sure I follow. This is different than the host controller
>> registers? Why would this not just be multiple entries in the reg property?
> 
> Well the register regions are per root-port. On Tegra20 there's 2 of
> them, Tegra30 has 3 and if I understand correctly Marvell can have up to
> 10 (!). Adding all of them to the reg property of the host controller
> could work but it needs some way to match the reg entry to the root port
> similar to option 1 below.

The compatible property of the PCI host controller can imply what each
index of the reg property entries is for.

> 
> Adding a property in the root port nodes seems like a cleaner and more
> accurate representation of the hardware to me, but if that's not
> acceptable perhaps we need to bite the bullet and add the code to look
> the registers up from the parent's reg property.

What I don't like is a new property defined to describe mmio addresses.
We already have a property for that and it is "reg". But I think I'm
still missing something:

>>> 			pci@0,1 {
>>> 				...
>>> 				reg = <0x00000800 0 0 0 0>;

Is this a PCI bus address?

>>> 				regs = <0x80000000 0x00001000>;
>>> 			};

Rob

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 8, 2013, 11:46 p.m. UTC | #39
On 3/8/2013 10:02 AM, Jason Gunthorpe wrote:
> On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:
> 
>>>> http://www.spinics.net/lists/arm-kernel/msg228749.html
>>
>> The example in that posting looks messed up to me.
>>
>> 1) It has "reg = <0x0800 0 0 0 0>", but 0x0800 0 0  is not a valid
>> address in the address space defined by its parent - because the form of
>> the parent's ranges property indicates that it's a PCI-style address
>> form.  0x0800 0 0 lacks the top bits that indicate non-relocatable and
>> the type (I/O, memory, etc).
> 
> You need to review the OF PCI bindings to make sense of this.  The
> subnodes are PCI devices. Those PCI devices show up in the
> configuration space (ie lspci). They are the PCI root port bridges.

As it turns out, I wrote those bindings, almost 20 years ago.

Having dug through old versions of that patch, I think I see the source
of the confusion.

In a very early version of the patch -
  http://www.spinics.net/lists/arm-kernel/msg211839.html
the pcie-controller address space had 1 address cell, essentially
propagating the CPU address down to the subordinate nodes, and the
subordinate pcie nodes had no child address specification.  Somebody
correctly observed that PCI buses need to export 3/2 address/size cells
to their children.

But that stipulation applies only to the subordinate pcie nodes, not
necessarily to the enclosing pcie-controller node.

A later version -
   http://permalink.gmane.org/gmane.linux.kernel.pci/20358
applied 3/2 PCI addressing at both levels.  That is somewhat sensible if
you treat the top level as a "PCI root complex" and if the subordinate
port control registers are really PCI config header blocks.

But look at the ranges entries therein.  PCI config address <0x800 ..>
maps to CPU address 0xd004000 size 0x2000 and <0x1000 ..> maps to
0xd0044000 size 0x2000.  0x800 size 0x2000 encroaches into the <0x1000
..> range.  That can't be right.

That, plus the first version of the patch, makes me think that these
root-port control registers might not really be PCI config registers, in
which case the use of PCI addressing at the top level is a fiction.

Furthermore, we have the fact that pcie@0,0 corresponds to
marvell,pcie-port = <0> and marvell,pcie-lane = <0>, and similarly for
pcie@0,1 and pcie@0,2.  That correspondence still appears in the recent
patch.

So it looks like the "@0,0" addressing attempts to represent lane,port
instead of device,function PCI config space addressing.  0,0 is not the
right PCI device/function number for reg=<0x800 ...> - 0x800 is device
1, not device 0.

Another problem is the device_type values.  If the top level is to be
interpreted as a PCI addressing domain, it should have a device_type=pci
property.

One solution is for the top node to use 1-cell addresses, passing the
CPU address to the subordinates.  The subordinate nodes use the standard
PCI address representation.  The subordinate reg properties just list
the CPU address of the control registers.  Each subordinate has a ranges
properties to translate PCI to CPU addresses and a bus-range property
for PCI bus numbers (unless those are determined dynamically).

What you lose with that is the ability to refer to the nodes by
port,lane.  If that is important, then the top address domain needs an
additional cell to say whether a given address is a CPU address or
port,lane.  The top node would need a ranges to map the various forms
into CPU addresses.  The child reg properties could use the port,lane
form, and marvell,pcie-{port,lane} would disappear.

In neither case would you need the controversial "reg-names" thing.

The tradeoff:

Existing proposed patch: Violates addressing rules, requires funny new
"reg-names" mechanism, pretends to create a PCI bus level that does not
exist.

One-cell top address: Follows addressing rules, may be able to use
existing "simple-bus" address translation code, does not permit the
pretty @lane,port human-readable form.

Two-cell top address: Follows addressing rules, possibly requires new
code to translate this particular 2-address-cell format, permits
@lane,port representation.

> 
> The reg value follows the OF PCI spec and has the configuration
> address of the bridge. For the first port's bridge this address in
> lspci format is 00:01.0 which encodes to <0x800 0 0>
> 
> The Linux OF PCI core uses the reg value in this format to match the
> OF node in the DT to the PCI device node as it does PCI discovery.
> 
>> 2) The "@0,0" and "@1,0" suffixes do not correspond to the reg values
>> <0x0800 0 0 0 0> and <0x1000 0 0 0 0> using any rule that I know.
> 
> @0,1,0 (bus,device,fn) could be more appropriate, but that is
> cosmetic?

Except that the @0,0 in this case seems to represent port,lane and not
device,function , as argued above.

> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 9, 2013, 1:31 a.m. UTC | #40
On Fri, Mar 08, 2013 at 01:46:13PM -1000, Mitch Bradley wrote:
> On 3/8/2013 10:02 AM, Jason Gunthorpe wrote:
> > On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:
> > 
> >>>> http://www.spinics.net/lists/arm-kernel/msg228749.html
> >>
> >> The example in that posting looks messed up to me.
> >>
> >> 1) It has "reg = <0x0800 0 0 0 0>", but 0x0800 0 0  is not a valid
> >> address in the address space defined by its parent - because the form of
> >> the parent's ranges property indicates that it's a PCI-style address
> >> form.  0x0800 0 0 lacks the top bits that indicate non-relocatable and
> >> the type (I/O, memory, etc).
> > 
> > You need to review the OF PCI bindings to make sense of this.  The
> > subnodes are PCI devices. Those PCI devices show up in the
> > configuration space (ie lspci). They are the PCI root port bridges.
> 
> As it turns out, I wrote those bindings, almost 20 years ago.
> 
> Having dug through old versions of that patch, I think I see the source
> of the confusion.

Not sure, I think this has gone into a weird tangent, your conclusions
don't match how the PCI-E root complex is presented to the kernel.

Lets just go back to the start?

The pci-controller node is a root complex.

The children of that node are elements in the root complex - they are
the root port bridges. They are all on PCI bus 0.

Each root port bridge responds to configuration cycles, and has a PCI
configuration space. They show up in lspci.

The host controller device driver knows how to issue configuration
cycles, and it knows how to deliver configuration cycles on bus 0 to
the bridges.

This one:
+			pcie@0,0 {
+				device_type = "pci";
+				reg = <0x0800 0 0 0 0>;

Responds to bus 0, device 1, function 0.

Each root port bridge has an associated non-PCI MMIO address
space. The child above is associated with 0xd0040000. This MMIO
address space is needed by the host controller driver to operate the
port.

The text after the @ is a mistake, Linux doesn't enforce anything
about this text so nobody noticed till now, it should be corrected to
@1,0, similarly for the second one.

Also, from your comments the top level should gain a 'device_type =
"pci"'.

So, from that point, does the DT start to make sense? Are there other
problems?

The question at hand is how do you associate the non-PCI MMIO space
0xd0040000 with the root port bridge 'pcie@1,0' ? The patch proposes
to use a reg array in the controller plus reg-names to do this.

I have trouble making sense of your suggestions to switch away from
5dw addressing. It is critical that the DT child stanza be associated
with the PCI discovered root port bridge. By my understanding Linux
does this based on the configuration space format 'reg' value. How is
that association possible if you switch things away from 5dw
addressing?

Also, this is for firmware-less embedded. Linux is required to scan
the bus and do full address assignment of all PCI devies, including
the root port bridges. The ranges property on the top node specifies
the addressing apertures that are available for this process. This is
common practice, there are many examples of this in kernel dts for PCI
controllers.

It is impossible to specify any detail for the bridges (be it address
ranges or bus number ranges) because we have no idea what discovery
will find, or what values Linux will choose to assign.

> In neither case would you need the controversial "reg-names" thing.

reg-names has been a standard feature in the Linux kernel for a bit
already..

Thanks a lot for looking at this,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni March 9, 2013, 8:58 a.m. UTC | #41
Dear Thierry Reding,

On Fri, 8 Mar 2013 20:12:27 +0100, Thierry Reding wrote:

> Oh well. I don't like it, but if that's the way it has to be, then so be
> it.

I don't like it too much either, but it's not a big problem either. It
works, and satisfies the DT requirements reasonably.

> Any reason why the reg-names can't match the root port's node name
> so that it can be used directly instead of going through hoops to
> construct a string from extra parameters? Like below:
> 
> 	pcie-controller {
> 		regs = <...>;
> 		reg-names = ..., "pcie@1,0", "pcie@2,0";
> 
> 		pcie@1,0 {
> 			...
> 		};
> 
> 		pcie@2,0 {
> 			...
> 		};
> 	};

Yes, I thought about doing this, but in my driver, I also use the
"pcie0.0" string to request the address decoding window from the
mvebu-mbus driver. So I would anyway have to construct this "pciX.Y"
string. But if you feel like using "pcie@X,Y" for the reg-names is
better even if I still need to construct the "pcieX.Y" string, then I
will be perfectly ok with making this change.

Best regards,

Thomas
Thierry Reding March 9, 2013, 11:10 a.m. UTC | #42
On Fri, Mar 08, 2013 at 05:12:04PM -0600, Rob Herring wrote:
> On 03/08/2013 01:14 AM, Thierry Reding wrote:
> > On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
> >> On 03/07/2013 02:47 PM, Thierry Reding wrote:
> > [...]
> >>> In a nutshell (since some of the context isn't quoted anymore) the
> >>> problem that we're trying to solve is that some of the embedded SoCs
> >>> require per-root-port registers for configuration. The PCI DT
> >>> specification doesn't make any provisions for this. A few alternatives
> >>> have been discussed so far:
> >>
> >> I'm not sure I follow. This is different than the host controller
> >> registers? Why would this not just be multiple entries in the reg property?
> > 
> > Well the register regions are per root-port. On Tegra20 there's 2 of
> > them, Tegra30 has 3 and if I understand correctly Marvell can have up to
> > 10 (!). Adding all of them to the reg property of the host controller
> > could work but it needs some way to match the reg entry to the root port
> > similar to option 1 below.
> 
> The compatible property of the PCI host controller can imply what each
> index of the reg property entries is for.
> 
> > 
> > Adding a property in the root port nodes seems like a cleaner and more
> > accurate representation of the hardware to me, but if that's not
> > acceptable perhaps we need to bite the bullet and add the code to look
> > the registers up from the parent's reg property.
> 
> What I don't like is a new property defined to describe mmio addresses.
> We already have a property for that and it is "reg".

Okay, understood.

> But I think I'm still missing something:
> 
> >>> 			pci@0,1 {
> >>> 				...
> >>> 				reg = <0x00000800 0 0 0 0>;
> 
> Is this a PCI bus address?

Yes. The example is slightly wrong. pci@0,1 should actually be pci@1,0.
That means it is device 1, function 0 on the bus (implicitly 0 in this
case). That matches the values in the reg property, whose first cell is
defined as follows, quoting the PCI OF specification:

	npt000ss bbbbbbbb dddddfff rrrrrrrr

Where:

	n         is 0 if the address is relocatable, 1 otherwise
	p         is 1 if the addressable region is "prefetchable", 0
	          otherwise
	t         is 1 if the address is aliased (for non-relocatable
	          I/O), below 1 MB (for Memory), or below 64 KB (for
	          relocatable I/O).
	ss        is the space code, denoting the address space
	bbbbbbbb  is the 8-bit Bus Number
	ddddd     is the 5-bit Device Number
	fff       is the 3-bit Function Number
	rrrrrrrr  is the 8-bit Register Number

Thierry
Mitch Bradley March 10, 2013, 4:52 a.m. UTC | #43
On 3/8/2013 3:31 PM, Jason Gunthorpe wrote:
> On Fri, Mar 08, 2013 at 01:46:13PM -1000, Mitch Bradley wrote:
>> On 3/8/2013 10:02 AM, Jason Gunthorpe wrote:
>>> On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:
>>>
>>>>>> http://www.spinics.net/lists/arm-kernel/msg228749.html
>>>>
>>>> The example in that posting looks messed up to me.
>>>>
>>>> 1) It has "reg = <0x0800 0 0 0 0>", but 0x0800 0 0  is not a valid
>>>> address in the address space defined by its parent - because the
form of
>>>> the parent's ranges property indicates that it's a PCI-style address
>>>> form.  0x0800 0 0 lacks the top bits that indicate non-relocatable and
>>>> the type (I/O, memory, etc).
>>>
>>> You need to review the OF PCI bindings to make sense of this.  The
>>> subnodes are PCI devices. Those PCI devices show up in the
>>> configuration space (ie lspci). They are the PCI root port bridges.
>>
>> As it turns out, I wrote those bindings, almost 20 years ago.
>>
>> Having dug through old versions of that patch, I think I see the source
>> of the confusion.
>
> Not sure, I think this has gone into a weird tangent, your conclusions
> don't match how the PCI-E root complex is presented to the kernel.
>
> Lets just go back to the start?
>
> The pci-controller node is a root complex.
>
> The children of that node are elements in the root complex - they are
> the root port bridges. They are all on PCI bus 0.
>
> Each root port bridge responds to configuration cycles, and has a PCI
> configuration space. They show up in lspci.
>
> The host controller device driver knows how to issue configuration
> cycles, and it knows how to deliver configuration cycles on bus 0 to
> the bridges.
>
> This one:
> +			pcie@0,0 {
> +				device_type = "pci";
> +				reg = <0x0800 0 0 0 0>;
>
> Responds to bus 0, device 1, function 0.
>
> Each root port bridge has an associated non-PCI MMIO address
> space. The child above is associated with 0xd0040000. This MMIO
> address space is needed by the host controller driver to operate the
> port.
>
> The text after the @ is a mistake, Linux doesn't enforce anything
> about this text so nobody noticed till now, it should be corrected to
> @1,0, similarly for the second one.
>
> Also, from your comments the top level should gain a 'device_type =
> "pci"'.
>
> So, from that point, does the DT start to make sense? Are there other
> problems?
>
> The question at hand is how do you associate the non-PCI MMIO space
> 0xd0040000 with the root port bridge 'pcie@1,0' ? The patch proposes
> to use a reg array in the controller plus reg-names to do this.
>
> I have trouble making sense of your suggestions to switch away from
> 5dw addressing. It is critical that the DT child stanza be associated
> with the PCI discovered root port bridge. By my understanding Linux
> does this based on the configuration space format 'reg' value. How is
> that association possible if you switch things away from 5dw
> addressing?
>
> Also, this is for firmware-less embedded. Linux is required to scan
> the bus and do full address assignment of all PCI devies, including
> the root port bridges. The ranges property on the top node specifies
> the addressing apertures that are available for this process. This is
> common practice, there are many examples of this in kernel dts for PCI
> controllers.

Okay, I think I finally get it.  The Marvell root port bridge setup
registers look like standard config headers, even though they aren't
really in config space (because you need a different access method for
them, compared to real config accesses to downstream devices).

So, in order for the common code that enumerates the PCI bus to work,
you need to "spoof" the config accesses so that when you try to access
those particular "pseudo config headers", it uses the MMIO method
instead of the real config access method.

If that is indeed the case, them I would vote for a slight modification
of the intermediate patch that I cited earlier - the one in which the
ranges property includes translations from those special config
addresses into CPU addresses.  The modification is to fix the sizes,
changing 0x2000 to 0x800, so those ranges entries do not overlap in the
child address domain.

With that change, plus fixing the "@port,lane" strings to be "@dev,fun",
the DT would not raise all the red flags that led me down this rathole.

The controller driver would still have to implement the magic spoofing
of those funky config headers, but at least it would be doing it based
on well-established representation principles, specifically ranges
entries and well-formed child addresses, without having to resort to the
new sideband "port and lane" address representation.

Here's what it would look like:

pcie-controller {
	compatible = "marvell,armada-370-pcie";
	status = "disabled";

	#address-cells = <3>;
	#size-cells = <2>;

	bus-range = <0x00 0xff>;

	ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root
port config header */
	          0x00001000 0 0          0xd4008000 0 0x00000800   /* Root
port config header */
		  0x82000000 0 0xe0000000 0xe0000000 0 0x08000000   /*
non-prefetchable memory */
	          0x81000000 0 0          0xe8000000 0 0x00100000>; /*
downstream I/O */

	pcie@1,0 {
		device_type = "pci";
		reg = <0x0800 0 0 0 0>;
		#address-cells = <3>;
		#size-cells = <2>;
		#interrupt-cells = <1>;
		interrupt-map-mask = <0 0 0 0>;
		interrupt-map = <0 0 0 0 &mpic 58>;
		marvell,pcie-port = <0>;
		marvell,pcie-lane = <0>;
		clocks = <&gateclk 5>;
		status = "disabled";
	};

	pcie@2,0 {
		device_type = "pci";
		reg = <0x1000 0 0 0 0>;
		#address-cells = <3>;
		#size-cells = <2>;
		#interrupt-cells = <1>;
		interrupt-map-mask = <0 0 0 0>;
		interrupt-map = <0 0 0 0 &mpic 62>;
		marvell,pcie-port = <1>;
		marvell,pcie-lane = <0>;
		clocks = <&gateclk 9>;
		status = "disabled";
	};
};


>
> It is impossible to specify any detail for the bridges (be it address
> ranges or bus number ranges) because we have no idea what discovery
> will find, or what values Linux will choose to assign.
>
>> In neither case would you need the controversial "reg-names" thing.
>
> reg-names has been a standard feature in the Linux kernel for a bit
> already..
>
> Thanks a lot for looking at this,
> Jason
>
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 10, 2013, 5:04 a.m. UTC | #44
On 3/8/2013 1:12 PM, Rob Herring wrote:
> On 03/08/2013 01:14 AM, Thierry Reding wrote:
>> On Thu, Mar 07, 2013 at 06:05:33PM -0600, Rob Herring wrote:
>>> On 03/07/2013 02:47 PM, Thierry Reding wrote:
>> [...]
>>>> In a nutshell (since some of the context isn't quoted anymore) the
>>>> problem that we're trying to solve is that some of the embedded SoCs
>>>> require per-root-port registers for configuration. The PCI DT
>>>> specification doesn't make any provisions for this. A few alternatives
>>>> have been discussed so far:
>>>
>>> I'm not sure I follow. This is different than the host controller
>>> registers? Why would this not just be multiple entries in the reg property?
>>
>> Well the register regions are per root-port. On Tegra20 there's 2 of
>> them, Tegra30 has 3 and if I understand correctly Marvell can have up to
>> 10 (!). Adding all of them to the reg property of the host controller
>> could work but it needs some way to match the reg entry to the root port
>> similar to option 1 below.
> 
> The compatible property of the PCI host controller can imply what each
> index of the reg property entries is for.
> 
>>
>> Adding a property in the root port nodes seems like a cleaner and more
>> accurate representation of the hardware to me, but if that's not
>> acceptable perhaps we need to bite the bullet and add the code to look
>> the registers up from the parent's reg property.
> 
> What I don't like is a new property defined to describe mmio addresses.
> We already have a property for that and it is "reg". But I think I'm
> still missing something:


As stated in my recent reply to Jason, I thing the correct property is
"ranges".  "Ranges" translates mappable child address space addresses
into parent addresses, and that is exactly what is going on.  A specific
subset of config addresses is mappable into parent MMIO space.

When done that way, there is no need for the multi-entry reg property at
the top level, and the correspondence between specific root port DT
nodes and their MMIO addresses is easily determined by matching child
reg properties to parent ranges entries.

> 
>>>> 			pci@0,1 {
>>>> 				...
>>>> 				reg = <0x00000800 0 0 0 0>;
> 
> Is this a PCI bus address?
> 
>>>> 				regs = <0x80000000 0x00001000>;
>>>> 			};
> 
> Rob
> 
> _______________________________________________
> devicetree-discuss mailing list
> devicetree-discuss@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/devicetree-discuss
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 10, 2013, 6:55 a.m. UTC | #45
On Sat, Mar 09, 2013 at 06:52:13PM -1000, Mitch Bradley wrote:

> Okay, I think I finally get it.  The Marvell root port bridge setup
> registers look like standard config headers, even though they aren't
> really in config space (because you need a different access method for
> them, compared to real config accesses to downstream devices).

Well, thats pretty close. On Marvell the MMIO space has a lot of
stuff, some of it is config related, lots of it isn't. In fact most of
it isn't. Tegra is different, the MMIO region is exactly the config
space of the root port bridge.

Tegra could probably happily and sanely do as you've described (well,
see below), but it is wonky for Marvell. It looks like there are more
drivers coming that have this need - so I had hoped for a general
answer to the 'per-port MMIO space' problem that is unrelated to the
MMIO spaces role in config access.

> So, in order for the common code that enumerates the PCI bus to work,
> you need to "spoof" the config accesses so that when you try to access
> those particular "pseudo config headers", it uses the MMIO method
> instead of the real config access method.

There are rules for config access in both Tegra and Marvell that are
not trivial. Again, the driver takes are of this and all Linux sees is
a nice compliant config space.

> If that is indeed the case, them I would vote for a slight modification
> of the intermediate patch that I cited earlier - the one in which the
> ranges property includes translations from those special config
> addresses into CPU addresses.  The modification is to fix the sizes,
> changing 0x2000 to 0x800, so those ranges entries do not overlap in the
> child address domain.

On Marvell the size of the MMIO space is 0x2000, there are imporant
registers there beyond index 0x800 (todays driver may not touch them
but HW has them). Reducing the size doesn't seem appropriate.

> pcie-controller {
> 	compatible = "marvell,armada-370-pcie";
> 	status = "disabled";

        device_type = "pci";                                                                    

Here as well, as you noted before?

> 	#address-cells = <3>;
> 	#size-cells = <2>;
> 
> 	bus-range = <0x00 0xff>;
> 
> 	ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root
> port config header */
> 	          0x00001000 0 0          0xd4008000 0 0x00000800   /* Root
> port config header */
> 		  0x82000000 0 0xe0000000 0xe0000000 0 0x08000000   /*
> non-prefetchable memory */
> 	          0x81000000 0 0          0xe8000000 0 0x00100000>; /*
> downstream I/O */
> 
> 	pcie@1,0 {
> 		device_type = "pci";
> 		reg = <0x0800 0 0 0 0>;

Okay, this was Thierry's original idea, and it was my note that this
didn't seem like a good choice. I'm not wedded to that, but I'll
explain my reasoning a bit.

Two things bothered me
  - Describing a CPU MMIO mapping with a config address space seems
    wonky
  - Linux's OF core doesn't parse a 'reg' property under
    'device_type = "pci"' as something CPU mappable, it only looks
    to assigned-addresses for that. So the OF core would need to learn
    how to handle this case, and it would have to be well defined..

    Thierry's original patch avoided this problem by not using
    device_type = "pci"..

Also, did you mean to have a 0 size for the 'reg'? How will things
know how big the MMIO region is? Can't just assume 0x800..

Again, thanks!
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Petazzoni March 10, 2013, 3:06 p.m. UTC | #46
Dear Mitch Bradley,

On Sat, 09 Mar 2013 19:04:51 -1000, Mitch Bradley wrote:
 
> As stated in my recent reply to Jason, I thing the correct property is
> "ranges".  "Ranges" translates mappable child address space addresses
> into parent addresses, and that is exactly what is going on.  A specific
> subset of config addresses is mappable into parent MMIO space.

The PCI configuration space is *not* mapped in the MMIO space on
Marvell hardware. In the MMIO space of each PCIe interface, there are
many registers, only *two* of which are dedicated to accessing the PCI
configuration space:

 * One register to set the offset in the PCI configuration space.

 * One register to read or write a value in the PCI configuration, at
   the offset specified in the first register.

See the implementation of mvebu_pcie_hw_rd_conf() and
mvebu_pcie_hw_wr_conf() in the driver.

So really, the values specified in the reg = <...> property are *not*
the PCI configuration spaces mapped in the MMIO space. They represent a
bunch of per PCIe interface registers used to configure them, get the
status of the link... and access, through an indirect mechanism, the
PCI configuration space.

Does this helps?

Thanks,

Thomas
Thomas Petazzoni March 10, 2013, 3:09 p.m. UTC | #47
Dear Thierry Reding,

On Fri, 8 Mar 2013 21:13:40 +0100, Thierry Reding wrote:
> On Fri, Mar 08, 2013 at 01:02:46PM -0700, Jason Gunthorpe wrote:
> > On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:
> [...]
> > > 2) The "@0,0" and "@1,0" suffixes do not correspond to the reg values
> > > <0x0800 0 0 0 0> and <0x1000 0 0 0 0> using any rule that I know.
> > 
> > @0,1,0 (bus,device,fn) could be more appropriate, but that is
> > cosmetic?
> 
> The OF PCI specification is pretty strict about this as well. It says in
> section 2.2.1.3. that only the DD and DD,FF forms can appear in a device
> path.

Note that in the case of my driver, the @X,Y represent the port and
lane of the PCIe interface. Not sure if it is correct, I can change the
names to whatever is appropriate, those names aren't used anywhere in
the driver.

Best regards,

Thomas
Mitch Bradley March 10, 2013, 6:33 p.m. UTC | #48
On 3/10/2013 5:06 AM, Thomas Petazzoni wrote:
> Dear Mitch Bradley,
> 
> On Sat, 09 Mar 2013 19:04:51 -1000, Mitch Bradley wrote:
>  
>> As stated in my recent reply to Jason, I thing the correct property is
>> "ranges".  "Ranges" translates mappable child address space addresses
>> into parent addresses, and that is exactly what is going on.  A specific
>> subset of config addresses is mappable into parent MMIO space.
> 
> The PCI configuration space is *not* mapped in the MMIO space on
> Marvell hardware. In the MMIO space of each PCIe interface, there are
> many registers, only *two* of which are dedicated to accessing the PCI
> configuration space:
> 
>  * One register to set the offset in the PCI configuration space.
> 
>  * One register to read or write a value in the PCI configuration, at
>    the offset specified in the first register.
> 
> See the implementation of mvebu_pcie_hw_rd_conf() and
> mvebu_pcie_hw_wr_conf() in the driver.
> 
> So really, the values specified in the reg = <...> property are *not*
> the PCI configuration spaces mapped in the MMIO space. They represent a
> bunch of per PCIe interface registers used to configure them, get the
> status of the link... and access, through an indirect mechanism, the
> PCI configuration space.
> 
> Does this helps?

I agree that PCI config space accesses to *downstream* devices is via an
indirect-access register pair.

The question is, does that indirect-access mechanism apply also to the
internal config headers for the root port bridges?

According to section 20.15 of the MV78230 functional spec that I am
looking at, the configuration header registers are mapped to the
internal memory space.  That section is unclear about whether those
registers are CPU-accessible via indirect-access config transactions.
When the PCIe hardware is configured for endpoint mode, the internal
headers can be accessed via PCIe config transactions from an external
port, but in root complex mode, the possibility of indirect access from
the CPU is not mentioned.

The manual is a little vague in some respects, but it does say quite
clearly that MMIO access to the root port bridge config header is
possible, and it lists the MMIO addresses thereof (section A.10).

So we all agree that access to external (downstream) config registers is
via the indirect register pair.  The unclear thing is whether the
internal config registers for the root port bridges can be indirectly
accessed.

Do we have empirical evidence that indirect-access works to the internal
root port bridge config headers?

> 
> Thanks,
> 
> Thomas
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 11, 2013, 5:46 a.m. UTC | #49
On 3/9/2013 8:55 PM, Jason Gunthorpe wrote:
> On Sat, Mar 09, 2013 at 06:52:13PM -1000, Mitch Bradley wrote:
> 
>> Okay, I think I finally get it.  The Marvell root port bridge setup
>> registers look like standard config headers, even though they aren't
>> really in config space (because you need a different access method for
>> them, compared to real config accesses to downstream devices).
> 
> Well, thats pretty close. On Marvell the MMIO space has a lot of
> stuff, some of it is config related, lots of it isn't. In fact most of
> it isn't. Tegra is different, the MMIO region is exactly the config
> space of the root port bridge.
> 
> Tegra could probably happily and sanely do as you've described (well,
> see below), but it is wonky for Marvell. It looks like there are more
> drivers coming that have this need - so I had hoped for a general
> answer to the 'per-port MMIO space' problem that is unrelated to the
> MMIO spaces role in config access.
> 
>> So, in order for the common code that enumerates the PCI bus to work,
>> you need to "spoof" the config accesses so that when you try to access
>> those particular "pseudo config headers", it uses the MMIO method
>> instead of the real config access method.
> 
> There are rules for config access in both Tegra and Marvell that are
> not trivial. Again, the driver takes are of this and all Linux sees is
> a nice compliant config space.

Thanks for explaining that.

So it seems that we are faced with two requirements that are somewhat at
odds with one another.

1) Some of the root port bridge registers have to be accessed via config
space access functions so common PCI enumeration code will work.  To
represent this with the usual DT structure, the top root-complex needs
to define a 3/2 address space so its children
can have standard PCI reg properties.  Presumably, if those registers
are being programmed by common code, Marvell-specific code would
restrict its role to setting up config-space access functions, leaving
the actual touching of the registers to the common code.

2) Marvell chips have additional non-standard per-root-port registers
that generic PCI code would not understand.  These registers would be
touched only by Marvell-specific code.

The two kinds of registers are adjacent in MMIO space.  However, unless
I am misunderstanding this MV78230 manual, the highest "config header"
register index is 0x134, well below the 0x1000 size limit of a PCIe
config header.  Some of the extra registers are at 0x8xx, and others are
above 0x1800.

That suggests the following:

For the "config header" registers that use generic PCI code, use a
ranges entry to associate (pass through) MMIO addresses to the config
header portion of the register block.  This parameterizes the code that
sets up the special-case PCI config access.  That specification
technique is general and could be used not only for the Marvell case,
but also for any other chip that has some number of direct-mapped config
headers.

For requirement (2), the top node has a reg property listing the
portions of the address space that are consumed by the driver at the top
level instead of being passed through to the PCI addressing domain.  E.g.

  reg = <0xd0040800 0x1800>, <0xd0080800 0x1800>;

Thus we have accurately described the two aspects of the true situation.
 The PCI-compliant "config header" registers are passed through to the
child nodes where they can be dealt with in the normal PCI fashion.  The
non-PCI-compliant register footprint lives within the Marvell-specific
root-complex driver.

The root-complex driver presumably needs associate non-compliant
register blocks with specific child nodes.  That can be done by
requiring that the reg entries are in the same order as the config-space
ranges entries.

Anticipating the possible objection that ARM's 0x1000-byte page size
does not permit virtual-to-physical mapping at 0x800-byte granularity:
The device tree does not guarantee that reg entries are page-aligned; it
simply tries to describe the reality, even though it might be messy.

> 
>> If that is indeed the case, them I would vote for a slight modification
>> of the intermediate patch that I cited earlier - the one in which the
>> ranges property includes translations from those special config
>> addresses into CPU addresses.  The modification is to fix the sizes,
>> changing 0x2000 to 0x800, so those ranges entries do not overlap in the
>> child address domain.

BTW I have completely changed my mind about the overlap thing.

I said that it was bogus to use size=0x2000 for a config space header.
 That was based on an interpretation - which I now dislike - of the
meaning of 3-cell config addresses.  By that old interpretation,
size=0x800 would also be bogus, because bits 10-8 of the config address
are for the function number.

Consider the following question, which I have never previously
considered, at least not explicitly:

Q: What would be the 3-cell representation of the Command/Status
register address (offset 4) in device 1, function 1?

One obvious - but weak - answer is  <0x00000904 0 0> .  It's obvious
because the value 0x80000904 is what you put in the cf8 indirect access
address register.  But it is weak because it unnecessarily restricts the
config header size, and because it works differently than offsets
applied to memory and I/O space addresses.  Furthermore, coupling the
offset to the cf8 register is dodgy because of the funniness where you
have to but the byte-selector bits 1..0 not in the address register cf8,
but rather in the data register cfc.

The better answer is  <0x00000900 0 4>, using the third cell for the
offset, as with IO and memory space offsets.

Under this better interpretation, config space sizes larger than 0x100
are no problem.  The first cell is just a selector.  The size is "added"
to the third cell, so it does not encroach into the first cell's
device,function bits.

This better interpretation easily handles PCIe's 0x1000-byte extended
config headers, and trivially accomodates even larger sizes should a
future PCI variant further increase the header size. as described way
down below.

> 
> On Marvell the size of the MMIO space is 0x2000, there are imporant
> registers there beyond index 0x800 (todays driver may not touch them
> but HW has them). Reducing the size doesn't seem appropriate.

Per the above, the current idea is not to reduce the size, but rather to
split the 0x2000 into two pieces, accurately reflecting the fact that
part of it is PCI-compliant and can be handled by generic PCI drivers,
while the rest must be handled differently, outside of standard PCI
land.  Commingling the two seems dubious.

> 
>> pcie-controller {
>> 	compatible = "marvell,armada-370-pcie";
>> 	status = "disabled";
> 
>         device_type = "pci";                                                                    
> 
> Here as well, as you noted before?
> 
>> 	#address-cells = <3>;
>> 	#size-cells = <2>;
>>
>> 	bus-range = <0x00 0xff>;
>>
>> 	ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root
>> port config header */
>> 	          0x00001000 0 0          0xd4008000 0 0x00000800   /* Root
>> port config header */
>> 		  0x82000000 0 0xe0000000 0xe0000000 0 0x08000000   /*
>> non-prefetchable memory */
>> 	          0x81000000 0 0          0xe8000000 0 0x00100000>; /*
>> downstream I/O */
>>
>> 	pcie@1,0 {
>> 		device_type = "pci";
>> 		reg = <0x0800 0 0 0 0>;
> 
> Okay, this was Thierry's original idea, and it was my note that this
> didn't seem like a good choice. I'm not wedded to that, but I'll
> explain my reasoning a bit.
> 
> Two things bothered me
>   - Describing a CPU MMIO mapping with a config address space seems
>     wonky

I agree that mapping config space is sort of a jarring concept, but I
think that's because PCs have polluted the mindspace, not because there
is anything inherently bad about it.  The original PCI spec
fundamentally treated config space as just another (rather segmented)
linear address space.  It specified the cf8/cfc indirect access
mechanism not as a deep semantic feature, but rather as an
implementation hack to work around addressing limitations of PC chipsets
and the addressing-deficient popular OS of the time (Windows 3.1 which
was a veneer over DOS).  The PCI spec defined configuration mechanism #2
as a direct map into PC I/O space.  RISC chipsets of that era (e.g.
Alpha, Power PC) often direct-mapped config space into some form of
load/store space.

But PCs soon encroached on the RISC market and RISC system vendors
started to "leverage" PC I/O chipsets.  The indirect access "config
mechanism #1" became part of the mindset.  Indirect access was hardcoded
in enough software that chipset designers would provide indirect access
even if direct-mapping was possible.

Fundamentally, a config header is just a block of registers, addressable
as an index relative to some base.  It might require some specific
access path - but that is true with any block of registers.  Even
direct-mapped MMIO registers often require specific semantics such as
non-cached, non-write-buffered, or special address space ids.

The PCI SIG expected PCIe extended config space to be direct-mapped -
see slide 25 of

http://www.pcisig.com/specifications/pciexpress/technical_library/dev_con_09_02/specifications/pciexpress/PCIExpress_SoftwareandConfigurationModel.pdf

So mappability of config space was intended from the get-go.  The real
wonky thing is the "indirect access zombie that cannot die".

>   - Linux's OF core doesn't parse a 'reg' property under
>     'device_type = "pci"' as something CPU mappable, it only looks
>     to assigned-addresses for that. So the OF core would need to learn
>     how to handle this case, and it would have to be well defined..

The way I envisioned it working is that the root-complex node defines
its own config space access functions (the "ops" argument to
pci_scan_root_bus).  For config addresses listed in ranges, the
functions do MMIO.  For others, they use the chipset's indirect-access
registers.  The OF core is not involved.

> 
>     Thierry's original patch avoided this problem by not using
>     device_type = "pci"..
> 
> Also, did you mean to have a 0 size for the 'reg'? How will things
> know how big the MMIO region is? Can't just assume 0x800..

Sorry, that was a typo.

> 
> Again, thanks!
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 11, 2013, 7:46 a.m. UTC | #50
On Sun, Mar 10, 2013 at 07:46:04PM -1000, Mitch Bradley wrote:
> On 3/9/2013 8:55 PM, Jason Gunthorpe wrote:
> > On Sat, Mar 09, 2013 at 06:52:13PM -1000, Mitch Bradley wrote:
> > 
> >> Okay, I think I finally get it.  The Marvell root port bridge setup
> >> registers look like standard config headers, even though they aren't
> >> really in config space (because you need a different access method for
> >> them, compared to real config accesses to downstream devices).
> > 
> > Well, thats pretty close. On Marvell the MMIO space has a lot of
> > stuff, some of it is config related, lots of it isn't. In fact most of
> > it isn't. Tegra is different, the MMIO region is exactly the config
> > space of the root port bridge.
> > 
> > Tegra could probably happily and sanely do as you've described (well,
> > see below), but it is wonky for Marvell. It looks like there are more
> > drivers coming that have this need - so I had hoped for a general
> > answer to the 'per-port MMIO space' problem that is unrelated to the
> > MMIO spaces role in config access.
> > 
> >> So, in order for the common code that enumerates the PCI bus to work,
> >> you need to "spoof" the config accesses so that when you try to access
> >> those particular "pseudo config headers", it uses the MMIO method
> >> instead of the real config access method.
> > 
> > There are rules for config access in both Tegra and Marvell that are
> > not trivial. Again, the driver takes are of this and all Linux sees is
> > a nice compliant config space.
> 
> Thanks for explaining that.
> 
> So it seems that we are faced with two requirements that are somewhat at
> odds with one another.
> 
> 1) Some of the root port bridge registers have to be accessed via config
> space access functions so common PCI enumeration code will work.  To
> represent this with the usual DT structure, the top root-complex needs
> to define a 3/2 address space so its children
> can have standard PCI reg properties.  Presumably, if those registers
> are being programmed by common code, Marvell-specific code would
> restrict its role to setting up config-space access functions, leaving
> the actual touching of the registers to the common code.
> 
> 2) Marvell chips have additional non-standard per-root-port registers
> that generic PCI code would not understand.  These registers would be
> touched only by Marvell-specific code.
> 
> The two kinds of registers are adjacent in MMIO space.  However, unless
> I am misunderstanding this MV78230 manual, the highest "config header"
> register index is 0x134, well below the 0x1000 size limit of a PCIe
> config header.  Some of the extra registers are at 0x8xx, and others are
> above 0x1800.
> 
> That suggests the following:
> 
> For the "config header" registers that use generic PCI code, use a
> ranges entry to associate (pass through) MMIO addresses to the config
> header portion of the register block.  This parameterizes the code that
> sets up the special-case PCI config access.  That specification
> technique is general and could be used not only for the Marvell case,
> but also for any other chip that has some number of direct-mapped config
> headers.
> 
> For requirement (2), the top node has a reg property listing the
> portions of the address space that are consumed by the driver at the top
> level instead of being passed through to the PCI addressing domain.  E.g.
> 
>   reg = <0xd0040800 0x1800>, <0xd0080800 0x1800>;
> 
> Thus we have accurately described the two aspects of the true situation.
>  The PCI-compliant "config header" registers are passed through to the
> child nodes where they can be dealt with in the normal PCI fashion.  The
> non-PCI-compliant register footprint lives within the Marvell-specific
> root-complex driver.
> 
> The root-complex driver presumably needs associate non-compliant
> register blocks with specific child nodes.  That can be done by
> requiring that the reg entries are in the same order as the config-space
> ranges entries.
> 
> Anticipating the possible objection that ARM's 0x1000-byte page size
> does not permit virtual-to-physical mapping at 0x800-byte granularity:
> The device tree does not guarantee that reg entries are page-aligned; it
> simply tries to describe the reality, even though it might be messy.
> 
> > 
> >> If that is indeed the case, them I would vote for a slight modification
> >> of the intermediate patch that I cited earlier - the one in which the
> >> ranges property includes translations from those special config
> >> addresses into CPU addresses.  The modification is to fix the sizes,
> >> changing 0x2000 to 0x800, so those ranges entries do not overlap in the
> >> child address domain.
> 
> BTW I have completely changed my mind about the overlap thing.
> 
> I said that it was bogus to use size=0x2000 for a config space header.
>  That was based on an interpretation - which I now dislike - of the
> meaning of 3-cell config addresses.  By that old interpretation,
> size=0x800 would also be bogus, because bits 10-8 of the config address
> are for the function number.
> 
> Consider the following question, which I have never previously
> considered, at least not explicitly:
> 
> Q: What would be the 3-cell representation of the Command/Status
> register address (offset 4) in device 1, function 1?
> 
> One obvious - but weak - answer is  <0x00000904 0 0> .  It's obvious
> because the value 0x80000904 is what you put in the cf8 indirect access
> address register.  But it is weak because it unnecessarily restricts the
> config header size, and because it works differently than offsets
> applied to memory and I/O space addresses.  Furthermore, coupling the
> offset to the cf8 register is dodgy because of the funniness where you
> have to but the byte-selector bits 1..0 not in the address register cf8,
> but rather in the data register cfc.
> 
> The better answer is  <0x00000900 0 4>, using the third cell for the
> offset, as with IO and memory space offsets.
> 
> Under this better interpretation, config space sizes larger than 0x100
> are no problem.  The first cell is just a selector.  The size is "added"
> to the third cell, so it does not encroach into the first cell's
> device,function bits.
> 
> This better interpretation easily handles PCIe's 0x1000-byte extended
> config headers, and trivially accomodates even larger sizes should a
> future PCI variant further increase the header size. as described way
> down below.

That makes a lot of sense. It also mirrors parts of a discussion we
previously had when we first discussed a DT binding for Tegra PCIe.

Thanks for taking the time to go over this in so much detail. I very
much appreciate it.

> >> 	#address-cells = <3>;
> >> 	#size-cells = <2>;
> >>
> >> 	bus-range = <0x00 0xff>;
> >>
> >> 	ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root
> >> port config header */
> >> 	          0x00001000 0 0          0xd4008000 0 0x00000800   /* Root
> >> port config header */
> >> 		  0x82000000 0 0xe0000000 0xe0000000 0 0x08000000   /*
> >> non-prefetchable memory */
> >> 	          0x81000000 0 0          0xe8000000 0 0x00100000>; /*
> >> downstream I/O */
> >>
> >> 	pcie@1,0 {
> >> 		device_type = "pci";
> >> 		reg = <0x0800 0 0 0 0>;
> > 
> > Okay, this was Thierry's original idea, and it was my note that this
> > didn't seem like a good choice. I'm not wedded to that, but I'll
> > explain my reasoning a bit.
> > 
> > Two things bothered me
> >   - Describing a CPU MMIO mapping with a config address space seems
> >     wonky
> 
> I agree that mapping config space is sort of a jarring concept, but I
> think that's because PCs have polluted the mindspace, not because there
> is anything inherently bad about it.  The original PCI spec
> fundamentally treated config space as just another (rather segmented)
> linear address space.  It specified the cf8/cfc indirect access
> mechanism not as a deep semantic feature, but rather as an
> implementation hack to work around addressing limitations of PC chipsets
> and the addressing-deficient popular OS of the time (Windows 3.1 which
> was a veneer over DOS).  The PCI spec defined configuration mechanism #2
> as a direct map into PC I/O space.  RISC chipsets of that era (e.g.
> Alpha, Power PC) often direct-mapped config space into some form of
> load/store space.
> 
> But PCs soon encroached on the RISC market and RISC system vendors
> started to "leverage" PC I/O chipsets.  The indirect access "config
> mechanism #1" became part of the mindset.  Indirect access was hardcoded
> in enough software that chipset designers would provide indirect access
> even if direct-mapping was possible.
> 
> Fundamentally, a config header is just a block of registers, addressable
> as an index relative to some base.  It might require some specific
> access path - but that is true with any block of registers.  Even
> direct-mapped MMIO registers often require specific semantics such as
> non-cached, non-write-buffered, or special address space ids.
> 
> The PCI SIG expected PCIe extended config space to be direct-mapped -
> see slide 25 of
> 
> http://www.pcisig.com/specifications/pciexpress/technical_library/dev_con_09_02/specifications/pciexpress/PCIExpress_SoftwareandConfigurationModel.pdf
> 
> So mappability of config space was intended from the get-go.  The real
> wonky thing is the "indirect access zombie that cannot die".
> 
> >   - Linux's OF core doesn't parse a 'reg' property under
> >     'device_type = "pci"' as something CPU mappable, it only looks
> >     to assigned-addresses for that. So the OF core would need to learn
> >     how to handle this case, and it would have to be well defined..
> 
> The way I envisioned it working is that the root-complex node defines
> its own config space access functions (the "ops" argument to
> pci_scan_root_bus).  For config addresses listed in ranges, the
> functions do MMIO.  For others, they use the chipset's indirect-access
> registers.  The OF core is not involved.
> 
> > 
> >     Thierry's original patch avoided this problem by not using
> >     device_type = "pci"..

I think the OF core still needs to be involved in order to translate the
reg = <0x0800 0 0 0 0>; entry into a resource that can be ioremap()'ed.
As Jason already said, the OF core only does that for assigned-addresses
when device_type = "pci".

Otherwise the pci_ops implementation still couldn't access the MMIO
registers. Aiming for a driver-specific solution doesn't seem like a
good idea but if the functionality is common enough to be required by
two or more drivers perhaps a new helper could be created for exactly
this purpose.

Perhaps another alternative would be to extend the OF core to translate
the entries in the reg property as well. Or maybe I misunderstood what
you said.

Thierry
Thierry Reding March 11, 2013, 8:08 a.m. UTC | #51
On Sun, Mar 10, 2013 at 04:09:37PM +0100, Thomas Petazzoni wrote:
> Dear Thierry Reding,
> 
> On Fri, 8 Mar 2013 21:13:40 +0100, Thierry Reding wrote:
> > On Fri, Mar 08, 2013 at 01:02:46PM -0700, Jason Gunthorpe wrote:
> > > On Fri, Mar 08, 2013 at 09:43:11AM -1000, Mitch Bradley wrote:
> > [...]
> > > > 2) The "@0,0" and "@1,0" suffixes do not correspond to the reg values
> > > > <0x0800 0 0 0 0> and <0x1000 0 0 0 0> using any rule that I know.
> > > 
> > > @0,1,0 (bus,device,fn) could be more appropriate, but that is
> > > cosmetic?
> > 
> > The OF PCI specification is pretty strict about this as well. It says in
> > section 2.2.1.3. that only the DD and DD,FF forms can appear in a device
> > path.
> 
> Note that in the case of my driver, the @X,Y represent the port and
> lane of the PCIe interface. Not sure if it is correct, I can change the
> names to whatever is appropriate, those names aren't used anywhere in
> the driver.

I think they need to be changed to be @dev,fn in order to conform to the
OF PCI binding.

Thierry
Mitch Bradley March 11, 2013, 6:04 p.m. UTC | #52
On 3/10/2013 9:46 PM, Thierry Reding wrote:
> On Sun, Mar 10, 2013 at 07:46:04PM -1000, Mitch Bradley wrote:
>> On 3/9/2013 8:55 PM, Jason Gunthorpe wrote:
>>> On Sat, Mar 09, 2013 at 06:52:13PM -1000, Mitch Bradley wrote:
>>>
>>>> Okay, I think I finally get it.  The Marvell root port bridge setup
>>>> registers look like standard config headers, even though they aren't
>>>> really in config space (because you need a different access method for
>>>> them, compared to real config accesses to downstream devices).
>>>
>>> Well, thats pretty close. On Marvell the MMIO space has a lot of
>>> stuff, some of it is config related, lots of it isn't. In fact most of
>>> it isn't. Tegra is different, the MMIO region is exactly the config
>>> space of the root port bridge.
>>>
>>> Tegra could probably happily and sanely do as you've described (well,
>>> see below), but it is wonky for Marvell. It looks like there are more
>>> drivers coming that have this need - so I had hoped for a general
>>> answer to the 'per-port MMIO space' problem that is unrelated to the
>>> MMIO spaces role in config access.
>>>
>>>> So, in order for the common code that enumerates the PCI bus to work,
>>>> you need to "spoof" the config accesses so that when you try to access
>>>> those particular "pseudo config headers", it uses the MMIO method
>>>> instead of the real config access method.
>>>
>>> There are rules for config access in both Tegra and Marvell that are
>>> not trivial. Again, the driver takes are of this and all Linux sees is
>>> a nice compliant config space.
>>
>> Thanks for explaining that.
>>
>> So it seems that we are faced with two requirements that are somewhat at
>> odds with one another.
>>
>> 1) Some of the root port bridge registers have to be accessed via config
>> space access functions so common PCI enumeration code will work.  To
>> represent this with the usual DT structure, the top root-complex needs
>> to define a 3/2 address space so its children
>> can have standard PCI reg properties.  Presumably, if those registers
>> are being programmed by common code, Marvell-specific code would
>> restrict its role to setting up config-space access functions, leaving
>> the actual touching of the registers to the common code.
>>
>> 2) Marvell chips have additional non-standard per-root-port registers
>> that generic PCI code would not understand.  These registers would be
>> touched only by Marvell-specific code.
>>
>> The two kinds of registers are adjacent in MMIO space.  However, unless
>> I am misunderstanding this MV78230 manual, the highest "config header"
>> register index is 0x134, well below the 0x1000 size limit of a PCIe
>> config header.  Some of the extra registers are at 0x8xx, and others are
>> above 0x1800.
>>
>> That suggests the following:
>>
>> For the "config header" registers that use generic PCI code, use a
>> ranges entry to associate (pass through) MMIO addresses to the config
>> header portion of the register block.  This parameterizes the code that
>> sets up the special-case PCI config access.  That specification
>> technique is general and could be used not only for the Marvell case,
>> but also for any other chip that has some number of direct-mapped config
>> headers.
>>
>> For requirement (2), the top node has a reg property listing the
>> portions of the address space that are consumed by the driver at the top
>> level instead of being passed through to the PCI addressing domain.  E.g.
>>
>>   reg = <0xd0040800 0x1800>, <0xd0080800 0x1800>;
>>
>> Thus we have accurately described the two aspects of the true situation.
>>  The PCI-compliant "config header" registers are passed through to the
>> child nodes where they can be dealt with in the normal PCI fashion.  The
>> non-PCI-compliant register footprint lives within the Marvell-specific
>> root-complex driver.
>>
>> The root-complex driver presumably needs associate non-compliant
>> register blocks with specific child nodes.  That can be done by
>> requiring that the reg entries are in the same order as the config-space
>> ranges entries.
>>
>> Anticipating the possible objection that ARM's 0x1000-byte page size
>> does not permit virtual-to-physical mapping at 0x800-byte granularity:
>> The device tree does not guarantee that reg entries are page-aligned; it
>> simply tries to describe the reality, even though it might be messy.
>>
>>>
>>>> If that is indeed the case, them I would vote for a slight modification
>>>> of the intermediate patch that I cited earlier - the one in which the
>>>> ranges property includes translations from those special config
>>>> addresses into CPU addresses.  The modification is to fix the sizes,
>>>> changing 0x2000 to 0x800, so those ranges entries do not overlap in the
>>>> child address domain.
>>
>> BTW I have completely changed my mind about the overlap thing.
>>
>> I said that it was bogus to use size=0x2000 for a config space header.
>>  That was based on an interpretation - which I now dislike - of the
>> meaning of 3-cell config addresses.  By that old interpretation,
>> size=0x800 would also be bogus, because bits 10-8 of the config address
>> are for the function number.
>>
>> Consider the following question, which I have never previously
>> considered, at least not explicitly:
>>
>> Q: What would be the 3-cell representation of the Command/Status
>> register address (offset 4) in device 1, function 1?
>>
>> One obvious - but weak - answer is  <0x00000904 0 0> .  It's obvious
>> because the value 0x80000904 is what you put in the cf8 indirect access
>> address register.  But it is weak because it unnecessarily restricts the
>> config header size, and because it works differently than offsets
>> applied to memory and I/O space addresses.  Furthermore, coupling the
>> offset to the cf8 register is dodgy because of the funniness where you
>> have to but the byte-selector bits 1..0 not in the address register cf8,
>> but rather in the data register cfc.
>>
>> The better answer is  <0x00000900 0 4>, using the third cell for the
>> offset, as with IO and memory space offsets.
>>
>> Under this better interpretation, config space sizes larger than 0x100
>> are no problem.  The first cell is just a selector.  The size is "added"
>> to the third cell, so it does not encroach into the first cell's
>> device,function bits.
>>
>> This better interpretation easily handles PCIe's 0x1000-byte extended
>> config headers, and trivially accomodates even larger sizes should a
>> future PCI variant further increase the header size. as described way
>> down below.
> 
> That makes a lot of sense. It also mirrors parts of a discussion we
> previously had when we first discussed a DT binding for Tegra PCIe.
> 
> Thanks for taking the time to go over this in so much detail. I very
> much appreciate it.
> 
>>>> 	#address-cells = <3>;
>>>> 	#size-cells = <2>;
>>>>
>>>> 	bus-range = <0x00 0xff>;
>>>>
>>>> 	ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root
>>>> port config header */
>>>> 	          0x00001000 0 0          0xd4008000 0 0x00000800   /* Root
>>>> port config header */
>>>> 		  0x82000000 0 0xe0000000 0xe0000000 0 0x08000000   /*
>>>> non-prefetchable memory */
>>>> 	          0x81000000 0 0          0xe8000000 0 0x00100000>; /*
>>>> downstream I/O */
>>>>
>>>> 	pcie@1,0 {
>>>> 		device_type = "pci";
>>>> 		reg = <0x0800 0 0 0 0>;
>>>
>>> Okay, this was Thierry's original idea, and it was my note that this
>>> didn't seem like a good choice. I'm not wedded to that, but I'll
>>> explain my reasoning a bit.
>>>
>>> Two things bothered me
>>>   - Describing a CPU MMIO mapping with a config address space seems
>>>     wonky
>>
>> I agree that mapping config space is sort of a jarring concept, but I
>> think that's because PCs have polluted the mindspace, not because there
>> is anything inherently bad about it.  The original PCI spec
>> fundamentally treated config space as just another (rather segmented)
>> linear address space.  It specified the cf8/cfc indirect access
>> mechanism not as a deep semantic feature, but rather as an
>> implementation hack to work around addressing limitations of PC chipsets
>> and the addressing-deficient popular OS of the time (Windows 3.1 which
>> was a veneer over DOS).  The PCI spec defined configuration mechanism #2
>> as a direct map into PC I/O space.  RISC chipsets of that era (e.g.
>> Alpha, Power PC) often direct-mapped config space into some form of
>> load/store space.
>>
>> But PCs soon encroached on the RISC market and RISC system vendors
>> started to "leverage" PC I/O chipsets.  The indirect access "config
>> mechanism #1" became part of the mindset.  Indirect access was hardcoded
>> in enough software that chipset designers would provide indirect access
>> even if direct-mapping was possible.
>>
>> Fundamentally, a config header is just a block of registers, addressable
>> as an index relative to some base.  It might require some specific
>> access path - but that is true with any block of registers.  Even
>> direct-mapped MMIO registers often require specific semantics such as
>> non-cached, non-write-buffered, or special address space ids.
>>
>> The PCI SIG expected PCIe extended config space to be direct-mapped -
>> see slide 25 of
>>
>> http://www.pcisig.com/specifications/pciexpress/technical_library/dev_con_09_02/specifications/pciexpress/PCIExpress_SoftwareandConfigurationModel.pdf
>>
>> So mappability of config space was intended from the get-go.  The real
>> wonky thing is the "indirect access zombie that cannot die".
>>
>>>   - Linux's OF core doesn't parse a 'reg' property under
>>>     'device_type = "pci"' as something CPU mappable, it only looks
>>>     to assigned-addresses for that. So the OF core would need to learn
>>>     how to handle this case, and it would have to be well defined..
>>
>> The way I envisioned it working is that the root-complex node defines
>> its own config space access functions (the "ops" argument to
>> pci_scan_root_bus).  For config addresses listed in ranges, the
>> functions do MMIO.  For others, they use the chipset's indirect-access
>> registers.  The OF core is not involved.
>>
>>>
>>>     Thierry's original patch avoided this problem by not using
>>>     device_type = "pci"..
> 
> I think the OF core still needs to be involved in order to translate the
> reg = <0x0800 0 0 0 0>; entry into a resource that can be ioremap()'ed.
> As Jason already said, the OF core only does that for assigned-addresses
> when device_type = "pci".
> 
> Otherwise the pci_ops implementation still couldn't access the MMIO
> registers. 

I'm not sure I understand your point that <0x0800 ...> needs to be
translated into a resource.

(a) Existing code that needs to match a child nodes to a discovered PCI
hardware device uses the existing of_pci_find_child_device(parent,
devfn) from of_pci.c .  That routine "just works", because the reg =
<0x0800 ..> entry is standard.

(b) The discovery/enumeration code needs to access config space via
pci_ops.  The root complex driver implements pci_ops based on a trivial
parsing of ranges (omitting irrelevant details):

pci_op_read(devfn, pos) {
   loop_over_ranges_entries {
      if (to_devfn(ranges_entry.child) == devfn) {
         return mmio_read(ranges_entry.parent + pos);
   }
   return indirect_read(devfn, pos);
}

The actual implementation might setup different data structures, but the
pseudocode captures the essential points:  The input argument is devfn,
and a simple scan of ranges controls whether to use MMIO or indirect
access for a given devfn value.  Every aspect of the scanning loop is
trivial.  The loop itself is just a stride-5 walk over an array of integers.

"ranges_entry.parent" is be32_to_cpu(ranges_entry[3])

and

"to_devfn(ranges_entry.child)" is "be32_to_cpup(ranges_entry[0]) >> 8"

(you don't want to mask the result in this case because not masking
prevents non-config-space ranges entries from matching devfn.)

> Aiming for a driver-specific solution doesn't seem like a
> good idea but if the functionality is common enough to be required by
> two or more drivers perhaps a new helper could be created for exactly
> this purpose.

Indeed - but it is often best to wait and see rather than guessing about
what sort of common helper is needed.  I usually guess wrong in such cases.

> 
> Perhaps another alternative would be to extend the OF core to translate
> the entries in the reg property as well. Or maybe I misunderstood what
> you said.
> 
> Thierry
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 11, 2013, 6:15 p.m. UTC | #53
On Sun, Mar 10, 2013 at 07:46:04PM -1000, Mitch Bradley wrote:

> So it seems that we are faced with two requirements that are somewhat at
> odds with one another.
> 
> 1) Some of the root port bridge registers have to be accessed via config
> space access functions so common PCI enumeration code will work.  To
> represent this with the usual DT structure, the top root-complex needs
> to define a 3/2 address space so its children
> can have standard PCI reg properties.  Presumably, if those registers
> are being programmed by common code, Marvell-specific code would
> restrict its role to setting up config-space access functions, leaving
> the actual touching of the registers to the common code.
> 
> 2) Marvell chips have additional non-standard per-root-port registers
> that generic PCI code would not understand.  These registers would be
> touched only by Marvell-specific code.
> 
> The two kinds of registers are adjacent in MMIO space.  However, unless
> I am misunderstanding this MV78230 manual, the highest "config header"
> register index is 0x134, well below the 0x1000 size limit of a PCIe
> config header.  Some of the extra registers are at 0x8xx, and others are
> above 0x1800.

The register block you are looking at is for the cores 'target end
port mode', and the MMIO version is for internal use only, to allow
the CPU to configure RO bits, and other tasks. The target core will
allow access to that space via config cycles, either internally
(through the indirection register) or externally (from the off-chip
root complex) generated.

However - the driver runs the core in a 'root port bridge mode' where
the config header register block you are looking at is inhibited. The
Marvell IP block requires software support to run in bridge mode. So
Marvell really has only (2), while Tegra has only (1).

> For requirement (2), the top node has a reg property listing the
> portions of the address space that are consumed by the driver at the top
> level instead of being passed through to the PCI addressing domain.  E.g.
> 
>   reg = <0xd0040800 0x1800>, <0xd0080800 0x1800>;

Okay, so this is agreeing with what Thomas already has:

http://www.spinics.net/lists/arm-kernel/msg228749.html

With your two notes:
 - Correct the pcie@x,y to match the OF spec
 - Add a 'device_type = "pci"' to the pcie-controller node

Is that right?
 
> I said that it was bogus to use size=0x2000 for a config space header.
>  That was based on an interpretation - which I now dislike - of the
> meaning of 3-cell config addresses.  By that old interpretation,
> size=0x800 would also be bogus, because bits 10-8 of the config address
> are for the function number.

Hum, I thought the spec was pretty clear on this point:

 00 denotes Configuration Space, in which case:
  [..]
     bbbbbbbb,ddddd,fff,rrrrrrrrr is the Configuration Space address
     hh...hh,ll...ll must be zero

And also the text at 2.1.4.4..

So you get an 8 bit register offset, placed in the highest DW..

Can you read things another way?

Is there an updated spec that supports PCI-E extended config space?

> Consider the following question, which I have never previously
> considered, at least not explicitly:
>
> Q: What would be the 3-cell representation of the Command/Status
> register address (offset 4) in device 1, function 1?

Well, see section 11.1.2, where it provides an example
for the 'Expansion ROM base address register (0x30)' as being:

   02xxxx30 00000000 00000000 

Also the f-code bindings say:

 The data type 'config-addr' refers to the 'phys.hi' cell of the
 numerical representation of a Configuration Space address.

So there is an strong convention to use the 'r' bits as the
offset..

Further, review section 12 about how ranges should be treated - it
specifically says that the b,d,f bits in ranges should be 0, and the
child address should have those bits masked prior to searching the
ranges.

Section 12 would seem to forbid this:

      ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root port config header */
                0x00001000 0 0          0xd4008000 0 0x00000800   /* Root port config header */

Are you reading that differently?

> > Two things bothered me
> >   - Describing a CPU MMIO mapping with a config address space seems
> >     wonky
> 
> I agree that mapping config space is sort of a jarring concept, but I
> think that's because PCs have polluted the mindspace, not because
> there

I'm well aware of MMIO config space acces, that isn't so jarring. What
is so jarring is the idea that the OF translation would work like:

<0x00000900 0 0> -> 0xd4004000
<0x00000901 0 0> -> 0xd4004001

Which is completely unlike any ranges translation I've ever seen.

Basically, I look at how the register offset is encoded in the higest
dword plus the statement in section 12, and and conclude the there
wasn't an intention to model a memory map'd config space through OF.

What am I missing here?

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 11, 2013, 6:23 p.m. UTC | #54
> (b) The discovery/enumeration code needs to access config space via
> pci_ops.  The root complex driver implements pci_ops based on a trivial
> parsing of ranges (omitting irrelevant details):
> 
> pci_op_read(devfn, pos) {
>    loop_over_ranges_entries {
>       if (to_devfn(ranges_entry.child) == devfn) {
>          return mmio_read(ranges_entry.parent + pos);
                            ^^^^^^^^^^^^^^^^^^^

This has to be converted through all enclosing node's ranges prior to
being used as a CPU address - the OF core has all the code to do this,
but a new entry point would be needed for this specific application...

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 11, 2013, 7:49 p.m. UTC | #55
On 3/11/2013 8:23 AM, Jason Gunthorpe wrote:
>> (b) The discovery/enumeration code needs to access config space via
>> pci_ops.  The root complex driver implements pci_ops based on a trivial
>> parsing of ranges (omitting irrelevant details):
>>
>> pci_op_read(devfn, pos) {
>>    loop_over_ranges_entries {
>>       if (to_devfn(ranges_entry.child) == devfn) {
>>          return mmio_read(ranges_entry.parent + pos);
>                             ^^^^^^^^^^^^^^^^^^^
> 
> This has to be converted through all enclosing node's ranges prior to
> being used as a CPU address - the OF core has all the code to do this,
> but a new entry point would be needed for this specific application...

Agreed.  But I was considering that detail to be outside the scope of
this discussion because the same thing must happen for any physical
address at this level of the tree.

> 
> Cheers,
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 11, 2013, 9:50 p.m. UTC | #56
On 3/11/2013 8:15 AM, Jason Gunthorpe wrote:
> On Sun, Mar 10, 2013 at 07:46:04PM -1000, Mitch Bradley wrote:
> 
>> So it seems that we are faced with two requirements that are somewhat at
>> odds with one another.
>>
>> 1) Some of the root port bridge registers have to be accessed via config
>> space access functions so common PCI enumeration code will work.  To
>> represent this with the usual DT structure, the top root-complex needs
>> to define a 3/2 address space so its children
>> can have standard PCI reg properties.  Presumably, if those registers
>> are being programmed by common code, Marvell-specific code would
>> restrict its role to setting up config-space access functions, leaving
>> the actual touching of the registers to the common code.
>>
>> 2) Marvell chips have additional non-standard per-root-port registers
>> that generic PCI code would not understand.  These registers would be
>> touched only by Marvell-specific code.
>>
>> The two kinds of registers are adjacent in MMIO space.  However, unless
>> I am misunderstanding this MV78230 manual, the highest "config header"
>> register index is 0x134, well below the 0x1000 size limit of a PCIe
>> config header.  Some of the extra registers are at 0x8xx, and others are
>> above 0x1800.
> 
> The register block you are looking at is for the cores 'target end
> port mode', and the MMIO version is for internal use only, to allow
> the CPU to configure RO bits, and other tasks. The target core will
> allow access to that space via config cycles, either internally
> (through the indirection register) or externally (from the off-chip
> root complex) generated.
> 
> However - the driver runs the core in a 'root port bridge mode' where
> the config header register block you are looking at is inhibited. The
> Marvell IP block requires software support to run in bridge mode. So
> Marvell really has only (2), while Tegra has only (1).

Interesting.

For Marvell, if Marvell has only (2), does that imply that standard PCI
discovery and enumeration code cannot find the root port bridges, and
also there is no way to auto-configure the bridges with common pci-pci
bridge code?

> 
>> For requirement (2), the top node has a reg property listing the
>> portions of the address space that are consumed by the driver at the top
>> level instead of being passed through to the PCI addressing domain.  E.g.
>>
>>   reg = <0xd0040800 0x1800>, <0xd0080800 0x1800>;
> 
> Okay, so this is agreeing with what Thomas already has:
> 
> http://www.spinics.net/lists/arm-kernel/msg228749.html
> 
> With your two notes:
>  - Correct the pcie@x,y to match the OF spec
>  - Add a 'device_type = "pci"' to the pcie-controller node
> 
> Is that right?

At this point I am confused again.  There was talk of the need to use
standard PCI enumeration code to deal with the bridges, thus the need
for 3/2 to describe the interface between root-complex and the child
root-port nodes.  But now I think I'm hearing that these particular root
ports bear no resemblance to standard pcie bridges.  So I am back to
"not sure why we are using 3/2 PCI addresses across an interface
boundary that has no relation to PCI addressing".  Sorry if I'm being
obtuse.

For Tegra, I think it all makes sense, and I still like the ranges
representation to control the config-space spoofing.  But for Marvell,
it seems like there is neither the need for nor the possibility of
spoofing, nor for PCI-style addressing at the complex/port boundary.

>  
>> I said that it was bogus to use size=0x2000 for a config space header.
>>  That was based on an interpretation - which I now dislike - of the
>> meaning of 3-cell config addresses.  By that old interpretation,
>> size=0x800 would also be bogus, because bits 10-8 of the config address
>> are for the function number.
> 
> Hum, I thought the spec was pretty clear on this point:
> 
>  00 denotes Configuration Space, in which case:
>   [..]
>      bbbbbbbb,ddddd,fff,rrrrrrrrr is the Configuration Space address
>      hh...hh,ll...ll must be zero
> 
> And also the text at 2.1.4.4..
> 
> So you get an 8 bit register offset, placed in the highest DW..
> 
> Can you read things another way?

The reading is clear, but there is a crucial point that is missing.

I don't recall ever having seen anything but 0 in rrrrrrrr.  PCI reg
properties always list the base address of config header.  There is
never a need for a reg entry to point into the middle of a config
header.  The granularity is always at the dev,function level.  So the
fact that the spec puts the offset in those bits is moot, because the
offset is never used.  (I will address your counterargument below.)

I didn't anticipate that fact when I developed the representation, nor
did I anticipate that the rather-rigidly structured config space would
be extended years later.

With my "new interpretation", you still never actually set any of the
"offset bits".  The only thing you get from it is permission to use
larger size values without creating algebraic nonsense.  You never
actually do arithmetic on that representation.

> 
> Is there an updated spec that supports PCI-E extended config space?

http://www.openfirmware.org/ofwg/bindings/pci/pci-express.txt

It puts the extended config address bits in 27:24 of the high word - but
that said, I doubt that those bits have ever been set to nonzero.  Again
because you never need to point to an individual register using this
representation.  I will call David Kahn, who lives down the street from
me, later today and see what he thinks.


> 
>> Consider the following question, which I have never previously
>> considered, at least not explicitly:
>>
>> Q: What would be the 3-cell representation of the Command/Status
>> register address (offset 4) in device 1, function 1?
> 
> Well, see section 11.1.2, where it provides an example
> for the 'Expansion ROM base address register (0x30)' as being:
> 
>    02xxxx30 00000000 00000000 
> 
> Also the f-code bindings say:
> 
>  The data type 'config-addr' refers to the 'phys.hi' cell of the
>  numerical representation of a Configuration Space address.
> 
> So there is an strong convention to use the 'r' bits as the
> offset..

You're right, the spec does indeed say that.  But in practice, it never
actually happened that way.  Expansion ROMs, in practice, had to be
copied into memory because the tended to disappear under certain
settings of various enable bits.  And even that copying had to be done
very carefully, because expansion ROM access had an annoying tendency to
hang some bus interfaces.

So, yeah, the spec implies the convention, but I don't think it was ever
used.

> 
> Further, review section 12 about how ranges should be treated - it
> specifically says that the b,d,f bits in ranges should be 0, and the
> child address should have those bits masked prior to searching the
> ranges.
> 
> Section 12 would seem to forbid this:
> 
>       ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root port config header */
>                 0x00001000 0 0          0xd4008000 0 0x00000800   /* Root port config header */
> 
> Are you reading that differently?

I wasn't reading it at all, until you pointed it out to me :-)  I was
just reasoning based on my mental model.

Again, I agree with your reading of the spec, but I can't think of any
reason why relaxing that restriction to permit config space mappings
would be a problem.

> 
>>> Two things bothered me
>>>   - Describing a CPU MMIO mapping with a config address space seems
>>>     wonky
>>
>> I agree that mapping config space is sort of a jarring concept, but I
>> think that's because PCs have polluted the mindspace, not because
>> there
> 
> I'm well aware of MMIO config space acces, that isn't so jarring. What
> is so jarring is the idea that the OF translation would work like:
> 
> <0x00000900 0 0> -> 0xd4004000
> <0x00000901 0 0> -> 0xd4004001
> 
> Which is completely unlike any ranges translation I've ever seen.

You never actually say <0x901 0 0> in the DT.  What you say "there is a
config header based at <0x900 0 0>".  The offset is maintained
separately, not added directly to the DT representation of the config
header base address.

The most useful interpretation of a ranges entry is that the tuples
(child_base,offset) and (parent_base,offset) correspond for 0 <= offset
< size.  In some cases, you might be able to use arithmetic to merge
base and offset; in other cases you can't.

> 
> Basically, I look at how the register offset is encoded in the higest
> dword plus the statement in section 12, and and conclude the there
> wasn't an intention to model a memory map'd config space through OF.
> 
> What am I missing here?

The real point of section 12 is that you have to attend to the fact that
PCI a structured address space, not just a single linear space.

As you say, the text does not convey the intention of mapping config
space, but I can say categorically that I certainly did not intent "thou
shalt not map config space".


> 
> Regards,
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 11, 2013, 11:25 p.m. UTC | #57
On Mon, Mar 11, 2013 at 11:50:19AM -1000, Mitch Bradley wrote:

> > However - the driver runs the core in a 'root port bridge mode' where
> > the config header register block you are looking at is inhibited. The
> > Marvell IP block requires software support to run in bridge mode. So
> > Marvell really has only (2), while Tegra has only (1).
> 
> Interesting.
> 
> For Marvell, if Marvell has only (2), does that imply that standard PCI
> discovery and enumeration code cannot find the root port bridges, and
> also there is no way to auto-configure the bridges with common pci-pci
> bridge code?

The driver is required to take care of all of this. From the common
code's perspective there is a bridge config header, and writing to it
controls the device, as the PCI spec intends. But the driver doesn't
copy those read/writes 1:1 to any HW register block, it distributes
them around the device's register space as necessary to create the
PCI bridge config header.

The DT is still modeling the HW, there really are root port bridge(s),
that are completely conformant at the TLP level, but the IP designers
choose to leave the detail of the bridge config space up to a SW
implementation.

> At this point I am confused again.  There was talk of the need to
> use standard PCI enumeration code to deal with the bridges, thus the
> need for 3/2 to describe the interface between root-complex and the
> child root-port nodes.

I think your confusion is coming from your expectation that the PCI-E
IP block would be an entirely self contained root port, rather than
'tools to build a root port, with the right driver SW'.

> > Further, review section 12 about how ranges should be treated - it
> > specifically says that the b,d,f bits in ranges should be 0, and the
> > child address should have those bits masked prior to searching the
> > ranges.
> > 
> > Section 12 would seem to forbid this:
> > 
> >       ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root port config header */
> >                 0x00001000 0 0          0xd4008000 0 0x00000800   /* Root port config header */
> > 
> > Are you reading that differently?
> 
> I wasn't reading it at all, until you pointed it out to me :-)  I was
> just reasoning based on my mental model.
> 
> Again, I agree with your reading of the spec, but I can't think of any
> reason why relaxing that restriction to permit config space mappings
> would be a problem.

Okay, just wanted to confirm that, I think I follow you now. :)

So you are basically proposing a small update to Section 12 that would
read something like:

'When matching a config space address, the b,d,f bits are valid in the
 ranges and left unmasked in the child. The r bits in the ranges
 should be 0, and the r bits from the child are copied in the last DW
 and masked off in the first DW.'

Which, as you note, would not conflict with any existing usage..

Though, I can already see that the above language is not good enough,
it doesn't elegantly represent standard ECAM, for instance.. How would
you even represent ECAM with ranges??

Another wrinkle is that tegra not only has the per-bridge memory
mapped config space, bit it also has 4 'all port' ECAM-like memory
mapped config spaces. The rules around them are sufficiently
specialized that I'm not sure a ranges mapping would ever be
practical.. So right out of the gate we'd fail to capture memory
mapped config access via ranges on at least one SOC.

So, to me, this seems like overkill. Only ECAM has a hope of having a
common implementation, everything else will be host driver specific,
so why go to the trouble to revise the OF PCI specification for config
space?

Particularly since there are patches waiting on this question being
resolved :)

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 11, 2013, 11:38 p.m. UTC | #58
On 3/11/2013 1:25 PM, Jason Gunthorpe wrote:
> On Mon, Mar 11, 2013 at 11:50:19AM -1000, Mitch Bradley wrote:
> 
>>> However - the driver runs the core in a 'root port bridge mode' where
>>> the config header register block you are looking at is inhibited. The
>>> Marvell IP block requires software support to run in bridge mode. So
>>> Marvell really has only (2), while Tegra has only (1).
>>
>> Interesting.
>>
>> For Marvell, if Marvell has only (2), does that imply that standard PCI
>> discovery and enumeration code cannot find the root port bridges, and
>> also there is no way to auto-configure the bridges with common pci-pci
>> bridge code?
> 
> The driver is required to take care of all of this. From the common
> code's perspective there is a bridge config header, and writing to it
> controls the device, as the PCI spec intends. But the driver doesn't
> copy those read/writes 1:1 to any HW register block, it distributes
> them around the device's register space as necessary to create the
> PCI bridge config header.
> 
> The DT is still modeling the HW, there really are root port bridge(s),
> that are completely conformant at the TLP level, but the IP designers
> choose to leave the detail of the bridge config space up to a SW
> implementation.

Oh, ouch, deep spoofing.  I had to do that in system management mode
firmware for a Geode chip.  What a pain that was.

> 
>> At this point I am confused again.  There was talk of the need to
>> use standard PCI enumeration code to deal with the bridges, thus the
>> need for 3/2 to describe the interface between root-complex and the
>> child root-port nodes.
> 
> I think your confusion is coming from your expectation that the PCI-E
> IP block would be an entirely self contained root port, rather than
> 'tools to build a root port, with the right driver SW'.
> 
>>> Further, review section 12 about how ranges should be treated - it
>>> specifically says that the b,d,f bits in ranges should be 0, and the
>>> child address should have those bits masked prior to searching the
>>> ranges.
>>>
>>> Section 12 would seem to forbid this:
>>>
>>>       ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root port config header */
>>>                 0x00001000 0 0          0xd4008000 0 0x00000800   /* Root port config header */
>>>
>>> Are you reading that differently?
>>
>> I wasn't reading it at all, until you pointed it out to me :-)  I was
>> just reasoning based on my mental model.
>>
>> Again, I agree with your reading of the spec, but I can't think of any
>> reason why relaxing that restriction to permit config space mappings
>> would be a problem.
> 
> Okay, just wanted to confirm that, I think I follow you now. :)
> 
> So you are basically proposing a small update to Section 12 that would
> read something like:
> 
> 'When matching a config space address, the b,d,f bits are valid in the
>  ranges and left unmasked in the child. The r bits in the ranges
>  should be 0, and the r bits from the child are copied in the last DW
>  and masked off in the first DW.'
> 
> Which, as you note, would not conflict with any existing usage..
> 
> Though, I can already see that the above language is not good enough,
> it doesn't elegantly represent standard ECAM, for instance.. How would
> you even represent ECAM with ranges??
> 
> Another wrinkle is that tegra not only has the per-bridge memory
> mapped config space, bit it also has 4 'all port' ECAM-like memory
> mapped config spaces. The rules around them are sufficiently
> specialized that I'm not sure a ranges mapping would ever be
> practical.. So right out of the gate we'd fail to capture memory
> mapped config access via ranges on at least one SOC.
> 
> So, to me, this seems like overkill. Only ECAM has a hope of having a
> common implementation, everything else will be host driver specific,
> so why go to the trouble to revise the OF PCI specification for config
> space?
> 
> Particularly since there are patches waiting on this question being
> resolved :)

Okay, so I'll just bow out.  Do what you have to do.

> 
> Regards,
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 12, 2013, 7:08 a.m. UTC | #59
On Mon, Mar 11, 2013 at 01:38:38PM -1000, Mitch Bradley wrote:
> On 3/11/2013 1:25 PM, Jason Gunthorpe wrote:
> > On Mon, Mar 11, 2013 at 11:50:19AM -1000, Mitch Bradley wrote:
> > 
> >>> However - the driver runs the core in a 'root port bridge mode' where
> >>> the config header register block you are looking at is inhibited. The
> >>> Marvell IP block requires software support to run in bridge mode. So
> >>> Marvell really has only (2), while Tegra has only (1).
> >>
> >> Interesting.
> >>
> >> For Marvell, if Marvell has only (2), does that imply that standard PCI
> >> discovery and enumeration code cannot find the root port bridges, and
> >> also there is no way to auto-configure the bridges with common pci-pci
> >> bridge code?
> > 
> > The driver is required to take care of all of this. From the common
> > code's perspective there is a bridge config header, and writing to it
> > controls the device, as the PCI spec intends. But the driver doesn't
> > copy those read/writes 1:1 to any HW register block, it distributes
> > them around the device's register space as necessary to create the
> > PCI bridge config header.
> > 
> > The DT is still modeling the HW, there really are root port bridge(s),
> > that are completely conformant at the TLP level, but the IP designers
> > choose to leave the detail of the bridge config space up to a SW
> > implementation.
> 
> Oh, ouch, deep spoofing.  I had to do that in system management mode
> firmware for a Geode chip.  What a pain that was.
> 
> > 
> >> At this point I am confused again.  There was talk of the need to
> >> use standard PCI enumeration code to deal with the bridges, thus the
> >> need for 3/2 to describe the interface between root-complex and the
> >> child root-port nodes.
> > 
> > I think your confusion is coming from your expectation that the PCI-E
> > IP block would be an entirely self contained root port, rather than
> > 'tools to build a root port, with the right driver SW'.
> > 
> >>> Further, review section 12 about how ranges should be treated - it
> >>> specifically says that the b,d,f bits in ranges should be 0, and the
> >>> child address should have those bits masked prior to searching the
> >>> ranges.
> >>>
> >>> Section 12 would seem to forbid this:
> >>>
> >>>       ranges = <0x00000800 0 0          0xd4004000 0 0x00000800   /* Root port config header */
> >>>                 0x00001000 0 0          0xd4008000 0 0x00000800   /* Root port config header */
> >>>
> >>> Are you reading that differently?
> >>
> >> I wasn't reading it at all, until you pointed it out to me :-)  I was
> >> just reasoning based on my mental model.
> >>
> >> Again, I agree with your reading of the spec, but I can't think of any
> >> reason why relaxing that restriction to permit config space mappings
> >> would be a problem.
> > 
> > Okay, just wanted to confirm that, I think I follow you now. :)
> > 
> > So you are basically proposing a small update to Section 12 that would
> > read something like:
> > 
> > 'When matching a config space address, the b,d,f bits are valid in the
> >  ranges and left unmasked in the child. The r bits in the ranges
> >  should be 0, and the r bits from the child are copied in the last DW
> >  and masked off in the first DW.'
> > 
> > Which, as you note, would not conflict with any existing usage..
> > 
> > Though, I can already see that the above language is not good enough,
> > it doesn't elegantly represent standard ECAM, for instance.. How would
> > you even represent ECAM with ranges??
> > 
> > Another wrinkle is that tegra not only has the per-bridge memory
> > mapped config space, bit it also has 4 'all port' ECAM-like memory
> > mapped config spaces. The rules around them are sufficiently
> > specialized that I'm not sure a ranges mapping would ever be
> > practical.. So right out of the gate we'd fail to capture memory
> > mapped config access via ranges on at least one SOC.
> > 
> > So, to me, this seems like overkill. Only ECAM has a hope of having a
> > common implementation, everything else will be host driver specific,
> > so why go to the trouble to revise the OF PCI specification for config
> > space?
> > 
> > Particularly since there are patches waiting on this question being
> > resolved :)
> 
> Okay, so I'll just bow out.  Do what you have to do.

So to recapitulate, we agree that configuration space can be translated
through the "ranges" property. That means the only missing link is a new
function to translate not only "assigned-addresses" but also the "reg"
property for PCI devices. Is that it?

And for Marvell the non-configuration-space per-root-port registers need
to be mapped separately from the configuration space windows?

Thierry
Jason Gunthorpe March 12, 2013, 3:57 p.m. UTC | #60
On Tue, Mar 12, 2013 at 08:08:52AM +0100, Thierry Reding wrote:

> So to recapitulate, we agree that configuration space can be translated
> through the "ranges" property. That means the only missing link is a new
> function to translate not only "assigned-addresses" but also the "reg"
> property for PCI devices. Is that it?

No, the first conclusion is that placing a config space in ranges is
against the language in the current spec (see section 12).

The second conclusion is that there is probably a way forward to
update the spec in a backwards compatible way to model ECAM, but would
require more analysis.

Finally, there was no objections to the approach in Thomas's patch,
other than the note to fix the '@1,0' and the extra device_type="pci"
..

> And for Marvell the non-configuration-space per-root-port registers need
> to be mapped separately from the configuration space windows?

No, there are no configuration space windows on Marvell, so only one mapping.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 12, 2013, 8:38 p.m. UTC | #61
On Tue, Mar 12, 2013 at 09:57:49AM -0600, Jason Gunthorpe wrote:
> On Tue, Mar 12, 2013 at 08:08:52AM +0100, Thierry Reding wrote:
> 
> > So to recapitulate, we agree that configuration space can be translated
> > through the "ranges" property. That means the only missing link is a new
> > function to translate not only "assigned-addresses" but also the "reg"
> > property for PCI devices. Is that it?
> 
> No, the first conclusion is that placing a config space in ranges is
> against the language in the current spec (see section 12).
> 
> The second conclusion is that there is probably a way forward to
> update the spec in a backwards compatible way to model ECAM, but would
> require more analysis.
> 
> Finally, there was no objections to the approach in Thomas's patch,
> other than the note to fix the '@1,0' and the extra device_type="pci"

That's not what I deduced from Mitch's comments. He said explicitly that
he liked the idea to represent mapped configuration space using the
ranges property. Mitch, correct me if I misinterpret what you said.

I've also verified that things actually do work with the binding that I
proposed for Tegra, even if I add device_type = "pci" to the root port
nodes. The reason is that the OF core looks up the type of the *parent*
node to find a matching bus. This happens to be the pcie-controller node
which is not a PCI device and therefore the address translation works
properly.

So let me quote the latest binding that I'm using:

	pcie-controller {
		compatible = "nvidia,tegra20-pcie";
		reg = <0x80003000 0x00000800   /* PADS registers */
		       0x80003800 0x00000200   /* AFI registers */
		       0x90000000 0x10000000>; /* configuration space */
		reg-names = "pads", "afi", "cs";
		interrupts = <0 98 0x04   /* controller interrupt */
		              0 99 0x04>; /* MSI interrupt */
		interrupt-names = "intr", "msi";

		bus-range = <0x00 0xff>;
		#address-cells = <3>;
		#size-cells = <2>;

		ranges = <0x00000800 0 0 0x80000000 0 0x00001000   /* port 0 configuration space */
			  0x00001000 0 0 0x80001000 0 0x00001000   /* port 1 configuration space */
			  0x81000000 0 0 0x82000000 0 0x00010000   /* downstream I/O */
			  0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */
			  0xc2000000 0 0 0xb0000000 0 0x10000000>; /* prefetchable memory */

		clocks = <&tegra_car 70>, <&tegra_car 72>, <&tegra_car 74>,
			 <&tegra_car 118>;
		clock-names = "pex", "afi", "pcie_xclk", "pll_e";
		status = "disabled";

		pci@1,0 {
			device_type = "pci";
			reg = <0x000800 0 0 0 0x1000>;
			status = "disabled";

			#address-cells = <3>;
			#size-cells = <2>;

			nvidia,num-lanes = <2>;
		};

		pci@2,0 {
			device_type = "pci";
			reg = <0x001000 0 0 0 0x1000>;
			status = "disabled";

			#address-cells = <3>;
			#size-cells = <2>;

			nvidia,num-lanes = <2>;
		};
	};

I can't find anything wrong in the above. As far as I can tell, there's
nothing in it that's in conflict with the specification.

Thierry
Jason Gunthorpe March 12, 2013, 9:03 p.m. UTC | #62
On Tue, Mar 12, 2013 at 09:38:19PM +0100, Thierry Reding wrote:

> I've also verified that things actually do work with the binding that I
> proposed for Tegra, even if I add device_type = "pci" to the root port
> nodes. The reason is that the OF core looks up the type of the *parent*
> node to find a matching bus. This happens to be the pcie-controller node
> which is not a PCI device and therefore the address translation works
> properly.

Mitch pointed out early on that the parent node needs a 'device_type =
"pci"' as well.

The pcie-controller node would be classified as a 'bus node' according
to the spec.

Not going down the of_pci_* code paths for address translation at the
root port bridge nodes is certainly not right.

> ranges = <0x00000800 0 0 0x80000000 0 0x00001000   /* port 0 configuration space */
>           0x00001000 0 0 0x80001000 0 0x00001000   /* port 1 configuration space */

Section 12 says that the 'b,d,f' bits are 0 in ranges, so the above is
in conflict with that language:

 In particular, the phys.hi fields of the child address spaces in the
 "ranges" property for PCI does not contain the same information as
 "reg" property entries within PCI nodes. The only information that is
 present in "ranges" phys.hi entries are the non-relocatable,
 prefetchable and the PCI address space bits for which the entry
 applies. I.e., only the n, p and ss bits are present; the bbbbbbbb,
 ddddd, fff and rrrrrrrr fields are 0.

And the further requirement:

 When an address is to be mapped through a PCI bus bridge node, the
 phys.hi value of the address to be mapped and the child field of a
 "ranges" entry should be masked so that only the ss bits are compared.

Which is embodied by the code in of_bus_pci_map, meaning the OF core
won't even look at the bits in the high DW when searching ranges.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 12, 2013, 9:30 p.m. UTC | #63
On Tue, Mar 12, 2013 at 03:03:28PM -0600, Jason Gunthorpe wrote:
> On Tue, Mar 12, 2013 at 09:38:19PM +0100, Thierry Reding wrote:
> 
> > I've also verified that things actually do work with the binding that I
> > proposed for Tegra, even if I add device_type = "pci" to the root port
> > nodes. The reason is that the OF core looks up the type of the *parent*
> > node to find a matching bus. This happens to be the pcie-controller node
> > which is not a PCI device and therefore the address translation works
> > properly.
> 
> Mitch pointed out early on that the parent node needs a 'device_type =
> "pci"' as well.
> 
> The pcie-controller node would be classified as a 'bus node' according
> to the spec.
> 
> Not going down the of_pci_* code paths for address translation at the
> root port bridge nodes is certainly not right.

I'm not so sure. Why should the pcie-controller be a PCI device? It
really isn't. Instead it is a device on the processor/SoC/platform bus
that bridges to the PCI bus. As such its children should be PCI devices
to reflect that the translation happens within the pcie-controller node.

In the above it is logical to not set the pcie-controller's device_type
property to "pci".

> > ranges = <0x00000800 0 0 0x80000000 0 0x00001000   /* port 0 configuration space */
> >           0x00001000 0 0 0x80001000 0 0x00001000   /* port 1 configuration space */
> 
> Section 12 says that the 'b,d,f' bits are 0 in ranges, so the above is
> in conflict with that language:
> 
>  In particular, the phys.hi fields of the child address spaces in the
>  "ranges" property for PCI does not contain the same information as
>  "reg" property entries within PCI nodes. The only information that is
>  present in "ranges" phys.hi entries are the non-relocatable,
>  prefetchable and the PCI address space bits for which the entry
>  applies. I.e., only the n, p and ss bits are present; the bbbbbbbb,
>  ddddd, fff and rrrrrrrr fields are 0.
> 
> And the further requirement:
> 
>  When an address is to be mapped through a PCI bus bridge node, the
>  phys.hi value of the address to be mapped and the child field of a
>  "ranges" entry should be masked so that only the ss bits are compared.
> 
> Which is embodied by the code in of_bus_pci_map, meaning the OF core
> won't even look at the bits in the high DW when searching ranges.

Okay, so the problem with that is that we won't be able to include more
entries for configuration space within a ranges property because we
can't uniquely map them (they both have the same ss field values). One
possibility would be to differentiate using the phys.mid and phys.low
cells but they need to be 0 according to the spec.

The other alternative would be to amend the specification. Besides the
fact that the specification says so I don't see any reason why this
shouldn't be allowed.

Thierry
Jason Gunthorpe March 12, 2013, 10:08 p.m. UTC | #64
On Tue, Mar 12, 2013 at 10:30:06PM +0100, Thierry Reding wrote:

> > Not going down the of_pci_* code paths for address translation at the
> > root port bridge nodes is certainly not right.
> 
> I'm not so sure. Why should the pcie-controller be a PCI device?

The spec is clear on that point as well:

 device_type
  A Standard Package conforming to this specification and corresponding
  to a device that implements a PCI bus shall implement this property
  with the string value "pci"

The children of a pcie-controller node are PCI devices, thus the
pcie-controller node 'implements a PCI bus'. Or, as you say 'device on
the processor/SoC/platform bus that bridges to the PCI bus'

Think of device_type as also meaning bus_type and it makes more
logical sense, the name is terrible, but its usage is governed by the
OF docs..

> The other alternative would be to amend the specification. Besides the
> fact that the specification says so I don't see any reason why this
> shouldn't be allowed.

'the specification says so' *IS* the reason. DT isn't a free for all
where you get to do whatever you want, or whatever 'feels' right. It
is supposed to be a stable, OS agnostic ABI. That means bindings have
to follow the specs (when available).

Maybe a future revision will support PCI-E ECAM, but we don't know
what that will look like, and I'm pretty sure you don't want to hold
up your patches until an IEEE committee gets around to deciding
something ;)

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 12, 2013, 11:25 p.m. UTC | #65
On 3/12/2013 12:08 PM, Jason Gunthorpe wrote:
> On Tue, Mar 12, 2013 at 10:30:06PM +0100, Thierry Reding wrote:
> 
>>> Not going down the of_pci_* code paths for address translation at the
>>> root port bridge nodes is certainly not right.
>>
>> I'm not so sure. Why should the pcie-controller be a PCI device?
> 
> The spec is clear on that point as well:
> 
>  device_type
>   A Standard Package conforming to this specification and corresponding
>   to a device that implements a PCI bus shall implement this property
>   with the string value "pci"
> 
> The children of a pcie-controller node are PCI devices, thus the
> pcie-controller node 'implements a PCI bus'. Or, as you say 'device on
> the processor/SoC/platform bus that bridges to the PCI bus'
> 
> Think of device_type as also meaning bus_type and it makes more
> logical sense, the name is terrible, but its usage is governed by the
> OF docs..

I beg to differ about the name.  device_type means "what kind of
functionality is implemented by this device", not "what interface
connects this device to the rest of the system".   The use of
"device_type" for that meaning is perfectly reasonable in the English
language.

The DT answers the "what interface connects me to the system" question
by looking at the parent.

> 
>> The other alternative would be to amend the specification. Besides the
>> fact that the specification says so I don't see any reason why this
>> shouldn't be allowed.
> 
> 'the specification says so' *IS* the reason. DT isn't a free for all
> where you get to do whatever you want, or whatever 'feels' right. It
> is supposed to be a stable, OS agnostic ABI. That means bindings have
> to follow the specs (when available).
> 
> Maybe a future revision will support PCI-E ECAM, but we don't know
> what that will look like, and I'm pretty sure you don't want to hold
> up your patches until an IEEE committee gets around to deciding
> something ;)

The Open Firmware Working Group quit doing things under the IEEE banner
a long long time ago, precisely for the reason that the IEEE process was
too slow, cumbersome, and expensive.  The group thereafter published
"Recommended Practices" documents under its own non-authority instead of
going through the IEEE ratification process.  The group's only
"authority" arose from the fact that the members had expertise and
worked for companies that needed to use the results.

The Working Group is now inactive.  I am in contact with many of the
former members, but by and large they are no longer interested in
firmware standardization.  Many of them work for companies that now hold
their cards closely, that seemingly being the only way to make enough
money to stay in business.

In other words, there is no committee.

New versions of the document in question, or of any document for that
matter, can be issued by the simple expedient of interested parties
coming to agreement and writing down that agreement.  In practice, that
is what happens in Linux land for nearly every issue.

At this time, the primary "interest group" for device tree issues
revolves around the Linux kernel - the devicetree-discuss list is where
all the action is.  I suppose that Oracle still cares about Open
Firmware on its SPARC systems, but it is likely that Oracle is going to
do whatever they want no matter what Linux people say or do.  In
principle, the DT is supposed to be OS-agnostic, but in practice, which
other OS both uses the DT and has enough users to matter?  So it would
be entirely reasonable for a group of Linux kernel people to publish
revised documents, as has already been done in the form of
Documentation/ files setting down usage rules.


> 
> Jason
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 13, 2013, 8:18 a.m. UTC | #66
On Tue, Mar 12, 2013 at 04:08:54PM -0600, Jason Gunthorpe wrote:
> On Tue, Mar 12, 2013 at 10:30:06PM +0100, Thierry Reding wrote:
> 
> > > Not going down the of_pci_* code paths for address translation at the
> > > root port bridge nodes is certainly not right.
> > 
> > I'm not so sure. Why should the pcie-controller be a PCI device?
> 
> The spec is clear on that point as well:
> 
>  device_type
>   A Standard Package conforming to this specification and corresponding
>   to a device that implements a PCI bus shall implement this property
>   with the string value "pci"
> 
> The children of a pcie-controller node are PCI devices, thus the
> pcie-controller node 'implements a PCI bus'. Or, as you say 'device on
> the processor/SoC/platform bus that bridges to the PCI bus'
> 
> Think of device_type as also meaning bus_type and it makes more
> logical sense, the name is terrible, but its usage is governed by the
> OF docs..
> 
> > The other alternative would be to amend the specification. Besides the
> > fact that the specification says so I don't see any reason why this
> > shouldn't be allowed.
> 
> 'the specification says so' *IS* the reason. DT isn't a free for all
> where you get to do whatever you want, or whatever 'feels' right. It
> is supposed to be a stable, OS agnostic ABI. That means bindings have
> to follow the specs (when available).
> 
> Maybe a future revision will support PCI-E ECAM, but we don't know
> what that will look like, and I'm pretty sure you don't want to hold
> up your patches until an IEEE committee gets around to deciding
> something ;)

Mitch already answered this. The specification is now almost 15 years
old and it couldn't possibly have foreseen all of the future use-cases.
If the specification is too restrictive and Mitch gives his blessing to
remove some of the restrictions, I don't see any reason why we shouldn't
do so if it lets us represent the reality of hardware more accurately in
DT.

Furthermore we're not discussing radical changes. None of the changes
will be backwards-incompatible, but they will allow recent hardware to
be represented more correctly or conveniently.

Thierry
Jason Gunthorpe March 13, 2013, 5:02 p.m. UTC | #67
On Wed, Mar 13, 2013 at 09:18:15AM +0100, Thierry Reding wrote:

> Mitch already answered this. The specification is now almost 15 years
> old and it couldn't possibly have foreseen all of the future use-cases.
> If the specification is too restrictive and Mitch gives his blessing to
> remove some of the restrictions, I don't see any reason why we shouldn't
> do so if it lets us represent the reality of hardware more accurately in
> DT.

I understand the spec is old, and I have no problem with making a
Linux specific revision, but do *that* - don't bury some random
deviation inside the bindings for a driver. I even suggested some
language, but I feel more thought is needed to consider how to model
the standardized ECAM mechanism..

> Furthermore we're not discussing radical changes. None of the changes
> will be backwards-incompatible, but they will allow recent hardware to
> be represented more correctly or conveniently.

Sure, but it is still inconsistent, one MMIO config mechansim is in
ranges, one is in regs. Plus I don't think tegra's case is a great
starting point to design a spec update, it's config access mechanism
is especially strange...

Anyhow, I think this has been hashed to death, as long as your final
binding has the 'device_type = pci' on the pcie-controller node I
think it will be fine.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 13, 2013, 7:26 p.m. UTC | #68
On Wed, Mar 13, 2013 at 11:02:05AM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 13, 2013 at 09:18:15AM +0100, Thierry Reding wrote:
> 
> > Mitch already answered this. The specification is now almost 15 years
> > old and it couldn't possibly have foreseen all of the future use-cases.
> > If the specification is too restrictive and Mitch gives his blessing to
> > remove some of the restrictions, I don't see any reason why we shouldn't
> > do so if it lets us represent the reality of hardware more accurately in
> > DT.
> 
> I understand the spec is old, and I have no problem with making a
> Linux specific revision, but do *that* - don't bury some random
> deviation inside the bindings for a driver. I even suggested some
> language, but I feel more thought is needed to consider how to model
> the standardized ECAM mechanism..

As Mitch already said there's very little chance that the specification
update will be ratified through IEEE, so I think that we might just as
well put a corresponding text somewhere below Documentation/devicetree.

Also note that this has absolutely nothing to do with ECAM. All I'm
proposing is to allow the reg property of a root port to be translated
into a CPU addressable memory region through the ranges property. This
turns out to actually work properly with the current OF core in Linux.

> > Furthermore we're not discussing radical changes. None of the changes
> > will be backwards-incompatible, but they will allow recent hardware to
> > be represented more correctly or conveniently.
> 
> Sure, but it is still inconsistent, one MMIO config mechansim is in
> ranges, one is in regs. Plus I don't think tegra's case is a great
> starting point to design a spec update, it's config access mechanism
> is especially strange...

Again, it is still the most accurate way to describe the hardware. I
know it's not terribly beautiful and doesn't match with what you'd
expect from PC hardware. However it is still a reality and something the
kernel will have to deal with. Marvell hardware isn't very pretty either
but that shouldn't exclude it from being supported by Linux.

> Anyhow, I think this has been hashed to death, as long as your final
> binding has the 'device_type = pci' on the pcie-controller node I
> think it will be fine.

No. The pcie-controller is *not* a PCI device. It is a PCI host bridge.
It is the instance that translates from the platform to the PCI bus. Why
should it be device_type = "pci"? And we've also been over this before,
making that change stops the proposed binding from working properly
because the entry in the reg property can't be translated into a CPU
addressable memory region.

Thierry
Jason Gunthorpe March 13, 2013, 7:59 p.m. UTC | #69
On Wed, Mar 13, 2013 at 08:26:28PM +0100, Thierry Reding wrote:

> As Mitch already said there's very little chance that the specification
> update will be ratified through IEEE, so I think that we might just as
> well put a corresponding text somewhere below Documentation/devicetree.

Sure, I'm fine with that.
 
> Also note that this has absolutely nothing to do with ECAM. All I'm
> proposing is to allow the reg property of a root port to be translated
> into a CPU addressable memory region through the ranges property. This
> turns out to actually work properly with the current OF core in Linux.

ECAM is the only standard based mechanism that could possibly use this
new config space ranges concept. Whatever you do for tegra shouldn't
stomp on the possibility of making that work in the future. So at
least think about how ECAM could be modeled for a few mins before
going ahead.

Tegra is the non-standard unusual case here, an update to the PCI OF
binding should be of general use.

> > Anyhow, I think this has been hashed to death, as long as your final
> > binding has the 'device_type = pci' on the pcie-controller node I
> > think it will be fine.
> 
> No. The pcie-controller is *not* a PCI device. It is a PCI host bridge.
> It is the instance that translates from the platform to the PCI bus. Why
> should it be device_type = "pci"? 

Spec says so, Mitch said so, other PCI host bridge bindings in the
tree work like this. Do a grep in the powerpc tree, for instance.

arch/x86/platform/ce4100/falconfalls.dts is a pretty comprehensive
example.

> And we've also been over this before, making that change stops the
> proposed binding from working properly because the entry in the reg
> property can't be translated into a CPU addressable memory region.

Sigh, that's because the change makes things follow the OF PCI
spec. The Linux OF core follows the spec. Config space ranges binding
is specifically *defined to not work* in the spec. Why should it work
today?

You have to deal with this problem somehow, and omitting the top level
device_type is not a solution.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 13, 2013, 8:54 p.m. UTC | #70
On Wed, Mar 13, 2013 at 01:59:01PM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 13, 2013 at 08:26:28PM +0100, Thierry Reding wrote:
> 
> > As Mitch already said there's very little chance that the specification
> > update will be ratified through IEEE, so I think that we might just as
> > well put a corresponding text somewhere below Documentation/devicetree.
> 
> Sure, I'm fine with that.
>  
> > Also note that this has absolutely nothing to do with ECAM. All I'm
> > proposing is to allow the reg property of a root port to be translated
> > into a CPU addressable memory region through the ranges property. This
> > turns out to actually work properly with the current OF core in Linux.
> 
> ECAM is the only standard based mechanism that could possibly use this
> new config space ranges concept. Whatever you do for tegra shouldn't
> stomp on the possibility of making that work in the future. So at
> least think about how ECAM could be modeled for a few mins before
> going ahead.
> 
> Tegra is the non-standard unusual case here, an update to the PCI OF
> binding should be of general use.

But I'm not trying to encode ECAM in the DT. ECAM isn't the problem at
all. The only problem is that each root port has a separate window in
the CPU address space that is used to access that port's configuration
space.

ECAM only comes into play when accessing the configuration space of
subordinate devices (via type 1 transactions). But at that point the
device tree doesn't need to know about it. The only thing that needs to
be in DT is the memory region that the controller should write to in
order to generate type 1 transactions. And that's a problem that has
long been solved.

> > > Anyhow, I think this has been hashed to death, as long as your final
> > > binding has the 'device_type = pci' on the pcie-controller node I
> > > think it will be fine.
> > 
> > No. The pcie-controller is *not* a PCI device. It is a PCI host bridge.
> > It is the instance that translates from the platform to the PCI bus. Why
> > should it be device_type = "pci"? 
> 
> Spec says so, Mitch said so, other PCI host bridge bindings in the
> tree work like this. Do a grep in the powerpc tree, for instance.
> 
> arch/x86/platform/ce4100/falconfalls.dts is a pretty comprehensive
> example.

Yes, yes. We all know by now what the spec says. The whole point is
that maybe the spec needs to be updated and this requirement lifted.
What I keep trying to say is that in case of Tegra the PCIe controller
does not provide the PCI bus. It doesn't appear as a device on the PCI
bus and it doesn't have configuration space itself.

Only the PCI root ports provide access to the subordinate busses. And
only they appear on the PCI bus.

> > And we've also been over this before, making that change stops the
> > proposed binding from working properly because the entry in the reg
> > property can't be translated into a CPU addressable memory region.
> 
> Sigh, that's because the change makes things follow the OF PCI
> spec. The Linux OF core follows the spec. Config space ranges binding
> is specifically *defined to not work* in the spec. Why should it work
> today?

Okay. I keep understanding what you're saying. But the whole point of
this discussion is that maybe the spec should be amended to allow this
to work. In fact it does work with the current implementation in the
Linux kernel. And what I've been trying to explain is that I think this
to be the most accurate description of the Tegra hardware.

> You have to deal with this problem somehow, and omitting the top level
> device_type is not a solution.

So we've already agreed that the specification might be too restrictive,
and I'm trying to solve this problem by finding an accurate description
in DT to represent the hardware without following the spec to the letter
because it wasn't written with this kind of hardware in mind.

Tegra's PCIe controller doesn't actually implement the bus. In fact we
have to jump through hoops to make both root ports appear on the root
bus specifically because there's no common access to the configuration
space that includes the host bridge or the root ports.

Thierry
Mitch Bradley March 13, 2013, 8:58 p.m. UTC | #71
On 3/13/2013 9:26 AM, Thierry Reding wrote:
> On Wed, Mar 13, 2013 at 11:02:05AM -0600, Jason Gunthorpe wrote:
>> On Wed, Mar 13, 2013 at 09:18:15AM +0100, Thierry Reding wrote:
>>
>>> Mitch already answered this. The specification is now almost 15 years
>>> old and it couldn't possibly have foreseen all of the future use-cases.
>>> If the specification is too restrictive and Mitch gives his blessing to
>>> remove some of the restrictions, I don't see any reason why we shouldn't
>>> do so if it lets us represent the reality of hardware more accurately in
>>> DT.
>>
>> I understand the spec is old, and I have no problem with making a
>> Linux specific revision, but do *that* - don't bury some random
>> deviation inside the bindings for a driver. I even suggested some
>> language, but I feel more thought is needed to consider how to model
>> the standardized ECAM mechanism..
> 
> As Mitch already said there's very little chance that the specification
> update will be ratified through IEEE, so I think that we might just as
> well put a corresponding text somewhere below Documentation/devicetree.
> 
> Also note that this has absolutely nothing to do with ECAM. All I'm
> proposing is to allow the reg property of a root port to be translated
> into a CPU addressable memory region through the ranges property. This
> turns out to actually work properly with the current OF core in Linux.
> 
>>> Furthermore we're not discussing radical changes. None of the changes
>>> will be backwards-incompatible, but they will allow recent hardware to
>>> be represented more correctly or conveniently.
>>
>> Sure, but it is still inconsistent, one MMIO config mechansim is in
>> ranges, one is in regs. Plus I don't think tegra's case is a great
>> starting point to design a spec update, it's config access mechanism
>> is especially strange...
> 
> Again, it is still the most accurate way to describe the hardware. I
> know it's not terribly beautiful and doesn't match with what you'd
> expect from PC hardware. However it is still a reality and something the
> kernel will have to deal with. Marvell hardware isn't very pretty either
> but that shouldn't exclude it from being supported by Linux.
> 
>> Anyhow, I think this has been hashed to death, as long as your final
>> binding has the 'device_type = pci' on the pcie-controller node I
>> think it will be fine.
> 
> No. The pcie-controller is *not* a PCI device. It is a PCI host bridge.
> It is the instance that translates from the platform to the PCI bus. Why
> should it be device_type = "pci"? 

device_type="pci" means that this device implements PCI functionality,
which means that it is either a PCI host bridge or a PCI-to-PCI bridge.

"device_type" answers the question "What does this device do?".  The
alternative question "What bus does this device plug into?" is answered
by looking at the parent.

In this case, the answer to "what does pcie_controller do?" is "it
implements a PCI bus" below.  So 'device_type = "pci"' is appropriate.

Having just re-read the code in drivers/of/address.c, I think I have a
deeper understanding of a few practical issues:

In order to translate a 3/2 PCI address via of_get_pci_address(), the
parent device (i.e. root complex / pcie-controller) must have both
'device_type = "pci"' (else of_match_bus() won't find the pci version of
"struct of_bus") and also 'name = "pci"' (else strcmp(bus->name, "pci")
will fail).

So it would seem that, if you want to use address.c verbatim (for the
interface between pcie-controller and its direct children), not only do
you need device_type = pci, but you also need to rename
"pcie-controller" to "pci".

But then you run into the problem that the pci variant of struct of_bus
uses "assigned-addresses" instead of "reg".  So it still doesn't work
as-is.  To make it work, you would need to add an "assigned-addresses"
property to each root port node.  That property value could be in
non-relocatable memory space, translatable via normal rules, in which
case the "map config space to MMIO via ranges" discussion is moot for
Marvell.

Which re-raises a question for which I have not yet heard a compelling
answer:  Why use 3/2 PCI addressing between Marvell pcie-controller and
its root ports (with their unusual programming model)?  The Marvell
hardware address structure between controller and root ports is
definitely not 3/2.  Certainly you need 3/2 addressing below
(implemented by) the root ports, but I just don't see the utility of
pretending 3/2 addressing at the complex-to-port interface.  If the root
port hardware used the common programming model, with real config
headers and stuff, 3/2 would be good because you could use existing
drivers.  But since you need a special root-port driver anyway, why go
to the trouble of emulating non-existent addressing?



And we've also been over this before,
> making that change stops the proposed binding from working properly
> because the entry in the reg property can't be translated into a CPU
> addressable memory region.
> 
> Thierry
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 13, 2013, 9:33 p.m. UTC | #72
On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:
[...]
> In this case, the answer to "what does pcie_controller do?" is "it
> implements a PCI bus" below.  So 'device_type = "pci"' is appropriate.

Alright, that's 2 against 1. I don't have much of a choice but to yield.

> Having just re-read the code in drivers/of/address.c, I think I have a
> deeper understanding of a few practical issues:
> 
> In order to translate a 3/2 PCI address via of_get_pci_address(), the
> parent device (i.e. root complex / pcie-controller) must have both
> 'device_type = "pci"' (else of_match_bus() won't find the pci version of
> "struct of_bus") and also 'name = "pci"' (else strcmp(bus->name, "pci")
> will fail).
> 
> So it would seem that, if you want to use address.c verbatim (for the
> interface between pcie-controller and its direct children), not only do
> you need device_type = pci, but you also need to rename
> "pcie-controller" to "pci".

bus->name refers to that of the of_bus structure, not the node name of
the controller.

> But then you run into the problem that the pci variant of struct of_bus
> uses "assigned-addresses" instead of "reg".  So it still doesn't work
> as-is.  To make it work, you would need to add an "assigned-addresses"
> property to each root port node.  That property value could be in
> non-relocatable memory space, translatable via normal rules, in which
> case the "map config space to MMIO via ranges" discussion is moot for
> Marvell.

At the risk of quoting the spec: it says (4.1.2) that the entries in the
"assigned-addresses" property correspond to the function's configuration
space BARs. In this case however that's no longer true. But if everybody
thinks that repurposing the "assigned-addresses" property is more
acceptable than leaving out the device_type = "pci" property from the
pcie-controller node, then so be it.

> Which re-raises a question for which I have not yet heard a compelling
> answer:  Why use 3/2 PCI addressing between Marvell pcie-controller and
> its root ports (with their unusual programming model)?  The Marvell
> hardware address structure between controller and root ports is
> definitely not 3/2.  Certainly you need 3/2 addressing below
> (implemented by) the root ports, but I just don't see the utility of
> pretending 3/2 addressing at the complex-to-port interface.  If the root
> port hardware used the common programming model, with real config
> headers and stuff, 3/2 would be good because you could use existing
> drivers.  But since you need a special root-port driver anyway, why go
> to the trouble of emulating non-existent addressing?

I think the reason is that reg needs to be in the standard format for
the device node matching code to work. There are various checks to make
sure the reg property has entries which are 3/2 each.

There's also the issue that the pcie-controller's ranges property
contains the mapping of the I/O and memory windows and therefore needs
the 3/2 addressing to encode the type.

Thierry
Thierry Reding March 13, 2013, 10:02 p.m. UTC | #73
On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:
[...]
> But then you run into the problem that the pci variant of struct of_bus
> uses "assigned-addresses" instead of "reg".  So it still doesn't work
> as-is.  To make it work, you would need to add an "assigned-addresses"
> property to each root port node.  That property value could be in
> non-relocatable memory space, translatable via normal rules, in which
> case the "map config space to MMIO via ranges" discussion is moot for
> Marvell.

So if I understand all of the above correctly, the Tegra binding should
look like this:

	pcie-controller {
		compatible = "nvidia,tegra20-pcie";
		device_type = "pci";
		reg = <0x80003000 0x00000800   /* PADS registers */
		       0x80003800 0x00000200   /* AFI registers */
		       0x90000000 0x10000000>; /* configuration space */
		reg-names = "pads", "afi", "cs";
		interrupts = <0 98 0x04   /* controller interrupt */
		              0 99 0x04>; /* MSI interrupt */
		interrupt-names = "intr", "msi";

		bus-range = <0x00 0xff>;
		#address-cells = <3>;
		#size-cells = <2>;

		ranges = <0x82000800 0 0 0x80000000 0 0x00001000   /* port 0 registers */
			  0x82001000 0 0 0x80001000 0 0x00001000   /* port 1 registers */
			  0x81000000 0 0 0x82000000 0 0x00010000   /* downstream I/O */
			  0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */
			  0xc2000000 0 0 0xb0000000 0 0x10000000>; /* prefetchable memory */

		clocks = <&tegra_car 70>, <&tegra_car 72>, <&tegra_car 74>,
			 <&tegra_car 118>;
		clock-names = "pex", "afi", "pcie_xclk", "pll_e";
		status = "disabled";

		pci@1,0 {
			device_type = "pci";
			assigned-addresses = <0x82000800 0 0 0x80000000 0x1000>;
			reg = <0x000800 0 0 0 0>;
			status = "disabled";

			#address-cells = <3>;
			#size-cells = <2>;

			nvidia,num-lanes = <2>;
		};

		pci@2,0 {
			device_type = "pci";
			assigned-addresses = <0x82001000 0 0 0x80001000 0x1000>;
			reg = <0x001000 0 0 0 0>;
			status = "disabled";

			#address-cells = <3>;
			#size-cells = <2>;

			nvidia,num-lanes = <2>;
		};
	};

Does that look about correct?

One problem I see is that now the ranges property contains three entries
for 32-bit memory space:

	0x82000800 0 0 0x80000000 0 0x00001000   /* port 0 registers */
	0x82001000 0 0 0x80001000 0 0x00001000   /* port 1 registers */
	0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */

But we used to rely on the ss field to parse the ranges property and
configure the windows in the controller properly so that accesses to
these regions will generate the correct transactions.

Now the code will actually match the first entry and assume that it
represents the non-prefetchable 32-bit memory space and program the
controller to forward accesses to the 0x80000000-0x80000fff region
as PCI memory write transactions.

I suppose we could add a check to verify that devfn == 0. However does
the above ranges property not imply that the brige on bus 0, device 1
has a 32-bit memory space from 0x80000000 to 0x80000fff and the bridge
on bus 0, device 2 has a 32-bit memory space from 0x80001000 to
0x80001fff? Which matches with the wording in the spec that the
"assigned-addresses" correspond to the bridge's BARs.

Thierry
Jason Gunthorpe March 13, 2013, 10:21 p.m. UTC | #74
On Wed, Mar 13, 2013 at 11:02:35PM +0100, Thierry Reding wrote:
 
> Does that look about correct?

By my reading of the spec the entries in ranges should not
have the b,d,f bits set..

Although it is not a big stretch to include them..

> Now the code will actually match the first entry and assume that it
> represents the non-prefetchable 32-bit memory space and program the
> controller to forward accesses to the 0x80000000-0x80000fff region
> as PCI memory write transactions.

Yah, it makes a mess of determining the host bridge aperture. Require
the aperture to be first/last?

Make use of the 2nd dword?

Go back to the idea of the top level reg?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 13, 2013, 10:22 p.m. UTC | #75
On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:

> port hardware used the common programming model, with real config
> headers and stuff, 3/2 would be good because you could use existing
> drivers.  But since you need a special root-port driver anyway, why go
> to the trouble of emulating non-existent addressing?

Yes, I'm sorry, I've tried not to explore the complexity behind this
reasoning here.. Those discussions went on for a long time and I'm
tired of rehashing them :)

It isn't existing drivers that are valuable, it is the entire common
PCI infrastructure for dealing with discovery and address assignment
that is has value.

The common infrastructure can only allocate addresses within fixed
regions assigned to each PCI domain - and it divides those addreses
regions between ports in the root complex via the standard PCI-E root
port bridge memory windows.

Now, the largest SOC has 10 of the PCI-E ports in it, with 32 bit
addressing. It is essential that a small region of address space be
set aside for PCI-E use, and thus it is critical that resource
assignment be done across all ports, drawing on a common pool of
address space.

Thus two choices:
 1) Keep the PCI core the same and let the host bridge driver
    provide the missing configuration elements of the root
    complex
 2) Upgrade the PCI common core to deal with dynamic cross
    domain resource allocation, including hot plug.

Based on discussion/analysis #1 was chosen. Lots of reasons.

From that choice flows the DT bindings, which are now required to
model the PCI bus 0.

Does that help explain the rational?

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 13, 2013, 10:48 p.m. UTC | #76
On 3/13/2013 11:33 AM, Thierry Reding wrote:
> On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:
> [...]
>> In this case, the answer to "what does pcie_controller do?" is "it
>> implements a PCI bus" below.  So 'device_type = "pci"' is appropriate.
> 
> Alright, that's 2 against 1. I don't have much of a choice but to yield.

All issues of "voting" aside, 'device_type = pci' is what tells
of_get_pci_address() to use the 3/2 interpretation.  So if you want a
node to implement 3/2 addresses, it must say device_type = pci, unless
you do address translation some other way.

> 
>> Having just re-read the code in drivers/of/address.c, I think I have a
>> deeper understanding of a few practical issues:
>>
>> In order to translate a 3/2 PCI address via of_get_pci_address(), the
>> parent device (i.e. root complex / pcie-controller) must have both
>> 'device_type = "pci"' (else of_match_bus() won't find the pci version of
>> "struct of_bus") and also 'name = "pci"' (else strcmp(bus->name, "pci")
>> will fail).
>>
>> So it would seem that, if you want to use address.c verbatim (for the
>> interface between pcie-controller and its direct children), not only do
>> you need device_type = pci, but you also need to rename
>> "pcie-controller" to "pci".
> 
> bus->name refers to that of the of_bus structure, not the node name of
> the controller.

D'oh!  Completely correct.  That's good; I didn't like the idea of
needing both device_type and name = pci.

> 
>> But then you run into the problem that the pci variant of struct of_bus
>> uses "assigned-addresses" instead of "reg".  So it still doesn't work
>> as-is.  To make it work, you would need to add an "assigned-addresses"
>> property to each root port node.  That property value could be in
>> non-relocatable memory space, translatable via normal rules, in which
>> case the "map config space to MMIO via ranges" discussion is moot for
>> Marvell.
> 
> At the risk of quoting the spec: it says (4.1.2) that the entries in the
> "assigned-addresses" property correspond to the function's configuration
> space BARs. In this case however that's no longer true.

Right, but we got into this mess by pretending that a PCI domain exists
despite the fact that the hardware is not PCI at that level, in order to
accommodate some existing Linux code.  Since we are already "faking it
to deal with existing code", it doesn't seem particularly strange to
take advantage of how existing code works, in order to accomplish the
forgery.

> But if everybody
> thinks that repurposing the "assigned-addresses" property is more
> acceptable than leaving out the device_type = "pci" property from the
> pcie-controller node, then so be it.

I'm confused again.  Why does leaving out device_type = pci make it work?

> 
>> Which re-raises a question for which I have not yet heard a compelling
>> answer:  Why use 3/2 PCI addressing between Marvell pcie-controller and
>> its root ports (with their unusual programming model)?  The Marvell
>> hardware address structure between controller and root ports is
>> definitely not 3/2.  Certainly you need 3/2 addressing below
>> (implemented by) the root ports, but I just don't see the utility of
>> pretending 3/2 addressing at the complex-to-port interface.  If the root
>> port hardware used the common programming model, with real config
>> headers and stuff, 3/2 would be good because you could use existing
>> drivers.  But since you need a special root-port driver anyway, why go
>> to the trouble of emulating non-existent addressing?
> 
> I think the reason is that reg needs to be in the standard format for
> the device node matching code to work. There are various checks to make
> sure the reg property has entries which are 3/2 each.

Could you point me to some of those checks so I can understand this better?

> 
> There's also the issue that the pcie-controller's ranges property
> contains the mapping of the I/O and memory windows and therefore needs
> the 3/2 addressing to encode the type.

Why can't that be done in each of the root port nodes?

> 
> Thierry
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rob Herring March 14, 2013, 12:43 a.m. UTC | #77
On 03/13/2013 05:48 PM, Mitch Bradley wrote:
> On 3/13/2013 11:33 AM, Thierry Reding wrote:
>> On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:
>> [...]
>>> In this case, the answer to "what does pcie_controller do?" is "it
>>> implements a PCI bus" below.  So 'device_type = "pci"' is appropriate.
>>
>> Alright, that's 2 against 1. I don't have much of a choice but to yield.
> 
> All issues of "voting" aside, 'device_type = pci' is what tells
> of_get_pci_address() to use the 3/2 interpretation.  So if you want a
> node to implement 3/2 addresses, it must say device_type = pci, unless
> you do address translation some other way.

I should note that device_type is used for OF, but is supposed to not be
used for FDT as matching against compatible properties is preferred. I
don't have a good reason as to why, but Mitch may know the history.
However, there are numerous exceptions to that for compatibility and to
work with existing s/w. So this may be one of those cases.

Rob

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mitch Bradley March 14, 2013, 1:20 a.m. UTC | #78
On 3/13/2013 2:43 PM, Rob Herring wrote:
> On 03/13/2013 05:48 PM, Mitch Bradley wrote:
>> On 3/13/2013 11:33 AM, Thierry Reding wrote:
>>> On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:
>>> [...]
>>>> In this case, the answer to "what does pcie_controller do?" is "it
>>>> implements a PCI bus" below.  So 'device_type = "pci"' is appropriate.
>>>
>>> Alright, that's 2 against 1. I don't have much of a choice but to yield.
>>
>> All issues of "voting" aside, 'device_type = pci' is what tells
>> of_get_pci_address() to use the 3/2 interpretation.  So if you want a
>> node to implement 3/2 addresses, it must say device_type = pci, unless
>> you do address translation some other way.
> 
> I should note that device_type is used for OF, but is supposed to not be
> used for FDT as matching against compatible properties is preferred. I
> don't have a good reason as to why, but Mitch may know the history.
> However, there are numerous exceptions to that for compatibility and to
> work with existing s/w. So this may be one of those cases.

Indeed.  The history is that, originally, the "name" property was
supposed to be a very precise name, like what is now the most specific
part of "compatible".  And "device_type" was "what, in general, does
this device do".

But with rule, pathnames quickly became unusable to humans:

  /marvell,mv87230/marvell,svxyz/adaptec,12345/wd,8876

What the heck is that?  Model names are like hashes; they have no rhyme
or reason and humans cannot remember them for long.  In order for a
random person to have a clue about what something was, it was much
better for pathname components to represent the generic function:

  /pci/pci/scsi/disk

Now it's clear that we are looking at a bridged PCI domain hosting a
SCSI disk. That's very useful for a system administrator; the
gobbledygook name is not.

Furthermore, in light of the facts that model numbers change much more
rapidly than programming interfaces, companies clone each others
programming interfaces, and companies change their names, a much more
capable way of describing programming models was needed.

So the "generic names" recommended practice changed things to make
"name" mean what "device_type" used to vaguely mean, "device_type" was
deprecated (it was never very precisely defined, and
mutually-conflicting usage patterns had developed), and "compatible" in
its present form was introduced.  "Generic names" was quite
controversial at the time - I think David Kahn wanted to strangle me at
one point - but it was the right thing.  (Not to mention the
embarrassing mistake of the underscore in device_type, mea culpa.)

That being the case, I personally would like for device_type to
disappear forever.  But the fact exists that address.c currently uses it
(and fdt.c duly parses it), which I why I said that, if you want to use
of_get_pci_address() in its present form, device_type=pci must be
present in the bus node that implements the 3/2 address domain.

So, yeah, it's compability with existing s/w, namely the PCI support in
address.c .

> 
> Rob
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Stephen Warren March 14, 2013, 4:56 a.m. UTC | #79
On 03/13/2013 03:33 PM, Thierry Reding wrote:
> On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote: 
> [...]
>> In this case, the answer to "what does pcie_controller do?" is
>> "it implements a PCI bus" below.  So 'device_type = "pci"' is
>> appropriate.
> 
> Alright, that's 2 against 1. I don't have much of a choice but to
> yield.

Just one note here though re: how the Tegra HW works:

The Tegra "PCIe controller" HW translates from a SoC-internal bus to
another SoC-internal bus. The "PCIe root ports" translate from that
second SoC-internal bus to a PCIe bus. That's exactly why the PCIe
root port configuration registers don't show up via type 0 PCIe
configuration transactions. Thus, from a HW perspective, it really is
true that the PCIe root ports are PCI devices, but the "PCIe
controller" really isn't anything to do with PCIe.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 14, 2013, 7:11 a.m. UTC | #80
On Wed, Mar 13, 2013 at 12:48:28PM -1000, Mitch Bradley wrote:
> On 3/13/2013 11:33 AM, Thierry Reding wrote:
> > On Wed, Mar 13, 2013 at 10:58:02AM -1000, Mitch Bradley wrote:
> > [...]
> >> In this case, the answer to "what does pcie_controller do?" is "it
> >> implements a PCI bus" below.  So 'device_type = "pci"' is appropriate.
> > 
> > Alright, that's 2 against 1. I don't have much of a choice but to yield.
> 
> All issues of "voting" aside, 'device_type = pci' is what tells
> of_get_pci_address() to use the 3/2 interpretation.  So if you want a
> node to implement 3/2 addresses, it must say device_type = pci, unless
> you do address translation some other way.

of_bus_default_map() can also handle #address-cells > 2. We explicitly
added that special case back when the bindings still looked very
different and we needed to match the exact entry.

This is the reason why it actually works to not use device_type = "pci"
for the pcie-controller. For subordinate device address translation the
code will look at the parent bus, see that it isn't a PCI device and use
the default bus, which in turn will simply memcmp() the first 3 cells of
the "reg" property entry against those in the "ranges" property.

> >> But then you run into the problem that the pci variant of struct of_bus
> >> uses "assigned-addresses" instead of "reg".  So it still doesn't work
> >> as-is.  To make it work, you would need to add an "assigned-addresses"
> >> property to each root port node.  That property value could be in
> >> non-relocatable memory space, translatable via normal rules, in which
> >> case the "map config space to MMIO via ranges" discussion is moot for
> >> Marvell.
> > 
> > At the risk of quoting the spec: it says (4.1.2) that the entries in the
> > "assigned-addresses" property correspond to the function's configuration
> > space BARs. In this case however that's no longer true.
> 
> Right, but we got into this mess by pretending that a PCI domain exists
> despite the fact that the hardware is not PCI at that level, in order to
> accommodate some existing Linux code.

Actually at that level the hardware is indeed PCI. These entries are for
the root port nodes, which are in fact the bridges to the PCI bus. This
is what I've been trying to say all along. The PCIe controller is just a
collection of registers that allow CPU address space to be bridged to
an internal bus (FPCI, which is in fact a relic from the HyperTransport
roots of the controller IIUC) and setup/handle MSI.

The transition to the PCI fabric happens in the root ports. There is no
PCI host bridge.

> Since we are already "faking it to deal with existing code", it doesn't
> seem particularly strange to take advantage of how existing code works,
> in order to accomplish the forgery.

So we're back to square one. How is this kind of faking (repurposing the
"assigned-addresses" property) any better than translating the "reg"
property through an "ss = 00" entry in the "ranges" property?

> > But if everybody
> > thinks that repurposing the "assigned-addresses" property is more
> > acceptable than leaving out the device_type = "pci" property from the
> > pcie-controller node, then so be it.
> 
> I'm confused again.  Why does leaving out device_type = pci make it work?

I've already said this above. The default bus implementation handles the
case where #address-cells > 2 by comparing the first three cells, but
not go through the additional range matching that would otherwise need
to take place. Instead it treats the address cells as some kind of
identifier to match between the "reg" and "ranges" entries. That's very
much the way that the PCI bus implementation works, except for some
extra PCI-specific checks.

> >> Which re-raises a question for which I have not yet heard a compelling
> >> answer:  Why use 3/2 PCI addressing between Marvell pcie-controller and
> >> its root ports (with their unusual programming model)?  The Marvell
> >> hardware address structure between controller and root ports is
> >> definitely not 3/2.  Certainly you need 3/2 addressing below
> >> (implemented by) the root ports, but I just don't see the utility of
> >> pretending 3/2 addressing at the complex-to-port interface.  If the root
> >> port hardware used the common programming model, with real config
> >> headers and stuff, 3/2 would be good because you could use existing
> >> drivers.  But since you need a special root-port driver anyway, why go
> >> to the trouble of emulating non-existent addressing?
> > 
> > I think the reason is that reg needs to be in the standard format for
> > the device node matching code to work. There are various checks to make
> > sure the reg property has entries which are 3/2 each.
> 
> Could you point me to some of those checks so I can understand this better?

The code for that is in drivers/of/of_pci.c. The __of_pci_pci_compare()
function looks like this:

	static inline int __of_pci_pci_compare(struct device_node *node,
					       unsigned int devfn)
	{
		unsigned int size;
		const __be32 *reg = of_get_property(node, "reg", &size);

		if (!reg || size < 5 * sizeof(__be32))
			return 0;
		return ((be32_to_cpup(&reg[0]) >> 8) & 0xff) == devfn;
	}

This is the function that's called for each node to check whether it
corresponds to a given PCI device by matching the devfn to the value
found in the first cell of a device node's "reg" property.

> > There's also the issue that the pcie-controller's ranges property
> > contains the mapping of the I/O and memory windows and therefore needs
> > the 3/2 addressing to encode the type.
> 
> Why can't that be done in each of the root port nodes?

It could possibly be done, but we'd be lying. The way the hardware works
is that we configure a single window to map to each of the FPCI regions
for I/O and (non-)prefetchable memory spaces. So the pcie-controller
node's "ranges" property contains three entries, one for each of those
spaces which the driver uses to program the corresponding registers. The
same information is passed to the resource assignment code of the PCI
core so that it can partition these windows as appropriate and allocate
sub-windows that are assigned to the BARs of the root-ports.

Note that the programming of the root-port BARs is vital here. If memory
within one of the FPCI-mapped windows is accessed, the hardware will
look for a matching address in the root port BARs in order to determine
the port through which the transactions should be sent.

Thierry
Thierry Reding March 14, 2013, 9:01 a.m. UTC | #81
On Wed, Mar 13, 2013 at 04:21:02PM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 13, 2013 at 11:02:35PM +0100, Thierry Reding wrote:
>  
> > Does that look about correct?
> 
> By my reading of the spec the entries in ranges should not
> have the b,d,f bits set..
> 
> Although it is not a big stretch to include them..
> 
> > Now the code will actually match the first entry and assume that it
> > represents the non-prefetchable 32-bit memory space and program the
> > controller to forward accesses to the 0x80000000-0x80000fff region
> > as PCI memory write transactions.
> 
> Yah, it makes a mess of determining the host bridge aperture. Require
> the aperture to be first/last?
> 
> Make use of the 2nd dword?
> 
> Go back to the idea of the top level reg?

It turns out that this works with the Tegra driver because it uses the
new of_pci_process_ranges() function and simply overwrites earlier
matches by subsequent ones.

	ranges = <0x82000000 0 0 0x80000000 0 0x00001000   /* port 0 registers */
		  0x82000000 0 0 0x80001000 0 0x00001000   /* port 1 registers */
		  0x81000000 0 0 0x82000000 0 0x00010000   /* downstream I/O */
		  0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */
		  0xc2000000 0 0 0xb0000000 0 0x10000000>; /* prefetchable memory */

So the second entry would overwrite the non-prefetchable memory region
as established by the first entry. Finally that will also be overwritten
by the fourth entry.

So the above will actually work along with the corresponding root-port
"assigned-addresses" properties. I still don't like it much because I
don't think it accurately reflects the hardware. Furthermore it has the
same kludgy, non-spec conformant smack that my original proposal had
because it uses assigned-addresses for something it wasn't intended to.
Also the mapping is actually not correct, because the first two entries
don't refer to 32-bit memory space. They really map configuration space
and therefore the ss field is wrong IMO (it should be 00, but that won't
work with the OF core). One of the very early versions of my Tegra PCIe
patch series actually introduced an IORESOURCE_CS type to account for
this but it didn't go anywhere and became obsolete when I switched to
the new binding which side-steps this issue by not using the PCI bus'
translation mechanism.

Thierry
Jason Gunthorpe March 14, 2013, 5:25 p.m. UTC | #82
[trimm'd the cc list]

On Thu, Mar 14, 2013 at 10:01:20AM +0100, Thierry Reding wrote:

> It turns out that this works with the Tegra driver because it uses the
> new of_pci_process_ranges() function and simply overwrites earlier
> matches by subsequent ones.
> 
> 	ranges = <0x82000000 0 0 0x80000000 0 0x00001000   /* port 0 registers */
> 		  0x82000000 0 0 0x80001000 0 0x00001000   /* port 1 registers */
> 		  0x81000000 0 0 0x82000000 0 0x00010000   /* downstream I/O */
> 		  0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */
> 		  0xc2000000 0 0 0xb0000000 0 0x10000000>; /* prefetchable memory */

Okay.. There is still something funny here, the 3rd dword of the child
address should not be 0 in every line and there shouldn't be overlaps
in the child address space.

I'm assuming 0x80000000, 0xa0000000 and 0xb0000000 are real CPU physical
addresses?

Then it should probably look like:

ranges = <0x82000000 0 0x80000000  0x80000000  0 0x00001000   /* port 0 registers */
	  0x82000000 0 0x80001000  0x80001000  0 0x00001000   /* port 1 registers */
	  0x81000000 0          0  0x82000000  0 0x00010000   /* downstream I/O */
	  0x82000000 0 0xa0000000  0xa0000000  0 0x10000000   /* non-prefetchable memory */
	  0xc2000000 0 0xb0000000  0xb0000000  0 0x10000000>; /* prefetchable memory */

Which says 'access to CPU address 0xa0000000 produces a PCI-E memory TLP with
address 0xa0000000' - this is the 'normal' case, I assume that is what
happens on tegra?

It also says 'access to CPU address 0x82000000 produces a PCI-E IO TLP
with address 0' - this translation is something Linux typically
expects..

Then you'd go on to have:

               pci@1,0 {
                        device_type = "pci";
                        assigned-addresses = <0x82000000 0 0x80000000  0 0x1000>;
                        reg = <0x000800 0 0 0 0>;
               }
               pci@2,0 {
                        device_type = "pci";
                        assigned-addresses = <0x82000000 0 0x80001000  0 0x1000>;
                        reg = <0x001000 0 0 0 0>;
               }

Notice I've made the upper dw of assigned-addresses's size 0 and
included the full 3dw from the appropriate ranges line.

> So the above will actually work along with the corresponding root-port
> "assigned-addresses" properties. I still don't like it much because I
> don't think it accurately reflects the hardware. 

There are lots of valid ways to model the same HW :(

Bear in mind, for the PCI case - the OF PCI bindings model the HW
through the eyes of the abstractions in the PCI specification. That is
to say, they are not supposed to be an exact representation of the on
chip architecture.

Perhaps this would be clearer if you used 'pcie-root-complex' as the
name of the top level node?

> same kludgy, non-spec conformant smack that my original proposal had
> because it uses assigned-addresses for something it wasn't intended
> to.

Yes, only the top level 'reg' method avoids going outside any specs.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 14, 2013, 8:38 p.m. UTC | #83
On Thu, Mar 14, 2013 at 11:25:55AM -0600, Jason Gunthorpe wrote:
> [trimm'd the cc list]
> 
> On Thu, Mar 14, 2013 at 10:01:20AM +0100, Thierry Reding wrote:
> 
> > It turns out that this works with the Tegra driver because it uses the
> > new of_pci_process_ranges() function and simply overwrites earlier
> > matches by subsequent ones.
> > 
> > 	ranges = <0x82000000 0 0 0x80000000 0 0x00001000   /* port 0 registers */
> > 		  0x82000000 0 0 0x80001000 0 0x00001000   /* port 1 registers */
> > 		  0x81000000 0 0 0x82000000 0 0x00010000   /* downstream I/O */
> > 		  0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */
> > 		  0xc2000000 0 0 0xb0000000 0 0x10000000>; /* prefetchable memory */
> 
> Okay.. There is still something funny here, the 3rd dword of the child
> address should not be 0 in every line and there shouldn't be overlaps
> in the child address space.
> 
> I'm assuming 0x80000000, 0xa0000000 and 0xb0000000 are real CPU physical
> addresses?

Yes.

> Then it should probably look like:
> 
> ranges = <0x82000000 0 0x80000000  0x80000000  0 0x00001000   /* port 0 registers */
> 	  0x82000000 0 0x80001000  0x80001000  0 0x00001000   /* port 1 registers */
> 	  0x81000000 0          0  0x82000000  0 0x00010000   /* downstream I/O */
> 	  0x82000000 0 0xa0000000  0xa0000000  0 0x10000000   /* non-prefetchable memory */
> 	  0xc2000000 0 0xb0000000  0xb0000000  0 0x10000000>; /* prefetchable memory */
> 
> Which says 'access to CPU address 0xa0000000 produces a PCI-E memory TLP with
> address 0xa0000000' - this is the 'normal' case, I assume that is what
> happens on tegra?
> 
> It also says 'access to CPU address 0x82000000 produces a PCI-E IO TLP
> with address 0' - this translation is something Linux typically
> expects..
> 
> Then you'd go on to have:
> 
>                pci@1,0 {
>                         device_type = "pci";
>                         assigned-addresses = <0x82000000 0 0x80000000  0 0x1000>;
>                         reg = <0x000800 0 0 0 0>;
>                }
>                pci@2,0 {
>                         device_type = "pci";
>                         assigned-addresses = <0x82000000 0 0x80001000  0 0x1000>;
>                         reg = <0x001000 0 0 0 0>;
>                }
> 
> Notice I've made the upper dw of assigned-addresses's size 0 and
> included the full 3dw from the appropriate ranges line.

Okay, that all makes sense.

> > So the above will actually work along with the corresponding root-port
> > "assigned-addresses" properties. I still don't like it much because I
> > don't think it accurately reflects the hardware. 
> 
> There are lots of valid ways to model the same HW :(
> 
> Bear in mind, for the PCI case - the OF PCI bindings model the HW
> through the eyes of the abstractions in the PCI specification. That is
> to say, they are not supposed to be an exact representation of the on
> chip architecture.

That seems to be at odds with most other uses of DT that I've come
across. Generally the guideline seems to be to describe hardware
irrespective of the underlying implementation and leave it up to the
driver to translate the DT description into something the OS can use.

> Perhaps this would be clearer if you used 'pcie-root-complex' as the
> name of the top level node?

Perhaps. I'm not sure.

> > same kludgy, non-spec conformant smack that my original proposal had
> > because it uses assigned-addresses for something it wasn't intended
> > to.
> 
> Yes, only the top level 'reg' method avoids going outside any specs.

Yes. It has a couple of other disadvantages, though, so if what we've
been discussing here is in any way acceptable I'd rather go with that
solution, even if I'm not entirely happy about it either.

Thierry
Jason Gunthorpe March 14, 2013, 9:05 p.m. UTC | #84
On Thu, Mar 14, 2013 at 09:38:58PM +0100, Thierry Reding wrote:
> >                pci@1,0 {
> >                         device_type = "pci";
> >                         assigned-addresses = <0x82000000 0 0x80000000  0 0x1000>;

Sorry, I missed this. The b,d,f bits should be set:

                        assigned-addresses = <0x82000800 0 0x80000000  0 0x1000>;


> >                         reg = <0x000800 0 0 0 0>;
> >                }
> >                pci@2,0 {
> >                         device_type = "pci";
> >                         assigned-addresses = <0x82000000 0 0x80001000  0 0x1000>;

                         assigned-addresses = <0x82001000 0 0x80001000  0 0x1000>;


The 'ranges' stays the same, the PCI 3/2 parse will ignore the b,d,f
bits when comparing against ranges.

The use of r == 0 for assigned-addresses should be enough to
disambiguate this case from any future use of assigned-addresses - the
normal case has r == offset of the associated BAR != 0.

Cheers,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 14, 2013, 9:09 p.m. UTC | #85
On Thu, Mar 14, 2013 at 11:25:55AM -0600, Jason Gunthorpe wrote:
[...]
> I'm assuming 0x80000000, 0xa0000000 and 0xb0000000 are real CPU physical
> addresses?
> 
> Then it should probably look like:
> 
> ranges = <0x82000000 0 0x80000000  0x80000000  0 0x00001000   /* port 0 registers */
> 	  0x82000000 0 0x80001000  0x80001000  0 0x00001000   /* port 1 registers */
> 	  0x81000000 0          0  0x82000000  0 0x00010000   /* downstream I/O */
> 	  0x82000000 0 0xa0000000  0xa0000000  0 0x10000000   /* non-prefetchable memory */
> 	  0xc2000000 0 0xb0000000  0xb0000000  0 0x10000000>; /* prefetchable memory */
> 
> Which says 'access to CPU address 0xa0000000 produces a PCI-E memory TLP with
> address 0xa0000000' - this is the 'normal' case, I assume that is what
> happens on tegra?
> 
> It also says 'access to CPU address 0x82000000 produces a PCI-E IO TLP
> with address 0' - this translation is something Linux typically
> expects..

Both of the above paragraphs are true. However accesses to the windows
at 0x80000000 and 0x80001000 don't generate memory TLPs but type 0
configuration space TLPs.

So my first instinct was to make the first cell of the first two entries
0, but that doesn't work, since the OF core expects to find either
memory or I/O spaces. ss == 2 isn't quite right here.

Thierry
Mitch Bradley March 14, 2013, 9:10 p.m. UTC | #86
On 3/14/2013 10:38 AM, Thierry Reding wrote:
> On Thu, Mar 14, 2013 at 11:25:55AM -0600, Jason Gunthorpe wrote:
>> [trimm'd the cc list]
>>
>> On Thu, Mar 14, 2013 at 10:01:20AM +0100, Thierry Reding wrote:
>>
>>> It turns out that this works with the Tegra driver because it uses the
>>> new of_pci_process_ranges() function and simply overwrites earlier
>>> matches by subsequent ones.
>>>
>>> 	ranges = <0x82000000 0 0 0x80000000 0 0x00001000   /* port 0 registers */
>>> 		  0x82000000 0 0 0x80001000 0 0x00001000   /* port 1 registers */
>>> 		  0x81000000 0 0 0x82000000 0 0x00010000   /* downstream I/O */
>>> 		  0x82000000 0 0 0xa0000000 0 0x10000000   /* non-prefetchable memory */
>>> 		  0xc2000000 0 0 0xb0000000 0 0x10000000>; /* prefetchable memory */
>>
>> Okay.. There is still something funny here, the 3rd dword of the child
>> address should not be 0 in every line and there shouldn't be overlaps
>> in the child address space.
>>
>> I'm assuming 0x80000000, 0xa0000000 and 0xb0000000 are real CPU physical
>> addresses?
> 
> Yes.
> 
>> Then it should probably look like:
>>
>> ranges = <0x82000000 0 0x80000000  0x80000000  0 0x00001000   /* port 0 registers */
>> 	  0x82000000 0 0x80001000  0x80001000  0 0x00001000   /* port 1 registers */
>> 	  0x81000000 0          0  0x82000000  0 0x00010000   /* downstream I/O */
>> 	  0x82000000 0 0xa0000000  0xa0000000  0 0x10000000   /* non-prefetchable memory */
>> 	  0xc2000000 0 0xb0000000  0xb0000000  0 0x10000000>; /* prefetchable memory */
>>
>> Which says 'access to CPU address 0xa0000000 produces a PCI-E memory TLP with
>> address 0xa0000000' - this is the 'normal' case, I assume that is what
>> happens on tegra?
>>
>> It also says 'access to CPU address 0x82000000 produces a PCI-E IO TLP
>> with address 0' - this translation is something Linux typically
>> expects..
>>
>> Then you'd go on to have:
>>
>>                pci@1,0 {
>>                         device_type = "pci";
>>                         assigned-addresses = <0x82000000 0 0x80000000  0 0x1000>;
>>                         reg = <0x000800 0 0 0 0>;
>>                }
>>                pci@2,0 {
>>                         device_type = "pci";
>>                         assigned-addresses = <0x82000000 0 0x80001000  0 0x1000>;
>>                         reg = <0x001000 0 0 0 0>;
>>                }
>>
>> Notice I've made the upper dw of assigned-addresses's size 0 and
>> included the full 3dw from the appropriate ranges line.
> 
> Okay, that all makes sense.
> 
>>> So the above will actually work along with the corresponding root-port
>>> "assigned-addresses" properties. I still don't like it much because I
>>> don't think it accurately reflects the hardware. 
>>
>> There are lots of valid ways to model the same HW :(
>>
>> Bear in mind, for the PCI case - the OF PCI bindings model the HW
>> through the eyes of the abstractions in the PCI specification. That is
>> to say, they are not supposed to be an exact representation of the on
>> chip architecture.
> 
> That seems to be at odds with most other uses of DT that I've come
> across. Generally the guideline seems to be to describe hardware
> irrespective of the underlying implementation and leave it up to the
> driver to translate the DT description into something the OS can use.

This can be a gray area where one just has to make a judgement call.
You can think of the hardware at various levels of abstraction and
sometimes it is not obvious where to draw the line.  "Tell the truth"
tends to be more stable in the long run, but can be more work in the
short run, and it can be hard to predict the future well enough to
choose the optimum approach.

> 
>> Perhaps this would be clearer if you used 'pcie-root-complex' as the
>> name of the top level node?
> 
> Perhaps. I'm not sure.
> 
>>> same kludgy, non-spec conformant smack that my original proposal had
>>> because it uses assigned-addresses for something it wasn't intended
>>> to.
>>
>> Yes, only the top level 'reg' method avoids going outside any specs.
> 
> Yes. It has a couple of other disadvantages, though, so if what we've
> been discussing here is in any way acceptable I'd rather go with that
> solution, even if I'm not entirely happy about it either.
> 
> Thierry
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe March 14, 2013, 9:29 p.m. UTC | #87
On Thu, Mar 14, 2013 at 10:09:26PM +0100, Thierry Reding wrote:

> > ranges = <0x82000000 0 0x80000000  0x80000000  0 0x00001000   /* port 0 registers */
> > 	  0x82000000 0 0x80001000  0x80001000  0 0x00001000   /* port 1 registers */
> > 	  0x81000000 0          0  0x82000000  0 0x00010000   /* downstream I/O */
> > 	  0x82000000 0 0xa0000000  0xa0000000  0 0x10000000   /* non-prefetchable memory */
> > 	  0xc2000000 0 0xb0000000  0xb0000000  0 0x10000000>; /* prefetchable memory */
> > 
> > Which says 'access to CPU address 0xa0000000 produces a PCI-E memory TLP with
> > address 0xa0000000' - this is the 'normal' case, I assume that is what
> > happens on tegra?
> > 
> > It also says 'access to CPU address 0x82000000 produces a PCI-E IO TLP
> > with address 0' - this translation is something Linux typically
> > expects..
> 
> Both of the above paragraphs are true. However accesses to the windows
> at 0x80000000 and 0x80001000 don't generate memory TLPs but type 0
> configuration space TLPs.

By my understanding access to 0x80000000/0x80001000 doesn't generate
any externally visible TLPs?

IHMO modeling this register space as a controller-internal MMIO region
associated with the bridge is reasonable... After all, you are
iomapping it and accessing it with readl/writel - those are MMIO
functions..

> So my first instinct was to make the first cell of the first two entries
> 0, but that doesn't work, since the OF core expects to find either
> memory or I/O spaces. 

Right, there is no specified way to translate config ss through
ranges.

Regards,
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thierry Reding March 14, 2013, 9:37 p.m. UTC | #88
On Thu, Mar 14, 2013 at 03:29:21PM -0600, Jason Gunthorpe wrote:
> On Thu, Mar 14, 2013 at 10:09:26PM +0100, Thierry Reding wrote:
> 
> > > ranges = <0x82000000 0 0x80000000  0x80000000  0 0x00001000   /* port 0 registers */
> > > 	  0x82000000 0 0x80001000  0x80001000  0 0x00001000   /* port 1 registers */
> > > 	  0x81000000 0          0  0x82000000  0 0x00010000   /* downstream I/O */
> > > 	  0x82000000 0 0xa0000000  0xa0000000  0 0x10000000   /* non-prefetchable memory */
> > > 	  0xc2000000 0 0xb0000000  0xb0000000  0 0x10000000>; /* prefetchable memory */
> > > 
> > > Which says 'access to CPU address 0xa0000000 produces a PCI-E memory TLP with
> > > address 0xa0000000' - this is the 'normal' case, I assume that is what
> > > happens on tegra?
> > > 
> > > It also says 'access to CPU address 0x82000000 produces a PCI-E IO TLP
> > > with address 0' - this translation is something Linux typically
> > > expects..
> > 
> > Both of the above paragraphs are true. However accesses to the windows
> > at 0x80000000 and 0x80001000 don't generate memory TLPs but type 0
> > configuration space TLPs.
> 
> By my understanding access to 0x80000000/0x80001000 doesn't generate
> any externally visible TLPs?

Now that you mention it, that's probably correct, yes.

> IHMO modeling this register space as a controller-internal MMIO region
> associated with the bridge is reasonable... After all, you are
> iomapping it and accessing it with readl/writel - those are MMIO
> functions..

Yes, I think that'd be okay.

Thierry
diff mbox

Patch

diff --git a/Documentation/devicetree/bindings/pci/mvebu-pci.txt b/Documentation/devicetree/bindings/pci/mvebu-pci.txt
new file mode 100644
index 0000000..bd621c8
--- /dev/null
+++ b/Documentation/devicetree/bindings/pci/mvebu-pci.txt
@@ -0,0 +1,174 @@ 
+* Marvell EBU PCIe interfaces
+
+Mandatory properties:
+- compatible: for now only "marvell,armada-370-xp-pcie" is supported
+- #address-cells, set to <3>
+- #size-cells, set to <2>
+- #interrupt-cells, set to <1>
+- bus-range: PCI bus numbers covered
+- ranges: standard PCI-style address ranges, describing the PCIe
+  registers for each PCIe interface, and then ranges for the PCI
+  memory and I/O regions.
+- interrupt-map-mask and interrupt-map are standard PCI Device Tree
+  properties to describe the interrupts associated to each PCI
+  interface.
+
+In addition, the Device Tree node must have sub-nodes describing each
+PCIe interface, having the following mandatory properties:
+- reg: the address and size of the PCIe registers (translated
+  addresses according to the ranges property of the parent)
+- clocks: the clock associated to this PCIe interface
+- marvell,pcie-port: the physical PCIe port number
+- status: either "disabled" or "okay"
+
+and the following optional properties:
+- marvell,pcie-lane: the physical PCIe lane number, for ports having
+  multiple lanes. If this property is not found, we assume that the
+  value is 0.
+
+Example:
+
+pcie-controller {
+	compatible = "marvell,armada-370-xp-pcie";
+	status = "disabled";
+
+	#address-cells = <3>;
+	#size-cells = <2>;
+
+	bus-range = <0x00 0xff>;
+
+	ranges = <0x00000800 0 0xd0040000 0xd0040000 0 0x00002000   /* port 0.0 registers */
+	          0x00004800 0 0xd0042000 0xd0042000 0 0x00002000   /* port 2.0 registers */
+	          0x00001000 0 0xd0044000 0xd0044000 0 0x00002000   /* port 0.1 registers */
+	          0x00001800 0 0xd0048000 0xd0048000 0 0x00002000   /* port 0.2 registers */
+	          0x00002000 0 0xd004C000 0xd004C000 0 0x00002000   /* port 0.3 registers */
+		  0x00002800 0 0xd0080000 0xd0080000 0 0x00002000   /* port 1.0 registers */
+	          0x00005000 0 0xd0082000 0xd0082000 0 0x00002000   /* port 3.0 registers */
+		  0x00003000 0 0xd0084000 0xd0084000 0 0x00002000   /* port 1.1 registers */
+		  0x00003800 0 0xd0088000 0xd0088000 0 0x00002000   /* port 1.2 registers */
+		  0x00004000 0 0xd008C000 0xd008C000 0 0x00002000   /* port 1.3 registers */
+		  0x81000000 0 0	  0xc0000000 0 0x00100000   /* downstream I/O */
+		  0x82000000 0 0	  0xc1000000 0 0x08000000>; /* non-prefetchable memory */
+
+	#interrupt-cells = <1>;
+	interrupt-map-mask = <0xf800 0 0 1>;
+	interrupt-map = <0x0800 0 0 1 &mpic 58
+		         0x1000 0 0 1 &mpic 59
+			 0x1800 0 0 1 &mpic 60
+			 0x2000 0 0 1 &mpic 61
+			 0x2800 0 0 1 &mpic 62
+		         0x3000 0 0 1 &mpic 63
+			 0x3800 0 0 1 &mpic 64
+			 0x4000 0 0 1 &mpic 65
+			 0x4800 0 0 1 &mpic 99
+			 0x5000 0 0 1 &mpic 103>;
+
+	pcie@0,0 {
+		device_type = "pciex";
+		reg = <0x0800 0 0xd0040000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 5>;
+		status = "disabled";
+	};
+
+	pcie@0,1 {
+		device_type = "pciex";
+		reg = <0x1000 0 0xd0044000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <1>;
+		clocks = <&gateclk 6>;
+		status = "disabled";
+	};
+
+	pcie@0,2 {
+		device_type = "pciex";
+		reg = <0x1800 0 0xd0048000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <2>;
+		clocks = <&gateclk 7>;
+		status = "disabled";
+	};
+
+	pcie@0,3 {
+		device_type = "pciex";
+		reg = <0x2000 0 0xd004C000 0 0xC000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <0>;
+		marvell,pcie-lane = <3>;
+		clocks = <&gateclk 8>;
+		status = "disabled";
+	};
+
+	pcie@1,0 {
+		device_type = "pciex";
+		reg = <0x2800 0 0xd0080000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 9>;
+		status = "disabled";
+	};
+
+	pcie@1,1 {
+		device_type = "pciex";
+		reg = <0x3000 0 0xd0084000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <1>;
+		clocks = <&gateclk 10>;
+		status = "disabled";
+	};
+
+	pcie@1,2 {
+		device_type = "pciex";
+		reg = <0x3800 0 0xd0088000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <2>;
+		clocks = <&gateclk 11>;
+		status = "disabled";
+	};
+
+	pcie@1,3 {
+		device_type = "pciex";
+		reg = <0x4000 0 0xd008C000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <1>;
+		marvell,pcie-lane = <3>;
+		clocks = <&gateclk 12>;
+		status = "disabled";
+	};
+	pcie@2,0 {
+		device_type = "pciex";
+		reg = <0x4800 0 0xd0042000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <2>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 26>;
+		status = "disabled";
+	};
+
+	pcie@3,0 {
+		device_type = "pciex";
+		reg = <0x5000 0 0xd0082000 0 0x2000>;
+		#address-cells = <3>;
+		#size-cells = <2>;
+		marvell,pcie-port = <3>;
+		marvell,pcie-lane = <0>;
+		clocks = <&gateclk 27>;
+		status = "disabled";
+	};
+};
diff --git a/drivers/pci/host/Kconfig b/drivers/pci/host/Kconfig
index cc3a1af..03e15e7 100644
--- a/drivers/pci/host/Kconfig
+++ b/drivers/pci/host/Kconfig
@@ -1,4 +1,10 @@ 
 menu "PCI host controller drivers"
 	depends on PCI
 
+config PCI_MVEBU
+	bool "Marvell EBU PCIe controller"
+	depends on ARCH_MVEBU
+	select PCI_SW_HOST_BRIDGE
+	select PCI_SW_PCI_PCI_BRIDGE
+
 endmenu
diff --git a/drivers/pci/host/Makefile b/drivers/pci/host/Makefile
new file mode 100644
index 0000000..3ad563f
--- /dev/null
+++ b/drivers/pci/host/Makefile
@@ -0,0 +1,4 @@ 
+obj-$(CONFIG_PCI_MVEBU) += pci-mvebu.o
+CFLAGS_pci-mvebu.o += \
+	-I$(srctree)/arch/arm/plat-orion/include \
+	-I$(srctree)/arch/arm/mach-mvebu/include
diff --git a/drivers/pci/host/pci-mvebu.c b/drivers/pci/host/pci-mvebu.c
new file mode 100644
index 0000000..82d13da
--- /dev/null
+++ b/drivers/pci/host/pci-mvebu.c
@@ -0,0 +1,688 @@ 
+/*
+ * PCIe driver for Marvell Armada 370 and Armada XP SoCs
+ *
+ * This file is licensed under the terms of the GNU General Public
+ * License version 2.  This program is licensed "as is" without any
+ * warranty of any kind, whether express or implied.
+ */
+
+#include <linux/kernel.h>
+#include <linux/pci.h>
+#include <linux/clk.h>
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/platform_device.h>
+#include <linux/of_address.h>
+#include <linux/of_pci.h>
+#include <linux/of_irq.h>
+#include <linux/of_platform.h>
+#include <plat/pcie.h>
+#include <mach/addr-map.h>
+
+/*
+ * This product ID is registered by Marvell, and used when the Marvell
+ * SoC is not the root complex, but an endpoint on the PCIe bus. It is
+ * therefore safe to re-use this PCI ID for our emulated PCI-to-PCI
+ * bridge.
+ */
+#define MARVELL_EMULATED_PCI_PCI_BRIDGE_ID 0x7846
+
+/* PCI configuration space of a PCI-to-PCI bridge */
+struct mvebu_sw_pci_bridge {
+	u16 vendor;
+	u16 device;
+	u16 command;
+	u16 status;
+	u16 class;
+	u8 interface;
+	u8 revision;
+	u8 bist;
+	u8 header_type;
+	u8 latency_timer;
+	u8 cache_line_size;
+	u32 bar[2];
+	u8 primary_bus;
+	u8 secondary_bus;
+	u8 subordinate_bus;
+	u8 secondary_latency_timer;
+	u8 iobase;
+	u8 iolimit;
+	u16 secondary_status;
+	u16 membase;
+	u16 memlimit;
+	u16 prefmembase;
+	u16 prefmemlimit;
+	u32 prefbaseupper;
+	u32 preflimitupper;
+	u16 iobaseupper;
+	u16 iolimitupper;
+	u8 cappointer;
+	u8 reserved1;
+	u16 reserved2;
+	u32 romaddr;
+	u8 intline;
+	u8 intpin;
+	u16 bridgectrl;
+};
+
+struct mvebu_pcie_port;
+
+/* Structure representing all PCIe interfaces */
+struct mvebu_pcie {
+	struct platform_device *pdev;
+	struct mvebu_pcie_port *ports;
+	struct resource io;
+	struct resource realio;
+	struct resource mem;
+	struct resource busn;
+	int nports;
+};
+
+/* Structure representing one PCIe interface */
+struct mvebu_pcie_port {
+	void __iomem *base;
+	spinlock_t conf_lock;
+	int haslink;
+	u32 port;
+	u32 lane;
+	int devfn;
+	struct clk *clk;
+	struct mvebu_sw_pci_bridge bridge;
+	struct device_node *dn;
+	struct mvebu_pcie *pcie;
+};
+
+static void mvebu_pcie_setup_io_window(struct mvebu_pcie_port *port,
+				       int enable)
+{
+	unsigned long iobase, iolimit;
+
+	if (port->bridge.iolimit < port->bridge.iobase)
+		return;
+
+	iolimit = 0xFFF | ((port->bridge.iolimit & 0xF0) << 8) |
+		(port->bridge.iolimitupper << 16);
+	iobase = ((port->bridge.iobase & 0xF0) << 8) |
+		(port->bridge.iobaseupper << 16);
+
+	if (enable) {
+		unsigned long physbase = port->pcie->io.start + iobase;
+		armada_370_xp_alloc_pcie_window(port->port, port->lane,
+						physbase, iobase,
+						iolimit-iobase,
+						IORESOURCE_IO);
+		pci_ioremap_io(iobase, physbase);
+	}
+	else
+		armada_370_xp_free_pcie_window(iobase);
+}
+
+static void mvebu_pcie_setup_mem_window(struct mvebu_pcie_port *port,
+					int enable)
+{
+	unsigned long membase, memlimit;
+
+	if (port->bridge.memlimit < port->bridge.membase)
+		return;
+
+	membase = ((port->bridge.membase & 0xFFF0) << 16);
+	memlimit = ((port->bridge.memlimit & 0xFFF0) << 16) | 0xFFFFF;
+
+	if (enable)
+		armada_370_xp_alloc_pcie_window(port->port, port->lane,
+						membase, ORION_ADDR_MAP_NO_REMAP,
+						memlimit-membase,
+						IORESOURCE_MEM);
+	else
+		armada_370_xp_free_pcie_window(membase);
+}
+
+static void mvebu_handle_pcie_command(struct mvebu_pcie_port *port, u16 old,
+				      u16 new)
+{
+	/* Enabling an I/O window ? */
+	if (!(old & PCI_COMMAND_IO) && (new & PCI_COMMAND_IO))
+		mvebu_pcie_setup_io_window(port, 1);
+
+	/* Disabling an I/O window ? */
+	if ((old & PCI_COMMAND_IO) && !(new & PCI_COMMAND_IO))
+		mvebu_pcie_setup_io_window(port, 0);
+
+	/* Enabling a memory window ? */
+	if (!(old & PCI_COMMAND_MEMORY) && (new & PCI_COMMAND_MEMORY))
+		mvebu_pcie_setup_mem_window(port, 1);
+
+	/* Disabling a memory window ? */
+	if ((old & PCI_COMMAND_MEMORY) && !(new & PCI_COMMAND_MEMORY))
+		mvebu_pcie_setup_mem_window(port, 0);
+}
+
+/*
+ * Initialize the configuration space of the PCI-to-PCI bridge
+ * associated with the given PCIe interface.
+ */
+static void mvebu_sw_pci_bridge_init(struct mvebu_pcie_port *port)
+{
+	struct mvebu_sw_pci_bridge *bridge = &port->bridge;
+
+	memset(bridge, 0, sizeof(struct mvebu_sw_pci_bridge));
+
+	bridge->status = PCI_STATUS_CAP_LIST;
+	bridge->class = PCI_CLASS_BRIDGE_PCI;
+	bridge->vendor = PCI_VENDOR_ID_MARVELL;
+	bridge->device = MARVELL_EMULATED_PCI_PCI_BRIDGE_ID;
+	bridge->header_type = PCI_HEADER_TYPE_BRIDGE;
+	bridge->cache_line_size = 0x10;
+
+	/* We support 32 bits I/O addressing */
+	bridge->iobase = PCI_IO_RANGE_TYPE_32;
+	bridge->iolimit = PCI_IO_RANGE_TYPE_32;
+}
+
+/*
+ * Read the configuration space of the PCI-to-PCI bridge associated to
+ * the given PCIe interface.
+ */
+static int mvebu_sw_pci_bridge_read(struct mvebu_pcie_port *port,
+				  unsigned int where, int size, u32 *value)
+{
+	struct mvebu_sw_pci_bridge *bridge = &port->bridge;
+
+	switch (where & ~3) {
+	case PCI_VENDOR_ID:
+		*value = bridge->device << 16 | bridge->vendor;
+		break;
+
+	case PCI_COMMAND:
+		*value = bridge->status << 16 | bridge->command;
+		break;
+
+	case PCI_CLASS_REVISION:
+		*value = bridge->class << 16 | bridge->interface << 8 |
+			 bridge->revision;
+		break;
+
+	case PCI_CACHE_LINE_SIZE:
+		*value = bridge->bist << 24 | bridge->header_type << 16 |
+			 bridge->latency_timer << 8 | bridge->cache_line_size;
+		break;
+
+	case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_1:
+		*value = bridge->bar[((where & ~3) - PCI_BASE_ADDRESS_0) / 4];
+		break;
+
+	case PCI_PRIMARY_BUS:
+		*value = (bridge->secondary_latency_timer << 24 |
+			  bridge->subordinate_bus         << 16 |
+			  bridge->secondary_bus           <<  8 |
+			  bridge->primary_bus);
+		break;
+
+	case PCI_IO_BASE:
+		*value = (bridge->secondary_status << 16 |
+			  bridge->iolimit          <<  8 |
+			  bridge->iobase);
+		break;
+
+	case PCI_MEMORY_BASE:
+		*value = (bridge->memlimit << 16 | bridge->membase);
+		break;
+
+	case PCI_PREF_MEMORY_BASE:
+		*value = (bridge->prefmemlimit << 16 | bridge->prefmembase);
+		break;
+
+	case PCI_PREF_BASE_UPPER32:
+		*value = bridge->prefbaseupper;
+		break;
+
+	case PCI_PREF_LIMIT_UPPER32:
+		*value = bridge->preflimitupper;
+		break;
+
+	case PCI_IO_BASE_UPPER16:
+		*value = (bridge->iolimitupper << 16 | bridge->iobaseupper);
+		break;
+
+	case PCI_ROM_ADDRESS1:
+		*value = 0;
+		break;
+
+	default:
+		*value = 0xffffffff;
+		return PCIBIOS_BAD_REGISTER_NUMBER;
+	}
+
+	if (size == 2)
+		*value = (*value >> (8 * (where & 3))) & 0xffff;
+	else if (size == 1)
+		*value = (*value >> (8 * (where & 3))) & 0xff;
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+/* Write to the PCI-to-PCI bridge configuration space */
+static int mvebu_sw_pci_bridge_write(struct mvebu_pcie_port *port,
+				     unsigned int where, int size, u32 value)
+{
+	struct mvebu_sw_pci_bridge *bridge = &port->bridge;
+	u32 mask, reg;
+	int err;
+
+	if (size == 4)
+		mask = 0x0;
+	else if (size == 2)
+		mask = ~(0xffff << ((where & 3) * 8));
+	else if (size == 1)
+		mask = ~(0xff << ((where & 3) * 8));
+	else
+		return PCIBIOS_BAD_REGISTER_NUMBER;
+
+	err = mvebu_sw_pci_bridge_read(port, where & ~3, 4, &reg);
+	if (err)
+		return err;
+
+	value = (reg & mask) | value << ((where & 3) * 8);
+
+	switch (where & ~3) {
+	case PCI_COMMAND:
+		mvebu_handle_pcie_command(port, bridge->command,
+					  value & 0xffff);
+		bridge->command = value & 0xffff;
+		bridge->status = value >> 16;
+		break;
+
+	case PCI_BASE_ADDRESS_0 ... PCI_BASE_ADDRESS_1:
+		bridge->bar[((where & ~3) - PCI_BASE_ADDRESS_0) / 4] = value;
+		break;
+
+	case PCI_IO_BASE:
+		/*
+		 * We also keep bit 1 set, it is a read-only bit that
+		 * indicates we support 32 bits addressing for the
+		 * I/O
+		 */
+		bridge->iobase = (value & 0xff) | PCI_IO_RANGE_TYPE_32;
+		bridge->iolimit = ((value >> 8) & 0xff) | PCI_IO_RANGE_TYPE_32;
+		bridge->secondary_status = value >> 16;
+		break;
+
+	case PCI_MEMORY_BASE:
+		bridge->membase = value & 0xffff;
+		bridge->memlimit = value >> 16;
+		break;
+
+	case PCI_PREF_MEMORY_BASE:
+		bridge->prefmembase = value & 0xffff;
+		bridge->prefmemlimit = value >> 16;
+		break;
+
+	case PCI_PREF_BASE_UPPER32:
+		bridge->prefbaseupper = value;
+		break;
+
+	case PCI_PREF_LIMIT_UPPER32:
+		bridge->preflimitupper = value;
+		break;
+
+	case PCI_IO_BASE_UPPER16:
+		bridge->iobaseupper = value & 0xffff;
+		bridge->iolimitupper = value >> 16;
+		break;
+
+	case PCI_PRIMARY_BUS:
+		bridge->primary_bus             = value & 0xff;
+		bridge->secondary_bus           = (value >> 8) & 0xff;
+		bridge->subordinate_bus         = (value >> 16) & 0xff;
+		bridge->secondary_latency_timer = (value >> 24) & 0xff;
+		orion_pcie_set_local_bus_nr(port->base, bridge->secondary_bus);
+		break;
+
+	default:
+		break;
+	}
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+static inline struct mvebu_pcie *sys_to_pcie(struct pci_sys_data *sys)
+{
+	return sys->private_data;
+}
+
+/* Find the PCIe interface that corresponds to the given bus */
+static struct mvebu_pcie_port *mvebu_find_port_from_bus(struct mvebu_pcie *pcie,
+							int bus)
+{
+	int porti;
+
+	for (porti = 0; porti < pcie->nports; porti++)
+		if (pcie->ports[porti].bridge.secondary_bus == bus)
+			return &pcie->ports[porti];
+
+	return NULL;
+}
+
+/* PCI configuration space write function */
+static int mvebu_pcie_wr_conf(struct pci_bus *bus, u32 devfn,
+			      int where, int size, u32 val)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata);
+
+	if (bus->number != 0) {
+		/*
+		 * Accessing a real PCIe interface.
+		 */
+		struct mvebu_pcie_port *port;
+		unsigned long flags;
+		int ret;
+
+		port = mvebu_find_port_from_bus(pcie, bus->number);
+		if (!port)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+
+		if (!port->haslink)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+
+		if (PCI_SLOT(devfn) != 0)
+			return PCIBIOS_DEVICE_NOT_FOUND;
+
+		spin_lock_irqsave(&port->conf_lock, flags);
+		ret = orion_pcie_wr_conf(port->base, bus,
+					 PCI_DEVFN(1, PCI_FUNC(devfn)),
+					 where, size, val);
+		spin_unlock_irqrestore(&port->conf_lock, flags);
+
+		return ret;
+	} else {
+		/*
+		 * Access the emulated PCI-to-PCI bridges.
+		 */
+		if (PCI_SLOT(devfn) >= 1 &&
+		    PCI_SLOT(devfn) <= pcie->nports) {
+			struct mvebu_pcie_port *port;
+			int porti = PCI_SLOT(devfn) - 1;
+			port = &pcie->ports[porti];
+			return mvebu_sw_pci_bridge_write(port, where, size, val);
+		} else {
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+	}
+
+	return PCIBIOS_SUCCESSFUL;
+}
+
+/* PCI configuration space read function */
+static int mvebu_pcie_rd_conf(struct pci_bus *bus, u32 devfn, int where,
+			      int size, u32 *val)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(bus->sysdata);
+
+	if (bus->number != 0) {
+		/*
+		 * Accessing a real PCIe interface.
+		 */
+		struct mvebu_pcie_port *port;
+		unsigned long flags;
+		int ret;
+
+		port = mvebu_find_port_from_bus(pcie, bus->number);
+		if (!port) {
+			*val = 0xffffffff;
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+
+		if (!port->haslink || PCI_SLOT(devfn) != 0) {
+			*val = 0xffffffff;
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+
+		spin_lock_irqsave(&port->conf_lock, flags);
+		ret = orion_pcie_rd_conf(port->base, bus,
+					 PCI_DEVFN(1, PCI_FUNC(devfn)),
+					 where, size, val);
+		spin_unlock_irqrestore(&port->conf_lock, flags);
+
+		return ret;
+	} else {
+		/*
+		 * Access the emulated PCI-to-PCI bridges.
+		 */
+		if (PCI_SLOT(devfn) >= 1 &&
+		    PCI_SLOT(devfn) <= pcie->nports) {
+			struct mvebu_pcie_port *port;
+			int porti = PCI_SLOT(devfn) - 1;
+			port = &pcie->ports[porti];
+			return mvebu_sw_pci_bridge_read(port, where, size, val);
+		} else {
+			*val = 0xffffffff;
+			return PCIBIOS_DEVICE_NOT_FOUND;
+		}
+	}
+}
+
+static struct pci_ops mvebu_pcie_ops = {
+	.read = mvebu_pcie_rd_conf,
+	.write = mvebu_pcie_wr_conf,
+};
+
+static int __init mvebu_pcie_setup(int nr, struct pci_sys_data *sys)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(sys);
+	int i;
+
+	pci_add_resource_offset(&sys->resources, &pcie->realio, sys->io_offset);
+	pci_add_resource_offset(&sys->resources, &pcie->mem, sys->mem_offset);
+	pci_add_resource(&sys->resources, &pcie->busn);
+
+	for (i = 0; i < pcie->nports; i++) {
+		struct mvebu_pcie_port *port = &pcie->ports[i];
+		orion_pcie_setup(port->base);
+	}
+
+	return 1;
+}
+
+static int __init mvebu_pcie_map_irq(const struct pci_dev *dev, u8 slot, u8 pin)
+{
+	struct of_irq oirq;
+	int ret;
+
+	ret = of_irq_map_pci(dev, &oirq);
+	if (ret) {
+		return ret;
+	}
+
+	return irq_create_of_mapping(oirq.controller, oirq.specifier,
+				     oirq.size);
+}
+
+static struct pci_bus *mvebu_pcie_scan_bus(int nr, struct pci_sys_data *sys)
+{
+	struct mvebu_pcie *pcie = sys_to_pcie(sys);
+	return pci_scan_root_bus(&pcie->pdev->dev, sys->busnr,
+				 &mvebu_pcie_ops, sys, &sys->resources);
+}
+
+resource_size_t mvebu_pcie_align_resource(struct pci_dev *dev,
+					  const struct resource *res,
+					  resource_size_t start,
+					  resource_size_t size,
+					  resource_size_t align)
+{
+	if (dev->bus->number != 0)
+		return start;
+
+	/*
+	 * On the PCI-to-PCI bridge side, the I/O windows must have at
+	 * least a 64 KB size and be aligned on their size, and the
+	 * memory windows must have at least a 1 MB size and be
+	 * aligned on their size
+	 */
+	if (res->flags & IORESOURCE_IO)
+		return round_up(start, max((resource_size_t)SZ_64K, size));
+	else if (res->flags & IORESOURCE_MEM)
+		return round_up(start, max((resource_size_t)SZ_1M, size));
+	else
+		return start;
+}
+
+static void mvebu_pcie_enable(struct mvebu_pcie *pcie)
+{
+	struct hw_pci hw;
+
+	memset(&hw, 0, sizeof(hw));
+
+	hw.nr_controllers = 1;
+	hw.private_data   = (void **)&pcie;
+	hw.setup          = mvebu_pcie_setup;
+	hw.scan           = mvebu_pcie_scan_bus;
+	hw.map_irq        = mvebu_pcie_map_irq;
+	hw.ops            = &mvebu_pcie_ops;
+	hw.align_resource = mvebu_pcie_align_resource;
+
+	pci_common_init(&hw);
+}
+
+static int __init mvebu_pcie_probe(struct platform_device *pdev)
+{
+	struct mvebu_pcie *pcie;
+	struct device_node *np = pdev->dev.of_node;
+	struct device_node *child;
+	const __be32 *range = NULL;
+	struct resource res;
+	int i, ret;
+
+	pcie = devm_kzalloc(&pdev->dev, sizeof(struct mvebu_pcie),
+			    GFP_KERNEL);
+	if (!pcie)
+		return -ENOMEM;
+
+	pcie->pdev = pdev;
+
+	/* Get the I/O and memory ranges from DT */
+	while ((range = of_pci_process_ranges(np, &res, range)) != NULL) {
+		if (resource_type(&res) == IORESOURCE_IO) {
+			memcpy(&pcie->io, &res, sizeof(res));
+			memcpy(&pcie->realio, &res, sizeof(res));
+			pcie->io.name = "I/O";
+			pcie->realio.start &= 0xFFFFF;
+			pcie->realio.end   &= 0xFFFFF;
+		}
+		if (resource_type(&res) == IORESOURCE_MEM) {
+			memcpy(&pcie->mem, &res, sizeof(res));
+			pcie->mem.name = "MEM";
+		}
+	}
+
+	/* Get the bus range */
+	ret = of_pci_parse_bus_range(np, &pcie->busn);
+	if (ret) {
+		dev_err(&pdev->dev, "failed to parse bus-range property: %d\n",
+			ret);
+		return ret;
+	}
+
+	for_each_child_of_node(pdev->dev.of_node, child) {
+		if (!of_device_is_available(child))
+			continue;
+		pcie->nports++;
+	}
+
+	pcie->ports = devm_kzalloc(&pdev->dev, pcie->nports *
+				   sizeof(struct mvebu_pcie_port),
+				   GFP_KERNEL);
+	if (!pcie->ports)
+		return -ENOMEM;
+
+	i = 0;
+	for_each_child_of_node(pdev->dev.of_node, child) {
+		struct mvebu_pcie_port *port = &pcie->ports[i];
+
+		if (!of_device_is_available(child))
+			continue;
+
+		port->pcie = pcie;
+
+		if (of_property_read_u32(child, "marvell,pcie-port",
+					 &port->port)) {
+			dev_warn(&pdev->dev,
+				 "ignoring PCIe DT node, missing pcie-port property\n");
+			continue;
+		}
+
+		if (of_property_read_u32(child, "marvell,pcie-lane",
+					 &port->lane))
+			port->lane = 0;
+
+		port->devfn = of_pci_get_devfn(child);
+		if (port->devfn < 0)
+			continue;
+
+		port->base = of_iomap(child, 0);
+		if (!port->base) {
+			dev_err(&pdev->dev, "PCIe%d.%d: cannot map registers\n",
+				port->port, port->lane);
+			continue;
+		}
+
+		if (orion_pcie_link_up(port->base)) {
+			port->haslink = 1;
+			dev_info(&pdev->dev, "PCIe%d.%d: link up\n",
+				 port->port, port->lane);
+		} else {
+			port->haslink = 0;
+			dev_info(&pdev->dev, "PCIe%d.%d: link down\n",
+				 port->port, port->lane);
+		}
+
+		port->clk = of_clk_get_by_name(child, NULL);
+		if (!port->clk) {
+			dev_err(&pdev->dev, "PCIe%d.%d: cannot get clock\n",
+			       port->port, port->lane);
+			iounmap(port->base);
+			port->haslink = 0;
+			continue;
+		}
+
+		port->dn = child;
+
+		clk_prepare_enable(port->clk);
+		spin_lock_init(&port->conf_lock);
+
+		mvebu_sw_pci_bridge_init(port);
+
+		i++;
+	}
+
+	mvebu_pcie_enable(pcie);
+
+	return 0;
+}
+
+static const struct of_device_id mvebu_pcie_of_match_table[] = {
+	{ .compatible = "marvell,armada-370-xp-pcie", },
+	{},
+};
+MODULE_DEVICE_TABLE(of, mvebu_pcie_of_match_table);
+
+static struct platform_driver mvebu_pcie_driver = {
+	.driver = {
+		.owner = THIS_MODULE,
+		.name = "mvebu-pcie",
+		.of_match_table =
+		   of_match_ptr(mvebu_pcie_of_match_table),
+	},
+};
+
+static int mvebu_pcie_init(void)
+{
+	return platform_driver_probe(&mvebu_pcie_driver,
+				     mvebu_pcie_probe);
+}
+
+subsys_initcall(mvebu_pcie_init);
+
+MODULE_AUTHOR("Thomas Petazzoni <thomas.petazzoni@free-electrons.com>");
+MODULE_DESCRIPTION("Marvell EBU PCIe driver");
+MODULE_LICENSE("GPLv2");