diff mbox series

[08/15] PCI: mvebu: Propagate errors when updating PCI_IO_BASE and PCI_MEM_BASE registers

Message ID 20211125124605.25915-9-pali@kernel.org
State New
Headers show
Series pci: mvebu: Various fixes | expand

Commit Message

Pali Rohár Nov. 25, 2021, 12:45 p.m. UTC
Properly propagate failure from mvebu_pcie_add_windows() function back to
the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
On error set base value higher than limit value which indicates that
address range is disabled. When IO is unsupported then let IO registers
zeroed as required by PCIe base specification.

Signed-off-by: Pali Rohár <pali@kernel.org>
Cc: stable@vger.kernel.org
---
 drivers/pci/controller/pci-mvebu.c | 82 ++++++++++++++++++++----------
 1 file changed, 55 insertions(+), 27 deletions(-)

Comments

Bjorn Helgaas Jan. 7, 2022, 9:55 p.m. UTC | #1
On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> Properly propagate failure from mvebu_pcie_add_windows() function back to
> the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
> PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
> On error set base value higher than limit value which indicates that
> address range is disabled. 

Does the spec say that if software programs something invalid,
hardware should proactively set the base and limit registers to
disable the window?

I'm not sure I've seen hardware that does this, and it seems ... maybe
a little aggressive.

What happens if software writes the base and limit in the wrong order,
so the window is invalid after the first write but valid after the
second?  That actually sounds like it could be a sensible strategy to
prevent a partially-configured window from being active.

Bjorn
Pali Rohár Jan. 7, 2022, 10:28 p.m. UTC | #2
On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > Properly propagate failure from mvebu_pcie_add_windows() function back to
> > the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
> > PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
> > On error set base value higher than limit value which indicates that
> > address range is disabled. 
> 
> Does the spec say that if software programs something invalid,
> hardware should proactively set the base and limit registers to
> disable the window?

No. But this patch address something totally different. Software can do
fully valid operation, e.g. try to set forwarding memory window as large
as possible. But because this driver "emulates" pci bridge by calling
software/kernel function (mvebu_pcie_add_windows), some operations which
in real HW cannot happen, are possible in software.

For example there are limitations in sizes of forwarding memory windows,
because it is done by mvebu-mbus driver, which is responsible for
configuring mapping and forwarding of PCIe I/O and MEM windows. And due
to Marvell HW, there are restrictions which are not in PCIe HW.

Currently if such error happens, obviously kernel is not able to set
PCIe windows and it just print warnings to dmesg. Trying to access these
windows would result in the worst case in crashes.

With this change when mvebu_pcie_add_windows() function fails then into
emulated config space is put information that particular forwarding
window is disabled. I think that it is better to indicate it in config
space what is the current "reality" of hardware configuration. If window
is disabled in real-HW (meaning in mvebu-mbus driver) then show it also
in emulated config space of pci bridge.

Do you have better idea what should emulated pci bridge do, if software
try to set fully valid configuration of forwarding window, but it is not
possible to achieve it (even compliant PCI bridge must be able to do
it)?

> I'm not sure I've seen hardware that does this, and it seems ... maybe
> a little aggressive.
> 
> What happens if software writes the base and limit in the wrong order,
> so the window is invalid after the first write but valid after the
> second?  That actually sounds like it could be a sensible strategy to
> prevent a partially-configured window from being active.
> 
> Bjorn

Invalid window (limit < base) means that window is disabled. And
pci-mvebu.c in its callbacks from pci-bridge-emul.c should correctly
handle it and propagates information about disablement to mvebu-mbus
driver.

After window is valid again (limit > base) then pci-mvebu.c call
mvebu-mbus to setup new mapping.
Bjorn Helgaas Jan. 7, 2022, 11:16 p.m. UTC | #3
On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > Properly propagate failure from mvebu_pcie_add_windows() function back to
> > > the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
> > > PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
> > > On error set base value higher than limit value which indicates that
> > > address range is disabled. 
> > 
> > Does the spec say that if software programs something invalid,
> > hardware should proactively set the base and limit registers to
> > disable the window?
> 
> No. But this patch address something totally different. Software can do
> fully valid operation, e.g. try to set forwarding memory window as large
> as possible. But because this driver "emulates" pci bridge by calling
> software/kernel function (mvebu_pcie_add_windows), some operations which
> in real HW cannot happen, are possible in software.
> 
> For example there are limitations in sizes of forwarding memory windows,
> because it is done by mvebu-mbus driver, which is responsible for
> configuring mapping and forwarding of PCIe I/O and MEM windows. And due
> to Marvell HW, there are restrictions which are not in PCIe HW.
> 
> Currently if such error happens, obviously kernel is not able to set
> PCIe windows and it just print warnings to dmesg. Trying to access these
> windows would result in the worst case in crashes.
> 
> With this change when mvebu_pcie_add_windows() function fails then into
> emulated config space is put information that particular forwarding
> window is disabled. I think that it is better to indicate it in config
> space what is the current "reality" of hardware configuration. If window
> is disabled in real-HW (meaning in mvebu-mbus driver) then show it also
> in emulated config space of pci bridge.
> 
> Do you have better idea what should emulated pci bridge do, if software
> try to set fully valid configuration of forwarding window, but it is not
> possible to achieve it (even compliant PCI bridge must be able to do
> it)?

On an ACPI system, the host bridge window sizes are constrained by the
host bridge _CRS method.  I assume there's a similar constraint in DT.

Is the fact that mvebu_pcie_add_windows() can fail a symptom of a DT
that describes more available space than mvebu-bus can map?

> > I'm not sure I've seen hardware that does this, and it seems ... maybe
> > a little aggressive.
> > 
> > What happens if software writes the base and limit in the wrong order,
> > so the window is invalid after the first write but valid after the
> > second?  That actually sounds like it could be a sensible strategy to
> > prevent a partially-configured window from being active.
> 
> Invalid window (limit < base) means that window is disabled. And
> pci-mvebu.c in its callbacks from pci-bridge-emul.c should correctly
> handle it and propagates information about disablement to mvebu-mbus
> driver.
> 
> After window is valid again (limit > base) then pci-mvebu.c call
> mvebu-mbus to setup new mapping.

Not sure I'm understanding the code correctly.  Here's the sort of
thing I'm worried about, but maybe this is actually impossible:

Let's say software writes (0x00, 0xff) to the I/O (base, limit), which
describes the [io 0x0000-0xffff] window.  If mvebu-mbus can't handle
that, it looks like you set the (base, limit) to (0xf0, 0x0f), which
would describe [io 0xf000-0x0fff], which is invalid.

The software writes 0x40 to the limit, so now we have (0xf0, 0x40), or
[io 0xf000-0x40ff].  That's still invalid, but software thinks the
0x00 it wrote to the base is still there.

Bjorn
Pali Rohár Jan. 7, 2022, 11:46 p.m. UTC | #4
On Friday 07 January 2022 17:16:17 Bjorn Helgaas wrote:
> On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> > On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > > Properly propagate failure from mvebu_pcie_add_windows() function back to
> > > > the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
> > > > PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
> > > > On error set base value higher than limit value which indicates that
> > > > address range is disabled. 
> > > 
> > > Does the spec say that if software programs something invalid,
> > > hardware should proactively set the base and limit registers to
> > > disable the window?
> > 
> > No. But this patch address something totally different. Software can do
> > fully valid operation, e.g. try to set forwarding memory window as large
> > as possible. But because this driver "emulates" pci bridge by calling
> > software/kernel function (mvebu_pcie_add_windows), some operations which
> > in real HW cannot happen, are possible in software.
> > 
> > For example there are limitations in sizes of forwarding memory windows,
> > because it is done by mvebu-mbus driver, which is responsible for
> > configuring mapping and forwarding of PCIe I/O and MEM windows. And due
> > to Marvell HW, there are restrictions which are not in PCIe HW.
> > 
> > Currently if such error happens, obviously kernel is not able to set
> > PCIe windows and it just print warnings to dmesg. Trying to access these
> > windows would result in the worst case in crashes.
> > 
> > With this change when mvebu_pcie_add_windows() function fails then into
> > emulated config space is put information that particular forwarding
> > window is disabled. I think that it is better to indicate it in config
> > space what is the current "reality" of hardware configuration. If window
> > is disabled in real-HW (meaning in mvebu-mbus driver) then show it also
> > in emulated config space of pci bridge.
> > 
> > Do you have better idea what should emulated pci bridge do, if software
> > try to set fully valid configuration of forwarding window, but it is not
> > possible to achieve it (even compliant PCI bridge must be able to do
> > it)?
> 
> On an ACPI system, the host bridge window sizes are constrained by the
> host bridge _CRS method.  I assume there's a similar constraint in DT.
> 
> Is the fact that mvebu_pcie_add_windows() can fail a symptom of a DT
> that describes more available space than mvebu-bus can map?

Memory maps for mvebu are more complicated. There is no explicit size in
DT ranges property as it is dynamically allocated by mvebu-mbus:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/armada-385.dtsi?h=v5.15#n47

On some Armada platform (I think it is AXP) there is lot of SerDeses
with PCIe functionality (I think six or seven?) but "shared memory pool"
which mvebu-mbus allocates to consumers is not big enough to allow e.g.
256 MB + 64 kB for every PCIe port.

There is upper limit of mvebu memory slots in HW and each slot has size
restrictions. So mvebu-mbus has to deal with splitting requested
PCIe window to more mvebu memory slots... So even if there is available
memory for assigning then there does not have to be a free slot.

Due to nature of plugable PCIe expansion cards, you are basically free
to put any PCIe card which you like into any PCIe slot. So it would be
up to the user to choose combination of such cards which pass all those
mvebu windows and slots restrictions... Otherwise kernel just say that
cannot satisfy card's BAR assignment because it is not possible to set
forwarding windows correctly.

Moreover it is possible to bind / unbind pci mvebu device dynamically at
runtime (also by rmmod), so whole resource allocation in mvebu-bus is
dynamic even during system runtime. So theoretically user can unbind one
driver to free some memory and then can bind another (which needs more
memory).

I think that this pci-mvebu driver and HW is very unusual in both
resource assignment and supported features and requirements from SW.

> > > I'm not sure I've seen hardware that does this, and it seems ... maybe
> > > a little aggressive.
> > > 
> > > What happens if software writes the base and limit in the wrong order,
> > > so the window is invalid after the first write but valid after the
> > > second?  That actually sounds like it could be a sensible strategy to
> > > prevent a partially-configured window from being active.
> > 
> > Invalid window (limit < base) means that window is disabled. And
> > pci-mvebu.c in its callbacks from pci-bridge-emul.c should correctly
> > handle it and propagates information about disablement to mvebu-mbus
> > driver.
> > 
> > After window is valid again (limit > base) then pci-mvebu.c call
> > mvebu-mbus to setup new mapping.
> 
> Not sure I'm understanding the code correctly.  Here's the sort of
> thing I'm worried about, but maybe this is actually impossible:
> 
> Let's say software writes (0x00, 0xff) to the I/O (base, limit), which
> describes the [io 0x0000-0xffff] window.  If mvebu-mbus can't handle
> that, it looks like you set the (base, limit) to (0xf0, 0x0f), which
> would describe [io 0xf000-0x0fff], which is invalid.
> 
> The software writes 0x40 to the limit, so now we have (0xf0, 0x40), or
> [io 0xf000-0x40ff].  That's still invalid, but software thinks the
> 0x00 it wrote to the base is still there.
> 
> Bjorn

I see. In this situation it does not work correctly.

But is not kernel itself "privileged" to setup forwarding windows?
Because currently kernel does not do it and therefore do we need to care
for it?

Or do you have idea how to handle this kind of situation? Or how to
handle these kinds of errors?
Bjorn Helgaas Jan. 13, 2022, 12:19 a.m. UTC | #5
On Sat, Jan 08, 2022 at 12:46:58AM +0100, Pali Rohár wrote:
> On Friday 07 January 2022 17:16:17 Bjorn Helgaas wrote:
> > On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> > > On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > > > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > > > Properly propagate failure from mvebu_pcie_add_windows() function back to
> > > > > the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
> > > > > PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
> > > > > On error set base value higher than limit value which indicates that
> > > > > address range is disabled. 
> > > > 
> > > > Does the spec say that if software programs something invalid,
> > > > hardware should proactively set the base and limit registers to
> > > > disable the window?
> > > 
> > > No. But this patch address something totally different. Software can do
> > > fully valid operation, e.g. try to set forwarding memory window as large
> > > as possible. But because this driver "emulates" pci bridge by calling
> > > software/kernel function (mvebu_pcie_add_windows), some operations which
> > > in real HW cannot happen, are possible in software.
> > > 
> > > For example there are limitations in sizes of forwarding memory windows,
> > > because it is done by mvebu-mbus driver, which is responsible for
> > > configuring mapping and forwarding of PCIe I/O and MEM windows. And due
> > > to Marvell HW, there are restrictions which are not in PCIe HW.
> > > 
> > > Currently if such error happens, obviously kernel is not able to set
> > > PCIe windows and it just print warnings to dmesg. Trying to access these
> > > windows would result in the worst case in crashes.
> > > 
> > > With this change when mvebu_pcie_add_windows() function fails then into
> > > emulated config space is put information that particular forwarding
> > > window is disabled. I think that it is better to indicate it in config
> > > space what is the current "reality" of hardware configuration. If window
> > > is disabled in real-HW (meaning in mvebu-mbus driver) then show it also
> > > in emulated config space of pci bridge.
> > > 
> > > Do you have better idea what should emulated pci bridge do, if software
> > > try to set fully valid configuration of forwarding window, but it is not
> > > possible to achieve it (even compliant PCI bridge must be able to do
> > > it)?
> > 
> > On an ACPI system, the host bridge window sizes are constrained by the
> > host bridge _CRS method.  I assume there's a similar constraint in DT.
> > 
> > Is the fact that mvebu_pcie_add_windows() can fail a symptom of a DT
> > that describes more available space than mvebu-bus can map?
> 
> Memory maps for mvebu are more complicated. There is no explicit size in
> DT ranges property as it is dynamically allocated by mvebu-mbus:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/armada-385.dtsi?h=v5.15#n47

I wish I knew how to really interpret those "ranges" properties.  (Is
there a good description in Documentation/ somewhere?  All I've found
so far is https://elinux.org/Device_Tree_Usage, which is good, but
doesn't match this example completely.)

I see:

  pciec: pcie {
    ranges = <...>;
    pcie1: pcie@1,0 {
      ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
	        0x81000000 0 0 0x81000000 0x1 0 1 0>;
    };
    pcie2: pcie@2,0 {
      ranges = <0x82000000 0 0 0x82000000 0x2 0 1 0
	        0x81000000 0 0 0x81000000 0x2 0 1 0>;
    };
    pcie3: pcie@3,0 {
      ranges = <0x82000000 0 0 0x82000000 0x3 0 1 0
	        0x81000000 0 0 0x81000000 0x3 0 1 0>;
    };
    pcie4: pcie@4,0 {
      ranges = <0x82000000 0 0 0x82000000 0x4 0 1 0
	        0x81000000 0 0 0x81000000 0x4 0 1 0>;
    };
  };

What does this look like in dmesg, i.e., what CPU address ranges are
mapped to what PCI bus addresses?

Are pcie1, pcie2, etc Root Ports?  Or are they each separate host
bridges (they each have "bus-range = <0x00 0xff>")?

Is space from pciec dynamically assigned to pcie1, pcie2, etc?  If so,
I assume there are more restrictions on the size and alignment than on
PCI bridge windows, which allow size/alignment down to 1MB?

I'm trying to see how this could be described in ACPI because that's a
fairly general model that accommodates most machines.  Possibly
describing mvebu in ACPI would involve losing some flexibility.

Bjorn
Pali Rohár Jan. 13, 2022, 10:35 a.m. UTC | #6
On Wednesday 12 January 2022 18:19:21 Bjorn Helgaas wrote:
> On Sat, Jan 08, 2022 at 12:46:58AM +0100, Pali Rohár wrote:
> > On Friday 07 January 2022 17:16:17 Bjorn Helgaas wrote:
> > > On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> > > > On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > > > > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > > > > Properly propagate failure from mvebu_pcie_add_windows() function back to
> > > > > > the caller mvebu_pci_bridge_emul_base_conf_write() and correctly updates
> > > > > > PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16 registers on error.
> > > > > > On error set base value higher than limit value which indicates that
> > > > > > address range is disabled. 
> > > > > 
> > > > > Does the spec say that if software programs something invalid,
> > > > > hardware should proactively set the base and limit registers to
> > > > > disable the window?
> > > > 
> > > > No. But this patch address something totally different. Software can do
> > > > fully valid operation, e.g. try to set forwarding memory window as large
> > > > as possible. But because this driver "emulates" pci bridge by calling
> > > > software/kernel function (mvebu_pcie_add_windows), some operations which
> > > > in real HW cannot happen, are possible in software.
> > > > 
> > > > For example there are limitations in sizes of forwarding memory windows,
> > > > because it is done by mvebu-mbus driver, which is responsible for
> > > > configuring mapping and forwarding of PCIe I/O and MEM windows. And due
> > > > to Marvell HW, there are restrictions which are not in PCIe HW.
> > > > 
> > > > Currently if such error happens, obviously kernel is not able to set
> > > > PCIe windows and it just print warnings to dmesg. Trying to access these
> > > > windows would result in the worst case in crashes.
> > > > 
> > > > With this change when mvebu_pcie_add_windows() function fails then into
> > > > emulated config space is put information that particular forwarding
> > > > window is disabled. I think that it is better to indicate it in config
> > > > space what is the current "reality" of hardware configuration. If window
> > > > is disabled in real-HW (meaning in mvebu-mbus driver) then show it also
> > > > in emulated config space of pci bridge.
> > > > 
> > > > Do you have better idea what should emulated pci bridge do, if software
> > > > try to set fully valid configuration of forwarding window, but it is not
> > > > possible to achieve it (even compliant PCI bridge must be able to do
> > > > it)?
> > > 
> > > On an ACPI system, the host bridge window sizes are constrained by the
> > > host bridge _CRS method.  I assume there's a similar constraint in DT.
> > > 
> > > Is the fact that mvebu_pcie_add_windows() can fail a symptom of a DT
> > > that describes more available space than mvebu-bus can map?
> > 
> > Memory maps for mvebu are more complicated. There is no explicit size in
> > DT ranges property as it is dynamically allocated by mvebu-mbus:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/armada-385.dtsi?h=v5.15#n47
> 
> I wish I knew how to really interpret those "ranges" properties.  (Is
> there a good description in Documentation/ somewhere?  All I've found
> so far is https://elinux.org/Device_Tree_Usage, which is good, but
> doesn't match this example completely.)
> 
> I see:
> 
>   pciec: pcie {
>     ranges = <...>;
>     pcie1: pcie@1,0 {
>       ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
> 	        0x81000000 0 0 0x81000000 0x1 0 1 0>;
>     };
>     pcie2: pcie@2,0 {
>       ranges = <0x82000000 0 0 0x82000000 0x2 0 1 0
> 	        0x81000000 0 0 0x81000000 0x2 0 1 0>;
>     };
>     pcie3: pcie@3,0 {
>       ranges = <0x82000000 0 0 0x82000000 0x3 0 1 0
> 	        0x81000000 0 0 0x81000000 0x3 0 1 0>;
>     };
>     pcie4: pcie@4,0 {
>       ranges = <0x82000000 0 0 0x82000000 0x4 0 1 0
> 	        0x81000000 0 0 0x81000000 0x4 0 1 0>;
>     };
>   };
> 
> What does this look like in dmesg, i.e., what CPU address ranges are
> mapped to what PCI bus addresses?

These explicit ranges in DT are probably ignored as they are invalid.
You can see them (0xffffffffffffffff) in dmesg. MEM and I/O resources
are parsed in pci-mvebu.c driver in mvebu_pcie_parse_request_resources()
function.

Here is relevant dmesg output:

[    0.671607] mvebu-pcie soc:pcie: host bridge /soc/pcie ranges:
[    0.677474] mvebu-pcie soc:pcie:      MEM 0x00f1080000..0x00f1081fff -> 0x0000080000
[    0.685245] mvebu-pcie soc:pcie:      MEM 0x00f1040000..0x00f1041fff -> 0x0000040000
[    0.693017] mvebu-pcie soc:pcie:      MEM 0x00f1044000..0x00f1045fff -> 0x0000044000
[    0.700788] mvebu-pcie soc:pcie:      MEM 0x00f1048000..0x00f1049fff -> 0x0000048000
[    0.708562] mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
[    0.716856] mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
[    0.725146] mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0200000000
[    0.733438] mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0200000000
[    0.741730] mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0300000000
[    0.750022] mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0300000000
[    0.758314] mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0400000000
[    0.766606] mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0400000000
[    0.907763] mvebu-pcie soc:pcie: PCI host bridge to bus 0000:00
[    0.913698] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.919203] pci_bus 0000:00: root bus resource [mem 0xf1080000-0xf1081fff] (bus address [0x00080000-0x00081fff])
[    0.929406] pci_bus 0000:00: root bus resource [mem 0xf1040000-0xf1041fff] (bus address [0x00040000-0x00041fff])
[    0.939608] pci_bus 0000:00: root bus resource [mem 0xf1044000-0xf1045fff] (bus address [0x00044000-0x00045fff])
[    0.949812] pci_bus 0000:00: root bus resource [mem 0xf1048000-0xf1049fff] (bus address [0x00048000-0x00049fff])
[    0.960014] pci_bus 0000:00: root bus resource [mem 0xe0000000-0xe7ffffff]
[    0.966908] pci_bus 0000:00: root bus resource [io  0x1000-0xeffff]
[    0.973261] pci 0000:00:01.0: [11ab:6820] type 01 class 0x060400
[    0.979458] pci 0000:00:02.0: [11ab:6820] type 01 class 0x060400
[    0.985643] pci 0000:00:03.0: [11ab:6820] type 01 class 0x060400
[    0.992482] PCI: bus0: Fast back to back transfers disabled
[    0.998075] pci 0000:00:01.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.006108] pci 0000:00:02.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.014134] pci 0000:00:03.0: bridge configuration invalid ([bus 00-00]), reconfiguring
[    1.022242] pci 0000:01:00.0: [1e0f:0001] type 00 class 0x010802
[    1.028289] pci 0000:01:00.0: reg 0x10: [mem 0xc0000000-0xc0003fff 64bit]
[    1.035233] pci 0000:01:00.0: PME# supported from D0 D3hot
[    1.040787] pci 0000:01:00.0: 2.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s PCIe x1 link at 0000:00:01.0 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)
[    1.056630] PCI: bus1: Fast back to back transfers disabled
[    1.062217] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    1.068935] pci 0000:02:00.0: [168c:003c] type 00 class 0x028000
[    1.074975] pci 0000:02:00.0: reg 0x10: [mem 0xc8000000-0xc81fffff 64bit]
[    1.081817] pci 0000:02:00.0: reg 0x30: [mem 0xc8200000-0xc820ffff pref]
[    1.088611] pci 0000:02:00.0: supports D1 D2
[    1.093765] PCI: bus2: Fast back to back transfers disabled
[    1.099356] pci_bus 0000:02: busn_res: [bus 02-ff] end is updated to 02
[    1.106070] pci 0000:03:00.0: [168c:002e] type 00 class 0x028000
[    1.112110] pci 0000:03:00.0: reg 0x10: [mem 0xd0000000-0xd000ffff 64bit]
[    1.119031] pci 0000:03:00.0: supports D1
[    1.123049] pci 0000:03:00.0: PME# supported from D0 D1 D3hot
[    1.129688] PCI: bus3: Fast back to back transfers disabled
[    1.135274] pci_bus 0000:03: busn_res: [bus 03-ff] end is updated to 03
[    1.141926] pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xe00fffff]
[    1.148738] pci 0000:00:02.0: BAR 8: assigned [mem 0xe0200000-0xe04fffff]
[    1.155544] pci 0000:00:03.0: BAR 8: assigned [mem 0xe0100000-0xe01fffff]
[    1.162357] pci 0000:01:00.0: BAR 0: assigned [mem 0xe0000000-0xe0003fff 64bit]
[    1.169696] pci 0000:00:01.0: PCI bridge to [bus 01]
[    1.174674] pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xe00fffff]
[    1.181490] pci 0000:02:00.0: BAR 0: assigned [mem 0xe0200000-0xe03fffff 64bit]
[    1.188830] pci 0000:02:00.0: BAR 6: assigned [mem 0xe0400000-0xe040ffff pref]
[    1.196074] pci 0000:00:02.0: PCI bridge to [bus 02]
[    1.201050] pci 0000:00:02.0:   bridge window [mem 0xe0200000-0xe04fffff]
[    1.207867] pci 0000:03:00.0: BAR 0: assigned [mem 0xe0100000-0xe010ffff 64bit]
[    1.215200] pci 0000:00:03.0: PCI bridge to [bus 03]
[    1.220179] pci 0000:00:03.0:   bridge window [mem 0xe0100000-0xe01fffff]
[    1.227028] pcieport 0000:00:01.0: enabling device (0140 -> 0142)
[    1.233186] pcieport 0000:00:02.0: enabling device (0140 -> 0142)
[    1.239339] pcieport 0000:00:03.0: enabling device (0140 -> 0142)

And some more information from /proc:

$ cat /proc/ioports
00001000-000effff : PCI I/O

$ cat /proc/iomem
...
e0000000-e7ffffff : PCI MEM
  e0000000-e00fffff : PCI Bus 0000:01
    e0000000-e0003fff : 0000:01:00.0
      e0000000-e0003fff : nvme
  e0100000-e01fffff : PCI Bus 0000:03
    e0100000-e010ffff : 0000:03:00.0
      e0100000-e010ffff : ath9k
  e0200000-e04fffff : PCI Bus 0000:02
    e0200000-e03fffff : 0000:02:00.0
      e0200000-e03fffff : ath
    e0400000-e040ffff : 0000:02:00.0
...

> Are pcie1, pcie2, etc Root Ports?  Or are they each separate host
> bridges (they each have "bus-range = <0x00 0xff>")?

From kernel point of view they are root ports. But in reality every of
these root port is on separate bus segment, but kernel pci-mvebu.c
driver merges all these segments/domains into one host bridge and put
all root ports into bus 0.

Here is lspci -tvnn output with topology:

$ lspci -tvnn
-[0000:00]-+-01.0-[01]----00.0  Device [1e0f:0001]
           +-02.0-[02]----00.0  Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter [168c:003c]
           \-03.0-[03]----00.0  Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) [168c:002e]

And without topology:

$ lspci -nn
00:01.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:6820] (rev 04)
00:02.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:6820] (rev 04)
00:03.0 PCI bridge [0604]: Marvell Technology Group Ltd. Device [11ab:6820] (rev 04)
01:00.0 Non-Volatile memory controller [0108]: Device [1e0f:0001]
02:00.0 Network controller [0280]: Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter [168c:003c]
03:00.0 Network controller [0280]: Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) [168c:002e] (rev 01)

Buses 1, 2 and 3 represents mPCIe cards, all of them are in reality in
separate bus segments and on different HW host bridges. So they do *not*
share access to config space, do *not* share INTx interrupts, etc...

> Is space from pciec dynamically assigned to pcie1, pcie2, etc?  If so,
> I assume there are more restrictions on the size and alignment than on
> PCI bridge windows, which allow size/alignment down to 1MB?

Yes, exactly. I do not know now all restrictions. At least there are
fixed number of memory slots and each has to be of size 2^N. They are
dynamically assigned by kernel mbus driver at time when somebody updates
BASE/LIMIT registers. And that kernel mbus driver takes care to split
non-aligned window size to more slots of size 2^N. And resources are
shared from pool with other HW parts (e.g. DMA), so other drivers loaded
in kernel can "eat" available slots before pci-mvebu and then there does
not have to be nothing to allocate for PCI.

But most Armada boards do not have exported all peripherals from SoC,
unconnected are disabled in DT and therefore exhaustion should not
happen.

> I'm trying to see how this could be described in ACPI because that's a
> fairly general model that accommodates most machines.  Possibly
> describing mvebu in ACPI would involve losing some flexibility.
> 
> Bjorn

I do not understand APCI model very well and I'm in impression that it
is impossible to represent mvebu in ACPI.
Bjorn Helgaas Jan. 20, 2022, 5:50 p.m. UTC | #7
On Thu, Jan 13, 2022 at 11:35:23AM +0100, Pali Rohár wrote:
> On Wednesday 12 January 2022 18:19:21 Bjorn Helgaas wrote:
> > On Sat, Jan 08, 2022 at 12:46:58AM +0100, Pali Rohár wrote:
> > > On Friday 07 January 2022 17:16:17 Bjorn Helgaas wrote:
> > > > On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> > > > > On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > > > > > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > > > > > Properly propagate failure from mvebu_pcie_add_windows()
> > > > > > > function back to the caller
> > > > > > > mvebu_pci_bridge_emul_base_conf_write() and correctly
> > > > > > > updates PCI_IO_BASE, PCI_MEM_BASE and
> > > > > > > PCI_IO_BASE_UPPER16 registers on error.  On error set
> > > > > > > base value higher than limit value which indicates that
> > > > > > > address range is disabled. 
> > > > > > 
> > > > > > Does the spec say that if software programs something
> > > > > > invalid, hardware should proactively set the base and
> > > > > > limit registers to disable the window?
> > > > > 
> > > > > No. But this patch address something totally different.
> > > > > Software can do fully valid operation, e.g. try to set
> > > > > forwarding memory window as large as possible. But because
> > > > > this driver "emulates" pci bridge by calling software/kernel
> > > > > function (mvebu_pcie_add_windows), some operations which in
> > > > > real HW cannot happen, are possible in software.
> > > > > 
> > > > > For example there are limitations in sizes of forwarding
> > > > > memory windows, because it is done by mvebu-mbus driver,
> > > > > which is responsible for configuring mapping and forwarding
> > > > > of PCIe I/O and MEM windows. And due to Marvell HW, there
> > > > > are restrictions which are not in PCIe HW.
> > > > > 
> > > > > Currently if such error happens, obviously kernel is not
> > > > > able to set PCIe windows and it just print warnings to
> > > > > dmesg. Trying to access these windows would result in the
> > > > > worst case in crashes.
> > > > > 
> > > > > With this change when mvebu_pcie_add_windows() function
> > > > > fails then into emulated config space is put information
> > > > > that particular forwarding window is disabled. I think that
> > > > > it is better to indicate it in config space what is the
> > > > > current "reality" of hardware configuration. If window is
> > > > > disabled in real-HW (meaning in mvebu-mbus driver) then show
> > > > > it also in emulated config space of pci bridge.
> > > > > 
> > > > > Do you have better idea what should emulated pci bridge do,
> > > > > if software try to set fully valid configuration of
> > > > > forwarding window, but it is not possible to achieve it
> > > > > (even compliant PCI bridge must be able to do it)?
> > > > 
> > > > On an ACPI system, the host bridge window sizes are
> > > > constrained by the host bridge _CRS method.  I assume there's
> > > > a similar constraint in DT.
> > > > 
> > > > Is the fact that mvebu_pcie_add_windows() can fail a symptom
> > > > of a DT that describes more available space than mvebu-bus can
> > > > map?
> > > 
> > > Memory maps for mvebu are more complicated. There is no explicit
> > > size in DT ranges property as it is dynamically allocated by
> > > mvebu-mbus:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/armada-385.dtsi?h=v5.15#n47
> > 
> > I wish I knew how to really interpret those "ranges" properties.
> > (Is there a good description in Documentation/ somewhere?  All
> > I've found so far is https://elinux.org/Device_Tree_Usage, which
> > is good, but doesn't match this example completely.)
> > 
> > I see:
> > 
> >   pciec: pcie {
> >     ranges = <...>;
> >     pcie1: pcie@1,0 {
> >       ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
> > 	        0x81000000 0 0 0x81000000 0x1 0 1 0>;
> >     };
> >     pcie2: pcie@2,0 {
> >       ranges = <0x82000000 0 0 0x82000000 0x2 0 1 0
> > 	        0x81000000 0 0 0x81000000 0x2 0 1 0>;
> >     };
> >     pcie3: pcie@3,0 {
> >       ranges = <0x82000000 0 0 0x82000000 0x3 0 1 0
> > 	        0x81000000 0 0 0x81000000 0x3 0 1 0>;
> >     };
> >     pcie4: pcie@4,0 {
> >       ranges = <0x82000000 0 0 0x82000000 0x4 0 1 0
> > 	        0x81000000 0 0 0x81000000 0x4 0 1 0>;
> >     };
> >   };
> > 
> > What does this look like in dmesg, i.e., what CPU address ranges are
> > mapped to what PCI bus addresses?
> 
> These explicit ranges in DT are probably ignored as they are invalid.
> You can see them (0xffffffffffffffff) in dmesg. 

Are you saying that this DT ranges and the dmesg line are connected?

  ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
            0x81000000 0 0 0x81000000 0x1 0 1 0>;

  mvebu-pcie soc:pcie: MEM 0xffffffffffffffff..0x00fffffffe -> 0x0100000000

1) It would be nice if there were a hint somewhere in Documentation/
that would allow mere mortals to see the connection there.

2) Why do we have these DT entries if they are invalid and useless?

> MEM and I/O resources are parsed in pci-mvebu.c driver in
> mvebu_pcie_parse_request_resources() function.

So mvebu-mbus.c fills in the static mbus_state from the DT
"pcie-mem-aperture", which seems unconnected to the DT descriptions of
the PCI controllers:

  static struct mvebu_mbus_state mbus_state;

  mvebu_mbus_dt_init
    mvebu_mbus_get_pcie_resources(&mbus_state.pcie_mem_aperture)
      of_property_read_u32_array("pcie-mem-aperture")

  mvebu_pcie_probe
    mvebu_pcie_parse_request_resources
      mvebu_mbus_get_pcie_mem_aperture(&pcie->mem)
	*res = mbus_state.pcie_mem_aperture
      pci_add_resource(&bridge->windows, &pcie->mem)

> Here is relevant dmesg output:
> 
> mvebu-pcie soc:pcie: host bridge /soc/pcie ranges:
> mvebu-pcie soc:pcie:      MEM 0x00f1080000..0x00f1081fff -> 0x0000080000
> mvebu-pcie soc:pcie:      MEM 0x00f1040000..0x00f1041fff -> 0x0000040000
> mvebu-pcie soc:pcie:      MEM 0x00f1044000..0x00f1045fff -> 0x0000044000
> mvebu-pcie soc:pcie:      MEM 0x00f1048000..0x00f1049fff -> 0x0000048000
> mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
> mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
> mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0200000000
> mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0200000000
> mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0300000000
> mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0300000000
> mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0400000000
> mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0400000000
> mvebu-pcie soc:pcie: PCI host bridge to bus 0000:00
> pci_bus 0000:00: root bus resource [bus 00-ff]
> pci_bus 0000:00: root bus resource [mem 0xf1080000-0xf1081fff] (bus address [0x00080000-0x00081fff])
> pci_bus 0000:00: root bus resource [mem 0xf1040000-0xf1041fff] (bus address [0x00040000-0x00041fff])
> pci_bus 0000:00: root bus resource [mem 0xf1044000-0xf1045fff] (bus address [0x00044000-0x00045fff])
> pci_bus 0000:00: root bus resource [mem 0xf1048000-0xf1049fff] (bus address [0x00048000-0x00049fff])
> pci_bus 0000:00: root bus resource [mem 0xe0000000-0xe7ffffff]

I see 0xf1080000-0xf1081fff, 0xf1040000-0xf1041fff, etc mentioned in
the DT info above, but I don't see where [mem 0xe0000000-0xe7ffffff]
came from.

Regardless, this means PCI thinks [mem 0xe0000000-0xe7ffffff] is
available on bus 00 and can be assigned to devices on bus 00 according
to the normal PCI rules (BARs aligned on size, PCI bridge windows
aligned on 1MB and multiple of 1MB in size).  IIUC, mvebu imposes
additional alignment constraints on the bridge windows.

These are the bridge window assignments from your dmesg:

> pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xe00fffff]
> pci 0000:00:02.0: BAR 8: assigned [mem 0xe0200000-0xe04fffff]
> pci 0000:00:03.0: BAR 8: assigned [mem 0xe0100000-0xe01fffff]

> pci 0000:00:01.0: PCI bridge to [bus 01]
> pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xe00fffff]
> pci 0000:00:02.0: PCI bridge to [bus 02]
> pci 0000:00:02.0:   bridge window [mem 0xe0200000-0xe04fffff]
> pci 0000:00:03.0: PCI bridge to [bus 03]
> pci 0000:00:03.0:   bridge window [mem 0xe0100000-0xe01fffff]

The PCI core knows nothing about the mvebu constraints.  Are we just
lucky here that when PCI assigned these bridge windows, they happen to
be supported on mvebu?  What happens if PCI decides it needs 29MB on
bus 01?

> > Are pcie1, pcie2, etc Root Ports?  Or are they each separate host
> > bridges (they each have "bus-range = <0x00 0xff>")?
> 
> From kernel point of view they are root ports. But in reality every of
> these root port is on separate bus segment, but kernel pci-mvebu.c
> driver merges all these segments/domains into one host bridge and put
> all root ports into bus 0.
> 
> Here is lspci -tvnn output with topology:
> 
> $ lspci -tvnn
> -[0000:00]-+-01.0-[01]----00.0  Device [1e0f:0001]
>            +-02.0-[02]----00.0  Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter [168c:003c]
>            \-03.0-[03]----00.0  Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) [168c:002e]

> Buses 1, 2 and 3 represents mPCIe cards, all of them are in reality
> in separate bus segments and on different HW host bridges. So they
> do *not* share access to config space, do *not* share INTx
> interrupts, etc...
> 
> > Is space from pciec dynamically assigned to pcie1, pcie2, etc?  If
> > so, I assume there are more restrictions on the size and alignment
> > than on PCI bridge windows, which allow size/alignment down to
> > 1MB?
> 
> Yes, exactly. I do not know now all restrictions. At least there are
> fixed number of memory slots and each has to be of size 2^N. They
> are dynamically assigned by kernel mbus driver at time when somebody
> updates BASE/LIMIT registers. And that kernel mbus driver takes care
> to split non-aligned window size to more slots of size 2^N. And
> resources are shared from pool with other HW parts (e.g. DMA), so
> other drivers loaded in kernel can "eat" available slots before
> pci-mvebu and then there does not have to be nothing to allocate for
> PCI.

So IIUC,

  pcie1 == 00:01.0 Root Port
  pcie2 == 00:02.0 Root Port
  pcie3 == 00:03.0 Root Port

From a software point of view, they're all under a single host bridge,
and Linux assumes everything under a host bridge plays by the PCI
rules.

In this case, the root ports *don't* play by the rules since they have
additional alignment restrictions, so I think these really should be
described as separate host bridges in DT with the address space
carved up statically among them.

It's common on x86 to have multiple host bridges that all appear to
software to be in domain 0000.  The bus number ranges under each are
static, e.g., one bridge has [bus 00-7f] and another has [bus 80-ff].

> But most Armada boards do not have exported all peripherals from SoC,
> unconnected are disabled in DT and therefore exhaustion should not
> happen.
> 
> > I'm trying to see how this could be described in ACPI because that's a
> > fairly general model that accommodates most machines.  Possibly
> > describing mvebu in ACPI would involve losing some flexibility.
> 
> I do not understand APCI model very well and I'm in impression that it
> is impossible to represent mvebu in ACPI.

It could be described as a separate host bridge for every root port.
ACPI uses _CRS (current resource settings) to describe the apertures
to PCI and any address translation.  Currently the _CRS description is
static, but ACPI does allow those resource assignments to be modified
via _PRS (possible resource settings) and _SRS (set resource
settings).

Bjorn
Pali Rohár Jan. 20, 2022, 7:08 p.m. UTC | #8
On Thursday 20 January 2022 11:50:47 Bjorn Helgaas wrote:
> On Thu, Jan 13, 2022 at 11:35:23AM +0100, Pali Rohár wrote:
> > On Wednesday 12 January 2022 18:19:21 Bjorn Helgaas wrote:
> > > On Sat, Jan 08, 2022 at 12:46:58AM +0100, Pali Rohár wrote:
> > > > On Friday 07 January 2022 17:16:17 Bjorn Helgaas wrote:
> > > > > On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> > > > > > On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > > > > > > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > > > > > > Properly propagate failure from mvebu_pcie_add_windows()
> > > > > > > > function back to the caller
> > > > > > > > mvebu_pci_bridge_emul_base_conf_write() and correctly
> > > > > > > > updates PCI_IO_BASE, PCI_MEM_BASE and
> > > > > > > > PCI_IO_BASE_UPPER16 registers on error.  On error set
> > > > > > > > base value higher than limit value which indicates that
> > > > > > > > address range is disabled. 
> > > > > > > 
> > > > > > > Does the spec say that if software programs something
> > > > > > > invalid, hardware should proactively set the base and
> > > > > > > limit registers to disable the window?
> > > > > > 
> > > > > > No. But this patch address something totally different.
> > > > > > Software can do fully valid operation, e.g. try to set
> > > > > > forwarding memory window as large as possible. But because
> > > > > > this driver "emulates" pci bridge by calling software/kernel
> > > > > > function (mvebu_pcie_add_windows), some operations which in
> > > > > > real HW cannot happen, are possible in software.
> > > > > > 
> > > > > > For example there are limitations in sizes of forwarding
> > > > > > memory windows, because it is done by mvebu-mbus driver,
> > > > > > which is responsible for configuring mapping and forwarding
> > > > > > of PCIe I/O and MEM windows. And due to Marvell HW, there
> > > > > > are restrictions which are not in PCIe HW.
> > > > > > 
> > > > > > Currently if such error happens, obviously kernel is not
> > > > > > able to set PCIe windows and it just print warnings to
> > > > > > dmesg. Trying to access these windows would result in the
> > > > > > worst case in crashes.
> > > > > > 
> > > > > > With this change when mvebu_pcie_add_windows() function
> > > > > > fails then into emulated config space is put information
> > > > > > that particular forwarding window is disabled. I think that
> > > > > > it is better to indicate it in config space what is the
> > > > > > current "reality" of hardware configuration. If window is
> > > > > > disabled in real-HW (meaning in mvebu-mbus driver) then show
> > > > > > it also in emulated config space of pci bridge.
> > > > > > 
> > > > > > Do you have better idea what should emulated pci bridge do,
> > > > > > if software try to set fully valid configuration of
> > > > > > forwarding window, but it is not possible to achieve it
> > > > > > (even compliant PCI bridge must be able to do it)?
> > > > > 
> > > > > On an ACPI system, the host bridge window sizes are
> > > > > constrained by the host bridge _CRS method.  I assume there's
> > > > > a similar constraint in DT.
> > > > > 
> > > > > Is the fact that mvebu_pcie_add_windows() can fail a symptom
> > > > > of a DT that describes more available space than mvebu-bus can
> > > > > map?
> > > > 
> > > > Memory maps for mvebu are more complicated. There is no explicit
> > > > size in DT ranges property as it is dynamically allocated by
> > > > mvebu-mbus:
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/armada-385.dtsi?h=v5.15#n47
> > > 
> > > I wish I knew how to really interpret those "ranges" properties.
> > > (Is there a good description in Documentation/ somewhere?  All
> > > I've found so far is https://elinux.org/Device_Tree_Usage, which
> > > is good, but doesn't match this example completely.)
> > > 
> > > I see:
> > > 
> > >   pciec: pcie {
> > >     ranges = <...>;
> > >     pcie1: pcie@1,0 {
> > >       ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
> > > 	        0x81000000 0 0 0x81000000 0x1 0 1 0>;
> > >     };
> > >     pcie2: pcie@2,0 {
> > >       ranges = <0x82000000 0 0 0x82000000 0x2 0 1 0
> > > 	        0x81000000 0 0 0x81000000 0x2 0 1 0>;
> > >     };
> > >     pcie3: pcie@3,0 {
> > >       ranges = <0x82000000 0 0 0x82000000 0x3 0 1 0
> > > 	        0x81000000 0 0 0x81000000 0x3 0 1 0>;
> > >     };
> > >     pcie4: pcie@4,0 {
> > >       ranges = <0x82000000 0 0 0x82000000 0x4 0 1 0
> > > 	        0x81000000 0 0 0x81000000 0x4 0 1 0>;
> > >     };
> > >   };
> > > 
> > > What does this look like in dmesg, i.e., what CPU address ranges are
> > > mapped to what PCI bus addresses?
> > 
> > These explicit ranges in DT are probably ignored as they are invalid.
> > You can see them (0xffffffffffffffff) in dmesg. 
> 
> Are you saying that this DT ranges and the dmesg line are connected?
> 
>   ranges = <0x82000000 0 0 0x82000000 0x1 0 1 0
>             0x81000000 0 0 0x81000000 0x1 0 1 0>;
> 
>   mvebu-pcie soc:pcie: MEM 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
> 
> 1) It would be nice if there were a hint somewhere in Documentation/
> that would allow mere mortals to see the connection there.

I agree, that there is missing lot of documentation related to Marvell
PCIe controllers, both pci-mvebu.c and pci-aardvark.c. The main problem
now is that it is hard to find information which explain everything...
like mysterious Memory Controller (which is now explained):
https://lore.kernel.org/linux-pci/20211003120944.3lmwxylnhlp2kfj7@pali/

> 2) Why do we have these DT entries if they are invalid and useless?

Sorry, I do not know. I was not involved during introducing of DT files
for this platform. I'm just observing how it works...

> > MEM and I/O resources are parsed in pci-mvebu.c driver in
> > mvebu_pcie_parse_request_resources() function.
> 
> So mvebu-mbus.c fills in the static mbus_state from the DT
> "pcie-mem-aperture", which seems unconnected to the DT descriptions of
> the PCI controllers:
> 
>   static struct mvebu_mbus_state mbus_state;
> 
>   mvebu_mbus_dt_init
>     mvebu_mbus_get_pcie_resources(&mbus_state.pcie_mem_aperture)
>       of_property_read_u32_array("pcie-mem-aperture")
> 
>   mvebu_pcie_probe
>     mvebu_pcie_parse_request_resources
>       mvebu_mbus_get_pcie_mem_aperture(&pcie->mem)
> 	*res = mbus_state.pcie_mem_aperture
>       pci_add_resource(&bridge->windows, &pcie->mem)
> 
> > Here is relevant dmesg output:
> > 
> > mvebu-pcie soc:pcie: host bridge /soc/pcie ranges:
> > mvebu-pcie soc:pcie:      MEM 0x00f1080000..0x00f1081fff -> 0x0000080000
> > mvebu-pcie soc:pcie:      MEM 0x00f1040000..0x00f1041fff -> 0x0000040000
> > mvebu-pcie soc:pcie:      MEM 0x00f1044000..0x00f1045fff -> 0x0000044000
> > mvebu-pcie soc:pcie:      MEM 0x00f1048000..0x00f1049fff -> 0x0000048000
> > mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
> > mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0100000000
> > mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0200000000
> > mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0200000000
> > mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0300000000
> > mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0300000000
> > mvebu-pcie soc:pcie:      MEM 0xffffffffffffffff..0x00fffffffe -> 0x0400000000
> > mvebu-pcie soc:pcie:       IO 0xffffffffffffffff..0x00fffffffe -> 0x0400000000
> > mvebu-pcie soc:pcie: PCI host bridge to bus 0000:00
> > pci_bus 0000:00: root bus resource [bus 00-ff]
> > pci_bus 0000:00: root bus resource [mem 0xf1080000-0xf1081fff] (bus address [0x00080000-0x00081fff])
> > pci_bus 0000:00: root bus resource [mem 0xf1040000-0xf1041fff] (bus address [0x00040000-0x00041fff])
> > pci_bus 0000:00: root bus resource [mem 0xf1044000-0xf1045fff] (bus address [0x00044000-0x00045fff])
> > pci_bus 0000:00: root bus resource [mem 0xf1048000-0xf1049fff] (bus address [0x00048000-0x00049fff])
> > pci_bus 0000:00: root bus resource [mem 0xe0000000-0xe7ffffff]
> 
> I see 0xf1080000-0xf1081fff, 0xf1040000-0xf1041fff, etc mentioned in
> the DT info above, but I don't see where [mem 0xe0000000-0xe7ffffff]
> came from.

0xe0000000 is defined in above mentioned "pcie-mem-aperture" DT property
and it is in armada-38x.dtsi DTS file included from armada-385.dtsi:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/armada-38x.dtsi?h=v5.15#n42

> Regardless, this means PCI thinks [mem 0xe0000000-0xe7ffffff] is
> available on bus 00 and can be assigned to devices on bus 00 according
> to the normal PCI rules (BARs aligned on size, PCI bridge windows
> aligned on 1MB and multiple of 1MB in size).  IIUC, mvebu imposes
> additional alignment constraints on the bridge windows.
> 
> These are the bridge window assignments from your dmesg:
> 
> > pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xe00fffff]
> > pci 0000:00:02.0: BAR 8: assigned [mem 0xe0200000-0xe04fffff]
> > pci 0000:00:03.0: BAR 8: assigned [mem 0xe0100000-0xe01fffff]
> 
> > pci 0000:00:01.0: PCI bridge to [bus 01]
> > pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xe00fffff]
> > pci 0000:00:02.0: PCI bridge to [bus 02]
> > pci 0000:00:02.0:   bridge window [mem 0xe0200000-0xe04fffff]
> > pci 0000:00:03.0: PCI bridge to [bus 03]
> > pci 0000:00:03.0:   bridge window [mem 0xe0100000-0xe01fffff]
> 
> The PCI core knows nothing about the mvebu constraints.  Are we just
> lucky here that when PCI assigned these bridge windows, they happen to
> be supported on mvebu?  What happens if PCI decides it needs 29MB on
> bus 01?

In this case pci-mvebu.c split 29MB window into continuous ranges of
power of two (16MB + 8MB + 4MB + 1MB) and then register each range to
mbus slot. Code is in function mvebu_pcie_add_windows():
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/pci-mvebu.c?h=v5.15#n300

So at the end there is continuous space of 29MB PCIe window, just it
"eats" 4 mbus slots.

This function may fail (if there is not enough free mbus slots) and this
patch is propagating that failure back to the caller.

> > > Are pcie1, pcie2, etc Root Ports?  Or are they each separate host
> > > bridges (they each have "bus-range = <0x00 0xff>")?
> > 
> > From kernel point of view they are root ports. But in reality every of
> > these root port is on separate bus segment, but kernel pci-mvebu.c
> > driver merges all these segments/domains into one host bridge and put
> > all root ports into bus 0.
> > 
> > Here is lspci -tvnn output with topology:
> > 
> > $ lspci -tvnn
> > -[0000:00]-+-01.0-[01]----00.0  Device [1e0f:0001]
> >            +-02.0-[02]----00.0  Qualcomm Atheros QCA986x/988x 802.11ac Wireless Network Adapter [168c:003c]
> >            \-03.0-[03]----00.0  Qualcomm Atheros AR9287 Wireless Network Adapter (PCI-Express) [168c:002e]
> 
> > Buses 1, 2 and 3 represents mPCIe cards, all of them are in reality
> > in separate bus segments and on different HW host bridges. So they
> > do *not* share access to config space, do *not* share INTx
> > interrupts, etc...
> > 
> > > Is space from pciec dynamically assigned to pcie1, pcie2, etc?  If
> > > so, I assume there are more restrictions on the size and alignment
> > > than on PCI bridge windows, which allow size/alignment down to
> > > 1MB?
> > 
> > Yes, exactly. I do not know now all restrictions. At least there are
> > fixed number of memory slots and each has to be of size 2^N. They
> > are dynamically assigned by kernel mbus driver at time when somebody
> > updates BASE/LIMIT registers. And that kernel mbus driver takes care
> > to split non-aligned window size to more slots of size 2^N. And
> > resources are shared from pool with other HW parts (e.g. DMA), so
> > other drivers loaded in kernel can "eat" available slots before
> > pci-mvebu and then there does not have to be nothing to allocate for
> > PCI.
> 
> So IIUC,
> 
>   pcie1 == 00:01.0 Root Port
>   pcie2 == 00:02.0 Root Port
>   pcie3 == 00:03.0 Root Port
> 
> From a software point of view, they're all under a single host bridge,
> and Linux assumes everything under a host bridge plays by the PCI
> rules.

Yes.

> In this case, the root ports *don't* play by the rules since they have
> additional alignment restrictions, so I think these really should be
> described as separate host bridges in DT with the address space
> carved up statically among them.

I fully agree with you.

But pci-mvebu.c driver and also its DT bindings are written differently.
Changing it probably would not be simple due to backward compatibility
and will take development resources...

> It's common on x86 to have multiple host bridges that all appear to
> software to be in domain 0000.  The bus number ranges under each are
> static, e.g., one bridge has [bus 00-7f] and another has [bus 80-ff].

For mvebu they are dynamic and kernel assigns them at boot. As my above
printed lspci topology is simple, first bridge has assigned [bus 01-01],
second bridge [bus 02-02] and third bridge [bus 03-03].

> > But most Armada boards do not have exported all peripherals from SoC,
> > unconnected are disabled in DT and therefore exhaustion should not
> > happen.
> > 
> > > I'm trying to see how this could be described in ACPI because that's a
> > > fairly general model that accommodates most machines.  Possibly
> > > describing mvebu in ACPI would involve losing some flexibility.
> > 
> > I do not understand APCI model very well and I'm in impression that it
> > is impossible to represent mvebu in ACPI.
> 
> It could be described as a separate host bridge for every root port.
> ACPI uses _CRS (current resource settings) to describe the apertures
> to PCI and any address translation.  Currently the _CRS description is
> static, but ACPI does allow those resource assignments to be modified
> via _PRS (possible resource settings) and _SRS (set resource
> settings).
> 
> Bjorn
Bjorn Helgaas Jan. 20, 2022, 7:37 p.m. UTC | #9
On Thu, Jan 20, 2022 at 08:08:26PM +0100, Pali Rohár wrote:
> On Thursday 20 January 2022 11:50:47 Bjorn Helgaas wrote:
> > On Thu, Jan 13, 2022 at 11:35:23AM +0100, Pali Rohár wrote:
> > > On Wednesday 12 January 2022 18:19:21 Bjorn Helgaas wrote:
> > > > On Sat, Jan 08, 2022 at 12:46:58AM +0100, Pali Rohár wrote:
> > > > > On Friday 07 January 2022 17:16:17 Bjorn Helgaas wrote:
> > > > > > On Fri, Jan 07, 2022 at 11:28:26PM +0100, Pali Rohár wrote:
> > > > > > > On Friday 07 January 2022 15:55:04 Bjorn Helgaas wrote:
> > > > > > > > On Thu, Nov 25, 2021 at 01:45:58PM +0100, Pali Rohár wrote:
> > > > > > > > > Properly propagate failure from
> > > > > > > > > mvebu_pcie_add_windows() function back to the caller
> > > > > > > > > mvebu_pci_bridge_emul_base_conf_write() and
> > > > > > > > > correctly updates PCI_IO_BASE, PCI_MEM_BASE and
> > > > > > > > > PCI_IO_BASE_UPPER16 registers on error.  On error
> > > > > > > > > set base value higher than limit value which
> > > > > > > > > indicates that address range is disabled. 

> > Regardless, this means PCI thinks [mem 0xe0000000-0xe7ffffff] is
> > available on bus 00 and can be assigned to devices on bus 00
> > according to the normal PCI rules (BARs aligned on size, PCI
> > bridge windows aligned on 1MB and multiple of 1MB in size).  IIUC,
> > mvebu imposes additional alignment constraints on the bridge
> > windows.
> > 
> > These are the bridge window assignments from your dmesg:
> > 
> > > pci 0000:00:01.0: BAR 8: assigned [mem 0xe0000000-0xe00fffff]
> > > pci 0000:00:02.0: BAR 8: assigned [mem 0xe0200000-0xe04fffff]
> > > pci 0000:00:03.0: BAR 8: assigned [mem 0xe0100000-0xe01fffff]
> > 
> > > pci 0000:00:01.0: PCI bridge to [bus 01]
> > > pci 0000:00:01.0:   bridge window [mem 0xe0000000-0xe00fffff]
> > > pci 0000:00:02.0: PCI bridge to [bus 02]
> > > pci 0000:00:02.0:   bridge window [mem 0xe0200000-0xe04fffff]
> > > pci 0000:00:03.0: PCI bridge to [bus 03]
> > > pci 0000:00:03.0:   bridge window [mem 0xe0100000-0xe01fffff]
> > 
> > The PCI core knows nothing about the mvebu constraints.  Are we
> > just lucky here that when PCI assigned these bridge windows, they
> > happen to be supported on mvebu?  What happens if PCI decides it
> > needs 29MB on bus 01?
> 
> In this case pci-mvebu.c split 29MB window into continuous ranges of
> power of two (16MB + 8MB + 4MB + 1MB) and then register each range
> to mbus slot. Code is in function mvebu_pcie_add_windows():
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/pci/controller/pci-mvebu.c?h=v5.15#n300
> 
> So at the end there is continuous space of 29MB PCIe window, just it
> "eats" 4 mbus slots.
> 
> This function may fail (if there is not enough free mbus slots) and
> this patch is propagating that failure back to the caller.

This failure cannot occur in conforming PCI hardware.  I guess if you
want to propagate the error from mvebu_pcie_add_windows() back to
mvebu_pci_bridge_emul_base_conf_write() and do something there, I'm OK
with that.

But change the commit log so it doesn't say "... and correctly update
PCI_IO_BASE, PCI_MEM_BASE and PCI_IO_BASE_UPPER16" because this is
completely device-specific behavior and is not "correct" per any PCI
spec.

Instead, say something about how mvebu doesn't support arbitrary
windows and we're disabling the window completely if we can't provide
what's requested.  

Maybe this error warrants a clue in dmesg?  How would a user figure
out what's going on in this situation?  From the patch, it looks like
we would assign resources to a device, but the device just would not
work because the root port window was silently disabled?

Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/controller/pci-mvebu.c b/drivers/pci/controller/pci-mvebu.c
index a0b661972436..12afa565bfcf 100644
--- a/drivers/pci/controller/pci-mvebu.c
+++ b/drivers/pci/controller/pci-mvebu.c
@@ -315,7 +315,7 @@  static void mvebu_pcie_del_windows(struct mvebu_pcie_port *port,
  * areas each having a power of two size. We start from the largest
  * one (i.e highest order bit set in the size).
  */
-static void mvebu_pcie_add_windows(struct mvebu_pcie_port *port,
+static int mvebu_pcie_add_windows(struct mvebu_pcie_port *port,
 				   unsigned int target, unsigned int attribute,
 				   phys_addr_t base, size_t size,
 				   phys_addr_t remap)
@@ -336,7 +336,7 @@  static void mvebu_pcie_add_windows(struct mvebu_pcie_port *port,
 				&base, &end, ret);
 			mvebu_pcie_del_windows(port, base - size_mapped,
 					       size_mapped);
-			return;
+			return ret;
 		}
 
 		size -= sz;
@@ -345,16 +345,20 @@  static void mvebu_pcie_add_windows(struct mvebu_pcie_port *port,
 		if (remap != MVEBU_MBUS_NO_REMAP)
 			remap += sz;
 	}
+
+	return 0;
 }
 
-static void mvebu_pcie_set_window(struct mvebu_pcie_port *port,
+static int mvebu_pcie_set_window(struct mvebu_pcie_port *port,
 				  unsigned int target, unsigned int attribute,
 				  const struct mvebu_pcie_window *desired,
 				  struct mvebu_pcie_window *cur)
 {
+	int ret;
+
 	if (desired->base == cur->base && desired->remap == cur->remap &&
 	    desired->size == cur->size)
-		return;
+		return 0;
 
 	if (cur->size != 0) {
 		mvebu_pcie_del_windows(port, cur->base, cur->size);
@@ -369,30 +373,35 @@  static void mvebu_pcie_set_window(struct mvebu_pcie_port *port,
 	}
 
 	if (desired->size == 0)
-		return;
+		return 0;
+
+	ret = mvebu_pcie_add_windows(port, target, attribute, desired->base,
+				     desired->size, desired->remap);
+	if (ret) {
+		cur->size = 0;
+		cur->base = 0;
+		return ret;
+	}
 
-	mvebu_pcie_add_windows(port, target, attribute, desired->base,
-			       desired->size, desired->remap);
 	*cur = *desired;
+	return 0;
 }
 
-static void mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port)
+static int mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port)
 {
 	struct mvebu_pcie_window desired = {};
 	struct pci_bridge_emul_conf *conf = &port->bridge.conf;
 
 	/* Are the new iobase/iolimit values invalid? */
 	if (conf->iolimit < conf->iobase ||
-	    conf->iolimitupper < conf->iobaseupper) {
-		mvebu_pcie_set_window(port, port->io_target, port->io_attr,
-				      &desired, &port->iowin);
-		return;
-	}
+	    conf->iolimitupper < conf->iobaseupper)
+		return mvebu_pcie_set_window(port, port->io_target, port->io_attr,
+					     &desired, &port->iowin);
 
 	if (!mvebu_has_ioport(port)) {
 		dev_WARN(&port->pcie->pdev->dev,
 			 "Attempt to set IO when IO is disabled\n");
-		return;
+		return -EOPNOTSUPP;
 	}
 
 	/*
@@ -410,21 +419,19 @@  static void mvebu_pcie_handle_iobase_change(struct mvebu_pcie_port *port)
 			desired.remap) +
 		       1;
 
-	mvebu_pcie_set_window(port, port->io_target, port->io_attr, &desired,
-			      &port->iowin);
+	return mvebu_pcie_set_window(port, port->io_target, port->io_attr, &desired,
+				     &port->iowin);
 }
 
-static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
+static int mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
 {
 	struct mvebu_pcie_window desired = {.remap = MVEBU_MBUS_NO_REMAP};
 	struct pci_bridge_emul_conf *conf = &port->bridge.conf;
 
 	/* Are the new membase/memlimit values invalid? */
-	if (conf->memlimit < conf->membase) {
-		mvebu_pcie_set_window(port, port->mem_target, port->mem_attr,
-				      &desired, &port->memwin);
-		return;
-	}
+	if (conf->memlimit < conf->membase)
+		return mvebu_pcie_set_window(port, port->mem_target, port->mem_attr,
+					     &desired, &port->memwin);
 
 	/*
 	 * We read the PCI-to-PCI bridge emulated registers, and
@@ -436,8 +443,8 @@  static void mvebu_pcie_handle_membase_change(struct mvebu_pcie_port *port)
 	desired.size = (((conf->memlimit & 0xFFF0) << 16) | 0xFFFFF) -
 		       desired.base + 1;
 
-	mvebu_pcie_set_window(port, port->mem_target, port->mem_attr, &desired,
-			      &port->memwin);
+	return mvebu_pcie_set_window(port, port->mem_target, port->mem_attr, &desired,
+				     &port->memwin);
 }
 
 static pci_bridge_emul_read_status_t
@@ -522,15 +529,36 @@  mvebu_pci_bridge_emul_base_conf_write(struct pci_bridge_emul *bridge,
 		break;
 
 	case PCI_IO_BASE:
-		mvebu_pcie_handle_iobase_change(port);
+		if ((mask & 0xffff) && mvebu_pcie_handle_iobase_change(port)) {
+			/* On error disable IO range */
+			conf->iobase &= ~0xf0;
+			conf->iolimit &= ~0xf0;
+			conf->iobaseupper = cpu_to_le16(0x0000);
+			conf->iolimitupper = cpu_to_le16(0x0000);
+			if (mvebu_has_ioport(port))
+				conf->iobase |= 0xf0;
+		}
 		break;
 
 	case PCI_MEMORY_BASE:
-		mvebu_pcie_handle_membase_change(port);
+		if (mvebu_pcie_handle_membase_change(port)) {
+			/* On error disable mem range */
+			conf->membase = cpu_to_le16(le16_to_cpu(conf->membase) & ~0xfff0);
+			conf->memlimit = cpu_to_le16(le16_to_cpu(conf->memlimit) & ~0xfff0);
+			conf->membase = cpu_to_le16(le16_to_cpu(conf->membase) | 0xfff0);
+		}
 		break;
 
 	case PCI_IO_BASE_UPPER16:
-		mvebu_pcie_handle_iobase_change(port);
+		if (mvebu_pcie_handle_iobase_change(port)) {
+			/* On error disable IO range */
+			conf->iobase &= ~0xf0;
+			conf->iolimit &= ~0xf0;
+			conf->iobaseupper = cpu_to_le16(0x0000);
+			conf->iolimitupper = cpu_to_le16(0x0000);
+			if (mvebu_has_ioport(port))
+				conf->iobase |= 0xf0;
+		}
 		break;
 
 	case PCI_PRIMARY_BUS: