diff mbox series

[v2,1/2] PCI: Take multifunction devices into account when distributing resources

Message ID 20221114115953.40236-1-mika.westerberg@linux.intel.com
State New
Headers show
Series [v2,1/2] PCI: Take multifunction devices into account when distributing resources | expand

Commit Message

Mika Westerberg Nov. 14, 2022, 11:59 a.m. UTC
PCIe switch upstream port may be one of the functions of a multifunction
device. The resource distribution code does not take this into account
properly and therefore it expands the upstream port resource windows too
much, not leaving space for the other functions (in the multifunction
device) and this leads to an issue that Jonathan reported. He runs QEMU
with the following topoology (QEMU parameters):

 -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2	\
 -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on	\
 -device e1000,bus=root_port13,addr=0.1 			\
 -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3	\
 -device e1000,bus=fun1

The first e1000 NIC here is another function in the switch upstream
port. This leads to following errors:

  pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
  pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
  pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
  e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]

Fix this by taking into account the possible multifunction devices when
uptream port resources are distributed.

Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
---
The previous version of the series can be found here:

  https://lore.kernel.org/linux-pci/20221103103254.30497-1-mika.westerberg@linux.intel.com/

Changes from v1:
  * Re-worded the commit message to hopefully explain the problem
    better
  * Added Link: to the bug report
  * Update the comment according to Bjorn's suggestion
  * Dropped the ->multifunction check
  * Use %#llx in log format.

 drivers/pci/setup-bus.c | 56 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 52 insertions(+), 4 deletions(-)

Comments

Wysocki, Rafael J Nov. 14, 2022, 6:34 p.m. UTC | #1
On 11/14/2022 12:59 PM, Mika Westerberg wrote:
> PCIe switch upstream port may be one of the functions of a multifunction
> device. The resource distribution code does not take this into account
> properly and therefore it expands the upstream port resource windows too
> much, not leaving space for the other functions (in the multifunction
> device) and this leads to an issue that Jonathan reported. He runs QEMU
> with the following topoology (QEMU parameters):
>
>   -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2	\
>   -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on	\
>   -device e1000,bus=root_port13,addr=0.1 			\
>   -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3	\
>   -device e1000,bus=fun1
>
> The first e1000 NIC here is another function in the switch upstream
> port. This leads to following errors:
>
>    pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
>    pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
>    pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
>    e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
>
> Fix this by taking into account the possible multifunction devices when
> uptream port resources are distributed.
>
> Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@itel.com>

for both patches in this series.


> ---
> The previous version of the series can be found here:
>
>    https://lore.kernel.org/linux-pci/20221103103254.30497-1-mika.westerberg@linux.intel.com/
>
> Changes from v1:
>    * Re-worded the commit message to hopefully explain the problem
>      better
>    * Added Link: to the bug report
>    * Update the comment according to Bjorn's suggestion
>    * Dropped the ->multifunction check
>    * Use %#llx in log format.
>
>   drivers/pci/setup-bus.c | 56 ++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 52 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index b4096598dbcb..f3f39aa82dda 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1830,10 +1830,58 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>   	 * bridges below.
>   	 */
>   	if (hotplug_bridges + normal_bridges == 1) {
> -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> -		if (dev->subordinate)
> -			pci_bus_distribute_available_resources(dev->subordinate,
> -				add_list, io, mmio, mmio_pref);
> +		/* Upstream port must be the first */
> +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> +		if (!bridge->subordinate)
> +			return;
> +
> +		/*
> +		 * It is possible to have switch upstream port as a part of a
> +		 * multifunction device. For this reason reduce the space
> +		 * available for distribution by the amount required by the
> +		 * peers of the upstream port.
> +		 */
> +		list_for_each_entry(dev, &bus->devices, bus_list) {
> +			int i;
> +
> +			if (dev == bridge)
> +				continue;
> +
> +			for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +				const struct resource *dev_res = &dev->resource[i];
> +				resource_size_t dev_sz;
> +				struct resource *b_res;
> +
> +				if (dev_res->flags & IORESOURCE_IO) {
> +					b_res = &io;
> +				} else if (dev_res->flags & IORESOURCE_MEM) {
> +					if (dev_res->flags & IORESOURCE_PREFETCH)
> +						b_res = &mmio_pref;
> +					else
> +						b_res = &mmio;
> +				} else {
> +					continue;
> +				}
> +
> +				/* Size aligned to bridge window */
> +				align = pci_resource_alignment(bridge, b_res);
> +				dev_sz = ALIGN(resource_size(dev_res), align);
> +
> +				pci_dbg(dev, "%pR aligned to %#llx\n", dev_res,
> +					(unsigned long long)dev_sz);
> +
> +				if (dev_sz >= resource_size(b_res))
> +					memset(b_res, 0, sizeof(*b_res));
> +				else
> +					b_res->end -= dev_sz;
> +
> +				pci_dbg(bridge, "updated available to %pR\n", b_res);
> +			}
> +		}
> +
> +		pci_bus_distribute_available_resources(bridge->subordinate,
> +						       add_list, io, mmio,
> +						       mmio_pref);
>   		return;
>   	}
>
Jonathan Cameron Nov. 16, 2022, 9:46 a.m. UTC | #2
On Mon, 14 Nov 2022 13:59:52 +0200
Mika Westerberg <mika.westerberg@linux.intel.com> wrote:

> PCIe switch upstream port may be one of the functions of a multifunction
> device. The resource distribution code does not take this into account
> properly and therefore it expands the upstream port resource windows too
> much, not leaving space for the other functions (in the multifunction
> device) and this leads to an issue that Jonathan reported. He runs QEMU
> with the following topoology (QEMU parameters):
> 
>  -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2	\
>  -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on	\
>  -device e1000,bus=root_port13,addr=0.1 			\
>  -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3	\
>  -device e1000,bus=fun1
Other than the fact the example makes me look like a crazed maniac
(the wonder of minimal test cases)..

One comment inline but either way as far as I can tell (not being
particularly familiar with this code) it looks good to me.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


> 
> The first e1000 NIC here is another function in the switch upstream
> port. This leads to following errors:
> 
>   pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
>   pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
>   pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
>   e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
> 
> Fix this by taking into account the possible multifunction devices when
> uptream port resources are distributed.
> 
> Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
> The previous version of the series can be found here:
> 
>   https://lore.kernel.org/linux-pci/20221103103254.30497-1-mika.westerberg@linux.intel.com/
> 
> Changes from v1:
>   * Re-worded the commit message to hopefully explain the problem
>     better
>   * Added Link: to the bug report
>   * Update the comment according to Bjorn's suggestion
>   * Dropped the ->multifunction check
>   * Use %#llx in log format.
> 
>  drivers/pci/setup-bus.c | 56 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 52 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index b4096598dbcb..f3f39aa82dda 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1830,10 +1830,58 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  	 * bridges below.
>  	 */
>  	if (hotplug_bridges + normal_bridges == 1) {
> -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> -		if (dev->subordinate)
> -			pci_bus_distribute_available_resources(dev->subordinate,
> -				add_list, io, mmio, mmio_pref);
> +		/* Upstream port must be the first */
> +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> +		if (!bridge->subordinate)
> +			return;
> +
> +		/*
> +		 * It is possible to have switch upstream port as a part of a
> +		 * multifunction device. For this reason reduce the space
> +		 * available for distribution by the amount required by the
> +		 * peers of the upstream port.
> +		 */
> +		list_for_each_entry(dev, &bus->devices, bus_list) {

You 'could' use list_for_each_entry_continue().
It might be a tiny bit tidier but meh, current logic is pretty clear anyway
and avoids need to take a copy of the pointer to the first element.

> +			int i;
> +
> +			if (dev == bridge)
> +				continue;
> +
> +			for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +				const struct resource *dev_res = &dev->resource[i];
> +				resource_size_t dev_sz;
> +				struct resource *b_res;
> +
> +				if (dev_res->flags & IORESOURCE_IO) {
> +					b_res = &io;
> +				} else if (dev_res->flags & IORESOURCE_MEM) {
> +					if (dev_res->flags & IORESOURCE_PREFETCH)
> +						b_res = &mmio_pref;
> +					else
> +						b_res = &mmio;
> +				} else {
> +					continue;
> +				}
> +
> +				/* Size aligned to bridge window */
> +				align = pci_resource_alignment(bridge, b_res);
> +				dev_sz = ALIGN(resource_size(dev_res), align);
> +
> +				pci_dbg(dev, "%pR aligned to %#llx\n", dev_res,
> +					(unsigned long long)dev_sz);
> +
> +				if (dev_sz >= resource_size(b_res))
> +					memset(b_res, 0, sizeof(*b_res));
> +				else
> +					b_res->end -= dev_sz;
> +
> +				pci_dbg(bridge, "updated available to %pR\n", b_res);
> +			}
> +		}
> +
> +		pci_bus_distribute_available_resources(bridge->subordinate,
> +						       add_list, io, mmio,
> +						       mmio_pref);
>  		return;
>  	}
>
Bjorn Helgaas Nov. 17, 2022, 11:10 p.m. UTC | #3
On Mon, Nov 14, 2022 at 01:59:52PM +0200, Mika Westerberg wrote:
> PCIe switch upstream port may be one of the functions of a multifunction
> device.

I don't think this is specific to PCIe, is it?  Can't we have a
multi-function device where one function is a conventional PCI bridge?
Actually, I don't think "multi-function" is relevant at all -- you
iterate over all the devices on the bus below.  For PCIe, that likely
means multiple functions of the same device, but it could be separate
devices in conventional PCI.

> The resource distribution code does not take this into account
> properly and therefore it expands the upstream port resource windows too
> much, not leaving space for the other functions (in the multifunction
> device)

I guess the window expansion here is done by adjust_bridge_window()?

> and this leads to an issue that Jonathan reported. He runs QEMU
> with the following topoology (QEMU parameters):
> 
>  -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2	\
>  -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on	\
>  -device e1000,bus=root_port13,addr=0.1 			\
>  -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3	\
>  -device e1000,bus=fun1
> 
> The first e1000 NIC here is another function in the switch upstream
> port. This leads to following errors:
> 
>   pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
>   pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
>   pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
>   e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]

To make sure I have this right:

  00:04.0 Root Port; makes [mem 0x10200000-0x103fffff] available on bus 02
  02:00.0 Switch Upstream Port; makes that entire window available on bus 03
  02:00.1 NIC (nothing left for it)

> Fix this by taking into account the possible multifunction devices when
> uptream port resources are distributed.
> 
> Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> ---
> The previous version of the series can be found here:
> 
>   https://lore.kernel.org/linux-pci/20221103103254.30497-1-mika.westerberg@linux.intel.com/
> 
> Changes from v1:
>   * Re-worded the commit message to hopefully explain the problem
>     better
>   * Added Link: to the bug report
>   * Update the comment according to Bjorn's suggestion
>   * Dropped the ->multifunction check
>   * Use %#llx in log format.
> 
>  drivers/pci/setup-bus.c | 56 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 52 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index b4096598dbcb..f3f39aa82dda 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1830,10 +1830,58 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  	 * There is only one bridge on the bus so it gets all available
>  	 * resources which it can then distribute to the possible hotplug
>  	 * bridges below.

This comment might need to be updated (even if there's only one
bridge, we're going to account for other functions in the
multi-function device).

But might we not have other devices on the bus even if they're not in
the same multi-function device?  What if we had this scenario?

  00:1f.0 bridge window [mem 0x10200000-0x103fffff] to [bus 04-05]
  04:00.0 bridge to [bus 05]
  04:01.0 NIC [mem size 0x00020000]
  04:02.0 NIC [mem size 0x00020000]

We can't let 04:00.0 route the entire [mem 0x10200000-0x103fffff]
window to bus 05.

>  	if (hotplug_bridges + normal_bridges == 1) {
> -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> -		if (dev->subordinate)
> -			pci_bus_distribute_available_resources(dev->subordinate,
> -				add_list, io, mmio, mmio_pref);
> +		/* Upstream port must be the first */
> +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> +		if (!bridge->subordinate)
> +			return;
> +
> +		/*
> +		 * It is possible to have switch upstream port as a part of a
> +		 * multifunction device. For this reason reduce the space
> +		 * available for distribution by the amount required by the
> +		 * peers of the upstream port.
> +		 */
> +		list_for_each_entry(dev, &bus->devices, bus_list) {

It seems like maybe we ought to do this regardless of how many bridges
there are on the bus.  Don't we always want to assign space to devices
on this bus before distributing the leftovers to downstream buses?
E.g., maybe this should be done before the adjust_bridge_window()
calls?

> +			int i;
> +
> +			if (dev == bridge)
> +				continue;
> +
> +			for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +				const struct resource *dev_res = &dev->resource[i];
> +				resource_size_t dev_sz;
> +				struct resource *b_res;
> +
> +				if (dev_res->flags & IORESOURCE_IO) {
> +					b_res = &io;
> +				} else if (dev_res->flags & IORESOURCE_MEM) {
> +					if (dev_res->flags & IORESOURCE_PREFETCH)
> +						b_res = &mmio_pref;
> +					else
> +						b_res = &mmio;
> +				} else {
> +					continue;
> +				}
> +
> +				/* Size aligned to bridge window */
> +				align = pci_resource_alignment(bridge, b_res);
> +				dev_sz = ALIGN(resource_size(dev_res), align);
> +
> +				pci_dbg(dev, "%pR aligned to %#llx\n", dev_res,
> +					(unsigned long long)dev_sz);
> +
> +				if (dev_sz >= resource_size(b_res))
> +					memset(b_res, 0, sizeof(*b_res));
> +				else
> +					b_res->end -= dev_sz;
> +
> +				pci_dbg(bridge, "updated available to %pR\n", b_res);
> +			}
> +		}
> +
> +		pci_bus_distribute_available_resources(bridge->subordinate,
> +						       add_list, io, mmio,
> +						       mmio_pref);
>  		return;
>  	}
>  
> -- 
> 2.35.1
>
Mika Westerberg Nov. 18, 2022, 8:57 a.m. UTC | #4
On Thu, Nov 17, 2022 at 05:10:34PM -0600, Bjorn Helgaas wrote:
> On Mon, Nov 14, 2022 at 01:59:52PM +0200, Mika Westerberg wrote:
> > PCIe switch upstream port may be one of the functions of a multifunction
> > device.
> 
> I don't think this is specific to PCIe, is it?  Can't we have a
> multi-function device where one function is a conventional PCI bridge?
> Actually, I don't think "multi-function" is relevant at all -- you
> iterate over all the devices on the bus below.  For PCIe, that likely
> means multiple functions of the same device, but it could be separate
> devices in conventional PCI.

Yes it can be but I was trying to explain the problem we encountered and
that's related to PCIe.

I can leave this out if you think it is better that way.

> > The resource distribution code does not take this into account
> > properly and therefore it expands the upstream port resource windows too
> > much, not leaving space for the other functions (in the multifunction
> > device)
> 
> I guess the window expansion here is done by adjust_bridge_window()?

Yes but the resources are distributed in pci_bus_distribute_available_resources().

> 
> > and this leads to an issue that Jonathan reported. He runs QEMU
> > with the following topoology (QEMU parameters):
> > 
> >  -device pcie-root-port,port=0,id=root_port13,chassis=0,slot=2	\
> >  -device x3130-upstream,id=sw1,bus=root_port13,multifunction=on	\
> >  -device e1000,bus=root_port13,addr=0.1 			\
> >  -device xio3130-downstream,id=fun1,bus=sw1,chassis=0,slot=3	\
> >  -device e1000,bus=fun1
> > 
> > The first e1000 NIC here is another function in the switch upstream
> > port. This leads to following errors:
> > 
> >   pci 0000:00:04.0: bridge window [mem 0x10200000-0x103fffff] to [bus 02-04]
> >   pci 0000:02:00.0: bridge window [mem 0x10200000-0x103fffff] to [bus 03-04]
> >   pci 0000:02:00.1: BAR 0: failed to assign [mem size 0x00020000]
> >   e1000 0000:02:00.1: can't ioremap BAR 0: [??? 0x00000000 flags 0x0]
> 
> To make sure I have this right:
> 
>   00:04.0 Root Port; makes [mem 0x10200000-0x103fffff] available on bus 02
>   02:00.0 Switch Upstream Port; makes that entire window available on bus 03
>   02:00.1 NIC (nothing left for it)

Correct

> 
> > Fix this by taking into account the possible multifunction devices when
> > uptream port resources are distributed.
> > 
> > Link: https://lore.kernel.org/linux-pci/20221014124553.0000696f@huawei.com/
> > Reported-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > ---
> > The previous version of the series can be found here:
> > 
> >   https://lore.kernel.org/linux-pci/20221103103254.30497-1-mika.westerberg@linux.intel.com/
> > 
> > Changes from v1:
> >   * Re-worded the commit message to hopefully explain the problem
> >     better
> >   * Added Link: to the bug report
> >   * Update the comment according to Bjorn's suggestion
> >   * Dropped the ->multifunction check
> >   * Use %#llx in log format.
> > 
> >  drivers/pci/setup-bus.c | 56 ++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 52 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index b4096598dbcb..f3f39aa82dda 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -1830,10 +1830,58 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
> >  	 * There is only one bridge on the bus so it gets all available
> >  	 * resources which it can then distribute to the possible hotplug
> >  	 * bridges below.
> 
> This comment might need to be updated (even if there's only one
> bridge, we're going to account for other functions in the
> multi-function device).
> 
> But might we not have other devices on the bus even if they're not in
> the same multi-function device?  What if we had this scenario?
> 
>   00:1f.0 bridge window [mem 0x10200000-0x103fffff] to [bus 04-05]
>   04:00.0 bridge to [bus 05]
>   04:01.0 NIC [mem size 0x00020000]
>   04:02.0 NIC [mem size 0x00020000]
> 
> We can't let 04:00.0 route the entire [mem 0x10200000-0x103fffff]
> window to bus 05.

Good point. I will update the comment accordingly.

> >  	if (hotplug_bridges + normal_bridges == 1) {
> > -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > -		if (dev->subordinate)
> > -			pci_bus_distribute_available_resources(dev->subordinate,
> > -				add_list, io, mmio, mmio_pref);
> > +		/* Upstream port must be the first */
> > +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > +		if (!bridge->subordinate)
> > +			return;
> > +
> > +		/*
> > +		 * It is possible to have switch upstream port as a part of a
> > +		 * multifunction device. For this reason reduce the space
> > +		 * available for distribution by the amount required by the
> > +		 * peers of the upstream port.
> > +		 */
> > +		list_for_each_entry(dev, &bus->devices, bus_list) {
> 
> It seems like maybe we ought to do this regardless of how many bridges
> there are on the bus.  Don't we always want to assign space to devices
> on this bus before distributing the leftovers to downstream buses?

Yes we do.

> E.g., maybe this should be done before the adjust_bridge_window()
> calls?

With the current code it is clear that we deal with the upstream port.
At least in PCIe it is not allowed to have anything else than downstream
ports on that internal bus so the only case we would need to do this is
the switch upstream port.

Let me know if you still want me to move this before adjust_bridge_window()
I can do that in v3. Probably needs a comment too.
Bjorn Helgaas Nov. 18, 2022, 12:29 p.m. UTC | #5
Hi Mika,

On Fri, Nov 18, 2022 at 10:57:12AM +0200, Mika Westerberg wrote:
> On Thu, Nov 17, 2022 at 05:10:34PM -0600, Bjorn Helgaas wrote:
> > On Mon, Nov 14, 2022 at 01:59:52PM +0200, Mika Westerberg wrote:
> > > PCIe switch upstream port may be one of the functions of a multifunction
> > > device.
> > 
> > I don't think this is specific to PCIe, is it?  Can't we have a
> > multi-function device where one function is a conventional PCI bridge?
> > Actually, I don't think "multi-function" is relevant at all -- you
> > iterate over all the devices on the bus below.  For PCIe, that likely
> > means multiple functions of the same device, but it could be separate
> > devices in conventional PCI.
> 
> Yes it can be but I was trying to explain the problem we encountered and
> that's related to PCIe.
> 
> I can leave this out if you think it is better that way.

Not necessarily, I'm just hoping this change is generic enough for all
PCI and PCIe topologies.

> > > The resource distribution code does not take this into account
> > > properly and therefore it expands the upstream port resource windows too
> > > much, not leaving space for the other functions (in the multifunction
> > > device)
> > 
> > I guess the window expansion here is done by adjust_bridge_window()?
> 
> Yes but the resources are distributed in
> pci_bus_distribute_available_resources().

Yep, sounds good, I was just confirming my understanding of the code.
The main point of this patch is to *reduce* the size of the windows to
leave room for peers of the bridge, so I had to look a bit to figure
out where they got expanded.

> > >  	if (hotplug_bridges + normal_bridges == 1) {
> > > -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > > -		if (dev->subordinate)
> > > -			pci_bus_distribute_available_resources(dev->subordinate,
> > > -				add_list, io, mmio, mmio_pref);
> > > +		/* Upstream port must be the first */
> > > +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > > +		if (!bridge->subordinate)
> > > +			return;
> > > +
> > > +		/*
> > > +		 * It is possible to have switch upstream port as a part of a
> > > +		 * multifunction device. For this reason reduce the space
> > > +		 * available for distribution by the amount required by the
> > > +		 * peers of the upstream port.
> > > +		 */
> > > +		list_for_each_entry(dev, &bus->devices, bus_list) {
> > 
> > It seems like maybe we ought to do this regardless of how many bridges
> > there are on the bus.  Don't we always want to assign space to devices
> > on this bus before distributing the leftovers to downstream buses?
> 
> Yes we do.
> 
> > E.g., maybe this should be done before the adjust_bridge_window()
> > calls?
> 
> With the current code it is clear that we deal with the upstream port.
> At least in PCIe it is not allowed to have anything else than downstream
> ports on that internal bus so the only case we would need to do this is
> the switch upstream port.
> 
> Let me know if you still want me to move this before adjust_bridge_window()
> I can do that in v3. Probably needs a comment too.

Hmm, I don't know exactly how to do this, but I don't think this code
should be PCIe-specific, which I think means it shouldn't depend on
how many bridges are on the bus.

I guess the existing assumption that a bridge must be the first device
on the bus is a hidden assumption that this is PCIe.  That might be a
mistake from the past.

I haven't tried it, but I wonder if we could reproduce the same
problem in a conventional PCI topology with qemu.

Bjorn
Mika Westerberg Nov. 21, 2022, 11:47 a.m. UTC | #6
Hi,

On Fri, Nov 18, 2022 at 06:29:51AM -0600, Bjorn Helgaas wrote:
> Hi Mika,
> 
> On Fri, Nov 18, 2022 at 10:57:12AM +0200, Mika Westerberg wrote:
> > On Thu, Nov 17, 2022 at 05:10:34PM -0600, Bjorn Helgaas wrote:
> > > On Mon, Nov 14, 2022 at 01:59:52PM +0200, Mika Westerberg wrote:
> > > > PCIe switch upstream port may be one of the functions of a multifunction
> > > > device.
> > > 
> > > I don't think this is specific to PCIe, is it?  Can't we have a
> > > multi-function device where one function is a conventional PCI bridge?
> > > Actually, I don't think "multi-function" is relevant at all -- you
> > > iterate over all the devices on the bus below.  For PCIe, that likely
> > > means multiple functions of the same device, but it could be separate
> > > devices in conventional PCI.
> > 
> > Yes it can be but I was trying to explain the problem we encountered and
> > that's related to PCIe.
> > 
> > I can leave this out if you think it is better that way.
> 
> Not necessarily, I'm just hoping this change is generic enough for all
> PCI and PCIe topologies.

Okay maybe I'll just drop the "multi-function" part of it?

> > > > The resource distribution code does not take this into account
> > > > properly and therefore it expands the upstream port resource windows too
> > > > much, not leaving space for the other functions (in the multifunction
> > > > device)
> > > 
> > > I guess the window expansion here is done by adjust_bridge_window()?
> > 
> > Yes but the resources are distributed in
> > pci_bus_distribute_available_resources().
> 
> Yep, sounds good, I was just confirming my understanding of the code.
> The main point of this patch is to *reduce* the size of the windows to
> leave room for peers of the bridge, so I had to look a bit to figure
> out where they got expanded.

Understood :)

> > > >  	if (hotplug_bridges + normal_bridges == 1) {
> > > > -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > > > -		if (dev->subordinate)
> > > > -			pci_bus_distribute_available_resources(dev->subordinate,
> > > > -				add_list, io, mmio, mmio_pref);
> > > > +		/* Upstream port must be the first */
> > > > +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > > > +		if (!bridge->subordinate)
> > > > +			return;
> > > > +
> > > > +		/*
> > > > +		 * It is possible to have switch upstream port as a part of a
> > > > +		 * multifunction device. For this reason reduce the space
> > > > +		 * available for distribution by the amount required by the
> > > > +		 * peers of the upstream port.
> > > > +		 */
> > > > +		list_for_each_entry(dev, &bus->devices, bus_list) {
> > > 
> > > It seems like maybe we ought to do this regardless of how many bridges
> > > there are on the bus.  Don't we always want to assign space to devices
> > > on this bus before distributing the leftovers to downstream buses?
> > 
> > Yes we do.
> > 
> > > E.g., maybe this should be done before the adjust_bridge_window()
> > > calls?
> > 
> > With the current code it is clear that we deal with the upstream port.
> > At least in PCIe it is not allowed to have anything else than downstream
> > ports on that internal bus so the only case we would need to do this is
> > the switch upstream port.
> > 
> > Let me know if you still want me to move this before adjust_bridge_window()
> > I can do that in v3. Probably needs a comment too.
> 
> Hmm, I don't know exactly how to do this, but I don't think this code
> should be PCIe-specific, which I think means it shouldn't depend on
> how many bridges are on the bus.
> 
> I guess the existing assumption that a bridge must be the first device
> on the bus is a hidden assumption that this is PCIe.  That might be a
> mistake from the past.
> 
> I haven't tried it, but I wonder if we could reproduce the same
> problem in a conventional PCI topology with qemu.

I'm not an expert in QEMU but I think I was able to create such topology
with the following command line (parts copied from Jonathan's):

qemu-system-aarch64                                                             \
        -M virt,nvdimm=on,gic-version=3 -m 4g,maxmem=8G,slots=8 -cpu max -smp 4 \
        -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd                           \
        -nographic -no-reboot                                                   \
        -kernel Image                          					\
        -initrd rootfs.cpio.bz2              					\
        -device pcie-pci-bridge,id=br1                                          \
        -device pci-bridge,chassis_nr=1,bus=br1,id=br2,shpc=on,addr=2           \
        -device e1000,bus=br1,addr=3                                            \
        -device e1000,bus=br1,addr=4                                            \
        -device e1000,bus=br2,addr=5

However, the problem does not reproduce with or without the patch. The
below is 'lspci -v' without this patch and 5632e2beaf9d5dda694c0572684dea783d8a9492 reverted:

00:02.0 PCI bridge: Red Hat, Inc. Device 000e (prog-if 00 [Normal decode])
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 47
        Memory at 8000004000 (64-bit, non-prefetchable) [size=256]
        Bus: primary=00, secondary=01, subordinate=02, sec-latency=0
        I/O behind bridge: 1000-2fff [size=8K] [16-bit]
        Memory behind bridge: 10000000-102fffff [size=3M] [32-bit]
        Prefetchable memory behind bridge: [disabled] [64-bit]
        Capabilities: [8c] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [84] Power Management version 3
        Capabilities: [48] Express PCI-Express to PCI/PCI-X Bridge, MSI 00
        Capabilities: [40] Hot-plug capable
        Capabilities: [100] Advanced Error Reporting

01:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge (prog-if 00 [Normal decode])
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 45
        Memory at 10240000 (64-bit, non-prefetchable) [size=256]
        Bus: primary=01, secondary=02, subordinate=02, sec-latency=0
        I/O behind bridge: 1000-1fff [size=4K] [16-bit]
        Memory behind bridge: 10000000-100fffff [size=1M] [32-bit]
        Prefetchable memory behind bridge: [disabled] [64-bit]
        Capabilities: [4c] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [48] Slot ID: 0 slots, First+, chassis 01
        Capabilities: [40] Hot-plug capable

01:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc. QEMU Virtual Machine
        Flags: bus master, fast devsel, latency 0, IRQ 46
        Memory at 10200000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2000 [size=64]
        Expansion ROM at 10100000 [disabled] [size=512K]
        Kernel driver in use: e1000

01:04.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc. QEMU Virtual Machine
        Flags: bus master, fast devsel, latency 0, IRQ 47
        Memory at 10220000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2040 [size=64]
        Expansion ROM at 10180000 [disabled] [size=512K]
        Kernel driver in use: e1000

02:05.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc. QEMU Virtual Machine
        Flags: bus master, fast devsel, latency 0, IRQ 46
        Memory at 10080000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 1000 [size=64]
        Expansion ROM at 10000000 [disabled] [size=512K]
        Kernel driver in use: e1000

And with this patch (5632e2beaf9d5dda694c0572684dea783d8a9492 still reverted):

00:02.0 PCI bridge: Red Hat, Inc. Device 000e (prog-if 00 [Normal decode])
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 47
        Memory at 8000004000 (64-bit, non-prefetchable) [size=256]
        Bus: primary=00, secondary=01, subordinate=02, sec-latency=0
        I/O behind bridge: 1000-2fff [size=8K] [16-bit]
        Memory behind bridge: 10000000-102fffff [size=3M] [32-bit]
        Prefetchable memory behind bridge: [disabled] [64-bit]
        Capabilities: [8c] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [84] Power Management version 3
        Capabilities: [48] Express PCI-Express to PCI/PCI-X Bridge, MSI 00
        Capabilities: [40] Hot-plug capable
        Capabilities: [100] Advanced Error Reporting

01:02.0 PCI bridge: Red Hat, Inc. QEMU PCI-PCI bridge (prog-if 00 [Normal decode])
        Flags: bus master, 66MHz, fast devsel, latency 0, IRQ 45
        Memory at 10240000 (64-bit, non-prefetchable) [size=256]
        Bus: primary=01, secondary=02, subordinate=02, sec-latency=0
        I/O behind bridge: 1000-1fff [size=4K] [16-bit]
        Memory behind bridge: 10000000-100fffff [size=1M] [32-bit]
        Prefetchable memory behind bridge: [disabled] [64-bit]
        Capabilities: [4c] MSI: Enable- Count=1/1 Maskable+ 64bit+
        Capabilities: [48] Slot ID: 0 slots, First+, chassis 01
        Capabilities: [40] Hot-plug capable

01:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc. QEMU Virtual Machine
        Flags: bus master, fast devsel, latency 0, IRQ 46
        Memory at 10200000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2000 [size=64]
        Expansion ROM at 10100000 [disabled] [size=512K]
        Kernel driver in use: e1000

01:04.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc. QEMU Virtual Machine
        Flags: bus master, fast devsel, latency 0, IRQ 47
        Memory at 10220000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 2040 [size=64]
        Expansion ROM at 10180000 [disabled] [size=512K]
        Kernel driver in use: e1000

02:05.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
        Subsystem: Red Hat, Inc. QEMU Virtual Machine
        Flags: bus master, fast devsel, latency 0, IRQ 46
        Memory at 10080000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 1000 [size=64]
        Expansion ROM at 10000000 [disabled] [size=512K]
        Kernel driver in use: e1000
Bjorn Helgaas Nov. 21, 2022, 10:45 p.m. UTC | #7
On Mon, Nov 21, 2022 at 01:47:16PM +0200, Mika Westerberg wrote:
> On Fri, Nov 18, 2022 at 06:29:51AM -0600, Bjorn Helgaas wrote:
> > On Fri, Nov 18, 2022 at 10:57:12AM +0200, Mika Westerberg wrote:
> > > On Thu, Nov 17, 2022 at 05:10:34PM -0600, Bjorn Helgaas wrote:
> > > > On Mon, Nov 14, 2022 at 01:59:52PM +0200, Mika Westerberg wrote:
> > > > > PCIe switch upstream port may be one of the functions of a multifunction
> > > > > device.
> > > > 
> > > > I don't think this is specific to PCIe, is it?  Can't we have a
> > > > multi-function device where one function is a conventional PCI bridge?
> > > > Actually, I don't think "multi-function" is relevant at all -- you
> > > > iterate over all the devices on the bus below.  For PCIe, that likely
> > > > means multiple functions of the same device, but it could be separate
> > > > devices in conventional PCI.
> > > 
> > > Yes it can be but I was trying to explain the problem we encountered and
> > > that's related to PCIe.
> > > 
> > > I can leave this out if you think it is better that way.
> > 
> > Not necessarily, I'm just hoping this change is generic enough for all
> > PCI and PCIe topologies.
> 
> Okay maybe I'll just drop the "multi-function" part of it?
> 
> > > > > The resource distribution code does not take this into account
> > > > > properly and therefore it expands the upstream port resource windows too
> > > > > much, not leaving space for the other functions (in the multifunction
> > > > > device)
> > > > 
> > > > I guess the window expansion here is done by adjust_bridge_window()?
> > > 
> > > Yes but the resources are distributed in
> > > pci_bus_distribute_available_resources().
> > 
> > Yep, sounds good, I was just confirming my understanding of the code.
> > The main point of this patch is to *reduce* the size of the windows to
> > leave room for peers of the bridge, so I had to look a bit to figure
> > out where they got expanded.
> 
> Understood :)
> 
> > > > >  	if (hotplug_bridges + normal_bridges == 1) {
> > > > > -		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > > > > -		if (dev->subordinate)
> > > > > -			pci_bus_distribute_available_resources(dev->subordinate,
> > > > > -				add_list, io, mmio, mmio_pref);
> > > > > +		/* Upstream port must be the first */
> > > > > +		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
> > > > > +		if (!bridge->subordinate)
> > > > > +			return;
> > > > > +
> > > > > +		/*
> > > > > +		 * It is possible to have switch upstream port as a part of a
> > > > > +		 * multifunction device. For this reason reduce the space
> > > > > +		 * available for distribution by the amount required by the
> > > > > +		 * peers of the upstream port.
> > > > > +		 */
> > > > > +		list_for_each_entry(dev, &bus->devices, bus_list) {
> > > > 
> > > > It seems like maybe we ought to do this regardless of how many bridges
> > > > there are on the bus.  Don't we always want to assign space to devices
> > > > on this bus before distributing the leftovers to downstream buses?
> > > 
> > > Yes we do.
> > > 
> > > > E.g., maybe this should be done before the adjust_bridge_window()
> > > > calls?
> > > 
> > > With the current code it is clear that we deal with the upstream port.
> > > At least in PCIe it is not allowed to have anything else than downstream
> > > ports on that internal bus so the only case we would need to do this is
> > > the switch upstream port.
> > > 
> > > Let me know if you still want me to move this before adjust_bridge_window()
> > > I can do that in v3. Probably needs a comment too.
> > 
> > Hmm, I don't know exactly how to do this, but I don't think this code
> > should be PCIe-specific, which I think means it shouldn't depend on
> > how many bridges are on the bus.
> > 
> > I guess the existing assumption that a bridge must be the first device
> > on the bus is a hidden assumption that this is PCIe.  That might be a
> > mistake from the past.
> > 
> > I haven't tried it, but I wonder if we could reproduce the same
> > problem in a conventional PCI topology with qemu.
> 
> I'm not an expert in QEMU but I think I was able to create such topology
> with the following command line (parts copied from Jonathan's):
> 
> qemu-system-aarch64                                                             \
>         -M virt,nvdimm=on,gic-version=3 -m 4g,maxmem=8G,slots=8 -cpu max -smp 4 \
>         -bios /usr/share/qemu-efi-aarch64/QEMU_EFI.fd                           \
>         -nographic -no-reboot                                                   \
>         -kernel Image                          					\
>         -initrd rootfs.cpio.bz2              					\
>         -device pcie-pci-bridge,id=br1                                          \
>         -device pci-bridge,chassis_nr=1,bus=br1,id=br2,shpc=on,addr=2           \
>         -device e1000,bus=br1,addr=3                                            \
>         -device e1000,bus=br1,addr=4                                            \
>         -device e1000,bus=br2,addr=5
> 
> However, the problem does not reproduce with or without the patch. The
> below is 'lspci -v' without this patch and 5632e2beaf9d5dda694c0572684dea783d8a9492 reverted:

IIUC, the summary is this:

  00:02.0 bridge window [mem 0x10000000-0x102fffff] to [bus 01-02]
  01:02.0 bridge window [mem 0x10000000-0x100fffff] to [bus 02]
  01:03.0 NIC BAR [mem 0x10200000-0x1021ffff]
  01:04.0 NIC BAR [mem 0x10220000-0x1023ffff]
  02:05.0 NIC BAR [mem 0x10080000-0x1009ffff]

and it's the same with and without the current patch.

Are all these assignments done by BIOS, or did Linux update them?
Did we exercise the same "distribute available resources" path as in
the PCIe case?  I expect we *should*, because there really shouldn't
be any PCI vs PCIe differences in how resources are handled.  This is
why I'm not comfortable with assumptions here that depend on PCIe.

I can't tell from Jonathan's PCIe case whether we got a working config
from BIOS or not because our logging of bridge windows is kind of
poor.

Bjorn
Mika Westerberg Nov. 22, 2022, 6:42 a.m. UTC | #8
Hi,

On Mon, Nov 21, 2022 at 04:45:48PM -0600, Bjorn Helgaas wrote:
> IIUC, the summary is this:
> 
>   00:02.0 bridge window [mem 0x10000000-0x102fffff] to [bus 01-02]
>   01:02.0 bridge window [mem 0x10000000-0x100fffff] to [bus 02]
>   01:03.0 NIC BAR [mem 0x10200000-0x1021ffff]
>   01:04.0 NIC BAR [mem 0x10220000-0x1023ffff]
>   02:05.0 NIC BAR [mem 0x10080000-0x1009ffff]
> 
> and it's the same with and without the current patch.
> 
> Are all these assignments done by BIOS, or did Linux update them?

> Did we exercise the same "distribute available resources" path as in
> the PCIe case?  I expect we *should*, because there really shouldn't
> be any PCI vs PCIe differences in how resources are handled.  This is
> why I'm not comfortable with assumptions here that depend on PCIe.
> 
> I can't tell from Jonathan's PCIe case whether we got a working config
> from BIOS or not because our logging of bridge windows is kind of
> poor.

This is ARM64 so there is no "BIOS" involved (something similar though).

It is the same "system" that Jonathan used where the regression happened
with the multifunction PCIe configuration with the exception that I'm
now using PCI devices instead of PCIe as you asked.

I'm not 100% sure if the all the same code paths are used here, though.
Jonathan Cameron Nov. 22, 2022, 11:45 a.m. UTC | #9
On Tue, 22 Nov 2022 08:42:58 +0200
Mika Westerberg <mika.westerberg@linux.intel.com> wrote:

> Hi,
> 
> On Mon, Nov 21, 2022 at 04:45:48PM -0600, Bjorn Helgaas wrote:
> > IIUC, the summary is this:
> > 
> >   00:02.0 bridge window [mem 0x10000000-0x102fffff] to [bus 01-02]
> >   01:02.0 bridge window [mem 0x10000000-0x100fffff] to [bus 02]
> >   01:03.0 NIC BAR [mem 0x10200000-0x1021ffff]
> >   01:04.0 NIC BAR [mem 0x10220000-0x1023ffff]
> >   02:05.0 NIC BAR [mem 0x10080000-0x1009ffff]
> > 
> > and it's the same with and without the current patch.
> > 
> > Are all these assignments done by BIOS, or did Linux update them?  
> 
> > Did we exercise the same "distribute available resources" path as in
> > the PCIe case?  I expect we *should*, because there really shouldn't
> > be any PCI vs PCIe differences in how resources are handled.  This is
> > why I'm not comfortable with assumptions here that depend on PCIe.
> > 
> > I can't tell from Jonathan's PCIe case whether we got a working config
> > from BIOS or not because our logging of bridge windows is kind of
> > poor.  
> 
> This is ARM64 so there is no "BIOS" involved (something similar though).

It's EDK2 in my tests  - so very similar to other arch.
Possible to boot without though and rely on DT, but various things don't
work yet...

> 
> It is the same "system" that Jonathan used where the regression happened
> with the multifunction PCIe configuration with the exception that I'm
> now using PCI devices instead of PCIe as you asked.
> 
> I'm not 100% sure if the all the same code paths are used here, though.
> 

I wondered if it was possibly to do with fairly minimal handling of pci-pxb
(the weird root bridge) in EDK2, so tried the obvious of hanging your PCI
test below one of those rather than directly below a normal bridge.
Despite shuffling things around into configurations
I thought might trigger the problem, it all seems fine.

Note that I can't currently test the pxb-pcie configurations without EDK2
as arm-virt doesn't provide the relevant DT for those root bridges yet
(it's on my todo list as it's a prereq for getting the QEMU CXL ARM64
emulation upstream)

So no guarantees as I don't understand fully why PCI is ending up
with different handling.

From liberal distribution of printk()s it seems that for PCI bridges
pci_bridge_resources_not_assigned() is false, but for PCI express
example it is true.  My first instinct is quirk of the EDK2 handling? 
I'll have a dig, but I'm not an expert in EDK2 at all, so may not get
to the bottom of this.

Ultimately it seems that when the OS takes over the prefetchable memory
resources are not configured for the PCIe case, but are for the PCI case.
So we aren't currently walking the new code for PCI.

Jonathan
Mika Westerberg Nov. 22, 2022, 12:21 p.m. UTC | #10
On Tue, Nov 22, 2022 at 11:45:41AM +0000, Jonathan Cameron wrote:
> On Tue, 22 Nov 2022 08:42:58 +0200
> Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> 
> > Hi,
> > 
> > On Mon, Nov 21, 2022 at 04:45:48PM -0600, Bjorn Helgaas wrote:
> > > IIUC, the summary is this:
> > > 
> > >   00:02.0 bridge window [mem 0x10000000-0x102fffff] to [bus 01-02]
> > >   01:02.0 bridge window [mem 0x10000000-0x100fffff] to [bus 02]
> > >   01:03.0 NIC BAR [mem 0x10200000-0x1021ffff]
> > >   01:04.0 NIC BAR [mem 0x10220000-0x1023ffff]
> > >   02:05.0 NIC BAR [mem 0x10080000-0x1009ffff]
> > > 
> > > and it's the same with and without the current patch.
> > > 
> > > Are all these assignments done by BIOS, or did Linux update them?  
> > 
> > > Did we exercise the same "distribute available resources" path as in
> > > the PCIe case?  I expect we *should*, because there really shouldn't
> > > be any PCI vs PCIe differences in how resources are handled.  This is
> > > why I'm not comfortable with assumptions here that depend on PCIe.
> > > 
> > > I can't tell from Jonathan's PCIe case whether we got a working config
> > > from BIOS or not because our logging of bridge windows is kind of
> > > poor.  
> > 
> > This is ARM64 so there is no "BIOS" involved (something similar though).
> 
> It's EDK2 in my tests  - so very similar to other arch.
> Possible to boot without though and rely on DT, but various things don't
> work yet...

Okay.

> > It is the same "system" that Jonathan used where the regression happened
> > with the multifunction PCIe configuration with the exception that I'm
> > now using PCI devices instead of PCIe as you asked.
> > 
> > I'm not 100% sure if the all the same code paths are used here, though.
> > 
> 
> I wondered if it was possibly to do with fairly minimal handling of pci-pxb
> (the weird root bridge) in EDK2, so tried the obvious of hanging your PCI
> test below one of those rather than directly below a normal bridge.
> Despite shuffling things around into configurations
> I thought might trigger the problem, it all seems fine.

I also did some other experiments like tried to add pcie-root-port first
but that did not trigger the issue either (unless I missed something).

> Note that I can't currently test the pxb-pcie configurations without EDK2
> as arm-virt doesn't provide the relevant DT for those root bridges yet
> (it's on my todo list as it's a prereq for getting the QEMU CXL ARM64
> emulation upstream)
> 
> So no guarantees as I don't understand fully why PCI is ending up
> with different handling.
> 
> From liberal distribution of printk()s it seems that for PCI bridges
> pci_bridge_resources_not_assigned() is false, but for PCI express
> example it is true.  My first instinct is quirk of the EDK2 handling? 
> I'll have a dig, but I'm not an expert in EDK2 at all, so may not get
> to the bottom of this.
> 
> Ultimately it seems that when the OS takes over the prefetchable memory
> resources are not configured for the PCIe case, but are for the PCI case.
> So we aren't currently walking the new code for PCI.

I think the reason why this "difference" is that we have this in
__pci_bus_size_bridges():

               if (bus->self->is_hotplug_bridge) {
                        additional_io_size  = pci_hotplug_io_size;
                        additional_mmio_size = pci_hotplug_mmio_size;
                        additional_mmio_pref_size = pci_hotplug_mmio_pref_size;
                }

For QEMU PCIe root/downstream ports this might be true so we end up with
"additional resources" in the resource list and therfore the kernel
tries to do the allocation wrt. with PCI case it is not. I tried to end
up with the same code path in my command line with this:

  -device pci-bridge,chassis_nr=1,bus=br1,id=br2,shpc=on,addr=2

The "shpc=on" should make it hotplug bridge as well but apparently that
is not happening.
Bjorn Helgaas Nov. 22, 2022, 5:26 p.m. UTC | #11
On Tue, Nov 22, 2022 at 11:45:41AM +0000, Jonathan Cameron wrote:
> On Tue, 22 Nov 2022 08:42:58 +0200
> Mika Westerberg <mika.westerberg@linux.intel.com> wrote:
> > On Mon, Nov 21, 2022 at 04:45:48PM -0600, Bjorn Helgaas wrote:
> > > IIUC, the summary is this:
> > > 
> > >   00:02.0 bridge window [mem 0x10000000-0x102fffff] to [bus 01-02]
> > >   01:02.0 bridge window [mem 0x10000000-0x100fffff] to [bus 02]
> > >   01:03.0 NIC BAR [mem 0x10200000-0x1021ffff]
> > >   01:04.0 NIC BAR [mem 0x10220000-0x1023ffff]
> > >   02:05.0 NIC BAR [mem 0x10080000-0x1009ffff]
> > > 
> > > and it's the same with and without the current patch.
> > > 
> > > Are all these assignments done by BIOS, or did Linux update them?  
> > 
> > > Did we exercise the same "distribute available resources" path as in
> > > the PCIe case?  I expect we *should*, because there really shouldn't
> > > be any PCI vs PCIe differences in how resources are handled.  This is
> > > why I'm not comfortable with assumptions here that depend on PCIe.
> > > 
> > > I can't tell from Jonathan's PCIe case whether we got a working config
> > > from BIOS or not because our logging of bridge windows is kind of
> > > poor.  
> > 
> > This is ARM64 so there is no "BIOS" involved (something similar though).
> 
> It's EDK2 in my tests  - so very similar to other arch.
> Possible to boot without though and rely on DT, but various things don't
> work yet...

Doesn't matter whether it's BIOS/EFI/EDK2/etc, the question was really
whether anything has programmed BARs and windows before Linux gets
started.

> From liberal distribution of printk()s it seems that for PCI bridges
> pci_bridge_resources_not_assigned() is false, but for PCI express
> example it is true.  My first instinct is quirk of the EDK2 handling? 
> I'll have a dig, but I'm not an expert in EDK2 at all, so may not get
> to the bottom of this.

I don't know what pci_bridge_resources_not_assigned() is.

> Ultimately it seems that when the OS takes over the prefetchable memory
> resources are not configured for the PCIe case, but are for the PCI case.
> So we aren't currently walking the new code for PCI.

Whatever the reason for this is, it doesn't sound like something Linux
should assume or rely on.

Bjorn
diff mbox series

Patch

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index b4096598dbcb..f3f39aa82dda 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1830,10 +1830,58 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 	 * bridges below.
 	 */
 	if (hotplug_bridges + normal_bridges == 1) {
-		dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
-		if (dev->subordinate)
-			pci_bus_distribute_available_resources(dev->subordinate,
-				add_list, io, mmio, mmio_pref);
+		/* Upstream port must be the first */
+		bridge = list_first_entry(&bus->devices, struct pci_dev, bus_list);
+		if (!bridge->subordinate)
+			return;
+
+		/*
+		 * It is possible to have switch upstream port as a part of a
+		 * multifunction device. For this reason reduce the space
+		 * available for distribution by the amount required by the
+		 * peers of the upstream port.
+		 */
+		list_for_each_entry(dev, &bus->devices, bus_list) {
+			int i;
+
+			if (dev == bridge)
+				continue;
+
+			for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+				const struct resource *dev_res = &dev->resource[i];
+				resource_size_t dev_sz;
+				struct resource *b_res;
+
+				if (dev_res->flags & IORESOURCE_IO) {
+					b_res = &io;
+				} else if (dev_res->flags & IORESOURCE_MEM) {
+					if (dev_res->flags & IORESOURCE_PREFETCH)
+						b_res = &mmio_pref;
+					else
+						b_res = &mmio;
+				} else {
+					continue;
+				}
+
+				/* Size aligned to bridge window */
+				align = pci_resource_alignment(bridge, b_res);
+				dev_sz = ALIGN(resource_size(dev_res), align);
+
+				pci_dbg(dev, "%pR aligned to %#llx\n", dev_res,
+					(unsigned long long)dev_sz);
+
+				if (dev_sz >= resource_size(b_res))
+					memset(b_res, 0, sizeof(*b_res));
+				else
+					b_res->end -= dev_sz;
+
+				pci_dbg(bridge, "updated available to %pR\n", b_res);
+			}
+		}
+
+		pci_bus_distribute_available_resources(bridge->subordinate,
+						       add_list, io, mmio,
+						       mmio_pref);
 		return;
 	}