[v5,2/9] PCI: Take bridge window alignment into account when distributing resources

Message ID 20180416103453.46232-3-mika.westerberg@linux.intel.com
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series
  • PCI: Fixes and cleanups for native PCIe and ACPI hotplug
Related show

Commit Message

Mika Westerberg April 16, 2018, 10:34 a.m.
When hot-adding a PCIe switch the way we currently distribute resources
does not always work well because devices connected to the switch might
need to have their MMIO resources aligned to something else than the
default 1 MB boundary. For example Intel Gigabit ET2 quad port server
adapter includes PCIe switch leading to 4 x GbE NIC devices that want
to have their MMIO resources aligned to 2 MB boundary instead.

The current resource distribution code does not take this alignment into
account and might try to add too much resources for the extension
hotplug bridge(s). The resulting bridge window is too big which makes
the resource assignment operation fail, and we are left with a bridge
window with minimal amount (1 MB) of MMIO space.

Here is what happens when an Intel Gigabit ET2 quad port server adapter
is hot-added:

  pci 0000:39:00.0: BAR 14: assigned [mem 0x53300000-0x6a0fffff]
                                          ^^^^^^^^^^
  pci 0000:3a:01.0: BAR 14: assigned [mem 0x53400000-0x547fffff]
                                          ^^^^^^^^^^
The above shows that the downstream bridge (3a:01.0) window is aligned
to 2 MB instead of 1 MB as is the upstream bridge (39:00.0) window. The
remaining MMIO space (0x15a00000) is assigned to the hotplug bridge
(3a:04.0) but it fails:

  pci 0000:3a:04.0: BAR 14: no space for [mem size 0x15a00000]
  pci 0000:3a:04.0: BAR 14: failed to assign [mem size 0x15a00000]

The MMIO resource is calculated as follows:

  start = 0x54800000
  end = 0x54800000 + 0x15a00000 - 1 = 0x6a1fffff

This results bridge window [mem 0x54800000 - 0x6a1fffff] and it ends
after the upstream bridge window [mem 0x53300000-0x6a0fffff] explaining
the above failure. Because of this Linux falls back to the default
allocation of 1 MB as can be seen from 'lspci' output:

 39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
   3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
   3a:04.0 Memory behind bridge: 53300000-533fffff [size=1M]

The hotplug bridge 3a:04.0 only occupies 1 MB MMIO window which is
clearly not enough for extending the PCIe topology later if more devices
are to be hot-added.

Fix this by substracting properly aligned non-hotplug downstream bridge
window size from the remaining resources used for extension. After this
change the resource allocation looks like:

  39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
    3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
    3a:04.0 Memory behind bridge: 54800000-6a0fffff [size=345M]

This matches the expectation. All the extra MMIO resource space (345 MB)
is allocated to the extension hotplug bridge (3a:04.0).

Fixes: 1a5767725cec ("PCI: Distribute available resources to hotplug-capable bridges")
Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: stable@vger.kernel.org
---
 drivers/pci/setup-bus.c | 41 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas April 25, 2018, 10:38 p.m. | #1
On Mon, Apr 16, 2018 at 01:34:46PM +0300, Mika Westerberg wrote:
> When hot-adding a PCIe switch the way we currently distribute resources
> does not always work well because devices connected to the switch might
> need to have their MMIO resources aligned to something else than the
> default 1 MB boundary. For example Intel Gigabit ET2 quad port server
> adapter includes PCIe switch leading to 4 x GbE NIC devices that want
> to have their MMIO resources aligned to 2 MB boundary instead.
> 
> The current resource distribution code does not take this alignment into
> account and might try to add too much resources for the extension
> hotplug bridge(s). The resulting bridge window is too big which makes
> the resource assignment operation fail, and we are left with a bridge
> window with minimal amount (1 MB) of MMIO space.
> 
> Here is what happens when an Intel Gigabit ET2 quad port server adapter
> is hot-added:
> 
>   pci 0000:39:00.0: BAR 14: assigned [mem 0x53300000-0x6a0fffff]
>                                           ^^^^^^^^^^
>   pci 0000:3a:01.0: BAR 14: assigned [mem 0x53400000-0x547fffff]
>                                           ^^^^^^^^^^
> The above shows that the downstream bridge (3a:01.0) window is aligned
> to 2 MB instead of 1 MB as is the upstream bridge (39:00.0) window. The
> remaining MMIO space (0x15a00000) is assigned to the hotplug bridge
> (3a:04.0) but it fails:
> 
>   pci 0000:3a:04.0: BAR 14: no space for [mem size 0x15a00000]
>   pci 0000:3a:04.0: BAR 14: failed to assign [mem size 0x15a00000]
> 
> The MMIO resource is calculated as follows:
> 
>   start = 0x54800000
>   end = 0x54800000 + 0x15a00000 - 1 = 0x6a1fffff
> 
> This results bridge window [mem 0x54800000 - 0x6a1fffff] and it ends
> after the upstream bridge window [mem 0x53300000-0x6a0fffff] explaining
> the above failure. Because of this Linux falls back to the default
> allocation of 1 MB as can be seen from 'lspci' output:
> 
>  39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
>    3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
>    3a:04.0 Memory behind bridge: 53300000-533fffff [size=1M]
> 
> The hotplug bridge 3a:04.0 only occupies 1 MB MMIO window which is
> clearly not enough for extending the PCIe topology later if more devices
> are to be hot-added.
> 
> Fix this by substracting properly aligned non-hotplug downstream bridge
> window size from the remaining resources used for extension. After this
> change the resource allocation looks like:
> 
>   39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
>     3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
>     3a:04.0 Memory behind bridge: 54800000-6a0fffff [size=345M]
> 
> This matches the expectation. All the extra MMIO resource space (345 MB)
> is allocated to the extension hotplug bridge (3a:04.0).

Sorry, I've spent a lot of time trying to trace through this code, and
I'm still hopelessly confused.  Can you post the complete "lspci -vv"
output and the dmesg log (including the hot-add event) somewhere and
include a URL to it?

I think I understand the problem you're solving:

  - You have 366M, 1M-aligned, available for things on bus 3a
  - You assign 20M, 2M-aligned to 3a:01.0
  - This leaves 346M for other things on bus 3a, but it's not all
    contiguous because the 20M is in the middle.
  - The remaining 346M might be 1M on one side and 345M on the other
    (and there are many other possibilities, e.g., 3M + 343M, 5M +
    341M, ..., 345M + 1M).
  - The current code tries to assign all 346M to 3a:04.0, which
    fails because that space is not contiguous, so it falls back to
    allocating 1M, which works but is insufficient for future
    hot-adds.

Obviously this patch makes *this* situation work: it assigns 345M to
3a:04.0 and (I assume) leaves the 1M unused.  But I haven't been able
to convince myself that this patch works *in general*.

For example, what if we assigned the 20M from the end of the 366M
window instead of the beginning, so the 345M piece is below the 20M
and there's 1M left above it?  That is legal and should work, but I
suspect this patch would ignore the 345M piece and again assign 1M to
3a:04.0.

Or what if there are several hotplug bridges on bus 3a?  This example
has two, but there could be many more.

Or what if there are normal bridges as well as hotplug bridges on bus
3a?  Or if they're in arbitrary orders?

> Fixes: 1a5767725cec ("PCI: Distribute available resources to hotplug-capable bridges")
> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> Cc: stable@vger.kernel.org

Given my confusion about this, I doubt this satisfies the stable
kernel "obviously correct" rule.

s/substracting/subtracting/ above

> ---
>  drivers/pci/setup-bus.c | 41 ++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 40 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 072784f55ea5..eb3059fb7f63 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -1878,6 +1878,7 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  	resource_size_t available_mmio, resource_size_t available_mmio_pref)
>  {
>  	resource_size_t remaining_io, remaining_mmio, remaining_mmio_pref;
> +	resource_size_t io_start, mmio_start, mmio_pref_start;
>  	unsigned int normal_bridges = 0, hotplug_bridges = 0;
>  	struct resource *io_res, *mmio_res, *mmio_pref_res;
>  	struct pci_dev *dev, *bridge = bus->self;
> @@ -1942,11 +1943,16 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  			remaining_mmio_pref -= resource_size(res);
>  	}
>  
> +	io_start = io_res->start;
> +	mmio_start = mmio_res->start;
> +	mmio_pref_start = mmio_pref_res->start;
> +
>  	/*
>  	 * Go over devices on this bus and distribute the remaining
>  	 * resource space between hotplug bridges.
>  	 */
>  	for_each_pci_bridge(dev, bus) {
> +		resource_size_t align;
>  		struct pci_bus *b;
>  
>  		b = dev->subordinate;
> @@ -1964,7 +1970,7 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  				available_io, available_mmio,
>  				available_mmio_pref);
>  		} else if (dev->is_hotplug_bridge) {
> -			resource_size_t align, io, mmio, mmio_pref;
> +			resource_size_t io, mmio, mmio_pref;
>  
>  			/*
>  			 * Distribute available extra resources equally
> @@ -1977,11 +1983,13 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  			io = div64_ul(available_io, hotplug_bridges);
>  			io = min(ALIGN(io, align), remaining_io);
>  			remaining_io -= io;
> +			io_start += io;
>  
>  			align = pci_resource_alignment(bridge, mmio_res);
>  			mmio = div64_ul(available_mmio, hotplug_bridges);
>  			mmio = min(ALIGN(mmio, align), remaining_mmio);
>  			remaining_mmio -= mmio;
> +			mmio_start += mmio;
>  
>  			align = pci_resource_alignment(bridge, mmio_pref_res);
>  			mmio_pref = div64_ul(available_mmio_pref,
> @@ -1989,9 +1997,40 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
>  			mmio_pref = min(ALIGN(mmio_pref, align),
>  					remaining_mmio_pref);
>  			remaining_mmio_pref -= mmio_pref;
> +			mmio_pref_start += mmio_pref;
>  
>  			pci_bus_distribute_available_resources(b, add_list, io,
>  							       mmio, mmio_pref);
> +		} else {
> +			/*
> +			 * For normal bridges, track start of the parent
> +			 * bridge window to make sure we align the
> +			 * remaining space which is distributed to the
> +			 * hotplug bridges properly.
> +			 */
> +			resource_size_t aligned;
> +			struct resource *res;
> +
> +			res = &dev->resource[PCI_BRIDGE_RESOURCES + 0];
> +			io_start += resource_size(res);
> +			aligned = ALIGN(io_start,
> +					pci_resource_alignment(dev, res));
> +			if (aligned > io_start)
> +				remaining_io -= aligned - io_start;
> +
> +			res = &dev->resource[PCI_BRIDGE_RESOURCES + 1];
> +			mmio_start += resource_size(res);
> +			aligned = ALIGN(mmio_start,
> +					pci_resource_alignment(dev, res));
> +			if (aligned > mmio_start)
> +				remaining_mmio -= aligned - mmio_start;
> +
> +			res = &dev->resource[PCI_BRIDGE_RESOURCES + 2];
> +			mmio_pref_start += resource_size(res);
> +			aligned = ALIGN(mmio_pref_start,
> +					pci_resource_alignment(dev, res));
> +			if (aligned > mmio_pref_start)
> +				remaining_mmio_pref -= aligned - mmio_pref_start;
>  		}
>  	}
>  }
> -- 
> 2.16.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Mika Westerberg April 26, 2018, 12:23 p.m. | #2
On Wed, Apr 25, 2018 at 05:38:54PM -0500, Bjorn Helgaas wrote:
> On Mon, Apr 16, 2018 at 01:34:46PM +0300, Mika Westerberg wrote:
> > When hot-adding a PCIe switch the way we currently distribute resources
> > does not always work well because devices connected to the switch might
> > need to have their MMIO resources aligned to something else than the
> > default 1 MB boundary. For example Intel Gigabit ET2 quad port server
> > adapter includes PCIe switch leading to 4 x GbE NIC devices that want
> > to have their MMIO resources aligned to 2 MB boundary instead.
> > 
> > The current resource distribution code does not take this alignment into
> > account and might try to add too much resources for the extension
> > hotplug bridge(s). The resulting bridge window is too big which makes
> > the resource assignment operation fail, and we are left with a bridge
> > window with minimal amount (1 MB) of MMIO space.
> > 
> > Here is what happens when an Intel Gigabit ET2 quad port server adapter
> > is hot-added:
> > 
> >   pci 0000:39:00.0: BAR 14: assigned [mem 0x53300000-0x6a0fffff]
> >                                           ^^^^^^^^^^
> >   pci 0000:3a:01.0: BAR 14: assigned [mem 0x53400000-0x547fffff]
> >                                           ^^^^^^^^^^
> > The above shows that the downstream bridge (3a:01.0) window is aligned
> > to 2 MB instead of 1 MB as is the upstream bridge (39:00.0) window. The
> > remaining MMIO space (0x15a00000) is assigned to the hotplug bridge
> > (3a:04.0) but it fails:
> > 
> >   pci 0000:3a:04.0: BAR 14: no space for [mem size 0x15a00000]
> >   pci 0000:3a:04.0: BAR 14: failed to assign [mem size 0x15a00000]
> > 
> > The MMIO resource is calculated as follows:
> > 
> >   start = 0x54800000
> >   end = 0x54800000 + 0x15a00000 - 1 = 0x6a1fffff
> > 
> > This results bridge window [mem 0x54800000 - 0x6a1fffff] and it ends
> > after the upstream bridge window [mem 0x53300000-0x6a0fffff] explaining
> > the above failure. Because of this Linux falls back to the default
> > allocation of 1 MB as can be seen from 'lspci' output:
> > 
> >  39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
> >    3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
> >    3a:04.0 Memory behind bridge: 53300000-533fffff [size=1M]
> > 
> > The hotplug bridge 3a:04.0 only occupies 1 MB MMIO window which is
> > clearly not enough for extending the PCIe topology later if more devices
> > are to be hot-added.
> > 
> > Fix this by substracting properly aligned non-hotplug downstream bridge
> > window size from the remaining resources used for extension. After this
> > change the resource allocation looks like:
> > 
> >   39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
> >     3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
> >     3a:04.0 Memory behind bridge: 54800000-6a0fffff [size=345M]
> > 
> > This matches the expectation. All the extra MMIO resource space (345 MB)
> > is allocated to the extension hotplug bridge (3a:04.0).
> 
> Sorry, I've spent a lot of time trying to trace through this code, and
> I'm still hopelessly confused.  Can you post the complete "lspci -vv"
> output and the dmesg log (including the hot-add event) somewhere and
> include a URL to it?

I sent you the logs and lspci output both with and without this patch
when I connect a full chain of 6 Thunderbolt devices where 3 of them
include those NICs with 4 ethernet ports. The resulting topology
includes total of 6 + 3 + 1 PCIe switches.

> I think I understand the problem you're solving:
> 
>   - You have 366M, 1M-aligned, available for things on bus 3a
>   - You assign 20M, 2M-aligned to 3a:01.0
>   - This leaves 346M for other things on bus 3a, but it's not all
>     contiguous because the 20M is in the middle.
>   - The remaining 346M might be 1M on one side and 345M on the other
>     (and there are many other possibilities, e.g., 3M + 343M, 5M +
>     341M, ..., 345M + 1M).
>   - The current code tries to assign all 346M to 3a:04.0, which
>     fails because that space is not contiguous, so it falls back to
>     allocating 1M, which works but is insufficient for future
>     hot-adds.

My understanding is that the 20M is aligned to 2M so we need to take
that into account when we distribute the remaining space which makes it
345 instead of 346 which it would be without the alignment.

> Obviously this patch makes *this* situation work: it assigns 345M to
> 3a:04.0 and (I assume) leaves the 1M unused.  But I haven't been able
> to convince myself that this patch works *in general*.

I've tested this patch with full chain of devices with all my three
Intel Gigabit ET2 quad port server adapters connected there along with
other devices and the issue does not happen.

> For example, what if we assigned the 20M from the end of the 366M
> window instead of the beginning, so the 345M piece is below the 20M
> and there's 1M left above it?  That is legal and should work, but I
> suspect this patch would ignore the 345M piece and again assign 1M to
> 3a:04.0.

It should work so that it first allocates resources for the non-hotplug
bridges and after that everything else is put to hotplug bridges.

> Or what if there are several hotplug bridges on bus 3a?  This example
> has two, but there could be many more.
> 
> Or what if there are normal bridges as well as hotplug bridges on bus
> 3a?  Or if they're in arbitrary orders?

Thunderbolt host router with two ports has such configuration where
there are two hotplug ports and two normal ports (there could be more)
and it is hot-added as well. At least that works. With the other
arbitrary scenarios, it is hard to say without actually testing it on a
real hardware.

> > Fixes: 1a5767725cec ("PCI: Distribute available resources to hotplug-capable bridges")
> > Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
> > Reviewed-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
> > Cc: stable@vger.kernel.org
> 
> Given my confusion about this, I doubt this satisfies the stable
> kernel "obviously correct" rule.

Fair enough.

> s/substracting/subtracting/ above

OK, thanks.

Also I'm fine dropping this patch altogether and just file a kernel
bugzilla with this information attached. Maybe someone else can provide
a better fix eventually. This is not really common situation anyway
because typically you have only PCIe endpoints included in a Thunderbolt
device (not PCIe switches with a bunch of endpoints connected).
Furthermore, I tried the same in Windows and it does not handle it
properly either ;-)
Bjorn Helgaas May 1, 2018, 8:32 p.m. | #3
On Thu, Apr 26, 2018 at 03:23:33PM +0300, Mika Westerberg wrote:
> On Wed, Apr 25, 2018 at 05:38:54PM -0500, Bjorn Helgaas wrote:
> > On Mon, Apr 16, 2018 at 01:34:46PM +0300, Mika Westerberg wrote:
> > > When hot-adding a PCIe switch the way we currently distribute resources
> > > does not always work well because devices connected to the switch might
> > > need to have their MMIO resources aligned to something else than the
> > > default 1 MB boundary. For example Intel Gigabit ET2 quad port server
> > > adapter includes PCIe switch leading to 4 x GbE NIC devices that want
> > > to have their MMIO resources aligned to 2 MB boundary instead.
> > > 
> > > The current resource distribution code does not take this alignment into
> > > account and might try to add too much resources for the extension
> > > hotplug bridge(s). The resulting bridge window is too big which makes
> > > the resource assignment operation fail, and we are left with a bridge
> > > window with minimal amount (1 MB) of MMIO space.
> > > 
> > > Here is what happens when an Intel Gigabit ET2 quad port server adapter
> > > is hot-added:
> > > 
> > >   pci 0000:39:00.0: BAR 14: assigned [mem 0x53300000-0x6a0fffff]
> > >                                           ^^^^^^^^^^
> > >   pci 0000:3a:01.0: BAR 14: assigned [mem 0x53400000-0x547fffff]
> > >                                           ^^^^^^^^^^
> > > The above shows that the downstream bridge (3a:01.0) window is aligned
> > > to 2 MB instead of 1 MB as is the upstream bridge (39:00.0) window. The
> > > remaining MMIO space (0x15a00000) is assigned to the hotplug bridge
> > > (3a:04.0) but it fails:
> > > 
> > >   pci 0000:3a:04.0: BAR 14: no space for [mem size 0x15a00000]
> > >   pci 0000:3a:04.0: BAR 14: failed to assign [mem size 0x15a00000]
> > > 
> > > The MMIO resource is calculated as follows:
> > > 
> > >   start = 0x54800000
> > >   end = 0x54800000 + 0x15a00000 - 1 = 0x6a1fffff
> > > 
> > > This results bridge window [mem 0x54800000 - 0x6a1fffff] and it ends
> > > after the upstream bridge window [mem 0x53300000-0x6a0fffff] explaining
> > > the above failure. Because of this Linux falls back to the default
> > > allocation of 1 MB as can be seen from 'lspci' output:
> > > 
> > >  39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
> > >    3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
> > >    3a:04.0 Memory behind bridge: 53300000-533fffff [size=1M]
> > > 
> > > The hotplug bridge 3a:04.0 only occupies 1 MB MMIO window which is
> > > clearly not enough for extending the PCIe topology later if more devices
> > > are to be hot-added.
> > > 
> > > Fix this by substracting properly aligned non-hotplug downstream bridge
> > > window size from the remaining resources used for extension. After this
> > > change the resource allocation looks like:
> > > 
> > >   39:00.0 Memory behind bridge: 53300000-6a0fffff [size=366M]
> > >     3a:01.0 Memory behind bridge: 53400000-547fffff [size=20M]
> > >     3a:04.0 Memory behind bridge: 54800000-6a0fffff [size=345M]
> > > 
> > > This matches the expectation. All the extra MMIO resource space (345 MB)
> > > is allocated to the extension hotplug bridge (3a:04.0).
> > 
> > Sorry, I've spent a lot of time trying to trace through this code, and
> > I'm still hopelessly confused.  Can you post the complete "lspci -vv"
> > output and the dmesg log (including the hot-add event) somewhere and
> > include a URL to it?
> 
> I sent you the logs and lspci output both with and without this patch
> when I connect a full chain of 6 Thunderbolt devices where 3 of them
> include those NICs with 4 ethernet ports. The resulting topology
> includes total of 6 + 3 + 1 PCIe switches.

Thanks, I opened https://bugzilla.kernel.org/show_bug.cgi?id=199581
and attached the info you sent.

> > I think I understand the problem you're solving:
> > 
> >   - You have 366M, 1M-aligned, available for things on bus 3a
> >   - You assign 20M, 2M-aligned to 3a:01.0
> >   - This leaves 346M for other things on bus 3a, but it's not all
> >     contiguous because the 20M is in the middle.
> >   - The remaining 346M might be 1M on one side and 345M on the other
> >     (and there are many other possibilities, e.g., 3M + 343M, 5M +
> >     341M, ..., 345M + 1M).
> >   - The current code tries to assign all 346M to 3a:04.0, which
> >     fails because that space is not contiguous, so it falls back to
> >     allocating 1M, which works but is insufficient for future
> >     hot-adds.
> 
> My understanding is that the 20M is aligned to 2M so we need to take
> that into account when we distribute the remaining space which makes it
> 345 instead of 346 which it would be without the alignment.

I think that's what I said above, or did I miss something?

> > Obviously this patch makes *this* situation work: it assigns 345M to
> > 3a:04.0 and (I assume) leaves the 1M unused.  But I haven't been able
> > to convince myself that this patch works *in general*.
> 
> I've tested this patch with full chain of devices with all my three
> Intel Gigabit ET2 quad port server adapters connected there along with
> other devices and the issue does not happen.
> 
> > For example, what if we assigned the 20M from the end of the 366M
> > window instead of the beginning, so the 345M piece is below the 20M
> > and there's 1M left above it?  That is legal and should work, but I
> > suspect this patch would ignore the 345M piece and again assign 1M to
> > 3a:04.0.
> 
> It should work so that it first allocates resources for the non-hotplug
> bridges and after that everything else is put to hotplug bridges.
> 
> > Or what if there are several hotplug bridges on bus 3a?  This example
> > has two, but there could be many more.
> > 
> > Or what if there are normal bridges as well as hotplug bridges on bus
> > 3a?  Or if they're in arbitrary orders?
> 
> Thunderbolt host router with two ports has such configuration where
> there are two hotplug ports and two normal ports (there could be more)
> and it is hot-added as well. At least that works. With the other
> arbitrary scenarios, it is hard to say without actually testing it on a
> real hardware.

This is where it gets hard for me -- I'm not really comfortable if we
have to convince ourselves that code is correct by testing every
scenario.  It's a lot better if we can convince ourselves by reasoning
about what the code does.  That's not very reliable either, but if we
understand the code, we at least have a hope of being able to fix the
bugs we missed in our reasoning.

> Also I'm fine dropping this patch altogether and just file a kernel
> bugzilla with this information attached. Maybe someone else can provide
> a better fix eventually. This is not really common situation anyway
> because typically you have only PCIe endpoints included in a Thunderbolt
> device (not PCIe switches with a bunch of endpoints connected).
> Furthermore, I tried the same in Windows and it does not handle it
> properly either ;-)

OK, I opened the bugzilla and attached the info.  Thanks!
Mika Westerberg May 3, 2018, 12:39 p.m. | #4
On Tue, May 01, 2018 at 03:32:46PM -0500, Bjorn Helgaas wrote:
> This is where it gets hard for me -- I'm not really comfortable if we
> have to convince ourselves that code is correct by testing every
> scenario.  It's a lot better if we can convince ourselves by reasoning
> about what the code does.  That's not very reliable either, but if we
> understand the code, we at least have a hope of being able to fix the
> bugs we missed in our reasoning.

I took another look at the code and we can calculate everything upfront
before resources get distributed to hotplug bridges. I also tried and it
still works on my test systems. Does the below patch look more
acceptable to you?

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 072784f55ea5..cbc80dd5e8d8 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1881,6 +1881,7 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 	unsigned int normal_bridges = 0, hotplug_bridges = 0;
 	struct resource *io_res, *mmio_res, *mmio_pref_res;
 	struct pci_dev *dev, *bridge = bus->self;
+	resource_size_t align, io_align, mem_align;
 
 	io_res = &bridge->resource[PCI_BRIDGE_RESOURCES + 0];
 	mmio_res = &bridge->resource[PCI_BRIDGE_RESOURCES + 1];
@@ -1919,27 +1920,43 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 			normal_bridges++;
 	}
 
+	io_align = window_alignment(bus, IORESOURCE_IO);
+	mem_align = window_alignment(bus, IORESOURCE_MEM);
+
 	for_each_pci_bridge(dev, bus) {
-		const struct resource *res;
+		struct resource *res;
 
 		if (dev->is_hotplug_bridge)
 			continue;
 
 		/*
 		 * Reduce the available resource space by what the
-		 * bridge and devices below it occupy.
+		 * bridge and devices below it occupy taking into
+		 * account alignment if it differs from the default.
 		 */
 		res = &dev->resource[PCI_BRIDGE_RESOURCES + 0];
-		if (!res->parent && available_io > resource_size(res))
+		if (!res->parent && available_io > resource_size(res)) {
 			remaining_io -= resource_size(res);
+			align = pci_resource_alignment(dev, res);
+			if (align > io_align)
+				remaining_io -= align - io_align;
+		}
 
 		res = &dev->resource[PCI_BRIDGE_RESOURCES + 1];
-		if (!res->parent && available_mmio > resource_size(res))
+		if (!res->parent && available_mmio > resource_size(res)) {
 			remaining_mmio -= resource_size(res);
+			align = pci_resource_alignment(dev, res);
+			if (align > mem_align)
+				remaining_mmio -= align - mem_align;
+		}
 
 		res = &dev->resource[PCI_BRIDGE_RESOURCES + 2];
-		if (!res->parent && available_mmio_pref > resource_size(res))
+		if (!res->parent && available_mmio_pref > resource_size(res)) {
 			remaining_mmio_pref -= resource_size(res);
+			align = pci_resource_alignment(dev, res);
+			if (align > mem_align)
+				remaining_mmio_pref -= align - mem_align;
+		}
 	}
 
 	/*
@@ -1964,7 +1981,7 @@ static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 				available_io, available_mmio,
 				available_mmio_pref);
 		} else if (dev->is_hotplug_bridge) {
-			resource_size_t align, io, mmio, mmio_pref;
+			resource_size_t io, mmio, mmio_pref;
 
 			/*
 			 * Distribute available extra resources equally

Patch

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 072784f55ea5..eb3059fb7f63 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -1878,6 +1878,7 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 	resource_size_t available_mmio, resource_size_t available_mmio_pref)
 {
 	resource_size_t remaining_io, remaining_mmio, remaining_mmio_pref;
+	resource_size_t io_start, mmio_start, mmio_pref_start;
 	unsigned int normal_bridges = 0, hotplug_bridges = 0;
 	struct resource *io_res, *mmio_res, *mmio_pref_res;
 	struct pci_dev *dev, *bridge = bus->self;
@@ -1942,11 +1943,16 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 			remaining_mmio_pref -= resource_size(res);
 	}
 
+	io_start = io_res->start;
+	mmio_start = mmio_res->start;
+	mmio_pref_start = mmio_pref_res->start;
+
 	/*
 	 * Go over devices on this bus and distribute the remaining
 	 * resource space between hotplug bridges.
 	 */
 	for_each_pci_bridge(dev, bus) {
+		resource_size_t align;
 		struct pci_bus *b;
 
 		b = dev->subordinate;
@@ -1964,7 +1970,7 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 				available_io, available_mmio,
 				available_mmio_pref);
 		} else if (dev->is_hotplug_bridge) {
-			resource_size_t align, io, mmio, mmio_pref;
+			resource_size_t io, mmio, mmio_pref;
 
 			/*
 			 * Distribute available extra resources equally
@@ -1977,11 +1983,13 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 			io = div64_ul(available_io, hotplug_bridges);
 			io = min(ALIGN(io, align), remaining_io);
 			remaining_io -= io;
+			io_start += io;
 
 			align = pci_resource_alignment(bridge, mmio_res);
 			mmio = div64_ul(available_mmio, hotplug_bridges);
 			mmio = min(ALIGN(mmio, align), remaining_mmio);
 			remaining_mmio -= mmio;
+			mmio_start += mmio;
 
 			align = pci_resource_alignment(bridge, mmio_pref_res);
 			mmio_pref = div64_ul(available_mmio_pref,
@@ -1989,9 +1997,40 @@  static void pci_bus_distribute_available_resources(struct pci_bus *bus,
 			mmio_pref = min(ALIGN(mmio_pref, align),
 					remaining_mmio_pref);
 			remaining_mmio_pref -= mmio_pref;
+			mmio_pref_start += mmio_pref;
 
 			pci_bus_distribute_available_resources(b, add_list, io,
 							       mmio, mmio_pref);
+		} else {
+			/*
+			 * For normal bridges, track start of the parent
+			 * bridge window to make sure we align the
+			 * remaining space which is distributed to the
+			 * hotplug bridges properly.
+			 */
+			resource_size_t aligned;
+			struct resource *res;
+
+			res = &dev->resource[PCI_BRIDGE_RESOURCES + 0];
+			io_start += resource_size(res);
+			aligned = ALIGN(io_start,
+					pci_resource_alignment(dev, res));
+			if (aligned > io_start)
+				remaining_io -= aligned - io_start;
+
+			res = &dev->resource[PCI_BRIDGE_RESOURCES + 1];
+			mmio_start += resource_size(res);
+			aligned = ALIGN(mmio_start,
+					pci_resource_alignment(dev, res));
+			if (aligned > mmio_start)
+				remaining_mmio -= aligned - mmio_start;
+
+			res = &dev->resource[PCI_BRIDGE_RESOURCES + 2];
+			mmio_pref_start += resource_size(res);
+			aligned = ALIGN(mmio_pref_start,
+					pci_resource_alignment(dev, res));
+			if (aligned > mmio_pref_start)
+				remaining_mmio_pref -= aligned - mmio_pref_start;
 		}
 	}
 }