[RFC] pci: Change bus number assignment policy
diff mbox series

Message ID 20190312065724.28583-1-oohall@gmail.com
State New
Headers show
Series
  • [RFC] pci: Change bus number assignment policy
Related show

Checks

Context Check Description
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot-dco fail Signed-off-by missing
snowpatch_ozlabs/snowpatch_job_snowpatch-skiboot success Test snowpatch/job/snowpatch-skiboot on branch master
snowpatch_ozlabs/apply_patch success Successfully applied on branch master (261ca8e779e5138869a45f174caa49be6a274501)

Commit Message

Oliver O'Halloran March 12, 2019, 6:57 a.m. UTC
Change the way we assign bus numbers so that rather than assigning
bus numbers densely, we spread the available bus numbers across the
downstream bridges of a given topology.

For PCIe topologies with a wide-fan out this allows hot-plugging
large numbers of devices into a downstream port near the top of the
PCIe topology.

This patch changes the bus number assignment policy to evenly split the
available bus numbers across the bridges on bus. This allows us to
support hot-adding of large numbers of devices near the root of the
topology. Currently we only assign a handful of spare buses to each
hotplug port so if a large number of devices are added at the root
we won't have enough spare bus numbers for them.

e.g. On a zaius system with a two-switch (HBA switch and drawer switch) topology:

Before:

-+-[0030:00]---00.0-[01-40]----00.0-[02-40]--+-04.0-[03-1e]----00.0-[04-1e]--+-08.0-[05]----00.0
 |                                           |                               +-09.0-[06]----00.0
 |                                           |                               +-0a.0-[07-0b]--
 |                                           |                               +-0b.0-[0c-10]--
 |                                           |                               +-10.0-[11]----00.0
 |                                           |                               +-11.0-[12]----00.0
 |                                           |                               +-12.0-[13-17]--
 |                                           |                               +-13.0-[18-1c]--
 |                                           |                               +-15.0-[1d]----00.0
 |                                           |                               \-16.0-[1e]----00.0
 |                                           +-05.0-[1f-36]----00.0-[20-36]--+-04.0-[21]----00.0
 |                                           |                               +-05.0-[22]----00.0
 |                                           |                               +-06.0-[23]----00.0
 |                                           |                               +-07.0-[24]----00.0
 |                                           |                               +-0c.0-[25-29]--
 |                                           |                               +-0d.0-[2a]----00.0
 |                                           |                               +-0e.0-[2b-2f]--
 |                                           |                               +-0f.0-[30]----00.0
 |                                           |                               +-14.0-[31-35]--
 |                                           |                               \-17.0-[36]----00.0
 |                                           +-06.0-[37-3b]--
 |                                           \-07.0-[3c-40]--

After:

-+-[0030:00]---00.0-[01-fe]----00.0-[02-fd]--+-04.0-[03-41]----00.0-[04-40]--+-08.0-[05-0a]----00.0
 |                                           |                               +-09.0-[0b-10]----00.0
 |                                           |                               +-0a.0-[11-16]--
 |                                           |                               +-0b.0-[17-1c]--
 |                                           |                               +-10.0-[1d-22]----00.0
 |                                           |                               +-11.0-[23-28]----00.0
 |                                           |                               +-12.0-[29-2e]--
 |                                           |                               +-13.0-[2f-34]--
 |                                           |                               +-15.0-[35-3a]----00.0
 |                                           |                               \-16.0-[3b-40]----00.0
 |                                           +-05.0-[42-80]----00.0-[43-7f]--+-04.0-[44-49]----00.0
 |                                           |                               +-05.0-[4a-4f]--
 |                                           |                               +-06.0-[50-55]----00.0
 |                                           |                               +-07.0-[56-5b]----00.0
 |                                           |                               +-0c.0-[5c-61]--
 |                                           |                               +-0d.0-[62-67]----00.0
 |                                           |                               +-0e.0-[68-6d]--
 |                                           |                               +-0f.0-[6e-73]----00.0
 |                                           |                               +-14.0-[74-79]--
 |                                           |                               \-17.0-[7a-7f]----00.0
 |                                           +-06.0-[81-bf]--
 |                                           \-07.0-[c0-fe]--

This does however have the disadvantage that we can't really support
deep rather than wide topologies. In the above example you can see that
when we hit the lowest level there is only 5 buses available per port,
so if we had an architecture where downstream storage was daisy-chained
we would run out bus numbers pretty quickly.

These are solvable problems, but I figure I should see what people think
before spending a lot of time on this.

Not-Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
Cc: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
---
Sergey, would something like this work for the NVMe drawers you've been
working with? I think we'll need to support bus-number reassignment at
eventually, but if we could kick that can down the road a bit it'd be
helpful.
---
 core/pci.c | 44 ++++++++++++++++----------------------------
 1 file changed, 16 insertions(+), 28 deletions(-)

Comments

Oliver O'Halloran March 12, 2019, 6:58 a.m. UTC | #1
On Tue, Mar 12, 2019 at 5:57 PM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> Change the way we assign bus numbers so that rather than assigning
> bus numbers densely, we spread the available bus numbers across the
> downstream bridges of a given topology.
>
> For PCIe topologies with a wide-fan out this allows hot-plugging
> large numbers of devices into a downstream port near the top of the
> PCIe topology.
>
> This patch changes the bus number assignment policy to evenly split the
> available bus numbers across the bridges on bus. This allows us to
> support hot-adding of large numbers of devices near the root of the
> topology. Currently we only assign a handful of spare buses to each
> hotplug port so if a large number of devices are added at the root
> we won't have enough spare bus numbers for them.

I really should have proof-read that. Ah well.
Sergey Miroshnichenko March 13, 2019, 1:01 p.m. UTC | #2
Hello Oliver,

On 3/12/19 9:57 AM, Oliver O'Halloran wrote:
> Change the way we assign bus numbers so that rather than assigning
> bus numbers densely, we spread the available bus numbers across the
> downstream bridges of a given topology.
> 
> For PCIe topologies with a wide-fan out this allows hot-plugging
> large numbers of devices into a downstream port near the top of the
> PCIe topology.
> 
> This patch changes the bus number assignment policy to evenly split the
> available bus numbers across the bridges on bus. This allows us to
> support hot-adding of large numbers of devices near the root of the
> topology. Currently we only assign a handful of spare buses to each
> hotplug port so if a large number of devices are added at the root
> we won't have enough spare bus numbers for them.
> 
> e.g. On a zaius system with a two-switch (HBA switch and drawer switch) topology:
> 
> Before:
> 
> -+-[0030:00]---00.0-[01-40]----00.0-[02-40]--+-04.0-[03-1e]----00.0-[04-1e]--+-08.0-[05]----00.0
>  |                                           |                               +-09.0-[06]----00.0
>  |                                           |                               +-0a.0-[07-0b]--
>  |                                           |                               +-0b.0-[0c-10]--
>  |                                           |                               +-10.0-[11]----00.0
>  |                                           |                               +-11.0-[12]----00.0
>  |                                           |                               +-12.0-[13-17]--
>  |                                           |                               +-13.0-[18-1c]--
>  |                                           |                               +-15.0-[1d]----00.0
>  |                                           |                               \-16.0-[1e]----00.0
>  |                                           +-05.0-[1f-36]----00.0-[20-36]--+-04.0-[21]----00.0
>  |                                           |                               +-05.0-[22]----00.0
>  |                                           |                               +-06.0-[23]----00.0
>  |                                           |                               +-07.0-[24]----00.0
>  |                                           |                               +-0c.0-[25-29]--
>  |                                           |                               +-0d.0-[2a]----00.0
>  |                                           |                               +-0e.0-[2b-2f]--
>  |                                           |                               +-0f.0-[30]----00.0
>  |                                           |                               +-14.0-[31-35]--
>  |                                           |                               \-17.0-[36]----00.0
>  |                                           +-06.0-[37-3b]--
>  |                                           \-07.0-[3c-40]--
> 
> After:
> 
> -+-[0030:00]---00.0-[01-fe]----00.0-[02-fd]--+-04.0-[03-41]----00.0-[04-40]--+-08.0-[05-0a]----00.0
>  |                                           |                               +-09.0-[0b-10]----00.0
>  |                                           |                               +-0a.0-[11-16]--
>  |                                           |                               +-0b.0-[17-1c]--
>  |                                           |                               +-10.0-[1d-22]----00.0
>  |                                           |                               +-11.0-[23-28]----00.0
>  |                                           |                               +-12.0-[29-2e]--
>  |                                           |                               +-13.0-[2f-34]--
>  |                                           |                               +-15.0-[35-3a]----00.0
>  |                                           |                               \-16.0-[3b-40]----00.0
>  |                                           +-05.0-[42-80]----00.0-[43-7f]--+-04.0-[44-49]----00.0
>  |                                           |                               +-05.0-[4a-4f]--
>  |                                           |                               +-06.0-[50-55]----00.0
>  |                                           |                               +-07.0-[56-5b]----00.0
>  |                                           |                               +-0c.0-[5c-61]--
>  |                                           |                               +-0d.0-[62-67]----00.0
>  |                                           |                               +-0e.0-[68-6d]--
>  |                                           |                               +-0f.0-[6e-73]----00.0
>  |                                           |                               +-14.0-[74-79]--
>  |                                           |                               \-17.0-[7a-7f]----00.0
>  |                                           +-06.0-[81-bf]--
>  |                                           \-07.0-[c0-fe]--
> 
> This does however have the disadvantage that we can't really support
> deep rather than wide topologies. In the above example you can see that
> when we hit the lowest level there is only 5 buses available per port,
> so if we had an architecture where downstream storage was daisy-chained
> we would run out bus numbers pretty quickly.
> 
> These are solvable problems, but I figure I should see what people think
> before spending a lot of time on this.
> 
> Not-Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> Cc: Sergey Miroshnichenko <s.miroshnichenko@yadro.com>
> ---
> Sergey, would something like this work for the NVMe drawers you've been
> working with? I think we'll need to support bus-number reassignment at
> eventually, but if we could kick that can down the road a bit it'd be
> helpful.

Reserving more bus numbers per downstream port, covering all the 0-0xff
range, would be definitely helpful for the current state of Linux kernel.

But with the "pci=realloc" command line argument the kernel will
reassign the bus numbers during boot: compactly or based on the
"hpbussize" argument.

I have a patchset for Linux that handles a bridge hot-plugged in the
middle of a dense PCIe tree, we are actively using it internally for
about a year on our Vesnin servers and also Xeons, but we was going to
propose upstreaming it later - after the "Movable BARs" serie. These
patches increment bus numbers of a "tail" of the tree, freeing a space
for the new bridge, while all the working drivers remain working. The
hardest part was rebuilding procfs and sysfs entries and simlinks.

Best regards,
Serge

> ---
>  core/pci.c | 44 ++++++++++++++++----------------------------
>  1 file changed, 16 insertions(+), 28 deletions(-)
> 
> diff --git a/core/pci.c b/core/pci.c
> index 454b50102e59..d5537b6a376b 100644
> --- a/core/pci.c
> +++ b/core/pci.c
> @@ -743,9 +743,10 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>  		     bool scan_downstream)
>  {
>  	struct pci_device *pd = NULL, *rc = NULL;
> -	uint8_t dev, fn, next_bus, max_sub, save_max;
> +	uint8_t dev, fn, next_bus, max_sub;
>  	uint32_t scan_map;
>  	bool use_max;
> +	int bridges = 0, buses_per_bridge;
>  
>  	/* Decide what to scan  */
>  	scan_map = parent ? parent->scan_map : phb->scan_map;
> @@ -810,7 +811,18 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>  
>  	next_bus = bus + 1;
>  	max_sub = bus;
> -	save_max = max_bus;
> +
> +	list_for_each(list, pd, link)
> +		if (pd->is_bridge)
> +			bridges++;
> +
> +	buses_per_bridge = max_bus - next_bus - 1;
> +	if (bridges)
> +		buses_per_bridge /= bridges;
> +
> +	PCIERR(phb, pd->bdfn, "found %d [%x:%x] downstream bridges, %sscanning down, %d\n",
> +		bridges, next_bus, max_bus,  scan_downstream ? "" : "not ",
> +		buses_per_bridge);
>  
>  	/* Scan down bridges */
>  	list_for_each(list, pd, link) {
> @@ -819,32 +831,8 @@ uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
>  		if (!pd->is_bridge)
>  			continue;
>  
> -		/* We need to figure out a new bus number to start from.
> -		 *
> -		 * This can be tricky due to our HW constraints which differ
> -		 * from bridge to bridge so we are going to let the phb
> -		 * driver decide what to do. This can return us a maximum
> -		 * bus number to assign as well
> -		 *
> -		 * This function will:
> -		 *
> -		 *  - Return the bus number to use as secondary for the
> -		 *    bridge or 0 for a failure
> -		 *
> -		 *  - "max_bus" will be adjusted to represent the max
> -		 *    subordinate that can be associated with the downstream
> -		 *    device
> -		 *
> -		 *  - "use_max" will be set to true if the returned max_bus
> -		 *    *must* be used as the subordinate bus number of that
> -		 *    bridge (when we need to give aligned powers of two's
> -		 *    on P7IOC). If is is set to false, we just adjust the
> -		 *    subordinate bus number based on what we probed.
> -		 *
> -		 */
> -		max_bus = save_max;
> -		next_bus = phb->ops->choose_bus(phb, pd, next_bus,
> -						&max_bus, &use_max);
> +		use_max = 1;
> +		max_bus = next_bus + buses_per_bridge;
>  
>  		/* Configure the bridge with the returned values */
>  		if (next_bus <= bus) {
>

Patch
diff mbox series

diff --git a/core/pci.c b/core/pci.c
index 454b50102e59..d5537b6a376b 100644
--- a/core/pci.c
+++ b/core/pci.c
@@ -743,9 +743,10 @@  uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
 		     bool scan_downstream)
 {
 	struct pci_device *pd = NULL, *rc = NULL;
-	uint8_t dev, fn, next_bus, max_sub, save_max;
+	uint8_t dev, fn, next_bus, max_sub;
 	uint32_t scan_map;
 	bool use_max;
+	int bridges = 0, buses_per_bridge;
 
 	/* Decide what to scan  */
 	scan_map = parent ? parent->scan_map : phb->scan_map;
@@ -810,7 +811,18 @@  uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
 
 	next_bus = bus + 1;
 	max_sub = bus;
-	save_max = max_bus;
+
+	list_for_each(list, pd, link)
+		if (pd->is_bridge)
+			bridges++;
+
+	buses_per_bridge = max_bus - next_bus - 1;
+	if (bridges)
+		buses_per_bridge /= bridges;
+
+	PCIERR(phb, pd->bdfn, "found %d [%x:%x] downstream bridges, %sscanning down, %d\n",
+		bridges, next_bus, max_bus,  scan_downstream ? "" : "not ",
+		buses_per_bridge);
 
 	/* Scan down bridges */
 	list_for_each(list, pd, link) {
@@ -819,32 +831,8 @@  uint8_t pci_scan_bus(struct phb *phb, uint8_t bus, uint8_t max_bus,
 		if (!pd->is_bridge)
 			continue;
 
-		/* We need to figure out a new bus number to start from.
-		 *
-		 * This can be tricky due to our HW constraints which differ
-		 * from bridge to bridge so we are going to let the phb
-		 * driver decide what to do. This can return us a maximum
-		 * bus number to assign as well
-		 *
-		 * This function will:
-		 *
-		 *  - Return the bus number to use as secondary for the
-		 *    bridge or 0 for a failure
-		 *
-		 *  - "max_bus" will be adjusted to represent the max
-		 *    subordinate that can be associated with the downstream
-		 *    device
-		 *
-		 *  - "use_max" will be set to true if the returned max_bus
-		 *    *must* be used as the subordinate bus number of that
-		 *    bridge (when we need to give aligned powers of two's
-		 *    on P7IOC). If is is set to false, we just adjust the
-		 *    subordinate bus number based on what we probed.
-		 *
-		 */
-		max_bus = save_max;
-		next_bus = phb->ops->choose_bus(phb, pd, next_bus,
-						&max_bus, &use_max);
+		use_max = 1;
+		max_bus = next_bus + buses_per_bridge;
 
 		/* Configure the bridge with the returned values */
 		if (next_bus <= bus) {