mbox series

[v9,00/26] PCI: Allow BAR movement during boot and hotplug

Message ID 20201218174011.340514-1-s.miroshnichenko@yadro.com
Headers show
Series PCI: Allow BAR movement during boot and hotplug | expand

Message

Sergei Miroshnichenko Dec. 18, 2020, 5:39 p.m. UTC
Currently PCI hotplug works on top of resources which are usually reserved:
by BIOS, bootloader, firmware, or by the kernel (pci=hpmemsize=XM). These
resources are gaps in the address space where BARs of new devices may fit,
and extra bus number per port, so bridges can be hot-added. This series aim
the BARs problem: it shows the kernel how to redistribute them on the run,
so the hotplug becomes predictable and cross-platform. A follow-up patchset
will propose a solution for bus numbers. And another one -- for the powerpc
arch-specific problems.

To arrange a space for BARs of new hot-added devices, the kernel can pause
the drivers of working PCI devices and reshuffle the assigned BARs. When a
driver is un-paused by the kernel, it should ioremap() the new addresses of
its BARs.

Drivers indicate their support of the feature by implementing the new hooks
.bar_fixed(), .rescan_prepare(), .rescan_done() in the struct pci_driver.
If a driver doesn't yet support the feature, BARs of its devices are seen
as immovable and handled in the same way as resources with the PCI_FIXED
flag: they are guaranteed to remain untouched.

This approached was discussed during LPC 2020 [1].

The feature is usable not only for hotplug, but also for booting with a
complete set of BARs assigned, and for Resizable BARs.

Tested on a number of x86_64 machines without any special kernel command
line arguments (some of them -- with older v8 revision of this patchset):
 - PC: i7-5930K + ASUS X99-A;
 - PC: i5-8500 + ASUS Z370-F;
 - PC: AMD Ryzen 7 3700X + ASRock X570 + AMD 5700 XT (Resizable BARs);
 - Supermicro Super Server/H11SSL-i: AMD EPYC 7251;
 - HP ProLiant DL380 G5: Xeon X5460;
 - Dell Inspiron N5010: i5 M 480;
 - Dell Precision M6600: i7-2920XM.

Also tested on a Power8 (Vesnin) and Power9 (Nicole) ppc64le machines, but
with extra patchset, which is to be sent upstream a bit later.

The set contains:
 01-02: Independent bugfixes, both are not related directly to the movable
        BARs feature, but without them the rest of this series will not
	work as expected;

 03-14: Implement the essentials of the feature;

 15-21: Performance improvements for movable BARs and pciehp;

 22: Enable the feature by default;

 23-24: Add movable BARs support to nvme and portdrv;

 25-26: Debugging helpers.

This patchset is a part of our work on adding support for hot-adding
chains of chassis full of other bridges, NVME drives, SAS HBAs, GPUs, etc.
without special requirements such as Hot-Plug Controller, reservation of
bus numbers or memory regions by firmware, etc.

[1] https://linuxplumbersconf.org/event/7/contributions/847/attachments/584/1035/lpc2020_sergmir.pdf
    LPC 2020: VFIO/IOMMU/PCI MC: PCI hotplug: movable BARs and bus numbers

Changes since v8:
 - Fixed support of Resizable BARs, with amdgpu (no changes in the driver);
 - Added a new hook .bar_fixed(), so part of device's BARs can be movable,
   and this new call makes the API a bit cleaner in usage;
 - Memory reservation (via pci=hpmemsize, etc) now works as well;
 - Fix for a race condition (patch 01) has been simplified;
 - Simplified: the .realloc_range field is not needed anymore;
 - Added some debug messages;
 - Compilation warnings, NULL dereference fixed.

Changes since v7:
 - Added some documentation;
 - Replaced every occurrence of the word "immovable" with "fixed";
 - Don't touch PNP, ACPI resources anymore;
 - Replaced double rescan with triple rescan:
   * first try every BAR;
   * if that failed, retry without BARs which weren't assigned before;
   * if that failed, retry without BARs of hotplugged devices;
 - Reassign BARs during boot only if BIOS assigned not all requested BARs;
 - Fixed up PCIBIOS_MIN_MEM instead of ignoring it;
 - Now the feature auto-disables in presence of a transparent bridge;
 - Improved support of runtime PM;
 - Fixed issues with incorrectly released bridge windows;
 - Fixed calculating bridge window size.
 
Changes since v6:
 - Added a fix for hotplug on AMD Epyc + Supermicro H11SSL-i by ignoring
   PCIBIOS_MIN_MEM;
 - Fixed a workaround which marks VGA BARs as immovables;
 - Fixed misleading "can't claim BAR ... no compatible bridge window" error
   messages;
 - Refactored the code, reduced the amount of patches;
 - Exclude PowerPC-specific arch patches, they will be sent separately;
 - Disabled for PowerNV by default - waiting for the PCIPOCALYPSE patchset.
 - Fixed reports from the kbuild test robot.

Changes since v5:
 - Simplified the disable flag, now it is "pci=no_movable_buses";
 - More deliberate marking the BARs as immovable;
 - Mark as immovable BARs which are used by unbound drivers;
 - Ignoring BAR assignment by non-kernel program components, so the kernel
   is able now to distribute BARs in optimal and predictable way;
 - Move here PowerNV-specific patches from the older "powerpc/powernv/pci:
   Make hotplug self-sufficient, independent of FW and DT" series;
 - Fix EEH cache rebuilding and PE allocation for PowerNV during rescan.

Changes since v4:
 - Feature is enabled by default (turned on by one of the latest patches);
 - Add pci_dev_movable_bars_supported(dev) instead of marking the immovable
   BARs with the IORESOURCE_PCI_FIXED flag;
 - Set up PCIe bridges during rescan via sysfs, so MPS settings are now
   configured not only during system boot or pcihp events;
 - Allow movement of switch's BARs if claimed by portdrv;
 - Update EEH address caches after rescan for powerpc;
 - Don't disable completely hot-added devices which can't have BARs being
   fit - just disable their BARs, so they are still visible in lspci etc;
 - Clearer names: fixed_range_hard -> immovable_range, fixed_range_soft ->
   realloc_range;
 - Drop the patch for pci_restore_config_space() - fixed by properly using
   the runtime PM.

Changes since v3:
 - Rebased to the upstream, so the patches apply cleanly again.

Changes since v2:
 - Fixed double-assignment of bridge windows;
 - Fixed assignment of fixed prefetched resources;
 - Fixed releasing of fixed resources;
 - Fixed a debug message;
 - Removed auto-enabling the movable BARs for x86 - let's rely on the
   "pcie_movable_bars=force" option for now;
 - Reordered the patches - bugfixes first.

Changes since v1:
 - Add a "pcie_movable_bars={ off | force }" command line argument;
 - Handle the IORESOURCE_PCI_FIXED flag properly;
 - Don't move BARs of devices which don't support the feature;
 - Guarantee that new hotplugged devices will not steal memory from working
   devices by ignoring the failing new devices with the new PCI_DEV_IGNORE
   flag;
 - Add rescan_prepare()+rescan_done() to the struct pci_driver instead of
   using the reset_prepare()+reset_done() from struct pci_error_handlers;
 - Add a bugfix of a race condition;
 - Fixed hotplug in a non-pre-enabled (by BIOS/firmware) bridge;
 - Fix the compatibility of the feature with pm_runtime and D3-state;
 - Hotplug events from pciehp also can move BARs;
 - Add support of the feature to the NVME driver.

Sergei Miroshnichenko (26):
  PCI: Fix race condition in pci_enable/disable_device()
  PCI: Ensure a bridge has I/O and MEM access for hot-added devices
  PCI: hotplug: Initial support of the movable BARs feature
  PCI: Add version of release_child_resources() aware of fixed BARs
  PCI: hotplug: Fix reassigning the released BARs
  PCI: hotplug: Recalculate every bridge window during rescan
  PCI: hotplug: Don't allow hot-added devices to steal resources
  PCI: Reassign BARs if BIOS/bootloader had assigned not all of them
  PCI: Movable BARs: Make just a single attempt to assign bridge
    resources
  PCI: hotplug: Calculate fixed parts of bridge windows
  PCI: Include fixed BARs into the bus size calculating
  PCI: Make sure bridge windows include their fixed BARs
  PCI: hotplug: Add support of fixed BARs to pci_assign_resource()
  PCI: hotplug: Sort fixed BARs before assignment
  x86/PCI/ACPI: Fix up PCIBIOS_MIN_MEM if value computed from e820 is
    invalid
  PCI: hotplug: Configure MPS after manual bus rescan
  PCI: hotplug: Don't disable the released bridge windows immediately
  PCI: pciehp: Trigger a domain rescan on hp events when enabled movable
    BARs
  PCI: Don't claim fixed BARs
  PCI: hotplug: Retry bus assignment without reserved space
  PCI: Rescan the bus to resize a BAR
  PCI: hotplug: Enable the movable BARs feature by default
  PCI/portdrv: Declare support of movable BARs
  nvme-pci: Handle movable BARs
  PCI: Add a message for updating BARs
  resource: increase max nesting level for /proc/iomem

 Documentation/PCI/pci.rst                     |  63 +++
 .../admin-guide/kernel-parameters.txt         |   1 +
 arch/powerpc/platforms/powernv/pci.c          |   2 +
 arch/powerpc/platforms/pseries/setup.c        |   2 +
 arch/x86/pci/acpi.c                           |  15 +
 drivers/nvme/host/pci.c                       |  27 +-
 drivers/pci/bus.c                             |   2 +-
 drivers/pci/hotplug/pciehp_pci.c              |   5 +
 drivers/pci/iov.c                             |   2 +
 drivers/pci/pci.c                             |  24 ++
 drivers/pci/pci.h                             |  34 ++
 drivers/pci/pcie/portdrv_pci.c                |   7 +
 drivers/pci/probe.c                           | 401 +++++++++++++++++-
 drivers/pci/setup-bus.c                       | 227 +++++++---
 drivers/pci/setup-res.c                       | 105 ++++-
 include/linux/pci.h                           |  16 +
 kernel/resource.c                             |   2 +-
 17 files changed, 863 insertions(+), 72 deletions(-)


base-commit: a409ed156a90093a03fe6a93721ddf4c591eac87

Comments

Bjorn Helgaas Jan. 28, 2021, 2:53 p.m. UTC | #1
On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei Miroshnichenko wrote:
> Currently PCI hotplug works on top of resources which are usually reserved:
> by BIOS, bootloader, firmware, or by the kernel (pci=hpmemsize=XM). These
> resources are gaps in the address space where BARs of new devices may fit,
> and extra bus number per port, so bridges can be hot-added. This series aim
> the BARs problem: it shows the kernel how to redistribute them on the run,
> so the hotplug becomes predictable and cross-platform. A follow-up patchset
> will propose a solution for bus numbers. And another one -- for the powerpc
> arch-specific problems.

Hi Sergei,

It looks like there's a lot of good work here, but I don't really know
how to move forward with it.

I can certainly see scenarios where this functionality will be useful,
but the series currently doesn't mention bug reports that it fixes.  I
suspect there *are* some related bug reports, e.g., for Thunderbolt
hotplug.  We should dig them up, include pointers to them, and get the
reporters to test the series and provide feedback.

We had some review traffic on earlier versions, but as far as I can
see, nobody has stepped up with any actual signs of approval, e.g.,
Tested-by, Reviewed-by, or Acked-by tags.  That's a problem because
this touches a lot of sensitive code and I don't want to be stuck
fixing issues all by myself.  When issues crop up, I look to the
author and reviewers to help out.

Along that line, I'm a little concerned about how we will be able to
reproduce and debug problems.  Issues will likely depend on the
topology, the resources of the specific devices involved, and a
specific sequence of hotplug operations.  Those may be hard to
reproduce even for the reporter.  Maybe this is nothing more than a
grep pattern to extract relevant events from dmesg.  Ideal, but maybe
impractical, would be a way to reproduce things in a VM using qemu and
a script to simulate hotplugs.

> To arrange a space for BARs of new hot-added devices, the kernel can pause
> the drivers of working PCI devices and reshuffle the assigned BARs. When a
> driver is un-paused by the kernel, it should ioremap() the new addresses of
> its BARs.
> 
> Drivers indicate their support of the feature by implementing the new hooks
> .bar_fixed(), .rescan_prepare(), .rescan_done() in the struct pci_driver.
> If a driver doesn't yet support the feature, BARs of its devices are seen
> as immovable and handled in the same way as resources with the PCI_FIXED
> flag: they are guaranteed to remain untouched.
> 
> This approached was discussed during LPC 2020 [1].
> 
> The feature is usable not only for hotplug, but also for booting with a
> complete set of BARs assigned, and for Resizable BARs.

I'm interested in more details about both of these.  What does "a
complete set of BARs assigned" mean?  On x86, the BIOS usually assigns
all the BARs ahead of time, but I know that's not the case for all
arches.

For Resizable BARs, is the point that this allows more flexibility in
resizing BARs because we can now move things out of the way?

> Tested on a number of x86_64 machines without any special kernel command
> line arguments (some of them -- with older v8 revision of this patchset):
>  - PC: i7-5930K + ASUS X99-A;
>  - PC: i5-8500 + ASUS Z370-F;
>  - PC: AMD Ryzen 7 3700X + ASRock X570 + AMD 5700 XT (Resizable BARs);
>  - Supermicro Super Server/H11SSL-i: AMD EPYC 7251;
>  - HP ProLiant DL380 G5: Xeon X5460;
>  - Dell Inspiron N5010: i5 M 480;
>  - Dell Precision M6600: i7-2920XM.

Does this testing show no change in behavior and no regressions, or
does it show that this series fixes cases that previously did not
work?  If the latter, some bugzilla reports with before/after dmesg
logs would give some insight into how this works and also some good
justification.

> Also tested on a Power8 (Vesnin) and Power9 (Nicole) ppc64le machines, but
> with extra patchset, which is to be sent upstream a bit later.
> 
> The set contains:
>  01-02: Independent bugfixes, both are not related directly to the movable
>         BARs feature, but without them the rest of this series will not
> 	work as expected;
> 
>  03-14: Implement the essentials of the feature;
> 
>  15-21: Performance improvements for movable BARs and pciehp;
> 
>  22: Enable the feature by default;
> 
>  23-24: Add movable BARs support to nvme and portdrv;
> 
>  25-26: Debugging helpers.
> 
> This patchset is a part of our work on adding support for hot-adding
> chains of chassis full of other bridges, NVME drives, SAS HBAs, GPUs, etc.
> without special requirements such as Hot-Plug Controller, reservation of
> bus numbers or memory regions by firmware, etc.

This is a little bit of a chicken and egg situation.  I suspect many
people would like this functionality, but currently we either avoid it
because it's "known not to work" or we hack around it with the
firmware reservations you mention.

Bjorn
Lukas Wunner Jan. 28, 2021, 8:39 p.m. UTC | #2
On Thu, Jan 28, 2021 at 08:53:16AM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei Miroshnichenko wrote:
> > Currently PCI hotplug works on top of resources which are usually reserved:
> > by BIOS, bootloader, firmware, or by the kernel (pci=hpmemsize=XM). These
> > resources are gaps in the address space where BARs of new devices may fit,
> > and extra bus number per port, so bridges can be hot-added. This series aim
> > the BARs problem: it shows the kernel how to redistribute them on the run,
> > so the hotplug becomes predictable and cross-platform. A follow-up patchset
> > will propose a solution for bus numbers. And another one -- for the powerpc
> > arch-specific problems.
> 
> I can certainly see scenarios where this functionality will be useful,
> but the series currently doesn't mention bug reports that it fixes.  I
> suspect there *are* some related bug reports, e.g., for Thunderbolt
> hotplug.  We should dig them up, include pointers to them, and get the
> reporters to test the series and provide feedback.

In case it helps, an earlier version of the series was referenced
in this LWN article more than 2 years ago (scroll down to the
"Moving BARs" section at the end of the article):

https://lwn.net/Articles/767885/

The article provides some context:  Specifically, macOS is capable
of resizing and moving BARs, so this series sort of helps us catch
up with the competition.

With Thunderbolt, this series is particularly useful if
(a) PCIe cards are hot-added with large BARs (such as GPUs) and/or
(b) the Thunderbolt daisy-chain is very long.

Thunderbolt is essentially a cascade of nested hotplug ports,
so if more and more devices are added, it's easy to see that
the top-level hotplug port's BAR window may run out of space.

My understanding is that Sergei's use case doesn't involve
Thunderbolt at all but rather hotplugging of GPUs and network
cards in PowerPC servers in a datacenter, which may have the
same kinds of issues.

I intended to review and test this iteration of the series more
closely, but haven't been able to carve out the required time.
I'm adding some Thunderbolt folks to cc in the hope that they
can at least test the series on their development branch.
Getting this upstreamed should really be in the best interest
of Intel and other promulgators of Thunderbolt.

Thanks,

Lukas
Mika Westerberg Feb. 1, 2021, 12:55 p.m. UTC | #3
Hi,

On Thu, Jan 28, 2021 at 09:39:29PM +0100, Lukas Wunner wrote:
> On Thu, Jan 28, 2021 at 08:53:16AM -0600, Bjorn Helgaas wrote:
> > On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei Miroshnichenko wrote:
> > > Currently PCI hotplug works on top of resources which are usually reserved:
> > > by BIOS, bootloader, firmware, or by the kernel (pci=hpmemsize=XM). These
> > > resources are gaps in the address space where BARs of new devices may fit,
> > > and extra bus number per port, so bridges can be hot-added. This series aim
> > > the BARs problem: it shows the kernel how to redistribute them on the run,
> > > so the hotplug becomes predictable and cross-platform. A follow-up patchset
> > > will propose a solution for bus numbers. And another one -- for the powerpc
> > > arch-specific problems.
> > 
> > I can certainly see scenarios where this functionality will be useful,
> > but the series currently doesn't mention bug reports that it fixes.  I
> > suspect there *are* some related bug reports, e.g., for Thunderbolt
> > hotplug.  We should dig them up, include pointers to them, and get the
> > reporters to test the series and provide feedback.
> 
> In case it helps, an earlier version of the series was referenced
> in this LWN article more than 2 years ago (scroll down to the
> "Moving BARs" section at the end of the article):
> 
> https://lwn.net/Articles/767885/
> 
> The article provides some context:  Specifically, macOS is capable
> of resizing and moving BARs, so this series sort of helps us catch
> up with the competition.
> 
> With Thunderbolt, this series is particularly useful if
> (a) PCIe cards are hot-added with large BARs (such as GPUs) and/or
> (b) the Thunderbolt daisy-chain is very long.
> 
> Thunderbolt is essentially a cascade of nested hotplug ports,
> so if more and more devices are added, it's easy to see that
> the top-level hotplug port's BAR window may run out of space.
> 
> My understanding is that Sergei's use case doesn't involve
> Thunderbolt at all but rather hotplugging of GPUs and network
> cards in PowerPC servers in a datacenter, which may have the
> same kinds of issues.
> 
> I intended to review and test this iteration of the series more
> closely, but haven't been able to carve out the required time.
> I'm adding some Thunderbolt folks to cc in the hope that they
> can at least test the series on their development branch.
> Getting this upstreamed should really be in the best interest
> of Intel and other promulgators of Thunderbolt.

Sure. It seems that this series was submitted in December so probably
not applicable to the pci.git/next anymore. Anyways, I can give it a try
on a TBT capable system if someone tells me what exactly to test ;-)
Probably at least that the existing functionality still works but
something else maybe too?
Sergei Miroshnichenko Feb. 3, 2021, 7:59 p.m. UTC | #4
Hi Bjorn,

On Thu, 2021-01-28 at 08:53 -0600, Bjorn Helgaas wrote:
> I can certainly see scenarios where this functionality will be
> useful,
> but the series currently doesn't mention bug reports that it
> fixes.  I
> suspect there *are* some related bug reports, e.g., for Thunderbolt
> hotplug.  We should dig them up, include pointers to them, and get
> the
> reporters to test the series and provide feedback.

It never occurred to me to actually search for reports, thanks.

> We had some review traffic on earlier versions, but as far as I can
> see, nobody has stepped up with any actual signs of approval, e.g.,
> Tested-by, Reviewed-by, or Acked-by tags.  That's a problem because
> this touches a lot of sensitive code and I don't want to be stuck
> fixing issues all by myself.  When issues crop up, I look to the
> author and reviewers to help out.

Indeed, I had only a few informal personal responses [from outside of
our company]:
 - some machines have no regressions with the kernel built with Clang-
12+LTO+IAS;
 - NVMe hotplug started working for AMD Epyc with a PEX switch;
 - Another system with several PEX switches is under ongoing
experiments.

Providing an ability to quickly return to the old BARs allocating is
the reason why I keep wrapping the most intrusive code with "if
(pci_can_move_bars)". Originally I wanted it to be disabled by default,
available for those truly desperate. But there was an objection that
the code will not be ever tested this way.

> Along that line, I'm a little concerned about how we will be able to
> reproduce and debug problems.  Issues will likely depend on the
> topology, the resources of the specific devices involved, and a
> specific sequence of hotplug operations.  Those may be hard to
> reproduce even for the reporter.  Maybe this is nothing more than a
> grep pattern to extract relevant events from dmesg.  Ideal, but maybe
> impractical, would be a way to reproduce things in a VM using qemu
> and
> a script to simulate hotplugs.

I haven't yet tried the qemu for PCI debugging, definitely need to take
a look. To help with that, the patchset is supplied with additional
prints, but CONFIG_PCI_DEBUG=y is still preferred. A simulation will
reveal if additional debug messages are needed.

> > The feature is usable not only for hotplug, but also for booting
> > with a
> > complete set of BARs assigned, and for Resizable BARs.
> 
> I'm interested in more details about both of these.  What does "a
> complete set of BARs assigned" mean?  On x86, the BIOS usually
> assigns
> all the BARs ahead of time, but I know that's not the case for all
> arches.

Usually yes, but we have at least one x86 PC that skips some BARs (need
to check out again its particular model and version, IIRC that was
Z370-F) -- not the crucial ones like BAR0, though. That really got me
by surprise, so now it is one more case covered.

> For Resizable BARs, is the point that this allows more flexibility in
> resizing BARs because we can now move things out of the way?

Not only things out of the way, but most importantly the bridge windows
are now floating as well. So it's more freedom for a new BAR to
"choose" a place.

An awfully synthetic example:

|                       RC Address Space                       |
| orig GPU BAR | fixed BAR |                                   |
|              | fixed BAR |     Resized GPU BAR        |      |

> > Tested on a number of x86_64 machines without any special kernel
> > command
> > line arguments (some of them -- with older v8 revision of this
> > patchset):
> >  - PC: i7-5930K + ASUS X99-A;
> >  - PC: i5-8500 + ASUS Z370-F;
> >  - PC: AMD Ryzen 7 3700X + ASRock X570 + AMD 5700 XT (Resizable
> > BARs);
> >  - Supermicro Super Server/H11SSL-i: AMD EPYC 7251;
> >  - HP ProLiant DL380 G5: Xeon X5460;
> >  - Dell Inspiron N5010: i5 M 480;
> >  - Dell Precision M6600: i7-2920XM.
> 
> Does this testing show no change in behavior and no regressions, or
> does it show that this series fixes cases that previously did not
> work?  If the latter, some bugzilla reports with before/after dmesg
> logs would give some insight into how this works and also some good
> justification.

Both, actually, but the dmesg logs were never published, so they will
be the next step -- combined with qemu scripts, if it works.

> > This patchset is a part of our work on adding support for hot-
> > adding
> > chains of chassis full of other bridges, NVME drives, SAS HBAs,
> > GPUs, etc.
> > without special requirements such as Hot-Plug Controller,
> > reservation of
> > bus numbers or memory regions by firmware, etc.
> 
> This is a little bit of a chicken and egg situation.  I suspect many
> people would like this functionality, but currently we either avoid
> it
> because it's "known not to work" or we hack around it with the
> firmware reservations you mention.

The current (v9) patchset became more symbiotic with other hotplug
facilities: it started to respect pci=hpmemsize etc, and only drops it
if such allocation is impossible.

Attention Button for Assisted Hotplug (pciehp driver) was always
supported. It is also fully compatible with DPC.

The goal is to squeeze as much as possible even without special
hw+fw+sw support, when the only available tool is a manual rescan via
sysfs.

Serge
Sergei Miroshnichenko Feb. 3, 2021, 8:01 p.m. UTC | #5
On Thu, 2021-01-28 at 21:39 +0100, Lukas Wunner wrote:
> With Thunderbolt, this series is particularly useful if
> (a) PCIe cards are hot-added with large BARs (such as GPUs) and/or
> (b) the Thunderbolt daisy-chain is very long.
> 
> Thunderbolt is essentially a cascade of nested hotplug ports,
> so if more and more devices are added, it's easy to see that
> the top-level hotplug port's BAR window may run out of space.
> 
> My understanding is that Sergei's use case doesn't involve
> Thunderbolt at all but rather hotplugging of GPUs and network
> cards in PowerPC servers in a datacenter, which may have the
> same kinds of issues.

Hi Lukas,

I have yet to find some Thunderbolt hardware and try it.

Actually, we are hotplugging not only endpoints, but nested PCIe
switches as well: those are PCIe JBOD chassis (with NVMes and SAS
drives).

But to deal with them we have to use an additional patchset "Movable
bus numbers", that was also described during LPC2020 (including its
problematic points), here is its RFC, it haven't changed much since
then:

https://lore.kernel.org/linux-pci/20191024172157.878735-5-s.miroshnichenko@yadro.com/T/

Can Thunderbolt have a chain of switches deep enough to have problems
with bus numbers, when reserving is not the best option?

Serge

> I intended to review and test this iteration of the series more
> closely, but haven't been able to carve out the required time.
> I'm adding some Thunderbolt folks to cc in the hope that they
> can at least test the series on their development branch.
> Getting this upstreamed should really be in the best interest
> of Intel and other promulgators of Thunderbolt.
> 
> Thanks,
> 
> Lukas
Sergei Miroshnichenko Feb. 3, 2021, 8:17 p.m. UTC | #6
Hi Mika,

On Mon, 2021-02-01 at 14:55 +0200, Mika Westerberg wrote:
> On Thu, Jan 28, 2021 at 09:39:29PM +0100, Lukas Wunner wrote:
> > On Thu, Jan 28, 2021 at 08:53:16AM -0600, Bjorn Helgaas wrote:
> > > On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei Miroshnichenko
> > > wrote:
> > > > ...
> > 
> > I intended to review and test this iteration of the series more
> > closely, but haven't been able to carve out the required time.
> > I'm adding some Thunderbolt folks to cc in the hope that they
> > can at least test the series on their development branch.
> > Getting this upstreamed should really be in the best interest
> > of Intel and other promulgators of Thunderbolt.
> 
> Sure. It seems that this series was submitted in December so probably
> not applicable to the pci.git/next anymore. Anyways, I can give it a
> try
> on a TBT capable system if someone tells me what exactly to test ;-)
> Probably at least that the existing functionality still works but
> something else maybe too?

For setups that worked fine, the only expected change is a possible
little different BAR layout (in /proc/iomem), and there should the same
quantity (or more) of BARs assigned than before.

But if there are any problematic setups, which weren't able to arrange
new BARs, this patchset may push a bit further.

In a few days I'll provide an updated branch for our mirror of the
kernel on Github, with a complete and bumped set of patches, reducing
the steps required to test them.

Thanks!

Serge
Hinko Kocevar Feb. 4, 2021, 8:26 a.m. UTC | #7
Dear,

On 2/3/21 8:59 PM, Sergei Miroshnichenko wrote:
> Hi Bjorn,
> 
> On Thu, 2021-01-28 at 08:53 -0600, Bjorn Helgaas wrote:
>> I can certainly see scenarios where this functionality will be
>> useful,
>> but the series currently doesn't mention bug reports that it
>> fixes.  I
>> suspect there *are* some related bug reports, e.g., for Thunderbolt
>> hotplug.  We should dig them up, include pointers to them, and get
>> the
>> reporters to test the series and provide feedback.
> 
> It never occurred to me to actually search for reports, thanks.
> 
>> We had some review traffic on earlier versions, but as far as I can
>> see, nobody has stepped up with any actual signs of approval, e.g.,
>> Tested-by, Reviewed-by, or Acked-by tags.  That's a problem because
>> this touches a lot of sensitive code and I don't want to be stuck
>> fixing issues all by myself.  When issues crop up, I look to the
>> author and reviewers to help out.
> 
> Indeed, I had only a few informal personal responses [from outside of
> our company]:
>   - some machines have no regressions with the kernel built with Clang-
> 12+LTO+IAS;
>   - NVMe hotplug started working for AMD Epyc with a PEX switch;
>   - Another system with several PEX switches is under ongoing
> experiments.
> 


I reached out to Sergei a while ago about these patches on systems I 
work with. They involve a PEX switch on a separate card (MCH).

FWIW, I find this patchset really interesting and useful in the modular 
systems such as microTCA [1]. In those, up to 12 PCIe hotswappable 
mezzanine cards (AMCs) can be added/removed at anytime. Each can be a 
PCIe endpoint. The infrastructure usually involves a CPU card with a 
PCIe root complex. In addition, PCIe switch is located on the carrier 
hub (MCH) card which handles the PCIe links for AMCs.

At this time, if the system is booted without a certain slot being 
populated at CPU/Linux startup adding a card requires system power 
cycle. Today there are already (minor?) issues with the hotplug/hotswap 
of the cards that are present at Linux bootup hence folks that use these 
systems usually do not rely on any these features; as a result boxes are 
power cycled. I'm not saying it is all Linux;s fault, as the microTCA is 
inherently complex due to modularity and versatility of configurations 
one can have, even without hotplugging. It is also a matter of 
understanding how to configure and use the hotplugging that can be a 
challenge on its own.

Some preliminary tests of this patchset, I managed to run so far, 
suggest that newly added cards are nicely being recognized by the Linux 
and are ready for use without CPU reboot. I currently do not have access 
to the microTCA boxes I use. I hope to be in a better place in couple of 
weeks and look into these patches again!

Thanks!
//hinko

[1] 
https://www.picmg.org/wp-content/uploads/MicroTCA_Short_Form_Sept_2006.pdf

> Providing an ability to quickly return to the old BARs allocating is
> the reason why I keep wrapping the most intrusive code with "if
> (pci_can_move_bars)". Originally I wanted it to be disabled by default,
> available for those truly desperate. But there was an objection that
> the code will not be ever tested this way.
> 
>> Along that line, I'm a little concerned about how we will be able to
>> reproduce and debug problems.  Issues will likely depend on the
>> topology, the resources of the specific devices involved, and a
>> specific sequence of hotplug operations.  Those may be hard to
>> reproduce even for the reporter.  Maybe this is nothing more than a
>> grep pattern to extract relevant events from dmesg.  Ideal, but maybe
>> impractical, would be a way to reproduce things in a VM using qemu
>> and
>> a script to simulate hotplugs.
> 
> I haven't yet tried the qemu for PCI debugging, definitely need to take
> a look. To help with that, the patchset is supplied with additional
> prints, but CONFIG_PCI_DEBUG=y is still preferred. A simulation will
> reveal if additional debug messages are needed.
> 
>>> The feature is usable not only for hotplug, but also for booting
>>> with a
>>> complete set of BARs assigned, and for Resizable BARs.
>>
>> I'm interested in more details about both of these.  What does "a
>> complete set of BARs assigned" mean?  On x86, the BIOS usually
>> assigns
>> all the BARs ahead of time, but I know that's not the case for all
>> arches.
> 
> Usually yes, but we have at least one x86 PC that skips some BARs (need
> to check out again its particular model and version, IIRC that was
> Z370-F) -- not the crucial ones like BAR0, though. That really got me
> by surprise, so now it is one more case covered.
> 
>> For Resizable BARs, is the point that this allows more flexibility in
>> resizing BARs because we can now move things out of the way?
> 
> Not only things out of the way, but most importantly the bridge windows
> are now floating as well. So it's more freedom for a new BAR to
> "choose" a place.
> 
> An awfully synthetic example:
> 
> |                       RC Address Space                       |
> | orig GPU BAR | fixed BAR |                                   |
> |              | fixed BAR |     Resized GPU BAR        |      |
> 
>>> Tested on a number of x86_64 machines without any special kernel
>>> command
>>> line arguments (some of them -- with older v8 revision of this
>>> patchset):
>>>   - PC: i7-5930K + ASUS X99-A;
>>>   - PC: i5-8500 + ASUS Z370-F;
>>>   - PC: AMD Ryzen 7 3700X + ASRock X570 + AMD 5700 XT (Resizable
>>> BARs);
>>>   - Supermicro Super Server/H11SSL-i: AMD EPYC 7251;
>>>   - HP ProLiant DL380 G5: Xeon X5460;
>>>   - Dell Inspiron N5010: i5 M 480;
>>>   - Dell Precision M6600: i7-2920XM.
>>
>> Does this testing show no change in behavior and no regressions, or
>> does it show that this series fixes cases that previously did not
>> work?  If the latter, some bugzilla reports with before/after dmesg
>> logs would give some insight into how this works and also some good
>> justification.
> 
> Both, actually, but the dmesg logs were never published, so they will
> be the next step -- combined with qemu scripts, if it works.
> 
>>> This patchset is a part of our work on adding support for hot-
>>> adding
>>> chains of chassis full of other bridges, NVME drives, SAS HBAs,
>>> GPUs, etc.
>>> without special requirements such as Hot-Plug Controller,
>>> reservation of
>>> bus numbers or memory regions by firmware, etc.
>>
>> This is a little bit of a chicken and egg situation.  I suspect many
>> people would like this functionality, but currently we either avoid
>> it
>> because it's "known not to work" or we hack around it with the
>> firmware reservations you mention.
> 
> The current (v9) patchset became more symbiotic with other hotplug
> facilities: it started to respect pci=hpmemsize etc, and only drops it
> if such allocation is impossible.
> 
> Attention Button for Assisted Hotplug (pciehp driver) was always
> supported. It is also fully compatible with DPC.
> 
> The goal is to squeeze as much as possible even without special
> hw+fw+sw support, when the only available tool is a manual rescan via
> sysfs.
> 
> Serge
>
David Laight Feb. 4, 2021, 9:34 a.m. UTC | #8
> Actually, we are hotplugging not only endpoints, but nested PCIe
> switches as well: those are PCIe JBOD chassis (with NVMes and SAS
> drives).

From what I remember of playing with some PCI hot-plug hardware
and cardbus extenders to PCI chassis many years ago bus numbers
were actually a big problem.

A surprising number of io cards contain a bridge - so need a
bus number if hot-plugged.

(In spite of the marketing hype hot-plug basically let you remove
a working card and replace it with an identical one!
Modern drivers and OS are likely to handle the errors from
faulty cards better.)

The initial allocation of bus-numbers could easily spread out
the unused bus numbers.
Doing that for memory resources may have other problems
(You probably don't want to allocate the entire range to the
root hub?)

Are the bus numbers exposed as keys/filename in /sys ?
In which case changing them after boot is problematic.
You'd need a map of virtual bus numbers to physical ones.

As well as your 'suspend/resume' sequence, it should
be possible to send a card-remove/insert sequence to
an idle driver.

There is another case were BARs might need moving.
The PCIe spec doesn't allow very long (200ms?) from
reset removal (which might be close to power on) and
the requirement for endpoints to answer config cycles.
If you have to load a large FPGA from a serial EEPROM
this is actually a real problem.
So if the OS does a full probe of the PCI devices it may
find endpoints (or even bridges) that weren't actually there
when the BIOS (or equivalent) did its earlier probe.
Finding space in the middle of the pci devices for an endpoint
that wants two 1MB BARs is unlikely to suceed!

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Mika Westerberg Feb. 4, 2021, 10:49 a.m. UTC | #9
On Wed, Feb 03, 2021 at 08:17:14PM +0000, Sergei Miroshnichenko wrote:
> Hi Mika,
> 
> On Mon, 2021-02-01 at 14:55 +0200, Mika Westerberg wrote:
> > On Thu, Jan 28, 2021 at 09:39:29PM +0100, Lukas Wunner wrote:
> > > On Thu, Jan 28, 2021 at 08:53:16AM -0600, Bjorn Helgaas wrote:
> > > > On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei Miroshnichenko
> > > > wrote:
> > > > > ...
> > > 
> > > I intended to review and test this iteration of the series more
> > > closely, but haven't been able to carve out the required time.
> > > I'm adding some Thunderbolt folks to cc in the hope that they
> > > can at least test the series on their development branch.
> > > Getting this upstreamed should really be in the best interest
> > > of Intel and other promulgators of Thunderbolt.
> > 
> > Sure. It seems that this series was submitted in December so probably
> > not applicable to the pci.git/next anymore. Anyways, I can give it a
> > try
> > on a TBT capable system if someone tells me what exactly to test ;-)
> > Probably at least that the existing functionality still works but
> > something else maybe too?
> 
> For setups that worked fine, the only expected change is a possible
> little different BAR layout (in /proc/iomem), and there should the same
> quantity (or more) of BARs assigned than before.
> 
> But if there are any problematic setups, which weren't able to arrange
> new BARs, this patchset may push a bit further.

Got it.

> In a few days I'll provide an updated branch for our mirror of the
> kernel on Github, with a complete and bumped set of patches, reducing
> the steps required to test them.

Sounds good, thanks!
Sergei Miroshnichenko Feb. 10, 2021, 7:40 p.m. UTC | #10
On Thu, 2021-02-04 at 12:49 +0200, Mika Westerberg
wrote:
> On Wed, Feb 03, 2021 at 08:17:14PM +0000, Sergei Miroshnichenko
> wrote:
> > On Mon, 2021-02-01 at 14:55 +0200, Mika Westerberg wrote:
> > > On Thu, Jan 28, 2021 at 09:39:29PM +0100, Lukas Wunner wrote:
> > > > On Thu, Jan 28, 2021 at 08:53:16AM -0600, Bjorn Helgaas wrote:
> > > > > On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei
> > > > > Miroshnichenko
> > > > > wrote:
> > > > > > ...
> > > > 
> > > > I intended to review and test this iteration of the series more
> > > > closely, but haven't been able to carve out the required time.
> > > > I'm adding some Thunderbolt folks to cc in the hope that they
> > > > can at least test the series on their development branch.
> > > > Getting this upstreamed should really be in the best interest
> > > > of Intel and other promulgators of Thunderbolt.
> > > 
> > > Sure. It seems that this series was submitted in December so
> > > probably
> > > not applicable to the pci.git/next anymore. Anyways, I can give
> > > it a
> > > try
> > > on a TBT capable system if someone tells me what exactly to test
> > > ;-)
> > > Probably at least that the existing functionality still works but
> > > something else maybe too?
> > 
> > For setups that worked fine, the only expected change is a possible
> > little different BAR layout (in /proc/iomem), and there should the
> > same
> > quantity (or more) of BARs assigned than before.
> > 
> > But if there are any problematic setups, which weren't able to
> > arrange
> > new BARs, this patchset may push a bit further.
> 
> Got it.
> 
> > In a few days I'll provide an updated branch for our mirror of the
> > kernel on Github, with a complete and bumped set of patches,
> > reducing
> > the steps required to test them.
> 
> Sounds good, thanks!

Hi Mika,

The branch is finally ready, so if you still have time for that, please
take a look:

https://github.com/YADRO-KNS/linux/tree/yadro/pcie_hotplug/movable_bars_v9.1


And a bit off topic: I've also updated the branch with a complete set
of our hotplug-related patches, including an RFC series of movable bus
numbers for hot-adding large and/or nested switches (at the cost of
breaking the sysfs+procsf ABI). In case one is ever going to try that,
the pci=realloc,movable_buses kernel command line arguments are needed:

https://github.com/YADRO-KNS/linux/commits/yadro/pcie_hotplug/movable_buses_v9.1

Thanks!

Serge
Lukas Wunner Feb. 10, 2021, 9:46 p.m. UTC | #11
On Wed, Feb 10, 2021 at 07:40:06PM +0000, Sergei Miroshnichenko wrote:
> The branch is finally ready, so if you still have time for that, please
> take a look:
> 
> https://github.com/YADRO-KNS/linux/tree/yadro/pcie_hotplug/movable_bars_v9.1

Just a quick drive-by comment, in commit

    scsi: mpt3sas: Handle a surprise PCI unplug faster

you probably want to call pci_dev_is_disconnected() instead of
adding a new flag.

Thanks,

Lukas
Mika Westerberg Feb. 11, 2021, 11:39 a.m. UTC | #12
Hi,

On Wed, Feb 10, 2021 at 07:40:06PM +0000, Sergei Miroshnichenko wrote:
> On Thu, 2021-02-04 at 12:49 +0200, Mika Westerberg
> wrote:
> > On Wed, Feb 03, 2021 at 08:17:14PM +0000, Sergei Miroshnichenko
> > wrote:
> > > On Mon, 2021-02-01 at 14:55 +0200, Mika Westerberg wrote:
> > > > On Thu, Jan 28, 2021 at 09:39:29PM +0100, Lukas Wunner wrote:
> > > > > On Thu, Jan 28, 2021 at 08:53:16AM -0600, Bjorn Helgaas wrote:
> > > > > > On Fri, Dec 18, 2020 at 08:39:45PM +0300, Sergei
> > > > > > Miroshnichenko
> > > > > > wrote:
> > > > > > > ...
> > > > > 
> > > > > I intended to review and test this iteration of the series more
> > > > > closely, but haven't been able to carve out the required time.
> > > > > I'm adding some Thunderbolt folks to cc in the hope that they
> > > > > can at least test the series on their development branch.
> > > > > Getting this upstreamed should really be in the best interest
> > > > > of Intel and other promulgators of Thunderbolt.
> > > > 
> > > > Sure. It seems that this series was submitted in December so
> > > > probably
> > > > not applicable to the pci.git/next anymore. Anyways, I can give
> > > > it a
> > > > try
> > > > on a TBT capable system if someone tells me what exactly to test
> > > > ;-)
> > > > Probably at least that the existing functionality still works but
> > > > something else maybe too?
> > > 
> > > For setups that worked fine, the only expected change is a possible
> > > little different BAR layout (in /proc/iomem), and there should the
> > > same
> > > quantity (or more) of BARs assigned than before.
> > > 
> > > But if there are any problematic setups, which weren't able to
> > > arrange
> > > new BARs, this patchset may push a bit further.
> > 
> > Got it.
> > 
> > > In a few days I'll provide an updated branch for our mirror of the
> > > kernel on Github, with a complete and bumped set of patches,
> > > reducing
> > > the steps required to test them.
> > 
> > Sounds good, thanks!
> 
> Hi Mika,
> 
> The branch is finally ready, so if you still have time for that, please
> take a look:
> 
> https://github.com/YADRO-KNS/linux/tree/yadro/pcie_hotplug/movable_bars_v9.1

Thanks for sharing!

I tried this series on top of v5.11-rc7 on a Dell XPS 9380 so that I
have two TBT3 devices connected. Each device includes PCIe switch and a
xHCI endpoint.

What I see that the hotplug downstream port does not have enough bus
numbers (and other resources allocated) so when attaching the second
device it does not fit there anymore. The resulting 'lspci -t' output
looks like below:

-[0000:00]-+-00.0
           +-02.0
           +-04.0
           +-08.0
           +-12.0
           +-14.0
           +-14.2
           +-15.0
           +-15.1
           +-16.0
           +-1c.0-[01]----00.0
           +-1c.6-[02]----00.0
           +-1d.0-[03-3b]----00.0-[04-3b]--+-00.0-[05]----00.0
           |                               +-01.0-[06-1f]----00.0-[07-09]--+-02.0-[08]----00.0
           |                               |                               \-04.0-[09]----00.0-[0a]--
           |                               +-02.0-[20]----00.0
           |                               \-04.0-[21-3b]--
           +-1d.4-[3c]----00.0
           +-1f.0
           +-1f.3
           +-1f.4
           \-1f.5

So the last PCIE switch is not visible anymore, and the xHCI on the
second TBT device is not functional either.

On the mainline kernel I get this:

-[0000:00]-+-00.0
           +-02.0
           +-04.0
           +-08.0
           +-12.0
           +-14.0
           +-14.2
           +-15.0
           +-15.1
           +-16.0
           +-1c.0-[01]----00.0
           +-1c.6-[02]----00.0
           +-1d.0-[03-3b]----00.0-[04-3b]--+-00.0-[05]----00.0
           |                               +-01.0-[06-1f]----00.0-[07-1f]--+-02.0-[08]----00.0
           |                               |                               \-04.0-[09-1f]----00.0-[0a-1f]--+-02.0-[0b]----00.0
           |                               |                                                               \-04.0-[0c-1f]--
           |                               +-02.0-[20]----00.0
           |                               \-04.0-[21-3b]--
           +-1d.4-[3c]----00.0
           +-1f.0
           +-1f.3
           +-1f.4
           \-1f.5

In this topology I can add yet another TBT device and there are still
resources available and all the endpoints are functional too.

I can send you the full dmesg and lspci -vv output if needed.
Sergei Miroshnichenko Feb. 11, 2021, 5:45 p.m. UTC | #13
On Thu, 2021-02-11 at 13:39 +0200, mika.westerberg@linux.intel.com
wrote:
> On Wed, Feb 10, 2021 at 07:40:06PM +0000, Sergei Miroshnichenko
> wrote:
> > On Thu, 2021-02-04 at 12:49 +0200, Mika Westerberg
> > wrote:
> > > ...
> > The branch is finally ready, so if you still have time for that,
> > please
> > take a look:
> > 
> > https://github.com/YADRO-KNS/linux/tree/yadro/pcie_hotplug/movable_bars_v9.1
> 
> Thanks for sharing!
> 
> I tried this series on top of v5.11-rc7 on a Dell XPS 9380 so that I
> have two TBT3 devices connected. Each device includes PCIe switch and
> a
> xHCI endpoint.
> 
> What I see that the hotplug downstream port does not have enough bus
> numbers (and other resources allocated) so when attaching the second
> device it does not fit there anymore. The resulting 'lspci -t' output
> looks like below:
> 
> -[0000:00]-+-00.0
>            +-02.0
>            +-04.0
>            +-08.0
>            +-12.0
>            +-14.0
>            +-14.2
>            +-15.0
>            +-15.1
>            +-16.0
>            +-1c.0-[01]----00.0
>            +-1c.6-[02]----00.0
>            +-1d.0-[03-3b]----00.0-[04-3b]--+-00.0-[05]----00.0
>            |                               +-01.0-[06-1f]----00.0-
> [07-09]--+-02.0-[08]----00.0
>            |                               |                         
>       \-04.0-[09]----00.0-[0a]--
>            |                               +-02.0-[20]----00.0
>            |                               \-04.0-[21-3b]--
>            +-1d.4-[3c]----00.0
>            +-1f.0
>            +-1f.3
>            +-1f.4
>            \-1f.5
> 
> So the last PCIE switch is not visible anymore, and the xHCI on the
> second TBT device is not functional either.
> 
> On the mainline kernel I get this:
> 
> -[0000:00]-+-00.0
>            +-02.0
>            +-04.0
>            +-08.0
>            +-12.0
>            +-14.0
>            +-14.2
>            +-15.0
>            +-15.1
>            +-16.0
>            +-1c.0-[01]----00.0
>            +-1c.6-[02]----00.0
>            +-1d.0-[03-3b]----00.0-[04-3b]--+-00.0-[05]----00.0
>            |                               +-01.0-[06-1f]----00.0-
> [07-1f]--+-02.0-[08]----00.0
>            |                               |                         
>       \-04.0-[09-1f]----00.0-[0a-1f]--+-02.0-[0b]----00.0
>            |                               |                         
>                                       \-04.0-[0c-1f]--
>            |                               +-02.0-[20]----00.0
>            |                               \-04.0-[21-3b]--
>            +-1d.4-[3c]----00.0
>            +-1f.0
>            +-1f.3
>            +-1f.4
>            \-1f.5
> 
> In this topology I can add yet another TBT device and there are still
> resources available and all the endpoints are functional too.
> 
> I can send you the full dmesg and lspci -vv output if needed.


What a pity. Yes, please, I would of course like to take a look why
that happened, and compare, what went wrong (before and after the
hotplug: lspci -tv, dmesg -t and /proc/iomem with /proc/ioports, if it
wouldn't be too much trouble).

The first patchset wasn't intended to change the bus numbers behavior,
so I also have to check that locally. And despite of lack of bus
numbers, new endpoints still should be visible, this is for me to check
as well.

Thanks a lot for testing!

Serge
Mika Westerberg Feb. 12, 2021, 12:52 p.m. UTC | #14
Hi,

On Thu, Feb 11, 2021 at 05:45:20PM +0000, Sergei Miroshnichenko wrote:
> What a pity. Yes, please, I would of course like to take a look why
> that happened, and compare, what went wrong (before and after the
> hotplug: lspci -tv, dmesg -t and /proc/iomem with /proc/ioports, if it
> wouldn't be too much trouble).

I just sent these logs to you in a separate email. Let me know if you
need more.

Thanks!
Sergei Miroshnichenko Feb. 12, 2021, 8:54 p.m. UTC | #15
On Fri, 2021-02-12 at 14:52 +0200, mika.westerberg@linux.intel.com
wrote:
> Hi,
> 
> On Thu, Feb 11, 2021 at 05:45:20PM +0000, Sergei Miroshnichenko
> wrote:
> > What a pity. Yes, please, I would of course like to take a look why
> > that happened, and compare, what went wrong (before and after the
> > hotplug: lspci -tv, dmesg -t and /proc/iomem with /proc/ioports, if
> > it
> > wouldn't be too much trouble).
> 
> I just sent these logs to you in a separate email. Let me know if you
> need more.

Thanks, from them it's clear that the "full rescan" apprach currently
doesn't involve the pci_bus_distribute_available_resources(), that's
why hot-adding a second nested switch breaks: because of non-
distributed free bus numbers. The first switch seems was hot-added just
fine, with BARs being moved a bit.

This is to be fixed in v10, along with the
mpt3sas+pci_dev_is_disconnected() moment Lukas had found (thanks
Lukas!), CONFIG_DEBUG_LOCK_ALLOC macro, and a more useful debug
messages.

Serge
Mika Westerberg Feb. 15, 2021, 2:56 p.m. UTC | #16
On Fri, Feb 12, 2021 at 08:54:18PM +0000, Sergei Miroshnichenko wrote:
> On Fri, 2021-02-12 at 14:52 +0200, mika.westerberg@linux.intel.com
> wrote:
> > Hi,
> > 
> > On Thu, Feb 11, 2021 at 05:45:20PM +0000, Sergei Miroshnichenko
> > wrote:
> > > What a pity. Yes, please, I would of course like to take a look why
> > > that happened, and compare, what went wrong (before and after the
> > > hotplug: lspci -tv, dmesg -t and /proc/iomem with /proc/ioports, if
> > > it
> > > wouldn't be too much trouble).
> > 
> > I just sent these logs to you in a separate email. Let me know if you
> > need more.
> 
> Thanks, from them it's clear that the "full rescan" apprach currently
> doesn't involve the pci_bus_distribute_available_resources(), that's
> why hot-adding a second nested switch breaks: because of non-
> distributed free bus numbers. The first switch seems was hot-added just
> fine, with BARs being moved a bit.
> 
> This is to be fixed in v10, along with the
> mpt3sas+pci_dev_is_disconnected() moment Lukas had found (thanks
> Lukas!), CONFIG_DEBUG_LOCK_ALLOC macro, and a more useful debug
> messages.

Great, thanks! :)