[1/3] x86/quirks: Scan all busses for early PCI quirks

Message ID 20181018183721.27467-1-gpiccoli@canonical.com
State New
Delegated to: Bjorn Helgaas
Headers show
Series
  • [1/3] x86/quirks: Scan all busses for early PCI quirks
Related show

Commit Message

Guilherme G. Piccoli Oct. 18, 2018, 6:37 p.m.
Recently was noticed in an HP GEN9 system that kdump couldn't succeed
due to an irq storm coming from an Intel NIC, narrowed down to be lack
of clearing the MSI/MSI-X enable bits during the kdump kernel boot.
For that, we need an early quirk to manually turn off MSI/MSI-X for
PCI devices - this was worked as an optional boot parameter in a
subsequent patch.

Problem is that in our test system, the Intel NICs were not present in
any secondary bus under the first PCIe root complex, so they couldn't
be reached by the recursion in check_dev_quirk(). Modern systems,
specially with multi-processors and multiple NUMA nodes expose multiple
root complexes, describing more than one PCI hierarchy domain. Currently
the simple recursion present in the early-quirks code from x86 starts a
descending recursion from bus 0000:00, and reach many other busses by
navigating this hierarchy walking through the bridges. This is not
enough in systems with more than one root complex/host bridge, since
the recursion won't "traverse" to other root complexes by starting
statically in 0000:00 (for more details, see [0]).

This patch hence implements the full bus/device/function scan in
early_quirks(), by checking all possible busses instead of using a
recursion based on the first root bus or limiting the search scope to
the first 32 busses (like it was done in the beginning [1]).

[0] https://bugs.launchpad.net/bugs/1797990

[1] From historical perspective, early PCI scan dates back
to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64",
on October/2003. It initially restricted the search to the first
32 busses and slots.

Due to a potential bug found in Nvidia chipsets, the scan
was changed to run only in the first root bus: see
commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks")

Finally, secondary busses reachable from the 1st bus were re-added back by:
commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses")

Reported-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 arch/x86/kernel/early-quirks.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Bjorn Helgaas Oct. 18, 2018, 10:15 p.m. | #1
On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> Recently was noticed in an HP GEN9 system that kdump couldn't succeed
> due to an irq storm coming from an Intel NIC, narrowed down to be lack
> of clearing the MSI/MSI-X enable bits during the kdump kernel boot.
> For that, we need an early quirk to manually turn off MSI/MSI-X for
> PCI devices - this was worked as an optional boot parameter in a
> subsequent patch.
> 
> Problem is that in our test system, the Intel NICs were not present in
> any secondary bus under the first PCIe root complex, so they couldn't
> be reached by the recursion in check_dev_quirk(). Modern systems,
> specially with multi-processors and multiple NUMA nodes expose multiple
> root complexes, describing more than one PCI hierarchy domain. Currently
> the simple recursion present in the early-quirks code from x86 starts a
> descending recursion from bus 0000:00, and reach many other busses by
> navigating this hierarchy walking through the bridges. This is not
> enough in systems with more than one root complex/host bridge, since
> the recursion won't "traverse" to other root complexes by starting
> statically in 0000:00 (for more details, see [0]).
> 
> This patch hence implements the full bus/device/function scan in
> early_quirks(), by checking all possible busses instead of using a
> recursion based on the first root bus or limiting the search scope to
> the first 32 busses (like it was done in the beginning [1]).

I don't want to expand the early quirk infrastructure unless there is
absolutely no other way to solve this.  The early quirk stuff is
x86-specific, and it's not obvious that this problem is x86-only.

This patch scans buses 0-255, but still only in domain 0, so it won't
help with even more complicated systems that use other domains.

I'm not an IRQ expert, but it seems wrong to me that we are enabling
this interrupt before we're ready for it.  The MSI should target an
IOAPIC.  Can't that IOAPIC entry be masked until later?  I guess the
kdump kernel doesn't know what MSI address the device might be using.

Could the IRQ core be more tolerant of this somehow, e.g., if it
notices incoming interrupts with no handler, could it disable the
IOAPIC entry and fall back to polling periodically until a handler is
added?

> [0] https://bugs.launchpad.net/bugs/1797990
> 
> [1] From historical perspective, early PCI scan dates back
> to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64",
> on October/2003. It initially restricted the search to the first
> 32 busses and slots.
> 
> Due to a potential bug found in Nvidia chipsets, the scan
> was changed to run only in the first root bus: see
> commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks")
> 
> Finally, secondary busses reachable from the 1st bus were re-added back by:
> commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses")
> 
> Reported-by: Dan Streetman <ddstreet@canonical.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
> ---
>  arch/x86/kernel/early-quirks.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 50d5848bf22e..fd50f9e21623 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -731,7 +731,6 @@ static int __init check_dev_quirk(int num, int slot, int func)
>  	u16 vendor;
>  	u16 device;
>  	u8 type;
> -	u8 sec;
>  	int i;
>  
>  	class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE);
> @@ -760,11 +759,8 @@ static int __init check_dev_quirk(int num, int slot, int func)
>  	type = read_pci_config_byte(num, slot, func,
>  				    PCI_HEADER_TYPE);
>  
> -	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) {
> -		sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS);
> -		if (sec > num)
> -			early_pci_scan_bus(sec);
> -	}
> +	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE)
> +		return -1;
>  
>  	if (!(type & 0x80))
>  		return -1;
> @@ -787,8 +783,11 @@ static void __init early_pci_scan_bus(int bus)
>  
>  void __init early_quirks(void)
>  {
> +	int bus;
> +
>  	if (!early_pci_allowed())
>  		return;
>  
> -	early_pci_scan_bus(0);
> +	for (bus = 0; bus < 256; bus++)
> +		early_pci_scan_bus(bus);
>  }
> -- 
> 2.19.0
>
Guilherme G. Piccoli Oct. 22, 2018, 8:35 p.m. | #2
On 18/10/2018 19:15, Bjorn Helgaas wrote:
> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> [...] 

> I don't want to expand the early quirk infrastructure unless there is
> absolutely no other way to solve this.  The early quirk stuff is
> x86-specific, and it's not obvious that this problem is x86-only.
> 
> This patch scans buses 0-255, but still only in domain 0, so it won't
> help with even more complicated systems that use other domains.
> 
> I'm not an IRQ expert, but it seems wrong to me that we are enabling
> this interrupt before we're ready for it.  The MSI should target an
> IOAPIC.  Can't that IOAPIC entry be masked until later?  I guess the
> kdump kernel doesn't know what MSI address the device might be using.
> 
> Could the IRQ core be more tolerant of this somehow, e.g., if it
> notices incoming interrupts with no handler, could it disable the
> IOAPIC entry and fall back to polling periodically until a handler is
> added?

Hi Bjorn, thanks for your quick reply.
I understand your point, but I think this is inherently an architecture
problem. No matter what solution we decide for, it'll need to be applied
in early boot time, like before the PCI layer gets initialized.

So, I think a first step would be to split the solution "timing" in 2
possibilities:

a) We could try to disable MSIs or whatever approach we take in the
quiesce path of crash_kexec(), before the bootstrap of the kdump kernel.
The pro is we could use PCI handlers to do it generically. The con is
it'd touch that delicate shutdown path, from a broken kernel, and this
is unreliable. Also, I've noticed changes in those crash paths
usually gain huge amount of criticism by community, seems nobody wants
to change a bit of this code, if not utterly necessary.

b) Continue using an early boot approach. IMO, this would be per-arch by
nature.
Currently, powerpc for example does not suffer this issue due to their
arch code performing a FW-aided PCI fundamental reset in the devices[0].
On the other hand, x86 has no generic fundamental reset infrastructure
to my knowledge (we tried some alternatives, like a Bridge reset[1] that
didn't work, or zeroing the the command register, which worked), but if
we go with the IOAPIC way of handling this (which we tried a bit and
failed too), it'll be even more arch-dependent, since IOAPIC is x86 concept.


After discussing here internally, an alternative way for this MSI
approach work without requiring the change in the early PCI
infrastructure is to check if we're in kdump kernel and perform manually
the full scan in that case, instead of changing the generic case as
proposed here. This would still be x86-only, but again, it's difficult
if not impossible to fix all archs using the same code here.

Finally, about multi-domain PCI topologies, I've never saw it on x86, I
wasn't aware that such things existed in x86 - but if required we can
quickly extend the logic to contemplate it too.

Thanks again, looking forward for you suggestions.
Cheers,


Guilherme

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/pci-ioda.c#n3992

[1] Based in https://patchwork.kernel.org/patch/2562841, adapted to work
in early boot time.
Bjorn Helgaas Oct. 23, 2018, 5:03 p.m. | #3
On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
> On 18/10/2018 19:15, Bjorn Helgaas wrote:
> > On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> > [...] 
> I understand your point, but I think this is inherently an architecture
> problem. No matter what solution we decide for, it'll need to be applied
> in early boot time, like before the PCI layer gets initialized.

This is the part I want to know more about.  Apparently there's some
event X between early_quirks() and the PCI device enumeration, and we
must disable MSIs before X:

  setup_arch()
      early_quirks()                     # arch/x86/kernel/early-quirks.c
      early_pci_clear_msi()
  ...
  X
  ...
  pci_scan_root_bus_bridge()
    ...
    DECLARE_PCI_FIXUP_EARLY              # drivers/pci/quirks.c

I want to know specifically what X is.  If we don't know what X is and
all we know is "we have to disable MSIs earlier than PCI init", then
we're likely to break things again in the future by changing the order
of disabling MSIs and whatever X is.

Bjorn

Patch

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 50d5848bf22e..fd50f9e21623 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -731,7 +731,6 @@  static int __init check_dev_quirk(int num, int slot, int func)
 	u16 vendor;
 	u16 device;
 	u8 type;
-	u8 sec;
 	int i;
 
 	class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE);
@@ -760,11 +759,8 @@  static int __init check_dev_quirk(int num, int slot, int func)
 	type = read_pci_config_byte(num, slot, func,
 				    PCI_HEADER_TYPE);
 
-	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) {
-		sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS);
-		if (sec > num)
-			early_pci_scan_bus(sec);
-	}
+	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE)
+		return -1;
 
 	if (!(type & 0x80))
 		return -1;
@@ -787,8 +783,11 @@  static void __init early_pci_scan_bus(int bus)
 
 void __init early_quirks(void)
 {
+	int bus;
+
 	if (!early_pci_allowed())
 		return;
 
-	early_pci_scan_bus(0);
+	for (bus = 0; bus < 256; bus++)
+		early_pci_scan_bus(bus);
 }