[3/3] x86/quirks: Add parameter to clear MSIs early on boot

Message ID 20181018183721.27467-3-gpiccoli@canonical.com
State New
Delegated to: Bjorn Helgaas
Headers show
Series
  • [1/3] x86/quirks: Scan all busses for early PCI quirks
Related show

Commit Message

Guilherme G. Piccoli Oct. 18, 2018, 6:37 p.m.
We observed a kdump failure in x86 that was narrowed down to MSI irq
storm coming from a PCI network device. The bug manifests as a lack of
progress in the boot process of kdump kernel, and a flood of kernel
messages like:

[...]
[ 342.265294] do_IRQ: 0.155 No irq handler for vector
[ 342.266916] do_IRQ: 0.155 No irq handler for vector
[ 347.258422] do_IRQ: 14053260 callbacks suppressed
[...]

The root cause of the issue is that kexec process of the kdump kernel
doesn't ensure PCI devices are reset or MSI capabilities are disabled,
so a PCI adapter could produce a huge amount of irqs which would steal
all the processing time for the CPU (specially since we usually restrict
kdump kernel to use a single CPU only).

This patch implements the kernel parameter "pci=clearmsi" to clear the
MSI/MSI-X enable bits in the Message Control register for all PCI devices
during early boot time, thus preventing potential issues in the kexec'ed
kernel. PCI spec also supports/enforces this need (see PCI Local Bus
spec sections 6.8.1.3 and 6.8.2.3).

Suggested-by: Dan Streetman <ddstreet@canonical.com>
Suggested-by: Gavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 .../admin-guide/kernel-parameters.txt         |  6 ++++
 arch/x86/include/asm/pci-direct.h             |  1 +
 arch/x86/kernel/early-quirks.c                | 32 +++++++++++++++++++
 arch/x86/pci/common.c                         |  4 +++
 4 files changed, 43 insertions(+)

Comments

Sinan Kaya Oct. 18, 2018, 8:08 p.m. | #1
On 10/18/2018 2:37 PM, Guilherme G. Piccoli wrote:
> We observed a kdump failure in x86 that was narrowed down to MSI irq
> storm coming from a PCI network device. The bug manifests as a lack of
> progress in the boot process of kdump kernel, and a flood of kernel
> messages like:
> 
> [...]
> [ 342.265294] do_IRQ: 0.155 No irq handler for vector
> [ 342.266916] do_IRQ: 0.155 No irq handler for vector
> [ 347.258422] do_IRQ: 14053260 callbacks suppressed
> [...]

These kind of issues are usually fixed by fixing the network driver's
shutdown routine to ensure that MSI interrupts are cleared there.
Guilherme G. Piccoli Oct. 18, 2018, 8:13 p.m. | #2
On 18/10/2018 17:08, Sinan Kaya wrote:
> On 10/18/2018 2:37 PM, Guilherme G. Piccoli wrote:
>> We observed a kdump failure in x86 that was narrowed down to MSI irq
>> storm coming from a PCI network device. The bug manifests as a lack of
>> progress in the boot process of kdump kernel, and a flood of kernel
>> messages like:
>>
>> [...]
>> [ 342.265294] do_IRQ: 0.155 No irq handler for vector
>> [ 342.266916] do_IRQ: 0.155 No irq handler for vector
>> [ 347.258422] do_IRQ: 14053260 callbacks suppressed
>> [...]
> 
> These kind of issues are usually fixed by fixing the network driver's
> shutdown routine to ensure that MSI interrupts are cleared there.


Sinan, I'm not sure shutdown handlers for drivers are called in panic
kexec (I remember of an old experiment I did, loading a kernel
with "kexec -p" didn't trigger the handlers).

But this case is even worse, because the NICs were in PCI passthrough
mode, using vfio. So, they were completely unaware of what happened
in the host kernel.

Also, this is spec compliant - system reset events should guarantee the
bits are cleared (although kexec is not exactly a system reset, it's
similar)

Cheers,


Guilherme
Sinan Kaya Oct. 18, 2018, 8:30 p.m. | #3
On 10/18/2018 4:13 PM, Guilherme G. Piccoli wrote:
>> These kind of issues are usually fixed by fixing the network driver's
>> shutdown routine to ensure that MSI interrupts are cleared there.
> 
> Sinan, I'm not sure shutdown handlers for drivers are called in panic
> kexec (I remember of an old experiment I did, loading a kernel
> with "kexec -p" didn't trigger the handlers).

AFAIK, all shutdown (not remove) routines are called before launching the next
kernel even in crash scenario. It is not safe to start the new kernel while
hardware is doing a DMA to the system memory and triggering interrupts.

Shutdown routine in PCI core used to disable MSI/MSI-x on behalf of all
endpoints but it was later decided that this is the responsibility of the
endpoint driver.

commit fda78d7a0ead144f4b2cdb582dcba47911f4952c
Author: Prarit Bhargava <prarit@redhat.com>
Date:   Thu Jan 26 14:07:47 2017 -0500

     PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()

     The pci_bus_type .shutdown method, pci_device_shutdown(), is called from
     device_shutdown() in the kernel restart and shutdown paths.

     Previously, pci_device_shutdown() called pci_msi_shutdown() and
     pci_msix_shutdown().  This disables MSI and MSI-X, which causes the device
     to fall back to raising interrupts via INTx.  But the driver is still bound
     to the device, it doesn't know about this change, and it likely doesn't
     have an INTx handler, so these INTx interrupts cause "nobody cared"
     warnings like this:

       irq 16: nobody cared (try booting with the "irqpoll" option)
       CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.2-1.el7_UNSUPPORTED.x86_64 #1
       Hardware name: Hewlett-Packard HP Z820 Workstation/158B, BIOS J63 v03.90 06/
       ...

     The MSI disabling code was added by d52877c7b1af ("pci/irq: let
     pci_device_shutdown to call pci_msi_shutdown v2") because a driver left MSI
     enabled and kdump failed because the kexeced kernel wasn't prepared to
     receive the MSI interrupts.

    Subsequent commits 1851617cd2da ("PCI/MSI: Disable MSI at enumeration even
     if kernel doesn't support MSI") and  e80e7edc55ba ("PCI/MSI: Initialize MSI
     capability for all architectures") changed the kexeced kernel to disable
     all MSIs itself so it no longer depends on the crashed kernel to clean up
     after itself.

     Stop disabling MSI/MSI-X in pci_device_shutdown().  This resolves the
     "nobody cared" unhandled IRQ issue above.  It also allows PCI serial
     devices, which may rely on the MSI interrupts, to continue outputting
     messages during reboot/shutdown.

     [bhelgaas: changelog, drop pci_msi_shutdown() and pci_msix_shutdown() calls
     altogether]
     Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=187351
     Signed-off-by: Prarit Bhargava <prarit@redhat.com>
     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
     CC: Alex Williamson <alex.williamson@redhat.com>
     CC: David Arcari <darcari@redhat.com>
     CC: Myron Stowe <mstowe@redhat.com>
     CC: Lukas Wunner <lukas@wunner.de>
     CC: Keith Busch <keith.busch@intel.com>
     CC: Mika Westerberg <mika.westerberg@linux.intel.com>



> 
> But this case is even worse, because the NICs were in PCI passthrough
> mode, using vfio. So, they were completely unaware of what happened
> in the host kernel.
> 
> Also, this is spec compliant - system reset events should guarantee the
> bits are cleared (although kexec is not exactly a system reset, it's
> similar)
Guilherme G. Piccoli Oct. 22, 2018, 7:44 p.m. | #4
On 18/10/2018 17:30, Sinan Kaya wrote:
> 
> AFAIK, all shutdown (not remove) routines are called before launching
> the next
> kernel even in crash scenario. It is not safe to start the new kernel while
> hardware is doing a DMA to the system memory and triggering interrupts.

Hi Sinan,

I agree with you, it's definitely not safe to start a new kernel with
in-flight DMA transactions, but in the crash scenario I think the
rationale was that running kernel is broken so it's even more unreliable
to try gracefully shutdown the devices than hope-for-the-best and start
the kdump kernel right away heheh

Fact is that the shutdown handlers are not called in the crash scenario.
They come from device_shutdown(), the code paths are as follow:

Regular kexec flow:

syscall_reboot()
  kernel_kexec()
    kernel_restart_prepare()
	  device_shutdown()
	machine_kexec()
	
Although if CONFIG_KEXEC_JUMP is set, it doesn't call device_shutdown()
either.


Crash kexec flow:
  __crash_kexec()
      machine_kexec()

There are some entry points to __crash_kexec(), like panic() or die() in
x86, for example.
To validate this, one can load a kernel with "initcall_debug" parameter,
and performs a kexec - if the shutdown handlers are called, there's a
dev_info() call that shows a message per device.


> Shutdown routine in PCI core used to disable MSI/MSI-x on behalf of all
> endpoints but it was later decided that this is the responsibility of the
> endpoint driver.
> 

This may be a good idea, using the pci layer to disable MSIs in the
quiesce path of the broken kernel. I'll follow-up this discussion in
Bjorn's reply.

Thanks,


Guilherme

Patch

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 92eb1f42240d..aeb510e484d4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -3161,6 +3161,12 @@ 
 		nomsi		[MSI] If the PCI_MSI kernel config parameter is
 				enabled, this kernel boot option can be used to
 				disable the use of MSI interrupts system-wide.
+		clearmsi	[X86] Clears MSI/MSI-X enable bits early in boot
+				time in order to avoid issues like adapters
+				screaming irqs and preventing boot progress.
+				Also, it enforces the PCI Local Bus spec
+				rule that those bits should be 0 in system reset
+				events (useful for kexec/kdump cases).
 		noioapicquirk	[APIC] Disable all boot interrupt quirks.
 				Safety option to keep boot IRQs enabled. This
 				should never be necessary.
diff --git a/arch/x86/include/asm/pci-direct.h b/arch/x86/include/asm/pci-direct.h
index 813996305bf5..ebb3db2eee41 100644
--- a/arch/x86/include/asm/pci-direct.h
+++ b/arch/x86/include/asm/pci-direct.h
@@ -15,5 +15,6 @@  extern void write_pci_config(u8 bus, u8 slot, u8 func, u8 offset, u32 val);
 extern void write_pci_config_byte(u8 bus, u8 slot, u8 func, u8 offset, u8 val);
 extern void write_pci_config_16(u8 bus, u8 slot, u8 func, u8 offset, u16 val);
 
+extern unsigned int pci_early_clear_msi;
 extern int early_pci_allowed(void);
 #endif /* _ASM_X86_PCI_DIRECT_H */
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index fd50f9e21623..21060d80441e 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -28,6 +28,37 @@ 
 #include <asm/irq_remapping.h>
 #include <asm/early_ioremap.h>
 
+static void __init early_pci_clear_msi(int bus, int slot, int func)
+{
+	int pos;
+	u16 ctrl;
+
+	if (likely(!pci_early_clear_msi))
+		return;
+
+	pr_info_once("Clearing MSI/MSI-X enable bits early in boot (quirk)\n");
+
+	pos = pci_early_find_cap(bus, slot, func, PCI_CAP_ID_MSI);
+	if (pos) {
+		ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSI_FLAGS);
+		ctrl &= ~PCI_MSI_FLAGS_ENABLE;
+		write_pci_config_16(bus, slot, func, pos + PCI_MSI_FLAGS, ctrl);
+
+		/* Read again to flush previous write */
+		ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSI_FLAGS);
+	}
+
+	pos = pci_early_find_cap(bus, slot, func, PCI_CAP_ID_MSIX);
+	if (pos) {
+		ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSIX_FLAGS);
+		ctrl &= ~PCI_MSIX_FLAGS_ENABLE;
+		write_pci_config_16(bus, slot, func, pos + PCI_MSIX_FLAGS, ctrl);
+
+		/* Read again to flush previous write */
+		ctrl = read_pci_config_16(bus, slot, func, pos + PCI_MSIX_FLAGS);
+	}
+}
+
 static void __init fix_hypertransport_config(int num, int slot, int func)
 {
 	u32 htcfg;
@@ -709,6 +740,7 @@  static struct chipset early_qrk[] __initdata = {
 		PCI_CLASS_BRIDGE_HOST, PCI_ANY_ID, 0, force_disable_hpet},
 	{ PCI_VENDOR_ID_BROADCOM, 0x4331,
 	  PCI_CLASS_NETWORK_OTHER, PCI_ANY_ID, 0, apple_airport_reset},
+	{ PCI_ANY_ID, PCI_ANY_ID, 0, 0, 0, early_pci_clear_msi},
 	{}
 };
 
diff --git a/arch/x86/pci/common.c b/arch/x86/pci/common.c
index d4ec117c1142..7f6f85bd47a3 100644
--- a/arch/x86/pci/common.c
+++ b/arch/x86/pci/common.c
@@ -32,6 +32,7 @@  int noioapicreroute = 1;
 #endif
 int pcibios_last_bus = -1;
 unsigned long pirq_table_addr;
+unsigned int pci_early_clear_msi;
 const struct pci_raw_ops *__read_mostly raw_pci_ops;
 const struct pci_raw_ops *__read_mostly raw_pci_ext_ops;
 
@@ -604,6 +605,9 @@  char *__init pcibios_setup(char *str)
 	} else if (!strcmp(str, "skip_isa_align")) {
 		pci_probe |= PCI_CAN_SKIP_ISA_ALIGN;
 		return NULL;
+	} else if (!strcmp(str, "clearmsi")) {
+		pci_early_clear_msi = 1;
+		return NULL;
 	} else if (!strcmp(str, "noioapicquirk")) {
 		noioapicquirk = 1;
 		return NULL;