mbox series

[D,v2,0/3] Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot

Message ID 20181108170558.17534-1-mfo@canonical.com
Headers show
Series Add kernel parameter 'pci=clearmsi' to clear MSI(X)s early on boot | expand

Message

Mauricio Faria de Oliveira Nov. 8, 2018, 5:05 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1797990

(Note: the patch set for Cosmic/Disco should be identical,
 as both trees' master-next branch is at the 4.18.0-11.12
 level, with some more patches in Cosmic, but doing Disco
 as well for process guidelines/completeness sake.)

[Changelog]
 * v2:
   - Reorder patch 1 as 3 to allow for the next change:
   - Gate the bus-scan differences with the cmdline option
     (patch 3 only).  Now all functional changes are gated.

[Impact]

 * A kexec/crash kernel might get stuck and fail to boot
   (for crash kernel, kdump fails to collect a crashdump)
   if a PCI device is buggy/stuck/looping and triggers a
   continuous flood of MSI(X) interrupts (that the kernel
   does not yet know about).

 * This fix allowed to obtain crashdumps when debugging a
   heavy-load scenario, in which a (heavy-loaded) network
   adapter wouldn't stop triggering MSI-X interrupts ever
   after panic()->kdump kicked in.

 * This fix disables MSI(X) in all PCI devices on early
   boot (this is OK as it's (re-)enabled normally later)
   with a kernel cmdline parameter (disabled by default).

[Test Case]

 * A synthetic test-case is not yet available, however,
   this particular system/workload triggered the problem
   consistently, and it was used for development/testing.

 * We'll update this bug once a synthetic test-case is
   available; we're working on patching QEMU for this.

 * $ dmesg | grep 'Clearing MSI'
   [    0.000000] Clearing MSI/MSI-X enable bits early in boot (quirk)

 * The comparison of 'dmesg -t | sort' has been reviewed
   between option disabled/enabled on boot & kexec modes,
   and only expected differences found (MHz, PIDs, MIPS).

[Regression Potential] 

 * The potential area for regressions is early boot,
   particularly effects of applying quirks during PCI
   bus scan, which is changed/broader w/ these patches.

 * However, all quirks are applied based on PCI ID
   matching, so would only apply if actually targeting
   a new device.

 * Moreover, the new quirk is only applied based on
   a kernel cmdline parameter that is disabled by
   default, which constraints even more when this
   is actually in effect.

[Other Info]
 
 * The patch series is still under review/discussion
   upstream, but it's relatively important for Ubuntu
   users at this point, and after internal discussions
   we decided to submit it for SRU.

 * These are links to the linux-pci archive with the
   patches [1, 2, 3]

   [1] [PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks
       https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com/

   [2] [PATCH 2/3] x86/PCI: Export find_cap() to be used in early PCI code
       https://lore.kernel.org/linux-pci/20181018183721.27467-2-gpiccoli@canonical.com/

   [3] [PATCH 3/3] x86/quirks: Add parameter to clear MSIs early on boot
       https://lore.kernel.org/linux-pci/20181018183721.27467-3-gpiccoli@canonical.com/

Guilherme G. Piccoli (3):
  UBUNTU: SAUCE: x86/PCI: Export find_cap() to be used in early PCI code
  UBUNTU: SAUCE: x86/quirks: Add parameter to clear MSIs early on boot
  UBUNTU: SAUCE: x86/quirks: Scan all busses for early PCI quirks

 .../admin-guide/kernel-parameters.txt         |  6 +++
 arch/x86/include/asm/pci-direct.h             |  2 +
 arch/x86/kernel/aperture_64.c                 | 30 +-------------
 arch/x86/kernel/early-quirks.c                | 41 +++++++++++++++++++
 arch/x86/pci/common.c                         |  4 ++
 arch/x86/pci/early.c                          | 25 +++++++++++
 6 files changed, 80 insertions(+), 28 deletions(-)

Comments

Thadeu Lima de Souza Cascardo Nov. 8, 2018, 5:15 p.m. UTC | #1
Same considerations that I had for Xenial.
Thadeu Lima de Souza Cascardo Nov. 8, 2018, 5:21 p.m. UTC | #2
Hi, Mauricio.

So, disco master-next does not correspond yet to our unstable development,
which is already on top of 4.19. The patchset does not apply cleanly to
unstable master branch. Can you submit for unstable, then?

Thanks.
Cascardo.
Mauricio Faria de Oliveira Nov. 8, 2018, 5:28 p.m. UTC | #3
On Thu, Nov 8, 2018 at 3:21 PM Thadeu Lima de Souza Cascardo <
cascardo@canonical.com> wrote:

> Hi, Mauricio.
>
> So, disco master-next does not correspond yet to our unstable development,
> which is already on top of 4.19. The patchset does not apply cleanly to
> unstable master branch. Can you submit for unstable, then?
>

Sure; it will take a little while to build/test, but it should hit the list
today as well.


> Thanks.
> Cascardo.
>