Message ID | 20181018183721.27467-1-gpiccoli@canonical.com |
---|---|
State | Changes Requested |
Delegated to: | Bjorn Helgaas |
Headers | show |
Series | [1/3] x86/quirks: Scan all busses for early PCI quirks | expand |
On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: > Recently was noticed in an HP GEN9 system that kdump couldn't succeed > due to an irq storm coming from an Intel NIC, narrowed down to be lack > of clearing the MSI/MSI-X enable bits during the kdump kernel boot. > For that, we need an early quirk to manually turn off MSI/MSI-X for > PCI devices - this was worked as an optional boot parameter in a > subsequent patch. > > Problem is that in our test system, the Intel NICs were not present in > any secondary bus under the first PCIe root complex, so they couldn't > be reached by the recursion in check_dev_quirk(). Modern systems, > specially with multi-processors and multiple NUMA nodes expose multiple > root complexes, describing more than one PCI hierarchy domain. Currently > the simple recursion present in the early-quirks code from x86 starts a > descending recursion from bus 0000:00, and reach many other busses by > navigating this hierarchy walking through the bridges. This is not > enough in systems with more than one root complex/host bridge, since > the recursion won't "traverse" to other root complexes by starting > statically in 0000:00 (for more details, see [0]). > > This patch hence implements the full bus/device/function scan in > early_quirks(), by checking all possible busses instead of using a > recursion based on the first root bus or limiting the search scope to > the first 32 busses (like it was done in the beginning [1]). I don't want to expand the early quirk infrastructure unless there is absolutely no other way to solve this. The early quirk stuff is x86-specific, and it's not obvious that this problem is x86-only. This patch scans buses 0-255, but still only in domain 0, so it won't help with even more complicated systems that use other domains. I'm not an IRQ expert, but it seems wrong to me that we are enabling this interrupt before we're ready for it. The MSI should target an IOAPIC. Can't that IOAPIC entry be masked until later? I guess the kdump kernel doesn't know what MSI address the device might be using. Could the IRQ core be more tolerant of this somehow, e.g., if it notices incoming interrupts with no handler, could it disable the IOAPIC entry and fall back to polling periodically until a handler is added? > [0] https://bugs.launchpad.net/bugs/1797990 > > [1] From historical perspective, early PCI scan dates back > to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64", > on October/2003. It initially restricted the search to the first > 32 busses and slots. > > Due to a potential bug found in Nvidia chipsets, the scan > was changed to run only in the first root bus: see > commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks") > > Finally, secondary busses reachable from the 1st bus were re-added back by: > commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses") > > Reported-by: Dan Streetman <ddstreet@canonical.com> > Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> > --- > arch/x86/kernel/early-quirks.c | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c > index 50d5848bf22e..fd50f9e21623 100644 > --- a/arch/x86/kernel/early-quirks.c > +++ b/arch/x86/kernel/early-quirks.c > @@ -731,7 +731,6 @@ static int __init check_dev_quirk(int num, int slot, int func) > u16 vendor; > u16 device; > u8 type; > - u8 sec; > int i; > > class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE); > @@ -760,11 +759,8 @@ static int __init check_dev_quirk(int num, int slot, int func) > type = read_pci_config_byte(num, slot, func, > PCI_HEADER_TYPE); > > - if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) { > - sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS); > - if (sec > num) > - early_pci_scan_bus(sec); > - } > + if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) > + return -1; > > if (!(type & 0x80)) > return -1; > @@ -787,8 +783,11 @@ static void __init early_pci_scan_bus(int bus) > > void __init early_quirks(void) > { > + int bus; > + > if (!early_pci_allowed()) > return; > > - early_pci_scan_bus(0); > + for (bus = 0; bus < 256; bus++) > + early_pci_scan_bus(bus); > } > -- > 2.19.0 >
On 18/10/2018 19:15, Bjorn Helgaas wrote: > On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: > [...] > I don't want to expand the early quirk infrastructure unless there is > absolutely no other way to solve this. The early quirk stuff is > x86-specific, and it's not obvious that this problem is x86-only. > > This patch scans buses 0-255, but still only in domain 0, so it won't > help with even more complicated systems that use other domains. > > I'm not an IRQ expert, but it seems wrong to me that we are enabling > this interrupt before we're ready for it. The MSI should target an > IOAPIC. Can't that IOAPIC entry be masked until later? I guess the > kdump kernel doesn't know what MSI address the device might be using. > > Could the IRQ core be more tolerant of this somehow, e.g., if it > notices incoming interrupts with no handler, could it disable the > IOAPIC entry and fall back to polling periodically until a handler is > added? Hi Bjorn, thanks for your quick reply. I understand your point, but I think this is inherently an architecture problem. No matter what solution we decide for, it'll need to be applied in early boot time, like before the PCI layer gets initialized. So, I think a first step would be to split the solution "timing" in 2 possibilities: a) We could try to disable MSIs or whatever approach we take in the quiesce path of crash_kexec(), before the bootstrap of the kdump kernel. The pro is we could use PCI handlers to do it generically. The con is it'd touch that delicate shutdown path, from a broken kernel, and this is unreliable. Also, I've noticed changes in those crash paths usually gain huge amount of criticism by community, seems nobody wants to change a bit of this code, if not utterly necessary. b) Continue using an early boot approach. IMO, this would be per-arch by nature. Currently, powerpc for example does not suffer this issue due to their arch code performing a FW-aided PCI fundamental reset in the devices[0]. On the other hand, x86 has no generic fundamental reset infrastructure to my knowledge (we tried some alternatives, like a Bridge reset[1] that didn't work, or zeroing the the command register, which worked), but if we go with the IOAPIC way of handling this (which we tried a bit and failed too), it'll be even more arch-dependent, since IOAPIC is x86 concept. After discussing here internally, an alternative way for this MSI approach work without requiring the change in the early PCI infrastructure is to check if we're in kdump kernel and perform manually the full scan in that case, instead of changing the generic case as proposed here. This would still be x86-only, but again, it's difficult if not impossible to fix all archs using the same code here. Finally, about multi-domain PCI topologies, I've never saw it on x86, I wasn't aware that such things existed in x86 - but if required we can quickly extend the logic to contemplate it too. Thanks again, looking forward for you suggestions. Cheers, Guilherme [0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/pci-ioda.c#n3992 [1] Based in https://patchwork.kernel.org/patch/2562841, adapted to work in early boot time.
On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote: > On 18/10/2018 19:15, Bjorn Helgaas wrote: > > On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: > > [...] > I understand your point, but I think this is inherently an architecture > problem. No matter what solution we decide for, it'll need to be applied > in early boot time, like before the PCI layer gets initialized. This is the part I want to know more about. Apparently there's some event X between early_quirks() and the PCI device enumeration, and we must disable MSIs before X: setup_arch() early_quirks() # arch/x86/kernel/early-quirks.c early_pci_clear_msi() ... X ... pci_scan_root_bus_bridge() ... DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c I want to know specifically what X is. If we don't know what X is and all we know is "we have to disable MSIs earlier than PCI init", then we're likely to break things again in the future by changing the order of disabling MSIs and whatever X is. Bjorn
On 23/10/2018 14:03, Bjorn Helgaas wrote: > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote: >> On 18/10/2018 19:15, Bjorn Helgaas wrote: >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: >>> [...] >> I understand your point, but I think this is inherently an architecture >> problem. No matter what solution we decide for, it'll need to be applied >> in early boot time, like before the PCI layer gets initialized. > > This is the part I want to know more about. Apparently there's some > event X between early_quirks() and the PCI device enumeration, and we > must disable MSIs before X: > > setup_arch() > early_quirks() # arch/x86/kernel/early-quirks.c > early_pci_clear_msi() > ... > X > ... > pci_scan_root_bus_bridge() > ... > DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c > > I want to know specifically what X is. If we don't know what X is and > all we know is "we have to disable MSIs earlier than PCI init", then > we're likely to break things again in the future by changing the order > of disabling MSIs and whatever X is. > > Bjorn > Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years later, but recent discussions led to a better understanding of this 'X' point, thanks to Thomas! For those that deleted this thread from their email clients, it's available in [0] - the summary is that we faced an IRQ storm really early in boot, due to a bogus PCIe device MSI behavior, when booting a kdump kernel. This led the machine to get stuck in the boot and we couldn't kdump. The solution hereby proposed is to clear MSI interrupts early in x86, if a parameter is provided. I don't have the reproducer anymore and it was pretty hard to reproduce in virtual environments. So, about the 'X' Bjorn, in another recent thread about IRQ storms [1], Thomas clarified that and after a brief discussion, it seems there's no better way to prevent the MSI storm other than clearing the MSI capability early in boot. As discussed both here and in thread [1], this is indeed a per-architecture issue (powerpc is not subject to that, due to a better FW reset mechanism), so I think we still could benefit in having this idea implemented upstream, at least in x86 (we could expand to other architectures if desired, in the future). As a "test" data point, this was implemented in Ubuntu (same 3 patches present in this series) for ~2 years and we haven't received bug reports - I'm saying that because I understand your concerns about expanding the early PCI quirks scope. Let me know your thoughts. I'd suggest all to read thread [1], which addresses a similar issue but in a different "moment" of the system boot and provides some more insight on why the early MSI clearing seems to make sense. Thanks, Guilherme [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com [1] https://lore.kernel.org/lkml/87y2js3ghv.fsf@nanos.tec.linutronix.de
On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote: > On 23/10/2018 14:03, Bjorn Helgaas wrote: > > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote: > >> On 18/10/2018 19:15, Bjorn Helgaas wrote: > >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote: > >>> [...] > >> I understand your point, but I think this is inherently an architecture > >> problem. No matter what solution we decide for, it'll need to be applied > >> in early boot time, like before the PCI layer gets initialized. > > > > This is the part I want to know more about. Apparently there's some > > event X between early_quirks() and the PCI device enumeration, and we > > must disable MSIs before X: > > > > setup_arch() > > early_quirks() # arch/x86/kernel/early-quirks.c > > early_pci_clear_msi() > > ... > > X > > ... > > pci_scan_root_bus_bridge() > > ... > > DECLARE_PCI_FIXUP_EARLY # drivers/pci/quirks.c > > > > I want to know specifically what X is. If we don't know what X is and > > all we know is "we have to disable MSIs earlier than PCI init", then > > we're likely to break things again in the future by changing the order > > of disabling MSIs and whatever X is. > > Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years > later, but recent discussions led to a better understanding of this 'X' > point, thanks to Thomas! > > For those that deleted this thread from their email clients, it's > available in [0] - the summary is that we faced an IRQ storm really > early in boot, due to a bogus PCIe device MSI behavior, when booting a > kdump kernel. This led the machine to get stuck in the boot and we > couldn't kdump. The solution hereby proposed is to clear MSI interrupts > early in x86, if a parameter is provided. I don't have the reproducer > anymore and it was pretty hard to reproduce in virtual environments. > > So, about the 'X' Bjorn, in another recent thread about IRQ storms [1], > Thomas clarified that and after a brief discussion, it seems there's no > better way to prevent the MSI storm other than clearing the MSI > capability early in boot. As discussed both here and in thread [1], this > is indeed a per-architecture issue (powerpc is not subject to that, due > to a better FW reset mechanism), so I think we still could benefit in > having this idea implemented upstream, at least in x86 (we could expand > to other architectures if desired, in the future). > > As a "test" data point, this was implemented in Ubuntu (same 3 patches > present in this series) for ~2 years and we haven't received bug reports > - I'm saying that because I understand your concerns about expanding the > early PCI quirks scope. > > Let me know your thoughts. I'd suggest all to read thread [1], which > addresses a similar issue but in a different "moment" of the system boot > and provides some more insight on why the early MSI clearing seems to > make sense. I guess Thomas' patch [2] (from thread [1]) doesn't solve this problem? I think [0] proposes using early_quirks() to disable MSIs at boot-time. That doesn't seem like a robust solution because (a) the problem affects all arches but early_quirks() is x86-specific and (b) even on x86 early_quirks() only works for PCI segment 0 because it relies on the 0xCF8/0xCFC I/O ports. If I understand Thomas' email correctly, the IRQ storm occurs here: start_kernel setup_arch early_quirks # x86-only ... read_pci_config_16(num, slot, func, PCI_VENDOR_ID) outl(..., 0xcf8) # PCI segment 0 only inw(0xcfc) local_irq_enable ... native_irq_enable asm("sti") # <-- enable IRQ, storm occurs native_irq_enable() happens long before we discover PCI host bridges and run the normal PCI quirks, so those would be too late to disable MSIs. It doesn't seem practical to disable MSIs in the kdump kernel at the PCI level. I was hoping we could disable them somewhere in the IRQ code, e.g., at IOAPICs, but I think Thomas is saying that's not feasible. It seems like the only option left is to disable MSIs before the kexec. We used to clear the MSI/MSI-X Enable bits in pci_device_shutdown(), but that broke console devices that relied on MSI and caused "nobody cared" warnings when the devices fell back to using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()") left them unchanged. pci_device_shutdown() still clears the Bus Master Enable bit if we're doing a kexec and the device is in D0-D3hot, which should also disable MSI/MSI-X. Why doesn't this solve the problem? Is this because the device causing the storm was in PCI_UNKNOWN state? > [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com > > [1] https://lore.kernel.org/lkml/87y2js3ghv.fsf@nanos.tec.linutronix.de [2] https://lore.kernel.org/lkml/87tuueftou.fsf@nanos.tec.linutronix.de/ Notes to my future self about related changes: 2008-04-23 d52877c7b1af ("pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2") Disable MSI before kexec because new kernel isn't prepared for MSI 2011-10-17 d5dea7d95c48 ("PCI: msi: Disable msi interrupts when we initialize a pci device") Disable MSI/MSI-X at boot; only works for new kernels with CONFIG_PCI_MSI=y 2012-04-27 b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown") Disable bus mastering on shutdown (if enable/disable nested correctly) 2013-02-04 7897e6022761 ("PCI: Disable Bus Master unconditionally in pci_device_shutdown()") Disable bus mastering unconditionally (ignore nested enable/disable) 2013-03-14 6e0eda3c3898 ("PCI: Don't try to disable Bus Master on disconnected PCI devices") Don't touch bus mastering for D3cold or unknown state 2015-05-07 1851617cd2da ("PCI/MSI: Disable MSI at enumeration even if kernel doesn't support MSI") Disable MSI/MSI-X at boot even without CONFIG_PCI_MSI=y; broke Open Firmware arches 2015-10-21 e80e7edc55ba ("PCI/MSI: Initialize MSI capability for all architectures") Disable MSI/MSI-X at boot for all arches, including Open Firmware 2017-01-26 fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()") Leave MSI enabled before kexec; disabling causes device to use INTx, which drivers aren't prepared for, causing "nobody cared" warnings
Bjorn, On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: > On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote: >> On 23/10/2018 14:03, Bjorn Helgaas wrote: > I guess Thomas' patch [2] (from thread [1]) doesn't solve this > problem? No. As I explained in [1] patch from [2] cannot solve it because the patch from [2] which is what Liu was trying to solve requires that there is a registered interrupt handler which knows how to shut up the interrupt. > I think [0] proposes using early_quirks() to disable MSIs at > boot-time. That doesn't seem like a robust solution because (a) the > problem affects all arches but early_quirks() is x86-specific and (b) > even on x86 early_quirks() only works for PCI segment 0 because it > relies on the 0xCF8/0xCFC I/O ports. > > If I understand Thomas' email correctly, the IRQ storm occurs here: > > start_kernel > setup_arch > early_quirks # x86-only > ... > read_pci_config_16(num, slot, func, PCI_VENDOR_ID) > outl(..., 0xcf8) # PCI segment 0 only > inw(0xcfc) > local_irq_enable > ... > native_irq_enable > asm("sti") # <-- enable IRQ, storm occurs > > native_irq_enable() happens long before we discover PCI host bridges > and run the normal PCI quirks, so those would be too late to disable > MSIs. Correct. > It doesn't seem practical to disable MSIs in the kdump kernel at the > PCI level. I was hoping we could disable them somewhere in the IRQ > code, e.g., at IOAPICs, but I think Thomas is saying that's not > feasible. MSIs are not even going near the IOAPIC and as long as the interrupt core does not have an interrupt set up for the device is has no idea where to look at to shut it down. Actually it does not even reach the interrupt core. The raised vector arrives at the CPU and the low level code sees: No handler associated, ignore it. We cannot do anything from the low level code because all we know is that the vector was raised, but we have absolutely zero clue where that came from. At that point the IO-APIC interrupts are definitely not the problem because they are all disabled. > It seems like the only option left is to disable MSIs before the > kexec. We used to clear the MSI/MSI-X Enable bits in > pci_device_shutdown(), but that broke console devices that relied on > MSI and caused "nobody cared" warnings when the devices fell back to > using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in > pci_device_shutdown()") left them unchanged. That might be solvable because INTx arrives at the IO-APIC and we could mask all the INTx related IO-APIC lines, but that's icky because of this: > pci_device_shutdown() still clears the Bus Master Enable bit if we're > doing a kexec and the device is in D0-D3hot, which should also disable > MSI/MSI-X. Why doesn't this solve the problem? Is this because the > device causing the storm was in PCI_UNKNOWN state? That's indeed a really good question. Thanks, tglx
Bjorn, On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >> doing a kexec and the device is in D0-D3hot, which should also disable >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >> device causing the storm was in PCI_UNKNOWN state? > > That's indeed a really good question. So we do that on kexec, but is that true when starting a kdump kernel from a kernel crash? I doubt it. Thanks, tglx
On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: > Bjorn, > > On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: > > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: > >> pci_device_shutdown() still clears the Bus Master Enable bit if we're > >> doing a kexec and the device is in D0-D3hot, which should also disable > >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the > >> device causing the storm was in PCI_UNKNOWN state? > > > > That's indeed a really good question. > > So we do that on kexec, but is that true when starting a kdump kernel > from a kernel crash? I doubt it. Ah, right, I bet that's it, thanks. The kdump path is basically this: crash_kexec machine_kexec while the usual kexec path is: kernel_kexec kernel_restart_prepare device_shutdown while (!list_empty(&devices_kset->list)) dev->bus->shutdown pci_device_shutdown # pci_bus_type.shutdown machine_kexec So maybe we need to explore doing some or all of device_shutdown() in the crash_kexec() path as well as in the kernel_kexec() path.
Bjorn, On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >> >> doing a kexec and the device is in D0-D3hot, which should also disable >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >> >> device causing the storm was in PCI_UNKNOWN state? >> > >> > That's indeed a really good question. >> >> So we do that on kexec, but is that true when starting a kdump kernel >> from a kernel crash? I doubt it. > > Ah, right, I bet that's it, thanks. The kdump path is basically this: > > crash_kexec > machine_kexec > > while the usual kexec path is: > > kernel_kexec > kernel_restart_prepare > device_shutdown > while (!list_empty(&devices_kset->list)) > dev->bus->shutdown > pci_device_shutdown # pci_bus_type.shutdown > machine_kexec > > So maybe we need to explore doing some or all of device_shutdown() in > the crash_kexec() path as well as in the kernel_kexec() path. The problem is that if the machine crashed anything you try to attempt before starting the crash kernel is reducing the chance that the crash kernel actually starts. Is there something at the root bridge level which allows to tell the underlying busses to shut up, reset or go into a defined state? That might avoid chasing lists which might be already unreliable. Thanks, tglx
[+cc Rafael for question about ACPI method for PCI host bridge reset] On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote: > On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: > > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: > >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: > >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: > >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're > >> >> doing a kexec and the device is in D0-D3hot, which should also disable > >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the > >> >> device causing the storm was in PCI_UNKNOWN state? > >> > > >> > That's indeed a really good question. > >> > >> So we do that on kexec, but is that true when starting a kdump kernel > >> from a kernel crash? I doubt it. > > > > Ah, right, I bet that's it, thanks. The kdump path is basically this: > > > > crash_kexec > > machine_kexec > > > > while the usual kexec path is: > > > > kernel_kexec > > kernel_restart_prepare > > device_shutdown > > while (!list_empty(&devices_kset->list)) > > dev->bus->shutdown > > pci_device_shutdown # pci_bus_type.shutdown > > machine_kexec > > > > So maybe we need to explore doing some or all of device_shutdown() in > > the crash_kexec() path as well as in the kernel_kexec() path. > > The problem is that if the machine crashed anything you try to attempt > before starting the crash kernel is reducing the chance that the crash > kernel actually starts. Right. > Is there something at the root bridge level which allows to tell the > underlying busses to shut up, reset or go into a defined state? That > might avoid chasing lists which might be already unreliable. Maybe we need some kind of crash_device_shutdown() that does the minimal thing to protect the kdump kernel from devices. The programming model for conventional PCI host bridges and PCIe Root Complexes is device-specific since they're outside the PCI domain. There probably *are* ways to do those things, but you would need a native host bridge driver or something like an ACPI method. I'm not aware of an ACPI way to do this, but I added Rafael in case he is. A crash_device_shutdown() could do something at the host bridge level if that's possible, or reset/disable bus mastering/disable MSI/etc on individual PCI devices if necessary. Bjorn
Bjorn Helgaas <helgaas@kernel.org> writes: > [+cc Rafael for question about ACPI method for PCI host bridge reset] > > On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote: >> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: >> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: >> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: >> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >> >> >> doing a kexec and the device is in D0-D3hot, which should also disable >> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >> >> >> device causing the storm was in PCI_UNKNOWN state? >> >> > >> >> > That's indeed a really good question. >> >> >> >> So we do that on kexec, but is that true when starting a kdump kernel >> >> from a kernel crash? I doubt it. >> > >> > Ah, right, I bet that's it, thanks. The kdump path is basically this: >> > >> > crash_kexec >> > machine_kexec >> > >> > while the usual kexec path is: >> > >> > kernel_kexec >> > kernel_restart_prepare >> > device_shutdown >> > while (!list_empty(&devices_kset->list)) >> > dev->bus->shutdown >> > pci_device_shutdown # pci_bus_type.shutdown >> > machine_kexec >> > >> > So maybe we need to explore doing some or all of device_shutdown() in >> > the crash_kexec() path as well as in the kernel_kexec() path. >> >> The problem is that if the machine crashed anything you try to attempt >> before starting the crash kernel is reducing the chance that the crash >> kernel actually starts. > > Right. > >> Is there something at the root bridge level which allows to tell the >> underlying busses to shut up, reset or go into a defined state? That >> might avoid chasing lists which might be already unreliable. > > Maybe we need some kind of crash_device_shutdown() that does the > minimal thing to protect the kdump kernel from devices. The kdump kernel does not use any memory the original kernel uses. Which should be a minimal and fairly robust level of protection until the device drivers can be loaded and get ahold of things. > The programming model for conventional PCI host bridges and PCIe Root > Complexes is device-specific since they're outside the PCI domain. > There probably *are* ways to do those things, but you would need a > native host bridge driver or something like an ACPI method. I'm not > aware of an ACPI way to do this, but I added Rafael in case he is. > > A crash_device_shutdown() could do something at the host bridge level > if that's possible, or reset/disable bus mastering/disable MSI/etc on > individual PCI devices if necessary. Unless I am confused DMA'ing to memory that is not already in use is completely broken wether or not you are using the kdump kernel. Eric
ebiederm@xmission.com (Eric W. Biederman) writes: > Bjorn Helgaas <helgaas@kernel.org> writes: > >> [+cc Rafael for question about ACPI method for PCI host bridge reset] >> >> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote: >>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote: >>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote: >>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote: >>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote: >>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're >>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable >>> >> >> MSI/MSI-X. Why doesn't this solve the problem? Is this because the >>> >> >> device causing the storm was in PCI_UNKNOWN state? >>> >> > >>> >> > That's indeed a really good question. >>> >> >>> >> So we do that on kexec, but is that true when starting a kdump kernel >>> >> from a kernel crash? I doubt it. >>> > >>> > Ah, right, I bet that's it, thanks. The kdump path is basically this: >>> > >>> > crash_kexec >>> > machine_kexec >>> > >>> > while the usual kexec path is: >>> > >>> > kernel_kexec >>> > kernel_restart_prepare >>> > device_shutdown >>> > while (!list_empty(&devices_kset->list)) >>> > dev->bus->shutdown >>> > pci_device_shutdown # pci_bus_type.shutdown >>> > machine_kexec >>> > >>> > So maybe we need to explore doing some or all of device_shutdown() in >>> > the crash_kexec() path as well as in the kernel_kexec() path. >>> >>> The problem is that if the machine crashed anything you try to attempt >>> before starting the crash kernel is reducing the chance that the crash >>> kernel actually starts. >> >> Right. >> >>> Is there something at the root bridge level which allows to tell the >>> underlying busses to shut up, reset or go into a defined state? That >>> might avoid chasing lists which might be already unreliable. >> >> Maybe we need some kind of crash_device_shutdown() that does the >> minimal thing to protect the kdump kernel from devices. > > The kdump kernel does not use any memory the original kernel uses. > Which should be a minimal and fairly robust level of protection > until the device drivers can be loaded and get ahold of things. > >> The programming model for conventional PCI host bridges and PCIe Root >> Complexes is device-specific since they're outside the PCI domain. >> There probably *are* ways to do those things, but you would need a >> native host bridge driver or something like an ACPI method. I'm not >> aware of an ACPI way to do this, but I added Rafael in case he is. >> >> A crash_device_shutdown() could do something at the host bridge level >> if that's possible, or reset/disable bus mastering/disable MSI/etc on >> individual PCI devices if necessary. > > Unless I am confused DMA'ing to memory that is not already in use > is completely broken wether or not you are using the kdump kernel. Bah. I was confused because I had not read up-thread. MSI mixes DMA and irqs so confusion is easy. So the problem is screaming irqs when the kernel is booting up. This is a fundamentally tricky problem. For ordinary irqs you can have this with level triggered irqs and the kernel has code that will shutdown the irq at the ioapic level. Then the kernel continues by polling the irq source. I am still missing details but my first question is can our general solution to screaming level triggered irqs apply? How can edge triggered MSI irqs be screaming? Is there something we can do in enabling the APICs or IOAPICs that would allow this to be handled better. My memory when we enable the APICs and IOAPICs we completely clear the APIC entries and so should be disabling sources. Is the problem perhaps that we wind up using an APIC entry that was previously used for the MSI interrupt as something else when we reprogram them? Even with this why doesn't the generic code to stop screaming irqs apply here? Eric
On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote: > ebiederm@xmission.com (Eric W. Biederman) writes: > For ordinary irqs you can have this with level triggered irqs > and the kernel has code that will shutdown the irq at the ioapic > level. Then the kernel continues by polling the irq source. > > I am still missing details but my first question is can our general > solution to screaming level triggered irqs apply? No. > How can edge triggered MSI irqs be screaming? > > Is there something we can do in enabling the APICs or IOAPICs that > would allow this to be handled better. My memory when we enable > the APICs and IOAPICs we completely clear the APIC entries and so > should be disabling sources. Yes, but MSI has nothing to do with APIC/IOAPIC > Is the problem perhaps that we wind up using an APIC entry that was > previously used for the MSI interrupt as something else when we > reprogram them? Even with this why doesn't the generic code > to stop screaming irqs apply here? Again. No. The problem cannot be solved at the APIC level. The APIC is the receiving end of MSI and has absolutely no control over it. An MSI interrupt is a (DMA) write to the local APIC base address 0xfeexxxxx which has the target CPU and control bits encoded in the lower bits. The written data is the vector and more control bits. The only way to stop that is to shut it up at the PCI device level. Assume the following situation: - PCI device has MSI enabled and a valid target vector assigned - Kernel crashes - Kdump kernel starts - PCI device raises interrupts which result in the MSI write - Kdump kernel enables interrupts and the pending vector is raised in the CPU. - The CPU has no interrupt descriptor assigned to the vector and does not even know where the interrupt originated from. So it treats it like any other spurious interrupt to an unassigned vector, emits a ratelimited message and ACKs the interrupt at the APIC. - PCI device behaves stupid and reraises the interrupt for whatever reason. - Lather, rinse and repeat. Unfortunately there is no way to tell the APIC "Mask vector X" and the dump kernel does neither know which device it comes from nor does it have enumerated PCI completely which would reset the device and shutup the spew. Due to the interrupt storm it does not get that far. Thanks, tglx
On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote: > Unfortunately there is no way to tell the APIC "Mask vector X" and the > dump kernel does neither know which device it comes from nor does it > have enumerated PCI completely which would reset the device and shutup > the spew. Due to the interrupt storm it does not get that far. Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable on all active PCI devices in the crashing kernel before starting the crash kernel? So that the crash kernel starts with a clean slate? Guilherme's original patches from 2018 iterate over all 256 PCI buses. That might impact boot time negatively. The reason he has to do that is because the crashing kernel doesn't know which devices exist and which have interrupts enabled. However the crashing kernel has that information. It should either disable interrupts itself or pass the necessary information to the crashing kernel as setup_data or whatever. Guilherme's patches add a "clearmsi" command line parameter. I guess the idea is that the parameter is always passed to the crash kernel but the patches don't do that, which seems odd. Thanks, Lukas
On Sun, Nov 15 2020 at 18:01, Lukas Wunner wrote: > On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote: >> Unfortunately there is no way to tell the APIC "Mask vector X" and the >> dump kernel does neither know which device it comes from nor does it >> have enumerated PCI completely which would reset the device and shutup >> the spew. Due to the interrupt storm it does not get that far. > > Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable > on all active PCI devices in the crashing kernel before starting the > crash kernel? So that the crash kernel starts with a clean slate? > > Guilherme's original patches from 2018 iterate over all 256 PCI buses. > That might impact boot time negatively. The reason he has to do that > is because the crashing kernel doesn't know which devices exist and > which have interrupts enabled. However the crashing kernel has that > information. It should either disable interrupts itself or pass the > necessary information to the crashing kernel as setup_data or whatever. As I explained before: The problem with doing anything between crashing and starting the crash kernel is reducing the chance of actually starting it. The kernel crashed for whatever reason, so it's in a completely undefined state. Ergo there is no 'just do something'. You really have to think hard about what can be done safely with the least probability of running into another problem. Thanks, tglx
Thomas Gleixner <tglx@linutronix.de> writes: > On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote: >> ebiederm@xmission.com (Eric W. Biederman) writes: >> For ordinary irqs you can have this with level triggered irqs >> and the kernel has code that will shutdown the irq at the ioapic >> level. Then the kernel continues by polling the irq source. >> >> I am still missing details but my first question is can our general >> solution to screaming level triggered irqs apply? > > No. > >> How can edge triggered MSI irqs be screaming? >> >> Is there something we can do in enabling the APICs or IOAPICs that >> would allow this to be handled better. My memory when we enable >> the APICs and IOAPICs we completely clear the APIC entries and so >> should be disabling sources. > > Yes, but MSI has nothing to do with APIC/IOAPIC Yes, sorry. It has been long enough that the details were paged out of my memory. >> Is the problem perhaps that we wind up using an APIC entry that was >> previously used for the MSI interrupt as something else when we >> reprogram them? Even with this why doesn't the generic code >> to stop screaming irqs apply here? > > Again. No. The problem cannot be solved at the APIC level. The APIC is > the receiving end of MSI and has absolutely no control over it. > > An MSI interrupt is a (DMA) write to the local APIC base address > 0xfeexxxxx which has the target CPU and control bits encoded in the > lower bits. The written data is the vector and more control bits. > > The only way to stop that is to shut it up at the PCI device level. Or to write to magic chipset registers that will stop transforming DMA writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has such registers (because the point of the IOMMU is to limit the problems rogue devices can cause). Without an IOMMU I don't know if x86 has any such registers. I remember that other platforms have an interrupt controller that does provide some level of control. That x86 does not is what makes this an x86 specific problem. The generic solution is to have the PCI code set bus master disables when it is enumerationg and initializing devices. Last time I was paying attention that was actually the policy of the pci layer and drivers that did not enable bus mastering were considered buggy. Looking at patch 3/3 what this patchset does is an early disable of of the msi registers. Which is mostly reasonable. Especially as has been pointed out the only location the x86 vector and x86 cpu can be found is in the msi configuration registers. That also seems reasonable. But Bjorn's concern about not finding all devices in all domains does seem real. There are a handful of devices where the Bus master disable bit doesn't disable bus mastering. I wonder if there are devices where MSI and MSIX disables don't fully work. It seems completely possible to have MSI or MSIX equivalent registers at a non-standard location as drivers must be loaded to handle them. So if we can safely and reliably disable DMA and MSI at the generic PCI device level during boot up I am all for it. How difficult would it be to tell the IOMMUs to stop passing traffic through in an early pci quirk? The problem setup was apparently someone using the device directly from a VM. So I presume there is an IOMMU in that configuration. > Unfortunately there is no way to tell the APIC "Mask vector X" and the > dump kernel does neither know which device it comes from nor does it > have enumerated PCI completely which would reset the device and shutup > the spew. Due to the interrupt storm it does not get that far. So the question is how do we make this robust? Can we perhaps disable all interrupts in this case and limp along in polling mode until the pci bus has been enumerated? It is nice and desirable to be able to use the hardware in high performance mode in a kexec-on-panic situation but if we can detect a problem and figure out how to limp along sometimes that is acceptable. The failure mode in the kexec-on-panic kernel is definitely the corect one. We could not figure out how to wrestle the hardware into usability so we fail to take a crash dump or do anything else that might corrupt the system. Eric
First of all, thanks everybody for the great insights/discussion! This thread ended-up being a great learning for (at least) me. Given the big amount of replies and intermixed comments, I wouldn't be able to respond inline to all, so I'll try another approach below. From Bjorn: "I think [0] proposes using early_quirks() to disable MSIs at boot-time. That doesn't seem like a robust solution because (a) the problem affects all arches but early_quirks() is x86-specific and (b) even on x86 early_quirks() only works for PCI segment 0 because it relies on the 0xCF8/0xCFC I/O ports." Ah. I wasn't aware of that limitation, I thought enhancing the early_quirks() search to go through all buses would fix that, thanks for the clarification! And again, worth to clarify that this is not a problem affecting all arches _in practice_ - PowerPC for example has the FW primitives allowing a powerful PCI controller (out-of-band) reset, preventing this kind of issue usually. [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com From Bjorn: "A crash_device_shutdown() could do something at the host bridge level if that's possible, or reset/disable bus mastering/disable MSI/etc on individual PCI devices if necessary." From Lukas: "Guilherme's original patches from 2018 iterate over all 256 PCI buses. That might impact boot time negatively. The reason he has to do that is because the crashing kernel doesn't know which devices exist and which have interrupts enabled. However the crashing kernel has that information. It should either disable interrupts itself or pass the necessary information to the crashing kernel as setup_data or whatever. Guilherme's patches add a "clearmsi" command line parameter. I guess the idea is that the parameter is always passed to the crash kernel but the patches don't do that, which seems odd." Very good points Lukas, thanks! The reason of not adding the clearmsi thing as a default kdump procedure is kinda related to your first point: it impacts a bit boot-time, also it's an additional logic in the kdump kernel, which we know is (sometimes) the last resort in gathering additional data to debug a potentially complex issue. That said, I'd not like to enforce this kind of "policy" to everybody, so my implementation relies on having it as a parameter, and the kdump userspace counter-part could then have a choice in adding or not such mechanism to the kdump kernel parameter list. About passing the data to next kernel, this is very interesting, we could do something like that either through setup_data (as you said) or using a new proposal, the PKRAM thing [a]. This way we wouldn't need a crash_device_shutdown(), but instead when kernel is loading a crash kernel (through kexec -p) we can collect all the necessary information that'll be passed to the crash kernel (obviously that if we are collecting PCI topology information, we need callbacks in the PCI hotplug add/del path to update this information). [a] https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ Below, inline reply to Eric's last message. On 15/11/2020 17:46, Eric W. Biederman wrote: >> [...] >> An MSI interrupt is a (DMA) write to the local APIC base address >> 0xfeexxxxx which has the target CPU and control bits encoded in the >> lower bits. The written data is the vector and more control bits. >> >> The only way to stop that is to shut it up at the PCI device level. > > Or to write to magic chipset registers that will stop transforming DMA > writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has > such registers (because the point of the IOMMU is to limit the problems > rogue devices can cause). Without an IOMMU I don't know if x86 has any > such registers. I remember that other platforms have an interrupt > controller that does provide some level of control. That x86 does not > is what makes this an x86 specific problem. > [...] > Looking at patch 3/3 what this patchset does is an early disable of > of the msi registers. Which is mostly reasonable. Especially as has > been pointed out the only location the x86 vector and x86 cpu can > be found is in the msi configuration registers. > > That also seems reasonable. But Bjorn's concern about not finding all > devices in all domains does seem real. > [...] > So if we can safely and reliably disable DMA and MSI at the generic PCI > device level during boot up I am all for it. > > > How difficult would it be to tell the IOMMUs to stop passing traffic > through in an early pci quirk? The problem setup was apparently someone > using the device directly from a VM. So I presume there is an IOMMU > in that configuration. This is a good idea I think - we considered something like that in theory while working the problem back in 2018, but given I'm even less expert in IOMMU that I already am in PCI, the option was to go with the PCI approach. And you are right, the original problem is a device in PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to collect some stats) from the host side, through VFIO IIRC. Anyway, we could split the problem in two after all the considerations in the thread, I believe: (1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the DMA writes for PCI devices *until* PCI core code properly initialize the devices, that'd handle the majority of the cases I guess (IOMMU usage is quite common nowadays). (2) Collecting PCI topology information in a running kernel and passing that to the kdump kernel would allow us to disable the PCI devices MSI capabilities, the only problem here is that I couldn't see how to do that early enough, before the 'sti' executes, without relying in early_quirks(). ACPI maybe? As per Bjorn comment when explaining the limitations of early_quirks(), this problem seems not to be trivial. I'm a bit against doing that in crash_kexec() for the reasons mentioned in the thread, specially by Thomas and Eric - but if there's no other way...maybe this is the way to go. Cheers, Guilherme P.S. Fixed Gavin's bouncing address, sorry for that.
"Guilherme G. Piccoli" <gpiccoli@canonical.com> writes: > First of all, thanks everybody for the great insights/discussion! This > thread ended-up being a great learning for (at least) me. > > Given the big amount of replies and intermixed comments, I wouldn't be > able to respond inline to all, so I'll try another approach below. > > > From Bjorn: > "I think [0] proposes using early_quirks() to disable MSIs at boot-time. > That doesn't seem like a robust solution because (a) the problem affects > all arches but early_quirks() is x86-specific and (b) even on x86 > early_quirks() only works for PCI segment 0 because it relies on the > 0xCF8/0xCFC I/O ports." > > Ah. I wasn't aware of that limitation, I thought enhancing the > early_quirks() search to go through all buses would fix that, thanks for > the clarification! And again, worth to clarify that this is not a > problem affecting all arches _in practice_ - PowerPC for example has the > FW primitives allowing a powerful PCI controller (out-of-band) reset, > preventing this kind of issue usually. > > [0] > https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com > > > From Bjorn: > "A crash_device_shutdown() could do something at the host bridge level > if that's possible, or reset/disable bus mastering/disable MSI/etc on > individual PCI devices if necessary." > > From Lukas: > "Guilherme's original patches from 2018 iterate over all 256 PCI buses. > That might impact boot time negatively. The reason he has to do that is > because the crashing kernel doesn't know which devices exist and which > have interrupts enabled. However the crashing kernel has that > information. It should either disable interrupts itself or pass the > necessary information to the crashing kernel as setup_data or whatever. > > Guilherme's patches add a "clearmsi" command line parameter. I guess > the idea is that the parameter is always passed to the crash kernel but > the patches don't do that, which seems odd." > > Very good points Lukas, thanks! The reason of not adding the clearmsi > thing as a default kdump procedure is kinda related to your first point: > it impacts a bit boot-time, also it's an additional logic in the kdump > kernel, which we know is (sometimes) the last resort in gathering > additional data to debug a potentially complex issue. That said, I'd not > like to enforce this kind of "policy" to everybody, so my implementation > relies on having it as a parameter, and the kdump userspace counter-part > could then have a choice in adding or not such mechanism to the kdump > kernel parameter list. > > About passing the data to next kernel, this is very interesting, we > could do something like that either through setup_data (as you said) or > using a new proposal, the PKRAM thing [a]. > This way we wouldn't need a crash_device_shutdown(), but instead when > kernel is loading a crash kernel (through kexec -p) we can collect all > the necessary information that'll be passed to the crash kernel > (obviously that if we are collecting PCI topology information, we need > callbacks in the PCI hotplug add/del path to update this information). > > [a] > https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/ > > Below, inline reply to Eric's last message. > > > On 15/11/2020 17:46, Eric W. Biederman wrote: >>> [...] >>> An MSI interrupt is a (DMA) write to the local APIC base address >>> 0xfeexxxxx which has the target CPU and control bits encoded in the >>> lower bits. The written data is the vector and more control bits. >>> >>> The only way to stop that is to shut it up at the PCI device level. >> >> Or to write to magic chipset registers that will stop transforming DMA >> writes to 0xfeexxxxx into x86 interrupts. With an IOMMU I know x86 has >> such registers (because the point of the IOMMU is to limit the problems >> rogue devices can cause). Without an IOMMU I don't know if x86 has any >> such registers. I remember that other platforms have an interrupt >> controller that does provide some level of control. That x86 does not >> is what makes this an x86 specific problem. >> [...] >> Looking at patch 3/3 what this patchset does is an early disable of >> of the msi registers. Which is mostly reasonable. Especially as has >> been pointed out the only location the x86 vector and x86 cpu can >> be found is in the msi configuration registers. >> >> That also seems reasonable. But Bjorn's concern about not finding all >> devices in all domains does seem real. >> [...] >> So if we can safely and reliably disable DMA and MSI at the generic PCI >> device level during boot up I am all for it. >> >> >> How difficult would it be to tell the IOMMUs to stop passing traffic >> through in an early pci quirk? The problem setup was apparently someone >> using the device directly from a VM. So I presume there is an IOMMU >> in that configuration. > > This is a good idea I think - we considered something like that in > theory while working the problem back in 2018, but given I'm even less > expert in IOMMU that I already am in PCI, the option was to go with the > PCI approach. And you are right, the original problem is a device in > PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to > collect some stats) from the host side, through VFIO IIRC. > > Anyway, we could split the problem in two after all the considerations > in the thread, I believe: > > (1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the > DMA writes for PCI devices *until* PCI core code properly initialize the > devices, that'd handle the majority of the cases I guess (IOMMU usage is > quite common nowadays). > > (2) Collecting PCI topology information in a running kernel and passing > that to the kdump kernel would allow us to disable the PCI devices MSI > capabilities, the only problem here is that I couldn't see how to do > that early enough, before the 'sti' executes, without relying in > early_quirks(). ACPI maybe? As per Bjorn comment when explaining the > limitations of early_quirks(), this problem seems not to be trivial. > > I'm a bit against doing that in crash_kexec() for the reasons mentioned > in the thread, specially by Thomas and Eric - but if there's no other > way...maybe this is the way to go. The way to do that would be to collect the set of pci devices when the kexec on panic kernel is loaded, not during crash_kexec. If someone performs a device hotplug they would need to reload the kexec on panic kernel. I am not necessarily endorsing that just pointing out how it can be done. Eric
On Mon, Nov 16, 2020 at 6:45 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > The way to do that would be to collect the set of pci devices when the > kexec on panic kernel is loaded, not during crash_kexec. If someone > performs a device hotplug they would need to reload the kexec on panic > kernel. > > I am not necessarily endorsing that just pointing out how it can be > done. > > Eric Thanks Eric, I agree! I think if we use something like PKRAM (a more dynamic approach) we could have the PCI hotplug path updating the data to-be-passed to the crash kernel, so the crash kernel doesn't even need to be loaded again.
On Mon, Nov 16, 2020 at 05:31:36PM -0300, Guilherme G. Piccoli wrote: > First of all, thanks everybody for the great insights/discussion! This > thread ended-up being a great learning for (at least) me. > > Given the big amount of replies and intermixed comments, I wouldn't be > able to respond inline to all, so I'll try another approach below. > > > From Bjorn: > "I think [0] proposes using early_quirks() to disable MSIs at boot-time. > That doesn't seem like a robust solution because (a) the problem affects > all arches but early_quirks() is x86-specific and (b) even on x86 > early_quirks() only works for PCI segment 0 because it relies on the > 0xCF8/0xCFC I/O ports." > > Ah. I wasn't aware of that limitation, I thought enhancing the > early_quirks() search to go through all buses would fix that, thanks for > the clarification! And again, worth to clarify that this is not a > problem affecting all arches _in practice_ - PowerPC for example has the > FW primitives allowing a powerful PCI controller (out-of-band) reset, > preventing this kind of issue usually. > > [0] > https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com > > > From Bjorn: > "A crash_device_shutdown() could do something at the host bridge level > if that's possible, or reset/disable bus mastering/disable MSI/etc on > individual PCI devices if necessary." > > From Lukas: > "Guilherme's original patches from 2018 iterate over all 256 PCI buses. > That might impact boot time negatively. The reason he has to do that is > because the crashing kernel doesn't know which devices exist and which > have interrupts enabled. However the crashing kernel has that > information. It should either disable interrupts itself or pass the > necessary information to the crashing kernel as setup_data or whatever. I don't think passing the device information to the kdump kernel is really practical. The kdump kernel would use it to do PCI config writes to disable MSIs before enabling IRQs, and it doesn't know how to access config space that early. We could invent special "early config access" things, but that gets really complicated really fast. Config access depends on ACPI MCFG tables, firmware interfaces, and in many cases, on the native host bridge drivers in drivers/pci/controllers/. I think we need to disable MSIs in the crashing kernel before the kexec. It adds a little more code in the crash_kexec() path, but it seems like a worthwhile tradeoff. Bjorn
Bjorn Helgaas <helgaas@kernel.org> writes: > I don't think passing the device information to the kdump kernel is > really practical. The kdump kernel would use it to do PCI config > writes to disable MSIs before enabling IRQs, and it doesn't know how > to access config space that early. I don't think it is particularly practical either. But in practice on x86 it is either mmio writes or 0xcf8 style writes and we could pass a magic table that would have all of that information. > We could invent special "early config access" things, but that gets > really complicated really fast. Config access depends on ACPI MCFG > tables, firmware interfaces, and in many cases, on the native host > bridge drivers in drivers/pci/controllers/. I do agree that the practical problem with passing information early is that gets us into the weeds and creates code that we only care about in the case of kexec-on-panic. It is much better to make the existing code more robust, so that we reduce our dependency on firmware doing the right thing. > I think we need to disable MSIs in the crashing kernel before the > kexec. It adds a little more code in the crash_kexec() path, but it > seems like a worthwhile tradeoff. Disabling MSIs in the b0rken kernel is not possible. Walking the device tree or even a significant subset of it hugely decreases the chances that we will run into something that is incorrect in the known broken kernel. I expect the code to do that would double or triple the amount of code that must be executed in the known broken kernel. The last time something like that happened (switching from xchg to ordinary locks) we had cases that stopped working. Walking all of the pci devices in the system is much more invasive. That is not to downplay the problems of figuring out how to disable things in early boot. My two top candidates are poking the IOMMUs early to shut things off, and figuring out if we can delay enabling interrupts until we have initialized pci. Poking at IOMMUs early should work for most systems with ``enterprise'' hardware. Systems where people care about kdump the most. Eric
On Mon, Nov 16 2020 at 19:06, Eric W. Biederman wrote: > Bjorn Helgaas <helgaas@kernel.org> writes: > My two top candidates are poking the IOMMUs early to shut things off, > and figuring out if we can delay enabling interrupts until we have > initialized pci. Keeping interrupts disabled post PCI initialization would be nice, but that requires feeding the whole init machinery through a shredder and collecting the bits and pieces. > Poking at IOMMUs early should work for most systems with ``enterprise'' > hardware. Systems where people care about kdump the most. The IOMMU IRQ remapping part _is_ initialized early and _before_ interrupts are enabled. But that does not solve the problem either simply because then the IOMMU will catch the rogue MSIs and you get an interrupt storm on the IOMMU error interrupt. This one is not going to be turned off because the IOMMU error interrupt handler will handle each of them and tell the core code that everything is under control. As I explained several times now, the only way to shut this up reliably is at the source. Curing the symptom is almost never a good solution. Thanks, tglx
On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman <ebiederm@xmission.com> wrote: > [...] > > I think we need to disable MSIs in the crashing kernel before the > > kexec. It adds a little more code in the crash_kexec() path, but it > > seems like a worthwhile tradeoff. > > Disabling MSIs in the b0rken kernel is not possible. > > Walking the device tree or even a significant subset of it hugely > decreases the chances that we will run into something that is incorrect > in the known broken kernel. I expect the code to do that would double > or triple the amount of code that must be executed in the known broken > kernel. The last time something like that happened (switching from xchg > to ordinary locks) we had cases that stopped working. Walking all of > the pci devices in the system is much more invasive. > I think we could try to decouple this problem in 2, if that makes sense. Bjorn/others can jump in and correct me if I'm wrong. So, the problem is to walk a PCI topology tree, identify every device and disable MSI(/INTx maybe) in them, and the big deal with doing that in the broken kernel before the kexec is that this task is complex due to the tree walk, mainly. But what if we keep a table (a simple 2-D array) with the address and data to be written to disable the MSIs, and before the kexec we could have a parameter enabling a function that just goes through this array and performs the writes? The table itself would be constructed by the PCI core (and updated in the hotplug path), when device discovery is happening. This table would live in a protected area in memory, with no write/execute access, this way if the kernel itself tries to corrupt that, we get a fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect against that). If the parameter "kdump_clear_msi" is passed in the cmdline of the regular kernel, for example, then the function walks this simple table and performs the devices' writes before the kexec... If that's not possible or a bad idea for any reason, I still think the early_quirks() idea hereby proposed is not something we should discard, because it'll help a lot of users even with its limitations (in our case it worked very well). Also, taking here the opportunity to clarify my understanding about the limitations of that approach: Bjorn, in our reproducer machine we had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and 0000:80 - are those all under "segment 0" as per your verbiage? In our case the troublemaker was under 0000:80, and the early_quirks() shut the device up successfully. Thanks, Guilherme
On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote: > But that does not solve the problem either simply because then the IOMMU > will catch the rogue MSIs and you get an interrupt storm on the IOMMU > error interrupt. Not if you can tell the IOMMU to stop reporting those errors. We can easily do it per-device for DMA errors; not quite sure what granularity we have for interrupts. Perhaps the Intel IOMMU only lets you set the Fault Processing Disable bit per IRTE entry, and you still get faults for Compatibility Format interrupts? Not sure about AMD...
On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote: > On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote: >> But that does not solve the problem either simply because then the IOMMU >> will catch the rogue MSIs and you get an interrupt storm on the IOMMU >> error interrupt. > > Not if you can tell the IOMMU to stop reporting those errors. > > We can easily do it per-device for DMA errors; not quite sure what > granularity we have for interrupts. Perhaps the Intel IOMMU only lets > you set the Fault Processing Disable bit per IRTE entry, and you still > get faults for Compatibility Format interrupts? Not sure about AMD... It looks like the fault (DMAR) and event (AMD) interrupts can be disabled in the IOMMU. That might help to bridge the gap until the PCI bus is scanned in full glory and the devices can be shut up for real. If we make this conditional for a crash dump kernel that might do the trick. Lot's of _might_ there :) Thanks tglx
Thomas Gleixner <tglx@linutronix.de> writes: > On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote: >> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote: >>> But that does not solve the problem either simply because then the IOMMU >>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU >>> error interrupt. >> >> Not if you can tell the IOMMU to stop reporting those errors. >> >> We can easily do it per-device for DMA errors; not quite sure what >> granularity we have for interrupts. Perhaps the Intel IOMMU only lets >> you set the Fault Processing Disable bit per IRTE entry, and you still >> get faults for Compatibility Format interrupts? Not sure about AMD... > > It looks like the fault (DMAR) and event (AMD) interrupts can be > disabled in the IOMMU. That might help to bridge the gap until the PCI > bus is scanned in full glory and the devices can be shut up for real. > > If we make this conditional for a crash dump kernel that might do the > trick. > > Lot's of _might_ there :) Worth testing. Folks tracking this down is this enough of a hint for you to write a patch and test? Also worth checking how close irqpoll is to handling a case like this. At least historically it did a pretty good job at shutting down problem interrupts. I really find it weird that an edge triggered irq was firing fast enough to stop a system from booting. Level triggered irqs do that if they are acknolwedged without actually being handled. I think edge triggered irqs only fire when another event comes in, and it seems odd to get so many actual events causing interrupts that the system soft locks up. Is my memory of that situation confused? I recommend making these facilities general debug facilities so that they can be used for cases other than crash dump. The crash dump kernel would always enable them because it can assume that the hardware is very likely in a wonky state. Eric
On Tue, Nov 17, 2020 at 09:04:07AM -0300, Guilherme Piccoli wrote: > Also, taking here the opportunity to clarify my understanding about > the limitations of that approach: Bjorn, in our reproducer machine we > had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and > 0000:80 - are those all under "segment 0" as per your verbiage? Yes. The "0000" is the PCI segment (or "domain" in the Linux PCI core). It's common on x86 to have multiple host bridges in segment 0000.
Thanks a lot Bjorn! I confess except for PPC64 Server machines, I never saw other "domains" or segments. Is it common in x86 to have that? The early_quirks() are restricted to the first segment, no matter how many host bridges we have in segment 0000? Thanks again!
On Wed, Nov 18, 2020 at 07:36:08PM -0300, Guilherme Piccoli wrote: > Thanks a lot Bjorn! I confess except for PPC64 Server machines, I > never saw other "domains" or segments. Is it common in x86 to have > that? The early_quirks() are restricted to the first segment, no > matter how many host bridges we have in segment 0000? I don't know whether it's *common* to have multiple domains on x86, but they're definitely used on large systems. This includes some lspci info from an HPE Superdome Flex system: https://lore.kernel.org/lkml/5350E02A-6457-41A8-8F33-AF67BFDAEE3E@fb.com/ The early quirks in arch/x86/kernel/early-quirks.c (not the DECLARE_PCI_FIXUP_EARLY quirks in drivers/pci/quirks.c) are restricted to segment 0, no matter how many host bridges there are. This is because they use read_pci_config_16(), which uses a config access mechanism that has no provision for a segment number. Bjorn
Thank you for the clarification Bjorn! I was on vacation..sorry for my delay. Closing the loop here, I understand we're not getting this patch merged (due to its restriction to domain 0) and there was a suggestion in the thread of trying to block MSIs from the IOMMU init code (which also have the restriction of only working in iommu-enabled systems). Hope the thread is helpful and if somebody faces this issue, can comment here and at least find this approach, maybe test the patch. Thanks to all involved!
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c index 50d5848bf22e..fd50f9e21623 100644 --- a/arch/x86/kernel/early-quirks.c +++ b/arch/x86/kernel/early-quirks.c @@ -731,7 +731,6 @@ static int __init check_dev_quirk(int num, int slot, int func) u16 vendor; u16 device; u8 type; - u8 sec; int i; class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE); @@ -760,11 +759,8 @@ static int __init check_dev_quirk(int num, int slot, int func) type = read_pci_config_byte(num, slot, func, PCI_HEADER_TYPE); - if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) { - sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS); - if (sec > num) - early_pci_scan_bus(sec); - } + if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) + return -1; if (!(type & 0x80)) return -1; @@ -787,8 +783,11 @@ static void __init early_pci_scan_bus(int bus) void __init early_quirks(void) { + int bus; + if (!early_pci_allowed()) return; - early_pci_scan_bus(0); + for (bus = 0; bus < 256; bus++) + early_pci_scan_bus(bus); }
Recently was noticed in an HP GEN9 system that kdump couldn't succeed due to an irq storm coming from an Intel NIC, narrowed down to be lack of clearing the MSI/MSI-X enable bits during the kdump kernel boot. For that, we need an early quirk to manually turn off MSI/MSI-X for PCI devices - this was worked as an optional boot parameter in a subsequent patch. Problem is that in our test system, the Intel NICs were not present in any secondary bus under the first PCIe root complex, so they couldn't be reached by the recursion in check_dev_quirk(). Modern systems, specially with multi-processors and multiple NUMA nodes expose multiple root complexes, describing more than one PCI hierarchy domain. Currently the simple recursion present in the early-quirks code from x86 starts a descending recursion from bus 0000:00, and reach many other busses by navigating this hierarchy walking through the bridges. This is not enough in systems with more than one root complex/host bridge, since the recursion won't "traverse" to other root complexes by starting statically in 0000:00 (for more details, see [0]). This patch hence implements the full bus/device/function scan in early_quirks(), by checking all possible busses instead of using a recursion based on the first root bus or limiting the search scope to the first 32 busses (like it was done in the beginning [1]). [0] https://bugs.launchpad.net/bugs/1797990 [1] From historical perspective, early PCI scan dates back to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64", on October/2003. It initially restricted the search to the first 32 busses and slots. Due to a potential bug found in Nvidia chipsets, the scan was changed to run only in the first root bus: see commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks") Finally, secondary busses reachable from the 1st bus were re-added back by: commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses") Reported-by: Dan Streetman <ddstreet@canonical.com> Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com> --- arch/x86/kernel/early-quirks.c | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-)