diff mbox series

[1/3] x86/quirks: Scan all busses for early PCI quirks

Message ID 20181018183721.27467-1-gpiccoli@canonical.com
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series [1/3] x86/quirks: Scan all busses for early PCI quirks | expand

Commit Message

Guilherme G. Piccoli Oct. 18, 2018, 6:37 p.m. UTC
Recently was noticed in an HP GEN9 system that kdump couldn't succeed
due to an irq storm coming from an Intel NIC, narrowed down to be lack
of clearing the MSI/MSI-X enable bits during the kdump kernel boot.
For that, we need an early quirk to manually turn off MSI/MSI-X for
PCI devices - this was worked as an optional boot parameter in a
subsequent patch.

Problem is that in our test system, the Intel NICs were not present in
any secondary bus under the first PCIe root complex, so they couldn't
be reached by the recursion in check_dev_quirk(). Modern systems,
specially with multi-processors and multiple NUMA nodes expose multiple
root complexes, describing more than one PCI hierarchy domain. Currently
the simple recursion present in the early-quirks code from x86 starts a
descending recursion from bus 0000:00, and reach many other busses by
navigating this hierarchy walking through the bridges. This is not
enough in systems with more than one root complex/host bridge, since
the recursion won't "traverse" to other root complexes by starting
statically in 0000:00 (for more details, see [0]).

This patch hence implements the full bus/device/function scan in
early_quirks(), by checking all possible busses instead of using a
recursion based on the first root bus or limiting the search scope to
the first 32 busses (like it was done in the beginning [1]).

[0] https://bugs.launchpad.net/bugs/1797990

[1] From historical perspective, early PCI scan dates back
to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64",
on October/2003. It initially restricted the search to the first
32 busses and slots.

Due to a potential bug found in Nvidia chipsets, the scan
was changed to run only in the first root bus: see
commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks")

Finally, secondary busses reachable from the 1st bus were re-added back by:
commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses")

Reported-by: Dan Streetman <ddstreet@canonical.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
---
 arch/x86/kernel/early-quirks.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

Comments

Bjorn Helgaas Oct. 18, 2018, 10:15 p.m. UTC | #1
On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> Recently was noticed in an HP GEN9 system that kdump couldn't succeed
> due to an irq storm coming from an Intel NIC, narrowed down to be lack
> of clearing the MSI/MSI-X enable bits during the kdump kernel boot.
> For that, we need an early quirk to manually turn off MSI/MSI-X for
> PCI devices - this was worked as an optional boot parameter in a
> subsequent patch.
> 
> Problem is that in our test system, the Intel NICs were not present in
> any secondary bus under the first PCIe root complex, so they couldn't
> be reached by the recursion in check_dev_quirk(). Modern systems,
> specially with multi-processors and multiple NUMA nodes expose multiple
> root complexes, describing more than one PCI hierarchy domain. Currently
> the simple recursion present in the early-quirks code from x86 starts a
> descending recursion from bus 0000:00, and reach many other busses by
> navigating this hierarchy walking through the bridges. This is not
> enough in systems with more than one root complex/host bridge, since
> the recursion won't "traverse" to other root complexes by starting
> statically in 0000:00 (for more details, see [0]).
> 
> This patch hence implements the full bus/device/function scan in
> early_quirks(), by checking all possible busses instead of using a
> recursion based on the first root bus or limiting the search scope to
> the first 32 busses (like it was done in the beginning [1]).

I don't want to expand the early quirk infrastructure unless there is
absolutely no other way to solve this.  The early quirk stuff is
x86-specific, and it's not obvious that this problem is x86-only.

This patch scans buses 0-255, but still only in domain 0, so it won't
help with even more complicated systems that use other domains.

I'm not an IRQ expert, but it seems wrong to me that we are enabling
this interrupt before we're ready for it.  The MSI should target an
IOAPIC.  Can't that IOAPIC entry be masked until later?  I guess the
kdump kernel doesn't know what MSI address the device might be using.

Could the IRQ core be more tolerant of this somehow, e.g., if it
notices incoming interrupts with no handler, could it disable the
IOAPIC entry and fall back to polling periodically until a handler is
added?

> [0] https://bugs.launchpad.net/bugs/1797990
> 
> [1] From historical perspective, early PCI scan dates back
> to BitKeeper, added by Andi Kleen's "[PATCH] APIC fixes for x86-64",
> on October/2003. It initially restricted the search to the first
> 32 busses and slots.
> 
> Due to a potential bug found in Nvidia chipsets, the scan
> was changed to run only in the first root bus: see
> commit 8659c406ade3 ("x86: only scan the root bus in early PCI quirks")
> 
> Finally, secondary busses reachable from the 1st bus were re-added back by:
> commit 850c321027c2 ("x86/quirks: Reintroduce scanning of secondary buses")
> 
> Reported-by: Dan Streetman <ddstreet@canonical.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@canonical.com>
> ---
>  arch/x86/kernel/early-quirks.c | 13 ++++++-------
>  1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
> index 50d5848bf22e..fd50f9e21623 100644
> --- a/arch/x86/kernel/early-quirks.c
> +++ b/arch/x86/kernel/early-quirks.c
> @@ -731,7 +731,6 @@ static int __init check_dev_quirk(int num, int slot, int func)
>  	u16 vendor;
>  	u16 device;
>  	u8 type;
> -	u8 sec;
>  	int i;
>  
>  	class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE);
> @@ -760,11 +759,8 @@ static int __init check_dev_quirk(int num, int slot, int func)
>  	type = read_pci_config_byte(num, slot, func,
>  				    PCI_HEADER_TYPE);
>  
> -	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) {
> -		sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS);
> -		if (sec > num)
> -			early_pci_scan_bus(sec);
> -	}
> +	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE)
> +		return -1;
>  
>  	if (!(type & 0x80))
>  		return -1;
> @@ -787,8 +783,11 @@ static void __init early_pci_scan_bus(int bus)
>  
>  void __init early_quirks(void)
>  {
> +	int bus;
> +
>  	if (!early_pci_allowed())
>  		return;
>  
> -	early_pci_scan_bus(0);
> +	for (bus = 0; bus < 256; bus++)
> +		early_pci_scan_bus(bus);
>  }
> -- 
> 2.19.0
>
Guilherme G. Piccoli Oct. 22, 2018, 8:35 p.m. UTC | #2
On 18/10/2018 19:15, Bjorn Helgaas wrote:
> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> [...] 

> I don't want to expand the early quirk infrastructure unless there is
> absolutely no other way to solve this.  The early quirk stuff is
> x86-specific, and it's not obvious that this problem is x86-only.
> 
> This patch scans buses 0-255, but still only in domain 0, so it won't
> help with even more complicated systems that use other domains.
> 
> I'm not an IRQ expert, but it seems wrong to me that we are enabling
> this interrupt before we're ready for it.  The MSI should target an
> IOAPIC.  Can't that IOAPIC entry be masked until later?  I guess the
> kdump kernel doesn't know what MSI address the device might be using.
> 
> Could the IRQ core be more tolerant of this somehow, e.g., if it
> notices incoming interrupts with no handler, could it disable the
> IOAPIC entry and fall back to polling periodically until a handler is
> added?

Hi Bjorn, thanks for your quick reply.
I understand your point, but I think this is inherently an architecture
problem. No matter what solution we decide for, it'll need to be applied
in early boot time, like before the PCI layer gets initialized.

So, I think a first step would be to split the solution "timing" in 2
possibilities:

a) We could try to disable MSIs or whatever approach we take in the
quiesce path of crash_kexec(), before the bootstrap of the kdump kernel.
The pro is we could use PCI handlers to do it generically. The con is
it'd touch that delicate shutdown path, from a broken kernel, and this
is unreliable. Also, I've noticed changes in those crash paths
usually gain huge amount of criticism by community, seems nobody wants
to change a bit of this code, if not utterly necessary.

b) Continue using an early boot approach. IMO, this would be per-arch by
nature.
Currently, powerpc for example does not suffer this issue due to their
arch code performing a FW-aided PCI fundamental reset in the devices[0].
On the other hand, x86 has no generic fundamental reset infrastructure
to my knowledge (we tried some alternatives, like a Bridge reset[1] that
didn't work, or zeroing the the command register, which worked), but if
we go with the IOAPIC way of handling this (which we tried a bit and
failed too), it'll be even more arch-dependent, since IOAPIC is x86 concept.


After discussing here internally, an alternative way for this MSI
approach work without requiring the change in the early PCI
infrastructure is to check if we're in kdump kernel and perform manually
the full scan in that case, instead of changing the generic case as
proposed here. This would still be x86-only, but again, it's difficult
if not impossible to fix all archs using the same code here.

Finally, about multi-domain PCI topologies, I've never saw it on x86, I
wasn't aware that such things existed in x86 - but if required we can
quickly extend the logic to contemplate it too.

Thanks again, looking forward for you suggestions.
Cheers,


Guilherme

[0]
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/powerpc/platforms/powernv/pci-ioda.c#n3992

[1] Based in https://patchwork.kernel.org/patch/2562841, adapted to work
in early boot time.
Bjorn Helgaas Oct. 23, 2018, 5:03 p.m. UTC | #3
On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
> On 18/10/2018 19:15, Bjorn Helgaas wrote:
> > On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> > [...] 
> I understand your point, but I think this is inherently an architecture
> problem. No matter what solution we decide for, it'll need to be applied
> in early boot time, like before the PCI layer gets initialized.

This is the part I want to know more about.  Apparently there's some
event X between early_quirks() and the PCI device enumeration, and we
must disable MSIs before X:

  setup_arch()
      early_quirks()                     # arch/x86/kernel/early-quirks.c
      early_pci_clear_msi()
  ...
  X
  ...
  pci_scan_root_bus_bridge()
    ...
    DECLARE_PCI_FIXUP_EARLY              # drivers/pci/quirks.c

I want to know specifically what X is.  If we don't know what X is and
all we know is "we have to disable MSIs earlier than PCI init", then
we're likely to break things again in the future by changing the order
of disabling MSIs and whatever X is.

Bjorn
Guilherme G. Piccoli Nov. 6, 2020, 1:14 p.m. UTC | #4
On 23/10/2018 14:03, Bjorn Helgaas wrote:
> On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
>> On 18/10/2018 19:15, Bjorn Helgaas wrote:
>>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
>>> [...] 
>> I understand your point, but I think this is inherently an architecture
>> problem. No matter what solution we decide for, it'll need to be applied
>> in early boot time, like before the PCI layer gets initialized.
> 
> This is the part I want to know more about.  Apparently there's some
> event X between early_quirks() and the PCI device enumeration, and we
> must disable MSIs before X:
> 
>   setup_arch()
>       early_quirks()                     # arch/x86/kernel/early-quirks.c
>       early_pci_clear_msi()
>   ...
>   X
>   ...
>   pci_scan_root_bus_bridge()
>     ...
>     DECLARE_PCI_FIXUP_EARLY              # drivers/pci/quirks.c
> 
> I want to know specifically what X is.  If we don't know what X is and
> all we know is "we have to disable MSIs earlier than PCI init", then
> we're likely to break things again in the future by changing the order
> of disabling MSIs and whatever X is.
> 
> Bjorn
> 

Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years
later, but recent discussions led to a better understanding of this 'X'
point, thanks to Thomas!

For those that deleted this thread from their email clients, it's
available in [0] - the summary is that we faced an IRQ storm really
early in boot, due to a bogus PCIe device MSI behavior, when booting a
kdump kernel. This led the machine to get stuck in the boot and we
couldn't kdump. The solution hereby proposed is to clear MSI interrupts
early in x86, if a parameter is provided. I don't have the reproducer
anymore and it was pretty hard to reproduce in virtual environments.

So, about the 'X' Bjorn, in another recent thread about IRQ storms [1],
Thomas clarified that and after a brief discussion, it seems there's no
better way to prevent the MSI storm other than clearing the MSI
capability early in boot. As discussed both here and in thread [1], this
is indeed a per-architecture issue (powerpc is not subject to that, due
to a better FW reset mechanism), so I think we still could benefit in
having this idea implemented upstream, at least in x86 (we could expand
to other architectures if desired, in the future).

As a "test" data point, this was implemented in Ubuntu (same 3 patches
present in this series) for ~2 years and we haven't received bug reports
- I'm saying that because I understand your concerns about expanding the
early PCI quirks scope.

Let me know your thoughts. I'd suggest all to read thread [1], which
addresses a similar issue but in a different "moment" of the system boot
and provides some more insight on why the early MSI clearing seems to
make sense.

Thanks,


Guilherme


[0]
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com

[1] https://lore.kernel.org/lkml/87y2js3ghv.fsf@nanos.tec.linutronix.de
Bjorn Helgaas Nov. 13, 2020, 4:46 p.m. UTC | #5
On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> > On Mon, Oct 22, 2018 at 05:35:06PM -0300, Guilherme G. Piccoli wrote:
> >> On 18/10/2018 19:15, Bjorn Helgaas wrote:
> >>> On Thu, Oct 18, 2018 at 03:37:19PM -0300, Guilherme G. Piccoli wrote:
> >>> [...] 
> >> I understand your point, but I think this is inherently an architecture
> >> problem. No matter what solution we decide for, it'll need to be applied
> >> in early boot time, like before the PCI layer gets initialized.
> > 
> > This is the part I want to know more about.  Apparently there's some
> > event X between early_quirks() and the PCI device enumeration, and we
> > must disable MSIs before X:
> > 
> >   setup_arch()
> >       early_quirks()                     # arch/x86/kernel/early-quirks.c
> >       early_pci_clear_msi()
> >   ...
> >   X
> >   ...
> >   pci_scan_root_bus_bridge()
> >     ...
> >     DECLARE_PCI_FIXUP_EARLY              # drivers/pci/quirks.c
> > 
> > I want to know specifically what X is.  If we don't know what X is and
> > all we know is "we have to disable MSIs earlier than PCI init", then
> > we're likely to break things again in the future by changing the order
> > of disabling MSIs and whatever X is.
> 
> Hi Bjorn (and all CCed), I'm sorry to necro-bump a thread >2 years
> later, but recent discussions led to a better understanding of this 'X'
> point, thanks to Thomas!
> 
> For those that deleted this thread from their email clients, it's
> available in [0] - the summary is that we faced an IRQ storm really
> early in boot, due to a bogus PCIe device MSI behavior, when booting a
> kdump kernel. This led the machine to get stuck in the boot and we
> couldn't kdump. The solution hereby proposed is to clear MSI interrupts
> early in x86, if a parameter is provided. I don't have the reproducer
> anymore and it was pretty hard to reproduce in virtual environments.
> 
> So, about the 'X' Bjorn, in another recent thread about IRQ storms [1],
> Thomas clarified that and after a brief discussion, it seems there's no
> better way to prevent the MSI storm other than clearing the MSI
> capability early in boot. As discussed both here and in thread [1], this
> is indeed a per-architecture issue (powerpc is not subject to that, due
> to a better FW reset mechanism), so I think we still could benefit in
> having this idea implemented upstream, at least in x86 (we could expand
> to other architectures if desired, in the future).
> 
> As a "test" data point, this was implemented in Ubuntu (same 3 patches
> present in this series) for ~2 years and we haven't received bug reports
> - I'm saying that because I understand your concerns about expanding the
> early PCI quirks scope.
> 
> Let me know your thoughts. I'd suggest all to read thread [1], which
> addresses a similar issue but in a different "moment" of the system boot
> and provides some more insight on why the early MSI clearing seems to
> make sense.

I guess Thomas' patch [2] (from thread [1]) doesn't solve this
problem?

I think [0] proposes using early_quirks() to disable MSIs at
boot-time.  That doesn't seem like a robust solution because (a) the
problem affects all arches but early_quirks() is x86-specific and (b)
even on x86 early_quirks() only works for PCI segment 0 because it
relies on the 0xCF8/0xCFC I/O ports.

If I understand Thomas' email correctly, the IRQ storm occurs here:

  start_kernel
    setup_arch
      early_quirks               # x86-only
        ...
          read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
            outl(..., 0xcf8)     # PCI segment 0 only
            inw(0xcfc)
    local_irq_enable
      ...
        native_irq_enable
          asm("sti")             # <-- enable IRQ, storm occurs

native_irq_enable() happens long before we discover PCI host bridges
and run the normal PCI quirks, so those would be too late to disable
MSIs.

It doesn't seem practical to disable MSIs in the kdump kernel at the
PCI level.  I was hoping we could disable them somewhere in the IRQ
code, e.g., at IOAPICs, but I think Thomas is saying that's not
feasible.

It seems like the only option left is to disable MSIs before the
kexec.  We used to clear the MSI/MSI-X Enable bits in
pci_device_shutdown(), but that broke console devices that relied on
MSI and caused "nobody cared" warnings when the devices fell back to
using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
pci_device_shutdown()") left them unchanged.

pci_device_shutdown() still clears the Bus Master Enable bit if we're
doing a kexec and the device is in D0-D3hot, which should also disable
MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
device causing the storm was in PCI_UNKNOWN state?

> [0] https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com
> 
> [1] https://lore.kernel.org/lkml/87y2js3ghv.fsf@nanos.tec.linutronix.de

[2] https://lore.kernel.org/lkml/87tuueftou.fsf@nanos.tec.linutronix.de/

Notes to my future self about related changes:

  2008-04-23 d52877c7b1af ("pci/irq: let pci_device_shutdown to call pci_msi_shutdown v2")
    Disable MSI before kexec because new kernel isn't prepared for MSI

  2011-10-17 d5dea7d95c48 ("PCI: msi: Disable msi interrupts when we initialize a pci device")
    Disable MSI/MSI-X at boot; only works for new kernels with
    CONFIG_PCI_MSI=y

  2012-04-27 b566a22c2332 ("PCI: disable Bus Master on PCI device shutdown")
    Disable bus mastering on shutdown (if enable/disable nested correctly)

  2013-02-04 7897e6022761 ("PCI: Disable Bus Master unconditionally in pci_device_shutdown()")
    Disable bus mastering unconditionally (ignore nested enable/disable)

  2013-03-14 6e0eda3c3898 ("PCI: Don't try to disable Bus Master on disconnected PCI devices")
    Don't touch bus mastering for D3cold or unknown state

  2015-05-07 1851617cd2da ("PCI/MSI: Disable MSI at enumeration even if kernel doesn't support MSI")
    Disable MSI/MSI-X at boot even without CONFIG_PCI_MSI=y; broke
    Open Firmware arches

  2015-10-21 e80e7edc55ba ("PCI/MSI: Initialize MSI capability for all architectures")
    Disable MSI/MSI-X at boot for all arches, including Open Firmware

  2017-01-26 fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in pci_device_shutdown()")
    Leave MSI enabled before kexec; disabling causes device to use INTx,
    which drivers aren't prepared for, causing "nobody cared" warnings
Thomas Gleixner Nov. 13, 2020, 11:31 p.m. UTC | #6
Bjorn,

On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> On Fri, Nov 06, 2020 at 10:14:14AM -0300, Guilherme G. Piccoli wrote:
>> On 23/10/2018 14:03, Bjorn Helgaas wrote:
> I guess Thomas' patch [2] (from thread [1]) doesn't solve this
> problem?

No. As I explained in [1] patch from [2] cannot solve it because the
patch from [2] which is what Liu was trying to solve requires that there
is a registered interrupt handler which knows how to shut up the
interrupt.

> I think [0] proposes using early_quirks() to disable MSIs at
> boot-time.  That doesn't seem like a robust solution because (a) the
> problem affects all arches but early_quirks() is x86-specific and (b)
> even on x86 early_quirks() only works for PCI segment 0 because it
> relies on the 0xCF8/0xCFC I/O ports.
>
> If I understand Thomas' email correctly, the IRQ storm occurs here:
>
>   start_kernel
>     setup_arch
>       early_quirks               # x86-only
>         ...
>           read_pci_config_16(num, slot, func, PCI_VENDOR_ID)
>             outl(..., 0xcf8)     # PCI segment 0 only
>             inw(0xcfc)
>     local_irq_enable
>       ...
>         native_irq_enable
>           asm("sti")             # <-- enable IRQ, storm occurs
>
> native_irq_enable() happens long before we discover PCI host bridges
> and run the normal PCI quirks, so those would be too late to disable
> MSIs.

Correct.

> It doesn't seem practical to disable MSIs in the kdump kernel at the
> PCI level.  I was hoping we could disable them somewhere in the IRQ
> code, e.g., at IOAPICs, but I think Thomas is saying that's not
> feasible.

MSIs are not even going near the IOAPIC and as long as the interrupt
core does not have an interrupt set up for the device is has no idea
where to look at to shut it down. Actually it does not even reach the
interrupt core. The raised vector arrives at the CPU and the low level
code sees: No handler associated, ignore it. We cannot do anything from
the low level code because all we know is that the vector was raised,
but we have absolutely zero clue where that came from. At that point the
IO-APIC interrupts are definitely not the problem because they are all
disabled.

> It seems like the only option left is to disable MSIs before the
> kexec.  We used to clear the MSI/MSI-X Enable bits in
> pci_device_shutdown(), but that broke console devices that relied on
> MSI and caused "nobody cared" warnings when the devices fell back to
> using INTx, so fda78d7a0ead ("PCI/MSI: Stop disabling MSI/MSI-X in
> pci_device_shutdown()") left them unchanged.

That might be solvable because INTx arrives at the IO-APIC and we could
mask all the INTx related IO-APIC lines, but that's icky because of
this:

> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> doing a kexec and the device is in D0-D3hot, which should also disable
> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> device causing the storm was in PCI_UNKNOWN state?

That's indeed a really good question.

Thanks,

        tglx
Thomas Gleixner Nov. 13, 2020, 11:40 p.m. UTC | #7
Bjorn,

On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> doing a kexec and the device is in D0-D3hot, which should also disable
>> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> device causing the storm was in PCI_UNKNOWN state?
>
> That's indeed a really good question.

So we do that on kexec, but is that true when starting a kdump kernel
from a kernel crash? I doubt it.

Thanks,

        tglx
Bjorn Helgaas Nov. 14, 2020, 8:39 p.m. UTC | #8
On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
> Bjorn,
> 
> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> >> doing a kexec and the device is in D0-D3hot, which should also disable
> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> >> device causing the storm was in PCI_UNKNOWN state?
> >
> > That's indeed a really good question.
> 
> So we do that on kexec, but is that true when starting a kdump kernel
> from a kernel crash? I doubt it.

Ah, right, I bet that's it, thanks.  The kdump path is basically this:

  crash_kexec
    machine_kexec

while the usual kexec path is:

  kernel_kexec
    kernel_restart_prepare
      device_shutdown
        while (!list_empty(&devices_kset->list))
          dev->bus->shutdown
            pci_device_shutdown            # pci_bus_type.shutdown
    machine_kexec

So maybe we need to explore doing some or all of device_shutdown() in
the crash_kexec() path as well as in the kernel_kexec() path.
Thomas Gleixner Nov. 14, 2020, 8:58 p.m. UTC | #9
Bjorn,

On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> >> device causing the storm was in PCI_UNKNOWN state?
>> >
>> > That's indeed a really good question.
>> 
>> So we do that on kexec, but is that true when starting a kdump kernel
>> from a kernel crash? I doubt it.
>
> Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>
>   crash_kexec
>     machine_kexec
>
> while the usual kexec path is:
>
>   kernel_kexec
>     kernel_restart_prepare
>       device_shutdown
>         while (!list_empty(&devices_kset->list))
>           dev->bus->shutdown
>             pci_device_shutdown            # pci_bus_type.shutdown
>     machine_kexec
>
> So maybe we need to explore doing some or all of device_shutdown() in
> the crash_kexec() path as well as in the kernel_kexec() path.

The problem is that if the machine crashed anything you try to attempt
before starting the crash kernel is reducing the chance that the crash
kernel actually starts.

Is there something at the root bridge level which allows to tell the
underlying busses to shut up, reset or go into a defined state? That
might avoid chasing lists which might be already unreliable.

Thanks,

        tglx
Bjorn Helgaas Nov. 14, 2020, 9:22 p.m. UTC | #10
[+cc Rafael for question about ACPI method for PCI host bridge reset]

On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
> >> >> device causing the storm was in PCI_UNKNOWN state?
> >> >
> >> > That's indeed a really good question.
> >> 
> >> So we do that on kexec, but is that true when starting a kdump kernel
> >> from a kernel crash? I doubt it.
> >
> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
> >
> >   crash_kexec
> >     machine_kexec
> >
> > while the usual kexec path is:
> >
> >   kernel_kexec
> >     kernel_restart_prepare
> >       device_shutdown
> >         while (!list_empty(&devices_kset->list))
> >           dev->bus->shutdown
> >             pci_device_shutdown            # pci_bus_type.shutdown
> >     machine_kexec
> >
> > So maybe we need to explore doing some or all of device_shutdown() in
> > the crash_kexec() path as well as in the kernel_kexec() path.
> 
> The problem is that if the machine crashed anything you try to attempt
> before starting the crash kernel is reducing the chance that the crash
> kernel actually starts.

Right.

> Is there something at the root bridge level which allows to tell the
> underlying busses to shut up, reset or go into a defined state? That
> might avoid chasing lists which might be already unreliable.

Maybe we need some kind of crash_device_shutdown() that does the
minimal thing to protect the kdump kernel from devices.

The programming model for conventional PCI host bridges and PCIe Root
Complexes is device-specific since they're outside the PCI domain.
There probably *are* ways to do those things, but you would need a
native host bridge driver or something like an ACPI method.  I'm not
aware of an ACPI way to do this, but I added Rafael in case he is.

A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary.

Bjorn
Eric W. Biederman Nov. 15, 2020, 2:05 p.m. UTC | #11
Bjorn Helgaas <helgaas@kernel.org> writes:

> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>
> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>> >> >> device causing the storm was in PCI_UNKNOWN state?
>> >> >
>> >> > That's indeed a really good question.
>> >> 
>> >> So we do that on kexec, but is that true when starting a kdump kernel
>> >> from a kernel crash? I doubt it.
>> >
>> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>> >
>> >   crash_kexec
>> >     machine_kexec
>> >
>> > while the usual kexec path is:
>> >
>> >   kernel_kexec
>> >     kernel_restart_prepare
>> >       device_shutdown
>> >         while (!list_empty(&devices_kset->list))
>> >           dev->bus->shutdown
>> >             pci_device_shutdown            # pci_bus_type.shutdown
>> >     machine_kexec
>> >
>> > So maybe we need to explore doing some or all of device_shutdown() in
>> > the crash_kexec() path as well as in the kernel_kexec() path.
>> 
>> The problem is that if the machine crashed anything you try to attempt
>> before starting the crash kernel is reducing the chance that the crash
>> kernel actually starts.
>
> Right.
>
>> Is there something at the root bridge level which allows to tell the
>> underlying busses to shut up, reset or go into a defined state? That
>> might avoid chasing lists which might be already unreliable.
>
> Maybe we need some kind of crash_device_shutdown() that does the
> minimal thing to protect the kdump kernel from devices.

The kdump kernel does not use any memory the original kernel uses.
Which should be a minimal and fairly robust level of protection
until the device drivers can be loaded and get ahold of things.

> The programming model for conventional PCI host bridges and PCIe Root
> Complexes is device-specific since they're outside the PCI domain.
> There probably *are* ways to do those things, but you would need a
> native host bridge driver or something like an ACPI method.  I'm not
> aware of an ACPI way to do this, but I added Rafael in case he is.
>
> A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary.

Unless I am confused DMA'ing to memory that is not already in use
is completely broken wether or not you are using the kdump kernel.

Eric
Eric W. Biederman Nov. 15, 2020, 2:29 p.m. UTC | #12
ebiederm@xmission.com (Eric W. Biederman) writes:

> Bjorn Helgaas <helgaas@kernel.org> writes:
>
>> [+cc Rafael for question about ACPI method for PCI host bridge reset]
>>
>> On Sat, Nov 14, 2020 at 09:58:08PM +0100, Thomas Gleixner wrote:
>>> On Sat, Nov 14 2020 at 14:39, Bjorn Helgaas wrote:
>>> > On Sat, Nov 14, 2020 at 12:40:10AM +0100, Thomas Gleixner wrote:
>>> >> On Sat, Nov 14 2020 at 00:31, Thomas Gleixner wrote:
>>> >> > On Fri, Nov 13 2020 at 10:46, Bjorn Helgaas wrote:
>>> >> >> pci_device_shutdown() still clears the Bus Master Enable bit if we're
>>> >> >> doing a kexec and the device is in D0-D3hot, which should also disable
>>> >> >> MSI/MSI-X.  Why doesn't this solve the problem?  Is this because the
>>> >> >> device causing the storm was in PCI_UNKNOWN state?
>>> >> >
>>> >> > That's indeed a really good question.
>>> >> 
>>> >> So we do that on kexec, but is that true when starting a kdump kernel
>>> >> from a kernel crash? I doubt it.
>>> >
>>> > Ah, right, I bet that's it, thanks.  The kdump path is basically this:
>>> >
>>> >   crash_kexec
>>> >     machine_kexec
>>> >
>>> > while the usual kexec path is:
>>> >
>>> >   kernel_kexec
>>> >     kernel_restart_prepare
>>> >       device_shutdown
>>> >         while (!list_empty(&devices_kset->list))
>>> >           dev->bus->shutdown
>>> >             pci_device_shutdown            # pci_bus_type.shutdown
>>> >     machine_kexec
>>> >
>>> > So maybe we need to explore doing some or all of device_shutdown() in
>>> > the crash_kexec() path as well as in the kernel_kexec() path.
>>> 
>>> The problem is that if the machine crashed anything you try to attempt
>>> before starting the crash kernel is reducing the chance that the crash
>>> kernel actually starts.
>>
>> Right.
>>
>>> Is there something at the root bridge level which allows to tell the
>>> underlying busses to shut up, reset or go into a defined state? That
>>> might avoid chasing lists which might be already unreliable.
>>
>> Maybe we need some kind of crash_device_shutdown() that does the
>> minimal thing to protect the kdump kernel from devices.
>
> The kdump kernel does not use any memory the original kernel uses.
> Which should be a minimal and fairly robust level of protection
> until the device drivers can be loaded and get ahold of things.
>
>> The programming model for conventional PCI host bridges and PCIe Root
>> Complexes is device-specific since they're outside the PCI domain.
>> There probably *are* ways to do those things, but you would need a
>> native host bridge driver or something like an ACPI method.  I'm not
>> aware of an ACPI way to do this, but I added Rafael in case he is.
>>
>> A crash_device_shutdown() could do something at the host bridge level
>> if that's possible, or reset/disable bus mastering/disable MSI/etc on
>> individual PCI devices if necessary.
>
> Unless I am confused DMA'ing to memory that is not already in use
> is completely broken wether or not you are using the kdump kernel.

Bah.  I was confused because I had not read up-thread.

MSI mixes DMA and irqs so confusion is easy.

So the problem is screaming irqs when the kernel is booting up.
This is a fundamentally tricky problem.

For ordinary irqs you can have this with level triggered irqs
and the kernel has code that will shutdown the irq at the ioapic
level.  Then the kernel continues by polling the irq source.

I am still missing details but my first question is can our general
solution to screaming level triggered irqs apply?

How can edge triggered MSI irqs be screaming?

Is there something we can do in enabling the APICs or IOAPICs that
would allow this to be handled better.  My memory when we enable
the APICs and IOAPICs we completely clear the APIC entries and so
should be disabling sources.

Is the problem perhaps that we wind up using an APIC entry that was
previously used for the MSI interrupt as something else when we
reprogram them?  Even with this why doesn't the generic code
to stop screaming irqs apply here?

Eric
Thomas Gleixner Nov. 15, 2020, 3:11 p.m. UTC | #13
On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
> ebiederm@xmission.com (Eric W. Biederman) writes:
> For ordinary irqs you can have this with level triggered irqs
> and the kernel has code that will shutdown the irq at the ioapic
> level.  Then the kernel continues by polling the irq source.
>
> I am still missing details but my first question is can our general
> solution to screaming level triggered irqs apply?

No.

> How can edge triggered MSI irqs be screaming?
>
> Is there something we can do in enabling the APICs or IOAPICs that
> would allow this to be handled better.  My memory when we enable
> the APICs and IOAPICs we completely clear the APIC entries and so
> should be disabling sources.

Yes, but MSI has nothing to do with APIC/IOAPIC

> Is the problem perhaps that we wind up using an APIC entry that was
> previously used for the MSI interrupt as something else when we
> reprogram them?  Even with this why doesn't the generic code
> to stop screaming irqs apply here?

Again. No. The problem cannot be solved at the APIC level. The APIC is
the receiving end of MSI and has absolutely no control over it.

An MSI interrupt is a (DMA) write to the local APIC base address
0xfeexxxxx which has the target CPU and control bits encoded in the
lower bits. The written data is the vector and more control bits.

The only way to stop that is to shut it up at the PCI device level.

Assume the following situation:

  - PCI device has MSI enabled and a valid target vector assigned

  - Kernel crashes

  - Kdump kernel starts

  - PCI device raises interrupts which result in the MSI write

  - Kdump kernel enables interrupts and the pending vector is raised in
    the CPU.

  - The CPU has no interrupt descriptor assigned to the vector
    and does not even know where the interrupt originated from. So it
    treats it like any other spurious interrupt to an unassigned vector,
    emits a ratelimited message and ACKs the interrupt at the APIC.

  - PCI device behaves stupid and reraises the interrupt for whatever
    reason.

  - Lather, rinse and repeat.

Unfortunately there is no way to tell the APIC "Mask vector X" and the
dump kernel does neither know which device it comes from nor does it
have enumerated PCI completely which would reset the device and shutup
the spew. Due to the interrupt storm it does not get that far.

Thanks,

        tglx
Lukas Wunner Nov. 15, 2020, 5:01 p.m. UTC | #14
On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.

Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
on all active PCI devices in the crashing kernel before starting the
crash kernel?  So that the crash kernel starts with a clean slate?

Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively.  The reason he has to do that
is because the crashing kernel doesn't know which devices exist and
which have interrupts enabled.  However the crashing kernel has that
information.  It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.

Guilherme's patches add a "clearmsi" command line parameter.  I guess
the idea is that the parameter is always passed to the crash kernel
but the patches don't do that, which seems odd.

Thanks,

Lukas
Thomas Gleixner Nov. 15, 2020, 7:18 p.m. UTC | #15
On Sun, Nov 15 2020 at 18:01, Lukas Wunner wrote:
> On Sun, Nov 15, 2020 at 04:11:43PM +0100, Thomas Gleixner wrote:
>> Unfortunately there is no way to tell the APIC "Mask vector X" and the
>> dump kernel does neither know which device it comes from nor does it
>> have enumerated PCI completely which would reset the device and shutup
>> the spew. Due to the interrupt storm it does not get that far.
>
> Can't we just set DisINTx, clear MSI Enable and clear MSI-X Enable
> on all active PCI devices in the crashing kernel before starting the
> crash kernel?  So that the crash kernel starts with a clean slate?
>
> Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that
> is because the crashing kernel doesn't know which devices exist and
> which have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.

As I explained before: The problem with doing anything between crashing
and starting the crash kernel is reducing the chance of actually
starting it. The kernel crashed for whatever reason, so it's in a
completely undefined state.

Ergo there is no 'just do something'. You really have to think hard
about what can be done safely with the least probability of running into
another problem.

Thanks,

        tglx
Eric W. Biederman Nov. 15, 2020, 8:46 p.m. UTC | #16
Thomas Gleixner <tglx@linutronix.de> writes:

> On Sun, Nov 15 2020 at 08:29, Eric W. Biederman wrote:
>> ebiederm@xmission.com (Eric W. Biederman) writes:
>> For ordinary irqs you can have this with level triggered irqs
>> and the kernel has code that will shutdown the irq at the ioapic
>> level.  Then the kernel continues by polling the irq source.
>>
>> I am still missing details but my first question is can our general
>> solution to screaming level triggered irqs apply?
>
> No.
>
>> How can edge triggered MSI irqs be screaming?
>>
>> Is there something we can do in enabling the APICs or IOAPICs that
>> would allow this to be handled better.  My memory when we enable
>> the APICs and IOAPICs we completely clear the APIC entries and so
>> should be disabling sources.
>
> Yes, but MSI has nothing to do with APIC/IOAPIC

Yes, sorry.  It has been long enough that the details were paged out
of my memory.

>> Is the problem perhaps that we wind up using an APIC entry that was
>> previously used for the MSI interrupt as something else when we
>> reprogram them?  Even with this why doesn't the generic code
>> to stop screaming irqs apply here?
>
> Again. No. The problem cannot be solved at the APIC level. The APIC is
> the receiving end of MSI and has absolutely no control over it.
>
> An MSI interrupt is a (DMA) write to the local APIC base address
> 0xfeexxxxx which has the target CPU and control bits encoded in the
> lower bits. The written data is the vector and more control bits.
>
> The only way to stop that is to shut it up at the PCI device level.

Or to write to magic chipset registers that will stop transforming DMA
writes to 0xfeexxxxx into x86 interrupts.  With an IOMMU I know x86 has
such registers (because the point of the IOMMU is to limit the problems
rogue devices can cause).  Without an IOMMU I don't know if x86 has any
such registers.  I remember that other platforms have an interrupt
controller that does provide some level of control.  That x86 does not
is what makes this an x86 specific problem.

The generic solution is to have the PCI code set bus master disables
when it is enumerationg and initializing devices.  Last time I was
paying attention that was actually the policy of the pci layer and
drivers that did not enable bus mastering were considered buggy.

Looking at patch 3/3 what this patchset does is an early disable of
of the msi registers.  Which is mostly reasonable.  Especially as has
been pointed out the only location the x86 vector and x86 cpu can
be found is in the msi configuration registers.

That also seems reasonable.  But Bjorn's concern about not finding all
devices in all domains does seem real.

There are a handful of devices where the Bus master disable bit doesn't
disable bus mastering.  I wonder if there are devices where MSI and MSIX
disables don't fully work.  It seems completely possible to have MSI or
MSIX equivalent registers at a non-standard location as drivers must be
loaded to handle them.

So if we can safely and reliably disable DMA and MSI at the generic PCI
device level during boot up I am all for it.


How difficult would it be to tell the IOMMUs to stop passing traffic
through in an early pci quirk?  The problem setup was apparently someone
using the device directly from a VM.  So I presume there is an IOMMU
in that configuration.

> Unfortunately there is no way to tell the APIC "Mask vector X" and the
> dump kernel does neither know which device it comes from nor does it
> have enumerated PCI completely which would reset the device and shutup
> the spew. Due to the interrupt storm it does not get that far.

So the question is how do we make this robust?

Can we perhaps disable all interrupts in this case and limp along
in polling mode until the pci bus has been enumerated?

It is nice and desirable to be able to use the hardware in high
performance mode in a kexec-on-panic situation but if we can detect a
problem and figure out how to limp along sometimes that is acceptable.

The failure mode in the kexec-on-panic kernel is definitely the corect
one.  We could not figure out how to wrestle the hardware into usability
so we fail to take a crash dump or do anything else that might corrupt
the system.

Eric
Guilherme G. Piccoli Nov. 16, 2020, 8:31 p.m. UTC | #17
First of all, thanks everybody for the great insights/discussion! This
thread ended-up being a great learning for (at least) me.

Given the big amount of replies and intermixed comments, I wouldn't be
able to respond inline to all, so I'll try another approach below.


From Bjorn:
"I think [0] proposes using early_quirks() to disable MSIs at boot-time.
That doesn't seem like a robust solution because (a) the problem affects
all arches but early_quirks() is x86-specific and (b) even on x86
early_quirks() only works for PCI segment 0 because it relies on the
0xCF8/0xCFC I/O ports."

Ah. I wasn't aware of that limitation, I thought enhancing the
early_quirks() search to go through all buses would fix that, thanks for
the clarification! And again, worth to clarify that this is not a
problem affecting all arches _in practice_ - PowerPC for example has the
FW primitives allowing a powerful PCI controller (out-of-band) reset,
preventing this kind of issue usually.

[0]
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com


From Bjorn:
"A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary."

From Lukas:
"Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively.  The reason he has to do that is
because the crashing kernel doesn't know which devices exist and which
have interrupts enabled.  However the crashing kernel has that
information.  It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.

Guilherme's patches add a "clearmsi" command line parameter.  I guess
the idea is that the parameter is always passed to the crash kernel but
the patches don't do that, which seems odd."

Very good points Lukas, thanks! The reason of not adding the clearmsi
thing as a default kdump procedure is kinda related to your first point:
it impacts a bit boot-time, also it's an additional logic in the kdump
kernel, which we know is (sometimes) the last resort in gathering
additional data to debug a potentially complex issue. That said, I'd not
like to enforce this kind of "policy" to everybody, so my implementation
relies on having it as a parameter, and the kdump userspace counter-part
could then have a choice in adding or not such mechanism to the kdump
kernel parameter list.

About passing the data to next kernel, this is very interesting, we
could do something like that either through setup_data (as you said) or
using a new proposal, the PKRAM thing [a].
This way we wouldn't need a crash_device_shutdown(), but instead when
kernel is loading a crash kernel (through kexec -p) we can collect all
the necessary information that'll be passed to the crash kernel
(obviously that if we are collecting PCI topology information, we need
callbacks in the PCI hotplug add/del path to update this information).

[a]
https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/

Below, inline reply to Eric's last message.


On 15/11/2020 17:46, Eric W. Biederman wrote:
>> [...]
>> An MSI interrupt is a (DMA) write to the local APIC base address
>> 0xfeexxxxx which has the target CPU and control bits encoded in the
>> lower bits. The written data is the vector and more control bits.
>>
>> The only way to stop that is to shut it up at the PCI device level.
> 
> Or to write to magic chipset registers that will stop transforming DMA
> writes to 0xfeexxxxx into x86 interrupts.  With an IOMMU I know x86 has
> such registers (because the point of the IOMMU is to limit the problems
> rogue devices can cause).  Without an IOMMU I don't know if x86 has any
> such registers.  I remember that other platforms have an interrupt
> controller that does provide some level of control.  That x86 does not
> is what makes this an x86 specific problem.
> [...]
> Looking at patch 3/3 what this patchset does is an early disable of
> of the msi registers.  Which is mostly reasonable.  Especially as has
> been pointed out the only location the x86 vector and x86 cpu can
> be found is in the msi configuration registers.
> 
> That also seems reasonable.  But Bjorn's concern about not finding all
> devices in all domains does seem real.
> [...]
> So if we can safely and reliably disable DMA and MSI at the generic PCI
> device level during boot up I am all for it.
> 
> 
> How difficult would it be to tell the IOMMUs to stop passing traffic
> through in an early pci quirk?  The problem setup was apparently someone
> using the device directly from a VM.  So I presume there is an IOMMU
> in that configuration.

This is a good idea I think - we considered something like that in
theory while working the problem back in 2018, but given I'm even less
expert in IOMMU that I already am in PCI, the option was to go with the
PCI approach. And you are right, the original problem is a device in
PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to
collect some stats) from the host side, through VFIO IIRC.

Anyway, we could split the problem in two after all the considerations
in the thread, I believe:

(1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the
DMA writes for PCI devices *until* PCI core code properly initialize the
devices, that'd handle the majority of the cases I guess (IOMMU usage is
quite common nowadays).

(2) Collecting PCI topology information in a running kernel and passing
that to the kdump kernel would allow us to disable the PCI devices MSI
capabilities, the only problem here is that I couldn't see how to do
that early enough, before the 'sti' executes, without relying in
early_quirks(). ACPI maybe? As per Bjorn comment when explaining the
limitations of early_quirks(), this problem seems not to be trivial.

I'm a bit against doing that in crash_kexec() for the reasons mentioned
in the thread, specially by Thomas and Eric - but if there's no other
way...maybe this is the way to go.

Cheers,


Guilherme


P.S. Fixed Gavin's bouncing address, sorry for that.
Eric W. Biederman Nov. 16, 2020, 9:45 p.m. UTC | #18
"Guilherme G. Piccoli" <gpiccoli@canonical.com> writes:

> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
>
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
>
>
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
>
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
>
> [0]
> https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com
>
>
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
>
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.
>
> Guilherme's patches add a "clearmsi" command line parameter.  I guess
> the idea is that the parameter is always passed to the crash kernel but
> the patches don't do that, which seems odd."
>
> Very good points Lukas, thanks! The reason of not adding the clearmsi
> thing as a default kdump procedure is kinda related to your first point:
> it impacts a bit boot-time, also it's an additional logic in the kdump
> kernel, which we know is (sometimes) the last resort in gathering
> additional data to debug a potentially complex issue. That said, I'd not
> like to enforce this kind of "policy" to everybody, so my implementation
> relies on having it as a parameter, and the kdump userspace counter-part
> could then have a choice in adding or not such mechanism to the kdump
> kernel parameter list.
>
> About passing the data to next kernel, this is very interesting, we
> could do something like that either through setup_data (as you said) or
> using a new proposal, the PKRAM thing [a].
> This way we wouldn't need a crash_device_shutdown(), but instead when
> kernel is loading a crash kernel (through kexec -p) we can collect all
> the necessary information that'll be passed to the crash kernel
> (obviously that if we are collecting PCI topology information, we need
> callbacks in the PCI hotplug add/del path to update this information).
>
> [a]
> https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/
>
> Below, inline reply to Eric's last message.
>
>
> On 15/11/2020 17:46, Eric W. Biederman wrote:
>>> [...]
>>> An MSI interrupt is a (DMA) write to the local APIC base address
>>> 0xfeexxxxx which has the target CPU and control bits encoded in the
>>> lower bits. The written data is the vector and more control bits.
>>>
>>> The only way to stop that is to shut it up at the PCI device level.
>> 
>> Or to write to magic chipset registers that will stop transforming DMA
>> writes to 0xfeexxxxx into x86 interrupts.  With an IOMMU I know x86 has
>> such registers (because the point of the IOMMU is to limit the problems
>> rogue devices can cause).  Without an IOMMU I don't know if x86 has any
>> such registers.  I remember that other platforms have an interrupt
>> controller that does provide some level of control.  That x86 does not
>> is what makes this an x86 specific problem.
>> [...]
>> Looking at patch 3/3 what this patchset does is an early disable of
>> of the msi registers.  Which is mostly reasonable.  Especially as has
>> been pointed out the only location the x86 vector and x86 cpu can
>> be found is in the msi configuration registers.
>> 
>> That also seems reasonable.  But Bjorn's concern about not finding all
>> devices in all domains does seem real.
>> [...]
>> So if we can safely and reliably disable DMA and MSI at the generic PCI
>> device level during boot up I am all for it.
>> 
>> 
>> How difficult would it be to tell the IOMMUs to stop passing traffic
>> through in an early pci quirk?  The problem setup was apparently someone
>> using the device directly from a VM.  So I presume there is an IOMMU
>> in that configuration.
>
> This is a good idea I think - we considered something like that in
> theory while working the problem back in 2018, but given I'm even less
> expert in IOMMU that I already am in PCI, the option was to go with the
> PCI approach. And you are right, the original problem is a device in
> PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to
> collect some stats) from the host side, through VFIO IIRC.
>
> Anyway, we could split the problem in two after all the considerations
> in the thread, I believe:
>
> (1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the
> DMA writes for PCI devices *until* PCI core code properly initialize the
> devices, that'd handle the majority of the cases I guess (IOMMU usage is
> quite common nowadays).
>
> (2) Collecting PCI topology information in a running kernel and passing
> that to the kdump kernel would allow us to disable the PCI devices MSI
> capabilities, the only problem here is that I couldn't see how to do
> that early enough, before the 'sti' executes, without relying in
> early_quirks(). ACPI maybe? As per Bjorn comment when explaining the
> limitations of early_quirks(), this problem seems not to be trivial.
>
> I'm a bit against doing that in crash_kexec() for the reasons mentioned
> in the thread, specially by Thomas and Eric - but if there's no other
> way...maybe this is the way to go.

The way to do that would be to collect the set of pci devices when the
kexec on panic kernel is loaded, not during crash_kexec.  If someone
performs a device hotplug they would need to reload the kexec on panic
kernel.

I am not necessarily endorsing that just pointing out how it can be
done.

Eric
Guilherme G. Piccoli Nov. 16, 2020, 9:49 p.m. UTC | #19
On Mon, Nov 16, 2020 at 6:45 PM Eric W. Biederman <ebiederm@xmission.com> wrote:
> The way to do that would be to collect the set of pci devices when the
> kexec on panic kernel is loaded, not during crash_kexec.  If someone
> performs a device hotplug they would need to reload the kexec on panic
> kernel.
>
> I am not necessarily endorsing that just pointing out how it can be
> done.
>
> Eric

Thanks Eric, I agree! I think if we use something like PKRAM (a more
dynamic approach) we could have the PCI hotplug path updating the data
to-be-passed to the crash kernel, so the crash kernel doesn't even
need to be loaded again.
Bjorn Helgaas Nov. 17, 2020, 12:19 a.m. UTC | #20
On Mon, Nov 16, 2020 at 05:31:36PM -0300, Guilherme G. Piccoli wrote:
> First of all, thanks everybody for the great insights/discussion! This
> thread ended-up being a great learning for (at least) me.
> 
> Given the big amount of replies and intermixed comments, I wouldn't be
> able to respond inline to all, so I'll try another approach below.
> 
> 
> From Bjorn:
> "I think [0] proposes using early_quirks() to disable MSIs at boot-time.
> That doesn't seem like a robust solution because (a) the problem affects
> all arches but early_quirks() is x86-specific and (b) even on x86
> early_quirks() only works for PCI segment 0 because it relies on the
> 0xCF8/0xCFC I/O ports."
> 
> Ah. I wasn't aware of that limitation, I thought enhancing the
> early_quirks() search to go through all buses would fix that, thanks for
> the clarification! And again, worth to clarify that this is not a
> problem affecting all arches _in practice_ - PowerPC for example has the
> FW primitives allowing a powerful PCI controller (out-of-band) reset,
> preventing this kind of issue usually.
> 
> [0]
> https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com
> 
> 
> From Bjorn:
> "A crash_device_shutdown() could do something at the host bridge level
> if that's possible, or reset/disable bus mastering/disable MSI/etc on
> individual PCI devices if necessary."
> 
> From Lukas:
> "Guilherme's original patches from 2018 iterate over all 256 PCI buses.
> That might impact boot time negatively.  The reason he has to do that is
> because the crashing kernel doesn't know which devices exist and which
> have interrupts enabled.  However the crashing kernel has that
> information.  It should either disable interrupts itself or pass the
> necessary information to the crashing kernel as setup_data or whatever.

I don't think passing the device information to the kdump kernel is
really practical.  The kdump kernel would use it to do PCI config
writes to disable MSIs before enabling IRQs, and it doesn't know how
to access config space that early.

We could invent special "early config access" things, but that gets
really complicated really fast.  Config access depends on ACPI MCFG
tables, firmware interfaces, and in many cases, on the native host
bridge drivers in drivers/pci/controllers/.

I think we need to disable MSIs in the crashing kernel before the
kexec.  It adds a little more code in the crash_kexec() path, but it
seems like a worthwhile tradeoff.

Bjorn
Eric W. Biederman Nov. 17, 2020, 1:06 a.m. UTC | #21
Bjorn Helgaas <helgaas@kernel.org> writes:

> I don't think passing the device information to the kdump kernel is
> really practical.  The kdump kernel would use it to do PCI config
> writes to disable MSIs before enabling IRQs, and it doesn't know how
> to access config space that early.

I don't think it is particularly practical either.  But in practice
on x86 it is either mmio writes or 0xcf8 style writes and we could
pass a magic table that would have all of that information.

> We could invent special "early config access" things, but that gets
> really complicated really fast.  Config access depends on ACPI MCFG
> tables, firmware interfaces, and in many cases, on the native host
> bridge drivers in drivers/pci/controllers/.

I do agree that the practical problem with passing information early
is that gets us into the weeds and creates code that we only care
about in the case of kexec-on-panic.  It is much better to make the
existing code more robust, so that we reduce our dependency on firmware
doing the right thing.

> I think we need to disable MSIs in the crashing kernel before the
> kexec.  It adds a little more code in the crash_kexec() path, but it
> seems like a worthwhile tradeoff.

Disabling MSIs in the b0rken kernel is not possible.

Walking the device tree or even a significant subset of it hugely
decreases the chances that we will run into something that is incorrect
in the known broken kernel.  I expect the code to do that would double
or triple the amount of code that must be executed in the known broken
kernel.  The last time something like that happened (switching from xchg
to ordinary locks) we had cases that stopped working.  Walking all of
the pci devices in the system is much more invasive.

That is not to downplay the problems of figuring out how to disable
things in early boot.

My two top candidates are poking the IOMMUs early to shut things off,
and figuring out if we can delay enabling interrupts until we have
initialized pci.

Poking at IOMMUs early should work for most systems with ``enterprise''
hardware.  Systems where people care about kdump the most.

Eric
Thomas Gleixner Nov. 17, 2020, 9:53 a.m. UTC | #22
On Mon, Nov 16 2020 at 19:06, Eric W. Biederman wrote:
> Bjorn Helgaas <helgaas@kernel.org> writes:
> My two top candidates are poking the IOMMUs early to shut things off,
> and figuring out if we can delay enabling interrupts until we have
> initialized pci.

Keeping interrupts disabled post PCI initialization would be nice, but
that requires feeding the whole init machinery through a shredder and
collecting the bits and pieces.

> Poking at IOMMUs early should work for most systems with ``enterprise''
> hardware.  Systems where people care about kdump the most.

The IOMMU IRQ remapping part _is_ initialized early and _before_
interrupts are enabled.

But that does not solve the problem either simply because then the IOMMU
will catch the rogue MSIs and you get an interrupt storm on the IOMMU
error interrupt.

This one is not going to be turned off because the IOMMU error interrupt
handler will handle each of them and tell the core code that everything
is under control.

As I explained several times now, the only way to shut this up reliably
is at the source. Curing the symptom is almost never a good solution.

Thanks,

        tglx
Guilherme G. Piccoli Nov. 17, 2020, 12:04 p.m. UTC | #23
On Mon, Nov 16, 2020 at 10:07 PM Eric W. Biederman
<ebiederm@xmission.com> wrote:
> [...]
> > I think we need to disable MSIs in the crashing kernel before the
> > kexec.  It adds a little more code in the crash_kexec() path, but it
> > seems like a worthwhile tradeoff.
>
> Disabling MSIs in the b0rken kernel is not possible.
>
> Walking the device tree or even a significant subset of it hugely
> decreases the chances that we will run into something that is incorrect
> in the known broken kernel.  I expect the code to do that would double
> or triple the amount of code that must be executed in the known broken
> kernel.  The last time something like that happened (switching from xchg
> to ordinary locks) we had cases that stopped working.  Walking all of
> the pci devices in the system is much more invasive.
>

I think we could try to decouple this problem in 2, if that makes
sense. Bjorn/others can jump in and correct me if I'm wrong. So, the
problem is to walk a PCI topology tree, identify every device and
disable MSI(/INTx maybe) in them, and the big deal with doing that in
the broken kernel before the kexec is that this task is complex due to
the tree walk, mainly. But what if we keep a table (a simple 2-D
array) with the address and data to be written to disable the MSIs,
and before the kexec we could have a parameter enabling a function
that just goes through this array and performs the writes?

The table itself would be constructed by the PCI core (and updated in
the hotplug path), when device discovery is happening. This table
would live in a protected area in memory, with no write/execute
access, this way if the kernel itself tries to corrupt that, we get a
fault (yeah, I know DMAs can corrupt anywhere, but IOMMU could protect
against that). If the parameter "kdump_clear_msi" is passed in the
cmdline of the regular kernel, for example, then the function walks
this simple table and performs the devices' writes before the kexec...

If that's not possible or a bad idea for any reason, I still think the
early_quirks() idea hereby proposed is not something we should
discard, because it'll help a lot of users even with its limitations
(in our case it worked very well).
Also, taking here the opportunity to clarify my understanding about
the limitations of that approach: Bjorn, in our reproducer machine we
had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and
0000:80 - are those all under "segment 0" as per your verbiage? In our
case the troublemaker was under 0000:80, and the early_quirks() shut
the device up successfully.

Thanks,


Guilherme
David Woodhouse Nov. 17, 2020, 12:19 p.m. UTC | #24
On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
> But that does not solve the problem either simply because then the IOMMU
> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
> error interrupt.

Not if you can tell the IOMMU to stop reporting those errors.

We can easily do it per-device for DMA errors; not quite sure what
granularity we have for interrupts. Perhaps the Intel IOMMU only lets
you set the Fault Processing Disable bit per IRTE entry, and you still
get faults for Compatibility Format interrupts? Not sure about AMD...
Thomas Gleixner Nov. 17, 2020, 7:34 p.m. UTC | #25
On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>> But that does not solve the problem either simply because then the IOMMU
>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>> error interrupt.
>
> Not if you can tell the IOMMU to stop reporting those errors.
>
> We can easily do it per-device for DMA errors; not quite sure what
> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
> you set the Fault Processing Disable bit per IRTE entry, and you still
> get faults for Compatibility Format interrupts? Not sure about AMD...

It looks like the fault (DMAR) and event (AMD) interrupts can be
disabled in the IOMMU. That might help to bridge the gap until the PCI
bus is scanned in full glory and the devices can be shut up for real.

If we make this conditional for a crash dump kernel that might do the
trick.

Lot's of _might_ there :)

Thanks

        tglx
Eric W. Biederman Nov. 17, 2020, 10:25 p.m. UTC | #26
Thomas Gleixner <tglx@linutronix.de> writes:

> On Tue, Nov 17 2020 at 12:19, David Woodhouse wrote:
>> On Tue, 2020-11-17 at 10:53 +0100, Thomas Gleixner wrote:
>>> But that does not solve the problem either simply because then the IOMMU
>>> will catch the rogue MSIs and you get an interrupt storm on the IOMMU
>>> error interrupt.
>>
>> Not if you can tell the IOMMU to stop reporting those errors.
>>
>> We can easily do it per-device for DMA errors; not quite sure what
>> granularity we have for interrupts. Perhaps the Intel IOMMU only lets
>> you set the Fault Processing Disable bit per IRTE entry, and you still
>> get faults for Compatibility Format interrupts? Not sure about AMD...
>
> It looks like the fault (DMAR) and event (AMD) interrupts can be
> disabled in the IOMMU. That might help to bridge the gap until the PCI
> bus is scanned in full glory and the devices can be shut up for real.
>
> If we make this conditional for a crash dump kernel that might do the
> trick.
>
> Lot's of _might_ there :)

Worth testing.

Folks tracking this down is this enough of a hint for you to write a
patch and test?

Also worth checking how close irqpoll is to handling a case like this.
At least historically it did a pretty good job at shutting down problem
interrupts.

I really find it weird that an edge triggered irq was firing fast enough
to stop a system from booting.  Level triggered irqs do that if they are
acknolwedged without actually being handled.  I think edge triggered
irqs only fire when another event comes in, and it seems odd to get so
many actual events causing interrupts that the system soft locks
up.  Is my memory of that situation confused?

I recommend making these facilities general debug facilities so that
they can be used for cases other than crash dump.  The crash dump kernel
would always enable them because it can assume that the hardware is
very likely in a wonky state.

Eric
Bjorn Helgaas Nov. 18, 2020, 9:05 p.m. UTC | #27
On Tue, Nov 17, 2020 at 09:04:07AM -0300, Guilherme Piccoli wrote:

> Also, taking here the opportunity to clarify my understanding about
> the limitations of that approach: Bjorn, in our reproducer machine we
> had 3 parents in the PCI tree (as per lspci -t), 0000:00, 0000:ff and
> 0000:80 - are those all under "segment 0" as per your verbiage?

Yes.  The "0000" is the PCI segment (or "domain" in the Linux PCI core).
It's common on x86 to have multiple host bridges in segment 0000.
Guilherme G. Piccoli Nov. 18, 2020, 10:36 p.m. UTC | #28
Thanks a lot Bjorn! I confess except for PPC64 Server machines, I
never saw other "domains" or segments. Is it common in x86 to have
that? The early_quirks() are restricted to the first segment, no
matter how many host bridges we have in segment 0000?

Thanks again!
Bjorn Helgaas Nov. 30, 2020, 8:20 p.m. UTC | #29
On Wed, Nov 18, 2020 at 07:36:08PM -0300, Guilherme Piccoli wrote:
> Thanks a lot Bjorn! I confess except for PPC64 Server machines, I
> never saw other "domains" or segments. Is it common in x86 to have
> that? The early_quirks() are restricted to the first segment, no
> matter how many host bridges we have in segment 0000?

I don't know whether it's *common* to have multiple domains on x86,
but they're definitely used on large systems.  This includes some
lspci info from an HPE Superdome Flex system:
https://lore.kernel.org/lkml/5350E02A-6457-41A8-8F33-AF67BFDAEE3E@fb.com/

The early quirks in arch/x86/kernel/early-quirks.c (not the
DECLARE_PCI_FIXUP_EARLY quirks in drivers/pci/quirks.c) are restricted
to segment 0, no matter how many host bridges there are.  This is
because they use read_pci_config_16(), which uses a config access
mechanism that has no provision for a segment number.

Bjorn
Guilherme G. Piccoli Dec. 14, 2020, 6:32 p.m. UTC | #30
Thank you for the clarification Bjorn! I was on vacation..sorry for my delay.

Closing the loop here, I understand we're not getting this patch
merged (due to its restriction to domain 0) and there was a suggestion
in the thread of trying to block MSIs from the IOMMU init code (which
also have the restriction of only working in iommu-enabled systems).
Hope the thread is helpful and if somebody faces this issue, can
comment here and at least find this approach, maybe test the patch.

Thanks to all involved!
diff mbox series

Patch

diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
index 50d5848bf22e..fd50f9e21623 100644
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -731,7 +731,6 @@  static int __init check_dev_quirk(int num, int slot, int func)
 	u16 vendor;
 	u16 device;
 	u8 type;
-	u8 sec;
 	int i;
 
 	class = read_pci_config_16(num, slot, func, PCI_CLASS_DEVICE);
@@ -760,11 +759,8 @@  static int __init check_dev_quirk(int num, int slot, int func)
 	type = read_pci_config_byte(num, slot, func,
 				    PCI_HEADER_TYPE);
 
-	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE) {
-		sec = read_pci_config_byte(num, slot, func, PCI_SECONDARY_BUS);
-		if (sec > num)
-			early_pci_scan_bus(sec);
-	}
+	if ((type & 0x7f) == PCI_HEADER_TYPE_BRIDGE)
+		return -1;
 
 	if (!(type & 0x80))
 		return -1;
@@ -787,8 +783,11 @@  static void __init early_pci_scan_bus(int bus)
 
 void __init early_quirks(void)
 {
+	int bus;
+
 	if (!early_pci_allowed())
 		return;
 
-	early_pci_scan_bus(0);
+	for (bus = 0; bus < 256; bus++)
+		early_pci_scan_bus(bus);
 }