diff mbox series

MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug

Message ID 806c51fa-992b-33ac-61a9-00a606f82edb@linux.intel.com
State New
Headers show
Series MSI interrupt for xhci still lost on 5.6-rc6 after cpu hotplug | expand

Commit Message

Mathias Nyman March 18, 2020, 7:25 p.m. UTC
Hi

I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
the "Plug non-maskable MSI affinity race" patch.

I can see this on a couple platforms, I'm running a script that first generates
a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
and on cpus:

for i in 1 3 5 7; do
	echo "1" > /sys/devices/system/cpu/cpu$i/online
done
echo "A" > "/proc/irq/*/smp_affinity"
echo "A" > "/proc/irq/*/smp_affinity"
echo "F" > "/proc/irq/*/smp_affinity"
for i in 1 3 5 7; do
	echo "0" > /sys/devices/system/cpu/cpu$i/online
done

I added some very simple debugging but I don't really know what to look for.
xhci interrupts (122) just stop after a setting msi affinity, it survived many
similar msi_set_affinity() calls before this.

I'm not that familiar with the inner workings of this, but I'll be happy to
help out with adding debugging and testing patches.

Details:

 cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
   0:         26          0          0          0          0          0          0          0   IO-APIC    2-edge      timer
   1:          0          0          0          0          0          7          0          0   IO-APIC    1-edge      i8042
   4:          0          4      59941          0          0          0          0          0   IO-APIC    4-edge      ttyS0
   8:          0          0          0          0          0          0          1          0   IO-APIC    8-edge      rtc0
   9:          0         40          8          0          0          0          0          0   IO-APIC    9-fasteoi   acpi
  16:          0          0          0          0          0          0          0          0   IO-APIC   16-fasteoi   i801_smbus
 120:          0          0        293          0          0          0          0          0   PCI-MSI 32768-edge      i915
 121:        728          0          0         58          0          0          0          0   PCI-MSI 520192-edge      enp0s31f6
 122:      63575       2271          0       1957       7262          0          0          0   PCI-MSI 327680-edge      xhci_hcd
 123:          0          0          0          0          0          0          0          0   PCI-MSI 514048-edge      snd_hda_intel:card0
 NMI:          0          0          0          0          0          0          0          0   Non-maskable interrupts
 
trace snippet: 
      <idle>-0     [001] d.h.   129.676900: xhci_irq: xhci irq
      <idle>-0     [001] d.h.   129.677507: xhci_irq: xhci irq
      <idle>-0     [001] d.h.   129.677556: xhci_irq: xhci irq
      <idle>-0     [001] d.h.   129.677647: xhci_irq: xhci irq
      <...>-14    [001] d..1   129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6
      <idle>-0     [003] d.h.   129.682639: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.682769: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.682908: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.683552: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.683677: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.683819: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.689017: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.689140: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.689307: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.689984: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.690107: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.690278: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.695541: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.695674: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.695839: xhci_irq: xhci irq
      <idle>-0     [003] d.H.   129.696667: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.696797: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.696973: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.702288: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.702380: xhci_irq: xhci irq
      <idle>-0     [003] d.h.   129.702493: xhci_irq: xhci irq
 migration/3-24    [003] d..1   129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0
 kworker/0:0-5     [000] d.h.   131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
 kworker/0:0-5     [000] d.h.   133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
 kworker/0:0-5     [000] d.h.   135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
      <idle>-0     [000] d.h.   137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
 kworker/0:0-5     [000] d.h.   139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
 kworker/0:0-5     [000] d.h.   141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
 kworker/0:0-5     [000] d.h.   143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
 kworker/0:0-5     [000] d.h.   145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0



Thanks
-Mathias

Comments

Evan Green March 19, 2020, 8:24 p.m. UTC | #1
On Wed, Mar 18, 2020 at 12:23 PM Mathias Nyman
<mathias.nyman@linux.intel.com> wrote:
>
> Hi
>
> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
> the "Plug non-maskable MSI affinity race" patch.
>
> I can see this on a couple platforms, I'm running a script that first generates
> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
> and on cpus:
>
> for i in 1 3 5 7; do
>         echo "1" > /sys/devices/system/cpu/cpu$i/online
> done
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "F" > "/proc/irq/*/smp_affinity"
> for i in 1 3 5 7; do
>         echo "0" > /sys/devices/system/cpu/cpu$i/online
> done
>
> I added some very simple debugging but I don't really know what to look for.
> xhci interrupts (122) just stop after a setting msi affinity, it survived many
> similar msi_set_affinity() calls before this.
>
> I'm not that familiar with the inner workings of this, but I'll be happy to
> help out with adding debugging and testing patches.

How quickly are you able to reproduce this when you run your script?
Does reverting Thomas' patch make it repro faster? Can you send the
output of lspci -vvv for the xhci device?

-Evan
Mathias Nyman March 20, 2020, 8:07 a.m. UTC | #2
On 19.3.2020 22.24, Evan Green wrote:
> On Wed, Mar 18, 2020 at 12:23 PM Mathias Nyman
> <mathias.nyman@linux.intel.com> wrote:
>>
>> Hi
>>
>> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
>> the "Plug non-maskable MSI affinity race" patch.
>>
>> I can see this on a couple platforms, I'm running a script that first generates
>> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
>> and on cpus:
>>
>> for i in 1 3 5 7; do
>>         echo "1" > /sys/devices/system/cpu/cpu$i/online
>> done
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "F" > "/proc/irq/*/smp_affinity"
>> for i in 1 3 5 7; do
>>         echo "0" > /sys/devices/system/cpu/cpu$i/online
>> done
>>
>> I added some very simple debugging but I don't really know what to look for.
>> xhci interrupts (122) just stop after a setting msi affinity, it survived many
>> similar msi_set_affinity() calls before this.
>>
>> I'm not that familiar with the inner workings of this, but I'll be happy to
>> help out with adding debugging and testing patches.
> 
> How quickly are you able to reproduce this when you run your script?

Less than a minute

> Does reverting Thomas' patch make it repro faster? Can you send the
> output of lspci -vvv for the xhci device?

It's roughly the same after reverting
"6f1a4891a592 x86/apic/msi: Plug non-maskable MSI affinity race"

# lspci -vvv -d :a3af
00:14.0 USB controller: Intel Corporation Device a3af (prog-if 30 [XHCI])
        Subsystem: Intel Corporation Device 7270
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=medium >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin ? routed to IRQ 122
        Region 0: Memory at b1020000 (64-bit, non-prefetchable) [size=64K]
        Capabilities: [70] Power Management version 2
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0-,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [80] MSI: Enable+ Count=1/8 Maskable- 64bit+
                Address: 00000000fee08000  Data: 4021
        Kernel driver in use: xhci_hcd

-Mathias
Thomas Gleixner March 20, 2020, 9:52 a.m. UTC | #3
Mathias,

Mathias Nyman <mathias.nyman@linux.intel.com> writes:
> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
> the "Plug non-maskable MSI affinity race" patch.
>
> I can see this on a couple platforms, I'm running a script that first generates
> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
> and on cpus:
>
> for i in 1 3 5 7; do
> 	echo "1" > /sys/devices/system/cpu/cpu$i/online
> done
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "A" > "/proc/irq/*/smp_affinity"
> echo "F" > "/proc/irq/*/smp_affinity"
> for i in 1 3 5 7; do
> 	echo "0" > /sys/devices/system/cpu/cpu$i/online
> done
> trace snippet: 
>       <idle>-0     [001] d.h.   129.676900: xhci_irq: xhci irq
>       <idle>-0     [001] d.h.   129.677507: xhci_irq: xhci irq
>       <idle>-0     [001] d.h.   129.677556: xhci_irq: xhci irq
>       <idle>-0     [001] d.h.   129.677647: xhci_irq: xhci irq
>       <...>-14     [001] d..1   129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6

Looks like a regular affinity setting in interrupt context, but I can't
make sense of the time stamps

>       <idle>-0     [003] d.h.   129.682639: xhci_irq: xhci irq
>       <idle>-0     [003] d.h.   129.702380: xhci_irq: xhci irq
>       <idle>-0     [003] d.h.   129.702493: xhci_irq: xhci irq
>  migration/3-24    [003] d..1   129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0

So this is a CPU offline operation and after that irq 122 is silent, right?

>  kworker/0:0-5     [000] d.h.   131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>  kworker/0:0-5     [000] d.h.   133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>  kworker/0:0-5     [000] d.h.   135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>       <idle>-0     [000] d.h.   137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>  kworker/0:0-5     [000] d.h.   139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>  kworker/0:0-5     [000] d.h.   141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>  kworker/0:0-5     [000] d.h.   143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>  kworker/0:0-5     [000] d.h.   145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0

That kworker context looks fishy. Can you please enable stacktraces in
the tracer so I can see the call chains leading to this? OTOH that's irq
121 not 122. Anyway moar information is always useful.

And please add the patch below.

Thanks,

        tglx

8<---------------
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -243,6 +243,7 @@ u64 arch_irq_stat(void)
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");
 
 	desc = __this_cpu_read(vector_irq[vector]);
+	trace_printk("vector: %u desc %lx\n", vector, (unsigned long) desc);
 	if (likely(!IS_ERR_OR_NULL(desc))) {
 		if (IS_ENABLED(CONFIG_X86_32))
 			handle_irq(desc, regs);
Mathias Nyman March 23, 2020, 9:42 a.m. UTC | #4
On 20.3.2020 11.52, Thomas Gleixner wrote:
> Mathias,
> 
> Mathias Nyman <mathias.nyman@linux.intel.com> writes:
>> I can reproduce the lost MSI interrupt issue on 5.6-rc6 which includes
>> the "Plug non-maskable MSI affinity race" patch.
>>
>> I can see this on a couple platforms, I'm running a script that first generates
>> a lot of usb traffic, and then in a busyloop sets irq affinity and turns off
>> and on cpus:
>>
>> for i in 1 3 5 7; do
>> 	echo "1" > /sys/devices/system/cpu/cpu$i/online
>> done
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "A" > "/proc/irq/*/smp_affinity"
>> echo "F" > "/proc/irq/*/smp_affinity"
>> for i in 1 3 5 7; do
>> 	echo "0" > /sys/devices/system/cpu/cpu$i/online
>> done
>> trace snippet: 
>>       <idle>-0     [001] d.h.   129.676900: xhci_irq: xhci irq
>>       <idle>-0     [001] d.h.   129.677507: xhci_irq: xhci irq
>>       <idle>-0     [001] d.h.   129.677556: xhci_irq: xhci irq
>>       <idle>-0     [001] d.h.   129.677647: xhci_irq: xhci irq
>>       <...>-14     [001] d..1   129.679802: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 2 -> 6
> 
> Looks like a regular affinity setting in interrupt context, but I can't
> make sense of the time stamps

I think so, everything worked normally after this one still.

> 
>>       <idle>-0     [003] d.h.   129.682639: xhci_irq: xhci irq
>>       <idle>-0     [003] d.h.   129.702380: xhci_irq: xhci irq
>>       <idle>-0     [003] d.h.   129.702493: xhci_irq: xhci irq
>>  migration/3-24    [003] d..1   129.703150: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 6 -> 0
> 
> So this is a CPU offline operation and after that irq 122 is silent, right?

Yes, after this irq 122 was silent.

> 
>>  kworker/0:0-5     [000] d.h.   131.328790: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>  kworker/0:0-5     [000] d.h.   133.312704: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>  kworker/0:0-5     [000] d.h.   135.360786: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>       <idle>-0     [000] d.h.   137.344694: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>  kworker/0:0-5     [000] d.h.   139.128679: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>  kworker/0:0-5     [000] d.h.   141.312686: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>  kworker/0:0-5     [000] d.h.   143.360703: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
>>  kworker/0:0-5     [000] d.h.   145.344791: msi_set_affinity: direct update msi 121, vector 34 -> 34, apicid: 0 -> 0
> 
> That kworker context looks fishy. Can you please enable stacktraces in
> the tracer so I can see the call chains leading to this? OTOH that's irq
> 121 not 122. Anyway moar information is always useful.
> 
> And please add the patch below.
> 

Full function trace with patch is huge, can be found compressed at

https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing

xhci_traces.tgz contains:
trace_full: full function trace. 
trace: timestamp  ~48.29 to ~48.93 of trace above, section with last xhci irq
trace_prink_only: only trace_printk() of "trace" above

This time xhci interrupts stopped after
migration/3-24    [003] d..1    48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4

Thanks
-Mathias
Thomas Gleixner March 23, 2020, 2:10 p.m. UTC | #5
Mathias,

Mathias Nyman <mathias.nyman@linux.intel.com> writes:
> Full function trace with patch is huge, can be found compressed at
>
> This time xhci interrupts stopped after
> migration/3-24    [003] d..1    48.530271: msi_set_affinity: twostep update msi, irq 122, vector 33 -> 34, apicid: 6 -> 4

thanks for providing the data. I think I decoded the issue. Can you
please test the patch below?

Thanks,

        tglx
----
diff --git a/arch/x86/kernel/apic/msi.c b/arch/x86/kernel/apic/msi.c
index 159bd0cb8548..49aec92fa35b 100644
--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -63,12 +63,13 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
 {
 	struct irq_cfg old_cfg, *cfg = irqd_cfg(irqd);
 	struct irq_data *parent = irqd->parent_data;
-	unsigned int cpu;
+	unsigned int cpu, old_vector;
 	int ret;
 
 	/* Save the current configuration */
 	cpu = cpumask_first(irq_data_get_effective_affinity_mask(irqd));
 	old_cfg = *cfg;
+	old_vector = old_cfg.vector;
 
 	/* Allocate a new target vector */
 	ret = parent->chip->irq_set_affinity(parent, mask, force);
@@ -90,10 +91,10 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
 	 */
 	if (!irqd_msi_nomask_quirk(irqd) ||
 	    cfg->vector == old_cfg.vector ||
-	    old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
+	    old_vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
 	    cfg->dest_apicid == old_cfg.dest_apicid) {
 		irq_msi_update_msg(irqd, cfg);
-		return ret;
+		goto check_old;
 	}
 
 	/*
@@ -102,7 +103,7 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
 	 */
 	if (WARN_ON_ONCE(cpu != smp_processor_id())) {
 		irq_msi_update_msg(irqd, cfg);
-		return ret;
+		goto check_old;
 	}
 
 	/*
@@ -163,6 +164,11 @@ msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
 	if (lapic_vector_set_in_irr(cfg->vector))
 		irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
 
+check_old:
+	if (!in_irq() && old_vector != MANAGED_IRQ_SHUTDOWN_VECTOR &&
+	    irqd_msi_nomask_quirk(irqd) && lapic_vector_set_in_irr(old_vector))
+		irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
+
 	return ret;
 }
Mathias Nyman March 23, 2020, 8:32 p.m. UTC | #6
On 23.3.2020 16.10, Thomas Gleixner wrote:
> 
> thanks for providing the data. I think I decoded the issue. Can you
> please test the patch below?
> 

Unfortunately it didn't help.
Traces added to xhci_trace2.tgz in the same place

https://drive.google.com/drive/folders/19AFZe32DYk4Kzxi8VYv-OWmNOCyIY6M5?usp=sharing

Thanks
-Mathias
Thomas Gleixner March 24, 2020, 12:24 a.m. UTC | #7
Mathias Nyman <mathias.nyman@linux.intel.com> writes:
> On 23.3.2020 16.10, Thomas Gleixner wrote:
>> 
>> thanks for providing the data. I think I decoded the issue. Can you
>> please test the patch below?
>
> Unfortunately it didn't help.

I did not expect that to help, simply because the same issue is caught
by the loop in fixup_irqs(). What I wanted to make sure is that there is
not something in between which causes the latter to fail.

So I stared at the trace data earlier today and looked at the xhci irq
events. They are following a more or less periodic schedule and the
forced migration on CPU hotplug hits definitely in the time frame where
the next interrupt should be raised by the device.

1) First off all I do not have to understand why new systems released
   in 2020 still use non-maskable MSI which is the root cause of all of
   this trouble especially in Intel systems which are known to have
   this disastrouos interrupt migration troubles.

   Please tell your hardware people to stop this. 
    
2) I have no idea why the two step mechanism fails exactly on this
   system. I tried the same test case on a skylake client and I can
   clearly see from the traces that the interrupt raised in the device
   falls exactly into the two step update and causes the IRR to be set
   which resolves the situation by IPI'ing the new target CPU.
    
   I have not found a single instance of IPI recovery in your
   traces. Instead of that your system stops working in exactly this
   situation.

   The two step mechanism tries to work around the fact that PCI does
   not support a 64bit atomic config space update. So we carefully avoid
   changing more than one 32bit value at a time, i.e. we change first
   the vector and then the destination ID (part of address_lo).  This
   ensures that the message is consistent all the time.

   But obviously on your system this does not work as expected. Why? I
   really can't tell.

   Please talk to your hardware folks.

And of course all of this is so well documented that all of us can
clearly figure out what's going on...

Thanks,

        tglx
Evan Green March 24, 2020, 4:17 p.m. UTC | #8
On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Mathias Nyman <mathias.nyman@linux.intel.com> writes:
> > On 23.3.2020 16.10, Thomas Gleixner wrote:
> >>
> >> thanks for providing the data. I think I decoded the issue. Can you
> >> please test the patch below?
> >
> > Unfortunately it didn't help.
>
> I did not expect that to help, simply because the same issue is caught
> by the loop in fixup_irqs(). What I wanted to make sure is that there is
> not something in between which causes the latter to fail.
>
> So I stared at the trace data earlier today and looked at the xhci irq
> events. They are following a more or less periodic schedule and the
> forced migration on CPU hotplug hits definitely in the time frame where
> the next interrupt should be raised by the device.
>
> 1) First off all I do not have to understand why new systems released
>    in 2020 still use non-maskable MSI which is the root cause of all of
>    this trouble especially in Intel systems which are known to have
>    this disastrouos interrupt migration troubles.
>
>    Please tell your hardware people to stop this.
>
> 2) I have no idea why the two step mechanism fails exactly on this
>    system. I tried the same test case on a skylake client and I can
>    clearly see from the traces that the interrupt raised in the device
>    falls exactly into the two step update and causes the IRR to be set
>    which resolves the situation by IPI'ing the new target CPU.
>
>    I have not found a single instance of IPI recovery in your
>    traces. Instead of that your system stops working in exactly this
>    situation.
>
>    The two step mechanism tries to work around the fact that PCI does
>    not support a 64bit atomic config space update. So we carefully avoid
>    changing more than one 32bit value at a time, i.e. we change first
>    the vector and then the destination ID (part of address_lo).  This
>    ensures that the message is consistent all the time.
>
>    But obviously on your system this does not work as expected. Why? I
>    really can't tell.
>
>    Please talk to your hardware folks.
>
> And of course all of this is so well documented that all of us can
> clearly figure out what's going on...

I won't pretend to know what's going on, so I'll preface this by
labeling it all as "flailing", but:

I wonder if there's some way the interrupt can get delayed between
XHCI snapping the torn value and it finding its way into the IRR. For
instance, if xhci read this value at the start of their interrupt
moderation timer period, that would be awful (I hope they don't do
this). One test patch would be to carve out 8 vectors reserved for
xhci on all cpus. Whenever you change the affinity, the assigned
vector is always reserved_base + cpu_number. That lets you exercise
the affinity switching code, but in a controlled manner where torn
interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
vector but I'm cpu 2). I might struggle to write such a change, but in
theory it's doable.

Actually the slightly easier experiment might be to reserve a single
vector for xhci on all cores. We'd strongly expect the problem to go
away, since now there's no more torn writes since the vector is always
the same. If it somehow didn't go away, you'd know there's more to the
story.

I was alternately trying to build a theory in my head about the write
somehow being posted and getting out of order, but I don't think that
can happen.

Another experiment would be to try my old patch in [1]. I'm not
advocating for this patch as a solution, Thomas and Bjorn have
convinced me that it will break the rest of the world. But your PCI
device 0xa3af seems to be Comet Lake. I was also on Comet Lake. So I'd
expect to at least see it mask your problem. Again, if it didn't, that
might be an interesting datapoint.

[1] https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/

-Evan
Thomas Gleixner March 24, 2020, 7:03 p.m. UTC | #9
Evan Green <evgreen@chromium.org> writes:
> On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>> And of course all of this is so well documented that all of us can
>> clearly figure out what's going on...
>
> I won't pretend to know what's going on, so I'll preface this by
> labeling it all as "flailing", but:
>
> I wonder if there's some way the interrupt can get delayed between
> XHCI snapping the torn value and it finding its way into the IRR. For
> instance, if xhci read this value at the start of their interrupt
> moderation timer period, that would be awful (I hope they don't do
> this). One test patch would be to carve out 8 vectors reserved for
> xhci on all cpus. Whenever you change the affinity, the assigned
> vector is always reserved_base + cpu_number. That lets you exercise
> the affinity switching code, but in a controlled manner where torn
> interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
> vector but I'm cpu 2). I might struggle to write such a change, but in
> theory it's doable.

Well, the point is that we don't see a spurious interrupt on any
CPU. We added a traceprintk into do_IRQ() and that would immediately
tell us where the thing goes off into lala land. Which it didn't.

> I was alternately trying to build a theory in my head about the write
> somehow being posted and getting out of order, but I don't think that
> can happen.

If that happens then the lost XHCI interrupt is the least of your
worries.

Thanks,

        tglx
Mathias Nyman March 25, 2020, 5:12 p.m. UTC | #10
On 24.3.2020 18.17, Evan Green wrote:
> 
> Another experiment would be to try my old patch in [1]. I'm not
> advocating for this patch as a solution, Thomas and Bjorn have
> convinced me that it will break the rest of the world. But your PCI
> device 0xa3af seems to be Comet Lake. I was also on Comet Lake. So I'd
> expect to at least see it mask your problem. Again, if it didn't, that
> might be an interesting datapoint.
> 
> [1] https://lore.kernel.org/lkml/20200117162444.v2.1.I9c7e72144ef639cc135ea33ef332852a6b33730f@changeid/
> 

Yes I'm testing on a Comet Lake, oddly your patch doesn't mask the problem.
Did a few rounds, seems that xhci interrupts don't always stop when both vector
and apicid are updated. Sometimes just apicid update triggers it.

(tail of) round 1:
          <idle>-0     [001] d.h.   249.040051: xhci_irq: xhci irq
          <idle>-0     [001] d.h.   249.040122: do_IRQ: vector: 33 desc ffff9e84075b1800
          <idle>-0     [001] d.h.   249.040125: xhci_irq: xhci irq
     migration/1-14    [001] d..1   249.045231: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6
     migration/3-24    [003] d..1   249.069563: msi_set_affinity: twostep update msi, irq 123, vector 33 -> 35, apicid: 6 -> 0
          <idle>-0     [000] d.h.   250.294934: do_IRQ: vector: 33 desc ffff9e84066df200
          <idle>-0     [000] d.h.   250.294940: msi_set_affinity: direct update msi 122, vector 33 -> 33, apicid: 0 -> 0
          <idle>-0     [000] d.h.   252.276099: do_IRQ: vector: 33 desc ffff9e84066df200
round 2:
          <idle>-0     [001] d.h.    75.580577: do_IRQ: vector: 33 desc ffffa0449d084a00
          <idle>-0     [001] d.h.    75.580578: xhci_irq: xhci irq
          <idle>-0     [001] d.h.    75.580639: do_IRQ: vector: 33 desc ffffa0449d084a00
          <idle>-0     [001] d.h.    75.580640: xhci_irq: xhci irq
           <...>-14    [001] d..1    75.585814: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6
     migration/3-24    [003] d..1    75.606381: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0
          <idle>-0     [000] d.h.    76.275792: do_IRQ: vector: 35 desc ffffa04486009a00

round 3:
          <idle>-0     [001] d.h.   117.383266: xhci_irq: xhci irq
          <idle>-0     [001] d.h.   117.383407: do_IRQ: vector: 33 desc ffff92515d0d5800
          <idle>-0     [001] d.h.   117.383410: xhci_irq: xhci irq
     migration/1-14    [001] d..1   117.388527: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 2 -> 6
     migration/3-24    [003] d..1   117.413877: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0
          <idle>-0     [000] d.h.   119.290474: do_IRQ: vector: 35 desc ffff92514757b600

round 4:
          <idle>-0     [003] d.h.   147.950569: do_IRQ: vector: 33 desc ffff976fdd04ca00
          <idle>-0     [003] d.h.   147.950569: xhci_irq: xhci irq
          <idle>-0     [003] d.h.   147.950689: do_IRQ: vector: 33 desc ffff976fdd04ca00
          <idle>-0     [003] d.h.   147.950690: xhci_irq: xhci irq
     migration/3-24    [003] d..1   147.951368: msi_set_affinity: direct update msi 123, vector 33 -> 33, apicid: 6 -> 0
          <idle>-0     [000] d.h.   148.491193: do_IRQ: vector: 35 desc ffff976fdcc9da00
          <idle>-0     [000] d.h.   148.491198: msi_set_affinity: direct update msi 122, vector 35 -> 35, apicid: 0 -> 0

-Mathias
Raj, Ashok May 1, 2020, 6:43 p.m. UTC | #11
Hi Thomas

Just started looking into it to get some idea about what could be
going on. I had some questions, that would be helpful to clarify.

On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote:
> Evan Green <evgreen@chromium.org> writes:
> > On Mon, Mar 23, 2020 at 5:24 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >> And of course all of this is so well documented that all of us can
> >> clearly figure out what's going on...
> >
> > I won't pretend to know what's going on, so I'll preface this by
> > labeling it all as "flailing", but:
> >
> > I wonder if there's some way the interrupt can get delayed between
> > XHCI snapping the torn value and it finding its way into the IRR. For
> > instance, if xhci read this value at the start of their interrupt
> > moderation timer period, that would be awful (I hope they don't do
> > this). One test patch would be to carve out 8 vectors reserved for
> > xhci on all cpus. Whenever you change the affinity, the assigned
> > vector is always reserved_base + cpu_number. That lets you exercise
> > the affinity switching code, but in a controlled manner where torn
> > interrupts could be easily seen (ie hey I got an interrupt on cpu 4's
> > vector but I'm cpu 2). I might struggle to write such a change, but in
> > theory it's doable.
> 
> Well, the point is that we don't see a spurious interrupt on any
> CPU. We added a traceprintk into do_IRQ() and that would immediately
> tell us where the thing goes off into lala land. Which it didn't.

Now that we don't have the torn write issue. We did an experiment 
with legacy MSI, and no interrupt remap support. One of the thought
process was, since we don't have a way to ensure that previous MSI writes
are globally observed, a read from the device should flush any
outstanidng writes correct? (according to PCI, not sure if we can
depend on this.. or chipsets would take their own sweet time to push to CPU)

I'm not certain if such a read happens today? So to make it simple tried
to force a retrigger. In the following case of direct update,
even though the vector isn't changing a MSI write to the previous 
destination could have been sent to the previous CPU right? 

arch/x86/kernel/apic/msi.c: msi_set_affinity()

	/*
         * Direct update is possible when:
         * - The MSI is maskable (remapped MSI does not use this code path)).
         *   The quirk bit is not set in this case.
         * - The new vector is the same as the old vector
         * - The old vector is MANAGED_IRQ_SHUTDOWN_VECTOR (interrupt starts up)
         * - The new destination CPU is the same as the old destination CPU
         */
        if (!irqd_msi_nomask_quirk(irqd) ||
            cfg->vector == old_cfg.vector ||
            old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
            cfg->dest_apicid == old_cfg.dest_apicid) {
                irq_msi_update_msg(irqd, cfg);
	-->>> force a retrigger

It appears that without a gaurantee of flusing MSI writes from the device
the check for lapic_vector_set_in_irr(vector) is still racy. 

With adding the forced retrigger in both places, the test didn't reveal any
lost interrupt cases.

Now the second question with Interrupt Remapping Support:


intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()

The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
were in flight, they made it to the previous CPU, and any new interrupts
must be delivered to the new CPU.

Question is do we need a check similar to the legacy MSI handling

	if (lapic_vector_set_in_irr())
	    handle interrupt? 

Is there a reason we don't check if the interrupt delivered to previous
CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
an IPI to perform the cleanup?  

It appears that maybe send_cleanup_vector() sends IPI to the old cpu
and that somehow ensures the device interrupt handler actually getting
called? I lost my track somewhere down there :)


Cheers,
Ashok
Thomas Gleixner May 5, 2020, 7:36 p.m. UTC | #12
Ashok,

"Raj, Ashok" <ashok.raj@linux.intel.com> writes:
> On Tue, Mar 24, 2020 at 08:03:44PM +0100, Thomas Gleixner wrote:
>> Evan Green <evgreen@chromium.org> writes:
>> Well, the point is that we don't see a spurious interrupt on any
>> CPU. We added a traceprintk into do_IRQ() and that would immediately
>> tell us where the thing goes off into lala land. Which it didn't.
>
> Now that we don't have the torn write issue. We did an experiment 
> with legacy MSI, and no interrupt remap support. One of the thought
> process was, since we don't have a way to ensure that previous MSI writes
> are globally observed, a read from the device should flush any
> outstanidng writes correct? (according to PCI, not sure if we can
> depend on this.. or chipsets would take their own sweet time to push
> to CPU)

Aargh. Indeed.

> I'm not certain if such a read happens today? So to make it simple tried
> to force a retrigger. In the following case of direct update,
> even though the vector isn't changing a MSI write to the previous 
> destination could have been sent to the previous CPU right?

Just checked the pci msi write code and there is no read at the end
which would flush the darn posted write. Duh, I never noticed.

> With adding the forced retrigger in both places, the test didn't reveal any
> lost interrupt cases.

Not a surprise, but not what we really want.

> Now the second question with Interrupt Remapping Support:
>
> intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()
>
> The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
> were in flight, they made it to the previous CPU, and any new interrupts
> must be delivered to the new CPU.
>
> Question is do we need a check similar to the legacy MSI handling
>
> 	if (lapic_vector_set_in_irr())
> 	    handle interrupt? 
>
> Is there a reason we don't check if the interrupt delivered to previous
> CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
> an IPI to perform the cleanup?
>
> It appears that maybe send_cleanup_vector() sends IPI to the old cpu
> and that somehow ensures the device interrupt handler actually getting
> called? I lost my track somewhere down there :)

The way it works is:

    1) New vector on different CPU is allocated

    2) New vector is installed

    3) When the first interrupt on the new CPU arrives then the cleanup
       IPI is sent to the previous CPU which tears down the old vector

So if the interrupt is sent to the original CPU just before #2 then this
will be correctly handled on that CPU after the set affinity code
reenables interrupts.

Can you try the patch below?

Thanks,

        tglx

8<--------------
 drivers/pci/msi.c |    2 ++
 1 file changed, 2 insertions(+)

--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc
 		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
 		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
 		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
+		readl(base + PCI_MSIX_ENTRY_DATA);
 	} else {
 		int pos = dev->msi_cap;
 		u16 msgctl;
@@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc
 			pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
 					      msg->data);
 		}
+		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
 	}
 
 skip:
Ashok Raj May 5, 2020, 8:16 p.m. UTC | #13
On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote:
> Ashok,
> 
> 
> > Now the second question with Interrupt Remapping Support:
> >
> > intel_ir_set_affinity->intel_ir_reconfigure_irte()-> modify_irte()
> >
> > The flush of Interrupt Entry Cache (IEC) should ensure, if any interrupts
> > were in flight, they made it to the previous CPU, and any new interrupts
> > must be delivered to the new CPU.
> >
> > Question is do we need a check similar to the legacy MSI handling
> >
> > 	if (lapic_vector_set_in_irr())
> > 	    handle interrupt? 
> >
> > Is there a reason we don't check if the interrupt delivered to previous
> > CPU in intel_ir_set_affinity()? Or is the send_cleanup_vector() sends
> > an IPI to perform the cleanup?
> >
> > It appears that maybe send_cleanup_vector() sends IPI to the old cpu
> > and that somehow ensures the device interrupt handler actually getting
> > called? I lost my track somewhere down there :)
> 
> The way it works is:
> 
>     1) New vector on different CPU is allocated
> 
>     2) New vector is installed
> 
>     3) When the first interrupt on the new CPU arrives then the cleanup
>        IPI is sent to the previous CPU which tears down the old vector
> 
> So if the interrupt is sent to the original CPU just before #2 then this
> will be correctly handled on that CPU after the set affinity code
> reenables interrupts.

I'll have this test tried out.. but in msi_set_affinity() the check below
for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct? 
don't we want to check for old_cfg.vector instead?

msi_set_affinit ()
{
....


        unlock_vector_lock();

        /*
         * Check whether the transition raced with a device interrupt and
         * is pending in the local APICs IRR. It is safe to do this outside
         * of vector lock as the irq_desc::lock of this interrupt is still
         * held and interrupts are disabled: The check is not accessing the
         * underlying vector store. It's just checking the local APIC's
         * IRR.
         */
        if (lapic_vector_set_in_irr(cfg->vector))
                irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


> 
> Can you try the patch below?
> 
> Thanks,
> 
>         tglx
> 
> 8<--------------
>  drivers/pci/msi.c |    2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/drivers/pci/msi.c
> +++ b/drivers/pci/msi.c
> @@ -323,6 +323,7 @@ void __pci_write_msi_msg(struct msi_desc
>  		writel(msg->address_lo, base + PCI_MSIX_ENTRY_LOWER_ADDR);
>  		writel(msg->address_hi, base + PCI_MSIX_ENTRY_UPPER_ADDR);
>  		writel(msg->data, base + PCI_MSIX_ENTRY_DATA);
> +		readl(base + PCI_MSIX_ENTRY_DATA);
>  	} else {
>  		int pos = dev->msi_cap;
>  		u16 msgctl;
> @@ -343,6 +344,7 @@ void __pci_write_msi_msg(struct msi_desc
>  			pci_write_config_word(dev, pos + PCI_MSI_DATA_32,
>  					      msg->data);
>  		}
> +		pci_read_config_word(dev, pos + PCI_MSI_FLAGS, &msgctl);
>  	}
>  
>  skip:
Thomas Gleixner May 5, 2020, 9:47 p.m. UTC | #14
Ashok,

"Raj, Ashok" <ashok.raj@intel.com> writes:
> On Tue, May 05, 2020 at 09:36:04PM +0200, Thomas Gleixner wrote:
>> The way it works is:
>> 
>>     1) New vector on different CPU is allocated
>> 
>>     2) New vector is installed
>> 
>>     3) When the first interrupt on the new CPU arrives then the cleanup
>>        IPI is sent to the previous CPU which tears down the old vector
>> 
>> So if the interrupt is sent to the original CPU just before #2 then this
>> will be correctly handled on that CPU after the set affinity code
>> reenables interrupts.
>
> I'll have this test tried out.. but in msi_set_affinity() the check below
> for lapic_vector_set_in_irr(cfg->vector), this is the new vector correct? 
> don't we want to check for old_cfg.vector instead?
>
> msi_set_affinit ()
> {
> ....
>         unlock_vector_lock();
>
>         /*
>          * Check whether the transition raced with a device interrupt and
>          * is pending in the local APICs IRR. It is safe to do this outside
>          * of vector lock as the irq_desc::lock of this interrupt is still
>          * held and interrupts are disabled: The check is not accessing the
>          * underlying vector store. It's just checking the local APIC's
>          * IRR.
>          */
>         if (lapic_vector_set_in_irr(cfg->vector))
>                 irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);

No. This catches the transitional interrupt to the new vector on the
original CPU, i.e. the one which is running that code.

Again the steps are:

 1) Allocate new vector on new CPU

 2) Set new vector on original CPU

 3) Set new vector on new CPU

So we have 3 points where an interrupt can fire:

 A) Before #2

 B) After #2 and before #3

 C) After #3

#A is hitting the old vector which is still valid on the old CPU and
   will be handled once interrupts are enabled with the correct irq
   descriptor - Normal operation (same as with maskable MSI)

#B This must be checked in the IRR because the there is no valid vector
   on the old CPU.

#C is handled on the new vector on the new CPU

Thanks,

        tglx
Ashok Raj May 7, 2020, 12:18 p.m. UTC | #15
Hi Thomas

We did a bit more tracing and it looks like the IRR check is actually
not happening on the right cpu. See below.

On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
> >
> > msi_set_affinit ()
> > {
> > ....
> >         unlock_vector_lock();
> >
> >         /*
> >          * Check whether the transition raced with a device interrupt and
> >          * is pending in the local APICs IRR. It is safe to do this outside
> >          * of vector lock as the irq_desc::lock of this interrupt is still
> >          * held and interrupts are disabled: The check is not accessing the
> >          * underlying vector store. It's just checking the local APIC's
> >          * IRR.
> >          */
> >         if (lapic_vector_set_in_irr(cfg->vector))
> >                 irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
> 
> No. This catches the transitional interrupt to the new vector on the
> original CPU, i.e. the one which is running that code.

Mathias added some trace to his xhci driver when the isr is called.

Below is the tail of my trace with last two times xhci_irq isr is called:

    <idle>-0     [003] d.h.   200.277971: xhci_irq: xhci irq
    <idle>-0     [003] d.h.   200.278052: xhci_irq: xhci irq

Just trying to follow your steps below with traces. The traces follow
the same comments in the source.

> 
> Again the steps are:
> 
>  1) Allocate new vector on new CPU

        /* Allocate a new target vector */
        ret = parent->chip->irq_set_affinity(parent, mask, force);

migration/3-24    [003] d..1   200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0

> 
>  2) Set new vector on original CPU

        /* Redirect it to the new vector on the local CPU temporarily */
        old_cfg.vector = cfg->vector;
        irq_msi_update_msg(irqd, &old_cfg);

migration/3-24    [003] d..1   200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6
> 
>  3) Set new vector on new CPU

        /* Now transition it to the target CPU */
        irq_msi_update_msg(irqd, cfg);

     migration/3-24    [003] d..1   200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33



     if (lapic_vector_set_in_irr(cfg->vector))
	irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);


migration/3-24    [003] d..1   200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0

> 
> So we have 3 points where an interrupt can fire:
> 
>  A) Before #2
> 
>  B) After #2 and before #3
> 
>  C) After #3
> 
> #A is hitting the old vector which is still valid on the old CPU and
>    will be handled once interrupts are enabled with the correct irq
>    descriptor - Normal operation (same as with maskable MSI)
> 
> #B This must be checked in the IRR because the there is no valid vector
>    on the old CPU.

The check for IRR seems like on a random cpu3 vs checking for the new vector 33
on old cpu 6?

This is the place when we force the retrigger without the IRR check things seem to fix itself.

> 
> #C is handled on the new vector on the new CPU
> 


Did we miss something? 

Cheers,
Ashok
Thomas Gleixner May 7, 2020, 12:53 p.m. UTC | #16
Ashok,

"Raj, Ashok" <ashok.raj@intel.com> writes:

> We did a bit more tracing and it looks like the IRR check is actually
> not happening on the right cpu. See below.

What?

> On Tue, May 05, 2020 at 11:47:26PM +0200, Thomas Gleixner wrote:
>> >
>> > msi_set_affinit ()
>> > {
>> > ....
>> >         unlock_vector_lock();
>> >
>> >         /*
>> >          * Check whether the transition raced with a device interrupt and
>> >          * is pending in the local APICs IRR. It is safe to do this outside
>> >          * of vector lock as the irq_desc::lock of this interrupt is still
>> >          * held and interrupts are disabled: The check is not accessing the
>> >          * underlying vector store. It's just checking the local APIC's
>> >          * IRR.
>> >          */
>> >         if (lapic_vector_set_in_irr(cfg->vector))
>> >                 irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>> 
>> No. This catches the transitional interrupt to the new vector on the
>> original CPU, i.e. the one which is running that code.
>
> Mathias added some trace to his xhci driver when the isr is called.
>
> Below is the tail of my trace with last two times xhci_irq isr is called:
>
>     <idle>-0     [003] d.h.   200.277971: xhci_irq: xhci irq
>     <idle>-0     [003] d.h.   200.278052: xhci_irq: xhci irq
>
> Just trying to follow your steps below with traces. The traces follow
> the same comments in the source.
>
>> 
>> Again the steps are:
>> 
>>  1) Allocate new vector on new CPU
>
>         /* Allocate a new target vector */
>         ret = parent->chip->irq_set_affinity(parent, mask, force);
>
> migration/3-24    [003] d..1   200.283012: msi_set_affinity: msi_set_affinity: quirk: 1: new vector allocated, new cpu = 0
>
>> 
>>  2) Set new vector on original CPU
>
>         /* Redirect it to the new vector on the local CPU temporarily */
>         old_cfg.vector = cfg->vector;
>         irq_msi_update_msg(irqd, &old_cfg);
>
> migration/3-24    [003] d..1   200.283033: msi_set_affinity: msi_set_affinity: Redirect to new vector 33 on old cpu 6

On old CPU 6? This runs on CPU 3 which is wrong to begin with.

>>  3) Set new vector on new CPU
>
>         /* Now transition it to the target CPU */
>         irq_msi_update_msg(irqd, cfg);
>
>      migration/3-24    [003] d..1   200.283044: msi_set_affinity: msi_set_affinity: Transition to new target cpu 0 vector 33
>
>
>
>      if (lapic_vector_set_in_irr(cfg->vector))
> 	irq_data_get_irq_chip(irqd)->irq_retrigger(irqd);
>
>
> migration/3-24    [003] d..1   200.283046: msi_set_affinity: msi_set_affinity: Update Done [IRR 0]: irq 123 localsw: Nvec 33 Napic 0
>
>> 
>> So we have 3 points where an interrupt can fire:
>> 
>>  A) Before #2
>> 
>>  B) After #2 and before #3
>> 
>>  C) After #3
>> 
>> #A is hitting the old vector which is still valid on the old CPU and
>>    will be handled once interrupts are enabled with the correct irq
>>    descriptor - Normal operation (same as with maskable MSI)
>> 
>> #B This must be checked in the IRR because the there is no valid vector
>>    on the old CPU.
>
> The check for IRR seems like on a random cpu3 vs checking for the new vector 33
> on old cpu 6?

The whole sequence runs on CPU 3. If old CPU was 6 then this should
never run on CPU 3.

> This is the place when we force the retrigger without the IRR check things seem to fix itself.

It's not fixing it. It's papering over the root cause.

> Did we miss something? 

Yes, you missed to analyze why this runs on CPU3 when old CPU is 6. But
the last interrupt actually was on CPU3.

>     <idle>-0     [003] d.h.   200.278052: xhci_irq: xhci irq

Can you please provide the full trace and the patch you used to generate
it?

Thanks,

        tglx
Thomas Gleixner May 7, 2020, 7:41 p.m. UTC | #17
Ashok,

"Raj, Ashok" <ashok.raj@intel.com> writes:
> 
> I think i got mixed up with logical apic id and logical cpu :-(

Stuff happens.

>           <idle>-0     [000] d.h.    44.376659: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 0
>           <idle>-0     [000] d.h.    44.376684: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 0 Nvec 33 Napic 2

>           <idle>-0     [000] d.h.    44.376685: xhci_irq: xhci irq

>           <idle>-0     [001] d.h.    44.376750: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 2
>           <idle>-0     [001] d.h.    44.376774: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 2

>           <idle>-0     [001] d.h.    44.376776: xhci_irq: xhci irq
>           <idle>-0     [001] d.h.    44.395824: xhci_irq: xhci irq

>            <...>-14    [001] d..1    44.400666: msi_set_affinity: quirk[1] new vector allocated, new apic = 6 vector = 33 this apic = 2
>            <...>-14    [001] d..1    44.400691: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 2 Nvec 33 Napic 6

>           <idle>-0     [003] d.h.    44.421021: xhci_irq: xhci irq
>           <idle>-0     [003] d.h.    44.421135: xhci_irq: xhci irq

>      migration/3-24    [003] d..1    44.421784: msi_set_affinity: quirk[1] new vector allocated, new apic = 0 vector = 33 this apic = 6
>      migration/3-24    [003] d..1    44.421803: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 6 Nvec 33 Napic 0

So this last one is a direct update. Straight forward moving it from one
to the other CPU on the same vector number.

And that's the case where we either expect the interrupt to come in on
CPU3 or on CPU0.

There is actually an example in the trace:

	<idle>-0     [000] d.h.    40.616467: msi_set_affinity: quirk[1] new vector allocated, new apic = 2 vector = 33 this apic = 0
	<idle>-0     [000] d.h.    40.616488: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 0 Nvec 33 Napic 2
	<idle>-0     [000] d.h.    40.616488: xhci_irq: xhci irq
	<idle>-0     [001] d.h.    40.616504: xhci_irq: xhci irq

>      migration/3-24    [003] d..1    44.421784: msi_set_affinity: quirk[1] new vector allocated, new apic = 0 vector = 33 this apic = 6
>      migration/3-24    [003] d..1    44.421803: msi_set_affinity: Direct Update: irq 123 Ovec=33 Oapic 6 Nvec 33 Napic 0

But as this last one is the migration thread, aka stomp machine, I
assume this is a hotplug operation. Which means the CPU cannot handle
interrupts anymore. In that case we check the old vector on the
unplugged CPU in fixup_irqs() and do the retrigger from there.
Can you please add tracing to that one as well?

Thanks,

        tglx
diff mbox series

Patch

--- a/arch/x86/kernel/apic/msi.c
+++ b/arch/x86/kernel/apic/msi.c
@@ -92,6 +92,10 @@  msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
            cfg->vector == old_cfg.vector ||
            old_cfg.vector == MANAGED_IRQ_SHUTDOWN_VECTOR ||
            cfg->dest_apicid == old_cfg.dest_apicid) {
+               trace_printk("direct update msi %u, vector %u -> %u, apicid: %u -> %u\n",
+                    irqd->irq,
+                    old_cfg.vector, cfg->vector,
+                    old_cfg.dest_apicid, cfg->dest_apicid);
                irq_msi_update_msg(irqd, cfg);
                return ret;
        }
@@ -134,7 +138,10 @@  msi_set_affinity(struct irq_data *irqd, const struct cpumask *mask, bool force)
         */
        if (IS_ERR_OR_NULL(this_cpu_read(vector_irq[cfg->vector])))
                this_cpu_write(vector_irq[cfg->vector], VECTOR_RETRIGGERED);
-
+       trace_printk("twostep update msi, irq %u, vector %u -> %u, apicid: %u -> %u\n",
+                    irqd->irq,
+                    old_cfg.vector, cfg->vector,
+                    old_cfg.dest_apicid, cfg->dest_apicid);
        /* Redirect it to the new vector on the local CPU temporarily */
        old_cfg.vector = cfg->vector;
        irq_msi_update_msg(irqd, &old_cfg);