diff mbox series

[v4,2/2] iommu/vt-d: don's issue devTLB flush request when device is disconnected

Message ID 20231220005153.3984502-3-haifeng.zhao@linux.intel.com
State New
Headers show
Series fix vt-d hard lockup when hotplug ATS capable device | expand

Commit Message

Ethan Zhao Dec. 20, 2023, 12:51 a.m. UTC
For those endpoint devices connect to system via hotplug capable ports,
users could request a warm reset to the device by flapping device's link
through setting the slot's link control register, as pciehpt_ist() DLLSC
interrupt sequence response, pciehp will unload the device driver and
then power it off. thus cause an IOMMU devTLB flush request for device to
be sent and a long time completion/timeout waiting in interrupt context.

That would cause following continuous hard lockup warning and system hang

[ 4211.433662] pcieport 0000:17:01.0: pciehp: Slot(108): Link Down
[ 4211.433664] pcieport 0000:17:01.0: pciehp: Slot(108): Card not present
[ 4223.822591] NMI watchdog: Watchdog detected hard LOCKUP on cpu 144
[ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE    kernel version xxxx
[ 4223.822623] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822623] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822624] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 1
0 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822624] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822625] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822625] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822625] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822626] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822626] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822626] FS:  0000000000000000(0000) GS:ffffa237ae400000(0000)
knlGS:0000000000000000
[ 4223.822627] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4223.822627] CR2: 00007ffe86515d80 CR3: 000002fd3000a001 CR4: 0000000000770ee0
[ 4223.822627] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4223.822628] DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
[ 4223.822628] PKRU: 55555554
[ 4223.822628] Call Trace:
[ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822629]  intel_iommu_release_device+0x1f/0x30
[ 4223.822629]  iommu_release_device+0x33/0x60
[ 4223.822629]  iommu_bus_notifier+0x7f/0x90
[ 4223.822630]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822630]  device_del+0x2e5/0x420
[ 4223.822630]  pci_remove_bus_device+0x70/0x110
[ 4223.822630]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822631]  pciehp_disable_slot+0x6b/0x100
[ 4223.822631]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822631]  pciehp_ist+0x176/0x180
[ 4223.822631]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822632]  irq_thread_fn+0x19/0x50
[ 4223.822632]  irq_thread+0x104/0x190
[ 4223.822632]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822632]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822633]  kthread+0x114/0x130
[ 4223.822633]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822633]  ret_from_fork+0x1f/0x30
[ 4223.822633] Kernel panic - not syncing: Hard LOCKUP
[ 4223.822634] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
         OE     kernel version xxxx
[ 4223.822634] Hardware name: vendorname xxxx 666-106,
BIOS 01.01.02.03.01 05/15/2023
[ 4223.822634] Call Trace:
[ 4223.822634]  <NMI>
[ 4223.822635]  dump_stack+0x6d/0x88
[ 4223.822635]  panic+0x101/0x2d0
[ 4223.822635]  ? ret_from_fork+0x11/0x30
[ 4223.822635]  nmi_panic.cold.14+0xc/0xc
[ 4223.822636]  watchdog_overflow_callback.cold.8+0x6d/0x81
[ 4223.822636]  __perf_event_overflow+0x4f/0xf0
[ 4223.822636]  handle_pmi_common+0x1ef/0x290
[ 4223.822636]  ? __set_pte_vaddr+0x28/0x40
[ 4223.822637]  ? flush_tlb_one_kernel+0xa/0x20
[ 4223.822637]  ? __native_set_fixmap+0x24/0x30
[ 4223.822637]  ? ghes_copy_tofrom_phys+0x70/0x100
[ 4223.822637]  ? __ghes_peek_estatus.isra.16+0x49/0xa0
[ 4223.822637]  intel_pmu_handle_irq+0xba/0x2b0
[ 4223.822638]  perf_event_nmi_handler+0x24/0x40
[ 4223.822638]  nmi_handle+0x4d/0xf0
[ 4223.822638]  default_do_nmi+0x49/0x100
[ 4223.822638]  exc_nmi+0x134/0x180
[ 4223.822639]  end_repeat_nmi+0x16/0x67
[ 4223.822639] RIP: 0010:qi_submit_sync+0x2c0/0x490
[ 4223.822639] Code: 48 be 00 00 00 00 00 08 00 00 49 85 74 24 20 0f 95 c1 48 8b
 57 10 83 c1 04 83 3c 1a 03 0f 84 a2 01 00 00 49 8b 04 24 8b 70 34 <40> f6 c6 10
 74 17 49 8b 04 24 8b 80 80 00 00 00 89 c2 d3 fa 41 39
[ 4223.822640] RSP: 0018:ffffc4f074f0bbb8 EFLAGS: 00000093
[ 4223.822640] RAX: ffffc4f040059000 RBX: 0000000000000014 RCX: 0000000000000005
[ 4223.822640] RDX: ffff9f3841315800 RSI: 0000000000000000 RDI: ffff9f38401a8340
[ 4223.822641] RBP: ffff9f38401a8340 R08: ffffc4f074f0bc00 R09: 0000000000000000
[ 4223.822641] R10: 0000000000000010 R11: 0000000000000018 R12: ffff9f384005e200
[ 4223.822641] R13: 0000000000000004 R14: 0000000000000046 R15: 0000000000000004
[ 4223.822641]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  ? qi_submit_sync+0x2c0/0x490
[ 4223.822642]  </NMI>
[ 4223.822642]  qi_flush_dev_iotlb+0xb1/0xd0
[ 4223.822642]  __dmar_remove_one_dev_info+0x224/0x250
[ 4223.822643]  dmar_remove_one_dev_info+0x3e/0x50
[ 4223.822643]  intel_iommu_release_device+0x1f/0x30
[ 4223.822643]  iommu_release_device+0x33/0x60
[ 4223.822643]  iommu_bus_notifier+0x7f/0x90
[ 4223.822644]  blocking_notifier_call_chain+0x60/0x90
[ 4223.822644]  device_del+0x2e5/0x420
[ 4223.822644]  pci_remove_bus_device+0x70/0x110
[ 4223.822644]  pciehp_unconfigure_device+0x7c/0x130
[ 4223.822644]  pciehp_disable_slot+0x6b/0x100
[ 4223.822645]  pciehp_handle_presence_or_link_change+0xd8/0x320
[ 4223.822645]  pciehp_ist+0x176/0x180
[ 4223.822645]  ? irq_finalize_oneshot.part.50+0x110/0x110
[ 4223.822645]  irq_thread_fn+0x19/0x50
[ 4223.822646]  irq_thread+0x104/0x190
[ 4223.822646]  ? irq_forced_thread_fn+0x90/0x90
[ 4223.822646]  ? irq_thread_check_affinity+0xe0/0xe0
[ 4223.822646]  kthread+0x114/0x130
[ 4223.822647]  ? __kthread_cancel_work+0x40/0x40
[ 4223.822647]  ret_from_fork+0x1f/0x30
[ 4223.822647] Kernel Offset: 0x6400000 from 0xffffffff81000000 (relocation
range: 0xffffffff80000000-0xffffffffbfffffff)

Fix it by checking the device's error_state in
devtlb_invalidation_with_pasid() to avoid sending meaningless devTLB flush
request to link down device that is set to pci_channel_io_perm_failure and
then power off in

pciehp_ist()
   pciehp_handle_presence_or_link_change()
     pciehp_disable_slot()
       remove_board()
         pciehp_unconfigure_device()

For SAVE_REMOVAL unplug, link is alive when iommu releases device and
issues devTLB invalidate request, wouldn't trigger such issue.

This patch works for all kinds of SURPPRISE_REMOVAL unplug operation.

Tested-by: Haorong Ye <yehaorong@bytedance.com>
Signed-off-by: Ethan Zhao <haifeng.zhao@linux.intel.com>
---
 drivers/iommu/intel/pasid.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Lukas Wunner Dec. 21, 2023, 10:39 a.m. UTC | #1
On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
> For those endpoint devices connect to system via hotplug capable ports,
> users could request a warm reset to the device by flapping device's link
> through setting the slot's link control register, as pciehpt_ist() DLLSC
> interrupt sequence response, pciehp will unload the device driver and
> then power it off. thus cause an IOMMU devTLB flush request for device to
> be sent and a long time completion/timeout waiting in interrupt context.

I think the problem is in the "waiting in interrupt context".

Can you change qi_submit_sync() to *sleep* until the queue is done?
Instead of busy-waiting in atomic context?

Is the hardware capable of sending an interrupt once the queue is done?
If it is not capable, would it be viable to poll with exponential backoff
and sleep in-between polling once the polling delay increases beyond, say,
10 usec?

Again, the proposed patch is not a proper solution.  It will paper over
the issue most of the time but every once in a while someone will still
get a hard lockup splat and it will then be more difficult to reproduce
and fix if the proposed patch is accepted.


> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>          OE    kernel version xxxx

I don't see any reason to hide the kernel version.
This isn't Intel Confidential information.


> [ 4223.822628] Call Trace:
> [ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
> [ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
> [ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50

__dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0
one and a half years ago, so the stack trace appears to be from
an older kernel version.

Thanks,

Lukas
Lukas Wunner Dec. 21, 2023, 11:01 a.m. UTC | #2
On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
> > For those endpoint devices connect to system via hotplug capable ports,
> > users could request a warm reset to the device by flapping device's link
> > through setting the slot's link control register, as pciehpt_ist() DLLSC
> > interrupt sequence response, pciehp will unload the device driver and
> > then power it off. thus cause an IOMMU devTLB flush request for device to
> > be sent and a long time completion/timeout waiting in interrupt context.
> 
> I think the problem is in the "waiting in interrupt context".

I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
capability to reduce the Invalidate Completion Timeout to a sane value?
Could you check whether that's supported?

Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
but that's not even a "must", it's a "should".  So devices are free to
take even longer.  We have to cut off at *some* point.

Thanks,

Lukas
Ethan Zhao Dec. 22, 2023, 1:56 a.m. UTC | #3
On 12/21/2023 6:39 PM, Lukas Wunner wrote:
> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>> For those endpoint devices connect to system via hotplug capable ports,
>> users could request a warm reset to the device by flapping device's link
>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>> interrupt sequence response, pciehp will unload the device driver and
>> then power it off. thus cause an IOMMU devTLB flush request for device to
>> be sent and a long time completion/timeout waiting in interrupt context.
> I think the problem is in the "waiting in interrupt context".
>
> Can you change qi_submit_sync() to *sleep* until the queue is done?
> Instead of busy-waiting in atomic context?

If you read that function carefully, you wouldn't say "sleep" there....

that is 'sync'ed.

>
> Is the hardware capable of sending an interrupt once the queue is done?
> If it is not capable, would it be viable to poll with exponential backoff
> and sleep in-between polling once the polling delay increases beyond, say,
> 10 usec?

I don't know if the polling along sleeping for completion of meanningless

devTLB invalidation request blindly sent to (removed/powered down/link down)

device makes sense or not.

But according to PCIe spec  6.1  10.3.1

"Software ensures no invalidations are issued to a Function when its

  ATS capability is disabled. "

>
> Again, the proposed patch is not a proper solution.  It will paper over
> the issue most of the time but every once in a while someone will still
> get a hard lockup splat and it will then be more difficult to reproduce
> and fix if the proposed patch is accepted.

Could you point out why is not proper ? Is there any other window

the hard lockup still could happen with the ATS capable devcie

supprise_removal case if we checked the connection state first ?

Please help to elaberate it.

>
>
>> [ 4223.822622] CPU: 144 PID: 1422 Comm: irq/57-pciehp Kdump: loaded Tainted: G S
>>           OE    kernel version xxxx
> I don't see any reason to hide the kernel version.
> This isn't Intel Confidential information.
>
Yes, this is the old kernel stack trace, but customer also tried lasted 
6.7rc4

(doesn't work) and the patched 6.7rc4 (fixed).


Thanks,

Ethan

>> [ 4223.822628] Call Trace:
>> [ 4223.822628]  qi_flush_dev_iotlb+0xb1/0xd0
>> [ 4223.822628]  __dmar_remove_one_dev_info+0x224/0x250
>> [ 4223.822629]  dmar_remove_one_dev_info+0x3e/0x50
> __dmar_remove_one_dev_info() was removed by db75c9573b08 in v6.0
> one and a half years ago, so the stack trace appears to be from
> an older kernel version.
>
> Thanks,
>
> Lukas
Ethan Zhao Dec. 22, 2023, 2:08 a.m. UTC | #4
On 12/21/2023 7:01 PM, Lukas Wunner wrote:
> On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
>> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's link
>>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for device to
>>> be sent and a long time completion/timeout waiting in interrupt context.
>> I think the problem is in the "waiting in interrupt context".
> I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
> capability to reduce the Invalidate Completion Timeout to a sane value?
> Could you check whether that's supported?

It is not about Intel vt-d's capability per my understanding, it is the 
third

party PCIe switch's capability, they are not aware of  ATS transation at 
all,

if its downstream port endpoint device is removed/powered-off/link-down,

it couldn't feedback the upstream iommu a fault/completion/timeout for

ATS transaction breakage reason.  While the root port could (verified).

>
> Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
  1 min (60 sec)+50%
> but that's not even a "must", it's a "should".  So devices are free to

I could happen if blindly wait here, so we should avoid such case.


Thanks,

Ethan

> take even longer.  We have to cut off at *some* point.
>
> Thanks,
>
> Lukas
Ethan Zhao Dec. 22, 2023, 3:56 a.m. UTC | #5
On 12/21/2023 7:01 PM, Lukas Wunner wrote:
> On Thu, Dec 21, 2023 at 11:39:40AM +0100, Lukas Wunner wrote:
>> On Tue, Dec 19, 2023 at 07:51:53PM -0500, Ethan Zhao wrote:
>>> For those endpoint devices connect to system via hotplug capable ports,
>>> users could request a warm reset to the device by flapping device's link
>>> through setting the slot's link control register, as pciehpt_ist() DLLSC
>>> interrupt sequence response, pciehp will unload the device driver and
>>> then power it off. thus cause an IOMMU devTLB flush request for device to
>>> be sent and a long time completion/timeout waiting in interrupt context.
>> I think the problem is in the "waiting in interrupt context".
> I'm wondering whether Intel IOMMUs possibly have a (perhaps undocumented)
> capability to reduce the Invalidate Completion Timeout to a sane value?
> Could you check whether that's supported?
>
> Granted, the Implementation Note you've pointed to allows 1 sec + 50%,
> but that's not even a "must", it's a "should".  So devices are free to
> take even longer.  We have to cut off at *some* point.

I really "expected" there is interrrupt signal to iommu hardware when

the PCIe swtich downstream device 'gone', or some internal polling

/heartbeating the endpoint device for ATS breaking,

but so far seems there are only hotplug interrupts to downstream

control.

...

How to define the point "some" msec to timeout while software

break out the waiting loop ?  or polling if the target is gone ?


Thanks,

Ethan



>
> Thanks,
>
> Lukas
diff mbox series

Patch

diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c
index 74e8e4c17e81..7dbee9931eb6 100644
--- a/drivers/iommu/intel/pasid.c
+++ b/drivers/iommu/intel/pasid.c
@@ -481,6 +481,9 @@  devtlb_invalidation_with_pasid(struct intel_iommu *iommu,
 	if (!info || !info->ats_enabled)
 		return;
 
+	if (pci_dev_is_disconnected(to_pci_dev(dev)))
+		return;
+
 	sid = info->bus << 8 | info->devfn;
 	qdep = info->ats_qdep;
 	pfsid = info->pfsid;