Message ID | 1530571967-19099-1-git-send-email-okaya@codeaurora.org |
---|---|
Headers | show |
Series | PCI: separate hotplug handling from fatal error handling | expand |
On Mon, Jul 02, 2018 at 06:52:44PM -0400, Sinan Kaya wrote: > This is what happens when you insert a fatal error to a hotplug slot. See > multiple link up/down messages. > > /_#_[__333.699731]_pcieport_0001:00:00.0:_AER:_Uncorrected_(Fatal)_error_received:_id=0000 > [ 333.748116] pcieport 0001:00:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id > [ 333.816044] pcieport 0001:00:00.0: device [17cb:0404] error status/mask=00040000/00400000 > [ 333.871581] pcieport 0001:00:00.0: [18] Malformed TLP (First) > [ 333.914675] pcieport 0001:00:00.0: TLP Header: 40000001 000000ff 00000000 00000000 > [ 333.968397] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down > [ 334.043234] iommu: Removing device 0001:01:00.4 from group 0 > [ 334.095952] iommu: Removing device 0001:01:00.3 from group 0 > [ 334.144644] iommu: Removing device 0001:01:00.2 from group 0 > [ 334.194653] iommu: Removing device 0001:01:00.1 from group 0 > [ 334.243564] pciehp 0001:00:00.0:pcie004: Slot(2): Link Up > [ 334.282556] iommu: Removing device 0001:01:00.0 from group 0 > [ 334.330994] pciehp 0001:00:00.0:pcie004: Slot(2): Link Up event queued; > currently getting powered off > [ 334.890587] pciehp 0001:00:00.0:pcie004: Timeout on hotplug command > 0x13f1 (issued 282900 msec ago) > [ 335.070190] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down > [ 335.106960] pciehp 0001:00:00.0:pcie004: Slot(2): Link Down event queued; currently getting powered on > [ 335.191119] pcieport 0001:00:00.0: AER: Device recovery failed > [ 346.590153] pciehp 0001:00:00.0:pcie004: Timeout on hotplug command 0x17f1 (issued 10250 msec ago) > > > Since DPC/AER patch to refactor fatal error handling, both hotplug > driver and AER/DPC driver will try removing devices and perform enumeration on > link events/AER events. > > Perfect environment for race condition without a change. > > Try to mask hotplug interrupts during AER/DPC fatal error handling. > > Changes from v4: > * add mask_irq() and unmask_irq() callbacks to struct pcie_driver > * refactor reset_slot() to use pciehp_mask_irq() an pciehp_unmask_irq() > > Sinan Kaya (3): > PCI: pciehp: implement mask and unmask interrupt functions > PCI: pciehp: reuse pciehp_mask/unmask_irq() in reset_slot() > PCI: Mask and unmask hotplug interrupts during reset > > drivers/pci/hotplug/pciehp.h | 2 ++ > drivers/pci/hotplug/pciehp_core.c | 19 +++++++++++++++++++ > drivers/pci/hotplug/pciehp_hpc.c | 36 ++++++++++++++++++++++++++++-------- > drivers/pci/pcie/err.c | 11 +++++++++++ > drivers/pci/pcie/portdrv.h | 2 ++ > 5 files changed, 62 insertions(+), 8 deletions(-) There's been a ton of discussion about this, and I didn't see a clear consensus emerge, so I'm going to drop it for now. If it's still needed, please repost it and I'll watch for reviewers. Bjorn
Hi Bjorn, On 7/31/2018 11:44 AM, Bjorn Helgaas wrote: > There's been a ton of discussion about this, and I didn't see a clear > consensus emerge, so I'm going to drop it for now. If it's still > needed, please repost it and I'll watch for reviewers. I apologize, let me summarize the conversation for you. I proposed this [1]. Oza seems to be onboard [2]. Later, I posted v6 with this change last weekend. [patch v6 1/1] pci: pciehp: ignore link events when there is a fatal error pending Lukas and I discussed the patch last weekend under this v6 thread. AFAIK, we agreed on the direction. I'll post V7 with Lukas' comments and capture the detail from email into the commit message. Let us know what you think about it. Sinan [1] https://lkml.org/lkml/2018/7/21/13. [2] https://lkml.org/lkml/2018/7/25/183
On Tue, Jul 31, 2018 at 11:54:58AM -0700, Sinan Kaya wrote: > Hi Bjorn, > > On 7/31/2018 11:44 AM, Bjorn Helgaas wrote: > > There's been a ton of discussion about this, and I didn't see a clear > > consensus emerge, so I'm going to drop it for now. If it's still > > needed, please repost it and I'll watch for reviewers. > > I apologize, let me summarize the conversation for you. > I proposed this [1]. Oza seems to be onboard [2]. > > Later, I posted v6 with this change last weekend. Oops, I missed that this had been obsoleted, sorry for the noise. > [patch v6 1/1] pci: pciehp: ignore link events when there is a fatal error > pending > > Lukas and I discussed the patch last weekend under this v6 thread. > AFAIK, we agreed on the direction. > > I'll post V7 with Lukas' comments and capture the detail from email > into the commit message. > > Let us know what you think about it. Sounds good, I'll look for that.