Message ID | 1372425030-5759-1-git-send-email-shangw@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Rejected |
Headers | show |
On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote: > The issue was introduced by commit 37f02195 ("powerpc/pci: fix > PCI-e devices rescan issue on powerpc platform"). The field That "fix" caused more problems than it solved. There's a better approach floating around that I will merge eventually next week. Cheers, Ben.
On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote: > The issue was introduced by commit 37f02195 ("powerpc/pci: fix > PCI-e devices rescan issue on powerpc platform"). The field > (struct pci_dev::irq) is reused by PCI core to trace the base > MSI interrupt number if the MSI stuff is enabled on the corresponding > device. When running to pcibios_setup_device(), we possibly still > have enabled MSI interrupt on the device. That means "pci_dev->irq" > still have the base MSI interrupt number and it will be overwritten > if we're going fix "pci_dev->irq" again by pci_read_irq_line(). > Eventually, when we enable the device, it runs to kernel crash caused > by fetching the the MSI interrupt descriptor (struct msi_desc) from > non-MSI interrupt and using the NULL descriptor. So finally I decided instead to apply Guenter patch [PATCH v2] powerpc/pci: Improve device hotplug initialization Which fixes the underlying problem instead. I'm running some tests, so far it looks good. However, Gavin, when you have a chance on vpl3, try injecting errors to other adapters, for example the VGA adapter (you need to do lspci to trigger the EEH detection after that since there's no driver and use the "loc code" variant off errinjct) or eth2 (the cxgb3). All I get from EEH with these is: [ 362.962564] EEH: Detected PCI bus error on PHB#7-PE#10000 [ 362.962570] eeh_handle_event: Cannot find PCI bus for PHB#7-PE#10000 and [ 424.381083] EEH: Detected PCI bus error on PHB#6-PE#10000 [ 424.381089] eeh_handle_event: Cannot find PCI bus for PHB#6-PE#10000 Followed by ... nothing. This is a tree which has Cascardo patch and Gunther patch (usual location on vpl3). Can you have a look ? Cheers, Ben.
On Sun, Jun 30, 2013 at 09:09:20AM +1000, Benjamin Herrenschmidt wrote: > On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote: > > The issue was introduced by commit 37f02195 ("powerpc/pci: fix > > PCI-e devices rescan issue on powerpc platform"). The field > > (struct pci_dev::irq) is reused by PCI core to trace the base > > MSI interrupt number if the MSI stuff is enabled on the corresponding > > device. When running to pcibios_setup_device(), we possibly still > > have enabled MSI interrupt on the device. That means "pci_dev->irq" > > still have the base MSI interrupt number and it will be overwritten > > if we're going fix "pci_dev->irq" again by pci_read_irq_line(). > > Eventually, when we enable the device, it runs to kernel crash caused > > by fetching the the MSI interrupt descriptor (struct msi_desc) from > > non-MSI interrupt and using the NULL descriptor. > > So finally I decided instead to apply Guenter patch > > [PATCH v2] powerpc/pci: Improve device hotplug initialization > > Which fixes the underlying problem instead. > Guess I am not hitting above bug because I have my own patch applied ;). Thanks a lot! Guenter
On Sun, Jun 30, 2013 at 09:09:20AM +1000, Benjamin Herrenschmidt wrote: >On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote: .../... >I'm running some tests, so far it looks good. However, Gavin, when you >have a chance on vpl3, try injecting errors to other adapters, for >example the VGA adapter (you need to do lspci to trigger the EEH >detection after that since there's no driver and use the "loc code" >variant off errinjct) or eth2 (the cxgb3). > >All I get from EEH with these is: > >[ 362.962564] EEH: Detected PCI bus error on PHB#7-PE#10000 >[ 362.962570] eeh_handle_event: Cannot find PCI bus for PHB#7-PE#10000 > >and > >[ 424.381083] EEH: Detected PCI bus error on PHB#6-PE#10000 >[ 424.381089] eeh_handle_event: Cannot find PCI bus for PHB#6-PE#10000 > >Followed by ... nothing. > Ben, I think one patch was lost from mainline and that fixes the problem. https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=5fb621698e94e3af8b413d9439041fde48e2784d I had the patch applied to /home/benh/linux-test and have following commands to inject errors, everything looks good :-) errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C5-T1 -a 0x0 -m 0x0 errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C6-T1 -a 0x0 -m 0x0 Thanks, Gavin
On Sun, 2013-06-30 at 09:51 +0800, Gavin Shan wrote: > Ben, I think one patch was lost from mainline and that fixes the problem. > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=5fb621698e94e3af8b413d9439041fde48e2784d > > I had the patch applied to /home/benh/linux-test and have following commands > to inject errors, everything looks good :-) > > errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C5-T1 -a 0x0 -m 0x0 > errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C6-T1 -a 0x0 -m 0x0 Ok, thanks, it's in -next and not upstream yet. I'll ask Linus to pull that one. Cheers, Ben.
diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c index eabeec9..d3a00e8 100644 --- a/arch/powerpc/kernel/pci-common.c +++ b/arch/powerpc/kernel/pci-common.c @@ -1009,10 +1009,18 @@ void pcibios_setup_device(struct pci_dev *dev) if (ppc_md.pci_dma_dev_setup) ppc_md.pci_dma_dev_setup(dev); - /* Read default IRQs and fixup if necessary */ - pci_read_irq_line(dev); - if (ppc_md.pci_irq_fixup) - ppc_md.pci_irq_fixup(dev); + /* + * Read default IRQs and fixup if necessary. We probably + * has MSI interrupt enabled on the device and that hasn't + * been unloaded yet. For that case, "dev->irq" is tracing + * the base MSI interrupt number and it's going to overrite + * the MSI interrupt number to fix "dev->irq" here. + */ + if (!dev->msi_enabled) { + pci_read_irq_line(dev); + if (ppc_md.pci_irq_fixup) + ppc_md.pci_irq_fixup(dev); + } } void pcibios_setup_bus_devices(struct pci_bus *bus)
The issue was introduced by commit 37f02195 ("powerpc/pci: fix PCI-e devices rescan issue on powerpc platform"). The field (struct pci_dev::irq) is reused by PCI core to trace the base MSI interrupt number if the MSI stuff is enabled on the corresponding device. When running to pcibios_setup_device(), we possibly still have enabled MSI interrupt on the device. That means "pci_dev->irq" still have the base MSI interrupt number and it will be overwritten if we're going fix "pci_dev->irq" again by pci_read_irq_line(). Eventually, when we enable the device, it runs to kernel crash caused by fetching the the MSI interrupt descriptor (struct msi_desc) from non-MSI interrupt and using the NULL descriptor. The patch adds more check inside pcibios_setup_device() and don't fix the interrupt number if we already had MSI interrupt enabled on the device. Unable to handle kernel paging request for data at address 0x00000008 Faulting instruction address: 0xc0000000004177ac cpu 0x6: Vector: 300 (Data Access) at [c000000fa24b7690] pc: c0000000004177ac: .pci_restore_msi_state+0x30c/0x3b0 lr: c00000000041777c: .pci_restore_msi_state+0x2dc/0x3b0 sp: c000000fa24b7910 msr: 8000000000009032 dar: 8 dsisr: 40000000 current = 0xc000000fb68542c0 paca = 0xc00000000ecd1500 softe: 0 irq_happened: 0x00 pid = 5367, comm = eehd enter ? for help [c000000fa24b79b0] c000000000405d2c .pci_restore_state.part.27+0x11c/0x2a0 [c000000fa24b7a40] c0000000005ea128 .e1000_io_slot_reset+0xa8/0x230 [c000000fa24b7ad0] c00000000005fcd4 .eeh_report_reset+0x94/0x120 [c000000fa24b7b60] c00000000005e97c .eeh_pe_dev_traverse+0x9c/0x190 [c000000fa24b7c10] c000000000060078 .eeh_handle_event+0x218/0x330 [c000000fa24b7ca0] c0000000000602c0 .eeh_event_handler+0x130/0x1a0 [c000000fa24b7d30] c0000000000ad6f8 .kthread+0xe8/0xf0 [c000000fa24b7e30] c00000000000a05c .ret_from_kernel_thread+0x5c/0x80 Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com> --- arch/powerpc/kernel/pci-common.c | 16 ++++++++++++---- 1 files changed, 12 insertions(+), 4 deletions(-)