powerpc/pci: Avoid overriding MSI interrupt

Message ID	1372425030-5759-1-git-send-email-shangw@linux.vnet.ibm.com (mailing list archive)
State	Rejected
Headers	show Return-Path: <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org> Gateway: Authorized Use Only! Violators will be prosecuted for <linuxppc-dev@lists.ozlabs.org> from <shangw@linux.vnet.ibm.com>; Fri, 28 Jun 2013 14:10:41 +0100 Gateway: Authorized Use Only! Violators will be prosecuted; Fri, 28 Jun 2013 14:10:38 +0100 From: Gavin Shan <shangw@linux.vnet.ibm.com> To: linuxppc-dev@lists.ozlabs.org Subject: [PATCH] powerpc/pci: Avoid overriding MSI interrupt Date: Fri, 28 Jun 2013 21:10:30 +0800 Message-Id: <1372425030-5759-1-git-send-email-shangw@linux.vnet.ibm.com> Cc: Yuanquan.Chen@freescale.com, Gavin Shan <shangw@linux.vnet.ibm.com> Precedence: list MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org Sender: "Linuxppc-dev" <linuxppc-dev-bounces+patchwork-incoming=ozlabs.org@lists.ozlabs.org>

Gavin Shan June 28, 2013, 1:10 p.m. UTC

The issue was introduced by commit 37f02195 ("powerpc/pci: fix
PCI-e devices rescan issue on powerpc platform"). The field
(struct pci_dev::irq) is reused by PCI core to trace the base
MSI interrupt number if the MSI stuff is enabled on the corresponding
device. When running to pcibios_setup_device(), we possibly still
have enabled MSI interrupt on the device. That means "pci_dev->irq"
still have the base MSI interrupt number and it will be overwritten
if we're going fix "pci_dev->irq" again by pci_read_irq_line().
Eventually, when we enable the device, it runs to kernel crash caused
by fetching the the MSI interrupt descriptor (struct msi_desc) from
non-MSI interrupt and using the NULL descriptor.

The patch adds more check inside pcibios_setup_device() and don't
fix the interrupt number if we already had MSI interrupt enabled on
the device.

Unable to handle kernel paging request for data at address 0x00000008
Faulting instruction address: 0xc0000000004177ac
cpu 0x6: Vector: 300 (Data Access) at [c000000fa24b7690]
    pc: c0000000004177ac: .pci_restore_msi_state+0x30c/0x3b0
    lr: c00000000041777c: .pci_restore_msi_state+0x2dc/0x3b0
    sp: c000000fa24b7910
   msr: 8000000000009032
   dar: 8
 dsisr: 40000000
  current = 0xc000000fb68542c0
  paca    = 0xc00000000ecd1500	 softe: 0	 irq_happened: 0x00
    pid   = 5367, comm = eehd
enter ? for help
[c000000fa24b79b0] c000000000405d2c .pci_restore_state.part.27+0x11c/0x2a0
[c000000fa24b7a40] c0000000005ea128 .e1000_io_slot_reset+0xa8/0x230
[c000000fa24b7ad0] c00000000005fcd4 .eeh_report_reset+0x94/0x120
[c000000fa24b7b60] c00000000005e97c .eeh_pe_dev_traverse+0x9c/0x190
[c000000fa24b7c10] c000000000060078 .eeh_handle_event+0x218/0x330
[c000000fa24b7ca0] c0000000000602c0 .eeh_event_handler+0x130/0x1a0
[c000000fa24b7d30] c0000000000ad6f8 .kthread+0xe8/0xf0
[c000000fa24b7e30] c00000000000a05c .ret_from_kernel_thread+0x5c/0x80

Reported-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Gavin Shan <shangw@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/pci-common.c |   16 ++++++++++++----
 1 files changed, 12 insertions(+), 4 deletions(-)

Benjamin Herrenschmidt June 29, 2013, 7:20 a.m. UTC | #1

On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote:
> The issue was introduced by commit 37f02195 ("powerpc/pci: fix
> PCI-e devices rescan issue on powerpc platform"). The field

That "fix" caused more problems than it solved. There's a better
approach floating around that I will merge eventually next week.

Cheers,
Ben.

Benjamin Herrenschmidt June 29, 2013, 11:09 p.m. UTC | #2

On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote:
> The issue was introduced by commit 37f02195 ("powerpc/pci: fix
> PCI-e devices rescan issue on powerpc platform"). The field
> (struct pci_dev::irq) is reused by PCI core to trace the base
> MSI interrupt number if the MSI stuff is enabled on the corresponding
> device. When running to pcibios_setup_device(), we possibly still
> have enabled MSI interrupt on the device. That means "pci_dev->irq"
> still have the base MSI interrupt number and it will be overwritten
> if we're going fix "pci_dev->irq" again by pci_read_irq_line().
> Eventually, when we enable the device, it runs to kernel crash caused
> by fetching the the MSI interrupt descriptor (struct msi_desc) from
> non-MSI interrupt and using the NULL descriptor.

So finally I decided instead to apply Guenter patch

[PATCH v2] powerpc/pci: Improve device hotplug initialization

Which fixes the underlying problem instead.

I'm running some tests, so far it looks good. However, Gavin, when you
have a chance on vpl3, try injecting errors to other adapters, for
example the VGA adapter (you need to do lspci to trigger the EEH
detection after that since there's no driver and use the "loc code"
variant off errinjct) or eth2 (the cxgb3).

All I get from EEH with these is:

[  362.962564] EEH: Detected PCI bus error on PHB#7-PE#10000
[  362.962570] eeh_handle_event: Cannot find PCI bus for PHB#7-PE#10000

and

[  424.381083] EEH: Detected PCI bus error on PHB#6-PE#10000
[  424.381089] eeh_handle_event: Cannot find PCI bus for PHB#6-PE#10000

Followed by ... nothing.

This is a tree which has Cascardo patch and Gunther patch (usual
location on vpl3).

Can you have a look ?

Cheers,
Ben.

Guenter Roeck June 30, 2013, 1:29 a.m. UTC | #3

On Sun, Jun 30, 2013 at 09:09:20AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote:
> > The issue was introduced by commit 37f02195 ("powerpc/pci: fix
> > PCI-e devices rescan issue on powerpc platform"). The field
> > (struct pci_dev::irq) is reused by PCI core to trace the base
> > MSI interrupt number if the MSI stuff is enabled on the corresponding
> > device. When running to pcibios_setup_device(), we possibly still
> > have enabled MSI interrupt on the device. That means "pci_dev->irq"
> > still have the base MSI interrupt number and it will be overwritten
> > if we're going fix "pci_dev->irq" again by pci_read_irq_line().
> > Eventually, when we enable the device, it runs to kernel crash caused
> > by fetching the the MSI interrupt descriptor (struct msi_desc) from
> > non-MSI interrupt and using the NULL descriptor.
> 
> So finally I decided instead to apply Guenter patch
> 
> [PATCH v2] powerpc/pci: Improve device hotplug initialization
> 
> Which fixes the underlying problem instead.
> 

Guess I am not hitting above bug because I have my own patch applied ;).

Thanks a lot!

Guenter

Gavin Shan June 30, 2013, 1:51 a.m. UTC | #4

On Sun, Jun 30, 2013 at 09:09:20AM +1000, Benjamin Herrenschmidt wrote:
>On Fri, 2013-06-28 at 21:10 +0800, Gavin Shan wrote:

.../...

>I'm running some tests, so far it looks good. However, Gavin, when you
>have a chance on vpl3, try injecting errors to other adapters, for
>example the VGA adapter (you need to do lspci to trigger the EEH
>detection after that since there's no driver and use the "loc code"
>variant off errinjct) or eth2 (the cxgb3).
>
>All I get from EEH with these is:
>
>[  362.962564] EEH: Detected PCI bus error on PHB#7-PE#10000
>[  362.962570] eeh_handle_event: Cannot find PCI bus for PHB#7-PE#10000
>
>and
>
>[  424.381083] EEH: Detected PCI bus error on PHB#6-PE#10000
>[  424.381089] eeh_handle_event: Cannot find PCI bus for PHB#6-PE#10000
>
>Followed by ... nothing.
>

Ben, I think one patch was lost from mainline and that fixes the problem.

https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=5fb621698e94e3af8b413d9439041fde48e2784d

I had the patch applied to /home/benh/linux-test and have following commands
to inject errors, everything looks good :-)

errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C5-T1 -a 0x0 -m 0x0
errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C6-T1 -a 0x0 -m 0x0

Thanks,
Gavin

Benjamin Herrenschmidt June 30, 2013, 4:06 a.m. UTC | #5

On Sun, 2013-06-30 at 09:51 +0800, Gavin Shan wrote:

> Ben, I think one patch was lost from mainline and that fixes the problem.
> 
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=5fb621698e94e3af8b413d9439041fde48e2784d
> 
> I had the patch applied to /home/benh/linux-test and have following commands
> to inject errors, everything looks good :-)
> 
> errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C5-T1 -a 0x0 -m 0x0
> errinjct eeh -v -f 0 -p U78AB.001.WZSGBJ6-P1-C6-T1 -a 0x0 -m 0x0

Ok, thanks, it's in -next and not upstream yet. I'll ask Linus to pull
that one.

Cheers,
Ben.

powerpc/pci: Avoid overriding MSI interrupt

Commit Message

Comments

Patch