Message ID | 20181119042517.45075-1-aik@ozlabs.ru (mailing list archive) |
---|---|
State | Accepted |
Commit | c20577014f85f36d4e137d3d52a1f61225b4a3d2 |
Headers | show |
Series | [kernel] powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status | expand |
Context | Check | Description |
---|---|---|
snowpatch_ozlabs/apply_patch | success | next/apply_patch Successfully applied |
snowpatch_ozlabs/build-ppc64le | success | build succeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/build-ppc64be | success | build succeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/build-ppc64e | success | build succeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/build-pmac32 | success | build succeded & removed 0 sparse warning(s) |
snowpatch_ozlabs/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 40 lines checked |
On Mon, Nov 19, 2018 at 03:25:17PM +1100, Alexey Kardashevskiy wrote: > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in > skiboot's NPU driver does not touch the pci_error_type parameter so > it might have garbage but the powernv code analyzes it nevertheless. > > This initializes pcierr and fstate to zero in all call sites. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> The skiboot code for npu2_freeze_status() is just: *freeze_state = OPAL_EEH_STOPPED_NOT_FROZEN; return OPAL_SUCCESS; It looks like npu_freeze_status() is affected as well. I think we could consider these skiboot bugs, but this still seems like an improvement. Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com> > --- > > Without this, this happens: > > pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data > EEH: PHB#6 failure detected, location: N/A > CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106 > Call Trace: > [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable) > [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0 > [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160 > [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0 > [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250 > [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0 > [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0 > [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci] > [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio] > [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00 > [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120 > [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80 > [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70 > EEH: Detected error on PHB#6 > EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail > ures. > EEH: Notify device drivers to shutdown > EEH: Beginning: 'error_detected(IO frozen)' > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover' > EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover' > EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover' > EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover' > EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover' > EEH: Collect temporary log > pnv_pci_dump_phb_diag_data: Unrecognized ioType 0 > EEH: Reset without hotplug activity > iommu: Removing device 0006:00:01.0 from group 4 > iommu: Removing device 0006:00:01.1 from group 4 > iommu: Removing device 0006:00:02.0 from group 4 > iommu: Removing device 0006:00:02.1 from group 4 > pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0 > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > EEH: Sleep 5s ahead of partial hotplug > pci 0004:04 : [PE# 00] Setting up window#0 0..3fffffff pg=1000 > pci 0004:05 : [PE# 18] Setting up window#0 0..3fffffff pg=1000 > pci 0004:06 : [PE# 30] Setting up window#0 0..3fffffff pg=1000 > pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000 > pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000 > pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000 > EEH: Beginning: 'slot_reset' > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > EEH: Finished:'slot_reset' with aggregate recovery state:'none' > EEH: Notify device driver to resume > EEH: Beginning: 'resume' > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > EEH: Finished:'resume' > EEH: Recovery successful. > --- > arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++---- > arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- > arch/powerpc/platforms/powernv/pci.c | 4 ++-- > 3 files changed, 8 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c > index abc0be7..f380789 100644 > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c > @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe) > static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > { > struct pnv_phb *phb = pe->phb->private_data; > - u8 fstate; > - __be16 pcierr; > + u8 fstate = 0; > + __be16 pcierr = 0; > s64 rc; > int result = 0; > > @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > static int pnv_eeh_get_pe_state(struct eeh_pe *pe) > { > struct pnv_phb *phb = pe->phb->private_data; > - u8 fstate; > - __be16 pcierr; > + u8 fstate = 0; > + __be16 pcierr = 0; > s64 rc; > int result; > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > index dd80744..72b5cc0 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt) > static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no) > { > struct pnv_ioda_pe *slave, *pe; > - u8 fstate, state; > - __be16 pcierr; > + u8 fstate = 0, state; > + __be16 pcierr = 0; > s64 rc; > > /* Sanity check on PE number */ > diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c > index 13aef23..db230a35 100644 > --- a/arch/powerpc/platforms/powernv/pci.c > +++ b/arch/powerpc/platforms/powernv/pci.c > @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) > static void pnv_pci_config_check_eeh(struct pci_dn *pdn) > { > struct pnv_phb *phb = pdn->phb->private_data; > - u8 fstate; > - __be16 pcierr; > + u8 fstate = 0; > + __be16 pcierr = 0; > unsigned int pe_no; > s64 rc; > > -- > 2.17.1 >
Alexey Kardashevskiy <aik@ozlabs.ru> writes: > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in > skiboot's NPU driver does not touch the pci_error_type parameter so > it might have garbage but the powernv code analyzes it nevertheless. > > This initializes pcierr and fstate to zero in all call sites. > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > --- Can we tag this with a Fixes? And seems like it should probably go to stable, or can we not trigger this path on older kernels? cheers > Without this, this happens: > > pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data > EEH: PHB#6 failure detected, location: N/A > CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106 > Call Trace: > [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable) > [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0 > [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160 > [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0 > [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250 > [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0 > [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0 > [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci] > [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio] > [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00 > [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120 > [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80 > [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70 > EEH: Detected error on PHB#6 > EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail > ures. > EEH: Notify device drivers to shutdown > EEH: Beginning: 'error_detected(IO frozen)' > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover' > EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover' > EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover' > EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen) > EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover' > EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover' > EEH: Collect temporary log > pnv_pci_dump_phb_diag_data: Unrecognized ioType 0 > EEH: Reset without hotplug activity > iommu: Removing device 0006:00:01.0 from group 4 > iommu: Removing device 0006:00:01.1 from group 4 > iommu: Removing device 0006:00:02.0 from group 4 > iommu: Removing device 0006:00:02.1 from group 4 > pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0 > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > EEH: Sleep 5s ahead of partial hotplug > pci 0004:04 : [PE# 00] Setting up window#0 0..3fffffff pg=1000 > pci 0004:05 : [PE# 18] Setting up window#0 0..3fffffff pg=1000 > pci 0004:06 : [PE# 30] Setting up window#0 0..3fffffff pg=1000 > pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000 > pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000 > pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000 > EEH: Beginning: 'slot_reset' > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > EEH: Finished:'slot_reset' with aggregate recovery state:'none' > EEH: Notify device driver to resume > EEH: Beginning: 'resume' > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > EEH: Finished:'resume' > EEH: Recovery successful. > --- > arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++---- > arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- > arch/powerpc/platforms/powernv/pci.c | 4 ++-- > 3 files changed, 8 insertions(+), 8 deletions(-) > > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c > index abc0be7..f380789 100644 > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c > @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe) > static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > { > struct pnv_phb *phb = pe->phb->private_data; > - u8 fstate; > - __be16 pcierr; > + u8 fstate = 0; > + __be16 pcierr = 0; > s64 rc; > int result = 0; > > @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > static int pnv_eeh_get_pe_state(struct eeh_pe *pe) > { > struct pnv_phb *phb = pe->phb->private_data; > - u8 fstate; > - __be16 pcierr; > + u8 fstate = 0; > + __be16 pcierr = 0; > s64 rc; > int result; > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > index dd80744..72b5cc0 100644 > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt) > static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no) > { > struct pnv_ioda_pe *slave, *pe; > - u8 fstate, state; > - __be16 pcierr; > + u8 fstate = 0, state; > + __be16 pcierr = 0; > s64 rc; > > /* Sanity check on PE number */ > diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c > index 13aef23..db230a35 100644 > --- a/arch/powerpc/platforms/powernv/pci.c > +++ b/arch/powerpc/platforms/powernv/pci.c > @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) > static void pnv_pci_config_check_eeh(struct pci_dn *pdn) > { > struct pnv_phb *phb = pdn->phb->private_data; > - u8 fstate; > - __be16 pcierr; > + u8 fstate = 0; > + __be16 pcierr = 0; > unsigned int pe_no; > s64 rc; > > -- > 2.17.1
On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote: > Alexey Kardashevskiy <aik@ozlabs.ru> writes: > > > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in > > skiboot's NPU driver does not touch the pci_error_type parameter so > > it might have garbage but the powernv code analyzes it nevertheless. > > > > This initializes pcierr and fstate to zero in all call sites. > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > --- > > Can we tag this with a Fixes? And seems like it should probably go to > stable, or can we not trigger this path on older kernels? > > cheers Hmm, it's triggered by use on an NPU PE so that would be any kernel that can run on P8 or later (AFAIK). It looks like the issue was present earlier, but the code was last touched when it was moved, in... 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()") ... which was back in v4.1. Sam. > > Without this, this happens: > > > > pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data > > EEH: PHB#6 failure detected, location: N/A > > CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106 > > Call Trace: > > [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable) > > [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0 > > [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160 > > [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0 > > [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250 > > [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0 > > [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0 > > [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci] > > [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio] > > [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00 > > [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120 > > [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80 > > [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70 > > EEH: Detected error on PHB#6 > > EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail > > ures. > > EEH: Notify device drivers to shutdown > > EEH: Beginning: 'error_detected(IO frozen)' > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > > EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen) > > EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover' > > EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen) > > EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover' > > EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen) > > EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover' > > EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen) > > EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover' > > EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover' > > EEH: Collect temporary log > > pnv_pci_dump_phb_diag_data: Unrecognized ioType 0 > > EEH: Reset without hotplug activity > > iommu: Removing device 0006:00:01.0 from group 4 > > iommu: Removing device 0006:00:01.1 from group 4 > > iommu: Removing device 0006:00:02.0 from group 4 > > iommu: Removing device 0006:00:02.1 from group 4 > > pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0 > > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > > EEH: Sleep 5s ahead of partial hotplug > > pci 0004:04 : [PE# 00] Setting up window#0 0..3fffffff pg=1000 > > pci 0004:05 : [PE# 18] Setting up window#0 0..3fffffff pg=1000 > > pci 0004:06 : [PE# 30] Setting up window#0 0..3fffffff pg=1000 > > pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000 > > pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000 > > pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000 > > EEH: Beginning: 'slot_reset' > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > > EEH: Finished:'slot_reset' with aggregate recovery state:'none' > > EEH: Notify device driver to resume > > EEH: Beginning: 'resume' > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > > EEH: Finished:'resume' > > EEH: Recovery successful. > > --- > > arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++---- > > arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- > > arch/powerpc/platforms/powernv/pci.c | 4 ++-- > > 3 files changed, 8 insertions(+), 8 deletions(-) > > > > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c > > index abc0be7..f380789 100644 > > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c > > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c > > @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe) > > static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > > { > > struct pnv_phb *phb = pe->phb->private_data; > > - u8 fstate; > > - __be16 pcierr; > > + u8 fstate = 0; > > + __be16 pcierr = 0; > > s64 rc; > > int result = 0; > > > > @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > > static int pnv_eeh_get_pe_state(struct eeh_pe *pe) > > { > > struct pnv_phb *phb = pe->phb->private_data; > > - u8 fstate; > > - __be16 pcierr; > > + u8 fstate = 0; > > + __be16 pcierr = 0; > > s64 rc; > > int result; > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c > > index dd80744..72b5cc0 100644 > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt) > > static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no) > > { > > struct pnv_ioda_pe *slave, *pe; > > - u8 fstate, state; > > - __be16 pcierr; > > + u8 fstate = 0, state; > > + __be16 pcierr = 0; > > s64 rc; > > > > /* Sanity check on PE number */ > > diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c > > index 13aef23..db230a35 100644 > > --- a/arch/powerpc/platforms/powernv/pci.c > > +++ b/arch/powerpc/platforms/powernv/pci.c > > @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) > > static void pnv_pci_config_check_eeh(struct pci_dn *pdn) > > { > > struct pnv_phb *phb = pdn->phb->private_data; > > - u8 fstate; > > - __be16 pcierr; > > + u8 fstate = 0; > > + __be16 pcierr = 0; > > unsigned int pe_no; > > s64 rc; > > > > -- > > 2.17.1 >
On 20/11/2018 14:51, Sam Bobroff wrote: > On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote: >> Alexey Kardashevskiy <aik@ozlabs.ru> writes: >> >>> The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in >>> skiboot's NPU driver does not touch the pci_error_type parameter so >>> it might have garbage but the powernv code analyzes it nevertheless. >>> >>> This initializes pcierr and fstate to zero in all call sites. >>> >>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >>> --- >> >> Can we tag this with a Fixes? And seems like it should probably go to >> stable, or can we not trigger this path on older kernels? >> >> cheers > > Hmm, it's triggered by use on an NPU PE so that would be any kernel that > can run on P8 or later (AFAIK). > > It looks like the issue was present earlier, but the code was last > touched when it was moved, in... > > 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()") > > ... which was back in v4.1. The original commit is https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c41a7f3f7593fe57 2013-06-20 13:21:09 +0800 === powerpc/eeh: I/O chip EEH state retrieval The patch adds I/O chip backend to retrieve the state for the indicated PE. While the PE state is temperarily unavailable, the upper layer (powernv platform) should return default delay (1 second). === This was 3.10-rc5 (this is what that sha1's Makefile has). But this was not the issue till skiboot decided not to zero these parameters so "Fixes:" should point to what? > > Sam. > >>> Without this, this happens: >>> >>> pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data >>> EEH: PHB#6 failure detected, location: N/A >>> CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106 >>> Call Trace: >>> [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable) >>> [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0 >>> [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160 >>> [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0 >>> [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250 >>> [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0 >>> [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0 >>> [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci] >>> [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio] >>> [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00 >>> [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120 >>> [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80 >>> [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70 >>> EEH: Detected error on PHB#6 >>> EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail >>> ures. >>> EEH: Notify device drivers to shutdown >>> EEH: Beginning: 'error_detected(IO frozen)' >>> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) >>> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) >>> EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen) >>> EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover' >>> EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen) >>> EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover' >>> EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen) >>> EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover' >>> EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen) >>> EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover' >>> EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover' >>> EEH: Collect temporary log >>> pnv_pci_dump_phb_diag_data: Unrecognized ioType 0 >>> EEH: Reset without hotplug activity >>> iommu: Removing device 0006:00:01.0 from group 4 >>> iommu: Removing device 0006:00:01.1 from group 4 >>> iommu: Removing device 0006:00:02.0 from group 4 >>> iommu: Removing device 0006:00:02.1 from group 4 >>> pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0 >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) >>> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) >>> EEH: Sleep 5s ahead of partial hotplug >>> pci 0004:04 : [PE# 00] Setting up window#0 0..3fffffff pg=1000 >>> pci 0004:05 : [PE# 18] Setting up window#0 0..3fffffff pg=1000 >>> pci 0004:06 : [PE# 30] Setting up window#0 0..3fffffff pg=1000 >>> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000 >>> pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000 >>> pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000 >>> EEH: Beginning: 'slot_reset' >>> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) >>> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) >>> EEH: Finished:'slot_reset' with aggregate recovery state:'none' >>> EEH: Notify device driver to resume >>> EEH: Beginning: 'resume' >>> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) >>> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) >>> EEH: Finished:'resume' >>> EEH: Recovery successful. >>> --- >>> arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++---- >>> arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- >>> arch/powerpc/platforms/powernv/pci.c | 4 ++-- >>> 3 files changed, 8 insertions(+), 8 deletions(-) >>> >>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c >>> index abc0be7..f380789 100644 >>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c >>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c >>> @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe) >>> static int pnv_eeh_get_phb_state(struct eeh_pe *pe) >>> { >>> struct pnv_phb *phb = pe->phb->private_data; >>> - u8 fstate; >>> - __be16 pcierr; >>> + u8 fstate = 0; >>> + __be16 pcierr = 0; >>> s64 rc; >>> int result = 0; >>> >>> @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe) >>> static int pnv_eeh_get_pe_state(struct eeh_pe *pe) >>> { >>> struct pnv_phb *phb = pe->phb->private_data; >>> - u8 fstate; >>> - __be16 pcierr; >>> + u8 fstate = 0; >>> + __be16 pcierr = 0; >>> s64 rc; >>> int result; >>> >>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c >>> index dd80744..72b5cc0 100644 >>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c >>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c >>> @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt) >>> static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no) >>> { >>> struct pnv_ioda_pe *slave, *pe; >>> - u8 fstate, state; >>> - __be16 pcierr; >>> + u8 fstate = 0, state; >>> + __be16 pcierr = 0; >>> s64 rc; >>> >>> /* Sanity check on PE number */ >>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c >>> index 13aef23..db230a35 100644 >>> --- a/arch/powerpc/platforms/powernv/pci.c >>> +++ b/arch/powerpc/platforms/powernv/pci.c >>> @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) >>> static void pnv_pci_config_check_eeh(struct pci_dn *pdn) >>> { >>> struct pnv_phb *phb = pdn->phb->private_data; >>> - u8 fstate; >>> - __be16 pcierr; >>> + u8 fstate = 0; >>> + __be16 pcierr = 0; >>> unsigned int pe_no; >>> s64 rc; >>> >>> -- >>> 2.17.1 >>
On Tue, 2018-11-20 at 17:55 +1100, Alexey Kardashevskiy wrote: > > On 20/11/2018 14:51, Sam Bobroff wrote: > > On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote: > > > Alexey Kardashevskiy <aik@ozlabs.ru> writes: > > > > > > > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS > > > > call in > > > > skiboot's NPU driver does not touch the pci_error_type > > > > parameter so > > > > it might have garbage but the powernv code analyzes it > > > > nevertheless. > > > > > > > > This initializes pcierr and fstate to zero in all call sites. > > > > > > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> > > > > --- > > > > > > Can we tag this with a Fixes? And seems like it should probably > > > go to > > > stable, or can we not trigger this path on older kernels? > > > > > > cheers > > > > Hmm, it's triggered by use on an NPU PE so that would be any kernel > > that > > can run on P8 or later (AFAIK). > > > > It looks like the issue was present earlier, but the code was last > > touched when it was moved, in... > > > > 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()") > > > > ... which was back in v4.1. > > The original commit is > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c41a7f3f7593fe57 > > 2013-06-20 13:21:09 +0800 > === > powerpc/eeh: I/O chip EEH state retrieval > > The patch adds I/O chip backend to retrieve the state for the > indicated PE. While the PE state is temperarily unavailable, > the upper layer (powernv platform) should return default delay > (1 second). > === > > This was 3.10-rc5 (this is what that sha1's Makefile has). > > But this was not the issue till skiboot decided not to zero these > parameters so "Fixes:" should point to what? It should still be that commit, it's perfectly reasonable (however unlikely) that someone could be running a 3.10 kernel with a modern skiboot. > > > > Sam. > > > > > > Without this, this happens: > > > > > > > > pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data > > > > EEH: PHB#6 failure detected, location: N/A > > > > CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0- > > > > le_f5a7bb7_aikATfstn1-p1 torvalds#106 > > > > Call Trace: > > > > [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 > > > > (unreliable) > > > > [c000003fea9dfa00] [c000000000038480] > > > > eeh_dev_check_failure+0x1f0/0x5f0 > > > > [c000003fea9dfaa0] [c0000000000a2768] > > > > pnv_pci_read_config+0x128/0x160 > > > > [c000003fea9dfae0] [c0000000005d2b0c] > > > > pci_bus_read_config_dword+0x9c/0xf0 > > > > [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250 > > > > [c000003fea9dfbc0] [c0000000005e0730] > > > > pci_dev_save_and_disable+0x70/0xa0 > > > > [c000003fea9dfbf0] [c0000000005e4078] > > > > pci_try_reset_function+0x48/0xc0 > > > > [c000003fea9dfc20] [c00800001cbc1b1c] > > > > vfio_pci_ioctl+0x334/0xea0 [vfio_pci] > > > > [c000003fea9dfcf0] [c00800001ca9046c] > > > > vfio_device_fops_unl_ioctl+0x44/0x70 [vfio] > > > > [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00 > > > > [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120 > > > > [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80 > > > > [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70 > > > > EEH: Detected error on PHB#6 > > > > EEH: This PCI device has failed 1 times in the last hour and > > > > will be permanently disabled after 5 fail > > > > ures. > > > > EEH: Notify device drivers to shutdown > > > > EEH: Beginning: 'error_detected(IO frozen)' > > > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > > > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > > > > EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci- > > > > >error_detected(IO frozen) > > > > EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can > > > > recover' > > > > EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci- > > > > >error_detected(IO frozen) > > > > EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can > > > > recover' > > > > EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci- > > > > >error_detected(IO frozen) > > > > EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can > > > > recover' > > > > EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci- > > > > >error_detected(IO frozen) > > > > EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can > > > > recover' > > > > EEH: Finished:'error_detected(IO frozen)' with aggregate > > > > recovery state:'can recover' > > > > EEH: Collect temporary log > > > > pnv_pci_dump_phb_diag_data: Unrecognized ioType 0 > > > > EEH: Reset without hotplug activity > > > > iommu: Removing device 0006:00:01.0 from group 4 > > > > iommu: Removing device 0006:00:01.1 from group 4 > > > > iommu: Removing device 0006:00:02.0 from group 4 > > > > iommu: Removing device 0006:00:02.1 from group 4 > > > > pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0 > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) > > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) > > > > EEH: Sleep 5s ahead of partial hotplug > > > > pci 0004:04 : [PE# 00] Setting up window#0 0..3fffffff > > > > pg=1000 > > > > pci 0004:05 : [PE# 18] Setting up window#0 0..3fffffff > > > > pg=1000 > > > > pci 0004:06 : [PE# 30] Setting up window#0 0..3fffffff > > > > pg=1000 > > > > pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff > > > > pg=1000 > > > > pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff > > > > pg=1000 > > > > pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff > > > > pg=1000 > > > > EEH: Beginning: 'slot_reset' > > > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > > > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > > > > EEH: Finished:'slot_reset' with aggregate recovery state:'none' > > > > EEH: Notify device driver to resume > > > > EEH: Beginning: 'resume' > > > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) > > > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) > > > > EEH: Finished:'resume' > > > > EEH: Recovery successful. > > > > --- > > > > arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++---- > > > > arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- > > > > arch/powerpc/platforms/powernv/pci.c | 4 ++-- > > > > 3 files changed, 8 insertions(+), 8 deletions(-) > > > > > > > > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c > > > > b/arch/powerpc/platforms/powernv/eeh-powernv.c > > > > index abc0be7..f380789 100644 > > > > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c > > > > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c > > > > @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct > > > > eeh_pe *pe) > > > > static int pnv_eeh_get_phb_state(struct eeh_pe *pe) > > > > { > > > > struct pnv_phb *phb = pe->phb->private_data; > > > > - u8 fstate; > > > > - __be16 pcierr; > > > > + u8 fstate = 0; > > > > + __be16 pcierr = 0; > > > > s64 rc; > > > > int result = 0; > > > > > > > > @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct > > > > eeh_pe *pe) > > > > static int pnv_eeh_get_pe_state(struct eeh_pe *pe) > > > > { > > > > struct pnv_phb *phb = pe->phb->private_data; > > > > - u8 fstate; > > > > - __be16 pcierr; > > > > + u8 fstate = 0; > > > > + __be16 pcierr = 0; > > > > s64 rc; > > > > int result; > > > > > > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c > > > > b/arch/powerpc/platforms/powernv/pci-ioda.c > > > > index dd80744..72b5cc0 100644 > > > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c > > > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c > > > > @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct > > > > pnv_phb *phb, int pe_no, int opt) > > > > static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int > > > > pe_no) > > > > { > > > > struct pnv_ioda_pe *slave, *pe; > > > > - u8 fstate, state; > > > > - __be16 pcierr; > > > > + u8 fstate = 0, state; > > > > + __be16 pcierr = 0; > > > > s64 rc; > > > > > > > > /* Sanity check on PE number */ > > > > diff --git a/arch/powerpc/platforms/powernv/pci.c > > > > b/arch/powerpc/platforms/powernv/pci.c > > > > index 13aef23..db230a35 100644 > > > > --- a/arch/powerpc/platforms/powernv/pci.c > > > > +++ b/arch/powerpc/platforms/powernv/pci.c > > > > @@ -602,8 +602,8 @@ static void > > > > pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) > > > > static void pnv_pci_config_check_eeh(struct pci_dn *pdn) > > > > { > > > > struct pnv_phb *phb = pdn->phb->private_data; > > > > - u8 fstate; > > > > - __be16 pcierr; > > > > + u8 fstate = 0; > > > > + __be16 pcierr = 0; > > > > unsigned int pe_no; > > > > s64 rc; > > > > > > > > -- > > > > 2.17.1
Russell Currey <ruscur@russell.cc> writes: > On Tue, 2018-11-20 at 17:55 +1100, Alexey Kardashevskiy wrote: >> >> On 20/11/2018 14:51, Sam Bobroff wrote: >> > On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote: >> > > Alexey Kardashevskiy <aik@ozlabs.ru> writes: >> > > >> > > > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS >> > > > call in >> > > > skiboot's NPU driver does not touch the pci_error_type >> > > > parameter so >> > > > it might have garbage but the powernv code analyzes it >> > > > nevertheless. >> > > > >> > > > This initializes pcierr and fstate to zero in all call sites. >> > > > >> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> >> > > > --- >> > > >> > > Can we tag this with a Fixes? And seems like it should probably >> > > go to >> > > stable, or can we not trigger this path on older kernels? >> > > >> > > cheers >> > >> > Hmm, it's triggered by use on an NPU PE so that would be any kernel >> > that >> > can run on P8 or later (AFAIK). >> > >> > It looks like the issue was present earlier, but the code was last >> > touched when it was moved, in... >> > >> > 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()") >> > >> > ... which was back in v4.1. >> >> The original commit is >> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c41a7f3f7593fe57 >> >> 2013-06-20 13:21:09 +0800 >> === >> powerpc/eeh: I/O chip EEH state retrieval >> >> The patch adds I/O chip backend to retrieve the state for the >> indicated PE. While the PE state is temperarily unavailable, >> the upper layer (powernv platform) should return default delay >> (1 second). >> === >> >> This was 3.10-rc5 (this is what that sha1's Makefile has). >> >> But this was not the issue till skiboot decided not to zero these >> parameters so "Fixes:" should point to what? > > It should still be that commit, it's perfectly reasonable (however > unlikely) that someone could be running a 3.10 kernel with a modern > skiboot. Yeah that kernel commit is the earliest point where we could hit the bug, so that's where Fixes should point. If the bug's not actually in that commit, as it sounds here, then you can explain that in the change log. The other way to think about is if someone has back ported 8c41a7f3f7593fe57 then they would also want this fix, so pointing the Fixes tag at 8c41a7f3f7593fe57 expresses that. cheers
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c index abc0be7..f380789 100644 --- a/arch/powerpc/platforms/powernv/eeh-powernv.c +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe) static int pnv_eeh_get_phb_state(struct eeh_pe *pe) { struct pnv_phb *phb = pe->phb->private_data; - u8 fstate; - __be16 pcierr; + u8 fstate = 0; + __be16 pcierr = 0; s64 rc; int result = 0; @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe) static int pnv_eeh_get_pe_state(struct eeh_pe *pe) { struct pnv_phb *phb = pe->phb->private_data; - u8 fstate; - __be16 pcierr; + u8 fstate = 0; + __be16 pcierr = 0; s64 rc; int result; diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c index dd80744..72b5cc0 100644 --- a/arch/powerpc/platforms/powernv/pci-ioda.c +++ b/arch/powerpc/platforms/powernv/pci-ioda.c @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt) static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no) { struct pnv_ioda_pe *slave, *pe; - u8 fstate, state; - __be16 pcierr; + u8 fstate = 0, state; + __be16 pcierr = 0; s64 rc; /* Sanity check on PE number */ diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c index 13aef23..db230a35 100644 --- a/arch/powerpc/platforms/powernv/pci.c +++ b/arch/powerpc/platforms/powernv/pci.c @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no) static void pnv_pci_config_check_eeh(struct pci_dn *pdn) { struct pnv_phb *phb = pdn->phb->private_data; - u8 fstate; - __be16 pcierr; + u8 fstate = 0; + __be16 pcierr = 0; unsigned int pe_no; s64 rc;
The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in skiboot's NPU driver does not touch the pci_error_type parameter so it might have garbage but the powernv code analyzes it nevertheless. This initializes pcierr and fstate to zero in all call sites. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> --- Without this, this happens: pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data EEH: PHB#6 failure detected, location: N/A CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106 Call Trace: [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable) [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0 [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160 [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0 [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250 [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0 [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0 [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci] [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio] [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00 [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120 [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80 [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70 EEH: Detected error on PHB#6 EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail ures. EEH: Notify device drivers to shutdown EEH: Beginning: 'error_detected(IO frozen)' EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen) EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover' EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen) EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover' EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen) EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover' EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen) EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover' EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover' EEH: Collect temporary log pnv_pci_dump_phb_diag_data: Unrecognized ioType 0 EEH: Reset without hotplug activity iommu: Removing device 0006:00:01.0 from group 4 iommu: Removing device 0006:00:01.1 from group 4 iommu: Removing device 0006:00:02.0 from group 4 iommu: Removing device 0006:00:02.1 from group 4 pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0 pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7) pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7) EEH: Sleep 5s ahead of partial hotplug pci 0004:04 : [PE# 00] Setting up window#0 0..3fffffff pg=1000 pci 0004:05 : [PE# 18] Setting up window#0 0..3fffffff pg=1000 pci 0004:06 : [PE# 30] Setting up window#0 0..3fffffff pg=1000 pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000 pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000 pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000 EEH: Beginning: 'slot_reset' EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) EEH: Finished:'slot_reset' with aggregate recovery state:'none' EEH: Notify device driver to resume EEH: Beginning: 'resume' EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0) EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0) EEH: Finished:'resume' EEH: Recovery successful. --- arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++---- arch/powerpc/platforms/powernv/pci-ioda.c | 4 ++-- arch/powerpc/platforms/powernv/pci.c | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-)