diff mbox series

[kernel] powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status

Message ID 20181119042517.45075-1-aik@ozlabs.ru (mailing list archive)
State Accepted
Commit c20577014f85f36d4e137d3d52a1f61225b4a3d2
Headers show
Series [kernel] powerpc/powernv/eeh/npu: Fix uninitialized variables in opal_pci_eeh_freeze_status | expand

Checks

Context Check Description
snowpatch_ozlabs/apply_patch success next/apply_patch Successfully applied
snowpatch_ozlabs/build-ppc64le success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64be success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-ppc64e success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/build-pmac32 success build succeded & removed 0 sparse warning(s)
snowpatch_ozlabs/checkpatch success total: 0 errors, 0 warnings, 0 checks, 40 lines checked

Commit Message

Alexey Kardashevskiy Nov. 19, 2018, 4:25 a.m. UTC
The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
skiboot's NPU driver does not touch the pci_error_type parameter so
it might have garbage but the powernv code analyzes it nevertheless.

This initializes pcierr and fstate to zero in all call sites.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

Without this, this happens:

pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
EEH: PHB#6 failure detected, location: N/A
CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106
Call Trace:
[c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable)
[c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0
[c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160
[c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0
[c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250
[c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0
[c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0
[c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
[c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
[c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00
[c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120
[c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80
[c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70
EEH: Detected error on PHB#6
EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail
ures.
EEH: Notify device drivers to shutdown
EEH: Beginning: 'error_detected(IO frozen)'
EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover'
EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover'
EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover'
EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen)
EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover'
EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover'
EEH: Collect temporary log
pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
EEH: Reset without hotplug activity
iommu: Removing device 0006:00:01.0 from group 4
iommu: Removing device 0006:00:01.1 from group 4
iommu: Removing device 0006:00:02.0 from group 4
iommu: Removing device 0006:00:02.1 from group 4
pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
EEH: Sleep 5s ahead of partial hotplug
pci 0004:04     : [PE# 00] Setting up window#0 0..3fffffff pg=1000
pci 0004:05     : [PE# 18] Setting up window#0 0..3fffffff pg=1000
pci 0004:06     : [PE# 30] Setting up window#0 0..3fffffff pg=1000
pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000
pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000
pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000
EEH: Beginning: 'slot_reset'
EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
EEH: Finished:'slot_reset' with aggregate recovery state:'none'
EEH: Notify device driver to resume
EEH: Beginning: 'resume'
EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
EEH: Finished:'resume'
EEH: Recovery successful.
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++----
 arch/powerpc/platforms/powernv/pci-ioda.c    | 4 ++--
 arch/powerpc/platforms/powernv/pci.c         | 4 ++--
 3 files changed, 8 insertions(+), 8 deletions(-)

Comments

Sam Bobroff Nov. 19, 2018, 11:28 p.m. UTC | #1
On Mon, Nov 19, 2018 at 03:25:17PM +1100, Alexey Kardashevskiy wrote:
> The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
> skiboot's NPU driver does not touch the pci_error_type parameter so
> it might have garbage but the powernv code analyzes it nevertheless.
> 
> This initializes pcierr and fstate to zero in all call sites.
> 
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>

The skiboot code for npu2_freeze_status() is just:

	*freeze_state = OPAL_EEH_STOPPED_NOT_FROZEN;
	return OPAL_SUCCESS;

It looks like npu_freeze_status() is affected as well. I think we
could consider these skiboot bugs, but this still seems like an
improvement.

Reviewed-by: Sam Bobroff <sbobroff@linux.ibm.com>

> ---
> 
> Without this, this happens:
> 
> pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
> EEH: PHB#6 failure detected, location: N/A
> CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106
> Call Trace:
> [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable)
> [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0
> [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160
> [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0
> [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250
> [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0
> [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0
> [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
> [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00
> [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120
> [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80
> [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70
> EEH: Detected error on PHB#6
> EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail
> ures.
> EEH: Notify device drivers to shutdown
> EEH: Beginning: 'error_detected(IO frozen)'
> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover'
> EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover'
> EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover'
> EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover'
> EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover'
> EEH: Collect temporary log
> pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
> EEH: Reset without hotplug activity
> iommu: Removing device 0006:00:01.0 from group 4
> iommu: Removing device 0006:00:01.1 from group 4
> iommu: Removing device 0006:00:02.0 from group 4
> iommu: Removing device 0006:00:02.1 from group 4
> pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> EEH: Sleep 5s ahead of partial hotplug
> pci 0004:04     : [PE# 00] Setting up window#0 0..3fffffff pg=1000
> pci 0004:05     : [PE# 18] Setting up window#0 0..3fffffff pg=1000
> pci 0004:06     : [PE# 30] Setting up window#0 0..3fffffff pg=1000
> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000
> pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000
> pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000
> EEH: Beginning: 'slot_reset'
> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> EEH: Finished:'slot_reset' with aggregate recovery state:'none'
> EEH: Notify device driver to resume
> EEH: Beginning: 'resume'
> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> EEH: Finished:'resume'
> EEH: Recovery successful.
> ---
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++----
>  arch/powerpc/platforms/powernv/pci-ioda.c    | 4 ++--
>  arch/powerpc/platforms/powernv/pci.c         | 4 ++--
>  3 files changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index abc0be7..f380789 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
>  static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
>  {
>  	struct pnv_phb *phb = pe->phb->private_data;
> -	u8 fstate;
> -	__be16 pcierr;
> +	u8 fstate = 0;
> +	__be16 pcierr = 0;
>  	s64 rc;
>  	int result = 0;
>  
> @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
>  static int pnv_eeh_get_pe_state(struct eeh_pe *pe)
>  {
>  	struct pnv_phb *phb = pe->phb->private_data;
> -	u8 fstate;
> -	__be16 pcierr;
> +	u8 fstate = 0;
> +	__be16 pcierr = 0;
>  	s64 rc;
>  	int result;
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index dd80744..72b5cc0 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt)
>  static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>  {
>  	struct pnv_ioda_pe *slave, *pe;
> -	u8 fstate, state;
> -	__be16 pcierr;
> +	u8 fstate = 0, state;
> +	__be16 pcierr = 0;
>  	s64 rc;
>  
>  	/* Sanity check on PE number */
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 13aef23..db230a35 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
>  static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
>  {
>  	struct pnv_phb *phb = pdn->phb->private_data;
> -	u8	fstate;
> -	__be16	pcierr;
> +	u8	fstate = 0;
> +	__be16	pcierr = 0;
>  	unsigned int pe_no;
>  	s64	rc;
>  
> -- 
> 2.17.1
>
Michael Ellerman Nov. 20, 2018, 2:51 a.m. UTC | #2
Alexey Kardashevskiy <aik@ozlabs.ru> writes:

> The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
> skiboot's NPU driver does not touch the pci_error_type parameter so
> it might have garbage but the powernv code analyzes it nevertheless.
>
> This initializes pcierr and fstate to zero in all call sites.
>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---

Can we tag this with a Fixes? And seems like it should probably go to
stable, or can we not trigger this path on older kernels?

cheers

> Without this, this happens:
>
> pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
> EEH: PHB#6 failure detected, location: N/A
> CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106
> Call Trace:
> [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable)
> [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0
> [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160
> [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0
> [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250
> [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0
> [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0
> [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
> [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00
> [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120
> [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80
> [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70
> EEH: Detected error on PHB#6
> EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail
> ures.
> EEH: Notify device drivers to shutdown
> EEH: Beginning: 'error_detected(IO frozen)'
> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover'
> EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover'
> EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover'
> EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen)
> EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover'
> EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover'
> EEH: Collect temporary log
> pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
> EEH: Reset without hotplug activity
> iommu: Removing device 0006:00:01.0 from group 4
> iommu: Removing device 0006:00:01.1 from group 4
> iommu: Removing device 0006:00:02.0 from group 4
> iommu: Removing device 0006:00:02.1 from group 4
> pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> EEH: Sleep 5s ahead of partial hotplug
> pci 0004:04     : [PE# 00] Setting up window#0 0..3fffffff pg=1000
> pci 0004:05     : [PE# 18] Setting up window#0 0..3fffffff pg=1000
> pci 0004:06     : [PE# 30] Setting up window#0 0..3fffffff pg=1000
> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000
> pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000
> pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000
> EEH: Beginning: 'slot_reset'
> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> EEH: Finished:'slot_reset' with aggregate recovery state:'none'
> EEH: Notify device driver to resume
> EEH: Beginning: 'resume'
> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> EEH: Finished:'resume'
> EEH: Recovery successful.
> ---
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++----
>  arch/powerpc/platforms/powernv/pci-ioda.c    | 4 ++--
>  arch/powerpc/platforms/powernv/pci.c         | 4 ++--
>  3 files changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index abc0be7..f380789 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
>  static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
>  {
>  	struct pnv_phb *phb = pe->phb->private_data;
> -	u8 fstate;
> -	__be16 pcierr;
> +	u8 fstate = 0;
> +	__be16 pcierr = 0;
>  	s64 rc;
>  	int result = 0;
>  
> @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
>  static int pnv_eeh_get_pe_state(struct eeh_pe *pe)
>  {
>  	struct pnv_phb *phb = pe->phb->private_data;
> -	u8 fstate;
> -	__be16 pcierr;
> +	u8 fstate = 0;
> +	__be16 pcierr = 0;
>  	s64 rc;
>  	int result;
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index dd80744..72b5cc0 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt)
>  static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>  {
>  	struct pnv_ioda_pe *slave, *pe;
> -	u8 fstate, state;
> -	__be16 pcierr;
> +	u8 fstate = 0, state;
> +	__be16 pcierr = 0;
>  	s64 rc;
>  
>  	/* Sanity check on PE number */
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 13aef23..db230a35 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
>  static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
>  {
>  	struct pnv_phb *phb = pdn->phb->private_data;
> -	u8	fstate;
> -	__be16	pcierr;
> +	u8	fstate = 0;
> +	__be16	pcierr = 0;
>  	unsigned int pe_no;
>  	s64	rc;
>  
> -- 
> 2.17.1
Sam Bobroff Nov. 20, 2018, 3:51 a.m. UTC | #3
On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote:
> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> 
> > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
> > skiboot's NPU driver does not touch the pci_error_type parameter so
> > it might have garbage but the powernv code analyzes it nevertheless.
> >
> > This initializes pcierr and fstate to zero in all call sites.
> >
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > ---
> 
> Can we tag this with a Fixes? And seems like it should probably go to
> stable, or can we not trigger this path on older kernels?
> 
> cheers

Hmm, it's triggered by use on an NPU PE so that would be any kernel that
can run on P8 or later (AFAIK).

It looks like the issue was present earlier, but the code was last
touched when it was moved, in...

40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()")

... which was back in v4.1.

Sam.

> > Without this, this happens:
> >
> > pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
> > EEH: PHB#6 failure detected, location: N/A
> > CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106
> > Call Trace:
> > [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable)
> > [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0
> > [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160
> > [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0
> > [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250
> > [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0
> > [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0
> > [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
> > [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> > [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00
> > [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120
> > [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80
> > [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70
> > EEH: Detected error on PHB#6
> > EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail
> > ures.
> > EEH: Notify device drivers to shutdown
> > EEH: Beginning: 'error_detected(IO frozen)'
> > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> > EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen)
> > EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover'
> > EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen)
> > EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover'
> > EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen)
> > EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover'
> > EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen)
> > EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover'
> > EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover'
> > EEH: Collect temporary log
> > pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
> > EEH: Reset without hotplug activity
> > iommu: Removing device 0006:00:01.0 from group 4
> > iommu: Removing device 0006:00:01.1 from group 4
> > iommu: Removing device 0006:00:02.0 from group 4
> > iommu: Removing device 0006:00:02.1 from group 4
> > pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> > EEH: Sleep 5s ahead of partial hotplug
> > pci 0004:04     : [PE# 00] Setting up window#0 0..3fffffff pg=1000
> > pci 0004:05     : [PE# 18] Setting up window#0 0..3fffffff pg=1000
> > pci 0004:06     : [PE# 30] Setting up window#0 0..3fffffff pg=1000
> > pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000
> > pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000
> > pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000
> > EEH: Beginning: 'slot_reset'
> > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> > EEH: Finished:'slot_reset' with aggregate recovery state:'none'
> > EEH: Notify device driver to resume
> > EEH: Beginning: 'resume'
> > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> > EEH: Finished:'resume'
> > EEH: Recovery successful.
> > ---
> >  arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++----
> >  arch/powerpc/platforms/powernv/pci-ioda.c    | 4 ++--
> >  arch/powerpc/platforms/powernv/pci.c         | 4 ++--
> >  3 files changed, 8 insertions(+), 8 deletions(-)
> >
> > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > index abc0be7..f380789 100644
> > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
> >  static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
> >  {
> >  	struct pnv_phb *phb = pe->phb->private_data;
> > -	u8 fstate;
> > -	__be16 pcierr;
> > +	u8 fstate = 0;
> > +	__be16 pcierr = 0;
> >  	s64 rc;
> >  	int result = 0;
> >  
> > @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
> >  static int pnv_eeh_get_pe_state(struct eeh_pe *pe)
> >  {
> >  	struct pnv_phb *phb = pe->phb->private_data;
> > -	u8 fstate;
> > -	__be16 pcierr;
> > +	u8 fstate = 0;
> > +	__be16 pcierr = 0;
> >  	s64 rc;
> >  	int result;
> >  
> > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> > index dd80744..72b5cc0 100644
> > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt)
> >  static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
> >  {
> >  	struct pnv_ioda_pe *slave, *pe;
> > -	u8 fstate, state;
> > -	__be16 pcierr;
> > +	u8 fstate = 0, state;
> > +	__be16 pcierr = 0;
> >  	s64 rc;
> >  
> >  	/* Sanity check on PE number */
> > diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> > index 13aef23..db230a35 100644
> > --- a/arch/powerpc/platforms/powernv/pci.c
> > +++ b/arch/powerpc/platforms/powernv/pci.c
> > @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
> >  static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
> >  {
> >  	struct pnv_phb *phb = pdn->phb->private_data;
> > -	u8	fstate;
> > -	__be16	pcierr;
> > +	u8	fstate = 0;
> > +	__be16	pcierr = 0;
> >  	unsigned int pe_no;
> >  	s64	rc;
> >  
> > -- 
> > 2.17.1
>
Alexey Kardashevskiy Nov. 20, 2018, 6:55 a.m. UTC | #4
On 20/11/2018 14:51, Sam Bobroff wrote:
> On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote:
>> Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>>
>>> The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS call in
>>> skiboot's NPU driver does not touch the pci_error_type parameter so
>>> it might have garbage but the powernv code analyzes it nevertheless.
>>>
>>> This initializes pcierr and fstate to zero in all call sites.
>>>
>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>>> ---
>>
>> Can we tag this with a Fixes? And seems like it should probably go to
>> stable, or can we not trigger this path on older kernels?
>>
>> cheers
> 
> Hmm, it's triggered by use on an NPU PE so that would be any kernel that
> can run on P8 or later (AFAIK).
> 
> It looks like the issue was present earlier, but the code was last
> touched when it was moved, in...
> 
> 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()")
> 
> ... which was back in v4.1.


The original commit is
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c41a7f3f7593fe57

2013-06-20 13:21:09 +0800
===
powerpc/eeh: I/O chip EEH state retrieval

The patch adds I/O chip backend to retrieve the state for the
indicated PE. While the PE state is temperarily unavailable,
the upper layer (powernv platform) should return default delay
(1 second).
===

This was 3.10-rc5 (this is what that sha1's Makefile has).

But this was not the issue till skiboot decided not to zero these
parameters so "Fixes:" should point to what?


> 
> Sam.
> 
>>> Without this, this happens:
>>>
>>> pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
>>> EEH: PHB#6 failure detected, location: N/A
>>> CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-le_f5a7bb7_aikATfstn1-p1 torvalds#106
>>> Call Trace:
>>> [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4 (unreliable)
>>> [c000003fea9dfa00] [c000000000038480] eeh_dev_check_failure+0x1f0/0x5f0
>>> [c000003fea9dfaa0] [c0000000000a2768] pnv_pci_read_config+0x128/0x160
>>> [c000003fea9dfae0] [c0000000005d2b0c] pci_bus_read_config_dword+0x9c/0xf0
>>> [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250
>>> [c000003fea9dfbc0] [c0000000005e0730] pci_dev_save_and_disable+0x70/0xa0
>>> [c000003fea9dfbf0] [c0000000005e4078] pci_try_reset_function+0x48/0xc0
>>> [c000003fea9dfc20] [c00800001cbc1b1c] vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
>>> [c000003fea9dfcf0] [c00800001ca9046c] vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
>>> [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00
>>> [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120
>>> [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80
>>> [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70
>>> EEH: Detected error on PHB#6
>>> EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 fail
>>> ures.
>>> EEH: Notify device drivers to shutdown
>>> EEH: Beginning: 'error_detected(IO frozen)'
>>> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
>>> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
>>> EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci->error_detected(IO frozen)
>>> EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can recover'
>>> EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci->error_detected(IO frozen)
>>> EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can recover'
>>> EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci->error_detected(IO frozen)
>>> EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can recover'
>>> EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci->error_detected(IO frozen)
>>> EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can recover'
>>> EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'can recover'
>>> EEH: Collect temporary log
>>> pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
>>> EEH: Reset without hotplug activity
>>> iommu: Removing device 0006:00:01.0 from group 4
>>> iommu: Removing device 0006:00:01.1 from group 4
>>> iommu: Removing device 0006:00:02.0 from group 4
>>> iommu: Removing device 0006:00:02.1 from group 4
>>> pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
>>> pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
>>> EEH: Sleep 5s ahead of partial hotplug
>>> pci 0004:04     : [PE# 00] Setting up window#0 0..3fffffff pg=1000
>>> pci 0004:05     : [PE# 18] Setting up window#0 0..3fffffff pg=1000
>>> pci 0004:06     : [PE# 30] Setting up window#0 0..3fffffff pg=1000
>>> pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff pg=1000
>>> pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff pg=1000
>>> pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff pg=1000
>>> EEH: Beginning: 'slot_reset'
>>> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
>>> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
>>> EEH: Finished:'slot_reset' with aggregate recovery state:'none'
>>> EEH: Notify device driver to resume
>>> EEH: Beginning: 'resume'
>>> EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
>>> EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
>>> EEH: Finished:'resume'
>>> EEH: Recovery successful.
>>> ---
>>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++----
>>>  arch/powerpc/platforms/powernv/pci-ioda.c    | 4 ++--
>>>  arch/powerpc/platforms/powernv/pci.c         | 4 ++--
>>>  3 files changed, 8 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> index abc0be7..f380789 100644
>>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>>> @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
>>>  static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
>>>  {
>>>  	struct pnv_phb *phb = pe->phb->private_data;
>>> -	u8 fstate;
>>> -	__be16 pcierr;
>>> +	u8 fstate = 0;
>>> +	__be16 pcierr = 0;
>>>  	s64 rc;
>>>  	int result = 0;
>>>  
>>> @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
>>>  static int pnv_eeh_get_pe_state(struct eeh_pe *pe)
>>>  {
>>>  	struct pnv_phb *phb = pe->phb->private_data;
>>> -	u8 fstate;
>>> -	__be16 pcierr;
>>> +	u8 fstate = 0;
>>> +	__be16 pcierr = 0;
>>>  	s64 rc;
>>>  	int result;
>>>  
>>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> index dd80744..72b5cc0 100644
>>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>>> @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt)
>>>  static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
>>>  {
>>>  	struct pnv_ioda_pe *slave, *pe;
>>> -	u8 fstate, state;
>>> -	__be16 pcierr;
>>> +	u8 fstate = 0, state;
>>> +	__be16 pcierr = 0;
>>>  	s64 rc;
>>>  
>>>  	/* Sanity check on PE number */
>>> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
>>> index 13aef23..db230a35 100644
>>> --- a/arch/powerpc/platforms/powernv/pci.c
>>> +++ b/arch/powerpc/platforms/powernv/pci.c
>>> @@ -602,8 +602,8 @@ static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
>>>  static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
>>>  {
>>>  	struct pnv_phb *phb = pdn->phb->private_data;
>>> -	u8	fstate;
>>> -	__be16	pcierr;
>>> +	u8	fstate = 0;
>>> +	__be16	pcierr = 0;
>>>  	unsigned int pe_no;
>>>  	s64	rc;
>>>  
>>> -- 
>>> 2.17.1
>>
Russell Currey Nov. 20, 2018, 6:59 a.m. UTC | #5
On Tue, 2018-11-20 at 17:55 +1100, Alexey Kardashevskiy wrote:
> 
> On 20/11/2018 14:51, Sam Bobroff wrote:
> > On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote:
> > > Alexey Kardashevskiy <aik@ozlabs.ru> writes:
> > > 
> > > > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS
> > > > call in
> > > > skiboot's NPU driver does not touch the pci_error_type
> > > > parameter so
> > > > it might have garbage but the powernv code analyzes it
> > > > nevertheless.
> > > > 
> > > > This initializes pcierr and fstate to zero in all call sites.
> > > > 
> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> > > > ---
> > > 
> > > Can we tag this with a Fixes? And seems like it should probably
> > > go to
> > > stable, or can we not trigger this path on older kernels?
> > > 
> > > cheers
> > 
> > Hmm, it's triggered by use on an NPU PE so that would be any kernel
> > that
> > can run on P8 or later (AFAIK).
> > 
> > It looks like the issue was present earlier, but the code was last
> > touched when it was moved, in...
> > 
> > 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()")
> > 
> > ... which was back in v4.1.
> 
> The original commit is
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c41a7f3f7593fe57
> 
> 2013-06-20 13:21:09 +0800
> ===
> powerpc/eeh: I/O chip EEH state retrieval
> 
> The patch adds I/O chip backend to retrieve the state for the
> indicated PE. While the PE state is temperarily unavailable,
> the upper layer (powernv platform) should return default delay
> (1 second).
> ===
> 
> This was 3.10-rc5 (this is what that sha1's Makefile has).
> 
> But this was not the issue till skiboot decided not to zero these
> parameters so "Fixes:" should point to what?

It should still be that commit, it's perfectly reasonable (however
unlikely) that someone could be running a 3.10 kernel with a modern
skiboot.

> 
> 
> > Sam.
> > 
> > > > Without this, this happens:
> > > > 
> > > > pnv_eeh_get_phb_diag: Failure -7 getting PHB#6 diag-data
> > > > EEH: PHB#6 failure detected, location: N/A
> > > > CPU: 23 PID: 5939 Comm: qemu-system-ppc Not tainted 4.19.0-
> > > > le_f5a7bb7_aikATfstn1-p1 torvalds#106
> > > > Call Trace:
> > > > [c000003fea9df9c0] [c000000000a990ec] dump_stack+0xb0/0xf4
> > > > (unreliable)
> > > > [c000003fea9dfa00] [c000000000038480]
> > > > eeh_dev_check_failure+0x1f0/0x5f0
> > > > [c000003fea9dfaa0] [c0000000000a2768]
> > > > pnv_pci_read_config+0x128/0x160
> > > > [c000003fea9dfae0] [c0000000005d2b0c]
> > > > pci_bus_read_config_dword+0x9c/0xf0
> > > > [c000003fea9dfb40] [c0000000005df3d4] pci_save_state+0x64/0x250
> > > > [c000003fea9dfbc0] [c0000000005e0730]
> > > > pci_dev_save_and_disable+0x70/0xa0
> > > > [c000003fea9dfbf0] [c0000000005e4078]
> > > > pci_try_reset_function+0x48/0xc0
> > > > [c000003fea9dfc20] [c00800001cbc1b1c]
> > > > vfio_pci_ioctl+0x334/0xea0 [vfio_pci]
> > > > [c000003fea9dfcf0] [c00800001ca9046c]
> > > > vfio_device_fops_unl_ioctl+0x44/0x70 [vfio]
> > > > [c000003fea9dfd10] [c00000000039fd84] do_vfs_ioctl+0xd4/0xa00
> > > > [c000003fea9dfdb0] [c0000000003a07b4] ksys_ioctl+0x104/0x120
> > > > [c000003fea9dfe00] [c0000000003a07f8] sys_ioctl+0x28/0x80
> > > > [c000003fea9dfe20] [c00000000000b3a4] system_call+0x5c/0x70
> > > > EEH: Detected error on PHB#6
> > > > EEH: This PCI device has failed 1 times in the last hour and
> > > > will be permanently disabled after 5 fail
> > > > ures.
> > > > EEH: Notify device drivers to shutdown
> > > > EEH: Beginning: 'error_detected(IO frozen)'
> > > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> > > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> > > > EEH: PE#c (PCI 0006:00:01.0): Invoking vfio-pci-
> > > > >error_detected(IO frozen)
> > > > EEH: PE#c (PCI 0006:00:01.0): vfio-pci driver reports: 'can
> > > > recover'
> > > > EEH: PE#c (PCI 0006:00:01.1): Invoking vfio-pci-
> > > > >error_detected(IO frozen)
> > > > EEH: PE#c (PCI 0006:00:01.1): vfio-pci driver reports: 'can
> > > > recover'
> > > > EEH: PE#b (PCI 0006:00:02.0): Invoking vfio-pci-
> > > > >error_detected(IO frozen)
> > > > EEH: PE#b (PCI 0006:00:02.0): vfio-pci driver reports: 'can
> > > > recover'
> > > > EEH: PE#b (PCI 0006:00:02.1): Invoking vfio-pci-
> > > > >error_detected(IO frozen)
> > > > EEH: PE#b (PCI 0006:00:02.1): vfio-pci driver reports: 'can
> > > > recover'
> > > > EEH: Finished:'error_detected(IO frozen)' with aggregate
> > > > recovery state:'can recover'
> > > > EEH: Collect temporary log
> > > > pnv_pci_dump_phb_diag_data: Unrecognized ioType 0
> > > > EEH: Reset without hotplug activity
> > > > iommu: Removing device 0006:00:01.0 from group 4
> > > > iommu: Removing device 0006:00:01.1 from group 4
> > > > iommu: Removing device 0006:00:02.0 from group 4
> > > > iommu: Removing device 0006:00:02.1 from group 4
> > > > pnv_ioda_freeze_pe: Failure -7 freezing PHB#6-PE#0
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x0 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x1 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x8 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x9 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x10 (-7)
> > > > pnv_eeh_restore_config: Can't reinit PCI dev 0x11 (-7)
> > > > EEH: Sleep 5s ahead of partial hotplug
> > > > pci 0004:04     : [PE# 00] Setting up window#0 0..3fffffff
> > > > pg=1000
> > > > pci 0004:05     : [PE# 18] Setting up window#0 0..3fffffff
> > > > pg=1000
> > > > pci 0004:06     : [PE# 30] Setting up window#0 0..3fffffff
> > > > pg=1000
> > > > pci 0006:00:00.0: [PE# 0d] Setting up window 0..3fffffff
> > > > pg=1000
> > > > pci 0006:00:01.0: [PE# 0c] Setting up window 0..3fffffff
> > > > pg=1000
> > > > pci 0006:00:02.0: [PE# 0b] Setting up window 0..3fffffff
> > > > pg=1000
> > > > EEH: Beginning: 'slot_reset'
> > > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> > > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> > > > EEH: Finished:'slot_reset' with aggregate recovery state:'none'
> > > > EEH: Notify device driver to resume
> > > > EEH: Beginning: 'resume'
> > > > EEH: PE#d (PCI 0006:00:00.0): not actionable (1,1,0)
> > > > EEH: PE#d (PCI 0006:00:00.1): not actionable (1,1,0)
> > > > EEH: Finished:'resume'
> > > > EEH: Recovery successful.
> > > > ---
> > > >  arch/powerpc/platforms/powernv/eeh-powernv.c | 8 ++++----
> > > >  arch/powerpc/platforms/powernv/pci-ioda.c    | 4 ++--
> > > >  arch/powerpc/platforms/powernv/pci.c         | 4 ++--
> > > >  3 files changed, 8 insertions(+), 8 deletions(-)
> > > > 
> > > > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c
> > > > b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > > > index abc0be7..f380789 100644
> > > > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> > > > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > > > @@ -564,8 +564,8 @@ static void pnv_eeh_get_phb_diag(struct
> > > > eeh_pe *pe)
> > > >  static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
> > > >  {
> > > >  	struct pnv_phb *phb = pe->phb->private_data;
> > > > -	u8 fstate;
> > > > -	__be16 pcierr;
> > > > +	u8 fstate = 0;
> > > > +	__be16 pcierr = 0;
> > > >  	s64 rc;
> > > >  	int result = 0;
> > > >  
> > > > @@ -603,8 +603,8 @@ static int pnv_eeh_get_phb_state(struct
> > > > eeh_pe *pe)
> > > >  static int pnv_eeh_get_pe_state(struct eeh_pe *pe)
> > > >  {
> > > >  	struct pnv_phb *phb = pe->phb->private_data;
> > > > -	u8 fstate;
> > > > -	__be16 pcierr;
> > > > +	u8 fstate = 0;
> > > > +	__be16 pcierr = 0;
> > > >  	s64 rc;
> > > >  	int result;
> > > >  
> > > > diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c
> > > > b/arch/powerpc/platforms/powernv/pci-ioda.c
> > > > index dd80744..72b5cc0 100644
> > > > --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> > > > +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> > > > @@ -604,8 +604,8 @@ static int pnv_ioda_unfreeze_pe(struct
> > > > pnv_phb *phb, int pe_no, int opt)
> > > >  static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int
> > > > pe_no)
> > > >  {
> > > >  	struct pnv_ioda_pe *slave, *pe;
> > > > -	u8 fstate, state;
> > > > -	__be16 pcierr;
> > > > +	u8 fstate = 0, state;
> > > > +	__be16 pcierr = 0;
> > > >  	s64 rc;
> > > >  
> > > >  	/* Sanity check on PE number */
> > > > diff --git a/arch/powerpc/platforms/powernv/pci.c
> > > > b/arch/powerpc/platforms/powernv/pci.c
> > > > index 13aef23..db230a35 100644
> > > > --- a/arch/powerpc/platforms/powernv/pci.c
> > > > +++ b/arch/powerpc/platforms/powernv/pci.c
> > > > @@ -602,8 +602,8 @@ static void
> > > > pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
> > > >  static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
> > > >  {
> > > >  	struct pnv_phb *phb = pdn->phb->private_data;
> > > > -	u8	fstate;
> > > > -	__be16	pcierr;
> > > > +	u8	fstate = 0;
> > > > +	__be16	pcierr = 0;
> > > >  	unsigned int pe_no;
> > > >  	s64	rc;
> > > >  
> > > > -- 
> > > > 2.17.1
Michael Ellerman Nov. 20, 2018, 10:58 a.m. UTC | #6
Russell Currey <ruscur@russell.cc> writes:
> On Tue, 2018-11-20 at 17:55 +1100, Alexey Kardashevskiy wrote:
>> 
>> On 20/11/2018 14:51, Sam Bobroff wrote:
>> > On Tue, Nov 20, 2018 at 01:51:06PM +1100, Michael Ellerman wrote:
>> > > Alexey Kardashevskiy <aik@ozlabs.ru> writes:
>> > > 
>> > > > The current implementation of the OPAL_PCI_EEH_FREEZE_STATUS
>> > > > call in
>> > > > skiboot's NPU driver does not touch the pci_error_type
>> > > > parameter so
>> > > > it might have garbage but the powernv code analyzes it
>> > > > nevertheless.
>> > > > 
>> > > > This initializes pcierr and fstate to zero in all call sites.
>> > > > 
>> > > > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>> > > > ---
>> > > 
>> > > Can we tag this with a Fixes? And seems like it should probably
>> > > go to
>> > > stable, or can we not trigger this path on older kernels?
>> > > 
>> > > cheers
>> > 
>> > Hmm, it's triggered by use on an NPU PE so that would be any kernel
>> > that
>> > can run on P8 or later (AFAIK).
>> > 
>> > It looks like the issue was present earlier, but the code was last
>> > touched when it was moved, in...
>> > 
>> > 40ae5f693f6a ("powerpc/powernv: Drop PHB operation get_state()")
>> > 
>> > ... which was back in v4.1.
>> 
>> The original commit is
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8c41a7f3f7593fe57
>> 
>> 2013-06-20 13:21:09 +0800
>> ===
>> powerpc/eeh: I/O chip EEH state retrieval
>> 
>> The patch adds I/O chip backend to retrieve the state for the
>> indicated PE. While the PE state is temperarily unavailable,
>> the upper layer (powernv platform) should return default delay
>> (1 second).
>> ===
>> 
>> This was 3.10-rc5 (this is what that sha1's Makefile has).
>> 
>> But this was not the issue till skiboot decided not to zero these
>> parameters so "Fixes:" should point to what?
>
> It should still be that commit, it's perfectly reasonable (however
> unlikely) that someone could be running a 3.10 kernel with a modern
> skiboot.

Yeah that kernel commit is the earliest point where we could hit the
bug, so that's where Fixes should point.

If the bug's not actually in that commit, as it sounds here, then you
can explain that in the change log.

The other way to think about is if someone has back ported
8c41a7f3f7593fe57 then they would also want this fix, so pointing the
Fixes tag at 8c41a7f3f7593fe57 expresses that.

cheers
diff mbox series

Patch

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
index abc0be7..f380789 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -564,8 +564,8 @@  static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
 static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
 {
 	struct pnv_phb *phb = pe->phb->private_data;
-	u8 fstate;
-	__be16 pcierr;
+	u8 fstate = 0;
+	__be16 pcierr = 0;
 	s64 rc;
 	int result = 0;
 
@@ -603,8 +603,8 @@  static int pnv_eeh_get_phb_state(struct eeh_pe *pe)
 static int pnv_eeh_get_pe_state(struct eeh_pe *pe)
 {
 	struct pnv_phb *phb = pe->phb->private_data;
-	u8 fstate;
-	__be16 pcierr;
+	u8 fstate = 0;
+	__be16 pcierr = 0;
 	s64 rc;
 	int result;
 
diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
index dd80744..72b5cc0 100644
--- a/arch/powerpc/platforms/powernv/pci-ioda.c
+++ b/arch/powerpc/platforms/powernv/pci-ioda.c
@@ -604,8 +604,8 @@  static int pnv_ioda_unfreeze_pe(struct pnv_phb *phb, int pe_no, int opt)
 static int pnv_ioda_get_pe_state(struct pnv_phb *phb, int pe_no)
 {
 	struct pnv_ioda_pe *slave, *pe;
-	u8 fstate, state;
-	__be16 pcierr;
+	u8 fstate = 0, state;
+	__be16 pcierr = 0;
 	s64 rc;
 
 	/* Sanity check on PE number */
diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
index 13aef23..db230a35 100644
--- a/arch/powerpc/platforms/powernv/pci.c
+++ b/arch/powerpc/platforms/powernv/pci.c
@@ -602,8 +602,8 @@  static void pnv_pci_handle_eeh_config(struct pnv_phb *phb, u32 pe_no)
 static void pnv_pci_config_check_eeh(struct pci_dn *pdn)
 {
 	struct pnv_phb *phb = pdn->phb->private_data;
-	u8	fstate;
-	__be16	pcierr;
+	u8	fstate = 0;
+	__be16	pcierr = 0;
 	unsigned int pe_no;
 	s64	rc;