Message ID | 1476437916-31010-1-git-send-email-vaibhav@linux.vnet.ibm.com (mailing list archive) |
---|---|
State | Accepted |
Headers | show |
On Fri, 2016-14-10 at 09:38:36 UTC, Vaibhav Jain wrote: > This patch prevents resetting the cxl adapter via sysfs in presence of > one or more active cxl_context on it. This protects against an > unrecoverable error caused by PSL owning a dirty cache line even after > reset and host tries to touch the same cache line. In case a force reset > of the card is required irrespective of any active contexts, the int > value -1 can be stored in the 'reset' sysfs attribute of the card. > > The patch introduces a new atomic_t member named contexts_num inside > struct cxl that holds the number of active context attached to the card > , which is checked against '0' before proceeding with the reset. To > prevent against a race condition where a context is activated just after > reset check is performed, the contexts_num is atomically set to '-1' > after reset-check to indicate that no more contexts can be activated on > the card anymore. > > Before activating a context we atomically test if contexts_num is > non-negative and if so, increment its value by one. In case the value of > contexts_num is negative then it indicates that the card is about to be > reset and context activation is error-ed out at that point. > > Cc: stable@vger.kernel.org > Fixes: 62fa19d4 ("cxl: Add ability to reset the card") > Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> > Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> Applied to powerpc fixes, thanks. https://git.kernel.org/powerpc/c/70b565bbdb911023373e035225ab10 cheers
On 14/10/16 20:38, Vaibhav Jain wrote: > This patch prevents resetting the cxl adapter via sysfs in presence of > one or more active cxl_context on it. This protects against an > unrecoverable error caused by PSL owning a dirty cache line even after > reset and host tries to touch the same cache line. In case a force reset > of the card is required irrespective of any active contexts, the int > value -1 can be stored in the 'reset' sysfs attribute of the card. > > The patch introduces a new atomic_t member named contexts_num inside > struct cxl that holds the number of active context attached to the card > , which is checked against '0' before proceeding with the reset. To > prevent against a race condition where a context is activated just after > reset check is performed, the contexts_num is atomically set to '-1' > after reset-check to indicate that no more contexts can be activated on > the card anymore. > > Before activating a context we atomically test if contexts_num is > non-negative and if so, increment its value by one. In case the value of > contexts_num is negative then it indicates that the card is about to be > reset and context activation is error-ed out at that point. > > Cc: stable@vger.kernel.org > Fixes: 62fa19d4 ("cxl: Add ability to reset the card") > Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> > Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> When I inject an EEH error, this patch causes the following WARN. Thoughts? [ 55.965011] EEH: PHB#0 failure detected, location: N/A [ 55.965078] CPU: 20 PID: 9933 Comm: lspci Not tainted 4.9.0-rc1-ajd-00006-g6fb17cc #4 [ 55.965080] Call Trace: [ 55.965091] [c00000036818fab0] [c000000000950ec8] dump_stack+0xb0/0xf0 (unreliable) [ 55.965100] [c00000036818faf0] [c00000000002eb44] eeh_dev_check_failure+0x1e4/0x540 [ 55.965107] [c00000036818fb90] [c000000000064090] pnv_pci_read_config+0xc0/0x130 [ 55.965114] [c00000036818fbd0] [c0000000004bec24] pci_user_read_config_dword+0x84/0x160 [ 55.965119] [c00000036818fc20] [c0000000004d12f4] pci_read_config+0x164/0x2a0 [ 55.965125] [c00000036818fca0] [c000000000318e70] sysfs_kf_bin_read+0x70/0xc0 [ 55.965131] [c00000036818fcc0] [c000000000317ff8] kernfs_fop_read+0xd8/0x260 [ 55.965136] [c00000036818fd10] [c000000000278b7c] __vfs_read+0x3c/0x180 [ 55.965141] [c00000036818fda0] [c000000000279e2c] vfs_read+0xac/0x1a0 [ 55.965146] [c00000036818fde0] [c00000000027bc24] SyS_pread64+0xb4/0xd0 [ 55.965152] [c00000036818fe30] [c00000000000bd20] system_call+0x38/0xfc [ 55.965171] EEH: Detected error on PHB#0 [ 55.965173] EEH: This PCI device has failed 1 times in the last hour [ 55.965174] EEH: Notify device drivers to shutdown [ 55.965182] cxl afu0.0: Deactivating AFU directed mode [ 55.965261] Harmless Hypervisor Maintenance interrupt [Recovered] [ 55.965263] Error detail: Unknown [ 55.965265] HMER: 8040000000000000 [ 55.965267] Harmless Hypervisor Maintenance interrupt [Recovered] [ 55.965268] Error detail: Unknown [ 55.965270] HMER: 8040000000000000 [ 55.965326] cxl afu0.0: PSL Purge called with link down, ignoring [ 55.965563] EEH: Collect temporary log [ 55.965565] PHB3 PHB#0 Diag-data (Version: 1) [ 55.965566] brdgCtl: 0000ffff [ 55.965568] UtlSts: 00200000 00000000 00000000 [ 55.965570] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff [ 55.965571] RootErrSts: ffffffff ffffffff ffffffff [ 55.965572] RootErrLog: ffffffff ffffffff ffffffff ffffffff [ 55.965574] RootErrLog1: ffffffff 0000000000000000 0000000000000000 [ 55.965575] nFir: 0000809000000000 0030006e00000000 0000800000000000 [ 55.965577] PhbSts: 0000001c00000000 0000001c00000000 [ 55.965578] Lem: 0000020000100000 40018e2400022482 0000000000100000 [ 55.965582] OutErr: 0000002000000000 0000002000000000 0000000000000000 0000000000000000 [ 55.965584] InAErr: 8000000000000000 8000000000000000 0402000000000000 0000000000000000 [ 55.965586] PE[ 0] A/B: 8000000000000000 8000000000000000 [ 55.965587] EEH: Reset without hotplug activity [ 60.592750] EEH: Notify device drivers the completion of reset [ 60.592760] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142) [ 60.593018] pci 0000:01 : [PE# 000] Switching PHB to CXL [ 60.593116] pci 0000:01 : [PE# 000] Switching PHB to CXL [ 60.622727] Adapter context unlocked with 0 active contexts [ 60.622762] ------------[ cut here ]------------ [ 60.622771] WARNING: CPU: 12 PID: 627 at ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] [ 60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath bnx2x mdio libcrc32c cxl [ 60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted 4.9.0-rc1-ajd-00006-g6fb17cc #4 [ 60.622795] task: c0000003be084900 task.stack: c0000003be108000 [ 60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR: c000000000492fd0 [ 60.622799] REGS: c0000003be10b660 TRAP: 0700 Not tainted (4.9.0-rc1-ajd-00006-g6fb17cc) [ 60.622800] MSR: 900000010282b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> [ 60.622810] CR: 28000282 XER: 20000000 [ 60.622811] SOFTE: 1 CFAR: c00000000094fc88 [ 60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8 000000000000002f [ 60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8 0000000000000000 [ 60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000001 [ 60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8 c0000003c5166500 [ 60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 [ 60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000 c000000000b14fe8 [ 60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000 0000000000000000 [ 60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400 0000000000000006 [ 60.622850] NIP [d000000004350be0] cxl_adapter_context_unlock+0x60/0x80 [cxl] [ 60.622856] LR [d000000004350bdc] cxl_adapter_context_unlock+0x5c/0x80 [cxl] [ 60.622857] Call Trace: [ 60.622863] [c0000003be10b8e0] [d000000004350bdc] cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable) [ 60.622871] [c0000003be10b940] [d00000000435e810] cxl_configure_adapter+0x930/0x960 [cxl] [ 60.622879] [c0000003be10b9f0] [d00000000435e88c] cxl_pci_slot_reset+0x4c/0x230 [cxl] [ 60.622883] [c0000003be10baa0] [c000000000032cd4] eeh_report_reset+0x164/0x1a0 [ 60.622887] [c0000003be10bae0] [c000000000031220] eeh_pe_dev_traverse+0x90/0x170 [ 60.622890] [c0000003be10bb70] [c000000000033354] eeh_handle_normal_event+0x3d4/0x520 [ 60.622892] [c0000003be10bc20] [c000000000033624] eeh_handle_event+0x44/0x360 [ 60.622895] [c0000003be10bcd0] [c000000000033a58] eeh_event_handler+0x118/0x1d0 [ 60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130 [ 60.622902] [c0000003be10be30] [c00000000000c0a0] ret_from_kernel_thread+0x5c/0xbc [ 60.622903] Instruction dump: [ 60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010 f821ffa1 91230348 [ 60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060 e8010010 7c0803a6 [ 60.622918] ---[ end trace d358551c9a007b4f ]--- [ 60.622959] cxl afu0.0: Activating AFU directed mode [ 60.623097] EEH: Notify device driver to resume
Hi Andrew, Le 04/11/2016 à 07:27, Andrew Donnellan a écrit : > On 14/10/16 20:38, Vaibhav Jain wrote: >> This patch prevents resetting the cxl adapter via sysfs in presence of >> one or more active cxl_context on it. This protects against an >> unrecoverable error caused by PSL owning a dirty cache line even after >> reset and host tries to touch the same cache line. In case a force reset >> of the card is required irrespective of any active contexts, the int >> value -1 can be stored in the 'reset' sysfs attribute of the card. >> >> The patch introduces a new atomic_t member named contexts_num inside >> struct cxl that holds the number of active context attached to the card >> , which is checked against '0' before proceeding with the reset. To >> prevent against a race condition where a context is activated just after >> reset check is performed, the contexts_num is atomically set to '-1' >> after reset-check to indicate that no more contexts can be activated on >> the card anymore. >> >> Before activating a context we atomically test if contexts_num is >> non-negative and if so, increment its value by one. In case the value of >> contexts_num is negative then it indicates that the card is about to be >> reset and context activation is error-ed out at that point. >> >> Cc: stable@vger.kernel.org >> Fixes: 62fa19d4 ("cxl: Add ability to reset the card") >> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> >> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> >> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> > > When I inject an EEH error, this patch causes the following WARN. Thoughts? mmm, hard to see a relation with that patch. I couldn't reproduce either. Could it bear any relation with the patch you're working on (lspci called while the capi device is unconfigured)? Fred > > > [ 55.965011] EEH: PHB#0 failure detected, location: N/A > [ 55.965078] CPU: 20 PID: 9933 Comm: lspci Not tainted > 4.9.0-rc1-ajd-00006-g6fb17cc #4 > [ 55.965080] Call Trace: > [ 55.965091] [c00000036818fab0] [c000000000950ec8] > dump_stack+0xb0/0xf0 (unreliable) > [ 55.965100] [c00000036818faf0] [c00000000002eb44] > eeh_dev_check_failure+0x1e4/0x540 > [ 55.965107] [c00000036818fb90] [c000000000064090] > pnv_pci_read_config+0xc0/0x130 > [ 55.965114] [c00000036818fbd0] [c0000000004bec24] > pci_user_read_config_dword+0x84/0x160 > [ 55.965119] [c00000036818fc20] [c0000000004d12f4] > pci_read_config+0x164/0x2a0 > [ 55.965125] [c00000036818fca0] [c000000000318e70] > sysfs_kf_bin_read+0x70/0xc0 > [ 55.965131] [c00000036818fcc0] [c000000000317ff8] > kernfs_fop_read+0xd8/0x260 > [ 55.965136] [c00000036818fd10] [c000000000278b7c] __vfs_read+0x3c/0x180 > [ 55.965141] [c00000036818fda0] [c000000000279e2c] vfs_read+0xac/0x1a0 > [ 55.965146] [c00000036818fde0] [c00000000027bc24] SyS_pread64+0xb4/0xd0 > [ 55.965152] [c00000036818fe30] [c00000000000bd20] system_call+0x38/0xfc > [ 55.965171] EEH: Detected error on PHB#0 > [ 55.965173] EEH: This PCI device has failed 1 times in the last hour > [ 55.965174] EEH: Notify device drivers to shutdown > [ 55.965182] cxl afu0.0: Deactivating AFU directed mode > [ 55.965261] Harmless Hypervisor Maintenance interrupt [Recovered] > [ 55.965263] Error detail: Unknown > [ 55.965265] HMER: 8040000000000000 > [ 55.965267] Harmless Hypervisor Maintenance interrupt [Recovered] > [ 55.965268] Error detail: Unknown > [ 55.965270] HMER: 8040000000000000 > [ 55.965326] cxl afu0.0: PSL Purge called with link down, ignoring > [ 55.965563] EEH: Collect temporary log > [ 55.965565] PHB3 PHB#0 Diag-data (Version: 1) > [ 55.965566] brdgCtl: 0000ffff > [ 55.965568] UtlSts: 00200000 00000000 00000000 > [ 55.965570] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff > [ 55.965571] RootErrSts: ffffffff ffffffff ffffffff > [ 55.965572] RootErrLog: ffffffff ffffffff ffffffff ffffffff > [ 55.965574] RootErrLog1: ffffffff 0000000000000000 0000000000000000 > [ 55.965575] nFir: 0000809000000000 0030006e00000000 > 0000800000000000 > [ 55.965577] PhbSts: 0000001c00000000 0000001c00000000 > [ 55.965578] Lem: 0000020000100000 40018e2400022482 > 0000000000100000 > [ 55.965582] OutErr: 0000002000000000 0000002000000000 > 0000000000000000 0000000000000000 > [ 55.965584] InAErr: 8000000000000000 8000000000000000 > 0402000000000000 0000000000000000 > [ 55.965586] PE[ 0] A/B: 8000000000000000 8000000000000000 > [ 55.965587] EEH: Reset without hotplug activity > [ 60.592750] EEH: Notify device drivers the completion of reset > [ 60.592760] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142) > [ 60.593018] pci 0000:01 : [PE# 000] Switching PHB to CXL > [ 60.593116] pci 0000:01 : [PE# 000] Switching PHB to CXL > [ 60.622727] Adapter context unlocked with 0 active contexts > [ 60.622762] ------------[ cut here ]------------ > [ 60.622771] WARNING: CPU: 12 PID: 627 at > ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] > [ 60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv > powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm > ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 > async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq > multipath bnx2x mdio libcrc32c cxl > [ 60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted > 4.9.0-rc1-ajd-00006-g6fb17cc #4 > [ 60.622795] task: c0000003be084900 task.stack: c0000003be108000 > [ 60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR: > c000000000492fd0 > [ 60.622799] REGS: c0000003be10b660 TRAP: 0700 Not tainted > (4.9.0-rc1-ajd-00006-g6fb17cc) > [ 60.622800] MSR: 900000010282b033 > <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> > [ 60.622810] CR: 28000282 XER: 20000000 > [ 60.622811] SOFTE: 1 CFAR: c00000000094fc88 > [ 60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8 > 000000000000002f > [ 60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8 > 0000000000000000 > [ 60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000 > 0000000000000001 > [ 60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8 > c0000003c5166500 > [ 60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000 > 0000000000000000 > [ 60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000 > c000000000b14fe8 > [ 60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000 > 0000000000000000 > [ 60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400 > 0000000000000006 > [ 60.622850] NIP [d000000004350be0] > cxl_adapter_context_unlock+0x60/0x80 [cxl] > [ 60.622856] LR [d000000004350bdc] > cxl_adapter_context_unlock+0x5c/0x80 [cxl] > [ 60.622857] Call Trace: > [ 60.622863] [c0000003be10b8e0] [d000000004350bdc] > cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable) > [ 60.622871] [c0000003be10b940] [d00000000435e810] > cxl_configure_adapter+0x930/0x960 [cxl] > [ 60.622879] [c0000003be10b9f0] [d00000000435e88c] > cxl_pci_slot_reset+0x4c/0x230 [cxl] > [ 60.622883] [c0000003be10baa0] [c000000000032cd4] > eeh_report_reset+0x164/0x1a0 > [ 60.622887] [c0000003be10bae0] [c000000000031220] > eeh_pe_dev_traverse+0x90/0x170 > [ 60.622890] [c0000003be10bb70] [c000000000033354] > eeh_handle_normal_event+0x3d4/0x520 > [ 60.622892] [c0000003be10bc20] [c000000000033624] > eeh_handle_event+0x44/0x360 > [ 60.622895] [c0000003be10bcd0] [c000000000033a58] > eeh_event_handler+0x118/0x1d0 > [ 60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130 > [ 60.622902] [c0000003be10be30] [c00000000000c0a0] > ret_from_kernel_thread+0x5c/0xbc > [ 60.622903] Instruction dump: > [ 60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010 > f821ffa1 91230348 > [ 60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060 > e8010010 7c0803a6 > [ 60.622918] ---[ end trace d358551c9a007b4f ]--- > [ 60.622959] cxl afu0.0: Activating AFU directed mode > [ 60.623097] EEH: Notify device driver to resume > >
Frederic/Andrew, Just recently this issue has been reported by system test without any of the two patches you are suspecting - this patch nor the lspci patch. I was hoping the lspci patch from Andrew can possibly solve it. System test CQ is SW370625. The stack reported in that is same, [ 5895.245959] EEH: PHB#2 failure detected, location: N/A [ 5895.246078] CPU: 19 PID: 121774 Comm: lspci Not tainted 3.10.0-514.el7.ppc64le #1 [ 5895.246240] Call Trace: [ 5895.246307] [c0000009f3707a60] [c000000000017ce0] show_stack+0x80/0x330 (unreliable) [ 5895.246501] [c0000009f3707b10] [c0000000009b22f4] dump_stack+0x30/0x44 [ 5895.246665] [c0000009f3707b30] [c00000000003b9ac] eeh_dev_check_failure+0x21c/0x580 [ 5895.246855] [c0000009f3707bd0] [c0000000000879dc] pnv_pci_read_config+0xbc/0x160 [ 5895.247045] [c0000009f3707c10] [c000000000527d54] pci_user_read_config_dword+0x84/0x160 [ 5895.247233] [c0000009f3707c60] [c000000000547224] pci_read_config+0xf4/0x2e0 [ 5895.247398] [c0000009f3707ce0] [c0000000003efb3c] read+0x10c/0x2a0 [ 5895.247561] [c0000009f3707da0] [c00000000031d160] vfs_read+0x110/0x290 [ 5895.247726] [c0000009f3707de0] [c00000000031ec70] SyS_pread64+0xb0/0xd0 Uma Krishnan On 11/4/2016 7:07 AM, Frederic Barrat wrote: > Hi Andrew, > > Le 04/11/2016 à 07:27, Andrew Donnellan a écrit : >> On 14/10/16 20:38, Vaibhav Jain wrote: >>> This patch prevents resetting the cxl adapter via sysfs in presence of >>> one or more active cxl_context on it. This protects against an >>> unrecoverable error caused by PSL owning a dirty cache line even after >>> reset and host tries to touch the same cache line. In case a force reset >>> of the card is required irrespective of any active contexts, the int >>> value -1 can be stored in the 'reset' sysfs attribute of the card. >>> >>> The patch introduces a new atomic_t member named contexts_num inside >>> struct cxl that holds the number of active context attached to the card >>> , which is checked against '0' before proceeding with the reset. To >>> prevent against a race condition where a context is activated just after >>> reset check is performed, the contexts_num is atomically set to '-1' >>> after reset-check to indicate that no more contexts can be activated on >>> the card anymore. >>> >>> Before activating a context we atomically test if contexts_num is >>> non-negative and if so, increment its value by one. In case the value of >>> contexts_num is negative then it indicates that the card is about to be >>> reset and context activation is error-ed out at that point. >>> >>> Cc: stable@vger.kernel.org >>> Fixes: 62fa19d4 ("cxl: Add ability to reset the card") >>> Acked-by: Frederic Barrat <fbarrat@linux.vnet.ibm.com> >>> Reviewed-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> >>> Signed-off-by: Vaibhav Jain <vaibhav@linux.vnet.ibm.com> >> >> When I inject an EEH error, this patch causes the following WARN. >> Thoughts? > > mmm, hard to see a relation with that patch. I couldn't reproduce > either. Could it bear any relation with the patch you're working on > (lspci called while the capi device is unconfigured)? > > Fred > > >> >> >> [ 55.965011] EEH: PHB#0 failure detected, location: N/A >> [ 55.965078] CPU: 20 PID: 9933 Comm: lspci Not tainted >> 4.9.0-rc1-ajd-00006-g6fb17cc #4 >> [ 55.965080] Call Trace: >> [ 55.965091] [c00000036818fab0] [c000000000950ec8] >> dump_stack+0xb0/0xf0 (unreliable) >> [ 55.965100] [c00000036818faf0] [c00000000002eb44] >> eeh_dev_check_failure+0x1e4/0x540 >> [ 55.965107] [c00000036818fb90] [c000000000064090] >> pnv_pci_read_config+0xc0/0x130 >> [ 55.965114] [c00000036818fbd0] [c0000000004bec24] >> pci_user_read_config_dword+0x84/0x160 >> [ 55.965119] [c00000036818fc20] [c0000000004d12f4] >> pci_read_config+0x164/0x2a0 >> [ 55.965125] [c00000036818fca0] [c000000000318e70] >> sysfs_kf_bin_read+0x70/0xc0 >> [ 55.965131] [c00000036818fcc0] [c000000000317ff8] >> kernfs_fop_read+0xd8/0x260 >> [ 55.965136] [c00000036818fd10] [c000000000278b7c] >> __vfs_read+0x3c/0x180 >> [ 55.965141] [c00000036818fda0] [c000000000279e2c] vfs_read+0xac/0x1a0 >> [ 55.965146] [c00000036818fde0] [c00000000027bc24] >> SyS_pread64+0xb4/0xd0 >> [ 55.965152] [c00000036818fe30] [c00000000000bd20] >> system_call+0x38/0xfc >> [ 55.965171] EEH: Detected error on PHB#0 >> [ 55.965173] EEH: This PCI device has failed 1 times in the last hour >> [ 55.965174] EEH: Notify device drivers to shutdown >> [ 55.965182] cxl afu0.0: Deactivating AFU directed mode >> [ 55.965261] Harmless Hypervisor Maintenance interrupt [Recovered] >> [ 55.965263] Error detail: Unknown >> [ 55.965265] HMER: 8040000000000000 >> [ 55.965267] Harmless Hypervisor Maintenance interrupt [Recovered] >> [ 55.965268] Error detail: Unknown >> [ 55.965270] HMER: 8040000000000000 >> [ 55.965326] cxl afu0.0: PSL Purge called with link down, ignoring >> [ 55.965563] EEH: Collect temporary log >> [ 55.965565] PHB3 PHB#0 Diag-data (Version: 1) >> [ 55.965566] brdgCtl: 0000ffff >> [ 55.965568] UtlSts: 00200000 00000000 00000000 >> [ 55.965570] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff >> [ 55.965571] RootErrSts: ffffffff ffffffff ffffffff >> [ 55.965572] RootErrLog: ffffffff ffffffff ffffffff ffffffff >> [ 55.965574] RootErrLog1: ffffffff 0000000000000000 0000000000000000 >> [ 55.965575] nFir: 0000809000000000 0030006e00000000 >> 0000800000000000 >> [ 55.965577] PhbSts: 0000001c00000000 0000001c00000000 >> [ 55.965578] Lem: 0000020000100000 40018e2400022482 >> 0000000000100000 >> [ 55.965582] OutErr: 0000002000000000 0000002000000000 >> 0000000000000000 0000000000000000 >> [ 55.965584] InAErr: 8000000000000000 8000000000000000 >> 0402000000000000 0000000000000000 >> [ 55.965586] PE[ 0] A/B: 8000000000000000 8000000000000000 >> [ 55.965587] EEH: Reset without hotplug activity >> [ 60.592750] EEH: Notify device drivers the completion of reset >> [ 60.592760] cxl-pci 0000:01:00.0: enabling device (0140 -> 0142) >> [ 60.593018] pci 0000:01 : [PE# 000] Switching PHB to CXL >> [ 60.593116] pci 0000:01 : [PE# 000] Switching PHB to CXL >> [ 60.622727] Adapter context unlocked with 0 active contexts >> [ 60.622762] ------------[ cut here ]------------ >> [ 60.622771] WARNING: CPU: 12 PID: 627 at >> ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] >> [ 60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv >> powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm >> ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 >> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq >> multipath bnx2x mdio libcrc32c cxl >> [ 60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted >> 4.9.0-rc1-ajd-00006-g6fb17cc #4 >> [ 60.622795] task: c0000003be084900 task.stack: c0000003be108000 >> [ 60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR: >> c000000000492fd0 >> [ 60.622799] REGS: c0000003be10b660 TRAP: 0700 Not tainted >> (4.9.0-rc1-ajd-00006-g6fb17cc) >> [ 60.622800] MSR: 900000010282b033 >> <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> >> [ 60.622810] CR: 28000282 XER: 20000000 >> [ 60.622811] SOFTE: 1 CFAR: c00000000094fc88 >> [ 60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8 >> 000000000000002f >> [ 60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8 >> 0000000000000000 >> [ 60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000 >> 0000000000000001 >> [ 60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8 >> c0000003c5166500 >> [ 60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000 >> 0000000000000000 >> [ 60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000 >> c000000000b14fe8 >> [ 60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000 >> 0000000000000000 >> [ 60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400 >> 0000000000000006 >> [ 60.622850] NIP [d000000004350be0] >> cxl_adapter_context_unlock+0x60/0x80 [cxl] >> [ 60.622856] LR [d000000004350bdc] >> cxl_adapter_context_unlock+0x5c/0x80 [cxl] >> [ 60.622857] Call Trace: >> [ 60.622863] [c0000003be10b8e0] [d000000004350bdc] >> cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable) >> [ 60.622871] [c0000003be10b940] [d00000000435e810] >> cxl_configure_adapter+0x930/0x960 [cxl] >> [ 60.622879] [c0000003be10b9f0] [d00000000435e88c] >> cxl_pci_slot_reset+0x4c/0x230 [cxl] >> [ 60.622883] [c0000003be10baa0] [c000000000032cd4] >> eeh_report_reset+0x164/0x1a0 >> [ 60.622887] [c0000003be10bae0] [c000000000031220] >> eeh_pe_dev_traverse+0x90/0x170 >> [ 60.622890] [c0000003be10bb70] [c000000000033354] >> eeh_handle_normal_event+0x3d4/0x520 >> [ 60.622892] [c0000003be10bc20] [c000000000033624] >> eeh_handle_event+0x44/0x360 >> [ 60.622895] [c0000003be10bcd0] [c000000000033a58] >> eeh_event_handler+0x118/0x1d0 >> [ 60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130 >> [ 60.622902] [c0000003be10be30] [c00000000000c0a0] >> ret_from_kernel_thread+0x5c/0xbc >> [ 60.622903] Instruction dump: >> [ 60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010 >> f821ffa1 91230348 >> [ 60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060 >> e8010010 7c0803a6 >> [ 60.622918] ---[ end trace d358551c9a007b4f ]--- >> [ 60.622959] cxl afu0.0: Activating AFU directed mode >> [ 60.623097] EEH: Notify device driver to resume >> >> >
On 05/11/16 00:15, Uma Krishnan wrote: > Frederic/Andrew, > > Just recently this issue has been reported by system test without any > of the two patches you are suspecting - this patch nor the lspci patch. > I was hoping the lspci patch from Andrew can possibly solve it. > System test CQ is SW370625. The stack reported in that is same, > > [ 5895.245959] EEH: PHB#2 failure detected, location: N/A > [ 5895.246078] CPU: 19 PID: 121774 Comm: lspci Not tainted > 3.10.0-514.el7.ppc64le #1 > [ 5895.246240] Call Trace: > [ 5895.246307] [c0000009f3707a60] [c000000000017ce0] > show_stack+0x80/0x330 (unreliable) > [ 5895.246501] [c0000009f3707b10] [c0000000009b22f4] > dump_stack+0x30/0x44 > [ 5895.246665] [c0000009f3707b30] [c00000000003b9ac] > eeh_dev_check_failure+0x21c/0x580 > [ 5895.246855] [c0000009f3707bd0] [c0000000000879dc] > pnv_pci_read_config+0xbc/0x160 > [ 5895.247045] [c0000009f3707c10] [c000000000527d54] > pci_user_read_config_dword+0x84/0x160 > [ 5895.247233] [c0000009f3707c60] [c000000000547224] > pci_read_config+0xf4/0x2e0 > [ 5895.247398] [c0000009f3707ce0] [c0000000003efb3c] read+0x10c/0x2a0 > [ 5895.247561] [c0000009f3707da0] [c00000000031d160] > vfs_read+0x110/0x290 > [ 5895.247726] [c0000009f3707de0] [c00000000031ec70] > SyS_pread64+0xb0/0xd0 This isn't a WARN - this stack trace is printed explicitly by the EEH code in the case of a PHB failure. arch/powerpc/kernel/eeh.c, line 403. Andrew
On 04/11/16 23:07, Frederic Barrat wrote: >> When I inject an EEH error, this patch causes the following WARN. >> Thoughts? > > mmm, hard to see a relation with that patch. I couldn't reproduce > either. Could it bear any relation with the patch you're working on > (lspci called while the capi device is unconfigured)? No, this was without any other patches... >> [ 60.593116] pci 0000:01 : [PE# 000] Switching PHB to CXL >> [ 60.622727] Adapter context unlocked with 0 active contexts >> [ 60.622762] ------------[ cut here ]------------ >> [ 60.622771] WARNING: CPU: 12 PID: 627 at >> ../drivers/misc/cxl/main.c:325 cxl_adapter_context_unlock+0x60/0x80 [cxl] >> [ 60.622772] Modules linked in: fuse powernv_rng rng_core leds_powernv >> powernv_op_panel led_class vmx_crypto ib_iser rdma_cm iw_cm ib_cm >> ib_core libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 >> async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq >> multipath bnx2x mdio libcrc32c cxl >> [ 60.622794] CPU: 12 PID: 627 Comm: eehd Not tainted >> 4.9.0-rc1-ajd-00006-g6fb17cc #4 >> [ 60.622795] task: c0000003be084900 task.stack: c0000003be108000 >> [ 60.622797] NIP: d000000004350be0 LR: d000000004350bdc CTR: >> c000000000492fd0 >> [ 60.622799] REGS: c0000003be10b660 TRAP: 0700 Not tainted >> (4.9.0-rc1-ajd-00006-g6fb17cc) >> [ 60.622800] MSR: 900000010282b033 >> <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> >> [ 60.622810] CR: 28000282 XER: 20000000 >> [ 60.622811] SOFTE: 1 CFAR: c00000000094fc88 >> [ 60.622814] GPR00: d000000004350bdc c0000003be10b8e0 d000000004379ae8 >> 000000000000002f >> [ 60.622818] GPR04: 0000000000000001 0000000000000000 00000000000003b8 >> 0000000000000000 >> [ 60.622822] GPR08: 0000000000000000 0000000000000000 0000000000000000 >> 0000000000000001 >> [ 60.622826] GPR12: 0000000000000000 c00000000fe03000 c0000000000baac8 >> c0000003c5166500 >> [ 60.622830] GPR16: 0000000000000000 0000000000000000 0000000000000000 >> 0000000000000000 >> [ 60.622834] GPR20: 0000000000000000 0000000000000000 0000000000000000 >> c000000000b14fe8 >> [ 60.622837] GPR24: c000000000b14fc0 c0000003afc10400 c0000003b0c40000 >> 0000000000000000 >> [ 60.622841] GPR28: c0000003c505a098 0000000000000000 c0000003afc10400 >> 0000000000000006 >> [ 60.622850] NIP [d000000004350be0] >> cxl_adapter_context_unlock+0x60/0x80 [cxl] >> [ 60.622856] LR [d000000004350bdc] >> cxl_adapter_context_unlock+0x5c/0x80 [cxl] >> [ 60.622857] Call Trace: >> [ 60.622863] [c0000003be10b8e0] [d000000004350bdc] >> cxl_adapter_context_unlock+0x5c/0x80 [cxl] (unreliable) >> [ 60.622871] [c0000003be10b940] [d00000000435e810] >> cxl_configure_adapter+0x930/0x960 [cxl] >> [ 60.622879] [c0000003be10b9f0] [d00000000435e88c] >> cxl_pci_slot_reset+0x4c/0x230 [cxl] >> [ 60.622883] [c0000003be10baa0] [c000000000032cd4] >> eeh_report_reset+0x164/0x1a0 >> [ 60.622887] [c0000003be10bae0] [c000000000031220] >> eeh_pe_dev_traverse+0x90/0x170 >> [ 60.622890] [c0000003be10bb70] [c000000000033354] >> eeh_handle_normal_event+0x3d4/0x520 >> [ 60.622892] [c0000003be10bc20] [c000000000033624] >> eeh_handle_event+0x44/0x360 >> [ 60.622895] [c0000003be10bcd0] [c000000000033a58] >> eeh_event_handler+0x118/0x1d0 >> [ 60.622898] [c0000003be10bd80] [c0000000000babc8] kthread+0x108/0x130 >> [ 60.622902] [c0000003be10be30] [c00000000000c0a0] >> ret_from_kernel_thread+0x5c/0xbc >> [ 60.622903] Instruction dump: >> [ 60.622905] 2f84ffff 4dfe0020 7c0802a6 7c8407b4 39200000 f8010010 >> f821ffa1 91230348 >> [ 60.622911] 3c620000 e8638070 48016639 e8410018 <0fe00000> 38210060 >> e8010010 7c0803a6 >> [ 60.622918] ---[ end trace d358551c9a007b4f ]--- >> [ 60.622959] cxl afu0.0: Activating AFU directed mode >> [ 60.623097] EEH: Notify device driver to resume That *definitely* looks related to this patch... Andrew
diff --git a/Documentation/ABI/testing/sysfs-class-cxl b/Documentation/ABI/testing/sysfs-class-cxl index 4ba0a2a..640f65e 100644 --- a/Documentation/ABI/testing/sysfs-class-cxl +++ b/Documentation/ABI/testing/sysfs-class-cxl @@ -220,8 +220,11 @@ What: /sys/class/cxl/<card>/reset Date: October 2014 Contact: linuxppc-dev@lists.ozlabs.org Description: write only - Writing 1 will issue a PERST to card which may cause the card - to reload the FPGA depending on load_image_on_perst. + Writing 1 will issue a PERST to card provided there are no + contexts active on any one of the card AFUs. This may cause + the card to reload the FPGA depending on load_image_on_perst. + Writing -1 will do a force PERST irrespective of any active + contexts on the card AFUs. Users: https://github.com/ibm-capi/libcxl What: /sys/class/cxl/<card>/perst_reloads_same_image (not in a guest) diff --git a/drivers/misc/cxl/api.c b/drivers/misc/cxl/api.c index f3d34b9..af23d7d 100644 --- a/drivers/misc/cxl/api.c +++ b/drivers/misc/cxl/api.c @@ -229,6 +229,14 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, if (ctx->status == STARTED) goto out; /* already started */ + /* + * Increment the mapped context count for adapter. This also checks + * if adapter_context_lock is taken. + */ + rc = cxl_adapter_context_get(ctx->afu->adapter); + if (rc) + goto out; + if (task) { ctx->pid = get_task_pid(task, PIDTYPE_PID); ctx->glpid = get_task_pid(task->group_leader, PIDTYPE_PID); @@ -240,6 +248,7 @@ int cxl_start_context(struct cxl_context *ctx, u64 wed, if ((rc = cxl_ops->attach_process(ctx, kernel, wed, 0))) { put_pid(ctx->pid); + cxl_adapter_context_put(ctx->afu->adapter); cxl_ctx_put(); goto out; } diff --git a/drivers/misc/cxl/context.c b/drivers/misc/cxl/context.c index c466ee2..5e506c1 100644 --- a/drivers/misc/cxl/context.c +++ b/drivers/misc/cxl/context.c @@ -238,6 +238,9 @@ int __detach_context(struct cxl_context *ctx) put_pid(ctx->glpid); cxl_ctx_put(); + + /* Decrease the attached context count on the adapter */ + cxl_adapter_context_put(ctx->afu->adapter); return 0; } diff --git a/drivers/misc/cxl/cxl.h b/drivers/misc/cxl/cxl.h index 01d372a..a144073 100644 --- a/drivers/misc/cxl/cxl.h +++ b/drivers/misc/cxl/cxl.h @@ -618,6 +618,14 @@ struct cxl { bool perst_select_user; bool perst_same_image; bool psl_timebase_synced; + + /* + * number of contexts mapped on to this card. Possible values are: + * >0: Number of contexts mapped and new one can be mapped. + * 0: No active contexts and new ones can be mapped. + * -1: No contexts mapped and new ones cannot be mapped. + */ + atomic_t contexts_num; }; int cxl_pci_alloc_one_irq(struct cxl *adapter); @@ -944,4 +952,20 @@ bool cxl_pci_is_vphb_device(struct pci_dev *dev); /* decode AFU error bits in the PSL register PSL_SERR_An */ void cxl_afu_decode_psl_serr(struct cxl_afu *afu, u64 serr); + +/* + * Increments the number of attached contexts on an adapter. + * In case an adapter_context_lock is taken the return -EBUSY. + */ +int cxl_adapter_context_get(struct cxl *adapter); + +/* Decrements the number of attached contexts on an adapter */ +void cxl_adapter_context_put(struct cxl *adapter); + +/* If no active contexts then prevents contexts from being attached */ +int cxl_adapter_context_lock(struct cxl *adapter); + +/* Unlock the contexts-lock if taken. Warn and force unlock otherwise */ +void cxl_adapter_context_unlock(struct cxl *adapter); + #endif diff --git a/drivers/misc/cxl/file.c b/drivers/misc/cxl/file.c index 5fb9894..d0b421f 100644 --- a/drivers/misc/cxl/file.c +++ b/drivers/misc/cxl/file.c @@ -205,11 +205,22 @@ static long afu_ioctl_start_work(struct cxl_context *ctx, ctx->pid = get_task_pid(current, PIDTYPE_PID); ctx->glpid = get_task_pid(current->group_leader, PIDTYPE_PID); + /* + * Increment the mapped context count for adapter. This also checks + * if adapter_context_lock is taken. + */ + rc = cxl_adapter_context_get(ctx->afu->adapter); + if (rc) { + afu_release_irqs(ctx, ctx); + goto out; + } + trace_cxl_attach(ctx, work.work_element_descriptor, work.num_interrupts, amr); if ((rc = cxl_ops->attach_process(ctx, false, work.work_element_descriptor, amr))) { afu_release_irqs(ctx, ctx); + cxl_adapter_context_put(ctx->afu->adapter); goto out; } diff --git a/drivers/misc/cxl/guest.c b/drivers/misc/cxl/guest.c index 9aa58a7..3e102cd 100644 --- a/drivers/misc/cxl/guest.c +++ b/drivers/misc/cxl/guest.c @@ -1152,6 +1152,9 @@ struct cxl *cxl_guest_init_adapter(struct device_node *np, struct platform_devic if ((rc = cxl_sysfs_adapter_add(adapter))) goto err_put1; + /* release the context lock as the adapter is configured */ + cxl_adapter_context_unlock(adapter); + return adapter; err_put1: diff --git a/drivers/misc/cxl/main.c b/drivers/misc/cxl/main.c index d9be23b2..62e0dfb 100644 --- a/drivers/misc/cxl/main.c +++ b/drivers/misc/cxl/main.c @@ -243,8 +243,10 @@ struct cxl *cxl_alloc_adapter(void) if (dev_set_name(&adapter->dev, "card%i", adapter->adapter_num)) goto err2; - return adapter; + /* start with context lock taken */ + atomic_set(&adapter->contexts_num, -1); + return adapter; err2: cxl_remove_adapter_nr(adapter); err1: @@ -286,6 +288,44 @@ int cxl_afu_select_best_mode(struct cxl_afu *afu) return 0; } +int cxl_adapter_context_get(struct cxl *adapter) +{ + int rc; + + rc = atomic_inc_unless_negative(&adapter->contexts_num); + return rc >= 0 ? 0 : -EBUSY; +} + +void cxl_adapter_context_put(struct cxl *adapter) +{ + atomic_dec_if_positive(&adapter->contexts_num); +} + +int cxl_adapter_context_lock(struct cxl *adapter) +{ + int rc; + /* no active contexts -> contexts_num == 0 */ + rc = atomic_cmpxchg(&adapter->contexts_num, 0, -1); + return rc ? -EBUSY : 0; +} + +void cxl_adapter_context_unlock(struct cxl *adapter) +{ + int val = atomic_cmpxchg(&adapter->contexts_num, -1, 0); + + /* + * contexts lock taken -> contexts_num == -1 + * If not true then show a warning and force reset the lock. + * This will happen when context_unlock was requested without + * doing a context_lock. + */ + if (val != -1) { + atomic_set(&adapter->contexts_num, 0); + WARN(1, "Adapter context unlocked with %d active contexts", + val); + } +} + static int __init init_cxl(void) { int rc = 0; diff --git a/drivers/misc/cxl/pci.c b/drivers/misc/cxl/pci.c index 7afad84..e96be9c 100644 --- a/drivers/misc/cxl/pci.c +++ b/drivers/misc/cxl/pci.c @@ -1487,6 +1487,8 @@ static int cxl_configure_adapter(struct cxl *adapter, struct pci_dev *dev) if ((rc = cxl_native_register_psl_err_irq(adapter))) goto err; + /* Release the context lock as adapter is configured */ + cxl_adapter_context_unlock(adapter); return 0; err: diff --git a/drivers/misc/cxl/sysfs.c b/drivers/misc/cxl/sysfs.c index b043c20..a8b6d6a 100644 --- a/drivers/misc/cxl/sysfs.c +++ b/drivers/misc/cxl/sysfs.c @@ -75,12 +75,31 @@ static ssize_t reset_adapter_store(struct device *device, int val; rc = sscanf(buf, "%i", &val); - if ((rc != 1) || (val != 1)) + if ((rc != 1) || (val != 1 && val != -1)) return -EINVAL; - if ((rc = cxl_ops->adapter_reset(adapter))) - return rc; - return count; + /* + * See if we can lock the context mapping that's only allowed + * when there are no contexts attached to the adapter. Once + * taken this will also prevent any context from getting activated. + */ + if (val == 1) { + rc = cxl_adapter_context_lock(adapter); + if (rc) + goto out; + + rc = cxl_ops->adapter_reset(adapter); + /* In case reset failed release context lock */ + if (rc) + cxl_adapter_context_unlock(adapter); + + } else if (val == -1) { + /* Perform a forced adapter reset */ + rc = cxl_ops->adapter_reset(adapter); + } + +out: + return rc ? rc : count; } static ssize_t load_image_on_perst_show(struct device *device,