diff mbox series

powerpc/eeh: Permanently disable the removed device

Message ID 20240405131420.998618-1-ganeshgr@linux.ibm.com (mailing list archive)
State New
Headers show
Series powerpc/eeh: Permanently disable the removed device | expand

Checks

Context Check Description
snowpatch_ozlabs/github-powerpc_selftests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_ppctests success Successfully ran 8 jobs.
snowpatch_ozlabs/github-powerpc_sparse success Successfully ran 4 jobs.

Commit Message

Ganesh Goudar April 5, 2024, 1:14 p.m. UTC
When a device is hot removed on powernv, the hotplug
driver clears the device's state. However, on pseries,
if a device is removed by phyp after reaching the error
threshold, the kernel remains unaware, leading to the
device not being torn down. This prevents necessary
remediation actions like failover.

Permanently disable the device if the presence check
fails.

Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
---
 arch/powerpc/kernel/eeh.c        | 4 +++-
 arch/powerpc/kernel/eeh_driver.c | 8 +++++++-
 2 files changed, 10 insertions(+), 2 deletions(-)

Comments

Michael Ellerman April 9, 2024, 9:07 a.m. UTC | #1
Hi Ganesh,

Ganesh Goudar <ganeshgr@linux.ibm.com> writes:
> When a device is hot removed on powernv, the hotplug
> driver clears the device's state. However, on pseries,
> if a device is removed by phyp after reaching the error
> threshold, the kernel remains unaware, leading to the
> device not being torn down. This prevents necessary
> remediation actions like failover.
>
> Permanently disable the device if the presence check
> fails.

You can wrap your changelogs a bit wider, 70 or 80 columns is fine.

> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index ab316e155ea9..8d1606406d3f 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -508,7 +508,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>  	 * state, PE is in good state.
>  	 */
>  	if ((ret < 0) ||
> -	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
> +	    (ret == EEH_STATE_NOT_SUPPORT &&
> +	     dev->error_state == pci_channel_io_perm_failure) ||
> +	    eeh_state_active(ret)) {
>  		eeh_stats.false_positives++;
>  		pe->false_positives++;
>  		rc = 0;

How does this hunk relate the changelog?

This is adding an extra condition to the false positive check, so
there's a risk this causes devices to go into failure when previously
they didn't, right? So please explain why it's a good change. The
comment above the if needs updating too.

> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 48773d2d9be3..10317badf471 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -867,7 +867,13 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>  	if (!devices) {
>  		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
>  			pe->phb->global_number, pe->addr);
> -		goto out; /* nothing to recover */

The other cases that go to recover_failed usually print something at
warn level, so this probably should too. So either make the above a
pr_warn(), or change it to a warn with a more helpful message.

> +		/*
> +		 * The device is removed, Tear down its state,
> +		 * On powernv hotplug driver would take care of
> +		 * it but not on pseries, Permanently disable the
> +		 * card as it is hot removed.
> +		 */

Formatting and punctuation is weird. It can be wider, and capital letter
is only required after a full stop, not a comma.

Also you say that the powernv hotplug driver "would" take care of it,
that's past tense, is that what you mean? Does the powernv hotplug
driver still take care of it after this change? And (how) does that
driver cope with it happening here also?

> +		goto recover_failed;
>  	}
>  	
cheers
Ganesh Goudar April 15, 2024, 6:49 a.m. UTC | #2
On 4/9/24 14:37, Michael Ellerman wrote:

> Hi Ganesh,
>
> Ganesh Goudar <ganeshgr@linux.ibm.com> writes:
>> When a device is hot removed on powernv, the hotplug
>> driver clears the device's state. However, on pseries,
>> if a device is removed by phyp after reaching the error
>> threshold, the kernel remains unaware, leading to the
>> device not being torn down. This prevents necessary
>> remediation actions like failover.
>>
>> Permanently disable the device if the presence check
>> fails.
> You can wrap your changelogs a bit wider, 70 or 80 columns is fine.

ok

>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>> index ab316e155ea9..8d1606406d3f 100644
>> --- a/arch/powerpc/kernel/eeh.c
>> +++ b/arch/powerpc/kernel/eeh.c
>> @@ -508,7 +508,9 @@ int eeh_dev_check_failure(struct eeh_dev *edev)
>>   	 * state, PE is in good state.
>>   	 */
>>   	if ((ret < 0) ||
>> -	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
>> +	    (ret == EEH_STATE_NOT_SUPPORT &&
>> +	     dev->error_state == pci_channel_io_perm_failure) ||
>> +	    eeh_state_active(ret)) {
>>   		eeh_stats.false_positives++;
>>   		pe->false_positives++;
>>   		rc = 0;
> How does this hunk relate the changelog?
>
> This is adding an extra condition to the false positive check, so
> there's a risk this causes devices to go into failure when previously
> they didn't, right? So please explain why it's a good change. The
> comment above the if needs updating too.

We need this change to log the event and get the device removed, I will explain this
in commit message.

>> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
>> index 48773d2d9be3..10317badf471 100644
>> --- a/arch/powerpc/kernel/eeh_driver.c
>> +++ b/arch/powerpc/kernel/eeh_driver.c
>> @@ -867,7 +867,13 @@ void eeh_handle_normal_event(struct eeh_pe *pe)
>>   	if (!devices) {
>>   		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
>>   			pe->phb->global_number, pe->addr);
>> -		goto out; /* nothing to recover */
> The other cases that go to recover_failed usually print something at
> warn level, so this probably should too. So either make the above a
> pr_warn(), or change it to a warn with a more helpful message.

ok

>> +		/*
>> +		 * The device is removed, Tear down its state,
>> +		 * On powernv hotplug driver would take care of
>> +		 * it but not on pseries, Permanently disable the
>> +		 * card as it is hot removed.
>> +		 */
> Formatting and punctuation is weird. It can be wider, and capital letter
> is only required after a full stop, not a comma.

ok, i will take care of it.

> Also you say that the powernv hotplug driver "would" take care of it,
> that's past tense, is that what you mean? Does the powernv hotplug
> driver still take care of it after this change? And (how) does that
> driver cope with it happening here also?

Yes, hotplug driver can still remove the device and the removal of
device is covered by pci rescan lock.

>> +		goto recover_failed;
>>   	}
>>   	
> cheers
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index ab316e155ea9..8d1606406d3f 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -508,7 +508,9 @@  int eeh_dev_check_failure(struct eeh_dev *edev)
 	 * state, PE is in good state.
 	 */
 	if ((ret < 0) ||
-	    (ret == EEH_STATE_NOT_SUPPORT) || eeh_state_active(ret)) {
+	    (ret == EEH_STATE_NOT_SUPPORT &&
+	     dev->error_state == pci_channel_io_perm_failure) ||
+	    eeh_state_active(ret)) {
 		eeh_stats.false_positives++;
 		pe->false_positives++;
 		rc = 0;
diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 48773d2d9be3..10317badf471 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -867,7 +867,13 @@  void eeh_handle_normal_event(struct eeh_pe *pe)
 	if (!devices) {
 		pr_debug("EEH: Frozen PHB#%x-PE#%x is empty!\n",
 			pe->phb->global_number, pe->addr);
-		goto out; /* nothing to recover */
+		/*
+		 * The device is removed, Tear down its state,
+		 * On powernv hotplug driver would take care of
+		 * it but not on pseries, Permanently disable the
+		 * card as it is hot removed.
+		 */
+		goto recover_failed;
 	}
 
 	/* Log the event */