diff mbox

[v2,1/3] powerpc/eeh: Ignore error handlers in eeh_pe_reset_and_recover()

Message ID 1461331687-1069-1-git-send-email-gwshan@linux.vnet.ibm.com (mailing list archive)
State Superseded
Headers show

Commit Message

Gavin Shan April 22, 2016, 1:28 p.m. UTC
The function eeh_pe_reset_and_recover() is used to recover EEH
error when the passthrough device are transferred to guest and
backwards, meaning the device's driver is vfio-pci or none.
When the driver is vfio-pci that provides error_detected() error
handler only, the handler simply stops the guest and it's not
expected behaviour. On the other hand, no error handlers will
be called if we don't have a bound driver.

This ignores all error handlers provided by device driver in
eeh_pe_reset_and_recover() to avoid the exceptional behaviour.

Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
Cc: stable@vger.kernel.org #v3.18+
Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Reviewed-by: Russell Currey <ruscur@russell.cc>
---
 arch/powerpc/kernel/eeh_driver.c | 11 +----------
 1 file changed, 1 insertion(+), 10 deletions(-)

Comments

David Gibson April 26, 2016, 5:29 a.m. UTC | #1
On Fri, Apr 22, 2016 at 11:28:02PM +1000, Gavin Shan wrote:
> The function eeh_pe_reset_and_recover() is used to recover EEH
> error when the passthrough device are transferred to guest and
> backwards, meaning the device's driver is vfio-pci or none.
> When the driver is vfio-pci that provides error_detected() error
> handler only, the handler simply stops the guest and it's not
> expected behaviour. On the other hand, no error handlers will
> be called if we don't have a bound driver.
> 
> This ignores all error handlers provided by device driver in
> eeh_pe_reset_and_recover() to avoid the exceptional behaviour.
> 
> Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
> Cc: stable@vger.kernel.org #v3.18+
> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
> Reviewed-by: Russell Currey <ruscur@russell.cc>
> ---
>  arch/powerpc/kernel/eeh_driver.c | 11 +----------
>  1 file changed, 1 insertion(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index fb6207d..1c7d703 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -552,7 +552,7 @@ static int eeh_clear_pe_frozen_state(struct eeh_pe *pe,
>  
>  int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>  {
> -	int result, ret;
> +	int ret;
>  
>  	/* Bail if the PE is being recovered */
>  	if (pe->state & EEH_PE_RECOVERING)
> @@ -564,9 +564,6 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>  	/* Save states */
>  	eeh_pe_dev_traverse(pe, eeh_dev_save_state, NULL);
>  
> -	/* Report error */
> -	eeh_pe_dev_traverse(pe, eeh_report_error, &result);

Ok, so after chatting to Gavin, I've made sense of this.  The basic
thing here is that eeh_pe_reset_and_recover() should be discarding any
errors from before the reset, not reporting them - the whole point is
that we know things have gone bad, and we want to clear back to a good
state.

>  	/* Issue reset */
>  	ret = eeh_reset_pe(pe);
>  	if (ret) {
> @@ -581,15 +578,9 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>  		return ret;
>  	}
>  
> -	/* Notify completion of reset */
> -	eeh_pe_dev_traverse(pe, eeh_report_reset, &result);

However, it's not clear if removing the report of a reset makes sense.
There are no current users of reset notification IIUC, but if we're
going to remove the reset reporting, we should put that in a separate
patch with its own justification, and remove the other caller as well.

>  	/* Restore device state */
>  	eeh_pe_dev_traverse(pe, eeh_dev_restore_state, NULL);
>  
> -	/* Resume */
> -	eeh_pe_dev_traverse(pe, eeh_report_resume, NULL);

And I'm not sure if it makes sense to remove the resume notification either.

>  	/* Clear recovery mode */
>  	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
>
Gavin Shan April 26, 2016, 10:17 a.m. UTC | #2
On Tue, Apr 26, 2016 at 03:29:59PM +1000, David Gibson wrote:
>On Fri, Apr 22, 2016 at 11:28:02PM +1000, Gavin Shan wrote:
>> The function eeh_pe_reset_and_recover() is used to recover EEH
>> error when the passthrough device are transferred to guest and
>> backwards, meaning the device's driver is vfio-pci or none.
>> When the driver is vfio-pci that provides error_detected() error
>> handler only, the handler simply stops the guest and it's not
>> expected behaviour. On the other hand, no error handlers will
>> be called if we don't have a bound driver.
>> 
>> This ignores all error handlers provided by device driver in
>> eeh_pe_reset_and_recover() to avoid the exceptional behaviour.
>> 
>> Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
>> Cc: stable@vger.kernel.org #v3.18+
>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>> Reviewed-by: Russell Currey <ruscur@russell.cc>
>> ---
>>  arch/powerpc/kernel/eeh_driver.c | 11 +----------
>>  1 file changed, 1 insertion(+), 10 deletions(-)
>> 
>> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
>> index fb6207d..1c7d703 100644
>> --- a/arch/powerpc/kernel/eeh_driver.c
>> +++ b/arch/powerpc/kernel/eeh_driver.c
>> @@ -552,7 +552,7 @@ static int eeh_clear_pe_frozen_state(struct eeh_pe *pe,
>>  
>>  int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>  {
>> -	int result, ret;
>> +	int ret;
>>  
>>  	/* Bail if the PE is being recovered */
>>  	if (pe->state & EEH_PE_RECOVERING)
>> @@ -564,9 +564,6 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>  	/* Save states */
>>  	eeh_pe_dev_traverse(pe, eeh_dev_save_state, NULL);
>>  
>> -	/* Report error */
>> -	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
>
>Ok, so after chatting to Gavin, I've made sense of this.  The basic
>thing here is that eeh_pe_reset_and_recover() should be discarding any
>errors from before the reset, not reporting them - the whole point is
>that we know things have gone bad, and we want to clear back to a good
>state.
>
>>  	/* Issue reset */
>>  	ret = eeh_reset_pe(pe);
>>  	if (ret) {
>> @@ -581,15 +578,9 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>  		return ret;
>>  	}
>>  
>> -	/* Notify completion of reset */
>> -	eeh_pe_dev_traverse(pe, eeh_report_reset, &result);
>
>However, it's not clear if removing the report of a reset makes sense.
>There are no current users of reset notification IIUC, but if we're
>going to remove the reset reporting, we should put that in a separate
>patch with its own justification, and remove the other caller as well.
>

Thanks, David. It makes sense to me. I will split it into two: one removes
eeh_report_error notification and another removes the left notification
handlers.

>>  	/* Restore device state */
>>  	eeh_pe_dev_traverse(pe, eeh_dev_restore_state, NULL);
>>  
>> -	/* Resume */
>> -	eeh_pe_dev_traverse(pe, eeh_report_resume, NULL);
>
>And I'm not sure if it makes sense to remove the resume notification either.
>

Based on the offline talk, we either keep all notification handlers or remove
all of them. As we can't keep eeh_report_error, we have to remove all of them.

>>  	/* Clear recovery mode */
>>  	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
>>  
>
>-- 
>David Gibson			| I'll have my music baroque, and my code
>david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
>				| _way_ _around_!
>http://www.ozlabs.org/~dgibson
Gavin Shan April 27, 2016, 1:16 a.m. UTC | #3
On Tue, Apr 26, 2016 at 08:17:31PM +1000, Gavin Shan wrote:
>On Tue, Apr 26, 2016 at 03:29:59PM +1000, David Gibson wrote:
>>On Fri, Apr 22, 2016 at 11:28:02PM +1000, Gavin Shan wrote:
>>> The function eeh_pe_reset_and_recover() is used to recover EEH
>>> error when the passthrough device are transferred to guest and
>>> backwards, meaning the device's driver is vfio-pci or none.
>>> When the driver is vfio-pci that provides error_detected() error
>>> handler only, the handler simply stops the guest and it's not
>>> expected behaviour. On the other hand, no error handlers will
>>> be called if we don't have a bound driver.
>>> 
>>> This ignores all error handlers provided by device driver in
>>> eeh_pe_reset_and_recover() to avoid the exceptional behaviour.
>>> 
>>> Fixes: 5cfb20b9 ("powerpc/eeh: Emulate EEH recovery for VFIO devices")
>>> Cc: stable@vger.kernel.org #v3.18+
>>> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
>>> Reviewed-by: Russell Currey <ruscur@russell.cc>
>>> ---
>>>  arch/powerpc/kernel/eeh_driver.c | 11 +----------
>>>  1 file changed, 1 insertion(+), 10 deletions(-)
>>> 
>>> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
>>> index fb6207d..1c7d703 100644
>>> --- a/arch/powerpc/kernel/eeh_driver.c
>>> +++ b/arch/powerpc/kernel/eeh_driver.c
>>> @@ -552,7 +552,7 @@ static int eeh_clear_pe_frozen_state(struct eeh_pe *pe,
>>>  
>>>  int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>>  {
>>> -	int result, ret;
>>> +	int ret;
>>>  
>>>  	/* Bail if the PE is being recovered */
>>>  	if (pe->state & EEH_PE_RECOVERING)
>>> @@ -564,9 +564,6 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>>  	/* Save states */
>>>  	eeh_pe_dev_traverse(pe, eeh_dev_save_state, NULL);
>>>  
>>> -	/* Report error */
>>> -	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
>>
>>Ok, so after chatting to Gavin, I've made sense of this.  The basic
>>thing here is that eeh_pe_reset_and_recover() should be discarding any
>>errors from before the reset, not reporting them - the whole point is
>>that we know things have gone bad, and we want to clear back to a good
>>state.
>>
>>>  	/* Issue reset */
>>>  	ret = eeh_reset_pe(pe);
>>>  	if (ret) {
>>> @@ -581,15 +578,9 @@ int eeh_pe_reset_and_recover(struct eeh_pe *pe)
>>>  		return ret;
>>>  	}
>>>  
>>> -	/* Notify completion of reset */
>>> -	eeh_pe_dev_traverse(pe, eeh_report_reset, &result);
>>
>>However, it's not clear if removing the report of a reset makes sense.
>>There are no current users of reset notification IIUC, but if we're
>>going to remove the reset reporting, we should put that in a separate
>>patch with its own justification, and remove the other caller as well.
>>
>
>Thanks, David. It makes sense to me. I will split it into two: one removes
>eeh_report_error notification and another removes the left notification
>handlers.
>
>>>  	/* Restore device state */
>>>  	eeh_pe_dev_traverse(pe, eeh_dev_restore_state, NULL);
>>>  
>>> -	/* Resume */
>>> -	eeh_pe_dev_traverse(pe, eeh_report_resume, NULL);
>>
>>And I'm not sure if it makes sense to remove the resume notification either.
>>
>
>Based on the offline talk, we either keep all notification handlers or remove
>all of them. As we can't keep eeh_report_error, we have to remove all of them.
>

v3 was posted for further review. Please ignore this series.

>>>  	/* Clear recovery mode */
>>>  	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);
>>>  
>>
>>-- 
>>David Gibson			| I'll have my music baroque, and my code
>>david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
>>				| _way_ _around_!
>>http://www.ozlabs.org/~dgibson
>
>
diff mbox

Patch

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index fb6207d..1c7d703 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -552,7 +552,7 @@  static int eeh_clear_pe_frozen_state(struct eeh_pe *pe,
 
 int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 {
-	int result, ret;
+	int ret;
 
 	/* Bail if the PE is being recovered */
 	if (pe->state & EEH_PE_RECOVERING)
@@ -564,9 +564,6 @@  int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 	/* Save states */
 	eeh_pe_dev_traverse(pe, eeh_dev_save_state, NULL);
 
-	/* Report error */
-	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
-
 	/* Issue reset */
 	ret = eeh_reset_pe(pe);
 	if (ret) {
@@ -581,15 +578,9 @@  int eeh_pe_reset_and_recover(struct eeh_pe *pe)
 		return ret;
 	}
 
-	/* Notify completion of reset */
-	eeh_pe_dev_traverse(pe, eeh_report_reset, &result);
-
 	/* Restore device state */
 	eeh_pe_dev_traverse(pe, eeh_dev_restore_state, NULL);
 
-	/* Resume */
-	eeh_pe_dev_traverse(pe, eeh_report_resume, NULL);
-
 	/* Clear recovery mode */
 	eeh_pe_state_clear(pe, EEH_PE_RECOVERING);