diff mbox

[v2,3/8] powerpc/eeh: Force reset on fenced PHB

Message ID 1444276739-20372-4-git-send-email-gwshan@linux.vnet.ibm.com (mailing list archive)
State Accepted
Headers show

Commit Message

Gavin Shan Oct. 8, 2015, 3:58 a.m. UTC
On fenced PHB, the error handlers in the drivers of its subordinate
devices could return PCI_ERS_RESULT_CAN_RECOVER, indicating no reset
will be issued during the recovery. It's conflicting with the fact
that fenced PHB won't be recovered without reset.

This limits the return value from the error handlers in the drivers
of the fenced PHB's subordinate devices to PCI_ERS_RESULT_NEED_NONE
or PCI_ERS_RESULT_NEED_RESET, to ensure reset will be issued during
recovery.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/eeh_driver.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Daniel Axtens Oct. 13, 2015, 1:43 a.m. UTC | #1
Gavin Shan <gwshan@linux.vnet.ibm.com> writes:


> +	 *
> +	 * When the PHB is fenced, we have to issue a reset to recover from
> +	 * the error. Override the result if necessary to have partially
> +	 * hotplug for this case.
>  	 */
>  	pr_info("EEH: Notify device drivers to shutdown\n");
>  	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
> +	if ((pe->type & EEH_PE_PHB) &&
> +	    result != PCI_ERS_RESULT_NONE &&
> +	    result != PCI_ERS_RESULT_NEED_RESET)
> +		result = PCI_ERS_RESULT_NEED_RESET;
I think we shouldn't discard the DISCONNECT state. A driver could ask
that the device be disconnected in the error_detected callback and we
should probably honour that.

Regards,
Daniel

>  
>  	/* Get the current PCI slot state. This can take a long time,
>  	 * sometimes over 300 seconds for certain systems.
> -- 
> 2.1.0
>
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
Gavin Shan Oct. 13, 2015, 5:01 a.m. UTC | #2
On Tue, Oct 13, 2015 at 12:43:23PM +1100, Daniel Axtens wrote:
>Gavin Shan <gwshan@linux.vnet.ibm.com> writes:
>
>> +	 *
>> +	 * When the PHB is fenced, we have to issue a reset to recover from
>> +	 * the error. Override the result if necessary to have partially
>> +	 * hotplug for this case.
>>  	 */
>>  	pr_info("EEH: Notify device drivers to shutdown\n");
>>  	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
>> +	if ((pe->type & EEH_PE_PHB) &&
>> +	    result != PCI_ERS_RESULT_NONE &&
>> +	    result != PCI_ERS_RESULT_NEED_RESET)
>> +		result = PCI_ERS_RESULT_NEED_RESET;
>I think we shouldn't discard the DISCONNECT state. A driver could ask
>that the device be disconnected in the error_detected callback and we
>should probably honour that.
>

Not exactly, the improvement is limited to fenced PHB, not frozen PE case.
That's ok to discard DISCONNECT which forces all PHB's subordinate devices
to offline permanently, which isn't so reasonable.

This flag (DISCONNECT) has been there before the partial hotplug is
added. I think the flag can die now with partial hotplug support.

Thanks,
Gavin

>>  
>>  	/* Get the current PCI slot state. This can take a long time,
>>  	 * sometimes over 300 seconds for certain systems.
>> -- 
>> 2.1.0
>>
>> _______________________________________________
>> Linuxppc-dev mailing list
>> Linuxppc-dev@lists.ozlabs.org
>> https://lists.ozlabs.org/listinfo/linuxppc-dev
Daniel Axtens Oct. 13, 2015, 5:18 a.m. UTC | #3
Gavin Shan <gwshan@linux.vnet.ibm.com> writes:

> On Tue, Oct 13, 2015 at 12:43:23PM +1100, Daniel Axtens wrote:
>>Gavin Shan <gwshan@linux.vnet.ibm.com> writes:
>>
>>> +	 *
>>> +	 * When the PHB is fenced, we have to issue a reset to recover from
>>> +	 * the error. Override the result if necessary to have partially
>>> +	 * hotplug for this case.
>>>  	 */
>>>  	pr_info("EEH: Notify device drivers to shutdown\n");
>>>  	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
>>> +	if ((pe->type & EEH_PE_PHB) &&
>>> +	    result != PCI_ERS_RESULT_NONE &&
>>> +	    result != PCI_ERS_RESULT_NEED_RESET)
>>> +		result = PCI_ERS_RESULT_NEED_RESET;
>>I think we shouldn't discard the DISCONNECT state. A driver could ask
>>that the device be disconnected in the error_detected callback and we
>>should probably honour that.
>>
>
> Not exactly, the improvement is limited to fenced PHB, not frozen PE case.
> That's ok to discard DISCONNECT which forces all PHB's subordinate devices
> to offline permanently, which isn't so reasonable.

I see. That makes more sense now. It's still a bit hacky but I can see
why it should be the way it is.

Reviewed-by: Daniel Axtens <dja@axtens.net>

>
> This flag (DISCONNECT) has been there before the partial hotplug is
> added. I think the flag can die now with partial hotplug support.

OK. It looks like I need to put some more thought into partial hotplug.
I'm increasingly thinking it would be worth redesigning the state
machine and the enumerations/flags to better deal with how things have
evolved over the years.

Regards,
Daniel


>
> Thanks,
> Gavin
>
>>>  
>>>  	/* Get the current PCI slot state. This can take a long time,
>>>  	 * sometimes over 300 seconds for certain systems.
>>> -- 
>>> 2.1.0
>>>
>>> _______________________________________________
>>> Linuxppc-dev mailing list
>>> Linuxppc-dev@lists.ozlabs.org
>>> https://lists.ozlabs.org/listinfo/linuxppc-dev
diff mbox

Patch

diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
index 32178a4..80dfe89 100644
--- a/arch/powerpc/kernel/eeh_driver.c
+++ b/arch/powerpc/kernel/eeh_driver.c
@@ -664,9 +664,17 @@  static void eeh_handle_normal_event(struct eeh_pe *pe)
 	 * to accomplish the reset.  Each child gets a report of the
 	 * status ... if any child can't handle the reset, then the entire
 	 * slot is dlpar removed and added.
+	 *
+	 * When the PHB is fenced, we have to issue a reset to recover from
+	 * the error. Override the result if necessary to have partially
+	 * hotplug for this case.
 	 */
 	pr_info("EEH: Notify device drivers to shutdown\n");
 	eeh_pe_dev_traverse(pe, eeh_report_error, &result);
+	if ((pe->type & EEH_PE_PHB) &&
+	    result != PCI_ERS_RESULT_NONE &&
+	    result != PCI_ERS_RESULT_NEED_RESET)
+		result = PCI_ERS_RESULT_NEED_RESET;
 
 	/* Get the current PCI slot state. This can take a long time,
 	 * sometimes over 300 seconds for certain systems.