diff mbox series

PATCH] Partial revert of "PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices"

Message ID fa217fe562df74980febdd57e8e3361c2ece2ae2.camel@kernel.crashing.org
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series PATCH] Partial revert of "PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices" | expand

Commit Message

Benjamin Herrenschmidt Aug. 20, 2018, 4:39 a.m. UTC
This partially reverts commit 7e9084b36740b2ec263ea35efb203001f755e1d8.

This only reverts the Documentation/PCI/pci-error-recovery.txt changes

Those changes are incorrect, they change the documentation to adapt
to the (imho incorrect) AER implementation, and as a result making
it no longer match the EEH implementation.

I believe the policy described originally in this document is what
should be implemented by everybody and the changes done by that commit
would compromise, among others, the ability to recover from errors with
storage devices.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
 Documentation/PCI/pci-error-recovery.txt | 35 +++++++-----------------
 1 file changed, 10 insertions(+), 25 deletions(-)

Comments

Bjorn Helgaas Aug. 21, 2018, 7:50 p.m. UTC | #1
On Mon, Aug 20, 2018 at 02:39:04PM +1000, Benjamin Herrenschmidt wrote:
> This partially reverts commit 7e9084b36740b2ec263ea35efb203001f755e1d8.
> 
> This only reverts the Documentation/PCI/pci-error-recovery.txt changes
> 
> Those changes are incorrect, they change the documentation to adapt
> to the (imho incorrect) AER implementation, and as a result making
> it no longer match the EEH implementation.
> 
> I believe the policy described originally in this document is what
> should be implemented by everybody and the changes done by that commit
> would compromise, among others, the ability to recover from errors with
> storage devices.

I think we should align EEH, AER, and DPC as much as possible,
including making this documentation match the code.

Because of its name, this file *looks* like it should match the code
in the PCI core, i.e., drivers/pci/...  So I think it would be
confusing to simply apply this revert without making a more direct
connection between this documentation and the powerpc-specific EEH
code.

If we can change AER & DPC to correspond to EEH, then I think it would
make sense to apply this revert along with those AER & DPC changes so
the documentation stays in step with the code.

> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> ---
>  Documentation/PCI/pci-error-recovery.txt | 35 +++++++-----------------
>  1 file changed, 10 insertions(+), 25 deletions(-)
> 
> diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
> index 688b69121e82..0b6bb3ef449e 100644
> --- a/Documentation/PCI/pci-error-recovery.txt
> +++ b/Documentation/PCI/pci-error-recovery.txt
> @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover from a PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>  
> -STEP 0: Error Event: ERR_NONFATAL
> +STEP 0: Error Event
>  -------------------
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0xffffffff,
> @@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
>  proceeds to STEP 4 (Slot Reset)
>  
> -STEP 3: Slot Reset
> +STEP 3: Link Reset
> +------------------
> +The platform resets the link.  This is a PCI-Express specific step
> +and is done whenever a fatal error has been detected that can be
> +"solved" by resetting the link.
> +
> +STEP 4: Slot Reset
>  ------------------
>  
>  In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
> @@ -314,7 +320,7 @@ Failure).
>  >>> However, it probably should.
>  
>  
> -STEP 4: Resume Operations
> +STEP 5: Resume Operations
>  -------------------------
>  The platform will call the resume() callback on all affected device
>  drivers if all drivers on the segment have returned
> @@ -326,7 +332,7 @@ a result code.
>  At this point, if a new error happens, the platform will restart
>  a new error recovery sequence.
>  
> -STEP 5: Permanent Failure
> +STEP 6: Permanent Failure
>  -------------------------
>  A "permanent failure" has occurred, and the platform cannot recover
>  the device.  The platform will call error_detected() with a
> @@ -349,27 +355,6 @@ errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
>  for additional detail on real-life experience of the causes of
>  software errors.
>  
> -STEP 0: Error Event: ERR_FATAL
> --------------------
> -PCI bus error is detected by the PCI hardware. On powerpc, the slot is
> -isolated, in that all I/O is blocked: all reads return 0xffffffff, all
> -writes are ignored.
> -
> -STEP 1: Remove devices
> ---------------------
> -Platform removes the devices depending on the error agent, it could be
> -this port for all subordinates or upstream component (likely downstream
> -port)
> -
> -STEP 2: Reset link
> ---------------------
> -The platform resets the link.  This is a PCI-Express specific step and is
> -done whenever a fatal error has been detected that can be "solved" by
> -resetting the link.
> -
> -STEP 3: Re-enumerate the devices
> ---------------------
> -Initiates the re-enumeration.
>  
>  Conclusion; General Remarks
>  ---------------------------
> 
>
Oza Pawandeep Aug. 22, 2018, 4:35 a.m. UTC | #2
On 2018-08-22 01:20, Bjorn Helgaas wrote:
> On Mon, Aug 20, 2018 at 02:39:04PM +1000, Benjamin Herrenschmidt wrote:
>> This partially reverts commit 
>> 7e9084b36740b2ec263ea35efb203001f755e1d8.
>> 
>> This only reverts the Documentation/PCI/pci-error-recovery.txt changes
>> 
>> Those changes are incorrect, they change the documentation to adapt
>> to the (imho incorrect) AER implementation, and as a result making
>> it no longer match the EEH implementation.
>> 
>> I believe the policy described originally in this document is what
>> should be implemented by everybody and the changes done by that commit
>> would compromise, among others, the ability to recover from errors 
>> with
>> storage devices.
> 
> I think we should align EEH, AER, and DPC as much as possible,
> including making this documentation match the code.
> 
> Because of its name, this file *looks* like it should match the code
> in the PCI core, i.e., drivers/pci/...  So I think it would be
> confusing to simply apply this revert without making a more direct
> connection between this documentation and the powerpc-specific EEH
> code.
> 
> If we can change AER & DPC to correspond to EEH, then I think it would
> make sense to apply this revert along with those AER & DPC changes so
> the documentation stays in step with the code.

I agree with you Bjorn, but it looks like we are far from holding some 
consensus on the behavior.
the other mail-chain [PCI/AER: prevent pcie_do_fatal_recovery from using 
device after it is removed]
has gone long enough, and touched lot of things all together. hence I 
would like to just use this mail-chain just to discuss
about AER, DPC behaviour.

let me put forward the proposal here to this mail-chain.

1) Right now AER and DPC both calls pcie_do_fatal_recovery(), I majorly 
see DPC as error handling and recovery agent rather than being used for 
hotplug.
    so in my opinion, both AER and DPC should have same error handling 
and recovery mechanism

    so if there is a way to figure out that in absence of pcihp, if DPC 
is being used to support hotplug then we fall back to original DPC 
mechanism (which
    is remove devices)
    otherwise, we fall back to drivers callbacks.

    Spec 6.7.5 Async Removal
    "
    The Surprise Down error resulting from async removal may trigger 
Downstream Port Containment (See Section 6.2.10).
    Downstream Port Containment following an async removal may be 
utilized to hold the Link of a
    Downstream Port in the Disabled LTSSM state while host software 
recovers from the side effects of an async removal.
    "
    pcie_do_fatal_recovery should take care of above. so that we support 
both error handling and async removal from DPC point of view.


2) pci_channel_io_frozen is the one which should be used from framework 
to communicate driver that this is ERR_FATAL, the it is left to the 
driver(s) to see if they want reset of the link (SBR).

3) pcie_do_nonfatal_recovery; we sould actually reset the link e.g. SBR 
if driver requests the reset of link.  (there is already TDO note 
anyway).
    if (status == PCI_ERS_RESULT_NEED_RESET) {
         /*
          * TODO: Should call platform-specific
          * functions to reset slot before calling
          * drivers' slot_reset callbacks?
          */
         status = broadcast_error_message(dev,
                 state,
                 "slot_reset",
                 report_slot_reset);
     }


Regards,
Oza.


> 
>> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> ---
>>  Documentation/PCI/pci-error-recovery.txt | 35 
>> +++++++-----------------
>>  1 file changed, 10 insertions(+), 25 deletions(-)
>> 
>> diff --git a/Documentation/PCI/pci-error-recovery.txt 
>> b/Documentation/PCI/pci-error-recovery.txt
>> index 688b69121e82..0b6bb3ef449e 100644
>> --- a/Documentation/PCI/pci-error-recovery.txt
>> +++ b/Documentation/PCI/pci-error-recovery.txt
>> @@ -110,7 +110,7 @@ The actual steps taken by a platform to recover 
>> from a PCI error
>>  event will be platform-dependent, but will follow the general
>>  sequence described below.
>> 
>> -STEP 0: Error Event: ERR_NONFATAL
>> +STEP 0: Error Event
>>  -------------------
>>  A PCI bus error is detected by the PCI hardware.  On powerpc, the 
>> slot
>>  is isolated, in that all I/O is blocked: all reads return 0xffffffff,
>> @@ -228,7 +228,13 @@ proceeds to either STEP3 (Link Reset) or to STEP 
>> 5 (Resume Operations).
>>  If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
>>  proceeds to STEP 4 (Slot Reset)
>> 
>> -STEP 3: Slot Reset
>> +STEP 3: Link Reset
>> +------------------
>> +The platform resets the link.  This is a PCI-Express specific step
>> +and is done whenever a fatal error has been detected that can be
>> +"solved" by resetting the link.
>> +
>> +STEP 4: Slot Reset
>>  ------------------
>> 
>>  In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
>> @@ -314,7 +320,7 @@ Failure).
>>  >>> However, it probably should.
>> 
>> 
>> -STEP 4: Resume Operations
>> +STEP 5: Resume Operations
>>  -------------------------
>>  The platform will call the resume() callback on all affected device
>>  drivers if all drivers on the segment have returned
>> @@ -326,7 +332,7 @@ a result code.
>>  At this point, if a new error happens, the platform will restart
>>  a new error recovery sequence.
>> 
>> -STEP 5: Permanent Failure
>> +STEP 6: Permanent Failure
>>  -------------------------
>>  A "permanent failure" has occurred, and the platform cannot recover
>>  the device.  The platform will call error_detected() with a
>> @@ -349,27 +355,6 @@ errors. See the discussion in 
>> powerpc/eeh-pci-error-recovery.txt
>>  for additional detail on real-life experience of the causes of
>>  software errors.
>> 
>> -STEP 0: Error Event: ERR_FATAL
>> --------------------
>> -PCI bus error is detected by the PCI hardware. On powerpc, the slot 
>> is
>> -isolated, in that all I/O is blocked: all reads return 0xffffffff, 
>> all
>> -writes are ignored.
>> -
>> -STEP 1: Remove devices
>> ---------------------
>> -Platform removes the devices depending on the error agent, it could 
>> be
>> -this port for all subordinates or upstream component (likely 
>> downstream
>> -port)
>> -
>> -STEP 2: Reset link
>> ---------------------
>> -The platform resets the link.  This is a PCI-Express specific step 
>> and is
>> -done whenever a fatal error has been detected that can be "solved" by
>> -resetting the link.
>> -
>> -STEP 3: Re-enumerate the devices
>> ---------------------
>> -Initiates the re-enumeration.
>> 
>>  Conclusion; General Remarks
>>  ---------------------------
>> 
>>
diff mbox series

Patch

diff --git a/Documentation/PCI/pci-error-recovery.txt b/Documentation/PCI/pci-error-recovery.txt
index 688b69121e82..0b6bb3ef449e 100644
--- a/Documentation/PCI/pci-error-recovery.txt
+++ b/Documentation/PCI/pci-error-recovery.txt
@@ -110,7 +110,7 @@  The actual steps taken by a platform to recover from a PCI error
 event will be platform-dependent, but will follow the general
 sequence described below.
 
-STEP 0: Error Event: ERR_NONFATAL
+STEP 0: Error Event
 -------------------
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
@@ -228,7 +228,13 @@  proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
-STEP 3: Slot Reset
+STEP 3: Link Reset
+------------------
+The platform resets the link.  This is a PCI-Express specific step
+and is done whenever a fatal error has been detected that can be
+"solved" by resetting the link.
+
+STEP 4: Slot Reset
 ------------------
 
 In response to a return value of PCI_ERS_RESULT_NEED_RESET, the
@@ -314,7 +320,7 @@  Failure).
 >>> However, it probably should.
 
 
-STEP 4: Resume Operations
+STEP 5: Resume Operations
 -------------------------
 The platform will call the resume() callback on all affected device
 drivers if all drivers on the segment have returned
@@ -326,7 +332,7 @@  a result code.
 At this point, if a new error happens, the platform will restart
 a new error recovery sequence.
 
-STEP 5: Permanent Failure
+STEP 6: Permanent Failure
 -------------------------
 A "permanent failure" has occurred, and the platform cannot recover
 the device.  The platform will call error_detected() with a
@@ -349,27 +355,6 @@  errors. See the discussion in powerpc/eeh-pci-error-recovery.txt
 for additional detail on real-life experience of the causes of
 software errors.
 
-STEP 0: Error Event: ERR_FATAL
--------------------
-PCI bus error is detected by the PCI hardware. On powerpc, the slot is
-isolated, in that all I/O is blocked: all reads return 0xffffffff, all
-writes are ignored.
-
-STEP 1: Remove devices
---------------------
-Platform removes the devices depending on the error agent, it could be
-this port for all subordinates or upstream component (likely downstream
-port)
-
-STEP 2: Reset link
---------------------
-The platform resets the link.  This is a PCI-Express specific step and is
-done whenever a fatal error has been detected that can be "solved" by
-resetting the link.
-
-STEP 3: Re-enumerate the devices
---------------------
-Initiates the re-enumeration.
 
 Conclusion; General Remarks
 ---------------------------