Message ID | 20210618060446.7969-1-wesley.sheng@amd.com |
---|---|
State | New |
Headers | show |
Series | Documentation: PCI: pci-error-recovery: rearrange the general sequence | expand |
On Fri, Jun 18, 2021 at 4:05 PM Wesley Sheng <wesley.sheng@amd.com> wrote: > > Reset_link() callback function was called before mmio_enabled() in > pcie_do_recovery() function actually, so rearrange the general > sequence betwen step 2 and step 3 accordingly. I don't think this is true in all cases. If pcie_do_recovery() is called with state==pci_channel_io_normal (i.e. non-fatal AER) the link won't be reset. EEH (ppc PCI error recovery thing) also uses .mmio_enabled() as described.
On Fri, Jun 18, 2021 at 05:21:32PM +1000, Oliver O'Halloran wrote: > On Fri, Jun 18, 2021 at 4:05 PM Wesley Sheng <wesley.sheng@amd.com> wrote: > > > > Reset_link() callback function was called before mmio_enabled() in > > pcie_do_recovery() function actually, so rearrange the general > > sequence betwen step 2 and step 3 accordingly. > > I don't think this is true in all cases. If pcie_do_recovery() is > called with state==pci_channel_io_normal (i.e. non-fatal AER) the link > won't be reset. EEH (ppc PCI error recovery thing) also uses > .mmio_enabled() as described. Yes, in case of non-fatal AER, reset_link() callback (aer_root_reset() for AER and dpc_reset_link() for DPC) will not be invoked. And if .error_detected() return PCI_ERS_RESULT_CAN_RECOVER, .mmio_enabled() be called followed. But if pcie_do_recovery() is called with state == pci_channel_io_frozen, reset_link() callback is called after .error_detected() but before .mmio_enabled(). So I thought Step 2: MMIO Enabled and Step 3: Link Reset should rearrange their sequence.
Please make the subject a little more specific. "rearrange the general sequence" doesn't say anything about what was affected. On Fri, Jun 18, 2021 at 02:04:46PM +0800, Wesley Sheng wrote: > Reset_link() callback function was called before mmio_enabled() in > pcie_do_recovery() function actually, so rearrange the general > sequence betwen step 2 and step 3 accordingly. s/betwen/between/ Not sure "general" adds anything in this sentence. "Step 2 and step 3" are not meaningful here in the commit log. It needs to spell out what those steps are so the log makes sense by itself. "reset_link" does not appear in pcie_do_recovery(). I'm guessing you're referring to the "reset_subordinates" function pointer? > Signed-off-by: Wesley Sheng <wesley.sheng@amd.com> I didn't quite understand your response to Oliver, so I'll wait for your corrections and his ack before proceeding. > --- > Documentation/PCI/pci-error-recovery.rst | 23 ++++++++++++----------- > 1 file changed, 12 insertions(+), 11 deletions(-) > > diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst > index 187f43a03200..ac6a8729ef28 100644 > --- a/Documentation/PCI/pci-error-recovery.rst > +++ b/Documentation/PCI/pci-error-recovery.rst > @@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure). > and prints an error to syslog. A reboot is then required to > get the device working again. > > -STEP 2: MMIO Enabled > +STEP 2: Link Reset > +------------------ > +The platform resets the link. This is a PCI-Express specific step > +and is done whenever a fatal error has been detected that can be > +"solved" by resetting the link. > + > + > +STEP 3: MMIO Enabled > -------------------- > The platform re-enables MMIO to the device (but typically not the > DMA), and then calls the mmio_enabled() callback on all affected > @@ -197,8 +204,8 @@ information, if any, and eventually do things like trigger a device local > reset or some such, but not restart operations. This callback is made if > all drivers on a segment agree that they can try to recover and if no automatic > link reset was performed by the HW. If the platform can't just re-enable IOs > -without a slot reset or a link reset, it will not call this callback, and > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > +without a slot reset, it will not call this callback, and > +instead will have gone directly or STEP 4 (Slot Reset) s/or/to/ ? > .. note:: > > @@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > such an error might cause IOs to be re-blocked for the whole > segment, and thus invalidate the recovery that other devices > on the same segment might have done, forcing the whole segment > - into one of the next states, that is, link reset or slot reset. > + into next states, that is, slot reset. s/into next states/into the next state/ ? > The driver should return one of the following result codes: > - PCI_ERS_RESULT_RECOVERED > @@ -233,17 +240,11 @@ The driver should return one of the following result codes: > > The next step taken depends on the results returned by the drivers. > If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform > -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). > +proceeds to STEP 5 (Resume Operations). > > If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform > proceeds to STEP 4 (Slot Reset) > > -STEP 3: Link Reset > ------------------- > -The platform resets the link. This is a PCI-Express specific step > -and is done whenever a fatal error has been detected that can be > -"solved" by resetting the link. > - > STEP 4: Slot Reset > ------------------ > > -- > 2.25.1 >
On Thu, Jul 01, 2021 at 05:22:31PM -0500, Bjorn Helgaas wrote: > Please make the subject a little more specific. "rearrange the > general sequence" doesn't say anything about what was affected. > > On Fri, Jun 18, 2021 at 02:04:46PM +0800, Wesley Sheng wrote: > > Reset_link() callback function was called before mmio_enabled() in > > pcie_do_recovery() function actually, so rearrange the general > > sequence betwen step 2 and step 3 accordingly. > > s/betwen/between/ > > Not sure "general" adds anything in this sentence. "Step 2 and step > 3" are not meaningful here in the commit log. It needs to spell out > what those steps are so the log makes sense by itself. > > "reset_link" does not appear in pcie_do_recovery(). I'm guessing > you're referring to the "reset_subordinates" function pointer? > Yes, you are right. pcieaer-howto.rst has a section named with "Provide callbacks", the callback supplied to pcie_do_recovery() was referred to reset_link. > > Signed-off-by: Wesley Sheng <wesley.sheng@amd.com> > > I didn't quite understand your response to Oliver, so I'll wait for > your corrections and his ack before proceeding. > OK. I thought step 2 MMIO Enabled and step 3 link reset should swap sequence. > > --- > > Documentation/PCI/pci-error-recovery.rst | 23 ++++++++++++----------- > > 1 file changed, 12 insertions(+), 11 deletions(-) > > > > diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst > > index 187f43a03200..ac6a8729ef28 100644 > > --- a/Documentation/PCI/pci-error-recovery.rst > > +++ b/Documentation/PCI/pci-error-recovery.rst > > @@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure). > > and prints an error to syslog. A reboot is then required to > > get the device working again. > > > > -STEP 2: MMIO Enabled > > +STEP 2: Link Reset > > +------------------ > > +The platform resets the link. This is a PCI-Express specific step > > +and is done whenever a fatal error has been detected that can be > > +"solved" by resetting the link. > > + > > + > > +STEP 3: MMIO Enabled > > -------------------- > > The platform re-enables MMIO to the device (but typically not the > > DMA), and then calls the mmio_enabled() callback on all affected > > @@ -197,8 +204,8 @@ information, if any, and eventually do things like trigger a device local > > reset or some such, but not restart operations. This callback is made if > > all drivers on a segment agree that they can try to recover and if no automatic > > link reset was performed by the HW. If the platform can't just re-enable IOs > > -without a slot reset or a link reset, it will not call this callback, and > > -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > > +without a slot reset, it will not call this callback, and > > +instead will have gone directly or STEP 4 (Slot Reset) > > s/or/to/ ? > > > .. note:: > > > > @@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) > > such an error might cause IOs to be re-blocked for the whole > > segment, and thus invalidate the recovery that other devices > > on the same segment might have done, forcing the whole segment > > - into one of the next states, that is, link reset or slot reset. > > + into next states, that is, slot reset. > > s/into next states/into the next state/ ? > > > The driver should return one of the following result codes: > > - PCI_ERS_RESULT_RECOVERED > > @@ -233,17 +240,11 @@ The driver should return one of the following result codes: > > > > The next step taken depends on the results returned by the drivers. > > If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform > > -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). > > +proceeds to STEP 5 (Resume Operations). > > > > If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform > > proceeds to STEP 4 (Slot Reset) > > > > -STEP 3: Link Reset > > ------------------- > > -The platform resets the link. This is a PCI-Express specific step > > -and is done whenever a fatal error has been detected that can be > > -"solved" by resetting the link. > > - > > STEP 4: Slot Reset > > ------------------ > > > > -- > > 2.25.1 > >
diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst index 187f43a03200..ac6a8729ef28 100644 --- a/Documentation/PCI/pci-error-recovery.rst +++ b/Documentation/PCI/pci-error-recovery.rst @@ -184,7 +184,14 @@ is STEP 6 (Permanent Failure). and prints an error to syslog. A reboot is then required to get the device working again. -STEP 2: MMIO Enabled +STEP 2: Link Reset +------------------ +The platform resets the link. This is a PCI-Express specific step +and is done whenever a fatal error has been detected that can be +"solved" by resetting the link. + + +STEP 3: MMIO Enabled -------------------- The platform re-enables MMIO to the device (but typically not the DMA), and then calls the mmio_enabled() callback on all affected @@ -197,8 +204,8 @@ information, if any, and eventually do things like trigger a device local reset or some such, but not restart operations. This callback is made if all drivers on a segment agree that they can try to recover and if no automatic link reset was performed by the HW. If the platform can't just re-enable IOs -without a slot reset or a link reset, it will not call this callback, and -instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) +without a slot reset, it will not call this callback, and +instead will have gone directly or STEP 4 (Slot Reset) .. note:: @@ -210,7 +217,7 @@ instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset) such an error might cause IOs to be re-blocked for the whole segment, and thus invalidate the recovery that other devices on the same segment might have done, forcing the whole segment - into one of the next states, that is, link reset or slot reset. + into next states, that is, slot reset. The driver should return one of the following result codes: - PCI_ERS_RESULT_RECOVERED @@ -233,17 +240,11 @@ The driver should return one of the following result codes: The next step taken depends on the results returned by the drivers. If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform -proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations). +proceeds to STEP 5 (Resume Operations). If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform proceeds to STEP 4 (Slot Reset) -STEP 3: Link Reset ------------------- -The platform resets the link. This is a PCI-Express specific step -and is done whenever a fatal error has been detected that can be -"solved" by resetting the link. - STEP 4: Slot Reset ------------------
Reset_link() callback function was called before mmio_enabled() in pcie_do_recovery() function actually, so rearrange the general sequence betwen step 2 and step 3 accordingly. Signed-off-by: Wesley Sheng <wesley.sheng@amd.com> --- Documentation/PCI/pci-error-recovery.rst | 23 ++++++++++++----------- 1 file changed, 12 insertions(+), 11 deletions(-)