diff mbox series

[v18,03/11] PCI/DPC: Fix DPC recovery issue in non hotplug case

Message ID 0e6f89cd6b9e4a72293cc90fafe93487d7c2d295.1585000084.git.sathyanarayanan.kuppuswamy@linux.intel.com
State New
Headers show
Series Add Error Disconnect Recover (EDR) support | expand

Commit Message

Kuppuswamy Sathyanarayanan March 24, 2020, 12:26 a.m. UTC
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

If hotplug is supported, during DPC event, hotplug
driver would remove the affected devices and detach
the drivers on DLLSC link down event and will
re-enumerate it once the DPC recovery is handled
and link comes back online (on DLLSC LINK up event).
Hence we don't depend on .mmio_enabled or .slot_reset
callbacks in error recovery handler to restore the
device.

But if hotplug is not supported/enabled, then we need
to let the error recovery handler attempt
the recovery of the devices using slot reset.

So if hotplug is not supported, then instead of
returning PCI_ERS_RESULT_RECOVERED, return
PCI_ERS_RESULT_NEED_RESET.

Also modify the way error recovery handler processes
the recovery value.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/dpc.c | 8 ++++++++
 drivers/pci/pcie/err.c | 5 +++--
 2 files changed, 11 insertions(+), 2 deletions(-)

Comments

Bjorn Helgaas March 24, 2020, 11:49 p.m. UTC | #1
On Mon, Mar 23, 2020 at 05:26:00PM -0700, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> If hotplug is supported, during DPC event, hotplug
> driver would remove the affected devices and detach
> the drivers on DLLSC link down event and will
> re-enumerate it once the DPC recovery is handled
> and link comes back online (on DLLSC LINK up event).
> Hence we don't depend on .mmio_enabled or .slot_reset
> callbacks in error recovery handler to restore the
> device.
> 
> But if hotplug is not supported/enabled, then we need
> to let the error recovery handler attempt
> the recovery of the devices using slot reset.
> 
> So if hotplug is not supported, then instead of
> returning PCI_ERS_RESULT_RECOVERED, return
> PCI_ERS_RESULT_NEED_RESET.
> 
> Also modify the way error recovery handler processes
> the recovery value.
> 
> Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/dpc.c | 8 ++++++++
>  drivers/pci/pcie/err.c | 5 +++--
>  2 files changed, 11 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index e06f42f58d3d..0e356ed0d73f 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -13,6 +13,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/init.h>
>  #include <linux/pci.h>
> +#include <linux/pci_hotplug.h>
>  
>  #include "portdrv.h"
>  #include "../pci.h"
> @@ -144,6 +145,13 @@ static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
>  	if (!pcie_wait_for_link(pdev, true))
>  		return PCI_ERS_RESULT_DISCONNECT;
>  
> +	/*
> +	 * If hotplug is not supported/enabled then let the device
> +	 * recover using slot reset.
> +	 */
> +	if (!hotplug_is_native(pdev))
> +		return PCI_ERS_RESULT_NEED_RESET;

I don't understand why hotplug is relevant here.  This path
(dpc_reset_link()) is only used for downstream ports that support DPC.
DPC has already disabled the link, which resets everything below the
port, regardless of whether the port supports hotplug.

I do see that PCI_ERS_RESULT_NEED_RESET seems to promise a lot more
than it actually *does*.  The doc (pci-error-recovery.rst) says
.error_detected() can return PCI_ERS_RESULT_NEED_RESET to *request* a
slot reset.  But if that happens, pcie_do_recovery() doesn't do a
reset at all.  It calls the driver's .slot_reset() method, which tells
the driver "we've reset your device; please re-initialize the
hardware."

I guess this abuses PCI_ERS_RESULT_NEED_RESET by taking advantage of
that implementation deficiency in pcie_do_recovery(): we know the
downstream devices have already been reset via DPC, and returning
PCI_ERS_RESULT_NEED_RESET means we'll call .slot_reset() to tell the
driver about that reset.

I can see how this achieves the desired result, but if/when we fix
pcie_do_recovery() to actually *do* the reset promised by
PCI_ERS_RESULT_NEED_RESET, we will be doing *two* resets: the first
via DPC and a second via whatever slot reset mechanism
pcie_do_recovery() would use.

So I guess the real issue (as you allude to in the commit log) is that
we rely on hotplug to unbind/rebind the driver, and without hotplug we
need to at least tell the driver the device was reset.

I'll try to expand the comment here so it reminds me what's going on
when we have to look at this again :)  Let me know if I'm on the right
track.

>  	return PCI_ERS_RESULT_RECOVERED;
>  }
>  
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 1ac57e9e1e71..6e52591a4722 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -178,7 +178,8 @@ static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
>  		return PCI_ERS_RESULT_DISCONNECT;
>  	}
>  
> -	if (status != PCI_ERS_RESULT_RECOVERED) {
> +	if ((status != PCI_ERS_RESULT_RECOVERED) &&
> +	    (status != PCI_ERS_RESULT_NEED_RESET)) {
>  		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
>  			pci_name(dev));
>  		return PCI_ERS_RESULT_DISCONNECT;
> @@ -206,7 +207,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>  	if (state == pci_channel_io_frozen) {
>  		pci_walk_bus(bus, report_frozen_detected, &status);
>  		status = reset_link(dev, service);
> -		if (status != PCI_ERS_RESULT_RECOVERED)
> +		if (status == PCI_ERS_RESULT_DISCONNECT)
>  			goto failed;
>  	} else {
>  		pci_walk_bus(bus, report_normal_detected, &status);
> -- 
> 2.17.1
>
Kuppuswamy Sathyanarayanan March 25, 2020, 1:17 a.m. UTC | #2
Hi Bjorn,

On 3/24/20 4:49 PM, Bjorn Helgaas wrote:
> I don't understand why hotplug is relevant here.  This path
> (dpc_reset_link()) is only used for downstream ports that support DPC.
> DPC has already disabled the link, which resets everything below the
> port, regardless of whether the port supports hotplug.
> 
> I do see that PCI_ERS_RESULT_NEED_RESET seems to promise a lot more
> than it actually*does*.  The doc (pci-error-recovery.rst) says
> .error_detected() can return PCI_ERS_RESULT_NEED_RESET to*request*  a
> slot reset.  But if that happens, pcie_do_recovery() doesn't do a
> reset at all.  It calls the driver's .slot_reset() method, which tells
> the driver "we've reset your device; please re-initialize the
> hardware."
> 
> I guess this abuses PCI_ERS_RESULT_NEED_RESET by taking advantage of
> that implementation deficiency in pcie_do_recovery(): we know the
> downstream devices have already been reset via DPC, and returning
> PCI_ERS_RESULT_NEED_RESET means we'll call .slot_reset() to tell the
> driver about that reset.
> 
> I can see how this achieves the desired result, but if/when we fix
> pcie_do_recovery() to actually*do*  the reset promised by
> PCI_ERS_RESULT_NEED_RESET, we will be doing*two*  resets: the first
> via DPC and a second via whatever slot reset mechanism
> pcie_do_recovery() would use.
When we fix this issue, if we make sure the reset logic is
implemented before we call .reset_link callback we should be
able to avoid resetting the device twice. Before we call DPC
.reset_link callback, the device link will not up and hence we
should not able to reset it.
> 
> So I guess the real issue (as you allude to in the commit log) is that
> we rely on hotplug to unbind/rebind the driver, and without hotplug we
> need to at least tell the driver the device was reset.
Agree
> 
> I'll try to expand the comment here so it reminds me what's going on
> when we have to look at this again:)   Let me know if I'm on the right
> track.
Yes, your understanding is correct.
Bjorn Helgaas March 28, 2020, 5:10 p.m. UTC | #3
On Tue, Mar 24, 2020 at 06:17:44PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> On 3/24/20 4:49 PM, Bjorn Helgaas wrote:
> > I don't understand why hotplug is relevant here.  This path
> > (dpc_reset_link()) is only used for downstream ports that support DPC.
> > DPC has already disabled the link, which resets everything below the
> > port, regardless of whether the port supports hotplug.
> > 
> > I do see that PCI_ERS_RESULT_NEED_RESET seems to promise a lot more
> > than it actually *does*.  The doc (pci-error-recovery.rst) says
> > .error_detected() can return PCI_ERS_RESULT_NEED_RESET to *request* a
> > slot reset.  But if that happens, pcie_do_recovery() doesn't do a
> > reset at all.  It calls the driver's .slot_reset() method, which tells
> > the driver "we've reset your device; please re-initialize the
> > hardware."
> > 
> > I guess this abuses PCI_ERS_RESULT_NEED_RESET by taking advantage of
> > that implementation deficiency in pcie_do_recovery(): we know the
> > downstream devices have already been reset via DPC, and returning
> > PCI_ERS_RESULT_NEED_RESET means we'll call .slot_reset() to tell the
> > driver about that reset.
> > 
> > I can see how this achieves the desired result, but if/when we fix
> > pcie_do_recovery() to actually *do* the reset promised by
> > PCI_ERS_RESULT_NEED_RESET, we will be doing *two* resets: the first
> > via DPC and a second via whatever slot reset mechanism
> > pcie_do_recovery() would use.
>
> When we fix this issue, if we make sure the reset logic is
> implemented before we call .reset_link callback we should be
> able to avoid resetting the device twice. Before we call DPC
> .reset_link callback, the device link will not up and hence we
> should not able to reset it.
>
> > So I guess the real issue (as you allude to in the commit log) is that
> > we rely on hotplug to unbind/rebind the driver, and without hotplug we
> > need to at least tell the driver the device was reset.
>
> Agree
>
> > I'll try to expand the comment here so it reminds me what's going on
> > when we have to look at this again:)   Let me know if I'm on the right
> > track.
>
> Yes, your understanding is correct.

OK, thanks.  I'm still uncomfortable with this issue, so I think I'm
going to apply this series but omit this patch.  Here's why:

1) The fact that resets may cause hotplug events isn't specific to
DPC, so I don't think dpc_reset_link() is the right place.  For
instance, aer_root_reset() ultimately does a secondary bus reset.  The
pci_slot_reset() -> pciehp_reset_slot() path goes to some trouble to
ignore the resulting hotplug event, but the pci_bus_reset() path does
not.

2) I'm not convinced that "hotplug_is_native()" is the correct test.
Even if we're using ACPI hotplug (acpiphp), that will detach the
drivers and remove the devices, won't it?

I considered something like the patch below, which partly addresses my
first concern, but not the second.  Even the first one is awfully
messy because of the different ways the aer_root_reset() path can
work.


PCI/ERR: Skip driver callbacks if reset causes hotplug remove/add

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 1ac57e9e1e71..000551a06013 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -208,6 +208,18 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		status = reset_link(dev, service);
 		if (status != PCI_ERS_RESULT_RECOVERED)
 			goto failed;
+
+		/*
+		 * If pdev supports hotplug, a link reset causes a hotplug
+		 * remove event.  If we have a hotplug driver, it will
+		 * detach all drivers of downstream devices and remove the
+		 * devices, so we can't call any driver error recovery
+		 * callbacks.  Bringing the link back up causes a hotplug
+		 * add event, and the devices should be re-enumerated and
+		 * the drivers re-attached.
+		 */
+		if (hotplug_is_native(pdev))
+			goto succeeded;
 	} else {
 		pci_walk_bus(bus, report_normal_detected, &status);
 	}
@@ -224,7 +236,11 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 		 * functions to reset slot before calling
 		 * drivers' slot_reset callbacks?
 		 */
+		pci_warn(pdev, "driver requested reset, but that's not implemented\n");
 		status = PCI_ERS_RESULT_RECOVERED;
+	}
+
+	if (status == PCI_ERS_RESULT_RECOVERED) {
 		pci_dbg(dev, "broadcast slot_reset message\n");
 		pci_walk_bus(bus, report_slot_reset, &status);
 	}
@@ -235,6 +251,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	pci_dbg(dev, "broadcast resume message\n");
 	pci_walk_bus(bus, report_resume, &status);
 
+succeeded:
 	pci_aer_clear_device_status(dev);
 	pci_cleanup_aer_uncorrect_error_status(dev);
 	pci_info(dev, "device recovery successful\n");
Kuppuswamy Sathyanarayanan March 28, 2020, 10:04 p.m. UTC | #4
Hi Bjorn,

On 3/28/20 10:10 AM, Bjorn Helgaas wrote:
> On Tue, Mar 24, 2020 at 06:17:44PM -0700, Kuppuswamy, Sathyanarayanan wrote:
>> On 3/24/20 4:49 PM, Bjorn Helgaas wrote:
>>> I don't understand why hotplug is relevant here.  This path
>>> (dpc_reset_link()) is only used for downstream ports that support DPC.
>>> DPC has already disabled the link, which resets everything below the
>>> port, regardless of whether the port supports hotplug.
>>>
>>> I do see that PCI_ERS_RESULT_NEED_RESET seems to promise a lot more
>>> than it actually *does*.  The doc (pci-error-recovery.rst) says
>>> .error_detected() can return PCI_ERS_RESULT_NEED_RESET to *request* a
>>> slot reset.  But if that happens, pcie_do_recovery() doesn't do a
>>> reset at all.  It calls the driver's .slot_reset() method, which tells
>>> the driver "we've reset your device; please re-initialize the
>>> hardware."
>>>
>>> I guess this abuses PCI_ERS_RESULT_NEED_RESET by taking advantage of
>>> that implementation deficiency in pcie_do_recovery(): we know the
>>> downstream devices have already been reset via DPC, and returning
>>> PCI_ERS_RESULT_NEED_RESET means we'll call .slot_reset() to tell the
>>> driver about that reset.
>>>
>>> I can see how this achieves the desired result, but if/when we fix
>>> pcie_do_recovery() to actually *do* the reset promised by
>>> PCI_ERS_RESULT_NEED_RESET, we will be doing *two* resets: the first
>>> via DPC and a second via whatever slot reset mechanism
>>> pcie_do_recovery() would use.
>>
>> When we fix this issue, if we make sure the reset logic is
>> implemented before we call .reset_link callback we should be
>> able to avoid resetting the device twice. Before we call DPC
>> .reset_link callback, the device link will not up and hence we
>> should not able to reset it.
>>
>>> So I guess the real issue (as you allude to in the commit log) is that
>>> we rely on hotplug to unbind/rebind the driver, and without hotplug we
>>> need to at least tell the driver the device was reset.
>>
>> Agree
>>
>>> I'll try to expand the comment here so it reminds me what's going on
>>> when we have to look at this again:)   Let me know if I'm on the right
>>> track.
>>
>> Yes, your understanding is correct.
> 
> OK, thanks.  I'm still uncomfortable with this issue, so I think I'm
> going to apply this series but omit this patch.  Here's why:
> 
> 1) The fact that resets may cause hotplug events isn't specific to
> DPC, so I don't think dpc_reset_link() is the right place.  For
> instance, aer_root_reset() ultimately does a secondary bus reset. 
Agree. Reset is common for pci_channel_io_frozen errors. I did not
look into aer_root_reset() implementation. So if state
is pci_channel_io_frozen then we can assume the slot has been
reseted.
  The
> pci_slot_reset() -> pciehp_reset_slot() path goes to some trouble to
> ignore the resulting hotplug event, but the pci_bus_reset() path does
> not.
> 
> 2) I'm not convinced that "hotplug_is_native()" is the correct test.
> Even if we're using ACPI hotplug (acpiphp), that will detach the
> drivers and remove the devices, won't it?
Yes, agreed. It does not handle ACPI hotplug case. In case of
ACPI hotplug, native_pcie_hotplug = 0. May be we need a new helper
function. hotplug_is_enabled() ?
> 
> I considered something like the patch below, which partly addresses my
> first concern, but not the second.  Even the first one is awfully
> messy because of the different ways the aer_root_reset() path can
> work.
> 
> 
> PCI/ERR: Skip driver callbacks if reset causes hotplug remove/add
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 1ac57e9e1e71..000551a06013 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -208,6 +208,18 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>   		status = reset_link(dev, service);
>   		if (status != PCI_ERS_RESULT_RECOVERED)
>   			goto failed;
> +
> +		/*
> +		 * If pdev supports hotplug, a link reset causes a hotplug
> +		 * remove event.  If we have a hotplug driver, it will
> +		 * detach all drivers of downstream devices and remove the
> +		 * devices, so we can't call any driver error recovery
> +		 * callbacks.  Bringing the link back up causes a hotplug
> +		 * add event, and the devices should be re-enumerated and
> +		 * the drivers re-attached.
> +		 */
> +		if (hotplug_is_native(pdev))
> +			goto succeeded;
>   	} else {
>   		pci_walk_bus(bus, report_normal_detected, &status);
>   	}
> @@ -224,7 +236,11 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>   		 * functions to reset slot before calling
>   		 * drivers' slot_reset callbacks?
>   		 */
> +		pci_warn(pdev, "driver requested reset, but that's not implemented\n");
>   		status = PCI_ERS_RESULT_RECOVERED;
> +	}
> +
> +	if (status == PCI_ERS_RESULT_RECOVERED) {
Moving it outside status == PCI_ERS_NEED_RESET check will let it execute
in non frozen error as well. IIUC, we should not call it on all error
types. Let me know your comments.
>   		pci_dbg(dev, "broadcast slot_reset message\n");
>   		pci_walk_bus(bus, report_slot_reset, &status);
>   	}
> @@ -235,6 +251,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
>   	pci_dbg(dev, "broadcast resume message\n");
>   	pci_walk_bus(bus, report_resume, &status);
>   
> +succeeded:
>   	pci_aer_clear_device_status(dev);
>   	pci_cleanup_aer_uncorrect_error_status(dev);
>   	pci_info(dev, "device recovery successful\n");
>
Bjorn Helgaas March 28, 2020, 10:21 p.m. UTC | #5
On Sat, Mar 28, 2020 at 03:04:05PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> Hi Bjorn,
> 
> On 3/28/20 10:10 AM, Bjorn Helgaas wrote:
> > On Tue, Mar 24, 2020 at 06:17:44PM -0700, Kuppuswamy, Sathyanarayanan wrote:
> > > On 3/24/20 4:49 PM, Bjorn Helgaas wrote:
> > > > I don't understand why hotplug is relevant here.  This path
> > > > (dpc_reset_link()) is only used for downstream ports that support DPC.
> > > > DPC has already disabled the link, which resets everything below the
> > > > port, regardless of whether the port supports hotplug.
> > > > 
> > > > I do see that PCI_ERS_RESULT_NEED_RESET seems to promise a lot more
> > > > than it actually *does*.  The doc (pci-error-recovery.rst) says
> > > > .error_detected() can return PCI_ERS_RESULT_NEED_RESET to *request* a
> > > > slot reset.  But if that happens, pcie_do_recovery() doesn't do a
> > > > reset at all.  It calls the driver's .slot_reset() method, which tells
> > > > the driver "we've reset your device; please re-initialize the
> > > > hardware."
> > > > 
> > > > I guess this abuses PCI_ERS_RESULT_NEED_RESET by taking advantage of
> > > > that implementation deficiency in pcie_do_recovery(): we know the
> > > > downstream devices have already been reset via DPC, and returning
> > > > PCI_ERS_RESULT_NEED_RESET means we'll call .slot_reset() to tell the
> > > > driver about that reset.
> > > > 
> > > > I can see how this achieves the desired result, but if/when we fix
> > > > pcie_do_recovery() to actually *do* the reset promised by
> > > > PCI_ERS_RESULT_NEED_RESET, we will be doing *two* resets: the first
> > > > via DPC and a second via whatever slot reset mechanism
> > > > pcie_do_recovery() would use.
> > > 
> > > When we fix this issue, if we make sure the reset logic is
> > > implemented before we call .reset_link callback we should be
> > > able to avoid resetting the device twice. Before we call DPC
> > > .reset_link callback, the device link will not up and hence we
> > > should not able to reset it.
> > > 
> > > > So I guess the real issue (as you allude to in the commit log) is that
> > > > we rely on hotplug to unbind/rebind the driver, and without hotplug we
> > > > need to at least tell the driver the device was reset.
> > > 
> > > Agree
> > > 
> > > > I'll try to expand the comment here so it reminds me what's going on
> > > > when we have to look at this again:)   Let me know if I'm on the right
> > > > track.
> > > 
> > > Yes, your understanding is correct.
> > 
> > OK, thanks.  I'm still uncomfortable with this issue, so I think I'm
> > going to apply this series but omit this patch.  Here's why:
> > 
> > 1) The fact that resets may cause hotplug events isn't specific to
> > DPC, so I don't think dpc_reset_link() is the right place.  For
> > instance, aer_root_reset() ultimately does a secondary bus reset.
> Agree. Reset is common for pci_channel_io_frozen errors. I did not
> look into aer_root_reset() implementation. So if state
> is pci_channel_io_frozen then we can assume the slot has been
> reseted.
>  The
> > pci_slot_reset() -> pciehp_reset_slot() path goes to some trouble to
> > ignore the resulting hotplug event, but the pci_bus_reset() path does
> > not.
> > 
> > 2) I'm not convinced that "hotplug_is_native()" is the correct test.
> > Even if we're using ACPI hotplug (acpiphp), that will detach the
> > drivers and remove the devices, won't it?
> Yes, agreed. It does not handle ACPI hotplug case. In case of
> ACPI hotplug, native_pcie_hotplug = 0. May be we need a new helper
> function. hotplug_is_enabled() ?

I'm not proposing the patch below to be applied.  I only included it
as an idea of where the hotplug testing should be.

I'm proposing to merge the pci/edr branch as-is, without these two
patches:

  PCI: move {pciehp,shpchp}_is_native() definitions to pci.c
  PCI/DPC: Fix DPC recovery issue in non hotplug case

accepting that we still have some issues in the non-hotplug case that
we can fix later.

> > I considered something like the patch below, which partly addresses my
> > first concern, but not the second.  Even the first one is awfully
> > messy because of the different ways the aer_root_reset() path can
> > work.
> > 
> > 
> > PCI/ERR: Skip driver callbacks if reset causes hotplug remove/add
> > 
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 1ac57e9e1e71..000551a06013 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -208,6 +208,18 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> >   		status = reset_link(dev, service);
> >   		if (status != PCI_ERS_RESULT_RECOVERED)
> >   			goto failed;
> > +
> > +		/*
> > +		 * If pdev supports hotplug, a link reset causes a hotplug
> > +		 * remove event.  If we have a hotplug driver, it will
> > +		 * detach all drivers of downstream devices and remove the
> > +		 * devices, so we can't call any driver error recovery
> > +		 * callbacks.  Bringing the link back up causes a hotplug
> > +		 * add event, and the devices should be re-enumerated and
> > +		 * the drivers re-attached.
> > +		 */
> > +		if (hotplug_is_native(pdev))
> > +			goto succeeded;
> >   	} else {
> >   		pci_walk_bus(bus, report_normal_detected, &status);
> >   	}
> > @@ -224,7 +236,11 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> >   		 * functions to reset slot before calling
> >   		 * drivers' slot_reset callbacks?
> >   		 */
> > +		pci_warn(pdev, "driver requested reset, but that's not implemented\n");
> >   		status = PCI_ERS_RESULT_RECOVERED;
> > +	}
> > +
> > +	if (status == PCI_ERS_RESULT_RECOVERED) {
> Moving it outside status == PCI_ERS_NEED_RESET check will let it execute
> in non frozen error as well. IIUC, we should not call it on all error
> types. Let me know your comments.
> >   		pci_dbg(dev, "broadcast slot_reset message\n");
> >   		pci_walk_bus(bus, report_slot_reset, &status);
> >   	}
> > @@ -235,6 +251,7 @@ void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
> >   	pci_dbg(dev, "broadcast resume message\n");
> >   	pci_walk_bus(bus, report_resume, &status);
> > +succeeded:
> >   	pci_aer_clear_device_status(dev);
> >   	pci_cleanup_aer_uncorrect_error_status(dev);
> >   	pci_info(dev, "device recovery successful\n");
> >
Kuppuswamy Sathyanarayanan March 28, 2020, 10:40 p.m. UTC | #6
On 3/28/20 3:21 PM, Bjorn Helgaas wrote:
>>> OK, thanks.  I'm still uncomfortable with this issue, so I think I'm
>>> going to apply this series but omit this patch.  Here's why:
>>>
>>> 1) The fact that resets may cause hotplug events isn't specific to
>>> DPC, so I don't think dpc_reset_link() is the right place.  For
>>> instance, aer_root_reset() ultimately does a secondary bus reset.
>> Agree. Reset is common for pci_channel_io_frozen errors. I did not
>> look into aer_root_reset() implementation. So if state
>> is pci_channel_io_frozen then we can assume the slot has been
>> reseted.
>>   The
>>> pci_slot_reset() -> pciehp_reset_slot() path goes to some trouble to
>>> ignore the resulting hotplug event, but the pci_bus_reset() path does
>>> not.
>>>
>>> 2) I'm not convinced that "hotplug_is_native()" is the correct test.
>>> Even if we're using ACPI hotplug (acpiphp), that will detach the
>>> drivers and remove the devices, won't it?
>> Yes, agreed. It does not handle ACPI hotplug case. In case of
>> ACPI hotplug, native_pcie_hotplug = 0. May be we need a new helper
>> function. hotplug_is_enabled() ?
> I'm not proposing the patch below to be applied.  I only included it
> as an idea of where the hotplug testing should be.
> 
> I'm proposing to merge the pci/edr branch as-is, without these two
> patches:
> 
>    PCI: move {pciehp,shpchp}_is_native() definitions to pci.c
>    PCI/DPC: Fix DPC recovery issue in non hotplug case
> 
> accepting that we still have some issues in the non-hotplug case that
> we can fix later.
Ok. I am fine with it. Thanks for working on it.
>
diff mbox series

Patch

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index e06f42f58d3d..0e356ed0d73f 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -13,6 +13,7 @@ 
 #include <linux/interrupt.h>
 #include <linux/init.h>
 #include <linux/pci.h>
+#include <linux/pci_hotplug.h>
 
 #include "portdrv.h"
 #include "../pci.h"
@@ -144,6 +145,13 @@  static pci_ers_result_t dpc_reset_link(struct pci_dev *pdev)
 	if (!pcie_wait_for_link(pdev, true))
 		return PCI_ERS_RESULT_DISCONNECT;
 
+	/*
+	 * If hotplug is not supported/enabled then let the device
+	 * recover using slot reset.
+	 */
+	if (!hotplug_is_native(pdev))
+		return PCI_ERS_RESULT_NEED_RESET;
+
 	return PCI_ERS_RESULT_RECOVERED;
 }
 
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 1ac57e9e1e71..6e52591a4722 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -178,7 +178,8 @@  static pci_ers_result_t reset_link(struct pci_dev *dev, u32 service)
 		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
-	if (status != PCI_ERS_RESULT_RECOVERED) {
+	if ((status != PCI_ERS_RESULT_RECOVERED) &&
+	    (status != PCI_ERS_RESULT_NEED_RESET)) {
 		pci_printk(KERN_DEBUG, dev, "link reset at upstream device %s failed\n",
 			pci_name(dev));
 		return PCI_ERS_RESULT_DISCONNECT;
@@ -206,7 +207,7 @@  void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	if (state == pci_channel_io_frozen) {
 		pci_walk_bus(bus, report_frozen_detected, &status);
 		status = reset_link(dev, service);
-		if (status != PCI_ERS_RESULT_RECOVERED)
+		if (status == PCI_ERS_RESULT_DISCONNECT)
 			goto failed;
 	} else {
 		pci_walk_bus(bus, report_normal_detected, &status);