[5/7] PCI/DPC: Print AER status in DPC event handling

Message ID 20180620213833.25072-5-keith.busch@intel.com
State New
Delegated to: Bjorn Helgaas
Headers show
Series
  • [1/7] PCI/DPC: Leave interrupts enabled while handling event
Related show

Commit Message

Keith Busch June 20, 2018, 9:38 p.m.
A DPC enabled device suppresses ERR_(NON)FATAL messages, preventing the
AER handler from reporting error details. If the DPC trigger reason says
the downstream port detected the error, this patch has the DPC driver
collect the AER uncorrectable status for logging, then clears the status.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/dpc.c | 6 ++++++
 1 file changed, 6 insertions(+)

Comments

Oza Pawandeep June 21, 2018, 9:16 a.m. | #1
On 2018-06-21 03:08, Keith Busch wrote:
> A DPC enabled device suppresses ERR_(NON)FATAL messages, preventing the
> AER handler from reporting error details. If the DPC trigger reason 
> says
> the downstream port detected the error, this patch has the DPC driver
> collect the AER uncorrectable status for logging, then clears the 
> status.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/dpc.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 1b0b25ba947c..f6098dd171f3 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -6,6 +6,7 @@
>   * Copyright (C) 2016 Intel Corp.
>   */
> 
> +#include <linux/aer.h>
>  #include <linux/delay.h>
>  #include <linux/interrupt.h>
>  #include <linux/init.h>
> @@ -161,6 +162,7 @@ static void dpc_process_rp_pio_error(struct dpc_dev 
> *dpc)
> 
>  static void dpc_work(struct work_struct *work)
>  {
> +	struct aer_err_info info;
>  	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
>  	struct pci_dev *pdev = dpc->dev->port;
>  	struct device *dev = &dpc->dev->device;
> @@ -185,6 +187,10 @@ static void dpc_work(struct work_struct *work)
>  	/* show RP PIO error detail information */
>  	if (dpc->rp_extensions && reason == 3 && ext_reason == 0)
>  		dpc_process_rp_pio_error(dpc);
> +	else if (reason == 0 && aer_get_device_error_info(pdev, &info)) {
> +		aer_print_error(pdev, &info);
> +		pci_cleanup_aer_uncorrect_error_status(pdev);

6.2.10 for Downstream Port Containment:

   When DPC is triggered due to receipt of an uncorrectable error 
Message,
   the Requester ID from the Message is recorded in the DPC Error
   Source ID register and that Message is discarded and not forwarded
   Upstream. When DPC is triggered by an unmasked uncorrectable error,
   that error will not be signaled with an uncorrectable error Message,
   even if otherwise enabled.

Inst the message is discarded and not forwarded to upstream.
which means that we should not find AER status set in RP or Switch.
in other words, at time either we will find DPC or AER triggered but not 
both at the same time.
then when DPC is triggered why do we need to 
pci_cleanup_aer_uncorrect_error_status(pdev); ?

Regards,
Oza.


> +	}
> 
>  	/* We configure DPC so it only triggers on ERR_FATAL */
>  	pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC);
Keith Busch June 21, 2018, 2:05 p.m. | #2
On Thu, Jun 21, 2018 at 02:46:10PM +0530, poza@codeaurora.org wrote:
> On 2018-06-21 03:08, Keith Busch wrote:
> > @@ -185,6 +187,10 @@ static void dpc_work(struct work_struct *work)
> >  	/* show RP PIO error detail information */
> >  	if (dpc->rp_extensions && reason == 3 && ext_reason == 0)
> >  		dpc_process_rp_pio_error(dpc);
> > +	else if (reason == 0 && aer_get_device_error_info(pdev, &info)) {
> > +		aer_print_error(pdev, &info);
> > +		pci_cleanup_aer_uncorrect_error_status(pdev);
> 
> 6.2.10 for Downstream Port Containment:
> 
>   When DPC is triggered due to receipt of an uncorrectable error Message,
>   the Requester ID from the Message is recorded in the DPC Error
>   Source ID register and that Message is discarded and not forwarded
>   Upstream. When DPC is triggered by an unmasked uncorrectable error,
>   that error will not be signaled with an uncorrectable error Message,
>   even if otherwise enabled.
> 
> Inst the message is discarded and not forwarded to upstream.
> which means that we should not find AER status set in RP or Switch.
> in other words, at time either we will find DPC or AER triggered but not
> both at the same time.
> then when DPC is triggered why do we need to
> pci_cleanup_aer_uncorrect_error_status(pdev); ?

According to the sequence diagram in 6.2.5, an uncorrectable error has
the cooresponding bits set in the Device Status and AER Uncorrectable
Error Status registers before DPC specifics are considered. DPC just
suppresses the ERR_[NON]FATAL messages, but the detecting ports AER
status, if implemented, should reflect what occured.
Oza Pawandeep June 22, 2018, 5:25 a.m. | #3
On 2018-06-21 19:35, Keith Busch wrote:
> On Thu, Jun 21, 2018 at 02:46:10PM +0530, poza@codeaurora.org wrote:
>> On 2018-06-21 03:08, Keith Busch wrote:
>> > @@ -185,6 +187,10 @@ static void dpc_work(struct work_struct *work)
>> >  	/* show RP PIO error detail information */
>> >  	if (dpc->rp_extensions && reason == 3 && ext_reason == 0)
>> >  		dpc_process_rp_pio_error(dpc);
>> > +	else if (reason == 0 && aer_get_device_error_info(pdev, &info)) {
>> > +		aer_print_error(pdev, &info);
>> > +		pci_cleanup_aer_uncorrect_error_status(pdev);
>> 
>> 6.2.10 for Downstream Port Containment:
>> 
>>   When DPC is triggered due to receipt of an uncorrectable error 
>> Message,
>>   the Requester ID from the Message is recorded in the DPC Error
>>   Source ID register and that Message is discarded and not forwarded
>>   Upstream. When DPC is triggered by an unmasked uncorrectable error,
>>   that error will not be signaled with an uncorrectable error Message,
>>   even if otherwise enabled.
>> 
>> Inst the message is discarded and not forwarded to upstream.
>> which means that we should not find AER status set in RP or Switch.
>> in other words, at time either we will find DPC or AER triggered but 
>> not
>> both at the same time.
>> then when DPC is triggered why do we need to
>> pci_cleanup_aer_uncorrect_error_status(pdev); ?
> 
> According to the sequence diagram in 6.2.5, an uncorrectable error has
> the cooresponding bits set in the Device Status and AER Uncorrectable
> Error Status registers before DPC specifics are considered. DPC just
> suppresses the ERR_[NON]FATAL messages, but the detecting ports AER
> status, if implemented, should reflect what occured.

Okay, I see. Thanks for clarifying it.

Regards,
Oza.
Oza Pawandeep June 22, 2018, 10:11 a.m. | #4
On 2018-06-21 19:35, Keith Busch wrote:
> On Thu, Jun 21, 2018 at 02:46:10PM +0530, poza@codeaurora.org wrote:
>> On 2018-06-21 03:08, Keith Busch wrote:
>> > @@ -185,6 +187,10 @@ static void dpc_work(struct work_struct *work)
>> >  	/* show RP PIO error detail information */
>> >  	if (dpc->rp_extensions && reason == 3 && ext_reason == 0)
>> >  		dpc_process_rp_pio_error(dpc);
>> > +	else if (reason == 0 && aer_get_device_error_info(pdev, &info)) {
>> > +		aer_print_error(pdev, &info);
>> > +		pci_cleanup_aer_uncorrect_error_status(pdev);
>> 
>> 6.2.10 for Downstream Port Containment:
>> 
>>   When DPC is triggered due to receipt of an uncorrectable error 
>> Message,
>>   the Requester ID from the Message is recorded in the DPC Error
>>   Source ID register and that Message is discarded and not forwarded
>>   Upstream. When DPC is triggered by an unmasked uncorrectable error,
>>   that error will not be signaled with an uncorrectable error Message,
>>   even if otherwise enabled.
>> 
>> Inst the message is discarded and not forwarded to upstream.
>> which means that we should not find AER status set in RP or Switch.
>> in other words, at time either we will find DPC or AER triggered but 
>> not
>> both at the same time.
>> then when DPC is triggered why do we need to
>> pci_cleanup_aer_uncorrect_error_status(pdev); ?
> 
> According to the sequence diagram in 6.2.5, an uncorrectable error has
> the cooresponding bits set in the Device Status and AER Uncorrectable
> Error Status registers before DPC specifics are considered. DPC just
> suppresses the ERR_[NON]FATAL messages, but the detecting ports AER
> status, if implemented, should reflect what occured.

Hi Keith,

was thinking that current code
pcie_do_fatal_recovery already does call

if ((service == PCIE_PORT_SERVICE_AER) &&
	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
		/*
		 * If the error is reported by a bridge, we think this error
		 * is related to the downstream link of the bridge, so we
		 * do error recovery on all subordinates of the bridge instead
		 * of the bridge and clear the error status of the bridge.
		 */
		pci_cleanup_aer_uncorrect_error_status(dev);
	}


instead of calling it here in dpc driver, can we make use of that 
existing call ?
probably we just might need to remove
if ((service == PCIE_PORT_SERVICE_AER) condition

Regards,
Oza.
Keith Busch June 22, 2018, 2:10 p.m. | #5
On Fri, Jun 22, 2018 at 03:41:50PM +0530, poza@codeaurora.org wrote:
> was thinking that current code
> pcie_do_fatal_recovery already does call
> 
> if ((service == PCIE_PORT_SERVICE_AER) &&
> 	    (dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
> 		/*
> 		 * If the error is reported by a bridge, we think this error
> 		 * is related to the downstream link of the bridge, so we
> 		 * do error recovery on all subordinates of the bridge instead
> 		 * of the bridge and clear the error status of the bridge.
> 		 */
> 		pci_cleanup_aer_uncorrect_error_status(dev);
> 	}
> 
> 
> instead of calling it here in dpc driver, can we make use of that existing
> call ?
> probably we just might need to remove
> if ((service == PCIE_PORT_SERVICE_AER) condition

That's really only desirable when DPC error status is 0. It should be
harmless, though, so your update is fine with me.

Patch

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 1b0b25ba947c..f6098dd171f3 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -6,6 +6,7 @@ 
  * Copyright (C) 2016 Intel Corp.
  */
 
+#include <linux/aer.h>
 #include <linux/delay.h>
 #include <linux/interrupt.h>
 #include <linux/init.h>
@@ -161,6 +162,7 @@  static void dpc_process_rp_pio_error(struct dpc_dev *dpc)
 
 static void dpc_work(struct work_struct *work)
 {
+	struct aer_err_info info;
 	struct dpc_dev *dpc = container_of(work, struct dpc_dev, work);
 	struct pci_dev *pdev = dpc->dev->port;
 	struct device *dev = &dpc->dev->device;
@@ -185,6 +187,10 @@  static void dpc_work(struct work_struct *work)
 	/* show RP PIO error detail information */
 	if (dpc->rp_extensions && reason == 3 && ext_reason == 0)
 		dpc_process_rp_pio_error(dpc);
+	else if (reason == 0 && aer_get_device_error_info(pdev, &info)) {
+		aer_print_error(pdev, &info);
+		pci_cleanup_aer_uncorrect_error_status(pdev);
+	}
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
 	pcie_do_fatal_recovery(pdev, PCIE_PORT_SERVICE_DPC);