[4/4] PCI/DPC: Print AER status in DPC event handling

Message ID 20171219210643.24615-4-keith.busch@intel.com
State Not Applicable
Delegated to: Bjorn Helgaas
Headers show
Series
  • [1/4] PCI/AER: Return approrpiate value when AER is not supported
Related show

Commit Message

Keith Busch Dec. 19, 2017, 9:06 p.m.
A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
which prevents the AER handler from reporting the details of the
error. This patch will have the DPC driver get the AER status registers
from the downstream port that detected the uncorrectable error, and
print out additional information.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

Comments

Sinan Kaya Dec. 21, 2017, 4:43 a.m. | #1
+Oza

On 12/19/2017 4:06 PM, Keith Busch wrote:
> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> which prevents the AER handler from reporting the details of the
> error. This patch will have the DPC driver get the AER status registers
> from the downstream port that detected the uncorrectable error, and
> print out additional information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---

Oza is doing some restructuring to unify DPC and AER error handling path per
feedback from Bjorn. It is almost done. He is testing it. 

Can this patch wait until you review his version? I'm thinking this could
be something that can be added to his series instead.

Coming back to this patch. The interrupt number for DPC and AER could be the
same or different. According to the spec, AER errors are always reported
regardless of DPC driver presence (see the famous flow chart).

If the interrupt IDs are the same for AER and DPC, your patch would introduce
double printing for AER errors.

Sinan
Keith Busch Dec. 21, 2017, 5:12 a.m. | #2
On Wed, Dec 20, 2017 at 11:43:07PM -0500, Sinan Kaya wrote:
> +Oza
> 
> On 12/19/2017 4:06 PM, Keith Busch wrote:
> > A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> > which prevents the AER handler from reporting the details of the
> > error. This patch will have the DPC driver get the AER status registers
> > from the downstream port that detected the uncorrectable error, and
> > print out additional information.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> 
> Oza is doing some restructuring to unify DPC and AER error handling path per
> feedback from Bjorn. It is almost done. He is testing it. 
> 
> Can this patch wait until you review his version? I'm thinking this could
> be something that can be added to his series instead.

No problem.
 
> Coming back to this patch. The interrupt number for DPC and AER could be the
> same or different. 

Only if you're talking about root ports. DPC is also a feature of
switch downstream ports, which don't generate interrupts on AER events
(they don't have a Root Error Command register).

> According to the spec, AER errors are always reported
> regardless of DPC driver presence (see the famous flow chart).

> If the interrupt IDs are the same for AER and DPC, your patch would introduce
> double printing for AER errors.

The AER Uncorrectable Status Register of the detecting port would
indeed be set with the appropriate status if that's the type of error
that occured, but when DPC is enabled, the root port never observes an
ERR_FATAL/ERR_NONFATAL message required for it to get set the Root Error
Status Register. The Linux AER handler requires the Root Error Status
be set in order for it to print anything, so I don't think we're at risk
of double printing with this patch even if the root port is DPC capable.
Sinan Kaya Jan. 10, 2018, 3:52 p.m. | #3
Hi Keith,

On 12/21/2017 12:12 AM, Keith Busch wrote:
> On Wed, Dec 20, 2017 at 11:43:07PM -0500, Sinan Kaya wrote:
>> +Oza
>>
>> On 12/19/2017 4:06 PM, Keith Busch wrote:
>>> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
>>> which prevents the AER handler from reporting the details of the
>>> error. This patch will have the DPC driver get the AER status registers
>>> from the downstream port that detected the uncorrectable error, and
>>> print out additional information.
>>>
>>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>>> ---
>>
>> Oza is doing some restructuring to unify DPC and AER error handling path per
>> feedback from Bjorn. It is almost done. He is testing it. 
>>
>> Can this patch wait until you review his version? I'm thinking this could
>> be something that can be added to his series instead.
> 
> No problem.
>  
>> Coming back to this patch. The interrupt number for DPC and AER could be the
>> same or different. 
> 
> Only if you're talking about root ports. DPC is also a feature of
> switch downstream ports, which don't generate interrupts on AER events
> (they don't have a Root Error Command register).
> 
>> According to the spec, AER errors are always reported
>> regardless of DPC driver presence (see the famous flow chart).
> 
>> If the interrupt IDs are the same for AER and DPC, your patch would introduce
>> double printing for AER errors.
> 
> The AER Uncorrectable Status Register of the detecting port would
> indeed be set with the appropriate status if that's the type of error
> that occured, but when DPC is enabled, the root port never observes an
> ERR_FATAL/ERR_NONFATAL message required for it to get set the Root Error
> Status Register. The Linux AER handler requires the Root Error Status
> be set in order for it to print anything, so I don't think we're at risk
> of double printing with this patch even if the root port is DPC capable.
> 

Coming back to this. Oza posted his patch that integrates DPC into PORTDRV similar
to AER driver. 

"[PATCH v3 0/4] Address error and recovery for AER and DPC"

We are looking for feedback on the series.

The idea in a nut shell is to collect all endpoint error recovery code into a
new file called pcie-err.c and then invoke callbacks before DPC driver shuts down
the currently active driver.

There is however, still one more problem that has not been tacked.

Since the AER status is set when we observe DPC event and nobody is clearing these
we won't observe another DPC event until somebody clears these. We can say that
we are resetting the endpoints as part of the DPC but we are not touching the
switch downstream port or the root port registers.

Somebody still needs to clear these in addition to printing whatever information
is available in the AER registers.

Do you agree?

Sinan
Bjorn Helgaas Jan. 12, 2018, 11:03 p.m. | #4
Hi Keith,

On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> which prevents the AER handler from reporting the details of the
> error. This patch will have the DPC driver get the AER status registers
> from the downstream port that detected the uncorrectable error, and
> print out additional information.
> 
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> index ef71a472592c..aefc4fbdcef0 100644
> --- a/drivers/pci/pcie/pcie-dpc.c
> +++ b/drivers/pci/pcie/pcie-dpc.c
> @@ -44,6 +44,7 @@ struct dpc_dev {
>  	int			cap_pos;
>  	bool			rp;
>  	u32			rp_pio_status;
> +	struct aer_err_info	info;
>  };
>  
>  static const char * const rp_pio_error_string[] = {
> @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
>  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>  	struct pci_bus *parent = pdev->subordinate;
>  
> +	if (aer_get_device_error_info(pdev, &dpc->info))
> +		aer_print_error(pdev, &dpc->info);

I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
was generated by something below "pdev", and that's where the
interesting AER logging would have been done.

This patch suggests that the DPC port itself would have something
interesting logged in its AER capability.  Is that really true?  I see
in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
Reason and the Error Source ID from the Message, but I don't see
anything about logging anything in its AER registers.

>  	pci_lock_rescan_remove();
>  	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
>  					 bus_list) {
> @@ -275,6 +279,10 @@ static irqreturn_t dpc_irq(int irq, void *context)
>  		/* show RP PIO error detail information */
>  		if (reason == 3 && ext_reason == 0)
>  			dpc_process_rp_pio_error(dpc);
> +		if (reason == 2)
> +			dpc->info.severity = AER_FATAL;
> +		else
> +			dpc->info.severity = AER_NONFATAL;
>  
>  		schedule_work(&dpc->work);
>  	}
> -- 
> 2.13.6
>
Keith Busch Jan. 14, 2018, 1:35 a.m. | #5
On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
> Hi Keith,
> 
> On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> > A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> > which prevents the AER handler from reporting the details of the
> > error. This patch will have the DPC driver get the AER status registers
> > from the downstream port that detected the uncorrectable error, and
> > print out additional information.
> > 
> > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > ---
> >  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
> >  1 file changed, 8 insertions(+)
> > 
> > diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> > index ef71a472592c..aefc4fbdcef0 100644
> > --- a/drivers/pci/pcie/pcie-dpc.c
> > +++ b/drivers/pci/pcie/pcie-dpc.c
> > @@ -44,6 +44,7 @@ struct dpc_dev {
> >  	int			cap_pos;
> >  	bool			rp;
> >  	u32			rp_pio_status;
> > +	struct aer_err_info	info;
> >  };
> >  
> >  static const char * const rp_pio_error_string[] = {
> > @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
> >  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> >  	struct pci_bus *parent = pdev->subordinate;
> >  
> > +	if (aer_get_device_error_info(pdev, &dpc->info))
> > +		aer_print_error(pdev, &dpc->info);
> 
> I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
> Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
> was generated by something below "pdev", and that's where the
> interesting AER logging would have been done.
> 
> This patch suggests that the DPC port itself would have something
> interesting logged in its AER capability.  Is that really true?  I see
> in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
> Reason and the Error Source ID from the Message, but I don't see
> anything about logging anything in its AER registers.

If the trigger reason is "unmasked uncorrectable error", then the DPC
AER register has useful information. All other reasons would not provide
useful data in the AER register.


I would assume that aer_get_device_error_info would return false if
it's one of other reasons, but I can fix this up to check the trigger
reasoning rather than unconditionally reading the AER status
Sinan Kaya Jan. 15, 2018, 2:32 p.m. | #6
On 1/13/2018 8:35 PM, Keith Busch wrote:
> On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
>> Hi Keith,
>>
>> On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
>>> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
>>> which prevents the AER handler from reporting the details of the
>>> error. This patch will have the DPC driver get the AER status registers
>>> from the downstream port that detected the uncorrectable error, and
>>> print out additional information.
>>>
>>> Signed-off-by: Keith Busch <keith.busch@intel.com>
>>> ---
>>>  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
>>>  1 file changed, 8 insertions(+)
>>>
>>> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
>>> index ef71a472592c..aefc4fbdcef0 100644
>>> --- a/drivers/pci/pcie/pcie-dpc.c
>>> +++ b/drivers/pci/pcie/pcie-dpc.c
>>> @@ -44,6 +44,7 @@ struct dpc_dev {
>>>  	int			cap_pos;
>>>  	bool			rp;
>>>  	u32			rp_pio_status;
>>> +	struct aer_err_info	info;
>>>  };
>>>  
>>>  static const char * const rp_pio_error_string[] = {
>>> @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
>>>  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
>>>  	struct pci_bus *parent = pdev->subordinate;
>>>  
>>> +	if (aer_get_device_error_info(pdev, &dpc->info))
>>> +		aer_print_error(pdev, &dpc->info);
>>
>> I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
>> Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
>> was generated by something below "pdev", and that's where the
>> interesting AER logging would have been done.
>>
>> This patch suggests that the DPC port itself would have something
>> interesting logged in its AER capability.  Is that really true?  I see
>> in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
>> Reason and the Error Source ID from the Message, but I don't see
>> anything about logging anything in its AER registers.
> 
> If the trigger reason is "unmasked uncorrectable error", then the DPC
> AER register has useful information. All other reasons would not provide
> useful data in the AER register.
> 
> 
> I would assume that aer_get_device_error_info would return false if
> it's one of other reasons, but I can fix this up to check the trigger
> reasoning rather than unconditionally reading the AER status
> 

https://pcisig.com/sites/default/files/specification_documents/ECN_DPC_2012-02-09_finalized.pdf

If you look at figure 6-2, AER errors are always logged regardless of the DPC
presence.

That's why, I was suggesting Keith that code should also clear the AER information
when such an event happens. Otherwise, same event won't trigger again.
Keith Busch Jan. 16, 2018, 2:47 a.m. | #7
On Wed, Jan 10, 2018 at 10:52:23AM -0500, Sinan Kaya wrote:
> Since the AER status is set when we observe DPC event and nobody is clearing these
> we won't observe another DPC event until somebody clears these. We can say that
> we are resetting the endpoints as part of the DPC but we are not touching the
> switch downstream port or the root port registers.
> 
> Somebody still needs to clear these in addition to printing whatever information
> is available in the AER registers.
> 
> Do you agree?

We should clear the downstream port's AER uncorrectable status register if
the trigger reason is of that type, but DPC definitely does not require
we clear it in order to observe a new event.
Bjorn Helgaas Jan. 17, 2018, 12:14 a.m. | #8
On Sat, Jan 13, 2018 at 06:35:18PM -0700, Keith Busch wrote:
> On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
> > Hi Keith,
> > 
> > On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> > > A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> > > which prevents the AER handler from reporting the details of the
> > > error. This patch will have the DPC driver get the AER status registers
> > > from the downstream port that detected the uncorrectable error, and
> > > print out additional information.
> > > 
> > > Signed-off-by: Keith Busch <keith.busch@intel.com>
> > > ---
> > >  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
> > >  1 file changed, 8 insertions(+)
> > > 
> > > diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> > > index ef71a472592c..aefc4fbdcef0 100644
> > > --- a/drivers/pci/pcie/pcie-dpc.c
> > > +++ b/drivers/pci/pcie/pcie-dpc.c
> > > @@ -44,6 +44,7 @@ struct dpc_dev {
> > >  	int			cap_pos;
> > >  	bool			rp;
> > >  	u32			rp_pio_status;
> > > +	struct aer_err_info	info;
> > >  };
> > >  
> > >  static const char * const rp_pio_error_string[] = {
> > > @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
> > >  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> > >  	struct pci_bus *parent = pdev->subordinate;
> > >  
> > > +	if (aer_get_device_error_info(pdev, &dpc->info))
> > > +		aer_print_error(pdev, &dpc->info);
> > 
> > I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
> > Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
> > was generated by something below "pdev", and that's where the
> > interesting AER logging would have been done.
> > 
> > This patch suggests that the DPC port itself would have something
> > interesting logged in its AER capability.  Is that really true?  I see
> > in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
> > Reason and the Error Source ID from the Message, but I don't see
> > anything about logging anything in its AER registers.
> 
> If the trigger reason is "unmasked uncorrectable error", then the DPC
> AER register has useful information. All other reasons would not provide
> useful data in the AER register.

OK, that makes sense.  This "unmasked uncorrectable error" case would
be things like the Port receiving an Unsupported Request completion, a
Completion Timeout, Completer Abort, etc.  In that case, the DPC Port
should do its own AER logging before triggering DPC.

> I would assume that aer_get_device_error_info would return false if
> it's one of other reasons, but I can fix this up to check the trigger
> reasoning rather than unconditionally reading the AER status

It might be overkill, but it does niggle at me that in these other
cases, the hardware isn't telling us to read the AER logging, but we
do it anyway.

We currently pass no information to interrupt_event_handler(), so it
has no way of testing the trigger reason.  I guess you could do the
test in dpc_irq() before scheduling interrupt_event_handler()?

That would move it from the work item to the ISR, but I guess it's
essentially the same sort of code as dpc_process_rp_pio_error(), which
we already do in the ISR.

Bjorn
Bjorn Helgaas Jan. 17, 2018, 12:36 a.m. | #9
On Mon, Jan 15, 2018 at 09:32:59AM -0500, Sinan Kaya wrote:
> On 1/13/2018 8:35 PM, Keith Busch wrote:
> > On Fri, Jan 12, 2018 at 05:03:03PM -0600, Bjorn Helgaas wrote:
> >> Hi Keith,
> >>
> >> On Tue, Dec 19, 2017 at 02:06:43PM -0700, Keith Busch wrote:
> >>> A DPC enabled device will suppress sending ERR_FATAL and ERR_NONFATAL,
> >>> which prevents the AER handler from reporting the details of the
> >>> error. This patch will have the DPC driver get the AER status registers
> >>> from the downstream port that detected the uncorrectable error, and
> >>> print out additional information.
> >>>
> >>> Signed-off-by: Keith Busch <keith.busch@intel.com>
> >>> ---
> >>>  drivers/pci/pcie/pcie-dpc.c | 8 ++++++++
> >>>  1 file changed, 8 insertions(+)
> >>>
> >>> diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
> >>> index ef71a472592c..aefc4fbdcef0 100644
> >>> --- a/drivers/pci/pcie/pcie-dpc.c
> >>> +++ b/drivers/pci/pcie/pcie-dpc.c
> >>> @@ -44,6 +44,7 @@ struct dpc_dev {
> >>>  	int			cap_pos;
> >>>  	bool			rp;
> >>>  	u32			rp_pio_status;
> >>> +	struct aer_err_info	info;
> >>>  };
> >>>  
> >>>  static const char * const rp_pio_error_string[] = {
> >>> @@ -111,6 +112,9 @@ static void interrupt_event_handler(struct work_struct *work)
> >>>  	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
> >>>  	struct pci_bus *parent = pdev->subordinate;
> >>>  
> >>> +	if (aer_get_device_error_info(pdev, &dpc->info))
> >>> +		aer_print_error(pdev, &dpc->info);
> >>
> >> I'm confused about this.  "pdev" is the DPC-enabled Port, i.e., the
> >> Port that received an ERR_FATAL or ERR_NONFATAL Message.  The Message
> >> was generated by something below "pdev", and that's where the
> >> interesting AER logging would have been done.
> >>
> >> This patch suggests that the DPC port itself would have something
> >> interesting logged in its AER capability.  Is that really true?  I see
> >> in sec 6.2.10 (PCIe r4.0) that the DPC port should log the Trigger
> >> Reason and the Error Source ID from the Message, but I don't see
> >> anything about logging anything in its AER registers.
> > 
> > If the trigger reason is "unmasked uncorrectable error", then the DPC
> > AER register has useful information. All other reasons would not provide
> > useful data in the AER register.
> > 
> > 
> > I would assume that aer_get_device_error_info would return false if
> > it's one of other reasons, but I can fix this up to check the trigger
> > reasoning rather than unconditionally reading the AER status
> > 
> 
> https://pcisig.com/sites/default/files/specification_documents/ECN_DPC_2012-02-09_finalized.pdf
> 
> If you look at figure 6-2, AER errors are always logged regardless
> of the DPC presence.

Figure 6-2 applies to "Errors from Table 6-2, 6-3, or 6-4".
(I suspect it should also include Table 6-5 because PCIe r4.0,
sec 6.2.4 says Table 6-5 corresponds to AER status bits).

But in any case, those tables don't include receipt of ERR_FATAL or
ERR_NONFATAL messages.  So I don't think a DPC Port should be logging
anything in its own AER registers when it receives those messages.

If a DPC Port receives ERR_FATAL or ERR_NONFATAL while DPC is enabled,
it will trigger DPC and drop the message.  If DPC were not enabled, I
think it would merely forward the message toward the Root Complex
without logging an error itself.
Bjorn Helgaas Jan. 17, 2018, 12:56 a.m. | #10
On Mon, Jan 15, 2018 at 07:47:41PM -0700, Keith Busch wrote:
> On Wed, Jan 10, 2018 at 10:52:23AM -0500, Sinan Kaya wrote:
> > Since the AER status is set when we observe DPC event and nobody
> > is clearing these we won't observe another DPC event until
> > somebody clears these. We can say that we are resetting the
> > endpoints as part of the DPC but we are not touching the switch
> > downstream port or the root port registers.
> > 
> > Somebody still needs to clear these in addition to printing
> > whatever information is available in the AER registers.
> 
> We should clear the downstream port's AER uncorrectable status
> register if the trigger reason is of that type, but DPC definitely
> does not require we clear it in order to observe a new event.

I don't quite understand this.  We don't need to clear the AER status
in order to observe a new DPC event, but I think we *do* need to clear
the AER status in order to log future AER events.  Don't we?

I think Sinan is saying that if a DPC Port observes its own unmasked
uncorrectable error (i.e., not something it learned about by receiving
an ERR_* message), it will set a bit in an AER status register.

Since we do not set DPC ERR_COR Enable, the DPC Port does not generate
an ERR_COR Message, so the AER driver never learns about the error and
never clears the AER status register.  So we'll decode the AER status
(with your current patch), but we don't clear it, so if another error
occurs, the AER logging won't work correctly.

Maybe we should be setting DPC ERR_COR Enable and letting the AER
driver decode and clear the AER info?
Keith Busch Jan. 17, 2018, 1:34 a.m. | #11
On Tue, Jan 16, 2018 at 06:56:00PM -0600, Bjorn Helgaas wrote:
> On Mon, Jan 15, 2018 at 07:47:41PM -0700, Keith Busch wrote:
> > On Wed, Jan 10, 2018 at 10:52:23AM -0500, Sinan Kaya wrote:
> > > Since the AER status is set when we observe DPC event and nobody
> > > is clearing these we won't observe another DPC event until
> > > somebody clears these. We can say that we are resetting the
> > > endpoints as part of the DPC but we are not touching the switch
> > > downstream port or the root port registers.
> > > 
> > > Somebody still needs to clear these in addition to printing
> > > whatever information is available in the AER registers.
> > 
> > We should clear the downstream port's AER uncorrectable status
> > register if the trigger reason is of that type, but DPC definitely
> > does not require we clear it in order to observe a new event.
> 
> I don't quite understand this.  We don't need to clear the AER status
> in order to observe a new DPC event, but I think we *do* need to clear
> the AER status in order to log future AER events.  Don't we?

Absolutely, I totally agree on this. I have a v2 ready to send that calls
pci_cleanup_aer_uncorrect_error_status after printing the error. And
I've also made sure to do this only for the appropriate trigger reason
that you mentioned on the other thread.

> I think Sinan is saying that if a DPC Port observes its own unmasked
> uncorrectable error (i.e., not something it learned about by receiving
> an ERR_* message), it will set a bit in an AER status register.
> 
> Since we do not set DPC ERR_COR Enable, the DPC Port does not generate
> an ERR_COR Message, so the AER driver never learns about the error and
> never clears the AER status register.  So we'll decode the AER status
> (with your current patch), but we don't clear it, so if another error
> occurs, the AER logging won't work correctly.
> 
> Maybe we should be setting DPC ERR_COR Enable and letting the AER
> driver decode and clear the AER info?

Yes, enabling ERR_COR should be no problem and will provide those
platform notification benefits.

That alone will not clear the uncorrectable status, though. The AER
driver will only check correctable status registers on that message. The
spec doesn't appear to define what we should find in the downstream
port's correctable status, so I can't tell if that will actually log
new information or if it's simply to assist platform notification that
something happened.

I'll post the v2 for consideration, and inlcude the ERR_COR enabling. This
does clash a bit with Oza's approach, but it should lower churn for the
near term.
Sinan Kaya Jan. 17, 2018, 1:36 p.m. | #12
On 1/16/2018 7:56 PM, Bjorn Helgaas wrote:
> I think Sinan is saying that if a DPC Port observes its own unmasked
> uncorrectable error (i.e., not something it learned about by receiving
> an ERR_* message), it will set a bit in an AER status register.
> 
> Since we do not set DPC ERR_COR Enable, the DPC Port does not generate
> an ERR_COR Message, so the AER driver never learns about the error and
> never clears the AER status register.  So we'll decode the AER status
> (with your current patch), but we don't clear it, so if another error
> occurs, the AER logging won't work correctly.

Yes, this is what I was meaning. I see that Keith took care of this in
the new series.

Patch

diff --git a/drivers/pci/pcie/pcie-dpc.c b/drivers/pci/pcie/pcie-dpc.c
index ef71a472592c..aefc4fbdcef0 100644
--- a/drivers/pci/pcie/pcie-dpc.c
+++ b/drivers/pci/pcie/pcie-dpc.c
@@ -44,6 +44,7 @@  struct dpc_dev {
 	int			cap_pos;
 	bool			rp;
 	u32			rp_pio_status;
+	struct aer_err_info	info;
 };
 
 static const char * const rp_pio_error_string[] = {
@@ -111,6 +112,9 @@  static void interrupt_event_handler(struct work_struct *work)
 	struct pci_dev *dev, *temp, *pdev = dpc->dev->port;
 	struct pci_bus *parent = pdev->subordinate;
 
+	if (aer_get_device_error_info(pdev, &dpc->info))
+		aer_print_error(pdev, &dpc->info);
+
 	pci_lock_rescan_remove();
 	list_for_each_entry_safe_reverse(dev, temp, &parent->devices,
 					 bus_list) {
@@ -275,6 +279,10 @@  static irqreturn_t dpc_irq(int irq, void *context)
 		/* show RP PIO error detail information */
 		if (reason == 3 && ext_reason == 0)
 			dpc_process_rp_pio_error(dpc);
+		if (reason == 2)
+			dpc->info.severity = AER_FATAL;
+		else
+			dpc->info.severity = AER_NONFATAL;
 
 		schedule_work(&dpc->work);
 	}