diff mbox series

[v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

Message ID 20220311025807.14664-1-sathyanarayanan.kuppuswamy@linux.intel.com
State New
Headers show
Series [v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly | expand

Commit Message

Kuppuswamy Sathyanarayanan March 11, 2022, 2:58 a.m. UTC
Currently the aer_irq() handler returns IRQ_NONE for cases without bits
PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
assumption is incorrect.

Consider a scenario where aer_irq() is triggered for a correctable
error, and while we process the error and before we clear the error
status in "Root Error Status" register, if the same kind of error
is triggered again, since aer_irq() only clears events it saw, the
multi-bit error is left in tact. This will cause the interrupt to fire
again, resulting in entering aer_irq() with just the multi-bit error
logged in the "Root Error Status" register.

Repeated AER recovery test has revealed this condition does happen
and this prevents any new interrupt from being triggered. Allow to
process interrupt even if only multi-correctable (BIT 1) or
multi-uncorrectable bit (BIT 3) is set.

Reported-by: Eric Badger <ebadger@purestorage.com>
Reviewed-by: Ashok Raj <ashok.raj@intel.com>
Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/aer.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

Comments

Bjorn Helgaas March 13, 2022, 7:52 p.m. UTC | #1
On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> assumption is incorrect.
> 
> Consider a scenario where aer_irq() is triggered for a correctable
> error, and while we process the error and before we clear the error
> status in "Root Error Status" register, if the same kind of error
> is triggered again, since aer_irq() only clears events it saw, the
> multi-bit error is left in tact. This will cause the interrupt to fire
> again, resulting in entering aer_irq() with just the multi-bit error
> logged in the "Root Error Status" register.
> 
> Repeated AER recovery test has revealed this condition does happen
> and this prevents any new interrupt from being triggered. Allow to
> process interrupt even if only multi-correctable (BIT 1) or
> multi-uncorrectable bit (BIT 3) is set.
> 
> Reported-by: Eric Badger <ebadger@purestorage.com>

Is there a bug report with any concrete details (dmesg, lspci, etc)
that we can include here?
Ashok Raj March 13, 2022, 9:43 p.m. UTC | #2
On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > assumption is incorrect.
> > 
> > Consider a scenario where aer_irq() is triggered for a correctable
> > error, and while we process the error and before we clear the error
> > status in "Root Error Status" register, if the same kind of error
> > is triggered again, since aer_irq() only clears events it saw, the
> > multi-bit error is left in tact. This will cause the interrupt to fire
> > again, resulting in entering aer_irq() with just the multi-bit error
> > logged in the "Root Error Status" register.
> > 
> > Repeated AER recovery test has revealed this condition does happen
> > and this prevents any new interrupt from being triggered. Allow to
> > process interrupt even if only multi-correctable (BIT 1) or
> > multi-uncorrectable bit (BIT 3) is set.
> > 
> > Reported-by: Eric Badger <ebadger@purestorage.com>
> 
> Is there a bug report with any concrete details (dmesg, lspci, etc)
> that we can include here?

Eric might have more details to add when he collected numerous logs to get
to the timeline of the problem. The test was to stress the links with an
automated power off, this will result in some eDPC UC error followed by
link down. The recovery worked fine for several cycles and suddenly there
were no more interrupts. A manual rescan on pci would probe and device is
operational again.

The test patch revealed we entered the aer_irq() with just the multi-error
PCI_ERR_ROOT_MULTI_COR_RCV or PCI_ERR_ROOT_MULTI_UNCOR_RCV, then we didn't
clear those bits causing interrupt generation to cease after that.

Cheers,
Ashok
Eric Badger March 14, 2022, 4:21 p.m. UTC | #3
On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > assumption is incorrect.
> > > 
> > > Consider a scenario where aer_irq() is triggered for a correctable
> > > error, and while we process the error and before we clear the error
> > > status in "Root Error Status" register, if the same kind of error
> > > is triggered again, since aer_irq() only clears events it saw, the
> > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > again, resulting in entering aer_irq() with just the multi-bit error
> > > logged in the "Root Error Status" register.
> > > 
> > > Repeated AER recovery test has revealed this condition does happen
> > > and this prevents any new interrupt from being triggered. Allow to
> > > process interrupt even if only multi-correctable (BIT 1) or
> > > multi-uncorrectable bit (BIT 3) is set.
> > > 
> > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > 
> > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > that we can include here?
> 
> Eric might have more details to add when he collected numerous logs to get
> to the timeline of the problem. The test was to stress the links with an
> automated power off, this will result in some eDPC UC error followed by
> link down. The recovery worked fine for several cycles and suddenly there
> were no more interrupts. A manual rescan on pci would probe and device is
> operational again.

The problem was originally discovered while performing a looping hot plug
test. At hot remove time, one or more corrected errors usually appeared:

[256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
[256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
[256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
[256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
[256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
[256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
[256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
[256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
[256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
[256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
[256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
[256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
[256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
[256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed

The problematic case arose when 2 corrected errors arrived in a sequence like this:

1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
   which now has value 0x1.
2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
   Error Status, which now has value 0x3.
4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
5. aer_irq() triggered again due to the second error, but, finding value 0x2
   in Root Error Status, takes no action. Future interrupts are now inhibited.
  
My observation on Intel Icelake is that a new AER interrupt will be generated
when one writes to Root Error Status but other bits remain set. I concluded
this based on testing with ACPI EINJ and a hack like this:

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..5c9bbbe7887b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1196,6 +1196,10 @@ static irqreturn_t aer_irq(int irq, void *context)
 	struct aer_err_source e_src = {};
 
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
+
+	pci_dbg(pdev->port, "Root Error Status: %04x\n", e_src.status);
+	return IRQ_NONE;
+
 	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
 		return IRQ_NONE;
 
And then running these commands:

    # Prep injection data for a correctable error.
    $ cd /sys/kernel/debug/apei/einj
    $ echo 0x00000040 > error_type
    $ echo 0x4 > flags
    $ echo 0x891000 > param4
    
    # Root Error Status is initially clear
    $ setpci -s 89:02.0 ECAP0001+0x30.w
    0000
    
    # Inject one error
    $ echo 1 > error_inject
    
    # Interrupt received
    [  285.526275] pcieport 0000:89:02.0: AER: Root Error Status 0001
    
    # Inject another error
    $ echo 1 > error_inject
    
    # No interrupt received, but "multiple ERR_COR" is now set
    $ setpci -s 89:02.0 ECAP0001+0x30.w
    0003
    
    # Wait for a while, then clear ERR_COR. A new interrupt immediately fires.
    $ setpci -s 89:02.0 ECAP0001+0x30.w=0x1
    [  354.596748] pcieport 0000:89:02.0: AER: Root Error Status 0002


I've tried to track down some different hardware to confirm this behavior, but
haven't found any that can run this test.

My reading of the PCIe 5.0 spec, section "6.2.4.1.2 Interrupt Generation"
doesn't seem to describe the behavior I saw on Icelake.

	If a Root Port or Root Complex Event Collector is enabled for
	edge-triggered interrupt signaling using MSI or MSI-X, an interrupt
	message must be sent every time the logical AND of the following
	conditions transitions from FALSE to TRUE:
	...
    At least one Error Reporting Enable bit in the Root Error Command
    register and its associated error Messages Received bit in the Root
    Error Status register are both set to 1b.

This section of the spec seems to say that, if Root Error Status sees the
sequence of values described above (0x1->0x3->0x2), only one interrupt would
be generated, since there was no FALSE to TRUE transition at 0x3->0x2.  So you
would need something analogous to:

8edf5332c3934 ("PCI: pciehp: Fix MSI interrupt race")

However, this seems not to be the case for Icelake.

Cheers,
Eric
Bjorn Helgaas March 17, 2022, 10:59 p.m. UTC | #4
On Mon, Mar 14, 2022 at 09:21:46AM -0700, Eric Badger wrote:
> On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> > On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > > assumption is incorrect.
> > > > 
> > > > Consider a scenario where aer_irq() is triggered for a correctable
> > > > error, and while we process the error and before we clear the error
> > > > status in "Root Error Status" register, if the same kind of error
> > > > is triggered again, since aer_irq() only clears events it saw, the
> > > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > > again, resulting in entering aer_irq() with just the multi-bit error
> > > > logged in the "Root Error Status" register.
> > > > 
> > > > Repeated AER recovery test has revealed this condition does happen
> > > > and this prevents any new interrupt from being triggered. Allow to
> > > > process interrupt even if only multi-correctable (BIT 1) or
> > > > multi-uncorrectable bit (BIT 3) is set.
> > > > 
> > > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > > 
> > > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > > that we can include here?
> > 
> > Eric might have more details to add when he collected numerous logs to get
> > to the timeline of the problem. The test was to stress the links with an
> > automated power off, this will result in some eDPC UC error followed by
> > link down. The recovery worked fine for several cycles and suddenly there
> > were no more interrupts. A manual rescan on pci would probe and device is
> > operational again.
> 
> The problem was originally discovered while performing a looping hot plug
> test. At hot remove time, one or more corrected errors usually appeared:
> 
> [256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
> [256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
> [256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
> [256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
> [256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
> [256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
> [256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
> [256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
> [256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
> [256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
> [256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
> [256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
> [256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed
> 
> The problematic case arose when 2 corrected errors arrived in a sequence like this:
> 
> 1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
>    which now has value 0x1.
> 2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
> 3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
>    Error Status, which now has value 0x3.
> 4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
> 5. aer_irq() triggered again due to the second error, but, finding value 0x2
>    in Root Error Status, takes no action. Future interrupts are now inhibited.

Thanks for the additional details!

After this patch, I guess aer_irq() still reads 0x2
(PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
PCI_ERR_ROOT_MULTI_COR_RCV.

In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
which probably contains either 0 or junk left over from being captured
when PCI_ERR_ROOT_COR_RCV was set.

And aer_irq() will queue an e_src record with status ==
PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
in status, aer_isr_one_error() will do nothing, right?

That might not be *terrible* and is definitely better than not being
able to handle future interrupts.  But we basically threw away the
information that multiple errors occurred, and we queued an e_src
record that occupies space without being used for anything.

Bjorn
Kuppuswamy Sathyanarayanan March 18, 2022, 8:28 p.m. UTC | #5
On 3/17/22 3:59 PM, Bjorn Helgaas wrote:
> Thanks for the additional details!
> 
> After this patch, I guess aer_irq() still reads 0x2
> (PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
> PCI_ERR_ROOT_MULTI_COR_RCV.
> 
> In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
> which probably contains either 0 or junk left over from being captured
> when PCI_ERR_ROOT_COR_RCV was set.
> 
> And aer_irq() will queue an e_src record with status ==
> PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
> in status, aer_isr_one_error() will do nothing, right?
> 
> That might not be*terrible*  and is definitely better than not being
> able to handle future interrupts.  But we basically threw away the
> information that multiple errors occurred, and we queued an e_src
> record that occupies space without being used for anything.

Yes, you are correct.  One other way to minimize this race window is to
clear the Root status register as soon as possible. Maybe we can move
it before source ID read as below.

--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1204,8 +1204,8 @@ static irqreturn_t aer_irq(int irq, void *context)
         if (!(e_src.status & AER_ERR_STATUS_MASK))
                 return IRQ_NONE;

-       pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);
         pci_write_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, 
e_src.status);
+       pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id)
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..7952e5efd6cf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -101,6 +101,11 @@  struct aer_stats {
 #define ERR_COR_ID(d)			(d & 0xffff)
 #define ERR_UNCOR_ID(d)			(d >> 16)
 
+#define AER_ERR_STATUS_MASK		(PCI_ERR_ROOT_UNCOR_RCV |	\
+					PCI_ERR_ROOT_COR_RCV |		\
+					PCI_ERR_ROOT_MULTI_COR_RCV |	\
+					PCI_ERR_ROOT_MULTI_UNCOR_RCV)
+
 static int pcie_aer_disable;
 static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
 
@@ -1196,7 +1201,7 @@  static irqreturn_t aer_irq(int irq, void *context)
 	struct aer_err_source e_src = {};
 
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_STATUS, &e_src.status);
-	if (!(e_src.status & (PCI_ERR_ROOT_UNCOR_RCV|PCI_ERR_ROOT_COR_RCV)))
+	if (!(e_src.status & AER_ERR_STATUS_MASK))
 		return IRQ_NONE;
 
 	pci_read_config_dword(rp, aer + PCI_ERR_ROOT_ERR_SRC, &e_src.id);