mbox series

[V2,0/9] Add RCEC handling to PCI/AER

Message ID 20200804194052.193272-1-sean.v.kelley@intel.com
Headers show
Series Add RCEC handling to PCI/AER | expand

Message

Kelley, Sean V Aug. 4, 2020, 7:40 p.m. UTC
From: Sean V Kelley <sean.v.kelley@linux.intel.com>

On the use of FLR on RCiEPs for the fatal case, still interested in more
feedback from the earlier discussion here [1]:

[1] https://lore.kernel.org/linux-pci/C21C050B-48B1-4429-B019-C81F3AB8E843@intel.com/

There is also the question of the absence of an FLR for non-fatal error.
If the device driver tells us that it needs "PCI_ERS_RESULT_NEED_RESET" by
the callback report_normal_detected() then we should try FLR on the device
as well.

On the use of variables with RP centric names such as the attributes
dev_attr_aer_rootport_total_err_..., one concern is the ripple effect on code
churn due to renaming. Open to suggestions, but trying to co-habitate so to
speak RCECs with RPs in the same drivers has trade-offs.

Changes since v1 [2]:

- Make PME capability of RCEC discoverable in get_port_device_capability().
- Replace the check on bnr with <= lastbusn in pcie_walk_rcec().
- Fix comment header for pcie_walk_rcec().
- Fix comment header for pci_walk_dev_affected().
- Fix spurious newline.
- Add sanity checks on dev->rcec.
- Use pci_dbg() in place of pci_info() for discovered RCiEPs.
- Remove AER RCEC AP FOUND message (accidently left in previously).
- Remove the check for RC_END from set_device_error_reporting() since
only Ports and RCECs are being passed.
(Jonathan Cameron)
- Fix the return type for flr_on_rciep().
(reported by lkp on DEC Alpha arch.)

[1] https://lore.kernel.org/linux-pci/20200724172223.145608-1-sean.v.kelley@intel.com/

Root Complex Event Collectors (RCEC) provide support for terminating error
and PME messages from Root Complex Integrated Endpoints (RCiEPs).  An RCEC
resides on a Bus in the Root Complex. Multiple RCECs can in fact reside on
a single bus. An RCEC will explicitly declare supported RCiEPs through the
Root Complex Endpoint Association Extended Capability.

(See PCIe 5.0-1, sections 1.3.2.3 (RCiEP), and 7.9.10 (RCEC Ext. Cap.))

The kernel lacks handling for these RCECs and the error messages received
from their respective associated RCiEPs. More recently, a new CPU
interconnect, Compute eXpress Link (CXL) depends on RCEC capabilities for
purposes of error messaging from CXL 1.1 supported RCiEP devices.

DocLink: https://www.computeexpresslink.org/

This use case is not limited to CXL. Existing hardware today includes
support for RCECs, such as the Denverton microserver product
family. Future hardware will be forthcoming.

(See Intel Document, Order number: 33061-003US)

So services such as AER or PME could be associated with an RCEC driver.
In the case of CXL, if an RCiEP (i.e., CXL 1.1 device) is associated with a
platform's RCEC it shall signal PME and AER error conditions through that
RCEC.

Towards the above use cases, add the missing RCEC class and extend the
PCIe Root Port and service drivers to allow association of RCiEPs to their
respective parent RCEC and facilitate handling of terminating error and PME
messages.


AER Test Results:
1) Inject a correctable error to the RCiEP 0000:e9:00.0
    Run ./aer_inject <a parameter file as below>:
    AER
    PCI_ID 0000:e9:00.0
    COR_STATUS BAD_TLP
    HEADER_LOG 0 1 2 3

    Log:
[   76.155963] pcieport 0000:e8:00.4: aer_inject: Injecting errors 00000040/00000000 into device 0000:e9:00.0
[   76.166966] pcieport 0000:e8:00.4: AER: Corrected error received: 0000:e9:00.0
[   76.175253] pci 0000:e9:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
[   76.185633] pci 0000:e9:00.0:   device [8086:4940] error status/mask=00000040/00002000
[   76.194604] pci 0000:e9:00.0:    [ 6] BadTLP

2) Inject a non-fatal error to the RCiEP 0000:e8:01.0
    Run ./aer_inject <a parameter file as below>:
    AER
    PCI_ID 0000:e8:01.0
    UNCOR_STATUS COMP_ABORT
    HEADER_LOG 0 1 2 3

    Log:
[  117.791854] pcieport 0000:e8:00.4: aer_inject: Injecting errors 00000000/00008000 into device 0000:e8:01.0
[  117.804244] pcieport 0000:e8:00.4: AER: Uncorrected (Non-Fatal) error received: 0000:e8:01.0
[  117.814652] igen6_edac 0000:e8:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Completer ID)
[  117.828511] igen6_edac 0000:e8:01.0:   device [8086:0b25] error status/mask=00008000/00100000
[  117.839189] igen6_edac 0000:e8:01.0:    [15] CmpltAbrt
[  117.847365] igen6_edac 0000:e8:01.0: AER:   TLP Header: 00000000 00000001 00000002 00000003
[  117.857775] igen6_edac 0000:e8:01.0: AER: device recovery successful

3) Inject a fatal error to the RCiEP 0000:ed:01.0
    Run ./aer_inject <a parameter file as below>:
    AER
    PCI_ID 0000:ed:01.0
    UNCOR_STATUS MALF_TLP
    HEADER_LOG 0 1 2 3

    Log:
[  131.511623] pcieport 0000:ed:00.4: aer_inject: Injecting errors 00000000/00040000 into device 0000:ed:01.0
[  131.523259] pcieport 0000:ed:00.4: AER: Uncorrected (Fatal) error received: 0000:ed:01.0
[  131.533842] igen6_edac 0000:ed:01.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
[  131.655618] igen6_edac 0000:ed:01.0: AER: device recovery successful

Jonathan Cameron (1):
  PCI/AER: Extend AER error handling to RCECs

Qiuxu Zhuo (6):
  pci_ids: Add class code and extended capability for RCEC
  PCI: Extend Root Port Driver to support RCEC
  PCI/portdrv: Add pcie_walk_rcec() to walk RCiEPs associated with RCEC
  PCI/AER: Apply function level reset to RCiEP on fatal error
  PCI: Add 'rcec' field to pci_dev for associated RCiEPs
  PCI/AER: Add RCEC AER error injection support

Sean V Kelley (2):
  PCI/AER: Add RCEC AER handling
  PCI/PME: Add RCEC PME handling

 drivers/pci/pcie/aer.c          | 36 +++++++++----
 drivers/pci/pcie/aer_inject.c   |  5 +-
 drivers/pci/pcie/err.c          | 90 +++++++++++++++++++++++++++------
 drivers/pci/pcie/pme.c          | 15 ++++--
 drivers/pci/pcie/portdrv.h      |  2 +
 drivers/pci/pcie/portdrv_core.c | 90 +++++++++++++++++++++++++++++++--
 drivers/pci/pcie/portdrv_pci.c  | 20 +++++++-
 include/linux/pci.h             |  3 ++
 include/linux/pci_ids.h         |  1 +
 include/uapi/linux/pci_regs.h   |  7 +++
 10 files changed, 233 insertions(+), 36 deletions(-)

--
2.27.0

Comments

Bjorn Helgaas Aug. 5, 2020, 6 p.m. UTC | #1
On Tue, Aug 04, 2020 at 12:40:43PM -0700, Sean V Kelley wrote:
> From: Sean V Kelley <sean.v.kelley@linux.intel.com>
> 
> On the use of FLR on RCiEPs for the fatal case, still interested in more
> feedback from the earlier discussion here [1]:
> 
> [1] https://lore.kernel.org/linux-pci/C21C050B-48B1-4429-B019-C81F3AB8E843@intel.com/
> 
> There is also the question of the absence of an FLR for non-fatal error.
> If the device driver tells us that it needs "PCI_ERS_RESULT_NEED_RESET" by
> the callback report_normal_detected() then we should try FLR on the device
> as well.
> 
> On the use of variables with RP centric names such as the attributes
> dev_attr_aer_rootport_total_err_..., one concern is the ripple effect on code
> churn due to renaming. Open to suggestions, but trying to co-habitate so to
> speak RCECs with RPs in the same drivers has trade-offs.
> 
> Changes since v1 [2]:
> 
> - Make PME capability of RCEC discoverable in get_port_device_capability().
> - Replace the check on bnr with <= lastbusn in pcie_walk_rcec().
> - Fix comment header for pcie_walk_rcec().
> - Fix comment header for pci_walk_dev_affected().
> - Fix spurious newline.
> - Add sanity checks on dev->rcec.
> - Use pci_dbg() in place of pci_info() for discovered RCiEPs.
> - Remove AER RCEC AP FOUND message (accidently left in previously).
> - Remove the check for RC_END from set_device_error_reporting() since
> only Ports and RCECs are being passed.
> (Jonathan Cameron)
> - Fix the return type for flr_on_rciep().
> (reported by lkp on DEC Alpha arch.)
> 
> [1] https://lore.kernel.org/linux-pci/20200724172223.145608-1-sean.v.kelley@intel.com/
> 
> Root Complex Event Collectors (RCEC) provide support for terminating error
> and PME messages from Root Complex Integrated Endpoints (RCiEPs).  An RCEC
> resides on a Bus in the Root Complex. Multiple RCECs can in fact reside on
> a single bus. An RCEC will explicitly declare supported RCiEPs through the
> Root Complex Endpoint Association Extended Capability.
> 
> (See PCIe 5.0-1, sections 1.3.2.3 (RCiEP), and 7.9.10 (RCEC Ext. Cap.))
> 
> The kernel lacks handling for these RCECs and the error messages received
> from their respective associated RCiEPs. More recently, a new CPU
> interconnect, Compute eXpress Link (CXL) depends on RCEC capabilities for
> purposes of error messaging from CXL 1.1 supported RCiEP devices.
> 
> DocLink: https://www.computeexpresslink.org/
> 
> This use case is not limited to CXL. Existing hardware today includes
> support for RCECs, such as the Denverton microserver product
> family. Future hardware will be forthcoming.
> 
> (See Intel Document, Order number: 33061-003US)
> 
> So services such as AER or PME could be associated with an RCEC driver.
> In the case of CXL, if an RCiEP (i.e., CXL 1.1 device) is associated with a
> platform's RCEC it shall signal PME and AER error conditions through that
> RCEC.
> 
> Towards the above use cases, add the missing RCEC class and extend the
> PCIe Root Port and service drivers to allow association of RCiEPs to their
> respective parent RCEC and facilitate handling of terminating error and PME
> messages.
> 
> 
> AER Test Results:
> 1) Inject a correctable error to the RCiEP 0000:e9:00.0
>     Run ./aer_inject <a parameter file as below>:
>     AER
>     PCI_ID 0000:e9:00.0
>     COR_STATUS BAD_TLP
>     HEADER_LOG 0 1 2 3
> 
>     Log:
> [   76.155963] pcieport 0000:e8:00.4: aer_inject: Injecting errors 00000040/00000000 into device 0000:e9:00.0
> [   76.166966] pcieport 0000:e8:00.4: AER: Corrected error received: 0000:e9:00.0
> [   76.175253] pci 0000:e9:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)
> [   76.185633] pci 0000:e9:00.0:   device [8086:4940] error status/mask=00000040/00002000
> [   76.194604] pci 0000:e9:00.0:    [ 6] BadTLP

If you remove the timestamps, there will be less distraction here.  As
I'm sure you know, the 0/n cover letter text doesn't really go
anywhere except the email archives.  If this is potentially useful in
the future, it should be in the actual patch commit logs.

> 2) Inject a non-fatal error to the RCiEP 0000:e8:01.0
>     Run ./aer_inject <a parameter file as below>:
>     AER
>     PCI_ID 0000:e8:01.0
>     UNCOR_STATUS COMP_ABORT
>     HEADER_LOG 0 1 2 3

I think maybe this could be written in a way that could be cut and
pasted?

>     Log:
> [  117.791854] pcieport 0000:e8:00.4: aer_inject: Injecting errors 00000000/00008000 into device 0000:e8:01.0
> [  117.804244] pcieport 0000:e8:00.4: AER: Uncorrected (Non-Fatal) error received: 0000:e8:01.0
> [  117.814652] igen6_edac 0000:e8:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Completer ID)
> [  117.828511] igen6_edac 0000:e8:01.0:   device [8086:0b25] error status/mask=00008000/00100000
> [  117.839189] igen6_edac 0000:e8:01.0:    [15] CmpltAbrt
> [  117.847365] igen6_edac 0000:e8:01.0: AER:   TLP Header: 00000000 00000001 00000002 00000003
> [  117.857775] igen6_edac 0000:e8:01.0: AER: device recovery successful
> 
> 3) Inject a fatal error to the RCiEP 0000:ed:01.0
>     Run ./aer_inject <a parameter file as below>:
>     AER
>     PCI_ID 0000:ed:01.0
>     UNCOR_STATUS MALF_TLP
>     HEADER_LOG 0 1 2 3
> 
>     Log:
> [  131.511623] pcieport 0000:ed:00.4: aer_inject: Injecting errors 00000000/00040000 into device 0000:ed:01.0
> [  131.523259] pcieport 0000:ed:00.4: AER: Uncorrected (Fatal) error received: 0000:ed:01.0
> [  131.533842] igen6_edac 0000:ed:01.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Inaccessible, (Unregistered Agent ID)
> [  131.655618] igen6_edac 0000:ed:01.0: AER: device recovery successful
> 
> Jonathan Cameron (1):
>   PCI/AER: Extend AER error handling to RCECs
> 
> Qiuxu Zhuo (6):
>   pci_ids: Add class code and extended capability for RCEC
>   PCI: Extend Root Port Driver to support RCEC
>   PCI/portdrv: Add pcie_walk_rcec() to walk RCiEPs associated with RCEC
>   PCI/AER: Apply function level reset to RCiEP on fatal error
>   PCI: Add 'rcec' field to pci_dev for associated RCiEPs
>   PCI/AER: Add RCEC AER error injection support
> 
> Sean V Kelley (2):
>   PCI/AER: Add RCEC AER handling
>   PCI/PME: Add RCEC PME handling
> 

>  drivers/pci/pcie/aer.c          | 36 +++++++++----
>  drivers/pci/pcie/aer_inject.c   |  5 +-
>  drivers/pci/pcie/err.c          | 90 +++++++++++++++++++++++++++------
>  drivers/pci/pcie/pme.c          | 15 ++++--
>  drivers/pci/pcie/portdrv.h      |  2 +
>  drivers/pci/pcie/portdrv_core.c | 90 +++++++++++++++++++++++++++++++--
>  drivers/pci/pcie/portdrv_pci.c  | 20 +++++++-
>  include/linux/pci.h             |  3 ++
>  include/linux/pci_ids.h         |  1 +
>  include/uapi/linux/pci_regs.h   |  7 +++
>  10 files changed, 233 insertions(+), 36 deletions(-)

I always apply patches to topic branches based at my "master" branch
(typically -rc1, so currently v5.8-rc1, but will soon be v5.9-rc1).

If your series doesn't apply there (as this one doesn't), it saves me
time if you tell me where it does apply.  I figured out that this
applies cleanly on top of my pci/error branch, which does make sense.

I'd actually *rather* have patches based on "master", even if I have
to resolve conflicts, because that gives me the flexibility to squash
in fixes and re-merge the topic branches.

Bjorn