mbox series

[v8,00/14] Add RCEC handling to PCI/AER

Message ID 20201002184735.1229220-1-seanvk.dev@oregontracks.org
Headers show
Series Add RCEC handling to PCI/AER | expand

Message

Sean V Kelley Oct. 2, 2020, 6:47 p.m. UTC
From: Sean V Kelley <sean.v.kelley@intel.com>

Changes since v7 [1]:

- No functional changes.

- Reword bridge patch.
- Noted testing below for #non-native/no RCEC case
(Jonathan Cameron)

- Separate out pci_walk_bus() into pci_walk_bridge() change.
- Put remaining dev to bridge name changes in the separate patch from v7.
(Bjorn Helgaas)

[1] https://lore.kernel.org/lkml/20200930215820.1113353-1-seanvk.dev@oregontracks.org/

Root Complex Event Collectors (RCEC) provide support for terminating error
and PME messages from Root Complex Integrated Endpoints (RCiEPs).  An RCEC
resides on a Bus in the Root Complex. Multiple RCECs can in fact reside on
a single bus. An RCEC will explicitly declare supported RCiEPs through the
Root Complex Endpoint Association Extended Capability.

(See PCIe 5.0-1, sections 1.3.2.3 (RCiEP), and 7.9.10 (RCEC Ext. Cap.))

The kernel lacks handling for these RCECs and the error messages received
from their respective associated RCiEPs. More recently, a new CPU
interconnect, Compute eXpress Link (CXL) depends on RCEC capabilities for
purposes of error messaging from CXL 1.1 supported RCiEP devices.

DocLink: https://www.computeexpresslink.org/

This use case is not limited to CXL. Existing hardware today includes
support for RCECs, such as the Denverton microserver product
family. Future hardware will be forthcoming.

(See Intel Document, Order number: 33061-003US)

So services such as AER or PME could be associated with an RCEC driver.
In the case of CXL, if an RCiEP (i.e., CXL 1.1 device) is associated with a
platform's RCEC it shall signal PME and AER error conditions through that
RCEC.

Towards the above use cases, add the missing RCEC class and extend the
PCIe Root Port and service drivers to allow association of RCiEPs to their
respective parent RCEC and facilitate handling of terminating error and PME
messages.

Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> #non-native/no RCEC


Jonathan Cameron (1):
  PCI/AER: Extend AER error handling to RCECs

Qiuxu Zhuo (5):
  PCI/RCEC: Add RCEC class code and extended capability
  PCI/RCEC: Bind RCEC devices to the Root Port driver
  PCI/AER: Apply function level reset to RCiEP on fatal error
  PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR
  PCI/AER: Add RCEC AER error injection support

Sean V Kelley (8):
  PCI/RCEC: Cache RCEC capabilities in pci_init_capabilities()
  PCI/ERR: Rename reset_link() to reset_subordinate_device()
  PCI/ERR: Use "bridge" for clarity in pcie_do_recovery()
  PCI/ERR: Add pci_walk_bridge() to pcie_do_recovery()
  PCI/ERR: Limit AER resets in pcie_do_recovery()
  PCI/RCEC: Add pcie_link_rcec() to associate RCiEPs
  PCI/AER: Add pcie_walk_rcec() to RCEC AER handling
  PCI/PME: Add pcie_walk_rcec() to RCEC PME handling

 drivers/pci/pci.h               |  25 ++++-
 drivers/pci/pcie/Makefile       |   2 +-
 drivers/pci/pcie/aer.c          |  36 ++++--
 drivers/pci/pcie/aer_inject.c   |   5 +-
 drivers/pci/pcie/err.c          | 109 +++++++++++++++----
 drivers/pci/pcie/pme.c          |  15 ++-
 drivers/pci/pcie/portdrv_core.c |   8 +-
 drivers/pci/pcie/portdrv_pci.c  |   8 +-
 drivers/pci/pcie/rcec.c         | 187 ++++++++++++++++++++++++++++++++
 drivers/pci/probe.c             |   2 +
 include/linux/pci.h             |   5 +
 include/linux/pci_ids.h         |   1 +
 include/uapi/linux/pci_regs.h   |   7 ++
 13 files changed, 367 insertions(+), 43 deletions(-)
 create mode 100644 drivers/pci/pcie/rcec.c

--
2.28.0

Comments

Bjorn Helgaas Oct. 9, 2020, 3:53 p.m. UTC | #1
On Fri, Oct 02, 2020 at 11:47:21AM -0700, Sean V Kelley wrote:
> From: Sean V Kelley <sean.v.kelley@intel.com>
> 
> Changes since v7 [1]:
> 
> - No functional changes.
> 
> - Reword bridge patch.
> - Noted testing below for #non-native/no RCEC case
> (Jonathan Cameron)
> 
> - Separate out pci_walk_bus() into pci_walk_bridge() change.
> - Put remaining dev to bridge name changes in the separate patch from v7.
> (Bjorn Helgaas)
> 
> [1] https://lore.kernel.org/lkml/20200930215820.1113353-1-seanvk.dev@oregontracks.org/
> 
> Root Complex Event Collectors (RCEC) provide support for terminating error
> and PME messages from Root Complex Integrated Endpoints (RCiEPs).  An RCEC
> resides on a Bus in the Root Complex. Multiple RCECs can in fact reside on
> a single bus. An RCEC will explicitly declare supported RCiEPs through the
> Root Complex Endpoint Association Extended Capability.
> 
> (See PCIe 5.0-1, sections 1.3.2.3 (RCiEP), and 7.9.10 (RCEC Ext. Cap.))
> 
> The kernel lacks handling for these RCECs and the error messages received
> from their respective associated RCiEPs. More recently, a new CPU
> interconnect, Compute eXpress Link (CXL) depends on RCEC capabilities for
> purposes of error messaging from CXL 1.1 supported RCiEP devices.
> 
> DocLink: https://www.computeexpresslink.org/
> 
> This use case is not limited to CXL. Existing hardware today includes
> support for RCECs, such as the Denverton microserver product
> family. Future hardware will be forthcoming.
> 
> (See Intel Document, Order number: 33061-003US)
> 
> So services such as AER or PME could be associated with an RCEC driver.
> In the case of CXL, if an RCiEP (i.e., CXL 1.1 device) is associated with a
> platform's RCEC it shall signal PME and AER error conditions through that
> RCEC.
> 
> Towards the above use cases, add the missing RCEC class and extend the
> PCIe Root Port and service drivers to allow association of RCiEPs to their
> respective parent RCEC and facilitate handling of terminating error and PME
> messages.
> 
> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> #non-native/no RCEC
> 
> 
> Jonathan Cameron (1):
>   PCI/AER: Extend AER error handling to RCECs
> 
> Qiuxu Zhuo (5):
>   PCI/RCEC: Add RCEC class code and extended capability
>   PCI/RCEC: Bind RCEC devices to the Root Port driver
>   PCI/AER: Apply function level reset to RCiEP on fatal error
>   PCI/RCEC: Add RCiEP's linked RCEC to AER/ERR
>   PCI/AER: Add RCEC AER error injection support
> 
> Sean V Kelley (8):
>   PCI/RCEC: Cache RCEC capabilities in pci_init_capabilities()
>   PCI/ERR: Rename reset_link() to reset_subordinate_device()
>   PCI/ERR: Use "bridge" for clarity in pcie_do_recovery()
>   PCI/ERR: Add pci_walk_bridge() to pcie_do_recovery()
>   PCI/ERR: Limit AER resets in pcie_do_recovery()
>   PCI/RCEC: Add pcie_link_rcec() to associate RCiEPs
>   PCI/AER: Add pcie_walk_rcec() to RCEC AER handling
>   PCI/PME: Add pcie_walk_rcec() to RCEC PME handling
> 
>  drivers/pci/pci.h               |  25 ++++-
>  drivers/pci/pcie/Makefile       |   2 +-
>  drivers/pci/pcie/aer.c          |  36 ++++--
>  drivers/pci/pcie/aer_inject.c   |   5 +-
>  drivers/pci/pcie/err.c          | 109 +++++++++++++++----
>  drivers/pci/pcie/pme.c          |  15 ++-
>  drivers/pci/pcie/portdrv_core.c |   8 +-
>  drivers/pci/pcie/portdrv_pci.c  |   8 +-
>  drivers/pci/pcie/rcec.c         | 187 ++++++++++++++++++++++++++++++++
>  drivers/pci/probe.c             |   2 +
>  include/linux/pci.h             |   5 +
>  include/linux/pci_ids.h         |   1 +
>  include/uapi/linux/pci_regs.h   |   7 ++
>  13 files changed, 367 insertions(+), 43 deletions(-)
>  create mode 100644 drivers/pci/pcie/rcec.c

Thank you very much for your work and patience with this series!

Applied to pci/err for v5.10 with the following changes:

  - Make pci_rcec_init() void since return value was unused.

  - Reorder pci_rcec_init() so rcec_ea is filled in before publishing
    in dev->rcec_ea.

  - Split pcie_do_recovery() patches up a little more.  My hope was to
    make the "Use 'bridge' for clarity" patch more of a pure rename
    patch and easier to review.  Not sure I accomplished that.

  - Log messages and uevents with "bridge", not "dev", in
    pcie_do_recovery() to preserve previous behavior.

  - Rename reset_subordinate_devices() to reset_subordinates() for
    brevity.

  - Fix kerneldoc issues (reported with "make W=1").

  - Fix whitespace (lines didn't use the full width or > 80 columns,
    etc).

Please let me know if I botched anything.