diff mbox series

[RFC] PCI/AER: Enable internal AER errors by default

Message ID 20230209-cxl-pci-aer-v1-1-f9a817fa4016@intel.com
State New
Headers show
Series [RFC] PCI/AER: Enable internal AER errors by default | expand

Commit Message

Ira Weiny Feb. 10, 2023, 10:33 p.m. UTC
The CXL driver expects internal error reporting to be enabled via
pci_enable_pcie_error_reporting().  It is likely other drivers expect the same.
Dave submitted a patch to enable the CXL side[1] but the PCI AER registers
still mask errors.

PCIe v6.0 Uncorrectable Mask Register (7.8.4.3) and Correctable Mask
Register (7.8.4.6) default to masking internal errors.  The
Uncorrectable Error Severity Register (7.8.4.4) defaults internal errors
as fatal.

Enable internal errors to be reported via the standard
pci_enable_pcie_error_reporting() call.  Ensure uncorrectable errors are set
non-fatal to limit any impact to other drivers.

[1] https://lore.kernel.org/all/167604864163.2392965.5102660329807283871.stgit@djiang5-mobl3.local/

Cc: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Stefan Roese <sr@denx.de>
Cc: "Kuppuswamy Sathyanarayanan" <sathyanarayanan.kuppuswamy@linux.intel.com>
Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
Cc: Oliver O'Halloran <oohall@gmail.com>
Cc: linux-cxl@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-pci@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
---
This is RFC to see if it is acceptable to be part of the standard
pci_enable_pcie_error_reporting() call or perhaps a separate pci core
call should be introduced.  It is anticipated that enabling this error
reporting is what existing drivers are expecting.  The errors are marked
non-fatal therefore it should not adversely affect existing devices.
---
 drivers/pci/pcie/aer.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)


---
base-commit: e5ab7f206ffc873160bd0f1a52cae17ab692a9d1
change-id: 20230209-cxl-pci-aer-18dda61c8239

Best regards,

Comments

Bjorn Helgaas Feb. 13, 2023, 9:38 p.m. UTC | #1
On Fri, Feb 10, 2023 at 02:33:23PM -0800, Ira Weiny wrote:
> The CXL driver expects internal error reporting to be enabled via
> pci_enable_pcie_error_reporting().  It is likely other drivers expect the same.
> Dave submitted a patch to enable the CXL side[1] but the PCI AER registers
> still mask errors.
> 
> PCIe v6.0 Uncorrectable Mask Register (7.8.4.3) and Correctable Mask
> Register (7.8.4.6) default to masking internal errors.  The
> Uncorrectable Error Severity Register (7.8.4.4) defaults internal errors
> as fatal.
> 
> Enable internal errors to be reported via the standard
> pci_enable_pcie_error_reporting() call.  Ensure uncorrectable errors are set
> non-fatal to limit any impact to other drivers.

Do you have any background on why the spec makes these errors masked
by default?  I'm sympathetic to wanting to learn about all the errors
we can, but I'm a little wary if the spec authors thought it was
important to mask these by default.

> [1] https://lore.kernel.org/all/167604864163.2392965.5102660329807283871.stgit@djiang5-mobl3.local/
> 
> Cc: Bjorn Helgaas <helgaas@kernel.org>
> Cc: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Stefan Roese <sr@denx.de>
> Cc: "Kuppuswamy Sathyanarayanan" <sathyanarayanan.kuppuswamy@linux.intel.com>
> Cc: Mahesh J Salgaonkar <mahesh@linux.ibm.com>
> Cc: Oliver O'Halloran <oohall@gmail.com>
> Cc: linux-cxl@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-pci@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Ira Weiny <ira.weiny@intel.com>
> ---
> This is RFC to see if it is acceptable to be part of the standard
> pci_enable_pcie_error_reporting() call or perhaps a separate pci core
> call should be introduced.  It is anticipated that enabling this error
> reporting is what existing drivers are expecting.  The errors are marked
> non-fatal therefore it should not adversely affect existing devices.
> ---
>  drivers/pci/pcie/aer.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 625f7b2cafe4..9d3ed3a5fc23 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -229,11 +229,28 @@ int pcie_aer_is_native(struct pci_dev *dev)
>  
>  int pci_enable_pcie_error_reporting(struct pci_dev *dev)
>  {
> +	int pos_cap_err;
> +	u32 reg;
>  	int rc;
>  
>  	if (!pcie_aer_is_native(dev))
>  		return -EIO;
>  
> +	pos_cap_err = dev->aer_cap;
> +
> +	/* Unmask correctable and uncorrectable (non-fatal) internal errors */
> +	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, &reg);
> +	reg &= ~PCI_ERR_COR_INTERNAL;
> +	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg);
> +
> +	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, &reg);
> +	reg &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg);
> +
> +	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, &reg);
> +	reg &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg);
> +
>  	rc = pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
>  	return pcibios_err_to_errno(rc);
>  }
> 
> ---
> base-commit: e5ab7f206ffc873160bd0f1a52cae17ab692a9d1
> change-id: 20230209-cxl-pci-aer-18dda61c8239
> 
> Best regards,
> -- 
> Ira Weiny <ira.weiny@intel.com>
>
David Laight Feb. 13, 2023, 10:44 p.m. UTC | #2
From: Bjorn Helgaas
> Sent: 13 February 2023 21:38
> 
> On Fri, Feb 10, 2023 at 02:33:23PM -0800, Ira Weiny wrote:
> > The CXL driver expects internal error reporting to be enabled via
> > pci_enable_pcie_error_reporting().  It is likely other drivers expect the same.
> > Dave submitted a patch to enable the CXL side[1] but the PCI AER registers
> > still mask errors.
> >
> > PCIe v6.0 Uncorrectable Mask Register (7.8.4.3) and Correctable Mask
> > Register (7.8.4.6) default to masking internal errors.  The
> > Uncorrectable Error Severity Register (7.8.4.4) defaults internal errors
> > as fatal.
> >
> > Enable internal errors to be reported via the standard
> > pci_enable_pcie_error_reporting() call.  Ensure uncorrectable errors are set
> > non-fatal to limit any impact to other drivers.
> 
> Do you have any background on why the spec makes these errors masked
> by default?  I'm sympathetic to wanting to learn about all the errors
> we can, but I'm a little wary if the spec authors thought it was
> important to mask these by default.

I'd guess that it is for backwards compatibility with older hardware
and/or software that that didn't support error notifications.

Then there are the x86 systems that manage to take the AER
error into some 'board management hardware' which finally
interrupts the kernel with an NMI - and the obvious consequence.
These systems are NEBS? 'qualified' for telecoms use, but take
out a PCIe link and the system crashes.

It is pretty easy to generate a PCIe error.
Any endpoint with two (or more) different sized BARs leaves
a big chunk of PCIe address space that is forwarded by the upstream
bridge but is not responded to.
The requirement to put the MSI-X area in its own BAR pretty much
ensures that such addresses exist.

(Never mind reprogramming the fpga that is terminating the link.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Ira Weiny Feb. 15, 2023, 12:08 a.m. UTC | #3
Bjorn Helgaas wrote:
> On Fri, Feb 10, 2023 at 02:33:23PM -0800, Ira Weiny wrote:
> > The CXL driver expects internal error reporting to be enabled via
> > pci_enable_pcie_error_reporting().  It is likely other drivers expect the same.
> > Dave submitted a patch to enable the CXL side[1] but the PCI AER registers
> > still mask errors.
> > 
> > PCIe v6.0 Uncorrectable Mask Register (7.8.4.3) and Correctable Mask
> > Register (7.8.4.6) default to masking internal errors.  The
> > Uncorrectable Error Severity Register (7.8.4.4) defaults internal errors
> > as fatal.
> > 
> > Enable internal errors to be reported via the standard
> > pci_enable_pcie_error_reporting() call.  Ensure uncorrectable errors are set
> > non-fatal to limit any impact to other drivers.
> 
> Do you have any background on why the spec makes these errors masked
> by default?  I'm sympathetic to wanting to learn about all the errors
> we can, but I'm a little wary if the spec authors thought it was
> important to mask these by default.
> 

I don't have any idea of the history.

To me 'internal errors' is a pretty wide net and was likely a catch all
that the authors felt was mostly unneeded.

CXL is different because it further divides the errors.

I've enlisted some help internal to Intel to hopefully find some answers.
But in the event no one knows it would be safe to to with my alternate
suggestion and add a new PCIe call to enable this specifically for the
drivers who need it.

Ira
diff mbox series

Patch

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 625f7b2cafe4..9d3ed3a5fc23 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -229,11 +229,28 @@  int pcie_aer_is_native(struct pci_dev *dev)
 
 int pci_enable_pcie_error_reporting(struct pci_dev *dev)
 {
+	int pos_cap_err;
+	u32 reg;
 	int rc;
 
 	if (!pcie_aer_is_native(dev))
 		return -EIO;
 
+	pos_cap_err = dev->aer_cap;
+
+	/* Unmask correctable and uncorrectable (non-fatal) internal errors */
+	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, &reg);
+	reg &= ~PCI_ERR_COR_INTERNAL;
+	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_COR_MASK, reg);
+
+	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, &reg);
+	reg &= ~PCI_ERR_UNC_INTN;
+	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_SEVER, reg);
+
+	pci_read_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, &reg);
+	reg &= ~PCI_ERR_UNC_INTN;
+	pci_write_config_dword(dev, pos_cap_err + PCI_ERR_UNCOR_MASK, reg);
+
 	rc = pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
 	return pcibios_err_to_errno(rc);
 }