[v11,5/8] PCI/AER: Allow clearing Error Status Register in FF mode
diff mbox series

Message ID c778a619d164f5202248543dc5eec99c079fb82e.1577400653.git.sathyanarayanan.kuppuswamy@linux.intel.com
State New
Headers show
Series
  • Add Error Disconnect Recover (EDR) support
Related show

Commit Message

Kuppuswamy, Sathyanarayanan Dec. 27, 2019, 12:39 a.m. UTC
From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

As per PCI firmware specification r3.2 System Firmware Intermediary
(SFI) _OSC and DPC Updates ECR
(https://members.pcisig.com/wg/PCI-SIG/document/13563), sec titled
"DPC Event Handling Implementation Note", page 10, Error Disconnect
Recover (EDR) support allows OS to handle error recovery and clearing
Error Registers even in FF mode. So create exception for FF mode checks
in pci_cleanup_aer_uncorrect_error_status(), pci_aer_clear_fatal_status()
and pci_cleanup_aer_error_status_regs() functions when its being called
from DPC code path.

Signed-off-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Keith Busch <keith.busch@intel.com>
---
 Documentation/PCI/pcieaer-howto.rst       |  2 +-
 drivers/net/ethernet/intel/ice/ice_main.c |  2 +-
 drivers/ntb/hw/idt/ntb_hw_idt.c           |  2 +-
 drivers/pci/pci.c                         |  2 +-
 drivers/pci/pci.h                         |  2 +-
 drivers/pci/pcie/aer.c                    | 16 ++++++++--------
 drivers/pci/pcie/dpc.c                    |  4 ++--
 drivers/pci/pcie/err.c                    |  2 +-
 drivers/scsi/lpfc/lpfc_attr.c             |  2 +-
 include/linux/aer.h                       | 12 ++++++++----
 10 files changed, 25 insertions(+), 21 deletions(-)

Comments

Bjorn Helgaas Dec. 30, 2019, 11:59 p.m. UTC | #1
[+cc Austin]

On Thu, Dec 26, 2019 at 04:39:11PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> As per PCI firmware specification r3.2 System Firmware Intermediary
> (SFI) _OSC and DPC Updates ECR
> (https://members.pcisig.com/wg/PCI-SIG/document/13563),

What is the state of this ECR?  I see it in the "PCI Express Review
Zone Archive".  I don't know what the usage is of the "Review Zone" vs
the "Review Zone Archive / PCI Express Review Zone Archive".  AFAICS,
it is not listed in any of the "Documents for 60 Day Member Review".

And I think it needs some clarification (for one thing, it needs to
say what the red/blue text means).  I've mentioned some other items to
Austin, but I haven't read it in detail because it seems like it's not
quite baked yet.

E.g., there's language about "it may make sense for an embedded system
OS to own SFI, but it's recommended that general-purpose OSes never
request SFI ownership."  That's useless: Linux is certainly a general
purpose OS, but Linux is also often an embedded OS.  So the ECR
doesn't provide useful guidance about how an OS should decide whether
to request SFI ownership.

Making code changes based on a published spec or ECN is fine,
obviously.  Changes based on an ECR that is well on track to being
accepted, e.g., is in the 60-day review period, are probably OK.  I
don't yet have warm fuzzies about this ECR because I have no idea how
far along it is.

We might be able to justify some of these changes based on other
specs; it just sounds weird to me to say "based on this Engineering
Change Request that might be accepted someday, we must do X".  Anybody
can dream up an ECR that says anything at all, so AFAICT, an ECR is
not at all authoritative.

> sec titled
> "DPC Event Handling Implementation Note", page 10, Error Disconnect
> Recover (EDR) support allows OS to handle error recovery and clearing
> Error Registers even in FF mode. So create exception for FF mode checks
> in pci_cleanup_aer_uncorrect_error_status(), pci_aer_clear_fatal_status()
> and pci_cleanup_aer_error_status_regs() functions when its being called
> from DPC code path.
Kuppuswamy, Sathyanarayanan Dec. 31, 2019, 6:11 p.m. UTC | #2
Hi Bjorn,

On 12/30/19 3:59 PM, Bjorn Helgaas wrote:
> [+cc Austin]
>
> On Thu, Dec 26, 2019 at 04:39:11PM -0800, sathyanarayanan.kuppuswamy@linux.intel.com wrote:
>> From: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> As per PCI firmware specification r3.2 System Firmware Intermediary
>> (SFI) _OSC and DPC Updates ECR
>> (https://members.pcisig.com/wg/PCI-SIG/document/13563),
> What is the state of this ECR?  I see it in the "PCI Express Review
> Zone Archive".  I don't know what the usage is of the "Review Zone" vs
> the "Review Zone Archive / PCI Express Review Zone Archive".  AFAICS,
> it is not listed in any of the "Documents for 60 Day Member Review".
>
> And I think it needs some clarification (for one thing, it needs to
> say what the red/blue text means).  I've mentioned some other items to
> Austin, but I haven't read it in detail because it seems like it's not
> quite baked yet.
>
> E.g., there's language about "it may make sense for an embedded system
> OS to own SFI, but it's recommended that general-purpose OSes never
> request SFI ownership."  That's useless: Linux is certainly a general
> purpose OS, but Linux is also often an embedded OS.  So the ECR
> doesn't provide useful guidance about how an OS should decide whether
> to request SFI ownership.
This ECR has merged three different change proposals (SFI related,
_OSC related updates and update to implementation note of DPC
handling with EDR support) into a single document.  Out of these
three changes, we only care about "DPC implementation note update".

We already have a ECR specification for Error Disconnect Recover (EDR)
support (https://members.pcisig.com/wg/PCI-SIG/document/12888) in published
spec section. But this document has some ambiguous statements / missing 
details
which as  clarified in the implementation note section of mentioned ECR.
>
> Making code changes based on a published spec or ECN is fine,
> obviously.  Changes based on an ECR that is well on track to being
> accepted, e.g., is in the 60-day review period, are probably OK.  I
> don't yet have warm fuzzies about this ECR because I have no idea how
> far along it is.
>
> We might be able to justify some of these changes based on other
> specs; it just sounds weird to me to say "based on this Engineering
> Change Request that might be accepted someday, we must do X".  Anybody
> can dream up an ECR that says anything at all, so AFAICT, an ECR is
> not at all authoritative.
>
>> sec titled
>> "DPC Event Handling Implementation Note", page 10, Error Disconnect
>> Recover (EDR) support allows OS to handle error recovery and clearing
>> Error Registers even in FF mode. So create exception for FF mode checks
>> in pci_cleanup_aer_uncorrect_error_status(), pci_aer_clear_fatal_status()
>> and pci_cleanup_aer_error_status_regs() functions when its being called
>> from DPC code path.

Patch
diff mbox series

diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 18bdefaafd1a..184c966b61cb 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -243,7 +243,7 @@  messages to root port when an error is detected.
 
 ::
 
-  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);`
+  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev, bool ff_check);
 
 pci_cleanup_aer_uncorrect_error_status cleanups the uncorrectable
 error status register.
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 69bff085acf7..8378dbf3500b 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -3491,7 +3491,7 @@  static pci_ers_result_t ice_pci_err_slot_reset(struct pci_dev *pdev)
 			result = PCI_ERS_RESULT_DISCONNECT;
 	}
 
-	err = pci_cleanup_aer_uncorrect_error_status(pdev);
+	err = pci_cleanup_aer_uncorrect_error_status(pdev, 1);
 	if (err)
 		dev_dbg(&pdev->dev,
 			"pci_cleanup_aer_uncorrect_error_status failed, error %d\n",
diff --git a/drivers/ntb/hw/idt/ntb_hw_idt.c b/drivers/ntb/hw/idt/ntb_hw_idt.c
index dcf234680535..9023308be2de 100644
--- a/drivers/ntb/hw/idt/ntb_hw_idt.c
+++ b/drivers/ntb/hw/idt/ntb_hw_idt.c
@@ -2675,7 +2675,7 @@  static int idt_init_pci(struct idt_ntb_dev *ndev)
 	if (ret != 0)
 		dev_warn(&pdev->dev, "PCIe AER capability disabled\n");
 	else /* Cleanup uncorrectable error status before getting to init */
-		pci_cleanup_aer_uncorrect_error_status(pdev);
+		pci_cleanup_aer_uncorrect_error_status(pdev, 1);
 
 	/* First enable the PCI device */
 	ret = pcim_enable_device(pdev);
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e87196cc1a7f..6264244d92bf 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -1499,7 +1499,7 @@  void pci_restore_state(struct pci_dev *dev)
 	pci_restore_rebar_state(dev);
 	pci_restore_dpc_state(dev);
 
-	pci_cleanup_aer_error_status_regs(dev);
+	pci_cleanup_aer_error_status_regs(dev, 1);
 	pci_restore_aer_state(dev);
 
 	pci_restore_config_space(dev);
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index a0a53bd05a0b..0b4452f72a9a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -646,7 +646,7 @@  void pci_no_aer(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
-void pci_aer_clear_fatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev, bool ff_check);
 void pci_aer_clear_device_status(struct pci_dev *dev);
 #else
 static inline void pci_no_aer(void) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1ca86f2e0166..c64a91b347d2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -376,7 +376,7 @@  void pci_aer_clear_device_status(struct pci_dev *dev)
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
 
-int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
+int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev, bool ff_check)
 {
 	int pos;
 	u32 status, sev;
@@ -385,7 +385,7 @@  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 	if (!pos)
 		return -EIO;
 
-	if (pcie_aer_get_firmware_first(dev))
+	if (ff_check && pcie_aer_get_firmware_first(dev))
 		return -EIO;
 
 	/* Clear status bits for ERR_NONFATAL errors only */
@@ -399,7 +399,7 @@  int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_cleanup_aer_uncorrect_error_status);
 
-void pci_aer_clear_fatal_status(struct pci_dev *dev)
+void pci_aer_clear_fatal_status(struct pci_dev *dev, bool ff_check)
 {
 	int pos;
 	u32 status, sev;
@@ -408,8 +408,8 @@  void pci_aer_clear_fatal_status(struct pci_dev *dev)
 	if (!pos)
 		return;
 
-	if (pcie_aer_get_firmware_first(dev))
-		return;
+	if (ff_check && pcie_aer_get_firmware_first(dev))
+		return -EIO;
 
 	/* Clear status bits for ERR_FATAL errors only */
 	pci_read_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, &status);
@@ -419,7 +419,7 @@  void pci_aer_clear_fatal_status(struct pci_dev *dev)
 		pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_STATUS, status);
 }
 
-int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
+int pci_cleanup_aer_error_status_regs(struct pci_dev *dev, bool ff_check)
 {
 	int pos;
 	u32 status;
@@ -432,7 +432,7 @@  int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
 	if (!pos)
 		return -EIO;
 
-	if (pcie_aer_get_firmware_first(dev))
+	if (ff_check && pcie_aer_get_firmware_first(dev))
 		return -EIO;
 
 	port_type = pci_pcie_type(dev);
@@ -515,7 +515,7 @@  void pci_aer_init(struct pci_dev *dev)
 	n = pcie_cap_has_rtctl(dev) ? 5 : 4;
 	pci_add_ext_cap_save_buffer(dev, PCI_EXT_CAP_ID_ERR, sizeof(u32) * n);
 
-	pci_cleanup_aer_error_status_regs(dev);
+	pci_cleanup_aer_error_status_regs(dev, 1);
 }
 
 void pci_aer_exit(struct pci_dev *dev)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index b19d707db222..e711a3747f48 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -280,8 +280,8 @@  static void dpc_process_error(struct dpc_dev *dpc)
 		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
 		 aer_get_device_error_info(pdev, &info)) {
 		aer_print_error(pdev, &info);
-		pci_cleanup_aer_uncorrect_error_status(pdev);
-		pci_aer_clear_fatal_status(pdev);
+		pci_cleanup_aer_uncorrect_error_status(pdev, 0);
+		pci_aer_clear_fatal_status(pdev, 0);
 	}
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 53cd9200ec2c..043eab17e45d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -236,7 +236,7 @@  void pcie_do_recovery(struct pci_dev *dev, enum pci_channel_state state,
 	pci_walk_bus(bus, report_resume, &status);
 
 	pci_aer_clear_device_status(dev);
-	pci_cleanup_aer_uncorrect_error_status(dev);
+	pci_cleanup_aer_uncorrect_error_status(dev, 1);
 	pci_info(dev, "AER: Device recovery successful\n");
 	return;
 
diff --git a/drivers/scsi/lpfc/lpfc_attr.c b/drivers/scsi/lpfc/lpfc_attr.c
index 4ff82b36a37a..5b37efd29e20 100644
--- a/drivers/scsi/lpfc/lpfc_attr.c
+++ b/drivers/scsi/lpfc/lpfc_attr.c
@@ -4810,7 +4810,7 @@  lpfc_aer_cleanup_state(struct device *dev, struct device_attribute *attr,
 		return -EINVAL;
 
 	if (phba->hba_flag & HBA_AER_ENABLED)
-		rc = pci_cleanup_aer_uncorrect_error_status(phba->pcidev);
+		rc = pci_cleanup_aer_uncorrect_error_status(phba->pcidev, 1);
 
 	if (rc == 0)
 		return strlen(buf);
diff --git a/include/linux/aer.h b/include/linux/aer.h
index fa19e01f418a..f4f49df500ad 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -44,8 +44,10 @@  struct aer_capability_regs {
 /* PCIe port driver needs this function to enable AER */
 int pci_enable_pcie_error_reporting(struct pci_dev *dev);
 int pci_disable_pcie_error_reporting(struct pci_dev *dev);
-int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev);
-int pci_cleanup_aer_error_status_regs(struct pci_dev *dev);
+int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev,
+					   bool ff_check);
+int pci_cleanup_aer_error_status_regs(struct pci_dev *dev,
+				      bool ff_check);
 void pci_save_aer_state(struct pci_dev *dev);
 void pci_restore_aer_state(struct pci_dev *dev);
 #else
@@ -57,11 +59,13 @@  static inline int pci_disable_pcie_error_reporting(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
-static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev)
+static inline int pci_cleanup_aer_uncorrect_error_status(struct pci_dev *dev,
+							 bool ff_check)
 {
 	return -EINVAL;
 }
-static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev)
+static inline int pci_cleanup_aer_error_status_regs(struct pci_dev *dev,
+						    bool ff_check)
 {
 	return -EINVAL;
 }