diff mbox

pci, Add AER_panic sysfs file

Message ID 1337274270-18785-1-git-send-email-prarit@redhat.com
State Superseded
Headers show

Commit Message

Prarit Bhargava May 17, 2012, 5:04 p.m. UTC
Consider the following case

		[ RP ]
		  |
		  |
	+---------+-----------+
	|	  |	      |
       [H1]      [H2]        [X1]

where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
PCIE AER driver error handling (ie, they have pci_error_handlers defined in
the driver), and X1 is a device with a driver that does not support PCIE
AER driver error handling.

If the Root Port takes an error what currently happens is that the
bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.

In some cases a user may not wish the system to continue because X1 is
an unhardened driver.  In these cases, the system should not do a bus reset,
but rather the system should panic to avoid any further possible data
corruption.

This patch implements an AER_panic sysfs entry for each root port which
a user can modify.  AER_panic = 1, means the system will panic on a
PCIE error which would have normally resulted in a secondary bus reset.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci-sysfs.c       |   42 ++++++++++++++++++++++++++++++++++++++++-
 drivers/pci/pcie/aer/aerdrv.c |    3 ++
 include/linux/pci.h           |    1 +
 3 files changed, 45 insertions(+), 1 deletions(-)

Comments

Prarit Bhargava May 17, 2012, 5:39 p.m. UTC | #1
On 05/17/2012 01:29 PM, Shyam_Iyer@Dell.com wrote:
> 
> 
>> -----Original Message-----
>> From: linux-pci-owner@vger.kernel.org [mailto:linux-pci-
>> owner@vger.kernel.org] On Behalf Of Prarit Bhargava
>> Sent: Thursday, May 17, 2012 1:05 PM
>> To: linux-pci@vger.kernel.org
>> Cc: Prarit Bhargava; Bjorn Helgaas
>> Subject: [PATCH] pci, Add AER_panic sysfs file
>>
>> Consider the following case
>>
>> 		[ RP ]
>> 		  |
>> 		  |
>> 	+---------+-----------+
>> 	|	  |	      |
>>        [H1]      [H2]        [X1]
>>
>> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that
>> support
>> PCIE AER driver error handling (ie, they have pci_error_handlers
>> defined in
>> the driver), and X1 is a device with a driver that does not support
>> PCIE
>> AER driver error handling.
>>
>> If the Root Port takes an error what currently happens is that the
>> bus resets and H1 & H2 call their slot_reset functions.  X1 does
>> nothing.
>>
>> In some cases a user may not wish the system to continue because X1 is
>> an unhardened driver.  In these cases, the system should not do a bus
>> reset,
>> but rather the system should panic to avoid any further possible data
>> corruption.
> 
> Do we neeed to panic for both correctable and uncorrectable errors.. ?
> 
> I thought correctable errors could recover without a bus reset.

Will a bus reset be issued on a correctable error?  I thought the code path was
that the bus reset was issued on the uncorrectable error.

drivers/pci/pcie/aer/aerdrv_core.c: do_recovery()

        if (severity == AER_FATAL) {
                result = reset_link(dev);
                if (result != PCI_ERS_RESULT_RECOVERED)
                        goto failed;
        }

I may not be looking at the right spot of code.  Care to enlighten me? :)

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Don Dutile May 17, 2012, 6:51 p.m. UTC | #2
On 05/17/2012 01:04 PM, Prarit Bhargava wrote:
> Consider the following case
>
> 		[ RP ]
> 		  |
> 		  |
> 	+---------+-----------+
> 	|	  |	      |
>         [H1]      [H2]        [X1]
>
> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
> the driver), and X1 is a device with a driver that does not support PCIE
> AER driver error handling.
>
> If the Root Port takes an error what currently happens is that the
> bus resets and H1&  H2 call their slot_reset functions.  X1 does nothing.
>
> In some cases a user may not wish the system to continue because X1 is
> an unhardened driver.  In these cases, the system should not do a bus reset,
> but rather the system should panic to avoid any further possible data
> corruption.
>
> This patch implements an AER_panic sysfs entry for each root port which
> a user can modify.  AER_panic = 1, means the system will panic on a
> PCIE error which would have normally resulted in a secondary bus reset.
>
> Signed-off-by: Prarit Bhargava<prarit@redhat.com>
> Cc: Bjorn Helgaas<bhelgaas@google.com>
> ---
>   drivers/pci/pci-sysfs.c       |   42 ++++++++++++++++++++++++++++++++++++++++-
>   drivers/pci/pcie/aer/aerdrv.c |    3 ++
>   include/linux/pci.h           |    1 +
>   3 files changed, 45 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index a55e248..8c6d525 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1135,6 +1135,35 @@ static ssize_t reset_store(struct device *dev,
>
>   static struct device_attribute reset_attr = __ATTR(reset, 0200, NULL, reset_store);
>
> +static ssize_t AER_panic_show(struct device *dev,
> +				 struct device_attribute *attr, char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	return sprintf(buf, "%d\n", pdev->rp_AER_panic);
> +}
> +
> +static ssize_t AER_panic_store(struct device *dev,
> +				  struct device_attribute *attr,
> +				  const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	unsigned long val;
> +
> +	if (kstrtoul(buf, 0,&val)<  0)
> +		return -EINVAL;
> +
> +	if ((val>  1) || (val<  0))
> +		return -EINVAL;
> +
> +	pdev->rp_AER_panic = val;
> +
> +	return count;
> +}
> +
> +static struct device_attribute rp_AER_panic_attr =
> +	      __ATTR(AER_panic, 0600, AER_panic_show, AER_panic_store);
> +
>   static int pci_create_capabilities_sysfs(struct pci_dev *dev)
>   {
>   	int retval;
> @@ -1169,8 +1198,16 @@ static int pci_create_capabilities_sysfs(struct pci_dev *dev)
>   			goto error;
>   		dev->reset_fn = 1;
>   	}
> -	return 0;
>
> +	/* PCIE Root Port panic-on-AER allows a user to configure each root
> +	 * port to panic on an AER error instead of issuing a bus reset.
> +	 */
> +	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
> +		retval = device_create_file(&dev->dev,&rp_AER_panic_attr);
> +		if (retval)
> +			goto error;
> +	}
> +	return 0;
>   error:
>   	pcie_aspm_remove_sysfs_dev_files(dev);
>   	if (dev->vpd&&  dev->vpd->attr) {
> @@ -1279,6 +1316,9 @@ static void pci_remove_capabilities_sysfs(struct pci_dev *dev)
>   		device_remove_file(&dev->dev,&reset_attr);
>   		dev->reset_fn = 0;
>   	}
> +
> +	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT)
> +		device_remove_file(&dev->dev,&rp_AER_panic_attr);
>   }
>
>   /**
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 58ad791..dd6b352 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>   	u32 reg32;
>   	int pos;
>
> +	if (dev->rp_AER_panic)
> +		panic("%s: AER detected on Root Port", pci_name(dev));
> +
It'd be more informative if the Root Port D:B:D.F was printed out above,
so one knows where the errors are coming from the system.  More likely than
not, the root is ok, but a device dangling from it is the 'root cause' (all pun intended)
of the error.

>   	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
>
>   	/* Disable Root's interrupt in response to error messages */
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e444f5b..a4e6a5a 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -324,6 +324,7 @@ struct pci_dev {
>   	unsigned int    is_hotplug_bridge:1;
>   	unsigned int    __aer_firmware_first_valid:1;
>   	unsigned int	__aer_firmware_first:1;
> +	unsigned int	rp_AER_panic:1; /* if 1, panic on AER bus reset */
>   	pci_dev_flags_t dev_flags;
>   	atomic_t	enable_cnt;	/* pci_enable_device has been called */
>

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prarit Bhargava May 17, 2012, 6:54 p.m. UTC | #3
>> +    if (dev->rp_AER_panic)
>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>> +
> It'd be more informative if the Root Port D:B:D.F was printed out above,
> so one knows where the errors are coming from the system.  More likely than
> not, the root is ok, but a device dangling from it is the 'root cause' (all pun
> intended)
> of the error.

That's what pci_name() is.

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Don Dutile May 17, 2012, 7:11 p.m. UTC | #4
On 05/17/2012 02:54 PM, Prarit Bhargava wrote:
>
>>> +    if (dev->rp_AER_panic)
>>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>>> +
>> It'd be more informative if the Root Port D:B:D.F was printed out above,
>> so one knows where the errors are coming from the system.  More likely than
>> not, the root is ok, but a device dangling from it is the 'root cause' (all pun
>> intended)
>> of the error.
>
> That's what pci_name() is.
>
> P.

ok. When I find where it's set, and what values it can be (init_name, vs kobject->name)
at what times in the kernel boot.



--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Betty Dall May 17, 2012, 9:32 p.m. UTC | #5
Prarit Bhargava <prarit <at> redhat.com> writes:

>  /**
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 58ad791..dd6b352 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -346,6 +346,9 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  	u32 reg32;
>  	int pos;
> 
> +	if (dev->rp_AER_panic)
> +		panic("%s: AER detected on Root Port", pci_name(dev));
> +
>  	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
> 


I really like this idea. I just wonder if the panic can happen in do_recovery in 
aerdrv_core.c before the broadcast_error_message() is done that invokes all of 
the error_detected callbacks. It would be best to panic as soon as possible to 
increase error containment. 

If we don't move it to before the broadcast_error_message(), I was also looking 
at if it would be appropriate to put the check for this in the 
default_downstream_reset_link() in aerdrv_core.c too. What do you think?

-Betty Dall

--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prarit Bhargava May 17, 2012, 10:16 p.m. UTC | #6
On 05/17/2012 03:11 PM, Don Dutile wrote:
> On 05/17/2012 02:54 PM, Prarit Bhargava wrote:
>>
>>>> +    if (dev->rp_AER_panic)
>>>> +        panic("%s: AER detected on Root Port", pci_name(dev));
>>>> +
>>> It'd be more informative if the Root Port D:B:D.F was printed out above,
>>> so one knows where the errors are coming from the system.  More likely than
>>> not, the root is ok, but a device dangling from it is the 'root cause' (all pun
>>> intended)
>>> of the error.
>>
>> That's what pci_name() is.
>>
>> P.
> 
> ok. When I find where it's set, and what values it can be (init_name, vs
> kobject->name)
> at what times in the kernel boot.

The device name is init'd in the function pci_setup_device().  It's name is not
changed after that AFAICT.

This is a well-used and well-known interface to get the address of a pci_dev in
the form of a string.  pci_name(), by my count, is used 879 times throughout the
linux kernel.

P.

> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Kroah-Hartman May 18, 2012, 4:51 a.m. UTC | #7
On Thu, May 17, 2012 at 01:04:30PM -0400, Prarit Bhargava wrote:
> Consider the following case
> 
> 		[ RP ]
> 		  |
> 		  |
> 	+---------+-----------+
> 	|	  |	      |
>        [H1]      [H2]        [X1]
> 
> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
> the driver), and X1 is a device with a driver that does not support PCIE
> AER driver error handling.
> 
> If the Root Port takes an error what currently happens is that the
> bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.
> 
> In some cases a user may not wish the system to continue because X1 is
> an unhardened driver.  In these cases, the system should not do a bus reset,
> but rather the system should panic to avoid any further possible data
> corruption.
> 
> This patch implements an AER_panic sysfs entry for each root port which
> a user can modify.  AER_panic = 1, means the system will panic on a
> PCIE error which would have normally resulted in a secondary bus reset.

If you add/modify/delete a sysfs file, you have to also have a
corrisponding patch to Documentation/ABI/ in order to keep things sane.
Please do that as part of this patch the next time you submit it.

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Prarit Bhargava May 18, 2012, 10:26 a.m. UTC | #8
On 05/18/2012 12:51 AM, Greg KH wrote:
> On Thu, May 17, 2012 at 01:04:30PM -0400, Prarit Bhargava wrote:
>> Consider the following case
>>
>> 		[ RP ]
>> 		  |
>> 		  |
>> 	+---------+-----------+
>> 	|	  |	      |
>>        [H1]      [H2]        [X1]
>>
>> where RP is a PCIE Root Port, H1 and H2 are devices with drivers that support
>> PCIE AER driver error handling (ie, they have pci_error_handlers defined in
>> the driver), and X1 is a device with a driver that does not support PCIE
>> AER driver error handling.
>>
>> If the Root Port takes an error what currently happens is that the
>> bus resets and H1 & H2 call their slot_reset functions.  X1 does nothing.
>>
>> In some cases a user may not wish the system to continue because X1 is
>> an unhardened driver.  In these cases, the system should not do a bus reset,
>> but rather the system should panic to avoid any further possible data
>> corruption.
>>
>> This patch implements an AER_panic sysfs entry for each root port which
>> a user can modify.  AER_panic = 1, means the system will panic on a
>> PCIE error which would have normally resulted in a secondary bus reset.
> 
> If you add/modify/delete a sysfs file, you have to also have a
> corrisponding patch to Documentation/ABI/ in order to keep things sane.
> Please do that as part of this patch the next time you submit it.

Sorry about that Greg.  I'll submit a [v2] with that modification.

P.
> 
> thanks,
> 
> greg k-h
--
To unsubscribe from this list: send the line "unsubscribe linux-pci" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index a55e248..8c6d525 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1135,6 +1135,35 @@  static ssize_t reset_store(struct device *dev,
 
 static struct device_attribute reset_attr = __ATTR(reset, 0200, NULL, reset_store);
 
+static ssize_t AER_panic_show(struct device *dev,
+				 struct device_attribute *attr, char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	return sprintf(buf, "%d\n", pdev->rp_AER_panic);
+}
+
+static ssize_t AER_panic_store(struct device *dev,
+				  struct device_attribute *attr,
+				  const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	unsigned long val;
+
+	if (kstrtoul(buf, 0, &val) < 0)
+		return -EINVAL;
+
+	if ((val > 1) || (val < 0))
+		return -EINVAL;
+
+	pdev->rp_AER_panic = val;
+
+	return count;
+}
+
+static struct device_attribute rp_AER_panic_attr =
+	      __ATTR(AER_panic, 0600, AER_panic_show, AER_panic_store);
+
 static int pci_create_capabilities_sysfs(struct pci_dev *dev)
 {
 	int retval;
@@ -1169,8 +1198,16 @@  static int pci_create_capabilities_sysfs(struct pci_dev *dev)
 			goto error;
 		dev->reset_fn = 1;
 	}
-	return 0;
 
+	/* PCIE Root Port panic-on-AER allows a user to configure each root
+	 * port to panic on an AER error instead of issuing a bus reset.
+	 */
+	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT) {
+		retval = device_create_file(&dev->dev, &rp_AER_panic_attr);
+		if (retval)
+			goto error;
+	}
+	return 0;
 error:
 	pcie_aspm_remove_sysfs_dev_files(dev);
 	if (dev->vpd && dev->vpd->attr) {
@@ -1279,6 +1316,9 @@  static void pci_remove_capabilities_sysfs(struct pci_dev *dev)
 		device_remove_file(&dev->dev, &reset_attr);
 		dev->reset_fn = 0;
 	}
+
+	if (dev->pcie_type == PCI_EXP_TYPE_ROOT_PORT)
+		device_remove_file(&dev->dev, &rp_AER_panic_attr);
 }
 
 /**
diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 58ad791..dd6b352 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -346,6 +346,9 @@  static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	u32 reg32;
 	int pos;
 
+	if (dev->rp_AER_panic)
+		panic("%s: AER detected on Root Port", pci_name(dev));
+
 	pos = pci_find_ext_capability(dev, PCI_EXT_CAP_ID_ERR);
 
 	/* Disable Root's interrupt in response to error messages */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e444f5b..a4e6a5a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -324,6 +324,7 @@  struct pci_dev {
 	unsigned int    is_hotplug_bridge:1;
 	unsigned int    __aer_firmware_first_valid:1;
 	unsigned int	__aer_firmware_first:1;
+	unsigned int	rp_AER_panic:1; /* if 1, panic on AER bus reset */
 	pci_dev_flags_t dev_flags;
 	atomic_t	enable_cnt;	/* pci_enable_device has been called */