diff mbox series

[v2,5/5] Documentation/ABI: Add details of PCI AER statistics

Message ID 20180523175808.28030-6-rajatja@google.com
State Changes Requested
Delegated to: Bjorn Helgaas
Headers show
Series Expose PCIe AER stats via sysfs | expand

Commit Message

Rajat Jain May 23, 2018, 5:58 p.m. UTC
Add the PCI AER statistics details to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
and provide a pointer to it in
Documentation/PCI/pcieaer-howto.txt

Signed-off-by: Rajat Jain <rajatja@google.com>
---
v2: Move the documentation to Documentation/ABI/

 .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
 Documentation/PCI/pcieaer-howto.txt           |   5 +
 2 files changed, 108 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats

Comments

Oza Pawandeep June 17, 2018, 5:24 a.m. UTC | #1
On 2018-05-23 23:28, Rajat Jain wrote:
> Add the PCI AER statistics details to
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> and provide a pointer to it in
> Documentation/PCI/pcieaer-howto.txt
> 
> Signed-off-by: Rajat Jain <rajatja@google.com>
> ---
> v2: Move the documentation to Documentation/ABI/
> 
>  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
>  Documentation/PCI/pcieaer-howto.txt           |   5 +
>  2 files changed, 108 insertions(+)
>  create mode 100644 
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> new file mode 100644
> index 000000000000..f55c389290ac
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> @@ -0,0 +1,103 @@
> +==========================
> +PCIe Device AER statistics
> +==========================
> +These attributes show up under all the devices that are AER capable. 
> These
> +statistical counters indicate the errors "as seen/reported by the 
> device".
> +Note that this may mean that if an end point is causing problems, the 
> AER
> +counters may increment at its link partner (e.g. root port) because 
> the
> +errors will be "seen" / reported by the link partner and not the the
> +problematic end point itself (which may report all counters as 0 as it 
> never
> +saw any problems).
> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Total number of correctable errors seen and reported by 
> this
> +		PCI device using ERR_COR.
> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Total number of uncorrectable fatal errors seen and 
> reported
> +		by this PCI device using ERR_FATAL.
> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Total number of uncorrectable non-fatal errors seen and 
> reported
> +		by this PCI device using ERR_NONFATAL.
> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Breakdown of of correctable errors seen and reported by 
> this
> +		PCI device using ERR_COR. A sample result looks like this:
> +-----------------------------------------
> +Receiver Error = 0x174
> +Bad TLP = 0x19
> +Bad DLLP = 0x3
> +RELAY_NUM Rollover = 0x0
> +Replay Timer Timeout = 0x1
> +Advisory Non-Fatal = 0x0
> +Corrected Internal Error = 0x0
> +Header Log Overflow = 0x0
> +-----------------------------------------
why hex display ? decimal is easy to read as these are counters.
> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Breakdown of of correctable errors seen and reported by 
> this
> +		PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
> +		looks like this:
> +-----------------------------------------
> +Undefined = 0x0
> +Data Link Protocol = 0x0
> +Surprise Down Error = 0x0
> +Poisoned TLP = 0x0
> +Flow Control Protocol = 0x0
> +Completion Timeout = 0x0
> +Completer Abort = 0x0
> +Unexpected Completion = 0x0
> +Receiver Overflow = 0x0
> +Malformed TLP = 0x0
> +ECRC = 0x0
> +Unsupported Request = 0x0
> +ACS Violation = 0x0
> +Uncorrectable Internal Error = 0x0
> +MC Blocked TLP = 0x0
> +AtomicOp Egress Blocked = 0x0
> +TLP Prefix Blocked Error = 0x0
> +-----------------------------------------
> +
> +============================
> +PCIe Rootport AER statistics
> +============================
> +These attributes showup under only the rootports that are AER capable. 
> These
> +indicate the number of error messages as "reported to" the rootport.
> Please note
> +that the rootports also transmit (internally) the ERR_* messages for
> errors seen
> +by the internal rootport PCI device, so these counters includes them 
> and are
> +thus cumulative of all the error messages on the PCI hierarchy 
> originating
> +at that root port.

what about switches and bridges ?
Also Can you give some idea as e.g what is the difference between
dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that both 
are same pci_dev.

rootport_total_fatal_errs gives me an idea that how many times things 
have been failed under this pci_dev ?
which means num of downstream link problems. but I am still trying to 
make sense as how it could be used,
since we dont have BDF information associated with the number of errors 
anywhere (except these AER print messages)


and dev_total_fatal_errs as you mentioned above that problematic EP, 
then say root-port will report it and increment
dev_total_fatal_errs ++
does it also increment root-port_total_fatal_errs ++ in above scenario ?

> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Total number of ERR_COR messages reported to rootport.
> +
> +Where:		/sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Total number of ERR_FATAL messages reported to rootport.
> +
> +Where:	    
> /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
> +Date:		May 2018
> +Kernel Version: 4.17.0
> +Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> +Description:	Total number of ERR_NONFATAL messages reported to 
> rootport.
> diff --git a/Documentation/PCI/pcieaer-howto.txt
> b/Documentation/PCI/pcieaer-howto.txt
> index acd0dddd6bb8..91b6e677cb8c 100644
> --- a/Documentation/PCI/pcieaer-howto.txt
> +++ b/Documentation/PCI/pcieaer-howto.txt
> @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
> device who sends
>  the error message to root port. Pls. refer to pci express specs for
>  other fields.
> 
> +2.4 AER Statistics / Counters
> +
> +When PCIe AER errors are captured, the counters / statistics are also 
> exposed
> +in form of sysfs attributes which are documented at
> +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> 
>  3. Developer Guide
Rajat Jain June 19, 2018, 12:11 a.m. UTC | #2
Hello,

On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
>
> On 2018-05-23 23:28, Rajat Jain wrote:
> > Add the PCI AER statistics details to
> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > and provide a pointer to it in
> > Documentation/PCI/pcieaer-howto.txt
> >
> > Signed-off-by: Rajat Jain <rajatja@google.com>
> > ---
> > v2: Move the documentation to Documentation/ABI/
> >
> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
> >  2 files changed, 108 insertions(+)
> >  create mode 100644
> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > new file mode 100644
> > index 000000000000..f55c389290ac
> > --- /dev/null
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > @@ -0,0 +1,103 @@
> > +==========================
> > +PCIe Device AER statistics
> > +==========================
> > +These attributes show up under all the devices that are AER capable.
> > These
> > +statistical counters indicate the errors "as seen/reported by the
> > device".
> > +Note that this may mean that if an end point is causing problems, the
> > AER
> > +counters may increment at its link partner (e.g. root port) because
> > the
> > +errors will be "seen" / reported by the link partner and not the the
> > +problematic end point itself (which may report all counters as 0 as it
> > never
> > +saw any problems).
> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Total number of correctable errors seen and reported by
> > this
> > +             PCI device using ERR_COR.
> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Total number of uncorrectable fatal errors seen and
> > reported
> > +             by this PCI device using ERR_FATAL.
> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Total number of uncorrectable non-fatal errors seen and
> > reported
> > +             by this PCI device using ERR_NONFATAL.
> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Breakdown of of correctable errors seen and reported by
> > this
> > +             PCI device using ERR_COR. A sample result looks like this:
> > +-----------------------------------------
> > +Receiver Error = 0x174
> > +Bad TLP = 0x19
> > +Bad DLLP = 0x3
> > +RELAY_NUM Rollover = 0x0
> > +Replay Timer Timeout = 0x1
> > +Advisory Non-Fatal = 0x0
> > +Corrected Internal Error = 0x0
> > +Header Log Overflow = 0x0
> > +-----------------------------------------
> why hex display ? decimal is easy to read as these are counters.

Have no particular preference. Since these can be potentially large
numbers, just had a random thought that hex might make it more
concise. I can change to decimal if that is preferable.

> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Breakdown of of correctable errors seen and reported by
> > this
> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
> > +             looks like this:
> > +-----------------------------------------
> > +Undefined = 0x0
> > +Data Link Protocol = 0x0
> > +Surprise Down Error = 0x0
> > +Poisoned TLP = 0x0
> > +Flow Control Protocol = 0x0
> > +Completion Timeout = 0x0
> > +Completer Abort = 0x0
> > +Unexpected Completion = 0x0
> > +Receiver Overflow = 0x0
> > +Malformed TLP = 0x0
> > +ECRC = 0x0
> > +Unsupported Request = 0x0
> > +ACS Violation = 0x0
> > +Uncorrectable Internal Error = 0x0
> > +MC Blocked TLP = 0x0
> > +AtomicOp Egress Blocked = 0x0
> > +TLP Prefix Blocked Error = 0x0
> > +-----------------------------------------
> > +
> > +============================
> > +PCIe Rootport AER statistics
> > +============================
> > +These attributes showup under only the rootports that are AER capable.
> > These
> > +indicate the number of error messages as "reported to" the rootport.
> > Please note
> > +that the rootports also transmit (internally) the ERR_* messages for
> > errors seen
> > +by the internal rootport PCI device, so these counters includes them
> > and are
> > +thus cumulative of all the error messages on the PCI hierarchy
> > originating
> > +at that root port.
>
> what about switches and bridges ?

What about them? AIUI, the switches forward the ERR_ messages from
downstream devices to the rootport, like they do with standard
messages. They can potentially generate their own ERR_ message and
that would be reported no different than other end point devices.

> Also Can you give some idea as e.g what is the difference between
> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that both
> are same pci_dev.

For a pci_dev representing the rootport:

dev_total_fatal_errors = how many times this PCI device *experienced*
a fatal problem on its own (i.e. either link issues while talking to
its link partner, or some internal errors).

rootport_total_fatal_errors = how many times this rootport was
*informed* about a problem (via ERR_* messages) in the PCI hierarchy
that originates at it (can be any link further downstream). This
includes the dev_total_fatal_errors also, because any errors detected
by the rootport are also "informed" to itself via ERR_* messages. In
reality, this is just the total number of ERR_FATAL messages received
at the rootport. This sysfs attribute will only exist for root ports.

>
> rootport_total_fatal_errs gives me an idea that how many times things
> have been failed under this pci_dev ?

Yes, as above.

> which means num of downstream link problems. but I am still trying to
> make sense as how it could be used,
> since we dont have BDF information associated with the number of errors
> anywhere (except these AER print messages)
>

Agree. That is a limitation. The challenges being more record keeping,
more complicated sysfs representation, and given that PCI devices may
come and go, how do we know it is the same device before we collate
their stats etc.

>
> and dev_total_fatal_errs as you mentioned above that problematic EP,
> then say root-port will report it and increment
> dev_total_fatal_errs ++
> does it also increment root-port_total_fatal_errs ++ in above scenario ?

Yes, as above, it will also root-port_total_fatal_errs++ for the root
port of that hierarchy.

Thanks,

Rajat

>
> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Total number of ERR_COR messages reported to rootport.
> > +
> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Total number of ERR_FATAL messages reported to rootport.
> > +
> > +Where:
> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
> > +Date:                May 2018
> > +Kernel Version: 4.17.0
> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> > +Description: Total number of ERR_NONFATAL messages reported to
> > rootport.
> > diff --git a/Documentation/PCI/pcieaer-howto.txt
> > b/Documentation/PCI/pcieaer-howto.txt
> > index acd0dddd6bb8..91b6e677cb8c 100644
> > --- a/Documentation/PCI/pcieaer-howto.txt
> > +++ b/Documentation/PCI/pcieaer-howto.txt
> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
> > device who sends
> >  the error message to root port. Pls. refer to pci express specs for
> >  other fields.
> >
> > +2.4 AER Statistics / Counters
> > +
> > +When PCIe AER errors are captured, the counters / statistics are also
> > exposed
> > +in form of sysfs attributes which are documented at
> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >
> >  3. Developer Guide
Rajat Jain June 19, 2018, 12:32 a.m. UTC | #3
Sorry, correction needed in my statement below:

On Mon, Jun 18, 2018 at 5:11 PM, Rajat Jain <rajatja@google.com> wrote:
> Hello,
>
> On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
>>
>> On 2018-05-23 23:28, Rajat Jain wrote:
>> > Add the PCI AER statistics details to
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > and provide a pointer to it in
>> > Documentation/PCI/pcieaer-howto.txt
>> >
>> > Signed-off-by: Rajat Jain <rajatja@google.com>
>> > ---
>> > v2: Move the documentation to Documentation/ABI/
>> >
>> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
>> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
>> >  2 files changed, 108 insertions(+)
>> >  create mode 100644
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > new file mode 100644
>> > index 000000000000..f55c389290ac
>> > --- /dev/null
>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > @@ -0,0 +1,103 @@
>> > +==========================
>> > +PCIe Device AER statistics
>> > +==========================
>> > +These attributes show up under all the devices that are AER capable.
>> > These
>> > +statistical counters indicate the errors "as seen/reported by the
>> > device".
>> > +Note that this may mean that if an end point is causing problems, the
>> > AER
>> > +counters may increment at its link partner (e.g. root port) because
>> > the
>> > +errors will be "seen" / reported by the link partner and not the the
>> > +problematic end point itself (which may report all counters as 0 as it
>> > never
>> > +saw any problems).
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of correctable errors seen and reported by
>> > this
>> > +             PCI device using ERR_COR.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of uncorrectable fatal errors seen and
>> > reported
>> > +             by this PCI device using ERR_FATAL.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of uncorrectable non-fatal errors seen and
>> > reported
>> > +             by this PCI device using ERR_NONFATAL.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > +             PCI device using ERR_COR. A sample result looks like this:
>> > +-----------------------------------------
>> > +Receiver Error = 0x174
>> > +Bad TLP = 0x19
>> > +Bad DLLP = 0x3
>> > +RELAY_NUM Rollover = 0x0
>> > +Replay Timer Timeout = 0x1
>> > +Advisory Non-Fatal = 0x0
>> > +Corrected Internal Error = 0x0
>> > +Header Log Overflow = 0x0
>> > +-----------------------------------------
>> why hex display ? decimal is easy to read as these are counters.
>
> Have no particular preference. Since these can be potentially large
> numbers, just had a random thought that hex might make it more
> concise. I can change to decimal if that is preferable.
>
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
>> > +             looks like this:
>> > +-----------------------------------------
>> > +Undefined = 0x0
>> > +Data Link Protocol = 0x0
>> > +Surprise Down Error = 0x0
>> > +Poisoned TLP = 0x0
>> > +Flow Control Protocol = 0x0
>> > +Completion Timeout = 0x0
>> > +Completer Abort = 0x0
>> > +Unexpected Completion = 0x0
>> > +Receiver Overflow = 0x0
>> > +Malformed TLP = 0x0
>> > +ECRC = 0x0
>> > +Unsupported Request = 0x0
>> > +ACS Violation = 0x0
>> > +Uncorrectable Internal Error = 0x0
>> > +MC Blocked TLP = 0x0
>> > +AtomicOp Egress Blocked = 0x0
>> > +TLP Prefix Blocked Error = 0x0
>> > +-----------------------------------------
>> > +
>> > +============================
>> > +PCIe Rootport AER statistics
>> > +============================
>> > +These attributes showup under only the rootports that are AER capable.
>> > These
>> > +indicate the number of error messages as "reported to" the rootport.
>> > Please note
>> > +that the rootports also transmit (internally) the ERR_* messages for
>> > errors seen
>> > +by the internal rootport PCI device, so these counters includes them
>> > and are
>> > +thus cumulative of all the error messages on the PCI hierarchy
>> > originating
>> > +at that root port.
>>
>> what about switches and bridges ?
>
> What about them? AIUI, the switches forward the ERR_ messages from
> downstream devices to the rootport, like they do with standard
> messages. They can potentially generate their own ERR_ message and
> that would be reported no different than other end point devices.
>
>> Also Can you give some idea as e.g what is the difference between
>> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that both
>> are same pci_dev.
>
> For a pci_dev representing the rootport:
>
> dev_total_fatal_errors = how many times this PCI device *experienced*
> a fatal problem on its own (i.e. either link issues while talking to
> its link partner, or some internal errors).
>
> rootport_total_fatal_errors = how many times this rootport was
> *informed* about a problem (via ERR_* messages) in the PCI hierarchy

Read the above sentence as:
" rootport_total_fatal_errors = how many times this rootport was
 *informed* about a FATAL problem (via ERR_FATAL messages) in the PCI hierarchy"


> that originates at it (can be any link further downstream). This
> includes the dev_total_fatal_errors also, because any errors detected
> by the rootport are also "informed" to itself via ERR_* messages. In
> reality, this is just the total number of ERR_FATAL messages received
> at the rootport. This sysfs attribute will only exist for root ports.
>
>>
>> rootport_total_fatal_errs gives me an idea that how many times things
>> have been failed under this pci_dev ?
>
> Yes, as above.
>
>> which means num of downstream link problems. but I am still trying to
>> make sense as how it could be used,
>> since we dont have BDF information associated with the number of errors
>> anywhere (except these AER print messages)
>>
>
> Agree. That is a limitation. The challenges being more record keeping,
> more complicated sysfs representation, and given that PCI devices may
> come and go, how do we know it is the same device before we collate
> their stats etc.
>
>>
>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>> then say root-port will report it and increment
>> dev_total_fatal_errs ++
>> does it also increment root-port_total_fatal_errs ++ in above scenario ?
>
> Yes, as above, it will also root-port_total_fatal_errs++ for the root
> port of that hierarchy.
>
> Thanks,
>
> Rajat
>
>>
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of ERR_COR messages reported to rootport.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>> > +
>> > +Where:
>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of ERR_NONFATAL messages reported to
>> > rootport.
>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>> > b/Documentation/PCI/pcieaer-howto.txt
>> > index acd0dddd6bb8..91b6e677cb8c 100644
>> > --- a/Documentation/PCI/pcieaer-howto.txt
>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>> > device who sends
>> >  the error message to root port. Pls. refer to pci express specs for
>> >  other fields.
>> >
>> > +2.4 AER Statistics / Counters
>> > +
>> > +When PCIe AER errors are captured, the counters / statistics are also
>> > exposed
>> > +in form of sysfs attributes which are documented at
>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> >  3. Developer Guide
Oza Pawandeep June 19, 2018, 6:03 a.m. UTC | #4
On 2018-06-19 05:41, Rajat Jain wrote:
> Hello,
> 
> On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
>> 
>> On 2018-05-23 23:28, Rajat Jain wrote:
>> > Add the PCI AER statistics details to
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > and provide a pointer to it in
>> > Documentation/PCI/pcieaer-howto.txt
>> >
>> > Signed-off-by: Rajat Jain <rajatja@google.com>
>> > ---
>> > v2: Move the documentation to Documentation/ABI/
>> >
>> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
>> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
>> >  2 files changed, 108 insertions(+)
>> >  create mode 100644
>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > new file mode 100644
>> > index 000000000000..f55c389290ac
>> > --- /dev/null
>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> > @@ -0,0 +1,103 @@
>> > +==========================
>> > +PCIe Device AER statistics
>> > +==========================
>> > +These attributes show up under all the devices that are AER capable.
>> > These
>> > +statistical counters indicate the errors "as seen/reported by the
>> > device".
>> > +Note that this may mean that if an end point is causing problems, the
>> > AER
>> > +counters may increment at its link partner (e.g. root port) because
>> > the
>> > +errors will be "seen" / reported by the link partner and not the the
>> > +problematic end point itself (which may report all counters as 0 as it
>> > never
>> > +saw any problems).
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of correctable errors seen and reported by
>> > this
>> > +             PCI device using ERR_COR.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of uncorrectable fatal errors seen and
>> > reported
>> > +             by this PCI device using ERR_FATAL.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of uncorrectable non-fatal errors seen and
>> > reported
>> > +             by this PCI device using ERR_NONFATAL.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > +             PCI device using ERR_COR. A sample result looks like this:
>> > +-----------------------------------------
>> > +Receiver Error = 0x174
>> > +Bad TLP = 0x19
>> > +Bad DLLP = 0x3
>> > +RELAY_NUM Rollover = 0x0
>> > +Replay Timer Timeout = 0x1
>> > +Advisory Non-Fatal = 0x0
>> > +Corrected Internal Error = 0x0
>> > +Header Log Overflow = 0x0
>> > +-----------------------------------------
>> why hex display ? decimal is easy to read as these are counters.
> 
> Have no particular preference. Since these can be potentially large
> numbers, just had a random thought that hex might make it more
> concise. I can change to decimal if that is preferable.
> 
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Breakdown of of correctable errors seen and reported by
>> > this
>> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
>> > +             looks like this:
>> > +-----------------------------------------
>> > +Undefined = 0x0
>> > +Data Link Protocol = 0x0
>> > +Surprise Down Error = 0x0
>> > +Poisoned TLP = 0x0
>> > +Flow Control Protocol = 0x0
>> > +Completion Timeout = 0x0
>> > +Completer Abort = 0x0
>> > +Unexpected Completion = 0x0
>> > +Receiver Overflow = 0x0
>> > +Malformed TLP = 0x0
>> > +ECRC = 0x0
>> > +Unsupported Request = 0x0
>> > +ACS Violation = 0x0
>> > +Uncorrectable Internal Error = 0x0
>> > +MC Blocked TLP = 0x0
>> > +AtomicOp Egress Blocked = 0x0
>> > +TLP Prefix Blocked Error = 0x0
>> > +-----------------------------------------
>> > +
>> > +============================
>> > +PCIe Rootport AER statistics
>> > +============================
>> > +These attributes showup under only the rootports that are AER capable.
>> > These
>> > +indicate the number of error messages as "reported to" the rootport.
>> > Please note
>> > +that the rootports also transmit (internally) the ERR_* messages for
>> > errors seen
>> > +by the internal rootport PCI device, so these counters includes them
>> > and are
>> > +thus cumulative of all the error messages on the PCI hierarchy
>> > originating
>> > +at that root port.
>> 
>> what about switches and bridges ?
> 
> What about them? AIUI, the switches forward the ERR_ messages from
> downstream devices to the rootport, like they do with standard
> messages. They can potentially generate their own ERR_ message and
> that would be reported no different than other end point devices.


yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be 
contained by switch
and the error handling code thinks that, the error is contained by 
switch irrespective of
AER or DPC, and it will think that the problem could be with 
Switch/bridge upstream link.

hence the pci_dev of the switch where you should be increment your 
counters.
of course ER_FATAL would have traversed till RP, but that doesnt meant 
that
you account the error there.

> 
>> Also Can you give some idea as e.g what is the difference between
>> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that 
>> both
>> are same pci_dev.
> 
> For a pci_dev representing the rootport:
> 
> dev_total_fatal_errors = how many times this PCI device *experienced*
> a fatal problem on its own (i.e. either link issues while talking to
> its link partner, or some internal errors).
> 
> rootport_total_fatal_errors = how many times this rootport was
> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
> that originates at it (can be any link further downstream). This
> includes the dev_total_fatal_errors also, because any errors detected
> by the rootport are also "informed" to itself via ERR_* messages. In
> reality, this is just the total number of ERR_FATAL messages received
> at the rootport. This sysfs attribute will only exist for root ports.
> 
>> 
>> rootport_total_fatal_errs gives me an idea that how many times things
>> have been failed under this pci_dev ?
> 
> Yes, as above.
> 
>> which means num of downstream link problems. but I am still trying to
>> make sense as how it could be used,
>> since we dont have BDF information associated with the number of 
>> errors
>> anywhere (except these AER print messages)
>> 
> 
> Agree. That is a limitation. The challenges being more record keeping,
> more complicated sysfs representation, and given that PCI devices may
> come and go, how do we know it is the same device before we collate
> their stats etc.
> 
>> 
>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>> then say root-port will report it and increment
>> dev_total_fatal_errs ++
>> does it also increment root-port_total_fatal_errs ++ in above scenario 
>> ?
> 
> Yes, as above, it will also root-port_total_fatal_errs++ for the root
> port of that hierarchy.
> 
> Thanks,
> 
> Rajat
> 
>> 
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of ERR_COR messages reported to rootport.
>> > +
>> > +Where:               /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>> > +
>> > +Where:
>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>> > +Date:                May 2018
>> > +Kernel Version: 4.17.0
>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>> > +Description: Total number of ERR_NONFATAL messages reported to
>> > rootport.
>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>> > b/Documentation/PCI/pcieaer-howto.txt
>> > index acd0dddd6bb8..91b6e677cb8c 100644
>> > --- a/Documentation/PCI/pcieaer-howto.txt
>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>> > device who sends
>> >  the error message to root port. Pls. refer to pci express specs for
>> >  other fields.
>> >
>> > +2.4 AER Statistics / Counters
>> > +
>> > +When PCIe AER errors are captured, the counters / statistics are also
>> > exposed
>> > +in form of sysfs attributes which are documented at
>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>> >
>> >  3. Developer Guide
Rajat Jain June 19, 2018, 4:31 p.m. UTC | #5
On Mon, Jun 18, 2018 at 11:03 PM,  <poza@codeaurora.org> wrote:
> On 2018-06-19 05:41, Rajat Jain wrote:
>>
>> Hello,
>>
>> On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
>>>
>>>
>>> On 2018-05-23 23:28, Rajat Jain wrote:
>>> > Add the PCI AER statistics details to
>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > and provide a pointer to it in
>>> > Documentation/PCI/pcieaer-howto.txt
>>> >
>>> > Signed-off-by: Rajat Jain <rajatja@google.com>
>>> > ---
>>> > v2: Move the documentation to Documentation/ABI/
>>> >
>>> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
>>> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
>>> >  2 files changed, 108 insertions(+)
>>> >  create mode 100644
>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> >
>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > new file mode 100644
>>> > index 000000000000..f55c389290ac
>>> > --- /dev/null
>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> > @@ -0,0 +1,103 @@
>>> > +==========================
>>> > +PCIe Device AER statistics
>>> > +==========================
>>> > +These attributes show up under all the devices that are AER capable.
>>> > These
>>> > +statistical counters indicate the errors "as seen/reported by the
>>> > device".
>>> > +Note that this may mean that if an end point is causing problems, the
>>> > AER
>>> > +counters may increment at its link partner (e.g. root port) because
>>> > the
>>> > +errors will be "seen" / reported by the link partner and not the the
>>> > +problematic end point itself (which may report all counters as 0 as it
>>> > never
>>> > +saw any problems).
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Total number of correctable errors seen and reported by
>>> > this
>>> > +             PCI device using ERR_COR.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Total number of uncorrectable fatal errors seen and
>>> > reported
>>> > +             by this PCI device using ERR_FATAL.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Total number of uncorrectable non-fatal errors seen and
>>> > reported
>>> > +             by this PCI device using ERR_NONFATAL.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Breakdown of of correctable errors seen and reported by
>>> > this
>>> > +             PCI device using ERR_COR. A sample result looks like
>>> > this:
>>> > +-----------------------------------------
>>> > +Receiver Error = 0x174
>>> > +Bad TLP = 0x19
>>> > +Bad DLLP = 0x3
>>> > +RELAY_NUM Rollover = 0x0
>>> > +Replay Timer Timeout = 0x1
>>> > +Advisory Non-Fatal = 0x0
>>> > +Corrected Internal Error = 0x0
>>> > +Header Log Overflow = 0x0
>>> > +-----------------------------------------
>>> why hex display ? decimal is easy to read as these are counters.
>>
>>
>> Have no particular preference. Since these can be potentially large
>> numbers, just had a random thought that hex might make it more
>> concise. I can change to decimal if that is preferable.
>>
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Breakdown of of correctable errors seen and reported by
>>> > this
>>> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample
>>> > result
>>> > +             looks like this:
>>> > +-----------------------------------------
>>> > +Undefined = 0x0
>>> > +Data Link Protocol = 0x0
>>> > +Surprise Down Error = 0x0
>>> > +Poisoned TLP = 0x0
>>> > +Flow Control Protocol = 0x0
>>> > +Completion Timeout = 0x0
>>> > +Completer Abort = 0x0
>>> > +Unexpected Completion = 0x0
>>> > +Receiver Overflow = 0x0
>>> > +Malformed TLP = 0x0
>>> > +ECRC = 0x0
>>> > +Unsupported Request = 0x0
>>> > +ACS Violation = 0x0
>>> > +Uncorrectable Internal Error = 0x0
>>> > +MC Blocked TLP = 0x0
>>> > +AtomicOp Egress Blocked = 0x0
>>> > +TLP Prefix Blocked Error = 0x0
>>> > +-----------------------------------------
>>> > +
>>> > +============================
>>> > +PCIe Rootport AER statistics
>>> > +============================
>>> > +These attributes showup under only the rootports that are AER capable.
>>> > These
>>> > +indicate the number of error messages as "reported to" the rootport.
>>> > Please note
>>> > +that the rootports also transmit (internally) the ERR_* messages for
>>> > errors seen
>>> > +by the internal rootport PCI device, so these counters includes them
>>> > and are
>>> > +thus cumulative of all the error messages on the PCI hierarchy
>>> > originating
>>> > +at that root port.
>>>
>>> what about switches and bridges ?
>>
>>
>> What about them? AIUI, the switches forward the ERR_ messages from
>> downstream devices to the rootport, like they do with standard
>> messages. They can potentially generate their own ERR_ message and
>> that would be reported no different than other end point devices.
>
>
>
> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
> contained by switch
> and the error handling code thinks that, the error is contained by switch
> irrespective of
> AER or DPC, and it will think that the problem could be with Switch/bridge
> upstream link.
>
> hence the pci_dev of the switch where you should be increment your counters.
> of course ER_FATAL would have traversed till RP, but that doesnt meant that
> you account the error there.

In this case, for the pci_dev for the rootport:
- rootport_total_fatal_errors will be incremented (since it will get ERR_FATAL)
- dev_total_fatal_errors will not be incremented.

The dev_total_fatal_errors will be incremented only for the pci device
identified by the "Error Source Identification Register" in the PCIe
spec. Does this help clarify?

>
>
>>
>>> Also Can you give some idea as e.g what is the difference between
>>> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that both
>>> are same pci_dev.
>>
>>
>> For a pci_dev representing the rootport:
>>
>> dev_total_fatal_errors = how many times this PCI device *experienced*
>> a fatal problem on its own (i.e. either link issues while talking to
>> its link partner, or some internal errors).
>>
>> rootport_total_fatal_errors = how many times this rootport was
>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
>> that originates at it (can be any link further downstream). This
>> includes the dev_total_fatal_errors also, because any errors detected
>> by the rootport are also "informed" to itself via ERR_* messages. In
>> reality, this is just the total number of ERR_FATAL messages received
>> at the rootport. This sysfs attribute will only exist for root ports.
>>
>>>
>>> rootport_total_fatal_errs gives me an idea that how many times things
>>> have been failed under this pci_dev ?
>>
>>
>> Yes, as above.
>>
>>> which means num of downstream link problems. but I am still trying to
>>> make sense as how it could be used,
>>> since we dont have BDF information associated with the number of errors
>>> anywhere (except these AER print messages)
>>>
>>
>> Agree. That is a limitation. The challenges being more record keeping,
>> more complicated sysfs representation, and given that PCI devices may
>> come and go, how do we know it is the same device before we collate
>> their stats etc.
>>
>>>
>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>>> then say root-port will report it and increment
>>> dev_total_fatal_errs ++
>>> does it also increment root-port_total_fatal_errs ++ in above scenario ?
>>
>>
>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
>> port of that hierarchy.
>>
>> Thanks,
>>
>> Rajat
>>
>>>
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Total number of ERR_COR messages reported to rootport.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>>> > +
>>> > +Where:
>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>>> > +Date:                May 2018
>>> > +Kernel Version: 4.17.0
>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>> > +Description: Total number of ERR_NONFATAL messages reported to
>>> > rootport.
>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>>> > b/Documentation/PCI/pcieaer-howto.txt
>>> > index acd0dddd6bb8..91b6e677cb8c 100644
>>> > --- a/Documentation/PCI/pcieaer-howto.txt
>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>>> > device who sends
>>> >  the error message to root port. Pls. refer to pci express specs for
>>> >  other fields.
>>> >
>>> > +2.4 AER Statistics / Counters
>>> > +
>>> > +When PCIe AER errors are captured, the counters / statistics are also
>>> > exposed
>>> > +in form of sysfs attributes which are documented at
>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>> >
>>> >  3. Developer Guide
Oza Pawandeep June 21, 2018, 9:19 a.m. UTC | #6
On 2018-06-19 22:01, Rajat Jain wrote:
> On Mon, Jun 18, 2018 at 11:03 PM,  <poza@codeaurora.org> wrote:
>> On 2018-06-19 05:41, Rajat Jain wrote:
>>> 
>>> Hello,
>>> 
>>> On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
>>>> 
>>>> 
>>>> On 2018-05-23 23:28, Rajat Jain wrote:
>>>> > Add the PCI AER statistics details to
>>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > and provide a pointer to it in
>>>> > Documentation/PCI/pcieaer-howto.txt
>>>> >
>>>> > Signed-off-by: Rajat Jain <rajatja@google.com>
>>>> > ---
>>>> > v2: Move the documentation to Documentation/ABI/
>>>> >
>>>> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
>>>> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
>>>> >  2 files changed, 108 insertions(+)
>>>> >  create mode 100644
>>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> >
>>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > new file mode 100644
>>>> > index 000000000000..f55c389290ac
>>>> > --- /dev/null
>>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> > @@ -0,0 +1,103 @@
>>>> > +==========================
>>>> > +PCIe Device AER statistics
>>>> > +==========================
>>>> > +These attributes show up under all the devices that are AER capable.
>>>> > These
>>>> > +statistical counters indicate the errors "as seen/reported by the
>>>> > device".
>>>> > +Note that this may mean that if an end point is causing problems, the
>>>> > AER
>>>> > +counters may increment at its link partner (e.g. root port) because
>>>> > the
>>>> > +errors will be "seen" / reported by the link partner and not the the
>>>> > +problematic end point itself (which may report all counters as 0 as it
>>>> > never
>>>> > +saw any problems).
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of correctable errors seen and reported by
>>>> > this
>>>> > +             PCI device using ERR_COR.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of uncorrectable fatal errors seen and
>>>> > reported
>>>> > +             by this PCI device using ERR_FATAL.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of uncorrectable non-fatal errors seen and
>>>> > reported
>>>> > +             by this PCI device using ERR_NONFATAL.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Breakdown of of correctable errors seen and reported by
>>>> > this
>>>> > +             PCI device using ERR_COR. A sample result looks like
>>>> > this:
>>>> > +-----------------------------------------
>>>> > +Receiver Error = 0x174
>>>> > +Bad TLP = 0x19
>>>> > +Bad DLLP = 0x3
>>>> > +RELAY_NUM Rollover = 0x0
>>>> > +Replay Timer Timeout = 0x1
>>>> > +Advisory Non-Fatal = 0x0
>>>> > +Corrected Internal Error = 0x0
>>>> > +Header Log Overflow = 0x0
>>>> > +-----------------------------------------
>>>> why hex display ? decimal is easy to read as these are counters.
>>> 
>>> 
>>> Have no particular preference. Since these can be potentially large
>>> numbers, just had a random thought that hex might make it more
>>> concise. I can change to decimal if that is preferable.
>>> 
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Breakdown of of correctable errors seen and reported by
>>>> > this
>>>> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample
>>>> > result
>>>> > +             looks like this:
>>>> > +-----------------------------------------
>>>> > +Undefined = 0x0
>>>> > +Data Link Protocol = 0x0
>>>> > +Surprise Down Error = 0x0
>>>> > +Poisoned TLP = 0x0
>>>> > +Flow Control Protocol = 0x0
>>>> > +Completion Timeout = 0x0
>>>> > +Completer Abort = 0x0
>>>> > +Unexpected Completion = 0x0
>>>> > +Receiver Overflow = 0x0
>>>> > +Malformed TLP = 0x0
>>>> > +ECRC = 0x0
>>>> > +Unsupported Request = 0x0
>>>> > +ACS Violation = 0x0
>>>> > +Uncorrectable Internal Error = 0x0
>>>> > +MC Blocked TLP = 0x0
>>>> > +AtomicOp Egress Blocked = 0x0
>>>> > +TLP Prefix Blocked Error = 0x0
>>>> > +-----------------------------------------
>>>> > +
>>>> > +============================
>>>> > +PCIe Rootport AER statistics
>>>> > +============================
>>>> > +These attributes showup under only the rootports that are AER capable.
>>>> > These
>>>> > +indicate the number of error messages as "reported to" the rootport.
>>>> > Please note
>>>> > +that the rootports also transmit (internally) the ERR_* messages for
>>>> > errors seen
>>>> > +by the internal rootport PCI device, so these counters includes them
>>>> > and are
>>>> > +thus cumulative of all the error messages on the PCI hierarchy
>>>> > originating
>>>> > +at that root port.
>>>> 
>>>> what about switches and bridges ?
>>> 
>>> 
>>> What about them? AIUI, the switches forward the ERR_ messages from
>>> downstream devices to the rootport, like they do with standard
>>> messages. They can potentially generate their own ERR_ message and
>>> that would be reported no different than other end point devices.
>> 
>> 
>> 
>> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
>> contained by switch
>> and the error handling code thinks that, the error is contained by 
>> switch
>> irrespective of
>> AER or DPC, and it will think that the problem could be with 
>> Switch/bridge
>> upstream link.
>> 
>> hence the pci_dev of the switch where you should be increment your 
>> counters.
>> of course ER_FATAL would have traversed till RP, but that doesnt meant 
>> that
>> you account the error there.
> 
> In this case, for the pci_dev for the rootport:
> - rootport_total_fatal_errors will be incremented (since it will get 
> ERR_FATAL)
> - dev_total_fatal_errors will not be incremented.

ok but my confusion is: should you not be incrementing counter against 
pci_dev of switch ? and not the RP ?
because the problem was with upstream link of the EP (e.g. switch)

> 
> The dev_total_fatal_errors will be incremented only for the pci device
> identified by the "Error Source Identification Register" in the PCIe
> spec. Does this help clarify?

> 
>> 
>> 
>>> 
>>>> Also Can you give some idea as e.g what is the difference between
>>>> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that 
>>>> both
>>>> are same pci_dev.
>>> 
>>> 
>>> For a pci_dev representing the rootport:
>>> 
>>> dev_total_fatal_errors = how many times this PCI device *experienced*
>>> a fatal problem on its own (i.e. either link issues while talking to
>>> its link partner, or some internal errors).
>>> 
>>> rootport_total_fatal_errors = how many times this rootport was
>>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
>>> that originates at it (can be any link further downstream). This
>>> includes the dev_total_fatal_errors also, because any errors detected
>>> by the rootport are also "informed" to itself via ERR_* messages. In
>>> reality, this is just the total number of ERR_FATAL messages received
>>> at the rootport. This sysfs attribute will only exist for root ports.
>>> 
>>>> 
>>>> rootport_total_fatal_errs gives me an idea that how many times 
>>>> things
>>>> have been failed under this pci_dev ?
>>> 
>>> 
>>> Yes, as above.
>>> 
>>>> which means num of downstream link problems. but I am still trying 
>>>> to
>>>> make sense as how it could be used,
>>>> since we dont have BDF information associated with the number of 
>>>> errors
>>>> anywhere (except these AER print messages)
>>>> 
>>> 
>>> Agree. That is a limitation. The challenges being more record 
>>> keeping,
>>> more complicated sysfs representation, and given that PCI devices may
>>> come and go, how do we know it is the same device before we collate
>>> their stats etc.
>>> 
>>>> 
>>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
>>>> then say root-port will report it and increment
>>>> dev_total_fatal_errs ++
>>>> does it also increment root-port_total_fatal_errs ++ in above 
>>>> scenario ?
>>> 
>>> 
>>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
>>> port of that hierarchy.
>>> 
>>> Thanks,
>>> 
>>> Rajat
>>> 
>>>> 
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of ERR_COR messages reported to rootport.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
>>>> > +
>>>> > +Where:
>>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
>>>> > +Date:                May 2018
>>>> > +Kernel Version: 4.17.0
>>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
>>>> > +Description: Total number of ERR_NONFATAL messages reported to
>>>> > rootport.
>>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
>>>> > b/Documentation/PCI/pcieaer-howto.txt
>>>> > index acd0dddd6bb8..91b6e677cb8c 100644
>>>> > --- a/Documentation/PCI/pcieaer-howto.txt
>>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
>>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
>>>> > device who sends
>>>> >  the error message to root port. Pls. refer to pci express specs for
>>>> >  other fields.
>>>> >
>>>> > +2.4 AER Statistics / Counters
>>>> > +
>>>> > +When PCIe AER errors are captured, the counters / statistics are also
>>>> > exposed
>>>> > +in form of sysfs attributes which are documented at
>>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>>>> >
>>>> >  3. Developer Guide
Rajat Jain June 22, 2018, 12:45 a.m. UTC | #7
On Thu, Jun 21, 2018 at 2:19 AM <poza@codeaurora.org> wrote:
>
> On 2018-06-19 22:01, Rajat Jain wrote:
> > On Mon, Jun 18, 2018 at 11:03 PM,  <poza@codeaurora.org> wrote:
> >> On 2018-06-19 05:41, Rajat Jain wrote:
> >>>
> >>> Hello,
> >>>
> >>> On Sat, Jun 16, 2018 at 10:24 PM <poza@codeaurora.org> wrote:
> >>>>
> >>>>
> >>>> On 2018-05-23 23:28, Rajat Jain wrote:
> >>>> > Add the PCI AER statistics details to
> >>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > and provide a pointer to it in
> >>>> > Documentation/PCI/pcieaer-howto.txt
> >>>> >
> >>>> > Signed-off-by: Rajat Jain <rajatja@google.com>
> >>>> > ---
> >>>> > v2: Move the documentation to Documentation/ABI/
> >>>> >
> >>>> >  .../testing/sysfs-bus-pci-devices-aer_stats   | 103 ++++++++++++++++++
> >>>> >  Documentation/PCI/pcieaer-howto.txt           |   5 +
> >>>> >  2 files changed, 108 insertions(+)
> >>>> >  create mode 100644
> >>>> > Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> >
> >>>> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > new file mode 100644
> >>>> > index 000000000000..f55c389290ac
> >>>> > --- /dev/null
> >>>> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> > @@ -0,0 +1,103 @@
> >>>> > +==========================
> >>>> > +PCIe Device AER statistics
> >>>> > +==========================
> >>>> > +These attributes show up under all the devices that are AER capable.
> >>>> > These
> >>>> > +statistical counters indicate the errors "as seen/reported by the
> >>>> > device".
> >>>> > +Note that this may mean that if an end point is causing problems, the
> >>>> > AER
> >>>> > +counters may increment at its link partner (e.g. root port) because
> >>>> > the
> >>>> > +errors will be "seen" / reported by the link partner and not the the
> >>>> > +problematic end point itself (which may report all counters as 0 as it
> >>>> > never
> >>>> > +saw any problems).
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Total number of correctable errors seen and reported by
> >>>> > this
> >>>> > +             PCI device using ERR_COR.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Total number of uncorrectable fatal errors seen and
> >>>> > reported
> >>>> > +             by this PCI device using ERR_FATAL.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Total number of uncorrectable non-fatal errors seen and
> >>>> > reported
> >>>> > +             by this PCI device using ERR_NONFATAL.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Breakdown of of correctable errors seen and reported by
> >>>> > this
> >>>> > +             PCI device using ERR_COR. A sample result looks like
> >>>> > this:
> >>>> > +-----------------------------------------
> >>>> > +Receiver Error = 0x174
> >>>> > +Bad TLP = 0x19
> >>>> > +Bad DLLP = 0x3
> >>>> > +RELAY_NUM Rollover = 0x0
> >>>> > +Replay Timer Timeout = 0x1
> >>>> > +Advisory Non-Fatal = 0x0
> >>>> > +Corrected Internal Error = 0x0
> >>>> > +Header Log Overflow = 0x0
> >>>> > +-----------------------------------------
> >>>> why hex display ? decimal is easy to read as these are counters.
> >>>
> >>>
> >>> Have no particular preference. Since these can be potentially large
> >>> numbers, just had a random thought that hex might make it more
> >>> concise. I can change to decimal if that is preferable.
> >>>
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Breakdown of of correctable errors seen and reported by
> >>>> > this
> >>>> > +             PCI device using ERR_FATAL or ERR_NONFATAL. A sample
> >>>> > result
> >>>> > +             looks like this:
> >>>> > +-----------------------------------------
> >>>> > +Undefined = 0x0
> >>>> > +Data Link Protocol = 0x0
> >>>> > +Surprise Down Error = 0x0
> >>>> > +Poisoned TLP = 0x0
> >>>> > +Flow Control Protocol = 0x0
> >>>> > +Completion Timeout = 0x0
> >>>> > +Completer Abort = 0x0
> >>>> > +Unexpected Completion = 0x0
> >>>> > +Receiver Overflow = 0x0
> >>>> > +Malformed TLP = 0x0
> >>>> > +ECRC = 0x0
> >>>> > +Unsupported Request = 0x0
> >>>> > +ACS Violation = 0x0
> >>>> > +Uncorrectable Internal Error = 0x0
> >>>> > +MC Blocked TLP = 0x0
> >>>> > +AtomicOp Egress Blocked = 0x0
> >>>> > +TLP Prefix Blocked Error = 0x0
> >>>> > +-----------------------------------------
> >>>> > +
> >>>> > +============================
> >>>> > +PCIe Rootport AER statistics
> >>>> > +============================
> >>>> > +These attributes showup under only the rootports that are AER capable.
> >>>> > These
> >>>> > +indicate the number of error messages as "reported to" the rootport.
> >>>> > Please note
> >>>> > +that the rootports also transmit (internally) the ERR_* messages for
> >>>> > errors seen
> >>>> > +by the internal rootport PCI device, so these counters includes them
> >>>> > and are
> >>>> > +thus cumulative of all the error messages on the PCI hierarchy
> >>>> > originating
> >>>> > +at that root port.
> >>>>
> >>>> what about switches and bridges ?
> >>>
> >>>
> >>> What about them? AIUI, the switches forward the ERR_ messages from
> >>> downstream devices to the rootport, like they do with standard
> >>> messages. They can potentially generate their own ERR_ message and
> >>> that would be reported no different than other end point devices.
> >>
> >>
> >>
> >> yes, what I meant to ask is; the ERR_FATAL msg coming from EP, can be
> >> contained by switch
> >> and the error handling code thinks that, the error is contained by
> >> switch
> >> irrespective of
> >> AER or DPC, and it will think that the problem could be with
> >> Switch/bridge
> >> upstream link.
> >>
> >> hence the pci_dev of the switch where you should be increment your
> >> counters.
> >> of course ER_FATAL would have traversed till RP, but that doesnt meant
> >> that
> >> you account the error there.
> >
> > In this case, for the pci_dev for the rootport:
> > - rootport_total_fatal_errors will be incremented (since it will get
> > ERR_FATAL)
> > - dev_total_fatal_errors will not be incremented.
>
> ok but my confusion is: should you not be incrementing counter against
> pci_dev of switch ? and not the RP ?
> because the problem was with upstream link of the EP (e.g. switch)

The question is who sent the ERR_* message to the rootport? That is
the guy who noticed the problem, and will most likely be the switch
port in your case. It is this guy whose counter shall be incremented.

>
> >
> > The dev_total_fatal_errors will be incremented only for the pci device
> > identified by the "Error Source Identification Register" in the PCIe
> > spec. Does this help clarify?
>
> >
> >>
> >>
> >>>
> >>>> Also Can you give some idea as e.g what is the difference between
> >>>> dev_total_fatal_errs and rootport_total_fatal_errs  (assuming that
> >>>> both
> >>>> are same pci_dev.
> >>>
> >>>
> >>> For a pci_dev representing the rootport:
> >>>
> >>> dev_total_fatal_errors = how many times this PCI device *experienced*
> >>> a fatal problem on its own (i.e. either link issues while talking to
> >>> its link partner, or some internal errors).
> >>>
> >>> rootport_total_fatal_errors = how many times this rootport was
> >>> *informed* about a problem (via ERR_* messages) in the PCI hierarchy
> >>> that originates at it (can be any link further downstream). This
> >>> includes the dev_total_fatal_errors also, because any errors detected
> >>> by the rootport are also "informed" to itself via ERR_* messages. In
> >>> reality, this is just the total number of ERR_FATAL messages received
> >>> at the rootport. This sysfs attribute will only exist for root ports.
> >>>
> >>>>
> >>>> rootport_total_fatal_errs gives me an idea that how many times
> >>>> things
> >>>> have been failed under this pci_dev ?
> >>>
> >>>
> >>> Yes, as above.
> >>>
> >>>> which means num of downstream link problems. but I am still trying
> >>>> to
> >>>> make sense as how it could be used,
> >>>> since we dont have BDF information associated with the number of
> >>>> errors
> >>>> anywhere (except these AER print messages)
> >>>>
> >>>
> >>> Agree. That is a limitation. The challenges being more record
> >>> keeping,
> >>> more complicated sysfs representation, and given that PCI devices may
> >>> come and go, how do we know it is the same device before we collate
> >>> their stats etc.
> >>>
> >>>>
> >>>> and dev_total_fatal_errs as you mentioned above that problematic EP,
> >>>> then say root-port will report it and increment
> >>>> dev_total_fatal_errs ++
> >>>> does it also increment root-port_total_fatal_errs ++ in above
> >>>> scenario ?
> >>>
> >>>
> >>> Yes, as above, it will also root-port_total_fatal_errs++ for the root
> >>> port of that hierarchy.
> >>>
> >>> Thanks,
> >>>
> >>> Rajat
> >>>
> >>>>
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Total number of ERR_COR messages reported to rootport.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Total number of ERR_FATAL messages reported to rootport.
> >>>> > +
> >>>> > +Where:
> >>>> > /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
> >>>> > +Date:                May 2018
> >>>> > +Kernel Version: 4.17.0
> >>>> > +Contact:     linux-pci@vger.kernel.org, rajatja@google.com
> >>>> > +Description: Total number of ERR_NONFATAL messages reported to
> >>>> > rootport.
> >>>> > diff --git a/Documentation/PCI/pcieaer-howto.txt
> >>>> > b/Documentation/PCI/pcieaer-howto.txt
> >>>> > index acd0dddd6bb8..91b6e677cb8c 100644
> >>>> > --- a/Documentation/PCI/pcieaer-howto.txt
> >>>> > +++ b/Documentation/PCI/pcieaer-howto.txt
> >>>> > @@ -73,6 +73,11 @@ In the example, 'Requester ID' means the ID of the
> >>>> > device who sends
> >>>> >  the error message to root port. Pls. refer to pci express specs for
> >>>> >  other fields.
> >>>> >
> >>>> > +2.4 AER Statistics / Counters
> >>>> > +
> >>>> > +When PCIe AER errors are captured, the counters / statistics are also
> >>>> > exposed
> >>>> > +in form of sysfs attributes which are documented at
> >>>> > +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> >>>> >
> >>>> >  3. Developer Guide
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
new file mode 100644
index 000000000000..f55c389290ac
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -0,0 +1,103 @@ 
+==========================
+PCIe Device AER statistics
+==========================
+These attributes show up under all the devices that are AER capable. These
+statistical counters indicate the errors "as seen/reported by the device".
+Note that this may mean that if an end point is causing problems, the AER
+counters may increment at its link partner (e.g. root port) because the
+errors will be "seen" / reported by the link partner and not the the
+problematic end point itself (which may report all counters as 0 as it never
+saw any problems).
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_total_cor_errs
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Total number of correctable errors seen and reported by this
+		PCI device using ERR_COR.
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_total_fatal_errs
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Total number of uncorrectable fatal errors seen and reported
+		by this PCI device using ERR_FATAL.
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_total_nonfatal_errs
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Total number of uncorrectable non-fatal errors seen and reported
+		by this PCI device using ERR_NONFATAL.
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_correctable
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Breakdown of of correctable errors seen and reported by this
+		PCI device using ERR_COR. A sample result looks like this:
+-----------------------------------------
+Receiver Error = 0x174
+Bad TLP = 0x19
+Bad DLLP = 0x3
+RELAY_NUM Rollover = 0x0
+Replay Timer Timeout = 0x1
+Advisory Non-Fatal = 0x0
+Corrected Internal Error = 0x0
+Header Log Overflow = 0x0
+-----------------------------------------
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/dev_breakdown_uncorrectable
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Breakdown of of correctable errors seen and reported by this
+		PCI device using ERR_FATAL or ERR_NONFATAL. A sample result
+		looks like this:
+-----------------------------------------
+Undefined = 0x0
+Data Link Protocol = 0x0
+Surprise Down Error = 0x0
+Poisoned TLP = 0x0
+Flow Control Protocol = 0x0
+Completion Timeout = 0x0
+Completer Abort = 0x0
+Unexpected Completion = 0x0
+Receiver Overflow = 0x0
+Malformed TLP = 0x0
+ECRC = 0x0
+Unsupported Request = 0x0
+ACS Violation = 0x0
+Uncorrectable Internal Error = 0x0
+MC Blocked TLP = 0x0
+AtomicOp Egress Blocked = 0x0
+TLP Prefix Blocked Error = 0x0
+-----------------------------------------
+
+============================
+PCIe Rootport AER statistics
+============================
+These attributes showup under only the rootports that are AER capable. These
+indicate the number of error messages as "reported to" the rootport. Please note
+that the rootports also transmit (internally) the ERR_* messages for errors seen
+by the internal rootport PCI device, so these counters includes them and are
+thus cumulative of all the error messages on the PCI hierarchy originating
+at that root port.
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/rootport_total_cor_errs
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Total number of ERR_COR messages reported to rootport.
+
+Where:		/sys/bus/pci/devices/<dev>/aer_stats/rootport_total_fatal_errs
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Total number of ERR_FATAL messages reported to rootport.
+
+Where:	    /sys/bus/pci/devices/<dev>/aer_stats/rootport_total_nonfatal_errs
+Date:		May 2018
+Kernel Version: 4.17.0
+Contact:	linux-pci@vger.kernel.org, rajatja@google.com
+Description:	Total number of ERR_NONFATAL messages reported to rootport.
diff --git a/Documentation/PCI/pcieaer-howto.txt b/Documentation/PCI/pcieaer-howto.txt
index acd0dddd6bb8..91b6e677cb8c 100644
--- a/Documentation/PCI/pcieaer-howto.txt
+++ b/Documentation/PCI/pcieaer-howto.txt
@@ -73,6 +73,11 @@  In the example, 'Requester ID' means the ID of the device who sends
 the error message to root port. Pls. refer to pci express specs for
 other fields.
 
+2.4 AER Statistics / Counters
+
+When PCIe AER errors are captured, the counters / statistics are also exposed
+in form of sysfs attributes which are documented at
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
 
 3. Developer Guide