diff mbox series

[RFC,net-next,19/19] devlink: Add Documentation/networking/devlink-health.txt

Message ID 1546266733-9512-20-git-send-email-eranbe@mellanox.com
State RFC, archived
Delegated to: David Miller
Headers show
Series Devlink health reporting and recovery system | expand

Commit Message

Eran Ben Elisha Dec. 31, 2018, 2:32 p.m. UTC
From: Aya Levin <ayal@mellanox.com>

This patch adds a new file to add information about devlink health
mechanism.

Signed-off-by: Aya Levin <ayal@mellanox.com>
---
 Documentation/networking/devlink-health.txt | 86 +++++++++++++++++++++
 1 file changed, 86 insertions(+)
 create mode 100644 Documentation/networking/devlink-health.txt

Comments

Jakub Kicinski Jan. 1, 2019, 1:47 a.m. UTC | #1
On Mon, 31 Dec 2018 16:32:13 +0200, Eran Ben Elisha wrote:
> +Once an error is reported, devlink health will do the following actions:
> +  * A log is being send to the kernel trace events buffer
> +  * Health status and statistics are being updated for the reporter instance
> +  * Object dump is being taken and saved at the reporter instance (as long as
> +    there is no other Objdump which is already stored)
> +  * Auto recovery attempt is being done. Depends on:
> +    - Auto-recovery configuration
> +    - Grace period vs. time passed since last recover

Would it make sense to store the result of last recovery if it failed?
Eran Ben Elisha Jan. 1, 2019, 10:01 a.m. UTC | #2
On 1/1/2019 3:47 AM, Jakub Kicinski wrote:
> On Mon, 31 Dec 2018 16:32:13 +0200, Eran Ben Elisha wrote:
>> +Once an error is reported, devlink health will do the following actions:
>> +  * A log is being send to the kernel trace events buffer
>> +  * Health status and statistics are being updated for the reporter instance
>> +  * Object dump is being taken and saved at the reporter instance (as long as
>> +    there is no other Objdump which is already stored)
>> +  * Auto recovery attempt is being done. Depends on:
>> +    - Auto-recovery configuration
>> +    - Grace period vs. time passed since last recover
> 
> Would it make sense to store the result of last recovery if it failed?

We thought about it.
Internally we discussed it and decided that recover failures shall be 
indicated in the kernel logs and not be provided as part of devlink 
health show command.
Keep in mind that if a recover failed, the reporter status will be kept 
as is, since no recover was successfully finished.

>
diff mbox series

Patch

diff --git a/Documentation/networking/devlink-health.txt b/Documentation/networking/devlink-health.txt
new file mode 100644
index 000000000000..ea8a9cc773a2
--- /dev/null
+++ b/Documentation/networking/devlink-health.txt
@@ -0,0 +1,86 @@ 
+The health mechanism is targeted for Real Time Alerting, in order to know when
+something bad had happened to a PCI device
+- Provide alert debug information
+- Self healing
+- If problem needs vendor support, provide a way to gather all needed debugging
+  information.
+
+The main idea is to unify and centralize driver health reports in the
+generic devlink instance and allow the user to set different
+attributes of the health reporting and recovery procedures.
+
+The devlink health reporter:
+Device driver creates a "health reporter" per each error/health type.
+Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error)
+or unknown (driver specific).
+For each registered health reporter a driver can issue error/health reports
+asynchronously. All health reports handling is done by devlink.
+Device driver can provide specific callbacks for each "health reporter", e.g.
+ - Recovery procedures
+ - Diagnostics and object dump procedures
+ - OOB initial parameters
+Different parts of the driver can register different types of health reporters
+with different handlers.
+
+Once an error is reported, devlink health will do the following actions:
+  * A log is being send to the kernel trace events buffer
+  * Health status and statistics are being updated for the reporter instance
+  * Object dump is being taken and saved at the reporter instance (as long as
+    there is no other Objdump which is already stored)
+  * Auto recovery attempt is being done. Depends on:
+    - Auto-recovery configuration
+    - Grace period vs. time passed since last recover
+
+The user interface:
+User can access/change each reporter's parameters and driver specific callbacks
+via devlink, e.g per error type (per health reporter)
+ - Configure reporter's generic parameters (like: disable/enable auto recovery)
+ - Invoke recovery procedure
+ - Run diagnostics
+ - Object dump
+
+The devlink health interface (via netlink):
+DEVLINK_CMD_HEALTH_REPORTER_GET
+  Retrieves status and configuration info per DEV and reporter.
+DEVLINK_CMD_HEALTH_REPORTER_SET
+  Allows reporter-related configuration setting.
+DEVLINK_CMD_HEALTH_REPORTER_RECOVER
+  Triggers a reporter's recovery procedure.
+DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE
+  Retrieves diagnostics data from a reporter on a device.
+DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET
+  Retrieves the last stored objdump. Devlink health
+  saves a single objdump. If an objdump is not already stored by the devlink
+  for this reporter, devlink generates a new objdump.
+  Objdump output is defined by the reporter.
+DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR
+  Clears the last saved objdump file for the specified reporter.
+
+
+                                               netlink
+                                      +--------------------------+
+                                      |                          |
+                                      |            +             |
+                                      |            |             |
+                                      +--------------------------+
+                                                   |request for ops
+                                                   |(diagnose,
+ mlx5_core                             devlink     |recover,
+                                                   |dump)
++--------+                            +--------------------------+
+|        |                            |    reporter|             |
+|        |                            |  +---------v----------+  |
+|        |   ops execution            |  |                    |  |
+|     <----------------------------------+                    |  |
+|        |                            |  |                    |  |
+|        |                            |  + ^------------------+  |
+|        |                            |    | request for ops     |
+|        |                            |    | (recover, dump)     |
+|        |                            |    |                     |
+|        |                            |  +-+------------------+  |
+|        |     health report          |  | health handler     |  |
+|        +------------------------------->                    |  |
+|        |                            |  +--------------------+  |
+|        |     health reporter create |                          |
+|        +---------------------------->                          |
++--------+                            +--------------------------+