From patchwork Mon Dec 31 14:32:13 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Eran Ben Elisha X-Patchwork-Id: 1019585 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=mellanox.com Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 43T0Cg3Pp7z9s55 for ; Tue, 1 Jan 2019 01:33:03 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727365AbeLaOdA (ORCPT ); Mon, 31 Dec 2018 09:33:00 -0500 Received: from mail-il-dmz.mellanox.com ([193.47.165.129]:52302 "EHLO mellanox.co.il" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726771AbeLaOcZ (ORCPT ); Mon, 31 Dec 2018 09:32:25 -0500 Received: from Internal Mail-Server by MTLPINE1 (envelope-from eranbe@mellanox.com) with ESMTPS (AES256-SHA encrypted); 31 Dec 2018 16:32:22 +0200 Received: from dev-l-vrt-198.mtl.labs.mlnx (dev-l-vrt-198.mtl.labs.mlnx [10.134.198.1]) by labmailer.mlnx (8.13.8/8.13.8) with ESMTP id wBVEWLZ7017529; Mon, 31 Dec 2018 16:32:22 +0200 From: Eran Ben Elisha To: netdev@vger.kernel.org, "David S. Miller" , Jiri Pirko Cc: Moshe Shemesh , Aya Levin , Eran Ben Elisha , Tal Alon , Ariel Almog Subject: [PATCH RFC net-next 19/19] devlink: Add Documentation/networking/devlink-health.txt Date: Mon, 31 Dec 2018 16:32:13 +0200 Message-Id: <1546266733-9512-20-git-send-email-eranbe@mellanox.com> X-Mailer: git-send-email 1.8.4.3 In-Reply-To: <1546266733-9512-1-git-send-email-eranbe@mellanox.com> References: <1546266733-9512-1-git-send-email-eranbe@mellanox.com> Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Aya Levin This patch adds a new file to add information about devlink health mechanism. Signed-off-by: Aya Levin --- Documentation/networking/devlink-health.txt | 86 +++++++++++++++++++++ 1 file changed, 86 insertions(+) create mode 100644 Documentation/networking/devlink-health.txt diff --git a/Documentation/networking/devlink-health.txt b/Documentation/networking/devlink-health.txt new file mode 100644 index 000000000000..ea8a9cc773a2 --- /dev/null +++ b/Documentation/networking/devlink-health.txt @@ -0,0 +1,86 @@ +The health mechanism is targeted for Real Time Alerting, in order to know when +something bad had happened to a PCI device +- Provide alert debug information +- Self healing +- If problem needs vendor support, provide a way to gather all needed debugging + information. + +The main idea is to unify and centralize driver health reports in the +generic devlink instance and allow the user to set different +attributes of the health reporting and recovery procedures. + +The devlink health reporter: +Device driver creates a "health reporter" per each error/health type. +Error/Health type can be a known/generic (eg pci error, fw error, rx/tx error) +or unknown (driver specific). +For each registered health reporter a driver can issue error/health reports +asynchronously. All health reports handling is done by devlink. +Device driver can provide specific callbacks for each "health reporter", e.g. + - Recovery procedures + - Diagnostics and object dump procedures + - OOB initial parameters +Different parts of the driver can register different types of health reporters +with different handlers. + +Once an error is reported, devlink health will do the following actions: + * A log is being send to the kernel trace events buffer + * Health status and statistics are being updated for the reporter instance + * Object dump is being taken and saved at the reporter instance (as long as + there is no other Objdump which is already stored) + * Auto recovery attempt is being done. Depends on: + - Auto-recovery configuration + - Grace period vs. time passed since last recover + +The user interface: +User can access/change each reporter's parameters and driver specific callbacks +via devlink, e.g per error type (per health reporter) + - Configure reporter's generic parameters (like: disable/enable auto recovery) + - Invoke recovery procedure + - Run diagnostics + - Object dump + +The devlink health interface (via netlink): +DEVLINK_CMD_HEALTH_REPORTER_GET + Retrieves status and configuration info per DEV and reporter. +DEVLINK_CMD_HEALTH_REPORTER_SET + Allows reporter-related configuration setting. +DEVLINK_CMD_HEALTH_REPORTER_RECOVER + Triggers a reporter's recovery procedure. +DEVLINK_CMD_HEALTH_REPORTER_DIAGNOSE + Retrieves diagnostics data from a reporter on a device. +DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_GET + Retrieves the last stored objdump. Devlink health + saves a single objdump. If an objdump is not already stored by the devlink + for this reporter, devlink generates a new objdump. + Objdump output is defined by the reporter. +DEVLINK_CMD_HEALTH_REPORTER_OBJDUMP_CLEAR + Clears the last saved objdump file for the specified reporter. + + + netlink + +--------------------------+ + | | + | + | + | | | + +--------------------------+ + |request for ops + |(diagnose, + mlx5_core devlink |recover, + |dump) ++--------+ +--------------------------+ +| | | reporter| | +| | | +---------v----------+ | +| | ops execution | | | | +| <----------------------------------+ | | +| | | | | | +| | | + ^------------------+ | +| | | | request for ops | +| | | | (recover, dump) | +| | | | | +| | | +-+------------------+ | +| | health report | | health handler | | +| +-------------------------------> | | +| | | +--------------------+ | +| | health reporter create | | +| +----------------------------> | ++--------+ +--------------------------+