mbox

[pull,request,net-next,00/15] Mellanox, mlx5 Firmware devlink health and sw reset

Message ID 20190505003207.1353-1-saeedm@mellanox.com
State Changes Requested
Delegated to: David Miller
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-05-04

Message

Saeed Mahameed May 5, 2019, 12:32 a.m. UTC
Hi Dave,

This series provides the support for mlx5 Firmware devlink health and
sw reset.

We plan to follow up this series with a patch that provides mlx5
documentation under Documentation/networking/mlx5.rst, first thing in
5.3 kernel release, it will include all new mlx5 devlink options and
more.

For more information please see tag log below.

Please pull and let me know if there is any problem.

Thanks,
Saeed.

---
The following changes since commit a734d1f4c2fc962ef4daa179e216df84a8ec5f84:

  net: openvswitch: return an error instead of doing BUG_ON() (2019-05-04 01:36:36 -0400)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-updates-2019-05-04

for you to fetch changes up to 30d8b932dcebbcb8c5d1991cab5325c2e3faad6d:

  net/mlx5: Report devlink health on FW fatal issues (2019-05-04 17:22:45 -0700)

----------------------------------------------------------------
mlx5-updates-2019-05-04

Mlx5 devlink health fw reporters and sw reset support

This series provides mlx5 firmware reset support and firmware devlink health
reporters.

1) Add CR-Space access and FW Crdump snapshot support via devlink region_snapshot

2) Issue software reset upon FW asserts

3) Add fw and fw_fatal devlink heath reporters to follow fw errors indication by
dump and recover procedures and enable trigger these functionality by user.

3.1) fw reporter:
The fw reporter implements diagnose and dump callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it and any other fw trace into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
current fw status.

3.2) fw_fatal repoter:
The fw_fatal reporter implements dump and recover callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors. The
CR-space dump is stored as a memory region snapshot to ease read by address.
The recover function runs recover flow which reloads the driver and triggers fw
reset if needed.

Command examples and output:
diagnose data:
assert_var[0] 0xfc3fc043
assert_var[1] 0x0001b41c
assert_var[2] 0x00000000
assert_var[3] 0x00000000
assert_var[4] 0x00000000
assert_exit_ptr 0x008033b4
assert_callra 0x0080365c
fw_ver 16.24.1000
hw_id 0x0000020d
irisc_index 0
synd 0x8: unrecoverable hardware error
ext_synd 0x003d
raw fw_ver 0x101803e8

dump traces:
   trace: 0000:82:00.1 [0x69cd6c5283e] 0 [0xb8] dump general info GVMI=0x0001
   trace: 0000:82:00.1 [0x69cd6c53bec] 0 [0xb8] GVMI management info, gvmi_management context:
   trace: 0000:82:00.1 [0x69cd6c55eff] 0 [0xb8] [000]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c5657f] 0 [0xb8] [010]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c56608] 0 [0xb8] [020]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c566ff] 0 [0xb8] [030]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c5677f] 0 [0xb8] [040]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c5687f] 0 [0xb8] [050]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c568ff] 0 [0xb8] [060]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c569a5] 0 [0xb8] [070]:  00000000  00000000  00000000  00000000
   trace: 0000:82:00.1 [0x69cd6c57021] 0 [0xb8] CMDIF dbase from IRON: active_dbase_slots = 0x00000000
   trace: 0000:82:00.1 [0x69cd6c58dae] 0 [0xb8] GVMI=0x0001 hw_toc context:
   trace: 0000:82:00.1 [0x69cd6c58e7f] 0 [0xb8] [000]:  00400100  00000000  00000000  fffff000
   trace: 0000:82:00.1 [0x69cd6c58f7f] 0 [0xb8] [010]:  00000000  00000000  00000000  00000000
...
...

devlink_region_name: cr-space snapshot_id: 1

00000000000f0018 e1 03 00 00 fb ae a9 3f

0000000000000000 00 20 00 01 00 00 00 00 03 00 00 00 00 00 00 00
0000000000000010 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 80
0000000000000050 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000060 00 00 00 00 00 00 00 00 00 00 00 00 de 0a 00 00
0000000000000070 0c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0000000000000080 00 00 00 00 00 00 00 00 00 00 00 00 00 00 fa 00
0000000000000090 b6 0b 00 00 00 00 00 00 80 c7 fe ff 50 0a 00 00
...
...

----------------------------------------------------------------
Alex Vesker (3):
      net/mlx5: Add Vendor Specific Capability access gateway
      net/mlx5: Add Crdump FW snapshot support
      net/mlx5: Add support for devlink region_snapshot parameter

Eran Ben Elisha (1):
      net/mlx5: Move all devlink related functions calls to devlink.c

Feras Daoud (3):
      net/mlx5: Handle SW reset of FW in error flow
      net/mlx5: Control CR-space access by different PFs
      net/mlx5: Issue SW reset on FW assert

Moshe Shemesh (8):
      net/mlx5: Refactor print health info
      net/mlx5: Create FW devlink health reporter
      net/mlx5: Add core dump register access functions
      net/mlx5: Add support for FW reporter dump
      net/mlx5: Report devlink health on FW issues
      net/mlx5: Add fw fatal devlink health reporter
      net/mlx5: Add support for FW fatal reporter dump
      net/mlx5: Report devlink health on FW fatal issues

 drivers/net/ethernet/mellanox/mlx5/core/Makefile   |   3 +-
 drivers/net/ethernet/mellanox/mlx5/core/devlink.c  |  72 +++
 drivers/net/ethernet/mellanox/mlx5/core/devlink.h  |  12 +
 .../net/ethernet/mellanox/mlx5/core/diag/crdump.c  | 210 ++++++++
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.c   | 143 +++++
 .../ethernet/mellanox/mlx5/core/diag/fw_tracer.h   |  14 +
 .../net/ethernet/mellanox/mlx5/core/en_selftest.c  |   2 +-
 drivers/net/ethernet/mellanox/mlx5/core/health.c   | 575 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/lib/mlx5.h |   6 +
 .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c  | 313 +++++++++++
 .../net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h  |  33 ++
 drivers/net/ethernet/mellanox/mlx5/core/main.c     |  19 +-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h    |   8 +-
 include/linux/mlx5/device.h                        |  10 +-
 include/linux/mlx5/driver.h                        |  20 +-
 include/linux/mlx5/mlx5_ifc.h                      |  17 +-
 16 files changed, 1357 insertions(+), 100 deletions(-)
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/devlink.h
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/diag/crdump.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.c
 create mode 100644 drivers/net/ethernet/mellanox/mlx5/core/lib/pci_vsc.h