mbox series

[SRU,J:linux-bluefield,0/7] Devlink backport: Fix mlx5 driver hangs due to mlx5_sf_hw_table_init Edit

Message ID 20231101144951.26198-1-witu@nvidia.com
Headers show
Series Devlink backport: Fix mlx5 driver hangs due to mlx5_sf_hw_table_init Edit | expand

Message

William Tu Nov. 1, 2023, 2:49 p.m. UTC
Summary:
Machine hangs when loading OFED 2310 mlx5 driver at BlueField

How to reproduce:
# load the OFED driver

Reason:
BF got stuck and observed call trace "mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]

dmesg from minicom:
[ 726.569928] INFO: task systemd-udevd:297 blocked for more than 604 seconds.
[ 726.576895] Tainted: G OE 5.15.0-1029-bluefield #31-Ubuntu
[ 726.584101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.591913] task:systemd-udevd state:D stack: 0 pid: 297 ppid: 280 flags:0x0000000d
[ 726.600248] Call trace:
[ 726.602680] __switch_to+0xf8/0x150
[ 726.606159] __schedule+0x2b8/0x790
[ 726.609634] schedule+0x64/0x140
[ 726.612850] schedule_preempt_disabled+0x18/0x24
[ 726.617453] __mutex_lock.constprop.0+0x1a0/0x680
[ 726.622141] __mutex_lock_slowpath+0x40/0x90
[ 726.626396] mutex_lock+0x64/0x70
[ 726.629695] devlink_resource_register+0x50/0x1a0
[ 726.634386] mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
[ 726.639882] mlx5_init_one_devl_locked+0x1c8/0x784 [mlx5_core]
[ 726.645791] probe_one+0x300/0x5f0 [mlx5_core]
[ 726.650307] local_pci_probe+0x48/0xb4
[ 726.654043] pci_device_probe+0x18c/0x200
[ 726.658039] really_probe+0xd0/0x490
[ 726.661600] __driver_probe_device+0x148/0x190
[ 726.666029] driver_probe_device+0x48/0x180
[ 726.670198] __driver_attach+0x104/0x240
[ 726.674106] bus_for_each_dev+0x78/0xdc
[ 726.677927] driver_attach+0x2c/0x40
[ 726.681486] bus_add_driver+0x154/0x270
[ 726.685307] driver_register+0x80/0x13c
[ 726.689129] __pci_register_driver+0x4c/0x60
[ 726.693386] __init_backport+0xf0/0x1000 [mlx5_core]
[ 726.698425] do_one_initcall+0x4c/0x250
[ 726.702248] do_init_module+0x50/0x260
[ 726.705983] load_module+0x9fc/0xbe0
[ 726.709543] __do_sys_finit_module+0xa8/0x114

How to fix:
This is related to
https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2039869
and we need to backport/cherry-pick more patches from the series

Patches are below
Backport: f655dacb59ac net: devlink: remove unused locked functions
Backport: 012ec02ae441 netdevsim: convert driver to use unlocked devlink API during init/fini
Cherry-pick: eb0e9fa2c635 net: devlink: add unlocked variants of devlink_region_create/destroy() functions
SKIP: 72a4c8c94efa mlxsw: convert driver to use unlocked devlink API during init/fini
Backport: 70a2ff89369d net: devlink: add unlocked variants of devlink_dpipe*() functions
Cherry-pick: 755cfa69c4ec net: devlink: add unlocked variants of devlink_sb*() functions
Cherry-pick: c223d6a4bf6d net: devlink: add unlocked variants of devlink_resource*() functions
Cherry-pick: 852e85a704c2 net: devlink: add unlocked variants of devling_trap*() functions
Cherry-pick: e26fde2f5bef net: devlink: avoid false DEADLOCK warning reported by lock

Thanks!

Jiri Pirko (6):
  net: devlink: add unlocked variants of devlink_resource*() functions
  net: devlink: add unlocked variants of devlink_sb*() functions
  net: devlink: add unlocked variants of devlink_dpipe*() functions
  net: devlink: add unlocked variants of devlink_region_create/destroy()
    functions
  netdevsim: convert driver to use unlocked devlink API during init/fini
  net: devlink: remove unused locked functions

Moshe Shemesh (1):
  net: devlink: avoid false DEADLOCK warning reported by lockdep

 drivers/net/netdevsim/dev.c |  92 +++----
 drivers/net/netdevsim/fib.c |  62 ++---
 include/net/devlink.h       |  60 ++--
 net/core/devlink.c          | 534 ++++++++++++++++++++----------------
 4 files changed, 421 insertions(+), 327 deletions(-)

Comments

Bartlomiej Zolnierkiewicz Nov. 2, 2023, 10:52 a.m. UTC | #1
Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com>

Please include BugLink also in the cover letter in the future submissions.

--
Best regards,
Bartlomiej

On Wed, Nov 1, 2023 at 3:51 PM William Tu <witu@nvidia.com> wrote:
>
> Summary:
> Machine hangs when loading OFED 2310 mlx5 driver at BlueField
>
> How to reproduce:
> # load the OFED driver
>
> Reason:
> BF got stuck and observed call trace "mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
>
> dmesg from minicom:
> [ 726.569928] INFO: task systemd-udevd:297 blocked for more than 604 seconds.
> [ 726.576895] Tainted: G OE 5.15.0-1029-bluefield #31-Ubuntu
> [ 726.584101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 726.591913] task:systemd-udevd state:D stack: 0 pid: 297 ppid: 280 flags:0x0000000d
> [ 726.600248] Call trace:
> [ 726.602680] __switch_to+0xf8/0x150
> [ 726.606159] __schedule+0x2b8/0x790
> [ 726.609634] schedule+0x64/0x140
> [ 726.612850] schedule_preempt_disabled+0x18/0x24
> [ 726.617453] __mutex_lock.constprop.0+0x1a0/0x680
> [ 726.622141] __mutex_lock_slowpath+0x40/0x90
> [ 726.626396] mutex_lock+0x64/0x70
> [ 726.629695] devlink_resource_register+0x50/0x1a0
> [ 726.634386] mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
> [ 726.639882] mlx5_init_one_devl_locked+0x1c8/0x784 [mlx5_core]
> [ 726.645791] probe_one+0x300/0x5f0 [mlx5_core]
> [ 726.650307] local_pci_probe+0x48/0xb4
> [ 726.654043] pci_device_probe+0x18c/0x200
> [ 726.658039] really_probe+0xd0/0x490
> [ 726.661600] __driver_probe_device+0x148/0x190
> [ 726.666029] driver_probe_device+0x48/0x180
> [ 726.670198] __driver_attach+0x104/0x240
> [ 726.674106] bus_for_each_dev+0x78/0xdc
> [ 726.677927] driver_attach+0x2c/0x40
> [ 726.681486] bus_add_driver+0x154/0x270
> [ 726.685307] driver_register+0x80/0x13c
> [ 726.689129] __pci_register_driver+0x4c/0x60
> [ 726.693386] __init_backport+0xf0/0x1000 [mlx5_core]
> [ 726.698425] do_one_initcall+0x4c/0x250
> [ 726.702248] do_init_module+0x50/0x260
> [ 726.705983] load_module+0x9fc/0xbe0
> [ 726.709543] __do_sys_finit_module+0xa8/0x114
>
> How to fix:
> This is related to
> https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2039869
> and we need to backport/cherry-pick more patches from the series
>
> Patches are below
> Backport: f655dacb59ac net: devlink: remove unused locked functions
> Backport: 012ec02ae441 netdevsim: convert driver to use unlocked devlink API during init/fini
> Cherry-pick: eb0e9fa2c635 net: devlink: add unlocked variants of devlink_region_create/destroy() functions
> SKIP: 72a4c8c94efa mlxsw: convert driver to use unlocked devlink API during init/fini
> Backport: 70a2ff89369d net: devlink: add unlocked variants of devlink_dpipe*() functions
> Cherry-pick: 755cfa69c4ec net: devlink: add unlocked variants of devlink_sb*() functions
> Cherry-pick: c223d6a4bf6d net: devlink: add unlocked variants of devlink_resource*() functions
> Cherry-pick: 852e85a704c2 net: devlink: add unlocked variants of devling_trap*() functions
> Cherry-pick: e26fde2f5bef net: devlink: avoid false DEADLOCK warning reported by lock
>
> Thanks!
>
> Jiri Pirko (6):
>   net: devlink: add unlocked variants of devlink_resource*() functions
>   net: devlink: add unlocked variants of devlink_sb*() functions
>   net: devlink: add unlocked variants of devlink_dpipe*() functions
>   net: devlink: add unlocked variants of devlink_region_create/destroy()
>     functions
>   netdevsim: convert driver to use unlocked devlink API during init/fini
>   net: devlink: remove unused locked functions
>
> Moshe Shemesh (1):
>   net: devlink: avoid false DEADLOCK warning reported by lockdep
>
>  drivers/net/netdevsim/dev.c |  92 +++----
>  drivers/net/netdevsim/fib.c |  62 ++---
>  include/net/devlink.h       |  60 ++--
>  net/core/devlink.c          | 534 ++++++++++++++++++++----------------
>  4 files changed, 421 insertions(+), 327 deletions(-)
>
>
Roxana Nicolescu Nov. 2, 2023, 11:09 a.m. UTC | #2
On 01/11/2023 15:49, William Tu wrote:
> Summary:
> Machine hangs when loading OFED 2310 mlx5 driver at BlueField
>
> How to reproduce:
> # load the OFED driver
>
> Reason:
> BF got stuck and observed call trace "mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
>
> dmesg from minicom:
> [ 726.569928] INFO: task systemd-udevd:297 blocked for more than 604 seconds.
> [ 726.576895] Tainted: G OE 5.15.0-1029-bluefield #31-Ubuntu
> [ 726.584101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 726.591913] task:systemd-udevd state:D stack: 0 pid: 297 ppid: 280 flags:0x0000000d
> [ 726.600248] Call trace:
> [ 726.602680] __switch_to+0xf8/0x150
> [ 726.606159] __schedule+0x2b8/0x790
> [ 726.609634] schedule+0x64/0x140
> [ 726.612850] schedule_preempt_disabled+0x18/0x24
> [ 726.617453] __mutex_lock.constprop.0+0x1a0/0x680
> [ 726.622141] __mutex_lock_slowpath+0x40/0x90
> [ 726.626396] mutex_lock+0x64/0x70
> [ 726.629695] devlink_resource_register+0x50/0x1a0
> [ 726.634386] mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
> [ 726.639882] mlx5_init_one_devl_locked+0x1c8/0x784 [mlx5_core]
> [ 726.645791] probe_one+0x300/0x5f0 [mlx5_core]
> [ 726.650307] local_pci_probe+0x48/0xb4
> [ 726.654043] pci_device_probe+0x18c/0x200
> [ 726.658039] really_probe+0xd0/0x490
> [ 726.661600] __driver_probe_device+0x148/0x190
> [ 726.666029] driver_probe_device+0x48/0x180
> [ 726.670198] __driver_attach+0x104/0x240
> [ 726.674106] bus_for_each_dev+0x78/0xdc
> [ 726.677927] driver_attach+0x2c/0x40
> [ 726.681486] bus_add_driver+0x154/0x270
> [ 726.685307] driver_register+0x80/0x13c
> [ 726.689129] __pci_register_driver+0x4c/0x60
> [ 726.693386] __init_backport+0xf0/0x1000 [mlx5_core]
> [ 726.698425] do_one_initcall+0x4c/0x250
> [ 726.702248] do_init_module+0x50/0x260
> [ 726.705983] load_module+0x9fc/0xbe0
> [ 726.709543] __do_sys_finit_module+0xa8/0x114
>
> How to fix:
> This is related to
> https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2039869
> and we need to backport/cherry-pick more patches from the series
>
> Patches are below
> Backport: f655dacb59ac net: devlink: remove unused locked functions
> Backport: 012ec02ae441 netdevsim: convert driver to use unlocked devlink API during init/fini
> Cherry-pick: eb0e9fa2c635 net: devlink: add unlocked variants of devlink_region_create/destroy() functions
> SKIP: 72a4c8c94efa mlxsw: convert driver to use unlocked devlink API during init/fini
> Backport: 70a2ff89369d net: devlink: add unlocked variants of devlink_dpipe*() functions
> Cherry-pick: 755cfa69c4ec net: devlink: add unlocked variants of devlink_sb*() functions
> Cherry-pick: c223d6a4bf6d net: devlink: add unlocked variants of devlink_resource*() functions
> Cherry-pick: 852e85a704c2 net: devlink: add unlocked variants of devling_trap*() functions
> Cherry-pick: e26fde2f5bef net: devlink: avoid false DEADLOCK warning reported by lock
>
> Thanks!
>
> Jiri Pirko (6):
>    net: devlink: add unlocked variants of devlink_resource*() functions
>    net: devlink: add unlocked variants of devlink_sb*() functions
>    net: devlink: add unlocked variants of devlink_dpipe*() functions
>    net: devlink: add unlocked variants of devlink_region_create/destroy()
>      functions
>    netdevsim: convert driver to use unlocked devlink API during init/fini
>    net: devlink: remove unused locked functions
>
> Moshe Shemesh (1):
>    net: devlink: avoid false DEADLOCK warning reported by lockdep
>
>   drivers/net/netdevsim/dev.c |  92 +++----
>   drivers/net/netdevsim/fib.c |  62 ++---
>   include/net/devlink.h       |  60 ++--
>   net/core/devlink.c          | 534 ++++++++++++++++++++----------------
>   4 files changed, 421 insertions(+), 327 deletions(-)
Acked-by: Roxana Nicolescu <roxana.nicolescu@canonical.com>
Bartlomiej Zolnierkiewicz Nov. 2, 2023, 11:43 a.m. UTC | #3
Applied to jammy:linux-bluefield/master-next. Thanks.

--
Best regards,
Bartlomiej

On Wed, Nov 1, 2023 at 3:51 PM William Tu <witu@nvidia.com> wrote:
>
> Summary:
> Machine hangs when loading OFED 2310 mlx5 driver at BlueField
>
> How to reproduce:
> # load the OFED driver
>
> Reason:
> BF got stuck and observed call trace "mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
>
> dmesg from minicom:
> [ 726.569928] INFO: task systemd-udevd:297 blocked for more than 604 seconds.
> [ 726.576895] Tainted: G OE 5.15.0-1029-bluefield #31-Ubuntu
> [ 726.584101] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 726.591913] task:systemd-udevd state:D stack: 0 pid: 297 ppid: 280 flags:0x0000000d
> [ 726.600248] Call trace:
> [ 726.602680] __switch_to+0xf8/0x150
> [ 726.606159] __schedule+0x2b8/0x790
> [ 726.609634] schedule+0x64/0x140
> [ 726.612850] schedule_preempt_disabled+0x18/0x24
> [ 726.617453] __mutex_lock.constprop.0+0x1a0/0x680
> [ 726.622141] __mutex_lock_slowpath+0x40/0x90
> [ 726.626396] mutex_lock+0x64/0x70
> [ 726.629695] devlink_resource_register+0x50/0x1a0
> [ 726.634386] mlx5_sf_hw_table_init+0xf4/0x2d0 [mlx5_core]
> [ 726.639882] mlx5_init_one_devl_locked+0x1c8/0x784 [mlx5_core]
> [ 726.645791] probe_one+0x300/0x5f0 [mlx5_core]
> [ 726.650307] local_pci_probe+0x48/0xb4
> [ 726.654043] pci_device_probe+0x18c/0x200
> [ 726.658039] really_probe+0xd0/0x490
> [ 726.661600] __driver_probe_device+0x148/0x190
> [ 726.666029] driver_probe_device+0x48/0x180
> [ 726.670198] __driver_attach+0x104/0x240
> [ 726.674106] bus_for_each_dev+0x78/0xdc
> [ 726.677927] driver_attach+0x2c/0x40
> [ 726.681486] bus_add_driver+0x154/0x270
> [ 726.685307] driver_register+0x80/0x13c
> [ 726.689129] __pci_register_driver+0x4c/0x60
> [ 726.693386] __init_backport+0xf0/0x1000 [mlx5_core]
> [ 726.698425] do_one_initcall+0x4c/0x250
> [ 726.702248] do_init_module+0x50/0x260
> [ 726.705983] load_module+0x9fc/0xbe0
> [ 726.709543] __do_sys_finit_module+0xa8/0x114
>
> How to fix:
> This is related to
> https://bugs.launchpad.net/ubuntu/+source/linux-bluefield/+bug/2039869
> and we need to backport/cherry-pick more patches from the series
>
> Patches are below
> Backport: f655dacb59ac net: devlink: remove unused locked functions
> Backport: 012ec02ae441 netdevsim: convert driver to use unlocked devlink API during init/fini
> Cherry-pick: eb0e9fa2c635 net: devlink: add unlocked variants of devlink_region_create/destroy() functions
> SKIP: 72a4c8c94efa mlxsw: convert driver to use unlocked devlink API during init/fini
> Backport: 70a2ff89369d net: devlink: add unlocked variants of devlink_dpipe*() functions
> Cherry-pick: 755cfa69c4ec net: devlink: add unlocked variants of devlink_sb*() functions
> Cherry-pick: c223d6a4bf6d net: devlink: add unlocked variants of devlink_resource*() functions
> Cherry-pick: 852e85a704c2 net: devlink: add unlocked variants of devling_trap*() functions
> Cherry-pick: e26fde2f5bef net: devlink: avoid false DEADLOCK warning reported by lock
>
> Thanks!
>
> Jiri Pirko (6):
>   net: devlink: add unlocked variants of devlink_resource*() functions
>   net: devlink: add unlocked variants of devlink_sb*() functions
>   net: devlink: add unlocked variants of devlink_dpipe*() functions
>   net: devlink: add unlocked variants of devlink_region_create/destroy()
>     functions
>   netdevsim: convert driver to use unlocked devlink API during init/fini
>   net: devlink: remove unused locked functions
>
> Moshe Shemesh (1):
>   net: devlink: avoid false DEADLOCK warning reported by lockdep
>
>  drivers/net/netdevsim/dev.c |  92 +++----
>  drivers/net/netdevsim/fib.c |  62 ++---
>  include/net/devlink.h       |  60 ++--
>  net/core/devlink.c          | 534 ++++++++++++++++++++----------------
>  4 files changed, 421 insertions(+), 327 deletions(-)
>