Message ID | 20231019163916.4338-1-witu@nvidia.com |
---|---|
Headers | show |
Series | Devlink backport: fix race and lock issue | expand |
Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com> On Thu, Oct 19, 2023 at 6:39 PM William Tu <witu@nvidia.com> wrote: > > BugLink: https://bugs.launchpad.net/bugs/2039869 > > The patch is a follow-up from the previous devlink backport series. > We've found that devlink reload hangs the system when testing against > OFED 2307. > > [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds. > [ 1089.760560] Tainted: G OE 5.15.0-1027-bluefield #29-Ubuntu > [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1089.790829] task:devlink state:D stack: 0 pid: 8753 ppid: 5090 flags:0x00000004 > [ 1089.790838] Call trace: > [ 1089.790840] __switch_to+0xf8/0x150 > [ 1089.790857] __schedule+0x2b8/0x790 > [ 1089.790865] schedule+0x64/0x140 > [ 1089.790870] schedule_preempt_disabled+0x18/0x24 > [ 1089.790874] __mutex_lock.constprop.0+0x1a0/0x680 > [ 1089.790878] __mutex_lock_slowpath+0x40/0x90 > [ 1089.790883] mutex_lock+0x64/0x70 > [ 1089.790887] devl_lock+0x1c/0x30 > [ 1089.790893] mlx5_detach_device+0x58/0x190 [mlx5_core] > [ 1089.791055] mlx5_unload_one+0x40/0xe4 [mlx5_core] > [ 1089.791177] mlx5_devlink_reload_down+0x184/0x270 [mlx5_core] > [ 1089.791318] devlink_reload+0x214/0x290 > > Checking the OFED source code, we found this missing devl trap group > also need to be backported to avoid deadlock. > > void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend) > { > ... > #ifdef HAVE_DEVL_PORT_REGISTER > #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER > devl_assert_locked(priv_to_devlink(dev)); > #else > devl_lock(devlink); > #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */ > #endif /* HAVE_DEVL_PORT_REGISTER */ > mutex_lock(&mlx5_intf_mutex); > #ifdef HAVE_DEVL_PORT_REGISTER > > v2: > Create new BugLink > > Jiri Pirko (1): > net: devlink: add unlocked variants of devling_trap*() functions > > include/net/devlink.h | 20 +++++ > net/core/devlink.c | 180 ++++++++++++++++++++++++++++++++++-------- > 2 files changed, 168 insertions(+), 32 deletions(-) >
Acked-by: Thibault Ferrante <thibault.ferrante@canonical.com> On 19-10-2023 18:39, William Tu wrote: > BugLink: https://bugs.launchpad.net/bugs/2039869 > > The patch is a follow-up from the previous devlink backport series. > We've found that devlink reload hangs the system when testing against > OFED 2307. > > [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds. > [ 1089.760560] Tainted: G OE 5.15.0-1027-bluefield #29-Ubuntu > [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1089.790829] task:devlink state:D stack: 0 pid: 8753 ppid: 5090 flags:0x00000004 > [ 1089.790838] Call trace: > [ 1089.790840] __switch_to+0xf8/0x150 > [ 1089.790857] __schedule+0x2b8/0x790 > [ 1089.790865] schedule+0x64/0x140 > [ 1089.790870] schedule_preempt_disabled+0x18/0x24 > [ 1089.790874] __mutex_lock.constprop.0+0x1a0/0x680 > [ 1089.790878] __mutex_lock_slowpath+0x40/0x90 > [ 1089.790883] mutex_lock+0x64/0x70 > [ 1089.790887] devl_lock+0x1c/0x30 > [ 1089.790893] mlx5_detach_device+0x58/0x190 [mlx5_core] > [ 1089.791055] mlx5_unload_one+0x40/0xe4 [mlx5_core] > [ 1089.791177] mlx5_devlink_reload_down+0x184/0x270 [mlx5_core] > [ 1089.791318] devlink_reload+0x214/0x290 > > Checking the OFED source code, we found this missing devl trap group > also need to be backported to avoid deadlock. > > void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend) > { > ... > #ifdef HAVE_DEVL_PORT_REGISTER > #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER > devl_assert_locked(priv_to_devlink(dev)); > #else > devl_lock(devlink); > #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */ > #endif /* HAVE_DEVL_PORT_REGISTER */ > mutex_lock(&mlx5_intf_mutex); > #ifdef HAVE_DEVL_PORT_REGISTER > > v2: > Create new BugLink > > Jiri Pirko (1): > net: devlink: add unlocked variants of devling_trap*() functions > > include/net/devlink.h | 20 +++++ > net/core/devlink.c | 180 ++++++++++++++++++++++++++++++++++-------- > 2 files changed, 168 insertions(+), 32 deletions(-) >
Applied to jammy:linux-bluefield/master-next. Thanks. -- Best regards, Bartlomiej On Thu, Oct 19, 2023 at 6:39 PM William Tu <witu@nvidia.com> wrote: > > BugLink: https://bugs.launchpad.net/bugs/2039869 > > The patch is a follow-up from the previous devlink backport series. > We've found that devlink reload hangs the system when testing against > OFED 2307. > > [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds. > [ 1089.760560] Tainted: G OE 5.15.0-1027-bluefield #29-Ubuntu > [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 1089.790829] task:devlink state:D stack: 0 pid: 8753 ppid: 5090 flags:0x00000004 > [ 1089.790838] Call trace: > [ 1089.790840] __switch_to+0xf8/0x150 > [ 1089.790857] __schedule+0x2b8/0x790 > [ 1089.790865] schedule+0x64/0x140 > [ 1089.790870] schedule_preempt_disabled+0x18/0x24 > [ 1089.790874] __mutex_lock.constprop.0+0x1a0/0x680 > [ 1089.790878] __mutex_lock_slowpath+0x40/0x90 > [ 1089.790883] mutex_lock+0x64/0x70 > [ 1089.790887] devl_lock+0x1c/0x30 > [ 1089.790893] mlx5_detach_device+0x58/0x190 [mlx5_core] > [ 1089.791055] mlx5_unload_one+0x40/0xe4 [mlx5_core] > [ 1089.791177] mlx5_devlink_reload_down+0x184/0x270 [mlx5_core] > [ 1089.791318] devlink_reload+0x214/0x290 > > Checking the OFED source code, we found this missing devl trap group > also need to be backported to avoid deadlock. > > void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend) > { > ... > #ifdef HAVE_DEVL_PORT_REGISTER > #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER > devl_assert_locked(priv_to_devlink(dev)); > #else > devl_lock(devlink); > #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */ > #endif /* HAVE_DEVL_PORT_REGISTER */ > mutex_lock(&mlx5_intf_mutex); > #ifdef HAVE_DEVL_PORT_REGISTER > > v2: > Create new BugLink > > Jiri Pirko (1): > net: devlink: add unlocked variants of devling_trap*() functions > > include/net/devlink.h | 20 +++++ > net/core/devlink.c | 180 ++++++++++++++++++++++++++++++++++-------- > 2 files changed, 168 insertions(+), 32 deletions(-) >