mbox series

[SRU,J:linux-bluefield,v2,0/1] Devlink backport: fix race and lock issue

Message ID 20231019163916.4338-1-witu@nvidia.com
Headers show
Series Devlink backport: fix race and lock issue | expand

Message

William Tu Oct. 19, 2023, 4:39 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2039869

The patch is a follow-up from the previous devlink backport series.
We've found that devlink reload hangs the system when testing against
OFED 2307.

[ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
[ 1089.760560]       Tainted: G           OE     5.15.0-1027-bluefield #29-Ubuntu
[ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1089.790829] task:devlink         state:D stack:    0 pid: 8753 ppid:  5090 flags:0x00000004
[ 1089.790838] Call trace:
[ 1089.790840]  __switch_to+0xf8/0x150
[ 1089.790857]  __schedule+0x2b8/0x790
[ 1089.790865]  schedule+0x64/0x140
[ 1089.790870]  schedule_preempt_disabled+0x18/0x24
[ 1089.790874]  __mutex_lock.constprop.0+0x1a0/0x680
[ 1089.790878]  __mutex_lock_slowpath+0x40/0x90
[ 1089.790883]  mutex_lock+0x64/0x70
[ 1089.790887]  devl_lock+0x1c/0x30
[ 1089.790893]  mlx5_detach_device+0x58/0x190 [mlx5_core]
[ 1089.791055]  mlx5_unload_one+0x40/0xe4 [mlx5_core]
[ 1089.791177]  mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
[ 1089.791318]  devlink_reload+0x214/0x290

Checking the OFED source code, we found this missing devl trap group
also need to be backported to avoid deadlock.

void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
{
...
#ifdef HAVE_DEVL_PORT_REGISTER
#ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
        devl_assert_locked(priv_to_devlink(dev));
#else
        devl_lock(devlink);
#endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
#endif /* HAVE_DEVL_PORT_REGISTER */
        mutex_lock(&mlx5_intf_mutex);
#ifdef HAVE_DEVL_PORT_REGISTER

v2:
Create new BugLink

Jiri Pirko (1):
  net: devlink: add unlocked variants of devling_trap*() functions

 include/net/devlink.h |  20 +++++
 net/core/devlink.c    | 180 ++++++++++++++++++++++++++++++++++--------
 2 files changed, 168 insertions(+), 32 deletions(-)

Comments

Bartlomiej Zolnierkiewicz Oct. 20, 2023, 10:19 a.m. UTC | #1
Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com>

On Thu, Oct 19, 2023 at 6:39 PM William Tu <witu@nvidia.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/2039869
>
> The patch is a follow-up from the previous devlink backport series.
> We've found that devlink reload hangs the system when testing against
> OFED 2307.
>
> [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
> [ 1089.760560]       Tainted: G           OE     5.15.0-1027-bluefield #29-Ubuntu
> [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1089.790829] task:devlink         state:D stack:    0 pid: 8753 ppid:  5090 flags:0x00000004
> [ 1089.790838] Call trace:
> [ 1089.790840]  __switch_to+0xf8/0x150
> [ 1089.790857]  __schedule+0x2b8/0x790
> [ 1089.790865]  schedule+0x64/0x140
> [ 1089.790870]  schedule_preempt_disabled+0x18/0x24
> [ 1089.790874]  __mutex_lock.constprop.0+0x1a0/0x680
> [ 1089.790878]  __mutex_lock_slowpath+0x40/0x90
> [ 1089.790883]  mutex_lock+0x64/0x70
> [ 1089.790887]  devl_lock+0x1c/0x30
> [ 1089.790893]  mlx5_detach_device+0x58/0x190 [mlx5_core]
> [ 1089.791055]  mlx5_unload_one+0x40/0xe4 [mlx5_core]
> [ 1089.791177]  mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
> [ 1089.791318]  devlink_reload+0x214/0x290
>
> Checking the OFED source code, we found this missing devl trap group
> also need to be backported to avoid deadlock.
>
> void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
> {
> ...
> #ifdef HAVE_DEVL_PORT_REGISTER
> #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
>         devl_assert_locked(priv_to_devlink(dev));
> #else
>         devl_lock(devlink);
> #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
> #endif /* HAVE_DEVL_PORT_REGISTER */
>         mutex_lock(&mlx5_intf_mutex);
> #ifdef HAVE_DEVL_PORT_REGISTER
>
> v2:
> Create new BugLink
>
> Jiri Pirko (1):
>   net: devlink: add unlocked variants of devling_trap*() functions
>
>  include/net/devlink.h |  20 +++++
>  net/core/devlink.c    | 180 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 168 insertions(+), 32 deletions(-)
>
Thibault Ferrante Oct. 20, 2023, 2:57 p.m. UTC | #2
Acked-by: Thibault Ferrante <thibault.ferrante@canonical.com>

On 19-10-2023 18:39, William Tu wrote:
> BugLink: https://bugs.launchpad.net/bugs/2039869
> 
> The patch is a follow-up from the previous devlink backport series.
> We've found that devlink reload hangs the system when testing against
> OFED 2307.
> 
> [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
> [ 1089.760560]       Tainted: G           OE     5.15.0-1027-bluefield #29-Ubuntu
> [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1089.790829] task:devlink         state:D stack:    0 pid: 8753 ppid:  5090 flags:0x00000004
> [ 1089.790838] Call trace:
> [ 1089.790840]  __switch_to+0xf8/0x150
> [ 1089.790857]  __schedule+0x2b8/0x790
> [ 1089.790865]  schedule+0x64/0x140
> [ 1089.790870]  schedule_preempt_disabled+0x18/0x24
> [ 1089.790874]  __mutex_lock.constprop.0+0x1a0/0x680
> [ 1089.790878]  __mutex_lock_slowpath+0x40/0x90
> [ 1089.790883]  mutex_lock+0x64/0x70
> [ 1089.790887]  devl_lock+0x1c/0x30
> [ 1089.790893]  mlx5_detach_device+0x58/0x190 [mlx5_core]
> [ 1089.791055]  mlx5_unload_one+0x40/0xe4 [mlx5_core]
> [ 1089.791177]  mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
> [ 1089.791318]  devlink_reload+0x214/0x290
> 
> Checking the OFED source code, we found this missing devl trap group
> also need to be backported to avoid deadlock.
> 
> void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
> {
> ...
> #ifdef HAVE_DEVL_PORT_REGISTER
> #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
>          devl_assert_locked(priv_to_devlink(dev));
> #else
>          devl_lock(devlink);
> #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
> #endif /* HAVE_DEVL_PORT_REGISTER */
>          mutex_lock(&mlx5_intf_mutex);
> #ifdef HAVE_DEVL_PORT_REGISTER
> 
> v2:
> Create new BugLink
> 
> Jiri Pirko (1):
>    net: devlink: add unlocked variants of devling_trap*() functions
> 
>   include/net/devlink.h |  20 +++++
>   net/core/devlink.c    | 180 ++++++++++++++++++++++++++++++++++--------
>   2 files changed, 168 insertions(+), 32 deletions(-)
>
Bartlomiej Zolnierkiewicz Oct. 23, 2023, 9:22 a.m. UTC | #3
Applied to jammy:linux-bluefield/master-next. Thanks.

--
Best regards,
Bartlomiej

On Thu, Oct 19, 2023 at 6:39 PM William Tu <witu@nvidia.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/2039869
>
> The patch is a follow-up from the previous devlink backport series.
> We've found that devlink reload hangs the system when testing against
> OFED 2307.
>
> [ 1089.747409] INFO: task devlink:8753 blocked for more than 120 seconds.
> [ 1089.760560]       Tainted: G           OE     5.15.0-1027-bluefield #29-Ubuntu
> [ 1089.775086] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 1089.790829] task:devlink         state:D stack:    0 pid: 8753 ppid:  5090 flags:0x00000004
> [ 1089.790838] Call trace:
> [ 1089.790840]  __switch_to+0xf8/0x150
> [ 1089.790857]  __schedule+0x2b8/0x790
> [ 1089.790865]  schedule+0x64/0x140
> [ 1089.790870]  schedule_preempt_disabled+0x18/0x24
> [ 1089.790874]  __mutex_lock.constprop.0+0x1a0/0x680
> [ 1089.790878]  __mutex_lock_slowpath+0x40/0x90
> [ 1089.790883]  mutex_lock+0x64/0x70
> [ 1089.790887]  devl_lock+0x1c/0x30
> [ 1089.790893]  mlx5_detach_device+0x58/0x190 [mlx5_core]
> [ 1089.791055]  mlx5_unload_one+0x40/0xe4 [mlx5_core]
> [ 1089.791177]  mlx5_devlink_reload_down+0x184/0x270 [mlx5_core]
> [ 1089.791318]  devlink_reload+0x214/0x290
>
> Checking the OFED source code, we found this missing devl trap group
> also need to be backported to avoid deadlock.
>
> void mlx5_detach_device(struct mlx5_core_dev *dev, bool suspend)
> {
> ...
> #ifdef HAVE_DEVL_PORT_REGISTER
> #ifdef HAVE_DEVL_TRAP_GROUPS_REGISTER
>         devl_assert_locked(priv_to_devlink(dev));
> #else
>         devl_lock(devlink);
> #endif /* HAVE_DEVL_TRAP_GROUPS_REGISTER */
> #endif /* HAVE_DEVL_PORT_REGISTER */
>         mutex_lock(&mlx5_intf_mutex);
> #ifdef HAVE_DEVL_PORT_REGISTER
>
> v2:
> Create new BugLink
>
> Jiri Pirko (1):
>   net: devlink: add unlocked variants of devling_trap*() functions
>
>  include/net/devlink.h |  20 +++++
>  net/core/devlink.c    | 180 ++++++++++++++++++++++++++++++++++--------
>  2 files changed, 168 insertions(+), 32 deletions(-)
>