mbox series

[SRU,F,0/2] Kernel panic with "refcount_t: underflow" in mlx5 driver (LP: 2019011)

Message ID 20230628100407.153108-1-frank.heimes@canonical.com
Headers show
Series Kernel panic with "refcount_t: underflow" in mlx5 driver (LP: 2019011) | expand

Message

Frank Heimes June 28, 2023, 10:04 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2019011

SRU Justification:

[ Impact ]

 * The mlx5 driver is causing a Kernel panic with
   "refcount_t: underflow".

 * This issue occurs during a recovery when the PCI device
   is isolated and thus doesn't respond.

[ Fix ]

 * This issue got solved upstream with
   aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
   "net/mlx5: Fix handling of entry refcount when command
   is not issued to FW" (upstream since 6.1-rc1)

 * But to get aaf2e65cac7f a backport of b898ce7bccf1
   b898ce7bccf13087719c021d829dab607c175246
   "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
   not accessible" is required on top (upstream since 5.10)

[ Test Plan ]

 * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
   is needed that has Mellanox cards (RoCE Express 2.1)
   assigned, configured and enabled and that runs a 5.4
   kernel with mlx5 driver.

 * Create some network traffic on (one of the) RoCE device
   (interface ens???[d?]) for testing (e.g. with stress-ng).

 * Make sure the module/driver mlx5 is loaded and in use.

 * Trigger a recovery (via the Support Element)
   that will render the adapter (ports) unresponsive
   for a moment and should provoke a similar situation.

 * Alternatively the interface itself can be removed for
   a moment and re-added again (but this may break further
   things on top).

 * Due to the lack of RoCE Express 2.1 hardware,
   the verification is on IBM.

[ Where problems could occur ]

 * The modifications are limited to the Mellanox mlx5 driver
   only - no other network driver is affected.

 * The pre-required commit (aaf2e65cac7f) can have a bad
   impact on (re-)claiming pages if FW is not accessible,
   which could cause page leaks in case done wrong.
   But this commit is pretty save since it's upstream
   since v5.10.

 * The fix itself (aaf2e65cac7f) mainly changes the
   cmd_work_handler and mlx5_cmd_comp_handler functions
   in a way that instead of pci_channel_offline
   mlx5_cmd_is_down (introiduced by b898ce7bccf1).

 * Actually b898ce7bccf1 started with changing from
   pci_channel_offline to mlx5_cmd_is_down,
   but looks like a few cases
   (in the area of refcount increate/decrease) were missed,
   that are now covered by aaf2e65cac7f.

 * It fixes now on top refcounts are now always properly
   increment and decrement to achieve a symmetric state
   for all flows.

 * These changes may have an impact on all cases where the
   mlx5 device is not responding, which can happen in case
   of an offline channel, interface down, reset or recovery.

[ Other Info ]

 * A lookup at the master-next git trees for jammy, kinetic
   and lunar showed that both fixes are already included,
   hence only focal is affected.

Moshe Shemesh (1):
  net/mlx5: Fix handling of entry refcount when command is not issued to
    FW

Saeed Mahameed (1):
  net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible

 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
 .../ethernet/mellanox/mlx5/core/pagealloc.c   |  2 +-
 include/linux/mlx5/driver.h                   |  1 +
 3 files changed, 14 insertions(+), 12 deletions(-)

Comments

Tim Gardner July 5, 2023, 5:32 p.m. UTC | #1
On 6/28/23 4:04 AM, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2019011
> 
> SRU Justification:
> 
> [ Impact ]
> 
>   * The mlx5 driver is causing a Kernel panic with
>     "refcount_t: underflow".
> 
>   * This issue occurs during a recovery when the PCI device
>     is isolated and thus doesn't respond.
> 
> [ Fix ]
> 
>   * This issue got solved upstream with
>     aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
>     "net/mlx5: Fix handling of entry refcount when command
>     is not issued to FW" (upstream since 6.1-rc1)
> 
>   * But to get aaf2e65cac7f a backport of b898ce7bccf1
>     b898ce7bccf13087719c021d829dab607c175246
>     "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
>     not accessible" is required on top (upstream since 5.10)
> 
> [ Test Plan ]
> 
>   * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
>     is needed that has Mellanox cards (RoCE Express 2.1)
>     assigned, configured and enabled and that runs a 5.4
>     kernel with mlx5 driver.
> 
>   * Create some network traffic on (one of the) RoCE device
>     (interface ens???[d?]) for testing (e.g. with stress-ng).
> 
>   * Make sure the module/driver mlx5 is loaded and in use.
> 
>   * Trigger a recovery (via the Support Element)
>     that will render the adapter (ports) unresponsive
>     for a moment and should provoke a similar situation.
> 
>   * Alternatively the interface itself can be removed for
>     a moment and re-added again (but this may break further
>     things on top).
> 
>   * Due to the lack of RoCE Express 2.1 hardware,
>     the verification is on IBM.
> 
> [ Where problems could occur ]
> 
>   * The modifications are limited to the Mellanox mlx5 driver
>     only - no other network driver is affected.
> 
>   * The pre-required commit (aaf2e65cac7f) can have a bad
>     impact on (re-)claiming pages if FW is not accessible,
>     which could cause page leaks in case done wrong.
>     But this commit is pretty save since it's upstream
>     since v5.10.
> 
>   * The fix itself (aaf2e65cac7f) mainly changes the
>     cmd_work_handler and mlx5_cmd_comp_handler functions
>     in a way that instead of pci_channel_offline
>     mlx5_cmd_is_down (introiduced by b898ce7bccf1).
> 
>   * Actually b898ce7bccf1 started with changing from
>     pci_channel_offline to mlx5_cmd_is_down,
>     but looks like a few cases
>     (in the area of refcount increate/decrease) were missed,
>     that are now covered by aaf2e65cac7f.
> 
>   * It fixes now on top refcounts are now always properly
>     increment and decrement to achieve a symmetric state
>     for all flows.
> 
>   * These changes may have an impact on all cases where the
>     mlx5 device is not responding, which can happen in case
>     of an offline channel, interface down, reset or recovery.
> 
> [ Other Info ]
> 
>   * A lookup at the master-next git trees for jammy, kinetic
>     and lunar showed that both fixes are already included,
>     hence only focal is affected.
> 
> Moshe Shemesh (1):
>    net/mlx5: Fix handling of entry refcount when command is not issued to
>      FW
> 
> Saeed Mahameed (1):
>    net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
> 
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
>   .../ethernet/mellanox/mlx5/core/pagealloc.c   |  2 +-
>   include/linux/mlx5/driver.h                   |  1 +
>   3 files changed, 14 insertions(+), 12 deletions(-)
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>
Kleber Souza July 6, 2023, 1:53 p.m. UTC | #2
On 28.06.23 12:04, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2019011
> 
> SRU Justification:
> 
> [ Impact ]
> 
>   * The mlx5 driver is causing a Kernel panic with
>     "refcount_t: underflow".
> 
>   * This issue occurs during a recovery when the PCI device
>     is isolated and thus doesn't respond.
> 
> [ Fix ]
> 
>   * This issue got solved upstream with
>     aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
>     "net/mlx5: Fix handling of entry refcount when command
>     is not issued to FW" (upstream since 6.1-rc1)
> 
>   * But to get aaf2e65cac7f a backport of b898ce7bccf1
>     b898ce7bccf13087719c021d829dab607c175246
>     "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
>     not accessible" is required on top (upstream since 5.10)
> 
> [ Test Plan ]
> 
>   * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
>     is needed that has Mellanox cards (RoCE Express 2.1)
>     assigned, configured and enabled and that runs a 5.4
>     kernel with mlx5 driver.
> 
>   * Create some network traffic on (one of the) RoCE device
>     (interface ens???[d?]) for testing (e.g. with stress-ng).
> 
>   * Make sure the module/driver mlx5 is loaded and in use.
> 
>   * Trigger a recovery (via the Support Element)
>     that will render the adapter (ports) unresponsive
>     for a moment and should provoke a similar situation.
> 
>   * Alternatively the interface itself can be removed for
>     a moment and re-added again (but this may break further
>     things on top).
> 
>   * Due to the lack of RoCE Express 2.1 hardware,
>     the verification is on IBM.
> 
> [ Where problems could occur ]
> 
>   * The modifications are limited to the Mellanox mlx5 driver
>     only - no other network driver is affected.
> 
>   * The pre-required commit (aaf2e65cac7f) can have a bad
>     impact on (re-)claiming pages if FW is not accessible,
>     which could cause page leaks in case done wrong.
>     But this commit is pretty save since it's upstream
>     since v5.10.
> 
>   * The fix itself (aaf2e65cac7f) mainly changes the
>     cmd_work_handler and mlx5_cmd_comp_handler functions
>     in a way that instead of pci_channel_offline
>     mlx5_cmd_is_down (introiduced by b898ce7bccf1).
> 
>   * Actually b898ce7bccf1 started with changing from
>     pci_channel_offline to mlx5_cmd_is_down,
>     but looks like a few cases
>     (in the area of refcount increate/decrease) were missed,
>     that are now covered by aaf2e65cac7f.
> 
>   * It fixes now on top refcounts are now always properly
>     increment and decrement to achieve a symmetric state
>     for all flows.
> 
>   * These changes may have an impact on all cases where the
>     mlx5 device is not responding, which can happen in case
>     of an offline channel, interface down, reset or recovery.
> 
> [ Other Info ]
> 
>   * A lookup at the master-next git trees for jammy, kinetic
>     and lunar showed that both fixes are already included,
>     hence only focal is affected.
> 
> Moshe Shemesh (1):
>    net/mlx5: Fix handling of entry refcount when command is not issued to
>      FW
> 
> Saeed Mahameed (1):
>    net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
> 
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
>   .../ethernet/mellanox/mlx5/core/pagealloc.c   |  2 +-
>   include/linux/mlx5/driver.h                   |  1 +
>   3 files changed, 14 insertions(+), 12 deletions(-)
> 

LTGM.


Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>

Thanks
Roxana Nicolescu July 7, 2023, 1:52 p.m. UTC | #3
On 28/06/2023 12:04, frank.heimes@canonical.com wrote:
> BugLink: https://bugs.launchpad.net/bugs/2019011
>
> SRU Justification:
>
> [ Impact ]
>
>   * The mlx5 driver is causing a Kernel panic with
>     "refcount_t: underflow".
>
>   * This issue occurs during a recovery when the PCI device
>     is isolated and thus doesn't respond.
>
> [ Fix ]
>
>   * This issue got solved upstream with
>     aaf2e65cac7f aaf2e65cac7f2e1ae729c2fbc849091df9699f96
>     "net/mlx5: Fix handling of entry refcount when command
>     is not issued to FW" (upstream since 6.1-rc1)
>
>   * But to get aaf2e65cac7f a backport of b898ce7bccf1
>     b898ce7bccf13087719c021d829dab607c175246
>     "net/mlx5: cmdif, Avoid skipping reclaim pages if FW is
>     not accessible" is required on top (upstream since 5.10)
>
> [ Test Plan ]
>
>   * An Ubuntu Server for s390x 20.04 LPAR or z/VM installation
>     is needed that has Mellanox cards (RoCE Express 2.1)
>     assigned, configured and enabled and that runs a 5.4
>     kernel with mlx5 driver.
>
>   * Create some network traffic on (one of the) RoCE device
>     (interface ens???[d?]) for testing (e.g. with stress-ng).
>
>   * Make sure the module/driver mlx5 is loaded and in use.
>
>   * Trigger a recovery (via the Support Element)
>     that will render the adapter (ports) unresponsive
>     for a moment and should provoke a similar situation.
>
>   * Alternatively the interface itself can be removed for
>     a moment and re-added again (but this may break further
>     things on top).
>
>   * Due to the lack of RoCE Express 2.1 hardware,
>     the verification is on IBM.
>
> [ Where problems could occur ]
>
>   * The modifications are limited to the Mellanox mlx5 driver
>     only - no other network driver is affected.
>
>   * The pre-required commit (aaf2e65cac7f) can have a bad
>     impact on (re-)claiming pages if FW is not accessible,
>     which could cause page leaks in case done wrong.
>     But this commit is pretty save since it's upstream
>     since v5.10.
>
>   * The fix itself (aaf2e65cac7f) mainly changes the
>     cmd_work_handler and mlx5_cmd_comp_handler functions
>     in a way that instead of pci_channel_offline
>     mlx5_cmd_is_down (introiduced by b898ce7bccf1).
>
>   * Actually b898ce7bccf1 started with changing from
>     pci_channel_offline to mlx5_cmd_is_down,
>     but looks like a few cases
>     (in the area of refcount increate/decrease) were missed,
>     that are now covered by aaf2e65cac7f.
>
>   * It fixes now on top refcounts are now always properly
>     increment and decrement to achieve a symmetric state
>     for all flows.
>
>   * These changes may have an impact on all cases where the
>     mlx5 device is not responding, which can happen in case
>     of an offline channel, interface down, reset or recovery.
>
> [ Other Info ]
>
>   * A lookup at the master-next git trees for jammy, kinetic
>     and lunar showed that both fixes are already included,
>     hence only focal is affected.
>
> Moshe Shemesh (1):
>    net/mlx5: Fix handling of entry refcount when command is not issued to
>      FW
>
> Saeed Mahameed (1):
>    net/mlx5: cmdif, Avoid skipping reclaim pages if FW is not accessible
>
>   drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 23 ++++++++++---------
>   .../ethernet/mellanox/mlx5/core/pagealloc.c   |  2 +-
>   include/linux/mlx5/driver.h                   |  1 +
>   3 files changed, 14 insertions(+), 12 deletions(-)
>
Applied to focal:master-next. Thanks!

Roxana