mbox series

[0/1,SRU,M] Intel E810 transmit hang with bonding enabled

Message ID 20240105110256.1455465-1-robert.malz@canonical.com
Headers show
Series Intel E810 transmit hang with bonding enabled | expand

Message

Robert Malz Jan. 5, 2024, 11:02 a.m. UTC
BugLink: https://bugs.launchpad.net/bugs/2036239

[Impact]
   * Issue is causing transmit hang on E810 ports with bonding enabled.
   * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine).
   * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification.
[Fix]
  * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1].
    This change has been tested in an environment where reproduction is easily achieved.
    After multiple iterations, no reproduction has been observed.
  * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities.
[Test Plan]
  * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run.
  * The issue could appear on a random node, making reproduction hard to achieve.
  * Multiple stress tests on single host with similar configuration did not trigger a reproduction.
[Where problems could occur]
  * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04
  * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released.
    Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced.

  [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
  [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6


Dave Ertman (1):
  [SRU][M][PATCH 0/1] ice: alter feature support check for SRIOV and LAG

 drivers/net/ethernet/intel/ice/ice.h          |  2 ++
 .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  3 +++
 drivers/net/ethernet/intel/ice/ice_common.c   |  8 ++++++
 drivers/net/ethernet/intel/ice/ice_lag.c      | 25 +++++++++++++++++++
 drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
 drivers/net/ethernet/intel/ice/ice_lib.h      |  1 +
 drivers/net/ethernet/intel/ice/ice_type.h     |  2 ++
 7 files changed, 42 insertions(+), 1 deletion(-)

Comments

Tim Gardner Jan. 8, 2024, 3:04 p.m. UTC | #1
On 1/5/24 4:02 AM, Robert Malz wrote:
> BugLink: https://bugs.launchpad.net/bugs/2036239
> 
> [Impact]
>     * Issue is causing transmit hang on E810 ports with bonding enabled.
>     * Based on the provided logs, TX hang can last for even a couple of minutes, but in most scenarios, the network will be recovered after the ice driver performs a PF reset (TX hang handler routine).
>     * Originally, the issue was observed during Tempest tests on a newly created OpenStack cluster, resulting in a lack of certification.
> [Fix]
>    * Initially, a workaround has been proposed by Intel engineers to disable LAG initialization [1].
>      This change has been tested in an environment where reproduction is easily achieved.
>      After multiple iterations, no reproduction has been observed.
>    * Shortly after, Intel proposed a patch [2] to disable LAG initialization if NVM does not expose proper capabilities.
> [Test Plan]
>    * To reproduce the issue, over a 20-node cluster was used with Ceph-based storage. The problem could sometimes manifest while deploying a cluster or after the cluster was already deployed during the Tempest test run.
>    * The issue could appear on a random node, making reproduction hard to achieve.
>    * Multiple stress tests on single host with similar configuration did not trigger a reproduction.
> [Where problems could occur]
>    * All ice drivers with ice_lag_event_handler registered can expose the issue. This handler is not implemented in 20.04
>    * CVL4.2 and older NVM images for E810 does not expose SRIOV LAG capabilities (CVL4.3 wasn't checked) meaning at some point NVM with this capability will be released.
>      Although potentialy issue is caused by using features without proper FW support [2], we want to take a closer look once NVMs with proper support are introduced.
> 
>    [1] - https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2036239/comments/40
>    [2] - https://lists.osuosl.org/pipermail/intel-wired-lan/Week-of-Mon-20231211/038588.html 4d50fcdc2476eef94c14c6761073af5667bb43b6
> 
> 
> Dave Ertman (1):
>    [SRU][M][PATCH 0/1] ice: alter feature support check for SRIOV and LAG
> 
>   drivers/net/ethernet/intel/ice/ice.h          |  2 ++
>   .../net/ethernet/intel/ice/ice_adminq_cmd.h   |  3 +++
>   drivers/net/ethernet/intel/ice/ice_common.c   |  8 ++++++
>   drivers/net/ethernet/intel/ice/ice_lag.c      | 25 +++++++++++++++++++
>   drivers/net/ethernet/intel/ice/ice_lib.c      |  2 +-
>   drivers/net/ethernet/intel/ice/ice_lib.h      |  1 +
>   drivers/net/ethernet/intel/ice/ice_type.h     |  2 ++
>   7 files changed, 42 insertions(+), 1 deletion(-)
> 

This appears to be a duplicate with no explanation.