mbox series

[SRU,X,0/1] UBUNTU: SAUCE: bnxt_en_bpo: Fix TX timeout during netpoll

Message ID de331296-a914-d8c7-2f25-a180724c5489@canonical.com
Headers show
Series UBUNTU: SAUCE: bnxt_en_bpo: Fix TX timeout during netpoll | expand

Message

Nivedita Singhvi Feb. 22, 2019, 10:20 a.m. UTC
BugLink: http://bugs.launchpad.net/bugs/1814095


[Impact]

The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
network stalls and fail to send data and heartbeat packets.

The following 25Gb Broadcom NIC error was seen on Xenial running the
4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
traffic (just once):

* The bnxt_en_po driver froze on a "TX timed out" error and triggered the
  Netdev Watchdog timer under load.

* From kernel log:
  "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
  See attached kern.log excerpt file for full excerpt of error log.

* Release = Xenial
  Kernel = 4.4.0-141-generic #167
  eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet

* This caused the driver to reset in order to recover:

  "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"

  driver: bnxt_en_bpo
  version: 1.8.1
  source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()

* The loss of connectivity and softirq stall caused other cascading failures
  on the system.

* The bnxt_en_po driver is the imported Broadcom driver pulled in to support
  newer Broadcom HW (specific boards) while the bnx_en module continues to
  support the older HW. The current Linux upstream driver does not compile
  easily with the 4.4 kernel (too many changes).

* This upstream and bnxt_en driver fix is a likely solution:
   "bnxt_en: Fix TX timeout during netpoll"
   commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906

  This fix has not been applied to the bnxt_en_po driver version, but review of
  the code indicates that it is susceptible to the bug, and the fix would be
  reasonable.


[Test Case]

* Unfortunately, this is not easy to reproduce. Also, it is only seen on
  4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.


[Regression Potential]

* The patch is restricted to the bpo driver, with very constrained scope
  - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
  opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
  driver).

* The patch is very small and backport is fairly minimal and simple.

* The fix has been running on the in-tree driver in upstream mainline as well
  as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
  lower level code that is different, this piece is still the same.

Comments

Juerg Haefliger Feb. 26, 2019, 9:37 a.m. UTC | #1
On Fri, 22 Feb 2019 11:20:18 +0100
Nivedita Singhvi <nivedita.singhvi@canonical.com> wrote:

> BugLink: http://bugs.launchpad.net/bugs/1814095
> 
> 
> [Impact]
> 
> The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
> network stalls and fail to send data and heartbeat packets.
> 
> The following 25Gb Broadcom NIC error was seen on Xenial running the
> 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
> traffic (just once):
> 
> * The bnxt_en_po driver froze on a "TX timed out" error and triggered the
>   Netdev Watchdog timer under load.
> 
> * From kernel log:
>   "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
>   See attached kern.log excerpt file for full excerpt of error log.
> 
> * Release = Xenial
>   Kernel = 4.4.0-141-generic #167
>   eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
> 
> * This caused the driver to reset in order to recover:
> 
>   "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
> 
>   driver: bnxt_en_bpo
>   version: 1.8.1
>   source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
> 
> * The loss of connectivity and softirq stall caused other cascading failures
>   on the system.
> 
> * The bnxt_en_po driver is the imported Broadcom driver pulled in to support
>   newer Broadcom HW (specific boards) while the bnx_en module continues to
>   support the older HW. The current Linux upstream driver does not compile
>   easily with the 4.4 kernel (too many changes).
> 
> * This upstream and bnxt_en driver fix is a likely solution:
>    "bnxt_en: Fix TX timeout during netpoll"
>    commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
> 
>   This fix has not been applied to the bnxt_en_po driver version, but review of
>   the code indicates that it is susceptible to the bug, and the fix would be
>   reasonable.
> 
> 
> [Test Case]
> 
> * Unfortunately, this is not easy to reproduce. Also, it is only seen on
>   4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
> 
> 
> [Regression Potential]
> 
> * The patch is restricted to the bpo driver, with very constrained scope
>   - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
>   opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
>   driver).

4.15 doesn't have this patch.


> * The patch is very small and backport is fairly minimal and simple.
> 
> * The fix has been running on the in-tree driver in upstream mainline as well
>   as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
>   lower level code that is different, this piece is still the same.

I'm a little reluctant to ACK this given that a) it's an upstream patch that
is applied to an out-of-tree vendor driver and b) the problem can't be
reproduced (or can it)? If I gave you a test kernel with a newer Broadcom
driver, would you be able to do some testing?

The latest Broadcom driver is version 1.9.2 from 01/02/2019 but it doesn't
have the patch that you want.

...Juerg
Nivedita Singhvi Feb. 26, 2019, 11:50 a.m. UTC | #2
On 2/26/19 3:07 PM, Juerg Haefliger wrote:
> On Fri, 22 Feb 2019 11:20:18 +0100
> Nivedita Singhvi <nivedita.singhvi@canonical.com> wrote:
> 
>> BugLink: http://bugs.launchpad.net/bugs/1814095
>>
>>
>> [Impact]
>>
>> The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
>> network stalls and fail to send data and heartbeat packets.
>>
>> The following 25Gb Broadcom NIC error was seen on Xenial running the
>> 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
>> traffic (just once):
>>
>> * The bnxt_en_po driver froze on a "TX timed out" error and triggered the
>>   Netdev Watchdog timer under load.
>>
>> * From kernel log:
>>   "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
>>   See attached kern.log excerpt file for full excerpt of error log.
>>
>> * Release = Xenial
>>   Kernel = 4.4.0-141-generic #167
>>   eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
>>
>> * This caused the driver to reset in order to recover:
>>
>>   "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
>>
>>   driver: bnxt_en_bpo
>>   version: 1.8.1
>>   source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
>>
>> * The loss of connectivity and softirq stall caused other cascading failures
>>   on the system.
>>
>> * The bnxt_en_po driver is the imported Broadcom driver pulled in to support
>>   newer Broadcom HW (specific boards) while the bnx_en module continues to
>>   support the older HW. The current Linux upstream driver does not compile
>>   easily with the 4.4 kernel (too many changes).
>>
>> * This upstream and bnxt_en driver fix is a likely solution:
>>    "bnxt_en: Fix TX timeout during netpoll"
>>    commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
>>
>>   This fix has not been applied to the bnxt_en_po driver version, but review of
>>   the code indicates that it is susceptible to the bug, and the fix would be
>>   reasonable.
>>
>>
>> [Test Case]
>>
>> * Unfortunately, this is not easy to reproduce. Also, it is only seen on
>>   4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
>>
>>
>> [Regression Potential]
>>
>> * The patch is restricted to the bpo driver, with very constrained scope
>>   - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
>>   opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
>>   driver).
> 
> 4.15 doesn't have this patch.

Which also does need patching (earlier thought had B only needing
the hwe 4.18 kernel). I will submit that against this same bug.

>> * The patch is very small and backport is fairly minimal and simple.
>>
>> * The fix has been running on the in-tree driver in upstream mainline as well
>>   as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
>>   lower level code that is different, this piece is still the same.
> 
> I'm a little reluctant to ACK this given that a) it's an upstream patch that
> is applied to an out-of-tree vendor driver and b) the problem can't be
> reproduced (or can it)? If I gave you a test kernel with a newer Broadcom
> driver, would you be able to do some testing?

Sadly, as I don't have the specific Broadcom NICs yet which this driver
supports, it's moot, and we're very dependent on getting users who
do have the HW to test it. I would imagine that getting testing of
the full new driver will be difficult and take more time.

It is certainly a very infrequent to rare occurrence, but I'm reluctant
to quantify that as we don't have sufficient details regarding the
workload itself and how long that had been running. On the one
occurrence of the issue, bad things happened with their Kafka server
outage etc. As there is a fix and it is upstream this seemed the safer
option.

There are not too many people yet who have the newer Broadcom card
and are also running 4.4 kernels (and hence the bnxt_en_bpo
driver). Hence it's unclear how often this might happen in
future (more likely, I would presume, with more people getting
new HW).

> 
> The latest Broadcom driver is version 1.9.2 from 01/02/2019 but it doesn't
> have the patch that you want.

Right, so we'd need to add the same fix to that as well, so it would
not be any safer to pull in the full new Broadcom driver.

I agree that we need a path forward with respect to updating this
driver and I'll ask again if I can get hold of the right HW, but
I suspect that it will not be very soon.

The thinking was that if our options are to a) pull in full new driver,
b) minimally patch existing driver, c) make no change, (b) would be
least evil of the options.

Let me know how you want to proceed, and believe me, I fully understand
the reluctance ;).
Brad Figg Feb. 26, 2019, 4:49 p.m. UTC | #3
I've had several discussions with Nivedita and Jay about this patch and I think
it is worth pulling in. The issue being hit can not be reproduced. Let's go
ahead and take this patch.

Brad

On Fri, Feb 22, 2019 at 11:20:18AM +0100, Nivedita Singhvi wrote:
> BugLink: http://bugs.launchpad.net/bugs/1814095
> 
> 
> [Impact]
> 
> The bnxt_en_bpo driver experienced tx timeouts causing the system to experience
> network stalls and fail to send data and heartbeat packets.
> 
> The following 25Gb Broadcom NIC error was seen on Xenial running the
> 4.4.0-141-generic kernel on an amd64 host seeing moderate-heavy network
> traffic (just once):
> 
> * The bnxt_en_po driver froze on a "TX timed out" error and triggered the
>   Netdev Watchdog timer under load.
> 
> * From kernel log:
>   "NETDEV WATCHDOG: eno2d1 (bnxt_en_bpo): transmit queue 0 timed out"
>   See attached kern.log excerpt file for full excerpt of error log.
> 
> * Release = Xenial
>   Kernel = 4.4.0-141-generic #167
>   eno2d1 = Product Name: Broadcom Adv. Dual 25Gb Ethernet
> 
> * This caused the driver to reset in order to recover:
> 
>   "bnxt_en_bpo 0000:19:00.1 eno2d1: TX timeout detected, starting reset task!"
> 
>   driver: bnxt_en_bpo
>   version: 1.8.1
>   source: ubuntu/bnxt/bnxt.c: bnxt_tx_timeout()
> 
> * The loss of connectivity and softirq stall caused other cascading failures
>   on the system.
> 
> * The bnxt_en_po driver is the imported Broadcom driver pulled in to support
>   newer Broadcom HW (specific boards) while the bnx_en module continues to
>   support the older HW. The current Linux upstream driver does not compile
>   easily with the 4.4 kernel (too many changes).
> 
> * This upstream and bnxt_en driver fix is a likely solution:
>    "bnxt_en: Fix TX timeout during netpoll"
>    commit: 73f21c653f930f438d53eed29b5e4c65c8a0f906
> 
>   This fix has not been applied to the bnxt_en_po driver version, but review of
>   the code indicates that it is susceptible to the bug, and the fix would be
>   reasonable.
> 
> 
> [Test Case]
> 
> * Unfortunately, this is not easy to reproduce. Also, it is only seen on
>   4.4 kernels with newer Broadcom NICs supported by the bnxt_en_bpo driver.
> 
> 
> [Regression Potential]
> 
> * The patch is restricted to the bpo driver, with very constrained scope
>   - just the newest Broadcom NICs being used by the Xenial 4.4 kernel (as
>   opposed to the hwe 4.15 etc. kernels, which would have the in-tree fixed
>   driver).
> 
> * The patch is very small and backport is fairly minimal and simple.
> 
> * The fix has been running on the in-tree driver in upstream mainline as well
>   as the Ubuntu Linux in-tree driver, although the Broadcom driver has a lot of
>   lower level code that is different, this piece is still the same.
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team