Message ID | 20190207124422.21952-1-gpiccoli@canonical.com |
---|---|
Headers | show |
Series | qlcnic: Firmware aborts/hangs in QLogic NIC | expand |
Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
On 2019-02-07 10:44:21 , Guilherme G. Piccoli wrote: > BugLink: http://bugs.launchpad.net/bugs/1815033 > > [Impact] > > * In multi-queue configurations for qlcnic driver, there is a corner case > in which TX queue zero is used at same time for regular data transmission > by one CPU while another uses the same queue descriptor for MAC config. > > * When such "race" indeed happens, it could lead to TX queue zero > corruption, triggering as net result firmware aborts/hangs out of > nowhere. The following kernel log messages were collected during the > corruption event: > > qlcnic 0000:01:00.0: Pause control frames disabled on all ports > qlcnic 0000:01:00.0: firmware hang detected > qlcnic 0000:01:00.0: Dumping hw/fw registers > PEG_HALT_STATUS1: 0x40001502, PEG_HALT_STATUS2: 0x3de7a0, > PEG_NET_0_PC: 0x6d268, PEG_NET_1_PC: 0x6d2ac, > PEG_NET_2_PC: 0x149, PEG_NET_3_PC: 0x6e105, > PEG_NET_4_PC: 0x1e00b > [...] > qlcnic 0000:01:00.0: Detected state change from DEV_NEED_RESET, skipping > ack check > > * The following device is known to suffer from the issue (lspci output), > although a whole class of devices (named 82XX series from the vendor) > are susceptible to this: > 01:00.0 Ethernet controller [0200]: QLogic Corp. cLOM8214 1/10GbE > Controller [1077:8020] > > * The fix is the following patch, present in mainline kernel as well as > in supported stable branches: > c333fa0c4f22 ("qlcnic: fix Tx descriptor corruption on 82xx devices"). > Link for patch in Linus tree: http://git.kernel.org/linus/c333fa0c4f22 > > [Test Case] > > * Unfortunately this is not easy to reproduce; we have a user report of > the issue with a pretty reliable reproducer - user is running a NFS > workload on top of the above PCI adapter. His problem goes away with > the patch proposed here to SRU. His problem happens in both kernels 4.4 > and 4.15, and the patch fixes it for both of them. > (Notice this is a Bionic-only SRU, since Ubuntu 4.4 kernel got the > patch from Greg's supported stable branch). > > [Regression Potential] > > * The patch scope is restricted to a single driver, and the code itself > is self-contained - basically a restriction to specific tx_ring when > setting filters. There is potential for regressions in this path for > the driver which could cause different firmware issues for example, > but the user testing exhibited great reliability - without the patch > issue happens after ~6h of machine boot. With the patch the machine > ran for more than 8 days without issues. > > * Also the patch is present in mainline kernel as well as supported > stable branches, and is already present in Ubuntu 4.4 kernel. > > Shahed Shaikh (1): > qlcnic: fix Tx descriptor corruption on 82xx devices > > drivers/net/ethernet/qlogic/qlcnic/qlcnic.h | 8 +++++--- > drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.c | 3 ++- > drivers/net/ethernet/qlogic/qlcnic/qlcnic_83xx_hw.h | 3 ++- > drivers/net/ethernet/qlogic/qlcnic/qlcnic_hw.h | 3 ++- > drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c | 12 ++++++------ > 5 files changed, 17 insertions(+), 12 deletions(-) > > -- > 2.19.2 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team