mbox series

[v2,net-next,00/11] net: qed/qede: critical hw error handling

Message ID 20200514095727.1361-1-irusskikh@marvell.com
Headers show
Series net: qed/qede: critical hw error handling | expand

Message

Igor Russkikh May 14, 2020, 9:57 a.m. UTC
FastLinQ devices as a complex systems may observe various hardware
level error conditions, both severe and recoverable.

Driver is able to detect and report this, but so far it only did
trace/dmesg based reporting.

Here we implement an extended hw error detection, service task
handler captures a dump for the later analysis.

I also resubmit a patch from Denis Bolotin on tx timeout handler,
addressing David's comment regarding recovery procedure as an extra
reaction on this event.

v2:

Removing the patch with ethtool dump and udev magic. Its quite isolated,
I'm working on devlink based logic for this separately.

v1:

https://patchwork.ozlabs.org/project/netdev/cover/cover.1588758463.git.irusskikh@marvell.com/

Denis Bolotin (1):
  net: qede: Implement ndo_tx_timeout

Igor Russkikh (10):
  net: qed: adding hw_err states and handling
  net: qede: add hw err scheduled handler
  net: qed: invoke err notify on critical areas
  net: qed: critical err reporting to management firmware
  net: qed: cleanup debug related declarations
  net: qed: attention clearing properties
  net: qede: optional hw recovery procedure
  net: qed: introduce critical fan failure handler
  net: qed: introduce critical hardware error handler
  net: qed: fix bad formatting

 drivers/net/ethernet/qlogic/qed/qed.h         |  16 +-
 drivers/net/ethernet/qlogic/qed/qed_debug.c   |  26 +-
 drivers/net/ethernet/qlogic/qed/qed_dev.c     |   4 +-
 drivers/net/ethernet/qlogic/qed/qed_hsi.h     |  49 +++-
 drivers/net/ethernet/qlogic/qed/qed_hw.c      |  42 ++-
 drivers/net/ethernet/qlogic/qed/qed_hw.h      |  15 ++
 drivers/net/ethernet/qlogic/qed/qed_int.c     |  40 ++-
 drivers/net/ethernet/qlogic/qed/qed_int.h     |  11 +
 drivers/net/ethernet/qlogic/qed/qed_main.c    |  34 +++
 drivers/net/ethernet/qlogic/qed/qed_mcp.c     | 254 ++++++++++++++++++
 drivers/net/ethernet/qlogic/qed/qed_mcp.h     |  28 ++
 drivers/net/ethernet/qlogic/qed/qed_spq.c     |  16 +-
 drivers/net/ethernet/qlogic/qede/qede.h       |  14 +-
 .../net/ethernet/qlogic/qede/qede_ethtool.c   |  24 ++
 drivers/net/ethernet/qlogic/qede/qede_main.c  | 147 +++++++++-
 include/linux/qed/qed_if.h                    |  26 +-
 16 files changed, 700 insertions(+), 46 deletions(-)

Comments

Jakub Kicinski May 14, 2020, 7:06 p.m. UTC | #1
On Thu, 14 May 2020 12:57:16 +0300 Igor Russkikh wrote:
> FastLinQ devices as a complex systems may observe various hardware
> level error conditions, both severe and recoverable.
> 
> Driver is able to detect and report this, but so far it only did
> trace/dmesg based reporting.
> 
> Here we implement an extended hw error detection, service task
> handler captures a dump for the later analysis.
> 
> I also resubmit a patch from Denis Bolotin on tx timeout handler,
> addressing David's comment regarding recovery procedure as an extra
> reaction on this event.
> 
> v2:
> 
> Removing the patch with ethtool dump and udev magic. Its quite isolated,
> I'm working on devlink based logic for this separately.
> 
> v1:
> 
> https://patchwork.ozlabs.org/project/netdev/cover/cover.1588758463.git.irusskikh@marvell.com/

I'm not 100% happy that the debug data gets reported to the management
FW before the devlink health code is in place. For the Linux community,
I think, having standard Linux interfaces implemented first is the
priority.
Igor Russkikh May 14, 2020, 7:40 p.m. UTC | #2
> I'm not 100% happy that the debug data gets reported to the management
> FW before the devlink health code is in place. For the Linux community,
> I think, having standard Linux interfaces implemented first is the
> priority.

Hi Jakub,

Thanks for the comment. I feel these two are a bit separate. We try to push
important messages to MFW, not debug data. And all these messages are as well
perfectly being reported on device level error printouts, they are not kind of
lost.

And for devlink, we anyway will need all the above infrastructure, to
eventually implement devlink dumps and other features.

Or, may be I didn't get your point?

Thanks,
  Igor
David Miller May 14, 2020, 8:01 p.m. UTC | #3
From: Igor Russkikh <irusskikh@marvell.com>
Date: Thu, 14 May 2020 12:57:16 +0300

> FastLinQ devices as a complex systems may observe various hardware
> level error conditions, both severe and recoverable.
> 
> Driver is able to detect and report this, but so far it only did
> trace/dmesg based reporting.
> 
> Here we implement an extended hw error detection, service task
> handler captures a dump for the later analysis.
> 
> I also resubmit a patch from Denis Bolotin on tx timeout handler,
> addressing David's comment regarding recovery procedure as an extra
> reaction on this event.
> 
> v2:
> 
> Removing the patch with ethtool dump and udev magic. Its quite isolated,
> I'm working on devlink based logic for this separately.
> 
> v1:
> 
> https://patchwork.ozlabs.org/project/netdev/cover/cover.1588758463.git.irusskikh@marvell.com/

I'm only applying this series because I trust that you will actually do the
devlink work, and you will have it done and submitted in a reasonable amount
of ti me.

Also, patch #4 had trailing empty lines added to a file, which is
warned about by 'git' when I apply your patches.  I fixed it up, but
this is the kind of thing you should have sorted out before you submit
changes to the list.

Thank you.
Jakub Kicinski May 14, 2020, 8:02 p.m. UTC | #4
On Thu, 14 May 2020 22:40:12 +0300 Igor Russkikh wrote:
> > I'm not 100% happy that the debug data gets reported to the management
> > FW before the devlink health code is in place. For the Linux community,
> > I think, having standard Linux interfaces implemented first is the
> > priority.  
> 
> Hi Jakub,
> 
> Thanks for the comment. I feel these two are a bit separate. We try to push
> important messages to MFW, not debug data. And all these messages are as well
> perfectly being reported on device level error printouts, they are not kind of
> lost.
> 
> And for devlink, we anyway will need all the above infrastructure, to
> eventually implement devlink dumps and other features.
> 
> Or, may be I didn't get your point?

That's fine, I'm just saying - I hope the devlink part doesn't take too
long to implement :)
Igor Russkikh May 14, 2020, 8:09 p.m. UTC | #5
On 14/05/2020 11:02 pm, Jakub Kicinski wrote:

> That's fine, I'm just saying - I hope the devlink part doesn't take too
> long to implement :)
> 

> I'm only applying this series because I trust that you will actually do
> the
> devlink work, and you will have it done and submitted in a reasonable
> amount
> of ti me.

I see. Thanks Jakub, David. Doing hard already on devlink side.

> Also, patch #4 had trailing empty lines added to a file, which is
> warned about by 'git' when I apply your patches.  I fixed it up, but
> this is the kind of thing you should have sorted out before you submit
> changes to the list.

Sorry for that miss, will do in future.

Thanks
  Igor
David Miller May 14, 2020, 8:09 p.m. UTC | #6
From: David Miller <davem@davemloft.net>
Date: Thu, 14 May 2020 13:01:59 -0700 (PDT)

> From: Igor Russkikh <irusskikh@marvell.com>
> Date: Thu, 14 May 2020 12:57:16 +0300
> 
>> FastLinQ devices as a complex systems may observe various hardware
>> level error conditions, both severe and recoverable.
>> 
>> Driver is able to detect and report this, but so far it only did
>> trace/dmesg based reporting.
>> 
>> Here we implement an extended hw error detection, service task
>> handler captures a dump for the later analysis.
>> 
>> I also resubmit a patch from Denis Bolotin on tx timeout handler,
>> addressing David's comment regarding recovery procedure as an extra
>> reaction on this event.
>> 
>> v2:
>> 
>> Removing the patch with ethtool dump and udev magic. Its quite isolated,
>> I'm working on devlink based logic for this separately.
>> 
>> v1:
>> 
>> https://patchwork.ozlabs.org/project/netdev/cover/cover.1588758463.git.irusskikh@marvell.com/
> 
> I'm only applying this series because I trust that you will actually do the
> devlink work, and you will have it done and submitted in a reasonable amount
> of ti me.
> 
> Also, patch #4 had trailing empty lines added to a file, which is
> warned about by 'git' when I apply your patches.  I fixed it up, but
> this is the kind of thing you should have sorted out before you submit
> changes to the list.

Actually, I had to revert, please fix these warnings (with gcc-10.1.1 on Fedora)_:

drivers/net/ethernet/qlogic/qed/qed_dev.c: In function ‘qed_llh_add_mac_filter’:
./include/linux/printk.h:303:2: warning: ‘abs_ppfid’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  303 |  printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
      |  ^~~~~~
drivers/net/ethernet/qlogic/qed/qed_dev.c:983:17: note: ‘abs_ppfid’ was declared here
  983 |  u8 filter_idx, abs_ppfid;
      |                 ^~~~~~~~~
David Miller May 14, 2020, 8:22 p.m. UTC | #7
From: David Miller <davem@davemloft.net>
Date: Thu, 14 May 2020 13:09:52 -0700 (PDT)

> Actually, I had to revert, please fix these warnings (with gcc-10.1.1 on Fedora)_:
> 
> drivers/net/ethernet/qlogic/qed/qed_dev.c: In function ‘qed_llh_add_mac_filter’:
> ./include/linux/printk.h:303:2: warning: ‘abs_ppfid’ may be used uninitialized in this function [-Wmaybe-uninitialized]
>   303 |  printk(KERN_NOTICE pr_fmt(fmt), ##__VA_ARGS__)
>       |  ^~~~~~
> drivers/net/ethernet/qlogic/qed/qed_dev.c:983:17: note: ‘abs_ppfid’ was declared here
>   983 |  u8 filter_idx, abs_ppfid;
>       |                 ^~~~~~~~~

Hmm, this seems to actually be an existing warning, sorry.

I'll reapply this.