mbox series

[00/15] net: taint when the device driver firmware crashes

Message ID 20200509043552.8745-1-mcgrof@kernel.org
Headers show
Series net: taint when the device driver firmware crashes | expand

Message

Luis Chamberlain May 9, 2020, 4:35 a.m. UTC
Device driver firmware can crash, and sometimes, this can leave your
system in a state which makes the device or subsystem completely
useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
of scraping some magical words from the kernel log, which is driver
specific, is much easier. So instead this series provides a helper which
lets drivers annotate this and shows how to use this on networking
drivers.

My methodology for finding when firmware crashes is to git grep for
"crash" and then doing some study of the code to see if this indeed
a place where the firmware crashes. In some places this is quite
obvious.

I'm starting off with networking first, if this gets merged later on I
can focus on the other drivers, but I already have some work done on
other subsytems.

Review, flames, etc are greatly appreciated.

This work, only on networking drivers, can be found on my git tree as well:

https://git.kernel.org/pub/scm/linux/kernel/git/mcgrof/linux-next.git/log/?h=20200509-taint-firmware-net

Luis Chamberlain (15):
  taint: add module firmware crash taint support
  ethernet/839: use new module_firmware_crashed()
  bnx2x: use new module_firmware_crashed()
  bnxt: use new module_firmware_crashed()
  bna: use new module_firmware_crashed()
  liquidio: use new module_firmware_crashed()
  cxgb4: use new module_firmware_crashed()
  ehea: use new module_firmware_crashed()
  qed: use new module_firmware_crashed()
  soc: qcom: ipa: use new module_firmware_crashed()
  wimax/i2400m: use new module_firmware_crashed()
  ath10k: use new module_firmware_crashed()
  ath6kl: use new module_firmware_crashed()
  brcm80211: use new module_firmware_crashed()
  mwl8k: use new module_firmware_crashed()

 drivers/net/ethernet/8390/axnet_cs.c                |  4 +++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c    |  1 +
 drivers/net/ethernet/broadcom/bnxt/bnxt_ethtool.c   |  1 +
 drivers/net/ethernet/brocade/bna/bfa_ioc.c          |  1 +
 drivers/net/ethernet/cavium/liquidio/lio_main.c     |  1 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c     |  1 +
 drivers/net/ethernet/ibm/ehea/ehea_main.c           |  2 ++
 drivers/net/ethernet/qlogic/qed/qed_debug.c         |  3 +++
 drivers/net/ipa/ipa_modem.c                         |  1 +
 drivers/net/wimax/i2400m/rx.c                       |  1 +
 drivers/net/wireless/ath/ath10k/pci.c               |  2 ++
 drivers/net/wireless/ath/ath10k/sdio.c              |  2 ++
 drivers/net/wireless/ath/ath10k/snoc.c              |  1 +
 drivers/net/wireless/ath/ath6kl/hif.c               |  1 +
 .../net/wireless/broadcom/brcm80211/brcmfmac/core.c |  1 +
 drivers/net/wireless/marvell/mwl8k.c                |  1 +
 include/linux/kernel.h                              |  3 ++-
 include/linux/module.h                              | 13 +++++++++++++
 include/trace/events/module.h                       |  3 ++-
 kernel/module.c                                     |  5 +++--
 kernel/panic.c                                      |  1 +
 21 files changed, 44 insertions(+), 5 deletions(-)

Comments

Jakub Kicinski May 9, 2020, 6:35 p.m. UTC | #1
On Sat,  9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead this series provides a helper which
> lets drivers annotate this and shows how to use this on networking
> drivers.
> 
> My methodology for finding when firmware crashes is to git grep for
> "crash" and then doing some study of the code to see if this indeed
> a place where the firmware crashes. In some places this is quite
> obvious.
> 
> I'm starting off with networking first, if this gets merged later on I
> can focus on the other drivers, but I already have some work done on
> other subsytems.
> 
> Review, flames, etc are greatly appreciated.

Tainting itself may be useful, but that's just the first step. I'd much
rather see folks start using the devlink health infrastructure. Devlink
is netlink based, but it's _not_ networking specific (many of its
optional features obviously are, but don't let that mislead you).

With devlink health we get (a) a standard notification on the failure; 
(b) information/state dump in a (somewhat) structured form, which can be
collected & shared with vendors; (c) automatic remediation (usually
device reset of some scope).

Now regarding the tainting - as I said it may be useful, but don't we
have to define what constitutes a "firmware crash"? There are many
failure modes, some perfectly recoverable (e.g. processing queue hang), 
some mere bugs (e.g. device fails to initialize some functions). All of
them may impact the functioning of the system. How do we choose those
that taint?
Shannon Nelson May 10, 2020, 1:01 a.m. UTC | #2
On 5/8/20 9:35 PM, Luis Chamberlain wrote:
> Device driver firmware can crash, and sometimes, this can leave your
> system in a state which makes the device or subsystem completely
> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> of scraping some magical words from the kernel log, which is driver
> specific, is much easier. So instead this series provides a helper which
> lets drivers annotate this and shows how to use this on networking
> drivers.
>
If the driver is able to detect that the device firmware has come back 
alive, through user intervention or whatever, should there be a way to 
"untaint" the kernel?  Or would you expect it to remain tainted?

sln
Andrew Lunn May 10, 2020, 1:58 a.m. UTC | #3
On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote:
> On 5/8/20 9:35 PM, Luis Chamberlain wrote:
> > Device driver firmware can crash, and sometimes, this can leave your
> > system in a state which makes the device or subsystem completely
> > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> > of scraping some magical words from the kernel log, which is driver
> > specific, is much easier. So instead this series provides a helper which
> > lets drivers annotate this and shows how to use this on networking
> > drivers.
> > 
> If the driver is able to detect that the device firmware has come back
> alive, through user intervention or whatever, should there be a way to
> "untaint" the kernel?  Or would you expect it to remain tainted?

Hi Shannon

In general, you don't want to be able to untained. Say a non-GPL
licenced module is loaded, which taints the kernel. It might then try
to untaint the kernel to hide its.

As for firmware, how much damage can the firmware do as it crashed? If
it is a DMA master, it could of splattered stuff through
memory. Restarting the firmware is not going to reverse the damage it
has done.

    Andrew
Shannon Nelson May 10, 2020, 2:15 a.m. UTC | #4
On 5/9/20 6:58 PM, Andrew Lunn wrote:
> On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote:
>> On 5/8/20 9:35 PM, Luis Chamberlain wrote:
>>> Device driver firmware can crash, and sometimes, this can leave your
>>> system in a state which makes the device or subsystem completely
>>> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
>>> of scraping some magical words from the kernel log, which is driver
>>> specific, is much easier. So instead this series provides a helper which
>>> lets drivers annotate this and shows how to use this on networking
>>> drivers.
>>>
>> If the driver is able to detect that the device firmware has come back
>> alive, through user intervention or whatever, should there be a way to
>> "untaint" the kernel?  Or would you expect it to remain tainted?
> Hi Shannon
>
> In general, you don't want to be able to untained. Say a non-GPL
> licenced module is loaded, which taints the kernel. It might then try
> to untaint the kernel to hide its.

Yeah, obviously we don't want this to be abuseable.  I was just 
wondering about reversing this particular status if the broken device 
could get itself fixed.

>
> As for firmware, how much damage can the firmware do as it crashed? If
> it is a DMA master, it could of splattered stuff through
> memory. Restarting the firmware is not going to reverse the damage it
> has done.
>
True, and tho' the driver might get the thing restarted, it wouldn't 
necessarily know what kind of damage had ensued.

Carry on,
sln
Luis Chamberlain May 11, 2020, 2:11 p.m. UTC | #5
On Sat, May 09, 2020 at 11:35:46AM -0700, Jakub Kicinski wrote:
> On Sat,  9 May 2020 04:35:37 +0000 Luis Chamberlain wrote:
> > Device driver firmware can crash, and sometimes, this can leave your
> > system in a state which makes the device or subsystem completely
> > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead
> > of scraping some magical words from the kernel log, which is driver
> > specific, is much easier. So instead this series provides a helper which
> > lets drivers annotate this and shows how to use this on networking
> > drivers.
> > 
> > My methodology for finding when firmware crashes is to git grep for
> > "crash" and then doing some study of the code to see if this indeed
> > a place where the firmware crashes. In some places this is quite
> > obvious.
> > 
> > I'm starting off with networking first, if this gets merged later on I
> > can focus on the other drivers, but I already have some work done on
> > other subsytems.
> > 
> > Review, flames, etc are greatly appreciated.
> 
> Tainting itself may be useful, but that's just the first step. I'd much
> rather see folks start using the devlink health infrastructure. Devlink
> is netlink based, but it's _not_ networking specific (many of its
> optional features obviously are, but don't let that mislead you).
> 
> With devlink health we get (a) a standard notification on the failure; 
> (b) information/state dump in a (somewhat) structured form, which can be
> collected & shared with vendors; (c) automatic remediation (usually
> device reset of some scope).

It indeed sounds very useful!

> Now regarding the tainting - as I said it may be useful, but don't we
> have to define what constitutes a "firmware crash"?

Yes indeed, I missed clarifying this in the documentation. I'll do so
in my next respin.

> There are many
> failure modes, some perfectly recoverable (e.g. processing queue hang), 
> some mere bugs (e.g. device fails to initialize some functions). All of
> them may impact the functioning of the system. How do we choose those
> that taint? 

Its up to the maintainers of the device driver, what I was aiming for
were those firmware crashes which indeed *can* have an impact on user
experience, and can *even* potentially require a driver removal / addition
to to get things back in order again.

  Luis
Luis Chamberlain May 11, 2020, 2:13 p.m. UTC | #6
On Sat, May 09, 2020 at 07:15:23PM -0700, Shannon Nelson wrote:
> On 5/9/20 6:58 PM, Andrew Lunn wrote:
> > On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote:
> > As for firmware, how much damage can the firmware do as it crashed? If
> > it is a DMA master, it could of splattered stuff through
> > memory. Restarting the firmware is not going to reverse the damage it
> > has done.
> > 
> True, and tho' the driver might get the thing restarted, it wouldn't
> necessarily know what kind of damage had ensued.

Indeed, it is those uknowns which we currently assume is just fine, but
in reality can be damaging. Today we just move on with life, but such
information is useful for analysis.

  Luis
Steven Rostedt May 11, 2020, 7:21 p.m. UTC | #7
On Sat, 9 May 2020 18:01:51 -0700
Shannon Nelson <snelson@pensando.io> wrote:

> If the driver is able to detect that the device firmware has come back 
> alive, through user intervention or whatever, should there be a way to 
> "untaint" the kernel?  Or would you expect it to remain tainted?

The only way to untaint a kernel is a reboot. A taint just means "something
happened to this kernel since it was booted". It's used as a hint, and
that's all.

I agree with the other comments in this thread. Use devlink health or
whatever tool to look further into causes. But from what I see here, this
code is "good enough" for a taint.

-- Steve