Message ID | 20200509043552.8745-1-mcgrof@kernel.org |
---|---|
Headers | show |
Series | net: taint when the device driver firmware crashes | expand |
On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote: > Device driver firmware can crash, and sometimes, this can leave your > system in a state which makes the device or subsystem completely > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead > of scraping some magical words from the kernel log, which is driver > specific, is much easier. So instead this series provides a helper which > lets drivers annotate this and shows how to use this on networking > drivers. > > My methodology for finding when firmware crashes is to git grep for > "crash" and then doing some study of the code to see if this indeed > a place where the firmware crashes. In some places this is quite > obvious. > > I'm starting off with networking first, if this gets merged later on I > can focus on the other drivers, but I already have some work done on > other subsytems. > > Review, flames, etc are greatly appreciated. Tainting itself may be useful, but that's just the first step. I'd much rather see folks start using the devlink health infrastructure. Devlink is netlink based, but it's _not_ networking specific (many of its optional features obviously are, but don't let that mislead you). With devlink health we get (a) a standard notification on the failure; (b) information/state dump in a (somewhat) structured form, which can be collected & shared with vendors; (c) automatic remediation (usually device reset of some scope). Now regarding the tainting - as I said it may be useful, but don't we have to define what constitutes a "firmware crash"? There are many failure modes, some perfectly recoverable (e.g. processing queue hang), some mere bugs (e.g. device fails to initialize some functions). All of them may impact the functioning of the system. How do we choose those that taint?
On 5/8/20 9:35 PM, Luis Chamberlain wrote: > Device driver firmware can crash, and sometimes, this can leave your > system in a state which makes the device or subsystem completely > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead > of scraping some magical words from the kernel log, which is driver > specific, is much easier. So instead this series provides a helper which > lets drivers annotate this and shows how to use this on networking > drivers. > If the driver is able to detect that the device firmware has come back alive, through user intervention or whatever, should there be a way to "untaint" the kernel? Or would you expect it to remain tainted? sln
On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote: > On 5/8/20 9:35 PM, Luis Chamberlain wrote: > > Device driver firmware can crash, and sometimes, this can leave your > > system in a state which makes the device or subsystem completely > > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead > > of scraping some magical words from the kernel log, which is driver > > specific, is much easier. So instead this series provides a helper which > > lets drivers annotate this and shows how to use this on networking > > drivers. > > > If the driver is able to detect that the device firmware has come back > alive, through user intervention or whatever, should there be a way to > "untaint" the kernel? Or would you expect it to remain tainted? Hi Shannon In general, you don't want to be able to untained. Say a non-GPL licenced module is loaded, which taints the kernel. It might then try to untaint the kernel to hide its. As for firmware, how much damage can the firmware do as it crashed? If it is a DMA master, it could of splattered stuff through memory. Restarting the firmware is not going to reverse the damage it has done. Andrew
On 5/9/20 6:58 PM, Andrew Lunn wrote: > On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote: >> On 5/8/20 9:35 PM, Luis Chamberlain wrote: >>> Device driver firmware can crash, and sometimes, this can leave your >>> system in a state which makes the device or subsystem completely >>> useless. Detecting this by inspecting /proc/sys/kernel/tainted instead >>> of scraping some magical words from the kernel log, which is driver >>> specific, is much easier. So instead this series provides a helper which >>> lets drivers annotate this and shows how to use this on networking >>> drivers. >>> >> If the driver is able to detect that the device firmware has come back >> alive, through user intervention or whatever, should there be a way to >> "untaint" the kernel? Or would you expect it to remain tainted? > Hi Shannon > > In general, you don't want to be able to untained. Say a non-GPL > licenced module is loaded, which taints the kernel. It might then try > to untaint the kernel to hide its. Yeah, obviously we don't want this to be abuseable. I was just wondering about reversing this particular status if the broken device could get itself fixed. > > As for firmware, how much damage can the firmware do as it crashed? If > it is a DMA master, it could of splattered stuff through > memory. Restarting the firmware is not going to reverse the damage it > has done. > True, and tho' the driver might get the thing restarted, it wouldn't necessarily know what kind of damage had ensued. Carry on, sln
On Sat, May 09, 2020 at 11:35:46AM -0700, Jakub Kicinski wrote: > On Sat, 9 May 2020 04:35:37 +0000 Luis Chamberlain wrote: > > Device driver firmware can crash, and sometimes, this can leave your > > system in a state which makes the device or subsystem completely > > useless. Detecting this by inspecting /proc/sys/kernel/tainted instead > > of scraping some magical words from the kernel log, which is driver > > specific, is much easier. So instead this series provides a helper which > > lets drivers annotate this and shows how to use this on networking > > drivers. > > > > My methodology for finding when firmware crashes is to git grep for > > "crash" and then doing some study of the code to see if this indeed > > a place where the firmware crashes. In some places this is quite > > obvious. > > > > I'm starting off with networking first, if this gets merged later on I > > can focus on the other drivers, but I already have some work done on > > other subsytems. > > > > Review, flames, etc are greatly appreciated. > > Tainting itself may be useful, but that's just the first step. I'd much > rather see folks start using the devlink health infrastructure. Devlink > is netlink based, but it's _not_ networking specific (many of its > optional features obviously are, but don't let that mislead you). > > With devlink health we get (a) a standard notification on the failure; > (b) information/state dump in a (somewhat) structured form, which can be > collected & shared with vendors; (c) automatic remediation (usually > device reset of some scope). It indeed sounds very useful! > Now regarding the tainting - as I said it may be useful, but don't we > have to define what constitutes a "firmware crash"? Yes indeed, I missed clarifying this in the documentation. I'll do so in my next respin. > There are many > failure modes, some perfectly recoverable (e.g. processing queue hang), > some mere bugs (e.g. device fails to initialize some functions). All of > them may impact the functioning of the system. How do we choose those > that taint? Its up to the maintainers of the device driver, what I was aiming for were those firmware crashes which indeed *can* have an impact on user experience, and can *even* potentially require a driver removal / addition to to get things back in order again. Luis
On Sat, May 09, 2020 at 07:15:23PM -0700, Shannon Nelson wrote: > On 5/9/20 6:58 PM, Andrew Lunn wrote: > > On Sat, May 09, 2020 at 06:01:51PM -0700, Shannon Nelson wrote: > > As for firmware, how much damage can the firmware do as it crashed? If > > it is a DMA master, it could of splattered stuff through > > memory. Restarting the firmware is not going to reverse the damage it > > has done. > > > True, and tho' the driver might get the thing restarted, it wouldn't > necessarily know what kind of damage had ensued. Indeed, it is those uknowns which we currently assume is just fine, but in reality can be damaging. Today we just move on with life, but such information is useful for analysis. Luis
On Sat, 9 May 2020 18:01:51 -0700 Shannon Nelson <snelson@pensando.io> wrote: > If the driver is able to detect that the device firmware has come back > alive, through user intervention or whatever, should there be a way to > "untaint" the kernel? Or would you expect it to remain tainted? The only way to untaint a kernel is a reboot. A taint just means "something happened to this kernel since it was booted". It's used as a hint, and that's all. I agree with the other comments in this thread. Use devlink health or whatever tool to look further into causes. But from what I see here, this code is "good enough" for a taint. -- Steve