diff mbox series

[RFC] cfg80211: add new command for reporting wiphy crashes

Message ID 20190920133708.15313-1-zajec5@gmail.com
State Not Applicable
Headers show
Series [RFC] cfg80211: add new command for reporting wiphy crashes | expand

Commit Message

Rafał Miłecki Sept. 20, 2019, 1:37 p.m. UTC
From: Rafał Miłecki <rafal@milecki.pl>

Hardware or firmware instability may result in unusable wiphy. In such
cases usually a hardware reset is needed. To allow a full recovery
kernel has to indicate problem to the user space.

This new nl80211 command lets user space known wiphy has crashed and has
been just recovered. When applicable it should result in supplicant or
authenticator reconfiguring all interfaces.

Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
---
I'd like to use this new cfg80211_crash_report() in brcmfmac after a
successful recovery from a FullMAC firmware crash.

Later on I'd like to modify hostapd to reconfigure wiphy using a
previously used setup.

I'm OpenWrt developer & user and I got annoyed by my devices not auto
recovering after various failures. There are things I cannot fix (hw
failures or closed fw crashes) but I still expect my devices to get
back to operational state as soon as possible on their own.
---
 include/net/cfg80211.h       |  7 +++++++
 include/uapi/linux/nl80211.h |  2 ++
 net/wireless/nl80211.c       | 29 +++++++++++++++++++++++++++++
 3 files changed, 38 insertions(+)

Comments

Jouni Malinen Sept. 20, 2019, 2:01 p.m. UTC | #1
On Fri, Sep 20, 2019 at 03:37:08PM +0200, Rafał Miłecki wrote:
> Hardware or firmware instability may result in unusable wiphy. In such
> cases usually a hardware reset is needed. To allow a full recovery
> kernel has to indicate problem to the user space.

Why? Shouldn't the driver be able to handle this on its own since all
the previous configuration was done through the driver anyway. As far as
I know, there are drivers that do indeed try to do this and handle it
successfully at least for station mode. AP mode may be more complex, but
for that one, I guess it would be fine to drop all associations (and
provide indication of that to user space) and just restart the BSS.

> This new nl80211 command lets user space known wiphy has crashed and has
> been just recovered. When applicable it should result in supplicant or
> authenticator reconfiguring all interfaces.

For me, that is not really "recovered" if some additional
reconfiguration steps are needed.. I'd like to get a more detailed view
on what exactly might need to be reconfigured and how would user space
know what exactly to do. Or would the plan here be that the driver would
not even indicate this crash if it is actually able to internally
recover fully from the firmware restart?

> I'd like to use this new cfg80211_crash_report() in brcmfmac after a
> successful recovery from a FullMAC firmware crash.
> 
> Later on I'd like to modify hostapd to reconfigure wiphy using a
> previously used setup.

So this implies that at least something would need to happen in AP mode.
Do you have a list of items that the driver cannot do on its own and why
it would be better to do them from user space?

> I'm OpenWrt developer & user and I got annoyed by my devices not auto
> recovering after various failures. There are things I cannot fix (hw
> failures or closed fw crashes) but I still expect my devices to get
> back to operational state as soon as possible on their own.

I fully agree with the auto recovery being important thing to cover for
this, but I'm not yet convinced that this needs user space action. Or if
it does, there would need to be more detailed way of indicating what
exactly is needed for user space to do. The proposed change here is just
saying "hey, I crashed and did something to get the hardware/firmware
responding again" which does not really tell much to user space other
than potentially requiring full disable + re-enable for the related
interfaces. And that is something that should not actually be done in
all cases of firmware crashes since there are drivers that handle
recovery in a manner that is in practice completely transparent to user
space.
Jakub Kicinski Sept. 20, 2019, 5:56 p.m. UTC | #2
On Fri, 20 Sep 2019 15:37:08 +0200, Rafał Miłecki wrote:
> From: Rafał Miłecki <rafal@milecki.pl>
> 
> Hardware or firmware instability may result in unusable wiphy. In such
> cases usually a hardware reset is needed. To allow a full recovery
> kernel has to indicate problem to the user space.
> 
> This new nl80211 command lets user space known wiphy has crashed and has
> been just recovered. When applicable it should result in supplicant or
> authenticator reconfiguring all interfaces.
> 
> Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
> ---
> I'd like to use this new cfg80211_crash_report() in brcmfmac after a
> successful recovery from a FullMAC firmware crash.
> 
> Later on I'd like to modify hostapd to reconfigure wiphy using a
> previously used setup.
> 
> I'm OpenWrt developer & user and I got annoyed by my devices not auto
> recovering after various failures. There are things I cannot fix (hw
> failures or closed fw crashes) but I still expect my devices to get
> back to operational state as soon as possible on their own.

Perhaps a slightly larger point, but I think it should be raised - 
is there any chance for reusing debugging, reset and recovery work done
in devlink originally for complex Ethernet devices?

WiFi drivers have been dealing with more complex/FW heavy designs for a
while so maybe you've grow your own interfaces, and maybe they
necessarily need to be 802.11-centric, but I'm a little surprised that:

linux $ git grep devlink -- drivers/net/wireless/
linux $
Rafał Miłecki Sept. 26, 2019, 11:52 a.m. UTC | #3
On 20.09.2019 16:01, Jouni Malinen wrote:
> On Fri, Sep 20, 2019 at 03:37:08PM +0200, Rafał Miłecki wrote:
>> Hardware or firmware instability may result in unusable wiphy. In such
>> cases usually a hardware reset is needed. To allow a full recovery
>> kernel has to indicate problem to the user space.
> 
> Why? Shouldn't the driver be able to handle this on its own since all
> the previous configuration was done through the driver anyway. As far as
> I know, there are drivers that do indeed try to do this and handle it
> successfully at least for station mode. AP mode may be more complex, but
> for that one, I guess it would be fine to drop all associations (and
> provide indication of that to user space) and just restart the BSS.

Indeed my main concert is AP mode. I'm afraid that cfg80211 doesn't
cache all settings, consider e.g. nl80211_start_ap(). It builds
struct cfg80211_ap_settings using info from nl80211 message and
passes it to the driver (rdev_start_ap()). Once it's done it
caches only a small subset of all setup data.

In other words driver doesn't have enough info to recover interfaces
setup.


>> This new nl80211 command lets user space known wiphy has crashed and has
>> been just recovered. When applicable it should result in supplicant or
>> authenticator reconfiguring all interfaces.
> 
> For me, that is not really "recovered" if some additional
> reconfiguration steps are needed.. I'd like to get a more detailed view
> on what exactly might need to be reconfigured and how would user space
> know what exactly to do. Or would the plan here be that the driver would
> not even indicate this crash if it is actually able to internally
> recover fully from the firmware restart?

I meant that hardware has been recovered & is operational again (driver
can talk to it). I expected user space to reconfigure all interfaces
using the same settings that were used on previous run.

If driver were able to recover interfaces setup on its own (with a help
of cfg80211) then user space wouldn't need to be involved.


>> I'd like to use this new cfg80211_crash_report() in brcmfmac after a
>> successful recovery from a FullMAC firmware crash.
>>
>> Later on I'd like to modify hostapd to reconfigure wiphy using a
>> previously used setup.
> 
> So this implies that at least something would need to happen in AP mode.
> Do you have a list of items that the driver cannot do on its own and why
> it would be better to do them from user space?

First of all I was wondering how to handle interfaces creation. After a
firmware crash we have:
1) Interfaces created in Linux
2) No corresponsing interfaces in firmware

Syncing that (re-creating in-firmware firmwares) may be a bit tricky
depending on a driver and hardware. For some cases it could be easier to
delete all interfaces and ask user space to setup wiphy (create required
interfaces) again. I'm not sure if that's acceptable though?

If we agree interfaces should stay and driver simply should configure
firmware properly, then we need all data as explained earlier. struct
cfg80211_ap_settings is not available during runtime. How should we
handle that problem?


>> I'm OpenWrt developer & user and I got annoyed by my devices not auto
>> recovering after various failures. There are things I cannot fix (hw
>> failures or closed fw crashes) but I still expect my devices to get
>> back to operational state as soon as possible on their own.
> 
> I fully agree with the auto recovery being important thing to cover for
> this, but I'm not yet convinced that this needs user space action. Or if
> it does, there would need to be more detailed way of indicating what
> exactly is needed for user space to do. The proposed change here is just
> saying "hey, I crashed and did something to get the hardware/firmware
> responding again" which does not really tell much to user space other
> than potentially requiring full disable + re-enable for the related
> interfaces. And that is something that should not actually be done in
> all cases of firmware crashes since there are drivers that handle
> recovery in a manner that is in practice completely transparent to user
> space.

I was aiming for a brutal force solution: just make user space
interfaces need a full setup just at they were just created.
Rafał Miłecki Sept. 26, 2019, 11:55 a.m. UTC | #4
Resending from my subscribed e-mail

On 20.09.2019 16:01, Jouni Malinen wrote:
> On Fri, Sep 20, 2019 at 03:37:08PM +0200, Rafał Miłecki wrote:
>> Hardware or firmware instability may result in unusable wiphy. In such
>> cases usually a hardware reset is needed. To allow a full recovery
>> kernel has to indicate problem to the user space.
>
> Why? Shouldn't the driver be able to handle this on its own since all
> the previous configuration was done through the driver anyway. As far as
> I know, there are drivers that do indeed try to do this and handle it
> successfully at least for station mode. AP mode may be more complex, but
> for that one, I guess it would be fine to drop all associations (and
> provide indication of that to user space) and just restart the BSS.

Indeed my main concert is AP mode. I'm afraid that cfg80211 doesn't
cache all settings, consider e.g. nl80211_start_ap(). It builds
struct cfg80211_ap_settings using info from nl80211 message and
passes it to the driver (rdev_start_ap()). Once it's done it
caches only a small subset of all setup data.

In other words driver doesn't have enough info to recover interfaces
setup.


>> This new nl80211 command lets user space known wiphy has crashed and has
>> been just recovered. When applicable it should result in supplicant or
>> authenticator reconfiguring all interfaces.
>
> For me, that is not really "recovered" if some additional
> reconfiguration steps are needed.. I'd like to get a more detailed view
> on what exactly might need to be reconfigured and how would user space
> know what exactly to do. Or would the plan here be that the driver would
> not even indicate this crash if it is actually able to internally
> recover fully from the firmware restart?

I meant that hardware has been recovered & is operational again (driver
can talk to it). I expected user space to reconfigure all interfaces
using the same settings that were used on previous run.

If driver were able to recover interfaces setup on its own (with a help
of cfg80211) then user space wouldn't need to be involved.


>> I'd like to use this new cfg80211_crash_report() in brcmfmac after a
>> successful recovery from a FullMAC firmware crash.
>>
>> Later on I'd like to modify hostapd to reconfigure wiphy using a
>> previously used setup.
>
> So this implies that at least something would need to happen in AP mode.
> Do you have a list of items that the driver cannot do on its own and why
> it would be better to do them from user space?

First of all I was wondering how to handle interfaces creation. After a
firmware crash we have:
1) Interfaces created in Linux
2) No corresponsing interfaces in firmware

Syncing that (re-creating in-firmware firmwares) may be a bit tricky
depending on a driver and hardware. For some cases it could be easier to
delete all interfaces and ask user space to setup wiphy (create required
interfaces) again. I'm not sure if that's acceptable though?

If we agree interfaces should stay and driver simply should configure
firmware properly, then we need all data as explained earlier. struct
cfg80211_ap_settings is not available during runtime. How should we
handle that problem?


>> I'm OpenWrt developer & user and I got annoyed by my devices not auto
>> recovering after various failures. There are things I cannot fix (hw
>> failures or closed fw crashes) but I still expect my devices to get
>> back to operational state as soon as possible on their own.
>
> I fully agree with the auto recovery being important thing to cover for
> this, but I'm not yet convinced that this needs user space action. Or if
> it does, there would need to be more detailed way of indicating what
> exactly is needed for user space to do. The proposed change here is just
> saying "hey, I crashed and did something to get the hardware/firmware
> responding again" which does not really tell much to user space other
> than potentially requiring full disable + re-enable for the related
> interfaces. And that is something that should not actually be done in
> all cases of firmware crashes since there are drivers that handle
> recovery in a manner that is in practice completely transparent to user
> space.

I was aiming for a brutal force solution: just make user space
interfaces need a full setup just at they were just created.
Johannes Berg Sept. 26, 2019, 11:55 a.m. UTC | #5
On Thu, 2019-09-26 at 13:52 +0200, Rafał Miłecki wrote:
> 
> Indeed my main concert is AP mode. I'm afraid that cfg80211 doesn't
> cache all settings, consider e.g. nl80211_start_ap(). It builds
> struct cfg80211_ap_settings using info from nl80211 message and
> passes it to the driver (rdev_start_ap()). Once it's done it
> caches only a small subset of all setup data.
> 
> In other words driver doesn't have enough info to recover interfaces
> setup.

So the driver can cache it, just like mac80211.

You can't seriously be suggesting that the driver doesn't *have* enough
information - everything passed through it :)

> I meant that hardware has been recovered & is operational again (driver
> can talk to it). I expected user space to reconfigure all interfaces
> using the same settings that were used on previous run.
> 
> If driver were able to recover interfaces setup on its own (with a help
> of cfg80211) then user space wouldn't need to be involved.

The driver can do it, mac80211 does. It's just a matter of what the
driver will do or not.

> First of all I was wondering how to handle interfaces creation. After a
> firmware crash we have:
> 1) Interfaces created in Linux
> 2) No corresponsing interfaces in firmware

> Syncing that (re-creating in-firmware firmwares) may be a bit tricky
> depending on a driver and hardware.

We do that in mac80211, it works fine. Why would it be tricky?

If something fails, I think we force that interface to go down.

> For some cases it could be easier to
> delete all interfaces and ask user space to setup wiphy (create required
> interfaces) again. I'm not sure if that's acceptable though?
> 
> If we agree interfaces should stay and driver simply should configure
> firmware properly, then we need all data as explained earlier. struct
> cfg80211_ap_settings is not available during runtime. How should we
> handle that problem?

You can cache it in the driver in whatever format makes sense.

> I was aiming for a brutal force solution: just make user space
> interfaces need a full setup just at they were just created.

You can still do that btw, just unregister and re-register the wiphy.

johannes
Rafał Miłecki Sept. 26, 2019, noon UTC | #6
On 26.09.2019 13:55, Johannes Berg wrote:
> On Thu, 2019-09-26 at 13:52 +0200, Rafał Miłecki wrote:
>>
>> Indeed my main concert is AP mode. I'm afraid that cfg80211 doesn't
>> cache all settings, consider e.g. nl80211_start_ap(). It builds
>> struct cfg80211_ap_settings using info from nl80211 message and
>> passes it to the driver (rdev_start_ap()). Once it's done it
>> caches only a small subset of all setup data.
>>
>> In other words driver doesn't have enough info to recover interfaces
>> setup.
> 
> So the driver can cache it, just like mac80211.
> 
> You can't seriously be suggesting that the driver doesn't *have* enough
> information - everything passed through it :)

Precisely: it doesn't store (cache) enough info.


>> I meant that hardware has been recovered & is operational again (driver
>> can talk to it). I expected user space to reconfigure all interfaces
>> using the same settings that were used on previous run.
>>
>> If driver were able to recover interfaces setup on its own (with a help
>> of cfg80211) then user space wouldn't need to be involved.
> 
> The driver can do it, mac80211 does. It's just a matter of what the
> driver will do or not.
> 
>> First of all I was wondering how to handle interfaces creation. After a
>> firmware crash we have:
>> 1) Interfaces created in Linux
>> 2) No corresponsing interfaces in firmware
> 
>> Syncing that (re-creating in-firmware firmwares) may be a bit tricky
>> depending on a driver and hardware.
> 
> We do that in mac80211, it works fine. Why would it be tricky?

In brcmfmac on .add_virtual_intf() we:
1) Send request to the FullMAC firmware
2) Wait for a reply
3) On success we create interface
4) We wake up .add_virtual_intf() handler

I'll need to find a way to skip step 3 in recovery path since interface
on host side already exists.


> If something fails, I think we force that interface to go down.
> 
>> For some cases it could be easier to
>> delete all interfaces and ask user space to setup wiphy (create required
>> interfaces) again. I'm not sure if that's acceptable though?
>>
>> If we agree interfaces should stay and driver simply should configure
>> firmware properly, then we need all data as explained earlier. struct
>> cfg80211_ap_settings is not available during runtime. How should we
>> handle that problem?
> 
> You can cache it in the driver in whatever format makes sense.
> 
>> I was aiming for a brutal force solution: just make user space
>> interfaces need a full setup just at they were just created.
> 
> You can still do that btw, just unregister and re-register the wiphy.

OK, so basically I need to work on caching setup data. Should I try
doing that at my selected driver level (brcmfmac)? Or should I focus on
generic solution (cfg80211) and consider offloading mac80211 if
possible?
Johannes Berg Sept. 26, 2019, 12:04 p.m. UTC | #7
On Thu, 2019-09-26 at 14:00 +0200, Rafał Miłecki wrote:

> > You can't seriously be suggesting that the driver doesn't *have* enough
> > information - everything passed through it :)
> 
> Precisely: it doesn't store (cache) enough info.

But nothing stops it (the driver) from doing so!

> In brcmfmac on .add_virtual_intf() we:
> 1) Send request to the FullMAC firmware
> 2) Wait for a reply
> 3) On success we create interface
> 4) We wake up .add_virtual_intf() handler
> 
> I'll need to find a way to skip step 3 in recovery path since interface
> on host side already exists.

Sure, we skip lots of things in all drivers, look at iwlwifi for example
with IWL_MVM_STATUS_IN_HW_RESTART.

> OK, so basically I need to work on caching setup data. Should I try
> doing that at my selected driver level (brcmfmac)? Or should I focus on
> generic solution (cfg80211) and consider offloading mac80211 if
> possible?

I think doing it generically will not work well, you have different code
paths and different formats, different data that you need etc.

Sometimes there's information cfg80211 doesn't even *have*, because the
driver is responsible for things (e.g. elements). I guess you can try,
but my gut feeling is that it'd simpler in the driver.

Now you can argue that everything passes through cfg80211 so it must
have enough data too (just like I'm arguing the driver certainly has
enough data), but ... it seems to me the cfg80211 is usually more
action-based, where the restore flow needs to keep the *state*, not
replay the series of actions that happened.

johannes
diff mbox series

Patch

diff --git a/include/net/cfg80211.h b/include/net/cfg80211.h
index ff45c3e1abff..668fa27c88cc 100644
--- a/include/net/cfg80211.h
+++ b/include/net/cfg80211.h
@@ -7437,6 +7437,13 @@  void cfg80211_pmsr_complete(struct wireless_dev *wdev,
 bool cfg80211_iftype_allowed(struct wiphy *wiphy, enum nl80211_iftype iftype,
 			     bool is_4addr, u8 check_swif);
 
+/**
+ * cfg80211_crash_report - report crashed wiphy that requires a setup
+ *
+ * @wiphy: the wiphy
+ * @gfp: allocation flags
+ */
+void cfg80211_crash_report(struct wiphy *wiphy, gfp_t gfp);
 
 /* Logging, debugging and troubleshooting/diagnostic helpers. */
 
diff --git a/include/uapi/linux/nl80211.h b/include/uapi/linux/nl80211.h
index beee59c831a7..9e17feb03849 100644
--- a/include/uapi/linux/nl80211.h
+++ b/include/uapi/linux/nl80211.h
@@ -1325,6 +1325,8 @@  enum nl80211_commands {
 
 	NL80211_CMD_PROBE_MESH_LINK,
 
+	NL80211_CMD_CRASH_REPORT,
+
 	/* add new commands above here */
 
 	/* used to define NL80211_CMD_MAX below */
diff --git a/net/wireless/nl80211.c b/net/wireless/nl80211.c
index d21b1581a665..d29785fb0676 100644
--- a/net/wireless/nl80211.c
+++ b/net/wireless/nl80211.c
@@ -16940,6 +16940,35 @@  void cfg80211_update_owe_info_event(struct net_device *netdev,
 }
 EXPORT_SYMBOL(cfg80211_update_owe_info_event);
 
+void cfg80211_crash_report(struct wiphy *wiphy, gfp_t gfp)
+{
+	struct cfg80211_registered_device *rdev = wiphy_to_rdev(wiphy);
+	struct sk_buff *msg;
+	void *hdr;
+
+	msg = nlmsg_new(NLMSG_DEFAULT_SIZE, gfp);
+	if (!msg)
+		return;
+
+	hdr = nl80211hdr_put(msg, 0, 0, 0, NL80211_CMD_CRASH_REPORT);
+	if (!hdr)
+		goto nla_put_failure;
+
+	if (nla_put_u32(msg, NL80211_ATTR_WIPHY, rdev->wiphy_idx))
+		goto nla_put_failure;
+
+	genlmsg_end(msg, hdr);
+
+	genlmsg_multicast_netns(&nl80211_fam, wiphy_net(&rdev->wiphy), msg, 0,
+				NL80211_MCGRP_CONFIG, gfp);
+
+	return;
+
+nla_put_failure:
+	nlmsg_free(msg);
+}
+EXPORT_SYMBOL(cfg80211_crash_report);
+
 /* initialisation/exit functions */
 
 int __init nl80211_init(void)