diff mbox series

[2/6] Rewrite roaming logic

Message ID 20200602001018.83640-2-matthewmwang@chromium.org
State Changes Requested
Headers show
Series [1/6] Use LUT instead of macro | expand

Commit Message

Matthew Wang June 2, 2020, 12:10 a.m. UTC
The roaming logic is currently too aggressive and full of holes. In
particular, the (arbitrary) values we currently assign to the roaming
difficulty are too small; an AP can reasonably move up or down 5dB in
RSSI in an enterprise environment even when the device is static. The
logic needs to be more resilient to these small fluctuations.

The following notes describe each of the changes made and the rationale
behind it.
1. Remove all short-circuiting and allow the roaming logic below to run
   its course. This way we avoid overlooking important factors.
2. Add a to_2ghz flag that indicates if we are moving from a 5ghz AP to
   a 2ghz AP. This should increase the difficulty to roam in that
   direction and mirror the to_5ghz flag.
3. Adjust the min_diff upward across the board. Originally, we would
   roam with a RSSI imporvement of 1dB in the weakest bucket and 5dB in
   the strongest bucket. RSSI is fairly fickle and APs can easily
   fluctuate by more than that in a normal enterprise environment.
   Additionally, there should be more buckets. We currently stop
   differentiating after the RSSI is at least -70dB. This is still
   pretty weak, and the difficulty should continue to increase above
   that. Move the threshold up to 4dB in the weakest bucket and 10dB in
   the strongest bucket, and add two more buckets (up to -60dB).
4. We adjust the roam difficulty up and down based on the estimated
   throughput of the selected AP relative to the estimated throughput of
   the current AP. Small gains or losses in estimated throughput should
   not be rewarded or penalized as heavily as they are now since
   estimated throughput is a hand-wavy measurement to begin with. Relax
   these adjustments.
5. At lower signal levels, we shouldn't care as much about estimated
   throughput gains. In particular, below a certain threshold, we should
   only be chasing RSSI gains in search of a more stable connection. For
   example, it doesn't matter that we gain 10% estimated throughput if
   our RSSI is bad. Introduce an adjust_factor so that we can weight
   estimated throughput differently.

Signed-off-by: Matthew Wang <matthewmwang@chromium.org>
---
 wpa_supplicant/events.c | 103 +++++++++++++++++++---------------------
 1 file changed, 50 insertions(+), 53 deletions(-)

Comments

Jouni Malinen June 19, 2020, 3:29 p.m. UTC | #1
On Mon, Jun 01, 2020 at 05:10:14PM -0700, Matthew Wang wrote:
> The roaming logic is currently too aggressive and full of holes. In
> particular, the (arbitrary) values we currently assign to the roaming
> difficulty are too small; an AP can reasonably move up or down 5dB in
> RSSI in an enterprise environment even when the device is static. The
> logic needs to be more resilient to these small fluctuations.
> 
> The following notes describe each of the changes made and the rationale
> behind it.
> 1. Remove all short-circuiting and allow the roaming logic below to run
>    its course. This way we avoid overlooking important factors.
> 2. Add a to_2ghz flag that indicates if we are moving from a 5ghz AP to
>    a 2ghz AP. This should increase the difficulty to roam in that
>    direction and mirror the to_5ghz flag.
> 3. Adjust the min_diff upward across the board. Originally, we would
>    roam with a RSSI imporvement of 1dB in the weakest bucket and 5dB in
>    the strongest bucket. RSSI is fairly fickle and APs can easily
>    fluctuate by more than that in a normal enterprise environment.
>    Additionally, there should be more buckets. We currently stop
>    differentiating after the RSSI is at least -70dB. This is still
>    pretty weak, and the difficulty should continue to increase above
>    that. Move the threshold up to 4dB in the weakest bucket and 10dB in
>    the strongest bucket, and add two more buckets (up to -60dB).
> 4. We adjust the roam difficulty up and down based on the estimated
>    throughput of the selected AP relative to the estimated throughput of
>    the current AP. Small gains or losses in estimated throughput should
>    not be rewarded or penalized as heavily as they are now since
>    estimated throughput is a hand-wavy measurement to begin with. Relax
>    these adjustments.

This seems to prevent roaming from non-HT to HT AP if those APs have the
same signal strength. This does not sound desirable. Being able to use
HT will likely get better throughput and reduced airtime need. For
example, the prefix_ht20_during_roam test case fails because of these.
Matthew Wang June 23, 2020, 9:41 p.m. UTC | #2
Thanks for the response. I ran the prefer_ht20_during_roam test and it
looks like it expects a roam from a non-HT AP at -30 dBm with 54 Mbps
est_tpt to an HT AP at -30 dBm with 65 Mbps est_tpt. I'm not too clear
on all the phy layer details of HT, but in my opinion, roaming from an
AP at such a strong signal level (-30 dBm) needs a fairly strong
justification, and a 20% increase (54 -> 65) in estimated throughput
is not that strong. That being said, if there are benefits to moving
from non-HT to HT that aren't captured in the {signal, est_tpt} pair,
then I'd advocate for using a flag that behaves similarly to the
to_{2,5}ghz flags, which adjusts the roaming difficulty up/down 2 dBm
depending on the direction of the roam. We can assign some value for
{to,from}_ht flags that subtracts from or adds to the roaming
difficulty depending on how valuable HT vs non-HT is. Does this make
sense?

On Tue, Jun 23, 2020 at 2:30 PM Matthew Wang <matthewmwang@chromium.org> wrote:
>
> Thanks for the response. I ran the prefer_ht20_during_roam test and it
> looks like it expects a roam from a non-HT AP at -30 dBm with 54 Mbps
> est_tpt to an HT AP at -30 dBm with 65 Mbps est_tpt. I'm not too clear
> on all the phy layer details of HT, but in my opinion, roaming from an
> AP at such a strong signal level (-30 dBm) needs a fairly strong
> justification, and a 20% increase (54 -> 65) in estimated throughput
> is not that strong. That being said, if there are benefits to moving
> from non-HT to HT that aren't captured in the {signal, est_tpt} pair,
> then I'd advocate for using a flag that behaves similarly to the
> to_{2,5}ghz flags, which adjusts the roaming difficulty up/down 2 dBm
> depending on the direction of the roam. We can assign some value for
> {to,from}_ht flags that subtracts from or adds to the roaming
> difficulty depending on how valuable HT vs non-HT is. Does this make
> sense?
>
> On Fri, Jun 19, 2020 at 8:31 AM Jouni Malinen <j@w1.fi> wrote:
> >
> > On Mon, Jun 01, 2020 at 05:10:14PM -0700, Matthew Wang wrote:
> > > The roaming logic is currently too aggressive and full of holes. In
> > > particular, the (arbitrary) values we currently assign to the roaming
> > > difficulty are too small; an AP can reasonably move up or down 5dB in
> > > RSSI in an enterprise environment even when the device is static. The
> > > logic needs to be more resilient to these small fluctuations.
> > >
> > > The following notes describe each of the changes made and the rationale
> > > behind it.
> > > 1. Remove all short-circuiting and allow the roaming logic below to run
> > >    its course. This way we avoid overlooking important factors.
> > > 2. Add a to_2ghz flag that indicates if we are moving from a 5ghz AP to
> > >    a 2ghz AP. This should increase the difficulty to roam in that
> > >    direction and mirror the to_5ghz flag.
> > > 3. Adjust the min_diff upward across the board. Originally, we would
> > >    roam with a RSSI imporvement of 1dB in the weakest bucket and 5dB in
> > >    the strongest bucket. RSSI is fairly fickle and APs can easily
> > >    fluctuate by more than that in a normal enterprise environment.
> > >    Additionally, there should be more buckets. We currently stop
> > >    differentiating after the RSSI is at least -70dB. This is still
> > >    pretty weak, and the difficulty should continue to increase above
> > >    that. Move the threshold up to 4dB in the weakest bucket and 10dB in
> > >    the strongest bucket, and add two more buckets (up to -60dB).
> > > 4. We adjust the roam difficulty up and down based on the estimated
> > >    throughput of the selected AP relative to the estimated throughput of
> > >    the current AP. Small gains or losses in estimated throughput should
> > >    not be rewarded or penalized as heavily as they are now since
> > >    estimated throughput is a hand-wavy measurement to begin with. Relax
> > >    these adjustments.
> >
> > This seems to prevent roaming from non-HT to HT AP if those APs have the
> > same signal strength. This does not sound desirable. Being able to use
> > HT will likely get better throughput and reduced airtime need. For
> > example, the prefix_ht20_during_roam test case fails because of these.
> >
> > --
> > Jouni Malinen                                            PGP id EFC895FA
Jouni Malinen Feb. 28, 2021, 9:22 a.m. UTC | #3
On Tue, Jun 23, 2020 at 02:41:19PM -0700, Matthew Wang wrote:
> Thanks for the response. I ran the prefer_ht20_during_roam test and it
> looks like it expects a roam from a non-HT AP at -30 dBm with 54 Mbps
> est_tpt to an HT AP at -30 dBm with 65 Mbps est_tpt. I'm not too clear
> on all the phy layer details of HT, but in my opinion, roaming from an
> AP at such a strong signal level (-30 dBm) needs a fairly strong
> justification, and a 20% increase (54 -> 65) in estimated throughput
> is not that strong. That being said, if there are benefits to moving
> from non-HT to HT that aren't captured in the {signal, est_tpt} pair,
> then I'd advocate for using a flag that behaves similarly to the
> to_{2,5}ghz flags, which adjusts the roaming difficulty up/down 2 dBm
> depending on the direction of the roam. We can assign some value for
> {to,from}_ht flags that subtracts from or adds to the roaming
> difficulty depending on how valuable HT vs non-HT is. Does this make
> sense?

That behavior is not limited to so high signal strength value. Same
happens with all other values and the impact is much more severe in some
cases. For example, with both APs at -86 dBm non-HT AP gets 2 Mbps
est_tpt and HT AP gets 8.666 Mbps. Still, roaming is prevented ("Skip
roam - too small difference in signal level (0 < 4)"). In other words,
even that min_diff -= 2 style adjustment that is given for to_5ghz would
not help here. In fact, it looks like this patch would make it
impossible to roam to an AP that shows better throughput estimate
(regardless of how large a difference) on the same band unless there is
also at least 4 dB increase in the signal level when cur_level < -85.
That does not look reasonable.

And yes, HT does come with possibility of using A-MPDU and A-MSDU
aggregation which can result in significant improvement in throughput
and airtime efficiency and that is in addition to the estimated
throughput values from TX rates. So it might indeed make sense to
add/subtract difficulty for roaming, but still, the main issue I see
here is in that behavior that seems to completely prevent roaming unless
signal level increases significantly regardless of other factors.
Matthew Wang March 2, 2021, 1:22 a.m. UTC | #4
> the main issue I see
here is in that behavior that seems to completely prevent roaming unless
signal level increases significantly regardless of other factors.

This is only the case for the lowest bucket (< -85dBm), as you've
pointed out, and in this case, I would argue that roaming for
estimated throughput gains doesn't actually provide much (if any)
benefit. -86dBm is hardly usable WiFi, and the roam will only add to
the disruption in connectivity, resulting in a decrease in the actual
throughput. Putting the non-HT vs HT question aside for a moment, if
we take two APs that are both at approximately -86 dBm from the
device, according to the original algorithm, any fluctuation in signal
level where the current signal is any worse than the signal of the
other AP will cause a roam, and as I'm sure you're aware, fluctuation
in signal level happens constantly. What we end up seeing is the
device aggressively roaming back and forth for no real signal or
throughput gains. The poor signal level (combined with constant
roaming) far outweighs any gains in estimated throughput.

Even taking into consideration the 2 Mbps non-HT AP vs the 8.666 Mbps
HT AP situation, I think it makes sense to encode some slight bias
towards the HT AP for the reasons you mentioned (A-MPDU and A-MSDU),
but there's no point in roaming to the HT AP if we're just going to
roam away from it because the signal fluctuated a couple dBm.

The point I'm trying to make is that RSSI measurements are moving
targets and can see quite a bit of variance, and the algorithm should
be able to absorb some of this. If we encoded a preference for HT, we
could see something like: at -86 dBm, roam to HT AP when signal is 2
dBm greater, roam to non-HT AP when signal is 6 dBm greater. In this
case, even if two APs have the same "real" RSSI relative to the
device, it's extremely likely, and, in fact, almost inevitable, that
the HT AP will be > 2 dBm greater at some point, and it's much less
likely that the non-HT AP will be > 6 dBm greater at some point. Using
the mac80211_hwsim module to test roaming has this shortcoming - that
the RSSI measurements are too precise and don't actually simulate
real-world environments accurately in this regard.

My arguments up to now have been pretty much entirely conjecture, but
I'd like to offer some (maybe slightly hand-wavy) data. ChromeOS is
one of the largest (if not the largest) consumers of the roaming
functionality in wpa_supplicant since Android offloads roaming to the
chipset. We've gotten many complaints about roaming from enterprise
customers in the past, but ever since we've adopted these roaming
changes nine or so months ago (along with a few more since I first
submitted this patch) in our tree, we've had customers tell us that
their issues were resolved and roaming complaints as a whole have
become more or less non-existent. Additionally, we've since added
integration tests to simulate movement between APs using programmable
attenuators, and we've seen measurable improvements in stability
(device is able to withstand minor fluctuations in signal level
without roaming) and no decreases in mobility (device is still able to
roam before getting deauth-ed for leaving the original AP).

I hope this makes sense, and I'm more than happy to continue
discussing anything you still think needs to be improved or is
unsatisfactory. I think encoding the HT preference is reasonable and
I'll go ahead and do that when I get a chance, but beyond that, I
still believe this is a step in the right direction. Sorry for the
long response, and if you've made it this far, thanks for reading :)

On Sun, Feb 28, 2021 at 1:22 AM Jouni Malinen <j@w1.fi> wrote:
>
> On Tue, Jun 23, 2020 at 02:41:19PM -0700, Matthew Wang wrote:
> > Thanks for the response. I ran the prefer_ht20_during_roam test and it
> > looks like it expects a roam from a non-HT AP at -30 dBm with 54 Mbps
> > est_tpt to an HT AP at -30 dBm with 65 Mbps est_tpt. I'm not too clear
> > on all the phy layer details of HT, but in my opinion, roaming from an
> > AP at such a strong signal level (-30 dBm) needs a fairly strong
> > justification, and a 20% increase (54 -> 65) in estimated throughput
> > is not that strong. That being said, if there are benefits to moving
> > from non-HT to HT that aren't captured in the {signal, est_tpt} pair,
> > then I'd advocate for using a flag that behaves similarly to the
> > to_{2,5}ghz flags, which adjusts the roaming difficulty up/down 2 dBm
> > depending on the direction of the roam. We can assign some value for
> > {to,from}_ht flags that subtracts from or adds to the roaming
> > difficulty depending on how valuable HT vs non-HT is. Does this make
> > sense?
>
> That behavior is not limited to so high signal strength value. Same
> happens with all other values and the impact is much more severe in some
> cases. For example, with both APs at -86 dBm non-HT AP gets 2 Mbps
> est_tpt and HT AP gets 8.666 Mbps. Still, roaming is prevented ("Skip
> roam - too small difference in signal level (0 < 4)"). In other words,
> even that min_diff -= 2 style adjustment that is given for to_5ghz would
> not help here. In fact, it looks like this patch would make it
> impossible to roam to an AP that shows better throughput estimate
> (regardless of how large a difference) on the same band unless there is
> also at least 4 dB increase in the signal level when cur_level < -85.
> That does not look reasonable.
>
> And yes, HT does come with possibility of using A-MPDU and A-MSDU
> aggregation which can result in significant improvement in throughput
> and airtime efficiency and that is in addition to the estimated
> throughput values from TX rates. So it might indeed make sense to
> add/subtract difficulty for roaming, but still, the main issue I see
> here is in that behavior that seems to completely prevent roaming unless
> signal level increases significantly regardless of other factors.
>
> --
> Jouni Malinen                                            PGP id EFC895FA
diff mbox series

Patch

diff --git a/wpa_supplicant/events.c b/wpa_supplicant/events.c
index f0f91892f..d6e4be3c7 100644
--- a/wpa_supplicant/events.c
+++ b/wpa_supplicant/events.c
@@ -1782,8 +1782,10 @@  static int wpa_supplicant_need_to_roam(struct wpa_supplicant *wpa_s,
 	struct wpa_bss *current_bss = NULL;
 #ifndef CONFIG_NO_ROAMING
 	int min_diff, diff;
-	int to_5ghz;
+	int to_2ghz, to_5ghz;
 	int cur_level;
+	int adjust = 0;
+	double adjust_factor, est_ratio;
 	unsigned int cur_est, sel_est;
 	struct wpa_signal_info si;
 	int cur_snr = 0;
@@ -1871,65 +1873,60 @@  static int wpa_supplicant_need_to_roam(struct wpa_supplicant *wpa_s,
 			cur_level, cur_snr, cur_est);
 	}
 
-	if (sel_est > cur_est + 5000) {
-		wpa_dbg(wpa_s, MSG_DEBUG,
-			"Allow reassociation - selected BSS has better estimated throughput");
-		return 1;
-	}
-
+	to_2ghz = current_bss->freq > 4000 && selected->freq < 4000;
 	to_5ghz = selected->freq > 4000 && current_bss->freq < 4000;
 
-	if (cur_level < 0 && cur_level > selected->level + to_5ghz * 2 &&
-	    sel_est < cur_est * 1.2) {
-		wpa_dbg(wpa_s, MSG_DEBUG, "Skip roam - Current BSS has better "
-			"signal level");
-		return 0;
-	}
-
-	if (cur_est > sel_est + 5000) {
-		wpa_dbg(wpa_s, MSG_DEBUG,
-			"Skip roam - Current BSS has better estimated throughput");
-		return 0;
+	/*
+	 * Set the minimum RSSI difference needed to roam (`min_diff`) and the
+	 * estimated throughput weight (`adjust_factor`) based on the RSSI of
+	 * the current AP. At low RSSI (< -70 dBm), we ignore estimated
+	 * throughput gains and only consider RSSI gains. At higher RSSI,
+	 * `adjust_factor` is multiplied by `adjust` (set below according to the
+	 * selected AP's estimated throughput relative to the current AP's
+	 * estimated throughput) to influence the `min_diff`.
+	 */
+	if (cur_level < -85) { /* ..-86 dBm */
+		min_diff = 4;
+		adjust_factor = 0;
+	} else if (cur_level < -80) { /* -85..-81 dBm */
+		min_diff = 5;
+		adjust_factor = 0.2;
+	} else if (cur_level < -75) { /* -80..-76 dBm */
+		min_diff = 6;
+		adjust_factor = 0.4;
+	} else if (cur_level < -70) { /* -75..-71 dBm */
+		min_diff = 7;
+		adjust_factor = 0.6;
+	} else if (cur_level < -65) { /* -70..-66 dBm */
+		min_diff = 8;
+		adjust_factor = 0.8;
+	} else if (cur_level < -60) { /* -65..-61 dBm */
+		min_diff = 9;
+		adjust_factor = 1.0;
+	} else if (cur_level < 0) { /* -60..-1 dBm */
+		min_diff = 10;
+		adjust_factor = 1.2;
+	} else { /* unspecified units (not in dBm) */
+		min_diff = 5;
+		adjust_factor = 1.0;
 	}
 
-	if (cur_snr > GREAT_SNR) {
-		wpa_dbg(wpa_s, MSG_DEBUG,
-			"Skip roam - Current BSS has good SNR (%u > %u)",
-			cur_snr, GREAT_SNR);
-		return 0;
+	sel_est = selected->est_throughput;
+	if (cur_est > sel_est) {
+		adjust = 1;
+		est_ratio = sel_est == 0 ? 2 : (double) cur_est / sel_est;
+	} else {
+		adjust = -1;
+		est_ratio = cur_est == 0 ? 2 : (double) sel_est / cur_est;
 	}
+	if (est_ratio > 2)
+		est_ratio = 2;
+	adjust *= (int) ((est_ratio - 1) * 10);
+	min_diff += adjust * adjust_factor;
 
-	if (cur_level < -85) /* ..-86 dBm */
-		min_diff = 1;
-	else if (cur_level < -80) /* -85..-81 dBm */
-		min_diff = 2;
-	else if (cur_level < -75) /* -80..-76 dBm */
-		min_diff = 3;
-	else if (cur_level < -70) /* -75..-71 dBm */
-		min_diff = 4;
-	else if (cur_level < 0) /* -70..-1 dBm */
-		min_diff = 5;
-	else /* unspecified units (not in dBm) */
-		min_diff = 2;
-
-	if (cur_est > sel_est * 1.5)
-		min_diff += 10;
-	else if (cur_est > sel_est * 1.2)
-		min_diff += 5;
-	else if (cur_est > sel_est * 1.1)
+	if (to_2ghz)
 		min_diff += 2;
-	else if (cur_est > sel_est)
-		min_diff++;
-	else if (sel_est > cur_est * 1.5)
-		min_diff -= 10;
-	else if (sel_est > cur_est * 1.2)
-		min_diff -= 5;
-	else if (sel_est > cur_est * 1.1)
-		min_diff -= 2;
-	else if (sel_est > cur_est)
-		min_diff--;
-
-	if (to_5ghz)
+	else if (to_5ghz)
 		min_diff -= 2;
 	diff = selected->level - cur_level;
 	if (diff < min_diff) {