unicast rekey fundamental flawed (was: connection hangs after wpa_supplicant re-key)

Message ID 27736598-ba37-0d61-009f-a4443355bbbe@web.de
State New
Headers show
Series
  • unicast rekey fundamental flawed (was: connection hangs after wpa_supplicant re-key)
Related show

Commit Message

Alexander Wetzel Sept. 27, 2017, 6:16 p.m.
Hello,

>> As above, I can work around the problem by increasing
>> dot11RSNAConfigPMKLifetime in the config file.  I also tried setting
>> "fast_reauth=0" but that did not have an impact.  With
>> "dot11RSNAConfigPMKLifetime=31536000" I've seen a solid connection for
>> multiple days.
>> 
>> Any ideas on how I can further debug/fix this?
>
> Some notes above on what this would take.. Either debug from AP or
> sniffer capture and all the needed keys for analysis.
>
> Using a larger dot11RSNAConfigPMKLifetime value sounds like a reasonable
> workaround for this, though. All it does here is give the AP full
> control on when to force PMK rekeying (i.e., in practice, when to force
> EAP reauthentication).

This seems to be the same issue I had in the past and reported/debugged (also with wlan captures) here
https://patchwork.kernel.org/patch/6449291/ and here https://dev.openwrt.org/ticket/18966

The short version is, that unicast rekeys are inherently dangerous when offloading the encryption to the card and using mac80211 from the linux kernel. (Group rekeys are not affected and fine). The root of the evil is directly in the ieee802.11 spec and only was "fixed" in 802.11-2012. The fix hast not been implemented in any wlan Stack I'm aware of, though.. (At least Windows seems to have code to handle the issue as a special case when connected to an linux AP using rekeys. Here the wlan also freezes, but recovers within ~1s.)

Here how I currently understand the issue: (can be wrong and/or incomplete)
When changing the unicast key but having no new key ID to switch over to we are racing the hardware of the wlan card.
It can (in my test environment to 100%) happen, that mac80211 hands over a frame to the wlan card for encryption with a pn belonging to the then still current old key.
While this packet is queued in the wlan card the unicast key is updated and installed in the card. The packet with the old pn is then encrypted with the new key and sent out.
The other end revives the packet, decrypt it successful with the new key and then sets the pn for the new key to the value from the packet. Which is of course way too high, since it belongs to the old key... One or two packets later the correct pn is beeing used, but the reply protection now drops the packets till we reach the pn of the old key (pretty unlikely to happen ever..) or the key is rolled over again, resetting the max seen pn to zero again. The result here is, that a rekey only works if the wlan is idle at the critical time, so no packets are queued when we replace the key.

Switching your wlan card to software encryption prevents the issue for linux systems, but chances are you have to do that on the AP and the client to really prevent the freezes. At least when both are running linux and mac80211. (We no longer race the wlan hardware, preventing key and pn to running out of sync.)

I'm currently back looking at the issue and trying to get an acceptable patch for that together to start a new discussion on linux-wireless.
Since that will probably still take some time I've attached you one older but tested interims version of the new kernel patch I'm working on. 

The patch will not prevent sending the broken packets, it will just detect and handle them for the most probable case (TID=0) on the receiving end. Preventing the issue all together seems to be very hard, expensive and for sure still above my current understanding and coding skills. 

At least in my setup both systems - the AP and the Station - must be patched or the wLan freezes during rekey if there is a data transfer ongoing.
Since I'm normally testing with flood ping and therefore have the same packet load in both directions that's expected.

The patch will print out "HACK: -RESCUE- new key packet with old pn mitigated" when encountering and handling a problematic packet.
Here a quick sample how an mitigated wlan freeze looks with the attached patch:

Sep 10 21:24:21.557801 perry kernel: HACK: virgin key detected, enable HACK code path!
Sep 10 21:24:21.557925 perry kernel: HACK     cnt: 00 00 00 00 00 00
Sep 10 21:24:21.557961 perry kernel: HACK old_cnt: 00 00 00 00 47 69
Sep 10 21:24:21.557986 perry kernel: HACK      pn: 00 00 00 00 47 6b
Sep 10 21:24:21.558016 perry kernel: HACK: -RESCUE- new key packet with old pn mitigated
Sep 10 21:24:21.617804 perry kernel: HACK: virgin key detected, enable HACK code path!
Sep 10 21:24:21.617941 perry kernel: HACK     cnt: 00 00 00 00 00 00
Sep 10 21:24:21.617970 perry kernel: HACK old_cnt: 00 00 00 00 47 6b
Sep 10 21:24:21.618007 perry kernel: HACK      pn: 00 00 00 00 00 01
Sep 10 21:24:21.618034 perry kernel: HACK: Switching key over to normal counter

I hope that helps and make this really hard to debug issue more widely known...

As it is only a small percentage of linux users will be able to tie that to rekeys. And even finding that out there does not help much, since there is absolutely nothing in any debug logs or even a kernel trace. (I tried that all prior to giving up and finally patching wireshark to be able to look at the interesting encrypted packets.) So besides using one of the patches you'll be only able to see issue in a wlan capture when looking for it.
 

Alexander Wetzel

Patch

diff -ur linux-4.13.0-gentoo_/net/mac80211/key.c linux-4.13.0-gentoo/net/mac80211/key.c
--- linux-4.13.0-gentoo_/net/mac80211/key.c	2017-09-03 22:56:17.000000000 +0200
+++ linux-4.13.0-gentoo/net/mac80211/key.c	2017-09-10 21:02:23.822346404 +0200
@@ -626,9 +626,21 @@ 
 
 	mutex_lock(&sdata->local->key_mtx);
 
-	if (sta && pairwise)
+	if (sta && pairwise) {
 		old_key = key_mtx_dereference(sdata->local, sta->ptk[idx]);
-	else if (sta)
+		if (old_key)
+			switch (key->conf.cipher) {
+			/* For now we only fix the issue for CCMP */
+			case WLAN_CIPHER_SUITE_CCMP:
+				/* Only TID=0 seems to be relevant, but that assumption may be wrong... */
+				memcpy(&key->u.ccmp.rx_pn_old, old_key->u.ccmp.rx_pn[0], IEEE80211_CCMP_PN_LEN);
+				key->check_pn_old = true;
+				break;
+			}
+		else
+			/* No old key, bypass hack code */
+			key->check_pn_old = false;
+	} else if (sta)
 		old_key = key_mtx_dereference(sdata->local, sta->gtk[idx]);
 	else
 		old_key = key_mtx_dereference(sdata->local, sdata->keys[idx]);
diff -ur linux-4.13.0-gentoo_/net/mac80211/key.h linux-4.13.0-gentoo/net/mac80211/key.h
--- linux-4.13.0-gentoo_/net/mac80211/key.h	2017-09-03 22:56:17.000000000 +0200
+++ linux-4.13.0-gentoo/net/mac80211/key.h	2017-09-10 21:02:54.752438385 +0200
@@ -59,6 +59,7 @@ 
 	struct ieee80211_local *local;
 	struct ieee80211_sub_if_data *sdata;
 	struct sta_info *sta;
+	bool check_pn_old;
 
 	/* for sdata list */
 	struct list_head list;
@@ -88,6 +89,7 @@ 
 			 * Management frames.
 			 */
 			u8 rx_pn[IEEE80211_NUM_TIDS + 1][IEEE80211_CCMP_PN_LEN];
+			u8 rx_pn_old[IEEE80211_CMAC_PN_LEN];
 			struct crypto_aead *tfm;
 			u32 replays; /* dot11RSNAStatsCCMPReplays */
 		} ccmp;
diff -ur linux-4.13.0-gentoo_/net/mac80211/wpa.c linux-4.13.0-gentoo/net/mac80211/wpa.c
--- linux-4.13.0-gentoo_/net/mac80211/wpa.c	2017-09-03 22:56:17.000000000 +0200
+++ linux-4.13.0-gentoo/net/mac80211/wpa.c	2017-09-10 21:08:04.203331545 +0200
@@ -532,6 +532,31 @@ 
 			key->u.ccmp.replays++;
 			return RX_DROP_UNUSABLE;
 		}
+		if (unlikely(key->check_pn_old)) {
+			/* Code only handles TID=0, which seems to be the only relevant TID for the race */
+			if (queue == 0) {
+				printk ("HACK: virgin key detected, enable HACK code path!");
+				print_hex_dump_debug("HACK     cnt: ", DUMP_PREFIX_NONE, IEEE80211_CCMP_PN_LEN, 6, key->u.ccmp.rx_pn[queue], IEEE80211_CCMP_PN_LEN, false);
+				print_hex_dump_debug("HACK old_cnt: ", DUMP_PREFIX_NONE, IEEE80211_CCMP_PN_LEN, 6, key->u.ccmp.rx_pn_old, IEEE80211_CCMP_PN_LEN, false);
+				print_hex_dump_debug("HACK      pn: ", DUMP_PREFIX_NONE, IEEE80211_CCMP_PN_LEN, 6, pn, IEEE80211_CCMP_PN_LEN, false);
+
+				if (memcmp(pn, key->u.ccmp.rx_pn_old, IEEE80211_CCMP_PN_LEN) < 0 ||
+				    memcmp(key->u.ccmp.rx_pn[queue], key->u.ccmp.rx_pn_old, IEEE80211_CCMP_PN_LEN) == 0 ) {
+					/* pn is < the pn from old key or rx_pn_old and rx_pn are identical, complete switch to new key */
+					printk ("HACK: Switching key over to normal counter\n");
+					memcpy(key->u.ccmp.rx_pn[queue], pn, IEEE80211_CCMP_PN_LEN);
+					key->check_pn_old = false;
+				} else {
+					/* This case would freeze the wlan on an unpatched kernel */
+					printk ("HACK: -RESCUE- new key packet with old pn mitigated\n");
+					memcpy(key->u.ccmp.rx_pn_old, pn, IEEE80211_CCMP_PN_LEN);
+				}
+			} else {
+				printk ("HACK: Sanity ERROR - Found a key with check_pn_old set were TID!=0");
+			}
+		} else {
+			memcpy(key->u.ccmp.rx_pn[queue], pn, IEEE80211_CCMP_PN_LEN);
+		}
 
 		if (!(status->flag & RX_FLAG_DECRYPTED)) {
 			u8 aad[2 * AES_BLOCK_SIZE];
@@ -546,8 +571,6 @@ 
 				    skb->data + skb->len - mic_len, mic_len))
 				return RX_DROP_UNUSABLE;
 		}
-
-		memcpy(key->u.ccmp.rx_pn[queue], pn, IEEE80211_CCMP_PN_LEN);
 	}
 
 	/* Remove CCMP header and MIC */