From patchwork Tue Jan 2 14:58:30 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Matthias Schiffer X-Patchwork-Id: 1881644 X-Patchwork-Delegate: mschiffer@universe-factory.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@legolas.ozlabs.org Authentication-Results: legolas.ozlabs.org; dkim=pass (2048-bit key; secure) header.d=lists.infradead.org header.i=@lists.infradead.org header.a=rsa-sha256 header.s=bombadil.20210309 header.b=l3sF4a4e; dkim-atps=neutral Authentication-Results: legolas.ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.openwrt.org (client-ip=2607:7c80:54:3::133; helo=bombadil.infradead.org; envelope-from=openwrt-devel-bounces+incoming=patchwork.ozlabs.org@lists.openwrt.org; receiver=patchwork.ozlabs.org) Received: from bombadil.infradead.org (bombadil.infradead.org [IPv6:2607:7c80:54:3::133]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature ECDSA (secp384r1) server-digest SHA384) (No client certificate requested) by legolas.ozlabs.org (Postfix) with ESMTPS id 4T4GLS1Gvrz20LT for ; Wed, 3 Jan 2024 02:02:53 +1100 (AEDT) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender: Content-Transfer-Encoding:Content-Type:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:MIME-Version:Message-ID:Date:Subject:Cc :To:From:Reply-To:Content-ID:Content-Description:Resent-Date:Resent-From: Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:In-Reply-To:References: List-Owner; bh=p3OTFsFnub85p9hrKvegDkpQSgOcQdMbshFcF221OcU=; b=l3sF4a4e21Rg/a sBSd1SQw2RLFf/lbmGvwekrBbkRUsKlUeMVSTjlTu/O1RYhApMXqfHSiaB4dqjzNRtrYCivUI1EHa GXjYhWN0kgiZV6Tiol2ikNjInAv1QtgTUoo+XIbjuFgZurk6gGcxNQ0VgUYbx58eYcUYwwBAEOKUo pMbBLTy/FlrdG2lrNxPrBGBSzgLe0A591b/Nz0pO6XXTQj0db6CFq4OFy+X3jSdbTEVXfXIEM1NQR nGq+gFxkjXSTyvf1qc5JFrZjufonGCSAelHI0g+vgbbhzoVzKaZekqbNWN8nHgkCfSSf/oJFncP/0 z8elEO9oMrn293rB/4dQ==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.96 #2 (Red Hat Linux)) id 1rKgES-008JM4-33; Tue, 02 Jan 2024 14:59:04 +0000 Received: from orthanc.universe-factory.net ([104.238.176.138]) by bombadil.infradead.org with esmtps (Exim 4.96 #2 (Red Hat Linux)) id 1rKgEN-008JKu-0J for openwrt-devel@lists.openwrt.org; Tue, 02 Jan 2024 14:59:00 +0000 Received: from avalon.lan (unknown [IPv6:2001:19f0:6c01:100::2]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by orthanc.universe-factory.net (Postfix) with ESMTPSA id CB3031F6AD; Tue, 2 Jan 2024 15:58:50 +0100 (CET) From: Matthias Schiffer To: Felix Fietkau Cc: David Bauer , openwrt-devel@lists.openwrt.org, Matthias Schiffer Subject: [PATCH netifd] system-linux: fix race condition in netlink socket error handing Date: Tue, 2 Jan 2024 15:58:30 +0100 Message-ID: <5c8c7673d95d24ac88468a8ab025fc66b9b15aee.1704207273.git.mschiffer@universe-factory.net> X-Mailer: git-send-email 2.43.0 MIME-Version: 1.0 X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240102_065859_276428_E1241ABD X-CRM114-Status: GOOD ( 16.30 ) X-Spam-Score: -0.0 (/) X-Spam-Report: Spam detection software, running on the system "bombadil.infradead.org", has NOT identified this incoming email as spam. The original message has been attached to this so you can view it or label similar future email. If you have any questions, see the administrator of that system for details. Content preview: The error handling needed for the buffer growth logic relies on uloop_fd's error flag, which is set based on epoll events. Doing so without handling recvmsg's error codes is racy, as an error state ma [...] Content analysis details: (-0.0 points, 5.0 required) pts rule name description ---- ---------------------- -------------------------------------------------- -0.0 SPF_PASS SPF: sender matches SPF record -0.0 SPF_HELO_PASS SPF: HELO matches SPF record X-BeenThere: openwrt-devel@lists.openwrt.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: OpenWrt Development List List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "openwrt-devel" Errors-To: openwrt-devel-bounces+incoming=patchwork.ozlabs.org@lists.openwrt.org The error handling needed for the buffer growth logic relies on uloop_fd's error flag, which is set based on epoll events. Doing so without handling recvmsg's error codes is racy, as an error state may be set between receiving epoll events and the next recvmsg, but calling recvmsg clears the error state. To fix this, add handling for errors returned by nl_recvmsgs_default() and nl_recv(); checking for u->error and retrieving the error status using getsockopt() becomes redundant. We have observed this issue on Gluon (recent OpenWrt 23.05); on some devices with DSA switches, the bridge interface's carrier-on event would consistenly get lost during boot due to insufficient buffer space (see [1]). We have bisected the issue to netifd commit 516ab774cc16 ("system-linux: fix race condition on bringing up wireless devices"), but that commit only uncovered the preexisting bug by switching from getting the carrier state from sysfs to using the netlink messages in cb_rtnl_event(). I suspect that other recent issues about netifd missing a carrier state change like [2] may have the same underlying cause. [1] https://github.com/freifunk-gluon/gluon/issues/3130 [2] https://github.com/openwrt/openwrt/issues/13863 Signed-off-by: Matthias Schiffer --- system-linux.c | 38 +++++++++++++------------------------- 1 file changed, 13 insertions(+), 25 deletions(-) diff --git a/system-linux.c b/system-linux.c index e7945e3330a8..4463a2a8282a 100644 --- a/system-linux.c +++ b/system-linux.c @@ -169,19 +169,14 @@ static void handler_nl_event(struct uloop_fd *u, unsigned int events) { struct event_socket *ev = container_of(u, struct event_socket, uloop); - int err; - socklen_t errlen = sizeof(err); + int ret; - if (!u->error) { - nl_recvmsgs_default(ev->sock); + ret = nl_recvmsgs_default(ev->sock); + if (ret >= 0) return; - } - - if (getsockopt(u->fd, SOL_SOCKET, SO_ERROR, (void *)&err, &errlen)) - goto abort; - switch(err) { - case ENOBUFS: + switch (-ret) { + case NLE_NOMEM: /* Increase rx buffer size on netlink socket */ ev->bufsize *= 2; if (nl_socket_set_buffer_size(ev->sock, ev->bufsize, 0)) @@ -195,7 +190,6 @@ handler_nl_event(struct uloop_fd *u, unsigned int events) default: goto abort; } - u->error = false; return; abort: @@ -797,24 +791,19 @@ handle_hotplug_event(struct uloop_fd *u, unsigned int events) struct sockaddr_nl nla; unsigned char *buf = NULL; int size; - int err; - socklen_t errlen = sizeof(err); - if (!u->error) { - while ((size = nl_recv(ev->sock, &nla, &buf, NULL)) > 0) { - if (nla.nl_pid == 0) - handle_hotplug_msg((char *) buf, size); + while ((size = nl_recv(ev->sock, &nla, &buf, NULL)) > 0) { + if (nla.nl_pid == 0) + handle_hotplug_msg((char *) buf, size); - free(buf); - } - return; + free(buf); } - if (getsockopt(u->fd, SOL_SOCKET, SO_ERROR, (void *)&err, &errlen)) - goto abort; + switch (-size) { + case 0: + return; - switch(err) { - case ENOBUFS: + case NLE_NOMEM: /* Increase rx buffer size on netlink socket */ ev->bufsize *= 2; if (nl_socket_set_buffer_size(ev->sock, ev->bufsize, 0)) @@ -824,7 +813,6 @@ handle_hotplug_event(struct uloop_fd *u, unsigned int events) default: goto abort; } - u->error = false; return; abort: