From patchwork Sat Nov 7 19:37:55 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 1396227 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=qOfrOK/2; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CT6xG4Hmpz9sTv for ; Sun, 8 Nov 2020 06:38:10 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728546AbgKGTiK (ORCPT ); Sat, 7 Nov 2020 14:38:10 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32838 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725836AbgKGTiJ (ORCPT ); Sat, 7 Nov 2020 14:38:09 -0500 Received: from mail-oi1-x244.google.com (mail-oi1-x244.google.com [IPv6:2607:f8b0:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8BDD5C0613CF; Sat, 7 Nov 2020 11:38:09 -0800 (PST) Received: by mail-oi1-x244.google.com with SMTP id m143so5407232oig.7; Sat, 07 Nov 2020 11:38:09 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=NPVTKJWmUSvOJP9rbtvMF3MMcTo6hjjAwGFYmud3CuY=; b=qOfrOK/2tjZM8MaWG6SWiJfebBPU6LXs8nOF7SAKUFIlSkz9zEJ+HuX0HUxaZIsfnK KnBITDKgdmfrxlL3948tPiWNjwAs70n/rWx5pLOwLsbtBpYGPc/K1HTMQcP2tZmgDfWs AVmEz77datlf04tCJrQCBEv1gxmhUXsvcPPZpCH98pDoa6ZSnpEKIqMBx4Ss+lHHGERj zDsleUStKMrGTJoSht+m3YkGwTV12Vkb4AbntAnRajhtpPm+tm1dM8EsA2ijVquEeARl 3MmtOYCqSPHWjgoZZhnZmbkCQGvzJKSlyVj9ht0A4zkjHXndJLXVlzsUnK9CxAJbH3/1 NN4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=NPVTKJWmUSvOJP9rbtvMF3MMcTo6hjjAwGFYmud3CuY=; b=lR4X3WLrjIVEyeH5XsRO0z9SGCvA1NC47SdhFgAwxpOeCDe0cM4UkwyyRHIUofZAlA 91pIGoCTYKEKD2lJWhlGBaYu3FbUXrrpRxy7YihF3n+qOytzUwF6aIpX4kJCWkeulCqE enkxC8nN0qGLaW/Q2U25NEXmeBG95lK7RdQcFOdxqKwPSsZTlfJYNcXuoq8j5t2nJ+WV f9cf4x4FgBU4yTSUqLXBI+FCtgvLeQXmd9s/i5DK4bRKxxIPdE8ehOpFQuxM5munYcp7 N25Gz6AgPqcoqQWLI1uutydac7+HO0c/4IqlCEv1X/4Py6Dx9vrjGxoarN8hLbMdzqCg RoVg== X-Gm-Message-State: AOAM53082H5FW71HHAq2iqkXQ1XgDlS/wNTh7ws0irjYnD1kfo3/zHHY yivWvL1aAfvuorwzTAWCMPB/F59xtUYRvQ== X-Google-Smtp-Source: ABdhPJzZUjVmCRkf6CSW4Q8oDRCjx2duacPPnJZinIlFsrUhF1foKa2T3fNV6ord91rwTbCxVHdt1Q== X-Received: by 2002:aca:3e86:: with SMTP id l128mr4510950oia.133.1604777889029; Sat, 07 Nov 2020 11:38:09 -0800 (PST) Received: from [127.0.1.1] ([184.63.162.180]) by smtp.gmail.com with ESMTPSA id j7sm1259489oie.44.2020.11.07.11.38.01 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Nov 2020 11:38:08 -0800 (PST) Subject: [bpf PATCH 1/5] bpf, sockmap: fix partial copy_page_to_iter so progress can still be made From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net, jakub@cloudflare.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org Date: Sat, 07 Nov 2020 11:37:55 -0800 Message-ID: <160477787531.608263.10144789972668918015.stgit@john-XPS-13-9370> In-Reply-To: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> References: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> User-Agent: StGit/0.23-36-gc01b MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If copy_page_to_iter() fails or even partially completes, but with fewer bytes copied than expected we currently reset sg.start and return EFAULT. This proves problematic if we already copied data into the user buffer before we return an error. Because we leave the copied data in the user buffer and fail to unwind the scatterlist so kernel side believes data has been copied and user side believes data has _not_ been received. Expected behavior should be to return number of bytes copied and then on the next read we need to return the error assuming its still there. This can happen if we have a copy length spanning multiple scatterlist elements and one or more complete before the error is hit. The error is rare enough though that my normal testing with server side programs, such as nginx, httpd, envoy, etc., I have never seen this. The only reliable way to reproduce that I've found is to stream movies over my browser for a day or so and wait for it to hang. Not very scientific, but with a few extra WARN_ON()s in the code the bug was obvious. When we review the errors from copy_page_to_iter() it seems we are hitting a page fault from copy_page_to_iter_iovec() where the code checks fault_in_pages_writeable(buf, copy) where buf is the user buffer. It also seems typical server applications don't hit this case. The other way to try and reproduce this is run the sockmap selftest tool test_sockmap with data verification enabled, but it doesn't reproduce the fault. Perhaps we can trigger this case artificially somehow from the test tools. I haven't sorted out a way to do that yet though. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend --- net/ipv4/tcp_bpf.c | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 37f4cb2bba5c..3709d679436e 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -15,8 +15,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, { struct iov_iter *iter = &msg->msg_iter; int peek = flags & MSG_PEEK; - int i, ret, copied = 0; struct sk_msg *msg_rx; + int i, copied = 0; msg_rx = list_first_entry_or_null(&psock->ingress_msg, struct sk_msg, list); @@ -37,10 +37,9 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, page = sg_page(sge); if (copied + copy > len) copy = len - copied; - ret = copy_page_to_iter(page, sge->offset, copy, iter); - if (ret != copy) { - msg_rx->sg.start = i; - return -EFAULT; + copy = copy_page_to_iter(page, sge->offset, copy, iter); + if (!copy) { + return copied ? copied : -EFAULT; } copied += copy; @@ -56,6 +55,11 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, put_page(page); } } else { + /* Lets not optimize peek case if copy_page_to_iter + * didn't copy the entire length lets just break. + */ + if (copy != sge->length) + goto out; sk_msg_iter_var_next(i); } @@ -82,6 +86,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, struct sk_msg, list); } +out: return copied; } EXPORT_SYMBOL_GPL(__tcp_bpf_recvmsg); From patchwork Sat Nov 7 19:38:15 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 1396228 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=g5J8Dz/H; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CT6xd6TLJz9sTv for ; Sun, 8 Nov 2020 06:38:29 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728651AbgKGTi3 (ORCPT ); Sat, 7 Nov 2020 14:38:29 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725836AbgKGTi2 (ORCPT ); Sat, 7 Nov 2020 14:38:28 -0500 Received: from mail-oi1-x244.google.com (mail-oi1-x244.google.com [IPv6:2607:f8b0:4864:20::244]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C62B1C0613CF; Sat, 7 Nov 2020 11:38:28 -0800 (PST) Received: by mail-oi1-x244.google.com with SMTP id t16so5389391oie.11; Sat, 07 Nov 2020 11:38:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=lu8xFLzKq5EcwF/rmpzYgaF3GRt/mG2wCMUgnyFW2F4=; b=g5J8Dz/HsGZtjd0AFilkNvenRhKE/uv0CLpDugjcbQKLK8o3WKw5ow7SpIhN6lvu3C ZuUcxmqC0D7k7vPpEGQsSn3axH2mnycVFxW+29tIAkeH1UWLTsxTUwseVLXunKqUM5AK bdUng6sV/8hwrFTGB4eV6knoOmU6YwLDQdtYgq0SNOInPO5hTC4FfK7I8sKKvHvdvTLT 9ncFd1C6rM718vYKb5SfvZnDtUc3pQAJBrAn3wjtgr7Q7mX26VxW+49ie+gGxQr6kos+ UKzbrASiDVCDdDWXnmJDOE7LwL0XvnIKd4cQ0e5rzGhHQtS+UI7jdt6NMsPioNIOWOqE K64A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=lu8xFLzKq5EcwF/rmpzYgaF3GRt/mG2wCMUgnyFW2F4=; b=D72o1ApxNLrpkxV6S7w2NKGyAoIBKpMqvA0/Xbl0/RjFcgmtzSMCED6PKW7PoOIZLv Oj5y7XMh7xUrhvesti3sPjkrEuIK8ZjAZfvlj/WiOWCqjY12RsxcUndGMzjJSR/uYbdb AhDC8p5JbYv7k+znvYmx11kSGgA2q7aeCnvjr2VJlwi03DWMfCIcWIFOqHUU4bZ5j5Xr joNmjiIjhoSSoBJ+i0n26EMCuGyqnA3JSYHgX4070x501w89PG/qgvStZVHcKQaMS3mx R0b5Rswap12qnG4FWo4ZAPvwAsNye05+HVjnAleh5Hmi+RGN1tsZ12CP1uS83uhhlI6N DS9Q== X-Gm-Message-State: AOAM53237m42zgRiffH3zwzh1NxQGn5Ei2FNCr+uCSViKicdWiHxUzrj EwcJGkN0MXx1Vao5BGFLoj4= X-Google-Smtp-Source: ABdhPJy7qPlnp5/EvqHFs1UtAVy5avBgFP1x8B47wxkipA/CHg7aiTtRnKD53ItrFJAaUZOT/0FQ6A== X-Received: by 2002:aca:c70b:: with SMTP id x11mr4979894oif.58.1604777908240; Sat, 07 Nov 2020 11:38:28 -0800 (PST) Received: from [127.0.1.1] ([184.63.162.180]) by smtp.gmail.com with ESMTPSA id w70sm1235588oiw.29.2020.11.07.11.38.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Nov 2020 11:38:27 -0800 (PST) Subject: [bpf PATCH 2/5] bpf, sockmap: Ensure SO_RCVBUF memory is observed on ingress redirect From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net, jakub@cloudflare.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org Date: Sat, 07 Nov 2020 11:38:15 -0800 Message-ID: <160477789531.608263.11406385279476664711.stgit@john-XPS-13-9370> In-Reply-To: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> References: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> User-Agent: StGit/0.23-36-gc01b MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Fix sockmap sk_skb programs so that they observe sk_rcvbuf limits. This allows users to tune SO_RCVBUF and sockmap will honor them. We can refactor the if(charge) case out in later patches. But, keep this fix to the point. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Suggested-by: Jakub Sitnicki Signed-off-by: John Fastabend --- net/core/skmsg.c | 20 ++++++++++++++++---- net/ipv4/tcp_bpf.c | 3 ++- 2 files changed, 18 insertions(+), 5 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 654182ecf87b..fe44280c033e 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -170,10 +170,12 @@ static int sk_msg_free_elem(struct sock *sk, struct sk_msg *msg, u32 i, struct scatterlist *sge = sk_msg_elem(msg, i); u32 len = sge->length; - if (charge) - sk_mem_uncharge(sk, len); - if (!msg->skb) + /* When the skb owns the memory we free it from consume_skb path. */ + if (!msg->skb) { + if (charge) + sk_mem_uncharge(sk, len); put_page(sg_page(sge)); + } memset(sge, 0, sizeof(*sge)); return len; } @@ -403,6 +405,9 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) int copied = 0, num_sge; struct sk_msg *msg; + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + return -EAGAIN; + msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); if (unlikely(!msg)) return -EAGAIN; @@ -418,7 +423,14 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) return num_sge; } - sk_mem_charge(sk, skb->len); + /* This will transition ownership of the data from the socket where + * the BPF program was run initiating the redirect to the socket + * we will eventually receive this data on. The data will be released + * from skb_consume found in __tcp_bpf_recvmsg() after its been copied + * into user buffers. + */ + skb_set_owner_r(skb, sk); + copied = skb->len; msg->sg.start = 0; msg->sg.size = copied; diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c index 3709d679436e..30802fa99aa1 100644 --- a/net/ipv4/tcp_bpf.c +++ b/net/ipv4/tcp_bpf.c @@ -46,7 +46,8 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock, if (likely(!peek)) { sge->offset += copy; sge->length -= copy; - sk_mem_uncharge(sk, copy); + if (!msg_rx->skb) + sk_mem_uncharge(sk, copy); msg_rx->sg.size -= copy; if (!sge->length) { From patchwork Sat Nov 7 19:38:34 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 1396231 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=LpQAHeY0; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CT6y30QnDz9sTv for ; Sun, 8 Nov 2020 06:38:51 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728652AbgKGTiu (ORCPT ); Sat, 7 Nov 2020 14:38:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32946 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725836AbgKGTiu (ORCPT ); Sat, 7 Nov 2020 14:38:50 -0500 Received: from mail-ot1-x343.google.com (mail-ot1-x343.google.com [IPv6:2607:f8b0:4864:20::343]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 82366C0613CF; Sat, 7 Nov 2020 11:38:48 -0800 (PST) Received: by mail-ot1-x343.google.com with SMTP id j14so4684384ots.1; Sat, 07 Nov 2020 11:38:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=LuYW2qeLzVXpkHlbGr56D0fo6TLa6zXCgpJReHVU/Qc=; b=LpQAHeY0nLu2CK+5LZgRauz9xEh1lQY5OHKH6vnN19mo31dzxe5zwGQw7OCz4rEHoG IeyhBLGZRfLFe3GZXsz9xTuW6O7Q4+g4Nv+rt4YQRXI2dlJznjaJFfdpB8iY1HkW8AD0 8oqkaz6FMEu77ylS34IeU+yH3tRx/490JJQODqtyKViRe2sKDzu00hLnGKR6RtQLjFtA oWGg+5y7mpbVg7Ux0SUJJhUFyw+bQ0lmsFPeFOuPogIX1lwyfMWYrBH9hR1tJlEvKyMi yQTUb99jKvwextI7RqfjXv8b41vzo4j+sxmxGofp4m7aOs9uk5of7U0P5hNfYLRdcWN2 RvwQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=LuYW2qeLzVXpkHlbGr56D0fo6TLa6zXCgpJReHVU/Qc=; b=Rs067k/XZxGaQHil4QMOnk3vncLyAi9J6TlGoQqkpyVCeURNxi2X3XV2Z9B7c6rvD6 YC110ZoI0xMnrbo+KBCOIyT5GYsDRz8WyDmqGb7CvUaalU8gDr/eopAWWqkJBwQsF3UD +eVRUTOSC5KNd+kWBeozmdgKYmhGj3LCgtP3o784s12t8mQ7rJfnGjH86p2o6Oqn0EOl ch02fpFN5iiP4TIGnBxq9GK8FS8HtjtimLM0Kw9ot1pC3+LeJ28SgXzN84CTp4hxhMDB EQ2kVrf1X8sZqGkZ4TdJSM1KkCmocrogmrceQCpvxtEr0Yv0KhO2uhEMG1stV96INuTU 5DPA== X-Gm-Message-State: AOAM533LPuEQ3TZplZGYlpdtdy53JkLnm3I6AO/oFjnfyGqbkDytEWaz rna+sUf0JeNlYWpUnetRfPY= X-Google-Smtp-Source: ABdhPJy1MjIq5MZWQZHHBe6A2qKAokuMwgigkCLh14Iy4hkp+eI5apegM12NUOgWePabJFcIYKzpXw== X-Received: by 2002:a9d:3b84:: with SMTP id k4mr5603552otc.4.1604777927870; Sat, 07 Nov 2020 11:38:47 -0800 (PST) Received: from [127.0.1.1] ([184.63.162.180]) by smtp.gmail.com with ESMTPSA id r206sm1187825oih.14.2020.11.07.11.38.40 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Nov 2020 11:38:47 -0800 (PST) Subject: [bpf PATCH 3/5] bpf, sockmap: Avoid returning unneeded EAGAIN when redirecting to self From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net, jakub@cloudflare.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org Date: Sat, 07 Nov 2020 11:38:34 -0800 Message-ID: <160477791482.608263.14389359214124051944.stgit@john-XPS-13-9370> In-Reply-To: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> References: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> User-Agent: StGit/0.23-36-gc01b MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If a socket redirects to itself and it is under memory pressure it is possible to get a socket stuck so that recv() returns EAGAIN and the socket can not advance for some time. This happens because when redirecting a skb to the same socket we received the skb on we first check if it is OK to enqueue the skb on the receiving socket by checking memory limits. But, if the skb is itself the object holding the memory needed to enqueue the skb we will keep retrying from kernel side and always fail with EAGAIN. Then userspace will get a recv() EAGAIN error if there are no skbs in the psock ingress queue. This will continue until either some skbs get kfree'd causing the memory pressure to reduce far enough that we can enqueue the pending packet or the socket is destroyed. In some cases its possible to get a socket stuck for a noticable amount of time if the socket is only receiving skbs from sk_skb verdict programs. To reproduce I make the socket memory limits ridiculously low so sockets are always under memory pressure. More often though if under memory pressure it looks like a spurious EAGAIN error on user space side causing userspace to retry and typically enough has moved on the memory side that it works. To fix skip memory checks and skb_orphan if receiving on the same sock as already assigned. For SK_PASS cases this is easy, its always the same socket so we can just omit the orphan/set_owner pair. For backlog cases we need to check skb->sk and decide if the orphan and set_owner pair are needed. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend --- net/core/skmsg.c | 72 ++++++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 53 insertions(+), 19 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index fe44280c033e..580252e532da 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -399,38 +399,38 @@ int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from, } EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter); -static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) +static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, + struct sk_buff *skb) { - struct sock *sk = psock->sk; - int copied = 0, num_sge; struct sk_msg *msg; if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) - return -EAGAIN; + return NULL; + + if (!sk_rmem_schedule(sk, skb, skb->len)) + return NULL; msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); if (unlikely(!msg)) - return -EAGAIN; - if (!sk_rmem_schedule(sk, skb, skb->len)) { - kfree(msg); - return -EAGAIN; - } + return NULL; sk_msg_init(msg); - num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len); + return msg; +} + +static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, + struct sk_psock *psock, + struct sock *sk, + struct sk_msg *msg) +{ + int num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len); + int copied; + if (unlikely(num_sge < 0)) { kfree(msg); return num_sge; } - /* This will transition ownership of the data from the socket where - * the BPF program was run initiating the redirect to the socket - * we will eventually receive this data on. The data will be released - * from skb_consume found in __tcp_bpf_recvmsg() after its been copied - * into user buffers. - */ - skb_set_owner_r(skb, sk); - copied = skb->len; msg->sg.start = 0; msg->sg.size = copied; @@ -442,6 +442,40 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) return copied; } +static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) +{ + struct sock *sk = psock->sk; + struct sk_msg *msg; + + msg = sk_psock_create_ingress_msg(sk, skb); + if (!msg) + return -EAGAIN; + + /* This will transition ownership of the data from the socket where + * the BPF program was run initiating the redirect to the socket + * we will eventually receive this data on. The data will be released + * from skb_consume found in __tcp_bpf_recvmsg() after its been copied + * into user buffers. + */ + skb_set_owner_r(skb, sk); + return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg); +} + +/* Puts an skb on the ingress queue of the socket already assigned to the + * skb. In this case we do not need to check memory limits or skb_set_owner_r + * because the skb is already accounted for here. + */ +static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb) +{ + struct sk_msg *msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); + struct sock *sk = psock->sk; + + if (unlikely(!msg)) + return -EAGAIN; + sk_msg_init(msg); + return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg); +} + static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb, u32 off, u32 len, bool ingress) { @@ -801,7 +835,7 @@ static void sk_psock_verdict_apply(struct sk_psock *psock, * retrying later from workqueue. */ if (skb_queue_empty(&psock->ingress_skb)) { - err = sk_psock_skb_ingress(psock, skb); + err = sk_psock_skb_ingress_self(psock, skb); } if (err < 0) { skb_queue_tail(&psock->ingress_skb, skb); From patchwork Sat Nov 7 19:38:54 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 1396232 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=bn7rWu//; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CT6yM6vTHz9sVC for ; Sun, 8 Nov 2020 06:39:07 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728669AbgKGTjH (ORCPT ); Sat, 7 Nov 2020 14:39:07 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:32992 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725836AbgKGTjH (ORCPT ); Sat, 7 Nov 2020 14:39:07 -0500 Received: from mail-ot1-x341.google.com (mail-ot1-x341.google.com [IPv6:2607:f8b0:4864:20::341]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA4F3C0613CF; Sat, 7 Nov 2020 11:39:06 -0800 (PST) Received: by mail-ot1-x341.google.com with SMTP id k3so4646611otp.12; Sat, 07 Nov 2020 11:39:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=7SSO8P7QNjSDnyLJqh8IxVjW+CSfAgn4F8/JaQHnAL4=; b=bn7rWu//83wK0urzZf+yCJenGPDafaWYhy1Y8Mvdb+CQWTzGTyKrvDJ0ItkalDxfDO 9LLF4MAveDC2iL8iYvrqCyjPJlYDALMh+QzvOr+HjFKjjAqFmpCxqo7fQdx+zBqRDXPr N7XV4NucpbCKfXtkkR3Lz6M3ugSDTHMfsr16Cys2ka8jJfa/72CTLPh+4BN2UuioZe5o g0hdXHGH8FCQ3ojjHo8DLYJ+II3A/leT/VXe54Vg4A7aNGiCVbv3ruYdVtvvP1Tm6T0W zU7OFJTAa9RIB1RPKdDBBXze/CHuPvy84RA3IdRreViePbELGFmhIBcKDgfT/PSbTzmu thNA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=7SSO8P7QNjSDnyLJqh8IxVjW+CSfAgn4F8/JaQHnAL4=; b=NqhV/ztanwC1POrll4MrzUYyBDj8MkMIRXPT9Wgq/KCVzXdUium3nEWYmSofXu74gs +F8OUC1w/Srm98jhHS0NfjPB3uV+iMn1gKW6V/uQEM8vkaHMU+FYGLtgcqal6HrA4jno IUUHydD2NaMU+vmqWIx8p82iXiUg54ue4gwbxtXpfAkT5uh1F9PBPgZe0fAvz8+PLfRX UDcaA/QgKRvF2FiBmoZ4FwvnTuTMi4F7yKEMqTPvJ2IJtELI5/wECBvtiItL0DQwK661 pdSoGI9po4QNw6n0wWX/fa7cgduZrFQXCyz/WUGyOO0aBoxLodnRvVU8kLOZCCEEEGBz C3bw== X-Gm-Message-State: AOAM533uzKPY7qSiMLY0mDXTtLzZ4/WA3xa2VORf/dlfcdHPs02NyEzZ W3PKjUuT+Ox1yIGERN8NPG4= X-Google-Smtp-Source: ABdhPJyK7FIQDOg6PuDArIPCEmh0BvsvMgx6X48ITd/6wo8Xeap9PlK3A/JIpyO5BviKjOghbBATuw== X-Received: by 2002:a9d:4c18:: with SMTP id l24mr5034347otf.291.1604777946315; Sat, 07 Nov 2020 11:39:06 -0800 (PST) Received: from [127.0.1.1] ([184.63.162.180]) by smtp.gmail.com with ESMTPSA id k15sm1206555oor.11.2020.11.07.11.38.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Nov 2020 11:39:05 -0800 (PST) Subject: [bpf PATCH 4/5] bpf, sockmap: Handle memory acct if skb_verdict prog redirects to self From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net, jakub@cloudflare.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org Date: Sat, 07 Nov 2020 11:38:54 -0800 Message-ID: <160477793403.608263.17626285322866020367.stgit@john-XPS-13-9370> In-Reply-To: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> References: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> User-Agent: StGit/0.23-36-gc01b MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If the skb_verdict_prog redirects an skb knowingly to itself, fix your BPF program this is not optimal and an abuse of the API please use SK_PASS. That said there may be cases, such as socket load balancing, where picking the socket is hashed based or otherwise picks the same socket it was received on in some rare cases. If this happens we don't want to confuse userspace giving them an EAGAIN error if we can avoid it. To avoid double accounting in these cases. At the moment even if the skb has already been charged against the sockets rcvbuf and forward alloc we check it again and do set_owner_r() causing it to be orphaned and recharged. For one this is useless work, but more importantly we can have a case where the skb could be put on the ingress queue, but because we are under memory pressure we return EAGAIN. The trouble here is the skb has already been accounted for so any rcvbuf checks include the memory associated with the packet already. This rolls up and can result in unecessary EAGAIN errors in userspace read() calls. Fix by doing an unlikely check and skipping checks if skb->sk == sk. Fixes: 51199405f9672 ("bpf: skb_verdict, support SK_PASS on RX BPF path") Signed-off-by: John Fastabend --- net/core/skmsg.c | 17 +++++++++++------ 1 file changed, 11 insertions(+), 6 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 580252e532da..59c36a672256 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -404,11 +404,13 @@ static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk, { struct sk_msg *msg; - if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) - return NULL; + if (likely(skb->sk != sk)) { + if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) + return NULL; - if (!sk_rmem_schedule(sk, skb, skb->len)) - return NULL; + if (!sk_rmem_schedule(sk, skb, skb->len)) + return NULL; + } msg = kzalloc(sizeof(*msg), __GFP_NOWARN | GFP_ATOMIC); if (unlikely(!msg)) @@ -455,9 +457,12 @@ static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb) * the BPF program was run initiating the redirect to the socket * we will eventually receive this data on. The data will be released * from skb_consume found in __tcp_bpf_recvmsg() after its been copied - * into user buffers. + * into user buffers. If we are receiving on the same sock skb->sk is + * already assigned, skip memory accounting and owner transition seeing + * it already set correctly. */ - skb_set_owner_r(skb, sk); + if (likely(skb->sk != sk)) + skb_set_owner_r(skb, sk); return sk_psock_skb_ingress_enqueue(skb, psock, sk, msg); } From patchwork Sat Nov 7 19:39:12 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: John Fastabend X-Patchwork-Id: 1396233 X-Patchwork-Delegate: bpf@iogearbox.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=pass (sender SPF authorized) smtp.mailfrom=vger.kernel.org (client-ip=23.128.96.18; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: ozlabs.org; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20161025 header.b=LkwWflC+; dkim-atps=neutral Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by ozlabs.org (Postfix) with ESMTP id 4CT6yk4k5dz9sVC for ; Sun, 8 Nov 2020 06:39:26 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728680AbgKGTj0 (ORCPT ); Sat, 7 Nov 2020 14:39:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33044 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725836AbgKGTjZ (ORCPT ); Sat, 7 Nov 2020 14:39:25 -0500 Received: from mail-ot1-x342.google.com (mail-ot1-x342.google.com [IPv6:2607:f8b0:4864:20::342]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A92CDC0613CF; Sat, 7 Nov 2020 11:39:25 -0800 (PST) Received: by mail-ot1-x342.google.com with SMTP id n15so4671950otl.8; Sat, 07 Nov 2020 11:39:25 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=subject:from:to:cc:date:message-id:in-reply-to:references :user-agent:mime-version:content-transfer-encoding; bh=umgOFSJKneTT4hVW+UHmrcTgF4QshDfext+F1vpx4yE=; b=LkwWflC+i3Igs2WVRfu3mUGpxMJBmKVzpAfGt6qd43ryZRsh+IMOIE5z8KO8o14uzs d1NDfluOkv1g05OooeqUXqWiF2R8flp8v5quuFFWe2O6gTgCMq9UwXy6au/N+Q/BbPNr 4vaCveIR5kveXvHcS2QFM+fkM4TLIg2cRBZykweCr8J2wWbiHiKIl2Ue+IK/CyD9Tko7 F3FBz29Z0JaKSmUjEAFlg03k3F4XEM5UwW5BFVfGWoFh8DXMjU50RW9UayjZ+9/WIKxr U9ZJAdk7dnf04XHe7sQ3opEmwJngzCcw+l/Ac0uemG1qMNadxwsTxKdxyh9tXX88+xiB kqbA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:date:message-id:in-reply-to :references:user-agent:mime-version:content-transfer-encoding; bh=umgOFSJKneTT4hVW+UHmrcTgF4QshDfext+F1vpx4yE=; b=prxkPYozxBqMdp6wPcjOz6Elajy1pkwRNY0wXNNNu7l5UeT7GniaCzpgxw//iBoGh6 WAv68aZDZCfNgtS0S0PeeUGgErUVOeuKksnTJs+m1ko/2aJ923nojEg8x4SX+/1Cc3Dc pC33QF+yLAfo+R3MDKAboIL2CemYud5/JqsHdwklvrbiL3Sr7y6LsphnJSU/nXzkZKbW pjxr7a6RDw7Bm/BQvNGMalEyfqjf4y6ZV88faxX6bCIANaV4HAAcHN4Xqx5iOMsglM41 7v1oiAPUveh/qOpW/GeQ2QxiukaHTHjimVCGAAGimb6fCVJXeTwHzOXJqlAxGlfSpvc8 O8ZA== X-Gm-Message-State: AOAM533QRKh8dnI4svy0dQ00GydBQrdEivzAAUUA0+H+L+ixHXyE7l+Q fzpJRAnULlCmuNoxRiqurro= X-Google-Smtp-Source: ABdhPJwgVaKh1amjH7VOKqvRSrC2rBZkR8YlWq223q35VVOWaHrnBEGBlrvSrJMMWsCZr/t1uOHY2g== X-Received: by 2002:a05:6830:1e70:: with SMTP id m16mr5098361otr.51.1604777965014; Sat, 07 Nov 2020 11:39:25 -0800 (PST) Received: from [127.0.1.1] ([184.63.162.180]) by smtp.gmail.com with ESMTPSA id 85sm1252346oie.30.2020.11.07.11.39.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 07 Nov 2020 11:39:24 -0800 (PST) Subject: [bpf PATCH 5/5] bpf, sockmap: Avoid failures from skb_to_sgvec when skb has frag_list From: John Fastabend To: ast@kernel.org, daniel@iogearbox.net, jakub@cloudflare.com Cc: john.fastabend@gmail.com, bpf@vger.kernel.org, netdev@vger.kernel.org Date: Sat, 07 Nov 2020 11:39:12 -0800 Message-ID: <160477795249.608263.2308084426369232846.stgit@john-XPS-13-9370> In-Reply-To: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> References: <160477770483.608263.6057216691957042088.stgit@john-XPS-13-9370> User-Agent: StGit/0.23-36-gc01b MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org When skb has a frag_list its possible for skb_to_sgvec() to fail. This happens when the scatterlist has fewer elements to store pages than would be needed for the initial skb plus any of its frags. This case appears rare, but is possible when running an RX parser/verdict programs exposed to the internet. Currently, when this happens we throw an error, break the pipe, and kfree the msg. This effectively breaks the application or forces it to do a retry. Lets catch this case and handle it by doing an skb_linearize() on any skb we receive with frags. At this point skb_to_sgvec should not fail because the failing conditions would require frags to be in place. Fixes: 604326b41a6fb ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: John Fastabend --- net/core/skmsg.c | 11 +++++++++-- 1 file changed, 9 insertions(+), 2 deletions(-) diff --git a/net/core/skmsg.c b/net/core/skmsg.c index 59c36a672256..b2063ae5648c 100644 --- a/net/core/skmsg.c +++ b/net/core/skmsg.c @@ -425,9 +425,16 @@ static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb, struct sock *sk, struct sk_msg *msg) { - int num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len); - int copied; + int num_sge, copied; + /* skb linearize may fail with ENOMEM, but lets simply try again + * later if this happens. Under memory pressure we don't want to + * drop the skb. We need to linearize the skb so that the mapping + * in skb_to_sgvec can not error. + */ + if (skb_linearize(skb)) + return -EAGAIN; + num_sge = skb_to_sgvec(skb, msg->sg.data, 0, skb->len); if (unlikely(num_sge < 0)) { kfree(msg); return num_sge;