From patchwork Thu Nov 19 12:38:21 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 1402995 X-Patchwork-Delegate: matthieu.baerts@tessares.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=198.145.21.10; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=JOh1gb8p; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CcK3z60QBz9sVJ for ; Thu, 19 Nov 2020 23:38:54 +1100 (AEDT) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id ADEC3100EBB8C; Thu, 19 Nov 2020 04:38:47 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=pabeni@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 9B334100EBBB6 for ; Thu, 19 Nov 2020 04:38:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605789524; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=AaWK+/KsWkNBQS/VeIIHCxhgxcelLlJphPXCsQqtSKE=; b=JOh1gb8padpCbbc+emKsnn7oAvGPf6OnYuP3J3RtJS8inQ3ebr1D1/wRKVE6uEwZy4xzE1 kUzOBOi8hqPdl/5I3z7kleebY98idMO45bareMQcyn6BQ1OPQUJZ6zouL/rCW/PT6K0QrK ZKueCAOvKCif6ti3ecdpXLXgVWch+GA= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-454-0A3HoiNpN8e0bwH_-g2whQ-1; Thu, 19 Nov 2020 07:38:42 -0500 X-MC-Unique: 0A3HoiNpN8e0bwH_-g2whQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id ADAD08144E2 for ; Thu, 19 Nov 2020 12:38:41 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-114-73.ams2.redhat.com [10.36.114.73]) by smtp.corp.redhat.com (Postfix) with ESMTP id 1E5785D6AC for ; Thu, 19 Nov 2020 12:38:40 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.01.org Date: Thu, 19 Nov 2020 13:38:21 +0100 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Message-ID-Hash: DU5CSAQPA5SDFYZBSJ5MD2J5HM2HGLCS X-Message-ID-Hash: DU5CSAQPA5SDFYZBSJ5MD2J5HM2HGLCS X-MailFrom: pabeni@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] [PATCH v5 1/6] mptcp: open code mptcp variant for lock_sock List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: This allows invoking an additional callback under the socket spin lock. Will be used by the next patches to avoid additional spin lock contention Signed-off-by: Paolo Abeni --- net/core/sock.c | 2 +- net/mptcp/protocol.h | 15 +++++++++++++++ 2 files changed, 16 insertions(+), 1 deletion(-) diff --git a/net/core/sock.c b/net/core/sock.c index 727ea1cc633c..89ee8582bf01 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -2486,7 +2486,7 @@ bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag) } EXPORT_SYMBOL(sk_page_frag_refill); -static void __lock_sock(struct sock *sk) +void __lock_sock(struct sock *sk) __releases(&sk->sk_lock.slock) __acquires(&sk->sk_lock.slock) { diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 82d5626323b1..e9f553993322 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -253,6 +253,21 @@ struct mptcp_sock { } rcvq_space; }; +void __lock_sock(struct sock *sk); + +#define mptcp_lock_sock(___sk, cb) do { \ + struct sock *__sk = (___sk); /* silence macro reuse warning */ \ + might_sleep(); \ + spin_lock_bh(&__sk->sk_lock.slock); \ + if (__sk->sk_lock.owned) \ + __lock_sock(__sk); \ + cb; \ + __sk->sk_lock.owned = 1; \ + spin_unlock(&__sk->sk_lock.slock); \ + mutex_acquire(&__sk->sk_lock.dep_map, 0, 0, _RET_IP_); \ + local_bh_enable(); \ +} while (0) + #define mptcp_for_each_subflow(__msk, __subflow) \ list_for_each_entry(__subflow, &((__msk)->conn_list), node) From patchwork Thu Nov 19 12:38:22 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 1402997 X-Patchwork-Delegate: matthieu.baerts@tessares.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=198.145.21.10; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=eEFfi+n3; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CcK401Nc2z9sVS for ; Thu, 19 Nov 2020 23:38:55 +1100 (AEDT) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id C3BCD100EBB92; Thu, 19 Nov 2020 04:38:52 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=pabeni@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 1D3D2100EBBB6 for ; Thu, 19 Nov 2020 04:38:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605789525; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=YFyVy/fUr6mafIrEK7Amg9w4W+C7SmpCRsShkAKtepQ=; b=eEFfi+n3BBW74O+6vusrmBzz0CKN0jr5bkow95EMC5qVgA3NvNhLIVISzuDwVSqWeVI1iE fsKneANz9NSQPAbOHzuPJU04urVYdHDscqCXxcslmyCFp4zH784CIg3ysUtHwU4iuaZGIb 1zUlnrJwFW2eE6FaPJgc+CRSGKi8pyo= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-60-CCM71oNIPEivN2Wlb7bnXg-1; Thu, 19 Nov 2020 07:38:43 -0500 X-MC-Unique: CCM71oNIPEivN2Wlb7bnXg-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A0B775B36A for ; Thu, 19 Nov 2020 12:38:42 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-114-73.ams2.redhat.com [10.36.114.73]) by smtp.corp.redhat.com (Postfix) with ESMTP id 104A85D6AC for ; Thu, 19 Nov 2020 12:38:41 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.01.org Date: Thu, 19 Nov 2020 13:38:22 +0100 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Message-ID-Hash: XDJPRJYX5E3QH6DWATLJQK3OFK2WMDDL X-Message-ID-Hash: XDJPRJYX5E3QH6DWATLJQK3OFK2WMDDL X-MailFrom: pabeni@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] [PATCH v5 2/6] mptcp: implement wmem reservation. List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: This leverage the previous commit to reserve the wmem required for the sendmsg() operation when the msk socket lock is first acquired. Some euristic is used to get a reasonable [over-estimate of the whole memory required. If we can't forward alloc such amount fallback to a resonable small chunk, otherwise enter the wait for memory path. When sendmsg() need more memory it looks at wmem_reserved first and if that is exaused, move more space from sk_forward_alloc. The reserved memory is a transient state and is released at the next socket unlock via the release_cb(). Overall this will simplify the next patch. Signed-off-by: Paolo Abeni --- v2 -> v3: - rename wmem_alloc -> wmem_reserved. This now track a transient value, which should really exposed to user-space. - use mptcp_lock_sock()/release_cb() to update wmem_reserved --- net/mptcp/protocol.c | 92 ++++++++++++++++++++++++++++++++++++++++---- net/mptcp/protocol.h | 1 + 2 files changed, 86 insertions(+), 7 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index d11d1a437f41..44751ea02e54 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -859,6 +859,81 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk, df->data_seq + df->data_len == msk->write_seq; } +static int mptcp_wmem_with_overhead(int size) +{ + return size + ((sizeof(struct mptcp_data_frag) * size) >> PAGE_SHIFT); +} + +static void __mptcp_wmem_reserve(struct sock *sk, int size) +{ + int amount = mptcp_wmem_with_overhead(size); + struct mptcp_sock *msk = mptcp_sk(sk); + + WARN_ON_ONCE(msk->wmem_reserved); + if (amount <= sk->sk_forward_alloc) + goto reserve; + + /* under memory pressure try to reserve at most a single page + * otherwise try to reserve the full estimate and fallback + * to a single page before entering the error path + */ + if ((tcp_under_memory_pressure(sk) && amount > PAGE_SIZE) || + !sk_wmem_schedule(sk, amount)) { + if (amount <= PAGE_SIZE) + goto nomem; + + amount = PAGE_SIZE; + if (!sk_wmem_schedule(sk, amount)) + goto nomem; + } + +reserve: + msk->wmem_reserved = amount; + sk->sk_forward_alloc -= amount; + return; + +nomem: + /* we will wait for memory on next allocation */ + msk->wmem_reserved = -1; +} + +static void __mptcp_update_wmem(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + if (!msk->wmem_reserved) + return; + + if (msk->wmem_reserved < 0) + msk->wmem_reserved = 0; + if (msk->wmem_reserved > 0) { + sk->sk_forward_alloc += msk->wmem_reserved; + msk->wmem_reserved = 0; + } +} + +static bool mptcp_wmem_alloc(struct sock *sk, int size) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + /* check for pre-existing error condition */ + if (msk->wmem_reserved < 0) + return false; + + if (msk->wmem_reserved >= size) + goto account; + + if (!sk_wmem_schedule(sk, size)) + return false; + + sk->sk_forward_alloc -= size; + msk->wmem_reserved += size; + +account: + msk->wmem_reserved -= size; + return true; +} + static void dfrag_uncharge(struct sock *sk, int len) { sk_mem_uncharge(sk, len); @@ -916,7 +991,7 @@ static void mptcp_clean_una(struct sock *sk) } out: - if (cleaned) + if (cleaned && tcp_under_memory_pressure(sk)) sk_mem_reclaim_partial(sk); } @@ -1292,7 +1367,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (msg->msg_flags & ~(MSG_MORE | MSG_DONTWAIT | MSG_NOSIGNAL)) return -EOPNOTSUPP; - lock_sock(sk); + mptcp_lock_sock(sk, __mptcp_wmem_reserve(sk, len)); timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); @@ -1341,11 +1416,12 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) offset = dfrag->offset + dfrag->data_len; psize = pfrag->size - offset; psize = min_t(size_t, psize, msg_data_left(msg)); - if (!sk_wmem_schedule(sk, psize + frag_truesize)) + if (!mptcp_wmem_alloc(sk, psize + frag_truesize)) goto wait_for_memory; if (copy_page_from_iter(dfrag->page, offset, psize, &msg->msg_iter) != psize) { + msk->wmem_reserved += psize + frag_truesize; ret = -EFAULT; goto out; } @@ -1361,7 +1437,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) * Note: we charge such data both to sk and ssk */ sk_wmem_queued_add(sk, frag_truesize); - sk->sk_forward_alloc -= frag_truesize; if (!dfrag_collapsed) { get_page(dfrag->page); list_add_tail(&dfrag->list, &msk->rtx_queue); @@ -1982,6 +2057,7 @@ static int __mptcp_init_sock(struct sock *sk) INIT_WORK(&msk->work, mptcp_worker); msk->out_of_order_queue = RB_ROOT; msk->first_pending = NULL; + msk->wmem_reserved = 0; msk->ack_hint = NULL; msk->first = NULL; @@ -2180,6 +2256,7 @@ static void __mptcp_destroy_sock(struct sock *sk) sk->sk_prot->destroy(sk); + WARN_ON_ONCE(msk->wmem_reserved); sk_stream_kill_queues(sk); xfrm_sk_free_policy(sk); sk_refcnt_debug_release(sk); @@ -2527,13 +2604,14 @@ static int mptcp_getsockopt(struct sock *sk, int level, int optname, #define MPTCP_DEFERRED_ALL (TCPF_WRITE_TIMER_DEFERRED) -/* this is very alike tcp_release_cb() but we must handle differently a - * different set of events - */ +/* processes deferred events and flush wmem */ static void mptcp_release_cb(struct sock *sk) { unsigned long flags, nflags; + /* clear any wmem reservation and errors */ + __mptcp_update_wmem(sk); + do { flags = sk->sk_tsq_flags; if (!(flags & MPTCP_DEFERRED_ALL)) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index e9f553993322..4d7d14ea0fb0 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -218,6 +218,7 @@ struct mptcp_sock { u64 ack_seq; u64 rcv_wnd_sent; u64 rcv_data_fin_seq; + int wmem_reserved; struct sock *last_snd; int snd_burst; int old_wspace; From patchwork Thu Nov 19 12:38:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 1402996 X-Patchwork-Delegate: matthieu.baerts@tessares.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=198.145.21.10; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=Fegjd4UP; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CcK3z5y5Vz9sVD for ; Thu, 19 Nov 2020 23:38:55 +1100 (AEDT) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id D078F100EBB96; Thu, 19 Nov 2020 04:38:52 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=216.205.24.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=pabeni@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 0DBA8100EBBB6 for ; Thu, 19 Nov 2020 04:38:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605789526; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=boWFGqUJIMAEqI7bd72vU5J6yWkOmsf3Ut2Ln2kRdys=; b=Fegjd4UPsv5Hm1x3xl4jSPiSd0jsS395JnK2zSOTSfe7JvhN65DXwUnopqQitUaVBlqC1F pvwxeTu3YLXMZvX1HzufYxF/IQQYmhLppjdwCP+z/E42nudzHrZibAUxg9ZwuxuP5nlpKv JVO9t27eKC3sWWi8EKjIGggGHjahi04= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-219-tm_RbRNhOc6U7Jm9n-0RoQ-1; Thu, 19 Nov 2020 07:38:44 -0500 X-MC-Unique: tm_RbRNhOc6U7Jm9n-0RoQ-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 98AC51005E44 for ; Thu, 19 Nov 2020 12:38:43 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-114-73.ams2.redhat.com [10.36.114.73]) by smtp.corp.redhat.com (Postfix) with ESMTP id 08CE75D6AC for ; Thu, 19 Nov 2020 12:38:42 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.01.org Date: Thu, 19 Nov 2020 13:38:23 +0100 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Message-ID-Hash: Y2TUUJSCFMAFL6CY275CHIDYYCXS5W46 X-Message-ID-Hash: Y2TUUJSCFMAFL6CY275CHIDYYCXS5W46 X-MailFrom: pabeni@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] [PATCH v5 3/6] mptcp: protect the rx path with the msk socket spinlock List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Such spinlock is currently used only to protect the 'owned' flag inside the socket lock itself. With this patch we extend it's scope to protect the whole msk receive path and sk_forward_memory. Given the above we can always move data into the msk receive queue (and OoO queue) from the subflow. We leverage the previous commit, so that we need to acquire the spinlock in the tx path only when moving fwd memory. recvmsg() must now explicitly acquire the socket spinlock when moving skbs out of sk_receive_queue. To reduce the numnber of lock operation required we use a second rx queue and splice the first into the latter in mptcp_lock_sock(). Additionally rmem allocated memory is bulk-freed via release_cb() Signed-off-by: Paolo Abeni --- v2 -> v3: - use an msk variable to track rmem bulk free - use release_cb()/mptcp_lock_sock() to avoid a couple of data lock contentions. v1 -> v2: - keep skb->destructor set for skbs in the msk receive queue and clear it at dequeue time. This simplify rmem accounting and reduces the deltas - added {READ,WRITE}_ONCE annotation for ack_hint - drop no-more-needed check in __mptcp_move_skbs(): we are always called under socket lock and can't race anymore with close --- net/mptcp/protocol.c | 142 +++++++++++++++++++++++++++++-------------- net/mptcp/protocol.h | 5 ++ 2 files changed, 103 insertions(+), 44 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 44751ea02e54..3074e0720110 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -448,7 +448,7 @@ static void mptcp_send_ack(struct mptcp_sock *msk, bool force) /* if the hintes ssk is still active, use it */ pick = ssk; - if (ssk == msk->ack_hint) + if (ssk == READ_ONCE(msk->ack_hint)) break; } if (!force && pick) { @@ -600,13 +600,13 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, break; } } while (more_data_avail); - msk->ack_hint = ssk; + WRITE_ONCE(msk->ack_hint, ssk); *bytes += moved; return done; } -static bool mptcp_ofo_queue(struct mptcp_sock *msk) +static bool __mptcp_ofo_queue(struct mptcp_sock *msk) { struct sock *sk = (struct sock *)msk; struct sk_buff *skb, *tail; @@ -652,34 +652,27 @@ static bool mptcp_ofo_queue(struct mptcp_sock *msk) /* In most cases we will be able to lock the mptcp socket. If its already * owned, we need to defer to the work queue to avoid ABBA deadlock. */ -static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) +static void move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) { struct sock *sk = (struct sock *)msk; unsigned int moved = 0; - if (READ_ONCE(sk->sk_lock.owned)) - return false; - - if (unlikely(!spin_trylock_bh(&sk->sk_lock.slock))) - return false; - - /* must re-check after taking the lock */ - if (!READ_ONCE(sk->sk_lock.owned)) { - __mptcp_move_skbs_from_subflow(msk, ssk, &moved); - mptcp_ofo_queue(msk); + if (inet_sk_state_load(sk) == TCP_CLOSE) + return; - /* If the moves have caught up with the DATA_FIN sequence number - * it's time to ack the DATA_FIN and change socket state, but - * this is not a good place to change state. Let the workqueue - * do it. - */ - if (mptcp_pending_data_fin(sk, NULL)) - mptcp_schedule_work(sk); - } + mptcp_data_lock(sk); - spin_unlock_bh(&sk->sk_lock.slock); + __mptcp_move_skbs_from_subflow(msk, ssk, &moved); + __mptcp_ofo_queue(msk); - return moved > 0; + /* If the moves have caught up with the DATA_FIN sequence number + * it's time to ack the DATA_FIN and change socket state, but + * this is not a good place to change state. Let the workqueue + * do it. + */ + if (mptcp_pending_data_fin(sk, NULL)) + mptcp_schedule_work(sk); + mptcp_data_unlock(sk); } void mptcp_data_ready(struct sock *sk, struct sock *ssk) @@ -923,17 +916,30 @@ static bool mptcp_wmem_alloc(struct sock *sk, int size) if (msk->wmem_reserved >= size) goto account; - if (!sk_wmem_schedule(sk, size)) + mptcp_data_lock(sk); + if (!sk_wmem_schedule(sk, size)) { + mptcp_data_unlock(sk); return false; + } sk->sk_forward_alloc -= size; msk->wmem_reserved += size; + mptcp_data_unlock(sk); account: msk->wmem_reserved -= size; return true; } +static void mptcp_wmem_uncharge(struct sock *sk, int size) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + if (msk->wmem_reserved < 0) + msk->wmem_reserved = 0; + msk->wmem_reserved += size; +} + static void dfrag_uncharge(struct sock *sk, int len) { sk_mem_uncharge(sk, len); @@ -962,6 +968,7 @@ static void mptcp_clean_una(struct sock *sk) if (__mptcp_check_fallback(msk)) atomic64_set(&msk->snd_una, msk->snd_nxt); + mptcp_data_lock(sk); snd_una = atomic64_read(&msk->snd_una); list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) { @@ -993,6 +1000,7 @@ static void mptcp_clean_una(struct sock *sk) out: if (cleaned && tcp_under_memory_pressure(sk)) sk_mem_reclaim_partial(sk); + mptcp_data_unlock(sk); } static void mptcp_clean_una_wakeup(struct sock *sk) @@ -1421,7 +1429,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) if (copy_page_from_iter(dfrag->page, offset, psize, &msg->msg_iter) != psize) { - msk->wmem_reserved += psize + frag_truesize; + mptcp_wmem_uncharge(sk, psize + frag_truesize); ret = -EFAULT; goto out; } @@ -1487,11 +1495,10 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk, struct msghdr *msg, size_t len) { - struct sock *sk = (struct sock *)msk; struct sk_buff *skb; int copied = 0; - while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { + while ((skb = skb_peek(&msk->receive_queue)) != NULL) { u32 offset = MPTCP_SKB_CB(skb)->offset; u32 data_len = skb->len - offset; u32 count = min_t(size_t, len - copied, data_len); @@ -1511,7 +1518,10 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk, break; } - __skb_unlink(skb, &sk->sk_receive_queue); + /* we will bulk release the skb memory later */ + skb->destructor = NULL; + msk->rmem_released += skb->truesize; + __skb_unlink(skb, &msk->receive_queue); __kfree_skb(skb); if (copied >= len) @@ -1619,25 +1629,47 @@ static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied) msk->rcvq_space.time = mstamp; } +static void __mptcp_update_rmem(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + if (!msk->rmem_released) + return; + + atomic_sub(msk->rmem_released, &sk->sk_rmem_alloc); + sk_mem_uncharge(sk, msk->rmem_released); + msk->rmem_released = 0; +} + +static void __mptcp_splice_receive_queue(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue); +} + static bool __mptcp_move_skbs(struct mptcp_sock *msk, unsigned int rcv) { + struct sock *sk = (struct sock *)msk; unsigned int moved = 0; - bool done; - - /* avoid looping forever below on racing close */ - if (((struct sock *)msk)->sk_state == TCP_CLOSE) - return false; + bool ret, done; __mptcp_flush_join_list(msk); do { struct sock *ssk = mptcp_subflow_recv_lookup(msk); bool slowpath; - if (!ssk) + /* we can have data pending in the subflows only if the msk + * receive buffer was full at subflow_data_ready() time, + * it is an likely slow path + */ + if (likely(!ssk)) break; slowpath = lock_sock_fast(ssk); + mptcp_data_lock(sk); done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved); + mptcp_data_unlock(sk); if (moved && rcv) { WRITE_ONCE(msk->rmem_pending, min(rcv, moved)); tcp_cleanup_rbuf(ssk, 1); @@ -1646,11 +1678,19 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk, unsigned int rcv) unlock_sock_fast(ssk, slowpath); } while (!done); - if (mptcp_ofo_queue(msk) || moved > 0) { - mptcp_check_data_fin((struct sock *)msk); - return true; + /* acquire the data lock only if some input data is pending */ + ret = moved > 0; + if (!RB_EMPTY_ROOT(&msk->out_of_order_queue) || + !skb_queue_empty_lockless(&sk->sk_receive_queue)) { + mptcp_data_lock(sk); + __mptcp_update_rmem(sk); + ret |= __mptcp_ofo_queue(msk); + __mptcp_splice_receive_queue(sk); + mptcp_data_unlock(sk); } - return false; + if (ret) + mptcp_check_data_fin((struct sock *)msk); + return !skb_queue_empty(&msk->receive_queue); } static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, @@ -1664,7 +1704,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (msg->msg_flags & ~(MSG_WAITALL | MSG_DONTWAIT)) return -EOPNOTSUPP; - lock_sock(sk); + mptcp_lock_sock(sk, __mptcp_splice_receive_queue(sk)); timeo = sock_rcvtimeo(sk, nonblock); len = min_t(size_t, len, INT_MAX); @@ -1683,7 +1723,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, copied += bytes_read; - if (skb_queue_empty(&sk->sk_receive_queue) && + if (skb_queue_empty(&msk->receive_queue) && __mptcp_move_skbs(msk, len - copied)) continue; @@ -1714,8 +1754,14 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, if (test_and_clear_bit(MPTCP_WORK_EOF, &msk->flags)) mptcp_check_for_eof(msk); - if (sk->sk_shutdown & RCV_SHUTDOWN) + if (sk->sk_shutdown & RCV_SHUTDOWN) { + /* race breaker: the shutdown could be after the + * previous receive queue check + */ + if (__mptcp_move_skbs(msk, len - copied)) + continue; break; + } if (sk->sk_state == TCP_CLOSE) { copied = -ENOTCONN; @@ -1737,7 +1783,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, mptcp_wait_data(sk, &timeo); } - if (skb_queue_empty(&sk->sk_receive_queue)) { + if (skb_queue_empty_lockless(&sk->sk_receive_queue) && + skb_queue_empty(&msk->receive_queue)) { /* entire backlog drained, clear DATA_READY. */ clear_bit(MPTCP_DATA_READY, &msk->flags); @@ -1753,7 +1800,7 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, out_err: pr_debug("msk=%p data_ready=%d rx queue empty=%d copied=%d", msk, test_bit(MPTCP_DATA_READY, &msk->flags), - skb_queue_empty(&sk->sk_receive_queue), copied); + skb_queue_empty_lockless(&sk->sk_receive_queue), copied); mptcp_rcv_space_adjust(msk, copied); release_sock(sk); @@ -2055,9 +2102,11 @@ static int __mptcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&msk->join_list); INIT_LIST_HEAD(&msk->rtx_queue); INIT_WORK(&msk->work, mptcp_worker); + __skb_queue_head_init(&msk->receive_queue); msk->out_of_order_queue = RB_ROOT; msk->first_pending = NULL; msk->wmem_reserved = 0; + msk->rmem_released = 0; msk->ack_hint = NULL; msk->first = NULL; @@ -2257,6 +2306,7 @@ static void __mptcp_destroy_sock(struct sock *sk) sk->sk_prot->destroy(sk); WARN_ON_ONCE(msk->wmem_reserved); + WARN_ON_ONCE(msk->rmem_released); sk_stream_kill_queues(sk); xfrm_sk_free_policy(sk); sk_refcnt_debug_release(sk); @@ -2476,6 +2526,9 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err, void mptcp_destroy_common(struct mptcp_sock *msk) { + struct sock *sk = (struct sock *)msk; + + skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue); skb_rbtree_purge(&msk->out_of_order_queue); mptcp_token_destroy(msk); mptcp_pm_free_anno_list(msk); @@ -2611,6 +2664,7 @@ static void mptcp_release_cb(struct sock *sk) /* clear any wmem reservation and errors */ __mptcp_update_wmem(sk); + __mptcp_update_rmem(sk); do { flags = sk->sk_tsq_flags; diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 4d7d14ea0fb0..f49662ffb607 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -227,6 +227,7 @@ struct mptcp_sock { unsigned long timer_ival; u32 token; int rmem_pending; + int rmem_released; unsigned long flags; bool can_ack; bool fully_established; @@ -238,6 +239,7 @@ struct mptcp_sock { struct work_struct work; struct sk_buff *ooo_last_skb; struct rb_root out_of_order_queue; + struct sk_buff_head receive_queue; struct list_head conn_list; struct list_head rtx_queue; struct mptcp_data_frag *first_pending; @@ -269,6 +271,9 @@ void __lock_sock(struct sock *sk); local_bh_enable(); \ } while (0) +#define mptcp_data_lock(sk) spin_lock_bh(&(sk)->sk_lock.slock) +#define mptcp_data_unlock(sk) spin_unlock_bh(&(sk)->sk_lock.slock) + #define mptcp_for_each_subflow(__msk, __subflow) \ list_for_each_entry(__subflow, &((__msk)->conn_list), node) From patchwork Thu Nov 19 12:38:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 1402999 X-Patchwork-Delegate: matthieu.baerts@tessares.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=2001:19d0:306:5::1; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=DzmQs4nL; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [IPv6:2001:19d0:306:5::1]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CcK412qMjz9sVT for ; Thu, 19 Nov 2020 23:38:56 +1100 (AEDT) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id D8F89100EBB9A; Thu, 19 Nov 2020 04:38:52 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=216.205.24.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=pabeni@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 6C462100EBB92 for ; Thu, 19 Nov 2020 04:38:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605789527; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=N9W62A1lBzbse+sL9Ng8QnJ0zHDEuE8rs8UP3B5pHuc=; b=DzmQs4nLBKL5GNAiTGnpgdtGHHy6onqTjhOAfB/BKzKbMMFdHbBAhQMwJa+tYeDgQkeXP2 yBeeIxTtPVd85E9ryZLb3VTFBw7wTowE97U63dfQfKq2DuCyjt6hqod6xLSj3/5WTjkJX7 toI2/IfJ4iEr2qrvrD6VCujkHolOkB8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-171-6EHT7NGNO1aIwbxJfa01gg-1; Thu, 19 Nov 2020 07:38:45 -0500 X-MC-Unique: 6EHT7NGNO1aIwbxJfa01gg-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 8D3BB1005E40 for ; Thu, 19 Nov 2020 12:38:44 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-114-73.ams2.redhat.com [10.36.114.73]) by smtp.corp.redhat.com (Postfix) with ESMTP id EF7A05D6AC for ; Thu, 19 Nov 2020 12:38:43 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.01.org Date: Thu, 19 Nov 2020 13:38:24 +0100 Message-Id: <8678c1babccb885a8c6da75c04655a1eff48f2c6.1605789317.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Message-ID-Hash: XI6JJJXRPJUOQ5TNJNVX656JQHOZ7UJP X-Message-ID-Hash: XI6JJJXRPJUOQ5TNJNVX656JQHOZ7UJP X-MailFrom: pabeni@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] [PATCH v5 4/6] mptcp: allocate TX skbs in msk context. List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Move the TX skbs allocation in mptcp_sendmsg() scope, and tentativelly allocate a skbs based on the goal size on the last used subflow. Use the ssk tx skb cache to prevent the subflow allocation. This allows removing the msk skb extension cache and will make possible the later patch. Signed-off-by: Paolo Abeni --- v2 -> v3: - update __mptcp_wmem_schedule() to take in account skb truesize, too - mptcp_reclaim_partial takes in account wmem_reserved status --- net/mptcp/protocol.c | 248 ++++++++++++++++++++++++++++++++++++------- net/mptcp/protocol.h | 4 +- 2 files changed, 210 insertions(+), 42 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 3074e0720110..cf90426ab725 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -804,16 +804,6 @@ static void mptcp_check_for_eof(struct mptcp_sock *msk) mptcp_close_wake_up(sk); } -static bool mptcp_ext_cache_refill(struct mptcp_sock *msk) -{ - const struct sock *sk = (const struct sock *)msk; - - if (!msk->cached_ext) - msk->cached_ext = __skb_ext_alloc(sk->sk_allocation); - - return !!msk->cached_ext; -} - static struct sock *mptcp_subflow_recv_lookup(const struct mptcp_sock *msk) { struct mptcp_subflow_context *subflow; @@ -852,14 +842,22 @@ static bool mptcp_frag_can_collapse_to(const struct mptcp_sock *msk, df->data_seq + df->data_len == msk->write_seq; } -static int mptcp_wmem_with_overhead(int size) +static int mptcp_wmem_with_overhead(struct sock *sk, int size) { - return size + ((sizeof(struct mptcp_data_frag) * size) >> PAGE_SHIFT); + struct mptcp_sock *msk = mptcp_sk(sk); + int ret, skbs; + + ret = size + ((sizeof(struct mptcp_data_frag) * size) >> PAGE_SHIFT); + skbs = (msk->tx_pending_data + size) / msk->size_goal_cache; + if (skbs < msk->skb_tx_cache.qlen) + return ret; + + return ret + (skbs - msk->skb_tx_cache.qlen) * SKB_TRUESIZE(MAX_TCP_HEADER); } static void __mptcp_wmem_reserve(struct sock *sk, int size) { - int amount = mptcp_wmem_with_overhead(size); + int amount = mptcp_wmem_with_overhead(sk, size); struct mptcp_sock *msk = mptcp_sk(sk); WARN_ON_ONCE(msk->wmem_reserved); @@ -940,6 +938,25 @@ static void mptcp_wmem_uncharge(struct sock *sk, int size) msk->wmem_reserved += size; } +static void mptcp_mem_reclaim_partial(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + /* if we are experiencing a transint allocation error, + * the forward allocation memory has been already + * released + */ + if (msk->wmem_reserved < 0) + return; + + mptcp_data_lock(sk); + sk->sk_forward_alloc += msk->wmem_reserved; + sk_mem_reclaim_partial(sk); + msk->wmem_reserved = sk->sk_forward_alloc; + sk->sk_forward_alloc = 0; + mptcp_data_unlock(sk); +} + static void dfrag_uncharge(struct sock *sk, int len) { sk_mem_uncharge(sk, len); @@ -1016,19 +1033,12 @@ static void mptcp_clean_una_wakeup(struct sock *sk) } } -/* ensure we get enough memory for the frag hdr, beyond some minimal amount of - * data - */ -static bool mptcp_page_frag_refill(struct sock *sk, struct page_frag *pfrag) +static void mptcp_enter_memory_pressure(struct sock *sk) { struct mptcp_subflow_context *subflow; struct mptcp_sock *msk = mptcp_sk(sk); bool first = true; - if (likely(skb_page_frag_refill(32U + sizeof(struct mptcp_data_frag), - pfrag, sk->sk_allocation))) - return true; - sk_stream_moderate_sndbuf(sk); mptcp_for_each_subflow(msk, subflow) { struct sock *ssk = mptcp_subflow_tcp_sock(subflow); @@ -1038,6 +1048,18 @@ static bool mptcp_page_frag_refill(struct sock *sk, struct page_frag *pfrag) sk_stream_moderate_sndbuf(ssk); first = false; } +} + +/* ensure we get enough memory for the frag hdr, beyond some minimal amount of + * data + */ +static bool mptcp_page_frag_refill(struct sock *sk, struct page_frag *pfrag) +{ + if (likely(skb_page_frag_refill(32U + sizeof(struct mptcp_data_frag), + pfrag, sk->sk_allocation))) + return true; + + mptcp_enter_memory_pressure(sk); return false; } @@ -1084,6 +1106,128 @@ static int mptcp_check_allowed_size(struct mptcp_sock *msk, u64 data_seq, return avail_size; } +static bool __mptcp_add_ext(struct sk_buff *skb, gfp_t gfp) +{ + struct skb_ext *mpext = __skb_ext_alloc(gfp); + + if (!mpext) + return false; + __skb_ext_set(skb, SKB_EXT_MPTCP, mpext); + return true; +} + +static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk) +{ + struct sk_buff *skb; + + skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation); + if (likely(skb)) { + if (likely(__mptcp_add_ext(skb, sk->sk_allocation))) { + skb_reserve(skb, MAX_TCP_HEADER); + skb->reserved_tailroom = skb->end - skb->tail; + return skb; + } + __kfree_skb(skb); + } else { + mptcp_enter_memory_pressure(sk); + } + return NULL; +} + +static bool mptcp_tx_cache_refill(struct sock *sk, int size, + struct sk_buff_head *skbs, int *total_ts) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + struct sk_buff *skb; + int space_needed; + + if (unlikely(tcp_under_memory_pressure(sk))) { + mptcp_mem_reclaim_partial(sk); + + /* under pressure pre-allocate at most a single skb */ + if (msk->skb_tx_cache.qlen) + return true; + space_needed = msk->size_goal_cache; + } else { + space_needed = msk->tx_pending_data + size - + msk->skb_tx_cache.qlen * msk->size_goal_cache; + } + + while (space_needed > 0) { + skb = __mptcp_do_alloc_tx_skb(sk); + if (unlikely(!skb)) { + /* under memory pressure, try to pass the caller a + * single skb to allow forward progress + */ + while (skbs->qlen > 1) { + skb = __skb_dequeue_tail(skbs); + __kfree_skb(skb); + } + return skbs->qlen > 0; + } + + *total_ts += skb->truesize; + __skb_queue_tail(skbs, skb); + space_needed -= msk->size_goal_cache; + } + return true; +} + +static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + struct sk_buff *skb; + + if (ssk->sk_tx_skb_cache) { + skb = ssk->sk_tx_skb_cache; + if (unlikely(!skb_ext_find(skb, SKB_EXT_MPTCP) && + !__mptcp_add_ext(skb, sk->sk_allocation))) + return false; + return true; + } + + skb = skb_peek(&msk->skb_tx_cache); + if (skb) { + if (likely(sk_wmem_schedule(ssk, skb->truesize))) { + skb = __skb_dequeue(&msk->skb_tx_cache); + if (WARN_ON_ONCE(!skb)) + return false; + + mptcp_wmem_uncharge(sk, skb->truesize); + ssk->sk_tx_skb_cache = skb; + return true; + } + + /* over memory limit, no point to try to allocate a new skb */ + return false; + } + + skb = __mptcp_do_alloc_tx_skb(sk); + if (!skb) + return false; + + if (likely(sk_wmem_schedule(ssk, skb->truesize))) { + ssk->sk_tx_skb_cache = skb; + return true; + } + kfree_skb(skb); + return false; +} + +static bool mptcp_must_reclaim_memory(struct sock *sk, struct sock *ssk) +{ + return !ssk->sk_tx_skb_cache && + !skb_peek(&mptcp_sk(sk)->skb_tx_cache) && + tcp_under_memory_pressure(sk); +} + +static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) +{ + if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) + mptcp_mem_reclaim_partial(sk); + return __mptcp_alloc_tx_skb(sk, ssk); +} + static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, struct mptcp_data_frag *dfrag, struct mptcp_sendmsg_info *info) @@ -1095,7 +1239,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, struct sk_buff *skb, *tail; bool can_collapse = false; int avail_size; - size_t ret; + size_t ret = 0; pr_debug("msk=%p ssk=%p sending dfrag at seq=%lld len=%d already sent=%d", msk, ssk, dfrag->data_seq, dfrag->data_len, info->sent); @@ -1103,6 +1247,7 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, /* compute send limit */ info->mss_now = tcp_send_mss(ssk, &info->size_goal, info->flags); avail_size = info->size_goal; + msk->size_goal_cache = info->size_goal; skb = tcp_write_queue_tail(ssk); if (skb) { /* Limit the write to the size available in the @@ -1151,8 +1296,11 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, goto out; } - mpext = __skb_ext_set(tail, SKB_EXT_MPTCP, msk->cached_ext); - msk->cached_ext = NULL; + mpext = skb_ext_find(tail, SKB_EXT_MPTCP); + if (WARN_ON_ONCE(!mpext)) { + /* should never reach here, stream corrupted */ + return -EINVAL; + } memset(mpext, 0, sizeof(*mpext)); mpext->data_seq = data_seq; @@ -1225,9 +1373,6 @@ static struct sock *mptcp_subflow_get_send(struct mptcp_sock *msk, sock_owned_by_me((struct sock *)msk); *sndbuf = 0; - if (!mptcp_ext_cache_refill(msk)) - return NULL; - if (__mptcp_check_fallback(msk)) { if (!msk->first) return NULL; @@ -1336,6 +1481,15 @@ static void mptcp_push_pending(struct sock *sk, unsigned int flags) if (ssk != prev_ssk || !prev_ssk) lock_sock(ssk); + /* keep it simple and always provide a new skb for the + * subflow, even if we will not use it when collapsing + * on the pending one + */ + if (!mptcp_alloc_tx_skb(sk, ssk)) { + mptcp_push_release(sk, ssk, &info); + goto out; + } + ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info); if (ret <= 0) { mptcp_push_release(sk, ssk, &info); @@ -1346,6 +1500,7 @@ static void mptcp_push_pending(struct sock *sk, unsigned int flags) dfrag->already_sent += ret; msk->snd_nxt += ret; msk->snd_burst -= ret; + msk->tx_pending_data -= ret; copied += ret; len -= ret; } @@ -1389,8 +1544,9 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) mptcp_clean_una(sk); while (msg_data_left(msg)) { + int total_ts, frag_truesize = 0; struct mptcp_data_frag *dfrag; - int frag_truesize = 0; + struct sk_buff_head skbs; bool dfrag_collapsed; size_t psize, offset; @@ -1424,9 +1580,17 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) offset = dfrag->offset + dfrag->data_len; psize = pfrag->size - offset; psize = min_t(size_t, psize, msg_data_left(msg)); - if (!mptcp_wmem_alloc(sk, psize + frag_truesize)) + total_ts = psize + frag_truesize; + __skb_queue_head_init(&skbs); + if (!mptcp_tx_cache_refill(sk, psize, &skbs, &total_ts)) goto wait_for_memory; + if (!mptcp_wmem_alloc(sk, total_ts)) { + __skb_queue_purge(&skbs); + goto wait_for_memory; + } + + skb_queue_splice_tail(&skbs, &msk->skb_tx_cache); if (copy_page_from_iter(dfrag->page, offset, psize, &msg->msg_iter) != psize) { mptcp_wmem_uncharge(sk, psize + frag_truesize); @@ -1455,8 +1619,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) dfrag->data_seq, dfrag->data_len, dfrag->already_sent, !dfrag_collapsed); - if (!mptcp_ext_cache_refill(msk)) - goto wait_for_memory; continue; wait_for_memory: @@ -1468,8 +1630,10 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) goto out; } - if (copied) + if (copied) { + msk->tx_pending_data += copied; mptcp_push_pending(sk, msg->msg_flags); + } out: release_sock(sk); @@ -2052,9 +2216,6 @@ static void mptcp_worker(struct work_struct *work) if (!dfrag) goto unlock; - if (!mptcp_ext_cache_refill(msk)) - goto reset_unlock; - ssk = mptcp_subflow_get_retrans(msk); if (!ssk) goto reset_unlock; @@ -2065,6 +2226,9 @@ static void mptcp_worker(struct work_struct *work) info.sent = 0; info.limit = dfrag->already_sent; while (info.sent < dfrag->already_sent) { + if (!mptcp_alloc_tx_skb(sk, ssk)) + break; + ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info); if (ret <= 0) break; @@ -2072,9 +2236,6 @@ static void mptcp_worker(struct work_struct *work) MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_RETRANSSEGS); copied += ret; info.sent += ret; - - if (!mptcp_ext_cache_refill(msk)) - break; } if (copied) tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle, @@ -2103,10 +2264,13 @@ static int __mptcp_init_sock(struct sock *sk) INIT_LIST_HEAD(&msk->rtx_queue); INIT_WORK(&msk->work, mptcp_worker); __skb_queue_head_init(&msk->receive_queue); + __skb_queue_head_init(&msk->skb_tx_cache); msk->out_of_order_queue = RB_ROOT; msk->first_pending = NULL; msk->wmem_reserved = 0; msk->rmem_released = 0; + msk->tx_pending_data = 0; + msk->size_goal_cache = TCP_BASE_MSS; msk->ack_hint = NULL; msk->first = NULL; @@ -2154,12 +2318,17 @@ static void __mptcp_clear_xmit(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); struct mptcp_data_frag *dtmp, *dfrag; + struct sk_buff *skb; sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer); WRITE_ONCE(msk->first_pending, NULL); list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) dfrag_clear(sk, dfrag); + while ((skb = __skb_dequeue(&msk->skb_tx_cache)) != NULL) { + sk->sk_forward_alloc += skb->truesize; + kfree_skb(skb); + } } static void mptcp_cancel_work(struct sock *sk) @@ -2538,9 +2707,6 @@ static void mptcp_destroy(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); - if (msk->cached_ext) - __skb_ext_put(msk->cached_ext); - mptcp_destroy_common(msk); sk_sockets_allocated_dec(sk); } diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index f49662ffb607..5997d1ca5323 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -240,11 +240,13 @@ struct mptcp_sock { struct sk_buff *ooo_last_skb; struct rb_root out_of_order_queue; struct sk_buff_head receive_queue; + struct sk_buff_head skb_tx_cache; /* this is wmem accounted */ + int tx_pending_data; + int size_goal_cache; struct list_head conn_list; struct list_head rtx_queue; struct mptcp_data_frag *first_pending; struct list_head join_list; - struct skb_ext *cached_ext; /* for the next sendmsg */ struct socket *subflow; /* outgoing connect/listener/!mp_capable */ struct sock *first; struct mptcp_pm_data pm; From patchwork Thu Nov 19 12:38:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 1403000 X-Patchwork-Delegate: matthieu.baerts@tessares.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=198.145.21.10; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=hZ9o9ma9; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CcK440qqDz9sTv for ; Thu, 19 Nov 2020 23:38:59 +1100 (AEDT) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id E2446100EBB9E; Thu, 19 Nov 2020 04:38:52 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=pabeni@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id D6EB6100EBB8D for ; Thu, 19 Nov 2020 04:38:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605789527; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1QqghyGm1u4/34yL1fGHf0wzeiGxw3gkDobYuj7VNwM=; b=hZ9o9ma9nTalOgSa6UPnz8WRjrwN1KE3E9dn5CJ79sMwwxQEYrrya71M2Pr0nX5zzJjwaH oaEE/LOBLkes+y/tmgbv7/pLFDX11xUYq7Oau7UdKR5u7zPJiA32t+23JeLcFQFaxXb+XO xQiRNf6Xh2Z0Nrlp5grHToAfb4xhmyE= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-366-mRW3A2nAMhOQUzc8cN0mcA-1; Thu, 19 Nov 2020 07:38:46 -0500 X-MC-Unique: mRW3A2nAMhOQUzc8cN0mcA-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 7EC8A5B361 for ; Thu, 19 Nov 2020 12:38:45 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-114-73.ams2.redhat.com [10.36.114.73]) by smtp.corp.redhat.com (Postfix) with ESMTP id E2AA65D6AC for ; Thu, 19 Nov 2020 12:38:44 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.01.org Date: Thu, 19 Nov 2020 13:38:25 +0100 Message-Id: In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Message-ID-Hash: 3WDS2SYA7KL3RKT2W7FOW6VQI2YXRGP3 X-Message-ID-Hash: 3WDS2SYA7KL3RKT2W7FOW6VQI2YXRGP3 X-MailFrom: pabeni@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] [PATCH v5 5/6] mptcp: avoid a few atomic ops in the rx path List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: Extending the data_lock scope in mptcp_incoming_option we can use that to protect both snd_una and wnd_end. In the typical case we will have a single atomic op instead of 2 Signed-off-by: Paolo Abeni --- net/mptcp/mptcp_diag.c | 2 +- net/mptcp/options.c | 33 +++++++++++++-------------------- net/mptcp/protocol.c | 34 ++++++++++++++++------------------ net/mptcp/protocol.h | 8 ++++---- 4 files changed, 34 insertions(+), 43 deletions(-) diff --git a/net/mptcp/mptcp_diag.c b/net/mptcp/mptcp_diag.c index 5f390a97f556..b70ae4ba3000 100644 --- a/net/mptcp/mptcp_diag.c +++ b/net/mptcp/mptcp_diag.c @@ -140,7 +140,7 @@ static void mptcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r, info->mptcpi_flags = flags; info->mptcpi_token = READ_ONCE(msk->token); info->mptcpi_write_seq = READ_ONCE(msk->write_seq); - info->mptcpi_snd_una = atomic64_read(&msk->snd_una); + info->mptcpi_snd_una = READ_ONCE(msk->snd_una); info->mptcpi_rcv_nxt = READ_ONCE(msk->ack_seq); unlock_sock_fast(sk, slow); } diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 8a59b3e44599..3986454a0340 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -833,15 +833,17 @@ static void ack_update_msk(struct mptcp_sock *msk, const struct sock *ssk, struct mptcp_options_received *mp_opt) { - u64 new_snd_una, snd_una, old_snd_una = atomic64_read(&msk->snd_una); - u64 new_wnd_end, wnd_end, old_wnd_end = atomic64_read(&msk->wnd_end); - u64 snd_nxt = READ_ONCE(msk->snd_nxt); + u64 new_wnd_end, new_snd_una, snd_nxt = READ_ONCE(msk->snd_nxt); struct sock *sk = (struct sock *)msk; + u64 old_snd_una; + + mptcp_data_lock(sk); /* avoid ack expansion on update conflict, to reduce the risk of * wrongly expanding to a future ack sequence number, which is way * more dangerous than missing an ack */ + old_snd_una = msk->snd_una; new_snd_una = expand_ack(old_snd_una, mp_opt->data_ack, mp_opt->ack64); /* ACK for data not even sent yet? Ignore. */ @@ -850,26 +852,17 @@ static void ack_update_msk(struct mptcp_sock *msk, new_wnd_end = new_snd_una + tcp_sk(ssk)->snd_wnd; - while (after64(new_wnd_end, old_wnd_end)) { - wnd_end = old_wnd_end; - old_wnd_end = atomic64_cmpxchg(&msk->wnd_end, wnd_end, - new_wnd_end); - if (old_wnd_end == wnd_end) { - if (mptcp_send_head(sk)) - mptcp_schedule_work(sk); - break; - } + if (after64(new_wnd_end, msk->wnd_end)) { + msk->wnd_end = new_wnd_end; + if (mptcp_send_head(sk)) + mptcp_schedule_work(sk); } - while (after64(new_snd_una, old_snd_una)) { - snd_una = old_snd_una; - old_snd_una = atomic64_cmpxchg(&msk->snd_una, snd_una, - new_snd_una); - if (old_snd_una == snd_una) { - mptcp_data_acked(sk); - break; - } + if (after64(new_snd_una, old_snd_una)) { + msk->snd_una = new_snd_una; + __mptcp_data_acked(sk); } + mptcp_data_unlock(sk); } bool mptcp_update_rcv_data_fin(struct mptcp_sock *msk, u64 data_fin_seq, bool use_64bit) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index cf90426ab725..313d488aae15 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -72,7 +72,7 @@ static struct socket *__mptcp_nmpc_socket(const struct mptcp_sock *msk) /* Returns end sequence number of the receiver's advertised window */ static u64 mptcp_wnd_end(const struct mptcp_sock *msk) { - return atomic64_read(&msk->wnd_end); + return READ_ONCE(msk->wnd_end); } static bool mptcp_is_tcpsk(struct sock *sk) @@ -370,7 +370,7 @@ static void mptcp_check_data_fin_ack(struct sock *sk) /* Look for an acknowledged DATA_FIN */ if (((1 << sk->sk_state) & (TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_LAST_ACK)) && - msk->write_seq == atomic64_read(&msk->snd_una)) { + msk->write_seq == READ_ONCE(msk->snd_una)) { mptcp_stop_timer(sk); WRITE_ONCE(msk->snd_data_fin_enable, 0); @@ -750,7 +750,7 @@ bool mptcp_schedule_work(struct sock *sk) return false; } -void mptcp_data_acked(struct sock *sk) +void __mptcp_data_acked(struct sock *sk) { mptcp_reset_timer(sk); @@ -983,11 +983,11 @@ static void mptcp_clean_una(struct sock *sk) * plain TCP */ if (__mptcp_check_fallback(msk)) - atomic64_set(&msk->snd_una, msk->snd_nxt); + msk->snd_una = READ_ONCE(msk->snd_nxt); - mptcp_data_lock(sk); - snd_una = atomic64_read(&msk->snd_una); + mptcp_data_lock(sk); + snd_una = msk->snd_una; list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) { if (after64(dfrag->data_seq + dfrag->data_len, snd_una)) break; @@ -1268,10 +1268,12 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, /* Zero window and all data acked? Probe. */ avail_size = mptcp_check_allowed_size(msk, data_seq, avail_size); if (avail_size == 0) { - if (skb || atomic64_read(&msk->snd_una) != msk->snd_nxt) + u64 snd_una = READ_ONCE(msk->snd_una); + + if (skb || snd_una != msk->snd_nxt) return 0; zero_window_probe = true; - data_seq = atomic64_read(&msk->snd_una) - 1; + data_seq = snd_una - 1; avail_size = 1; } @@ -1975,12 +1977,8 @@ static void mptcp_retransmit_handler(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); - if (atomic64_read(&msk->snd_una) == READ_ONCE(msk->snd_nxt)) { - mptcp_stop_timer(sk); - } else { - set_bit(MPTCP_WORK_RTX, &msk->flags); - mptcp_schedule_work(sk); - } + set_bit(MPTCP_WORK_RTX, &msk->flags); + mptcp_schedule_work(sk); } static void mptcp_retransmit_timer(struct timer_list *t) @@ -2605,8 +2603,8 @@ struct sock *mptcp_sk_clone(const struct sock *sk, msk->write_seq = subflow_req->idsn + 1; msk->snd_nxt = msk->write_seq; - atomic64_set(&msk->snd_una, msk->write_seq); - atomic64_set(&msk->wnd_end, msk->snd_nxt + req->rsk_rcv_wnd); + msk->snd_una = msk->write_seq; + msk->wnd_end = msk->snd_nxt + req->rsk_rcv_wnd; if (mp_opt->mp_capable) { msk->can_ack = true; @@ -2644,7 +2642,7 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk) if (msk->rcvq_space.space == 0) msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT; - atomic64_set(&msk->wnd_end, msk->snd_nxt + tcp_sk(ssk)->snd_wnd); + WRITE_ONCE(msk->wnd_end, msk->snd_nxt + tcp_sk(ssk)->snd_wnd); } static struct sock *mptcp_accept(struct sock *sk, int flags, int *err, @@ -2902,7 +2900,7 @@ void mptcp_finish_connect(struct sock *ssk) WRITE_ONCE(msk->ack_seq, ack_seq); WRITE_ONCE(msk->rcv_wnd_sent, ack_seq); WRITE_ONCE(msk->can_ack, 1); - atomic64_set(&msk->snd_una, msk->write_seq); + WRITE_ONCE(msk->snd_una, msk->write_seq); mptcp_pm_new_connection(msk, 0); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 5997d1ca5323..76a6a82d2585 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -222,8 +222,8 @@ struct mptcp_sock { struct sock *last_snd; int snd_burst; int old_wspace; - atomic64_t snd_una; - atomic64_t wnd_end; + u64 snd_una; + u64 wnd_end; unsigned long timer_ival; u32 token; int rmem_pending; @@ -323,7 +323,7 @@ static inline struct mptcp_data_frag *mptcp_rtx_tail(const struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); - if (!before64(msk->snd_nxt, atomic64_read(&msk->snd_una))) + if (!before64(msk->snd_nxt, READ_ONCE(msk->snd_una))) return NULL; return list_last_entry(&msk->rtx_queue, struct mptcp_data_frag, list); @@ -497,7 +497,7 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk); void mptcp_data_ready(struct sock *sk, struct sock *ssk); bool mptcp_finish_join(struct sock *sk); bool mptcp_schedule_work(struct sock *sk); -void mptcp_data_acked(struct sock *sk); +void __mptcp_data_acked(struct sock *sk); void mptcp_subflow_eof(struct sock *sk); bool mptcp_update_rcv_data_fin(struct mptcp_sock *msk, u64 data_fin_seq, bool use_64bit); void __mptcp_flush_join_list(struct mptcp_sock *msk); From patchwork Thu Nov 19 12:38:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Paolo Abeni X-Patchwork-Id: 1403001 X-Patchwork-Delegate: matthieu.baerts@tessares.net Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=198.145.21.10; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (1024-bit key; unprotected) header.d=redhat.com header.i=@redhat.com header.a=rsa-sha256 header.s=mimecast20190719 header.b=G0hgkPL4; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 4CcK443np0z9sVV for ; Thu, 19 Nov 2020 23:38:59 +1100 (AEDT) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id EF3D3100EB822; Thu, 19 Nov 2020 04:38:52 -0800 (PST) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=63.128.21.124; helo=us-smtp-delivery-124.mimecast.com; envelope-from=pabeni@redhat.com; receiver= Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 6195B100EBB8D for ; Thu, 19 Nov 2020 04:38:50 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1605789529; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6nY2pp+/zoNLD8EWpST15mb3XbMEHxMD5k1Mcs0EfZ8=; b=G0hgkPL4eQUY7zDyGcK8ROqsxlsMeqmdBQAacK4KQmRo0wTCZchRKQ5FQ935Hn/3zuOq7B Mv9kEO+5p26Z6Srt/tPy60xJQ30UT53FRaZiXo3XtoKC9NNjB4N+75mmj3IBwfcELlVAig bXPOa+wA/NbLuGGY015miXUS0kR6h3k= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-170-CFP7Mz6QOR6Swqot9tUAWw-1; Thu, 19 Nov 2020 07:38:47 -0500 X-MC-Unique: CFP7Mz6QOR6Swqot9tUAWw-1 Received: from smtp.corp.redhat.com (int-mx05.intmail.prod.int.phx2.redhat.com [10.5.11.15]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id A10A61005E40 for ; Thu, 19 Nov 2020 12:38:46 +0000 (UTC) Received: from gerbillo.redhat.com (ovpn-114-73.ams2.redhat.com [10.36.114.73]) by smtp.corp.redhat.com (Postfix) with ESMTP id D58B55D6AC for ; Thu, 19 Nov 2020 12:38:45 +0000 (UTC) From: Paolo Abeni To: mptcp@lists.01.org Date: Thu, 19 Nov 2020 13:38:26 +0100 Message-Id: <6633b88e37b051c8f7022ff79be22275423ac2fe.1605789317.git.pabeni@redhat.com> In-Reply-To: References: MIME-Version: 1.0 X-Scanned-By: MIMEDefang 2.79 on 10.5.11.15 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=pabeni@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Message-ID-Hash: VEDVGWFBEGQ3UXG5PVHEKRR5NS4EIMKS X-Message-ID-Hash: VEDVGWFBEGQ3UXG5PVHEKRR5NS4EIMKS X-MailFrom: pabeni@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] [PATCH v5 6/6] mptcp: use mptcp release_cb for delayed tasks. List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: We have some tasks triggered by the subflow receive path which require to access the msk socket status, specifically: mptcp_clean_una() and mptcp_push_pending() We have almost everything in place to defer to the msk release_cb such tasks when the msk sock is owned. Since the worker is no more used to clean the acked data, for fallback sockets we need to explicitly flush them. As added bonus we can move th wake-up code in __mptcp_clean_una(), simplify a lot mptcp_poll() and move the timer update under the data lock. The worker is now used only to process and send DATA_FIN packets and do the mptcp-level retransmissions. Signed-off-by: Paolo Abeni --- v4 -> v5: - squashed with "mptcp: simplify mptcp_nospace" to avoid flipping several chunks twice - added comments for release_cb locking schema - added missing memory barrier in wake-up code - moved the wake-up code into mptcp_clean_una() - subflow_write_space is now empty - don't set anymore SOCK_NOSPACE on subflow - unneeded - call mptcp_update_wmem() before sk_reclaim_memory_partial v3 -> v4: - use release_cb for push_pending, too v2 -> v3: - move clean una to release_cb(), so that we don't need an additional spin lock contention v1 -> v2 - add tx_pending_data accounting in __mptcp_subflow_push_pending() - in mptcp_data_acked schedule explicitly schedule the worker if ack is pending - checking sk status is both too liberal and not enough. - add missing call to __mptcp_check_send_data_fin() in mptcp_worker() this is needed to cope with __mptcp_subflow_push_pending() deferring send ack to the worker - renamed backlog to bl_entry - sk_mem_reclaim_partial() -> __mptcp_mem_reclaim_partial() in mptcp_clean_una(). We are both under the data_lock and the socket lock so we can call the more effective mptcp variant - do not try to push pending frames on wnd update if no pending data (use the new __mptcp_wnd_updated helper for that) - move write queue cleanup in destroy_common() so that it takes place after that subflows stops touching msk->sk_forward_alloc. and we don't need to acquire the msk socket lock spin lock to reclaim the xmit memory - explicitly init info->flags in __mptcp_subflow_push_pending(), or horrible corruption will happen later - explicitly flush pending data for fallback socket on ack reception. This likely fixes an older issue, which was hidden by the worker, spooling the pending data, just slower. --- net/mptcp/options.c | 18 ++- net/mptcp/protocol.c | 258 ++++++++++++++++++++++++++----------------- net/mptcp/protocol.h | 3 + net/mptcp/subflow.c | 14 +-- 4 files changed, 173 insertions(+), 120 deletions(-) diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 3986454a0340..6b7b4b67f18c 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -830,7 +830,7 @@ static u64 expand_ack(u64 old_ack, u64 cur_ack, bool use_64bit) } static void ack_update_msk(struct mptcp_sock *msk, - const struct sock *ssk, + struct sock *ssk, struct mptcp_options_received *mp_opt) { u64 new_wnd_end, new_snd_una, snd_nxt = READ_ONCE(msk->snd_nxt); @@ -854,8 +854,7 @@ static void ack_update_msk(struct mptcp_sock *msk, if (after64(new_wnd_end, msk->wnd_end)) { msk->wnd_end = new_wnd_end; - if (mptcp_send_head(sk)) - mptcp_schedule_work(sk); + __mptcp_wnd_updated(sk, ssk); } if (after64(new_snd_una, old_snd_una)) { @@ -915,8 +914,19 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb) struct mptcp_options_received mp_opt; struct mptcp_ext *mpext; - if (__mptcp_check_fallback(msk)) + if (__mptcp_check_fallback(msk)) { + /* Keep it simple and unconditionally trigger send data cleanup and + * pending queue spooling. We will need to acquire the data lock + * for more accurate checks, and once the lock is acquired, such + * helpers are cheap. + */ + mptcp_data_lock(subflow->conn); + if (mptcp_send_head(subflow->conn)) + __mptcp_wnd_updated(subflow->conn, sk); + __mptcp_data_acked(subflow->conn); + mptcp_data_unlock(subflow->conn); return; + } mptcp_get_options(skb, &mp_opt); if (!check_fully_established(msk, sk, subflow, skb, &mp_opt)) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 313d488aae15..083d5916e07c 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -360,17 +360,22 @@ static void mptcp_close_wake_up(struct sock *sk) sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN); } -static void mptcp_check_data_fin_ack(struct sock *sk) +static bool mptcp_pending_data_fin_ack(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); - if (__mptcp_check_fallback(msk)) - return; + return !__mptcp_check_fallback(msk) && + ((1 << sk->sk_state) & + (TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_LAST_ACK)) && + msk->write_seq == READ_ONCE(msk->snd_una); +} + +static void mptcp_check_data_fin_ack(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); /* Look for an acknowledged DATA_FIN */ - if (((1 << sk->sk_state) & - (TCPF_FIN_WAIT1 | TCPF_CLOSING | TCPF_LAST_ACK)) && - msk->write_seq == READ_ONCE(msk->snd_una)) { + if (mptcp_pending_data_fin_ack(sk)) { mptcp_stop_timer(sk); WRITE_ONCE(msk->snd_data_fin_enable, 0); @@ -750,16 +755,6 @@ bool mptcp_schedule_work(struct sock *sk) return false; } -void __mptcp_data_acked(struct sock *sk) -{ - mptcp_reset_timer(sk); - - if ((test_bit(MPTCP_NOSPACE, &mptcp_sk(sk)->flags) || - mptcp_send_head(sk) || - (inet_sk_state_load(sk) != TCP_ESTABLISHED))) - mptcp_schedule_work(sk); -} - void mptcp_subflow_eof(struct sock *sk) { if (!test_and_set_bit(MPTCP_WORK_EOF, &mptcp_sk(sk)->flags)) @@ -972,7 +967,7 @@ static void dfrag_clear(struct sock *sk, struct mptcp_data_frag *dfrag) put_page(dfrag->page); } -static void mptcp_clean_una(struct sock *sk) +static void __mptcp_clean_una(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); struct mptcp_data_frag *dtmp, *dfrag; @@ -985,8 +980,6 @@ static void mptcp_clean_una(struct sock *sk) if (__mptcp_check_fallback(msk)) msk->snd_una = READ_ONCE(msk->snd_nxt); - - mptcp_data_lock(sk); snd_una = msk->snd_una; list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) { if (after64(dfrag->data_seq + dfrag->data_len, snd_una)) @@ -1015,21 +1008,25 @@ static void mptcp_clean_una(struct sock *sk) } out: - if (cleaned && tcp_under_memory_pressure(sk)) - sk_mem_reclaim_partial(sk); - mptcp_data_unlock(sk); -} - -static void mptcp_clean_una_wakeup(struct sock *sk) -{ - struct mptcp_sock *msk = mptcp_sk(sk); + if (cleaned) { + if (tcp_under_memory_pressure(sk)) { + __mptcp_update_wmem(sk); + sk_mem_reclaim_partial(sk); + } - mptcp_clean_una(sk); + if (sk_stream_is_writeable(sk)) { + /* pairs with memory barrier in mptcp_poll */ + smp_mb(); + if (test_and_clear_bit(MPTCP_NOSPACE, &msk->flags)) + sk_stream_write_space(sk); + } + } - /* Only wake up writers if a subflow is ready */ - if (sk_stream_is_writeable(sk)) { - clear_bit(MPTCP_NOSPACE, &msk->flags); - sk_stream_write_space(sk); + if (snd_una == READ_ONCE(msk->snd_nxt)) { + if (msk->timer_ival) + mptcp_stop_timer(sk); + } else { + mptcp_reset_timer(sk); } } @@ -1116,13 +1113,13 @@ static bool __mptcp_add_ext(struct sk_buff *skb, gfp_t gfp) return true; } -static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk) +static struct sk_buff *__mptcp_do_alloc_tx_skb(struct sock *sk, gfp_t gfp) { struct sk_buff *skb; - skb = alloc_skb_fclone(MAX_TCP_HEADER, sk->sk_allocation); + skb = alloc_skb_fclone(MAX_TCP_HEADER, gfp); if (likely(skb)) { - if (likely(__mptcp_add_ext(skb, sk->sk_allocation))) { + if (likely(__mptcp_add_ext(skb, gfp))) { skb_reserve(skb, MAX_TCP_HEADER); skb->reserved_tailroom = skb->end - skb->tail; return skb; @@ -1154,7 +1151,7 @@ static bool mptcp_tx_cache_refill(struct sock *sk, int size, } while (space_needed > 0) { - skb = __mptcp_do_alloc_tx_skb(sk); + skb = __mptcp_do_alloc_tx_skb(sk, sk->sk_allocation); if (unlikely(!skb)) { /* under memory pressure, try to pass the caller a * single skb to allow forward progress @@ -1173,7 +1170,7 @@ static bool mptcp_tx_cache_refill(struct sock *sk, int size, return true; } -static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) +static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk, gfp_t gfp) { struct mptcp_sock *msk = mptcp_sk(sk); struct sk_buff *skb; @@ -1181,7 +1178,7 @@ static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) if (ssk->sk_tx_skb_cache) { skb = ssk->sk_tx_skb_cache; if (unlikely(!skb_ext_find(skb, SKB_EXT_MPTCP) && - !__mptcp_add_ext(skb, sk->sk_allocation))) + !__mptcp_add_ext(skb, gfp))) return false; return true; } @@ -1202,7 +1199,7 @@ static bool __mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) return false; } - skb = __mptcp_do_alloc_tx_skb(sk); + skb = __mptcp_do_alloc_tx_skb(sk, gfp); if (!skb) return false; @@ -1225,7 +1222,7 @@ static bool mptcp_alloc_tx_skb(struct sock *sk, struct sock *ssk) { if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) mptcp_mem_reclaim_partial(sk); - return __mptcp_alloc_tx_skb(sk, ssk); + return __mptcp_alloc_tx_skb(sk, ssk, sk->sk_allocation); } static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, @@ -1326,31 +1323,6 @@ static int mptcp_sendmsg_frag(struct sock *sk, struct sock *ssk, return ret; } -static void mptcp_nospace(struct mptcp_sock *msk) -{ - struct mptcp_subflow_context *subflow; - - set_bit(MPTCP_NOSPACE, &msk->flags); - smp_mb__after_atomic(); /* msk->flags is changed by write_space cb */ - - mptcp_for_each_subflow(msk, subflow) { - struct sock *ssk = mptcp_subflow_tcp_sock(subflow); - bool ssk_writeable = sk_stream_is_writeable(ssk); - struct socket *sock = READ_ONCE(ssk->sk_socket); - - if (ssk_writeable || !sock) - continue; - - /* enables ssk->write_space() callbacks */ - set_bit(SOCK_NOSPACE, &sock->flags); - } - - /* mptcp_data_acked() could run just before we set the NOSPACE bit, - * so explicitly check for snd_una value - */ - mptcp_clean_una((struct sock *)msk); -} - #define MPTCP_SEND_BURST_SIZE ((1 << 16) - \ sizeof(struct tcphdr) - \ MAX_TCP_OPTION_SPACE - \ @@ -1515,10 +1487,69 @@ static void mptcp_push_pending(struct sock *sk, unsigned int flags) out: /* start the timer, if it's not pending */ - if (!mptcp_timer_pending(sk)) - mptcp_reset_timer(sk); - if (copied) + if (copied) { + if (!mptcp_timer_pending(sk)) + mptcp_reset_timer(sk); __mptcp_check_send_data_fin(sk); + } +} + +static void __mptcp_subflow_push_pending(struct sock *sk, struct sock *ssk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + struct mptcp_sendmsg_info info; + struct mptcp_data_frag *dfrag; + int len, copied = 0; + u32 sndbuf; + + info.flags = 0; + while ((dfrag = mptcp_send_head(sk))) { + info.sent = dfrag->already_sent; + info.limit = dfrag->data_len; + len = dfrag->data_len - dfrag->already_sent; + while (len > 0) { + int ret = 0; + + /* do auto tuning */ + if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK) && + sndbuf > READ_ONCE(sk->sk_sndbuf)) + WRITE_ONCE(sk->sk_sndbuf, sndbuf); + + if (unlikely(mptcp_must_reclaim_memory(sk, ssk))) { + __mptcp_update_wmem(sk); + sk_mem_reclaim_partial(sk); + } + if (!__mptcp_alloc_tx_skb(sk, ssk, GFP_ATOMIC)) + goto out; + + ret = mptcp_sendmsg_frag(sk, ssk, dfrag, &info); + if (ret <= 0) + goto out; + + info.sent += ret; + dfrag->already_sent += ret; + msk->snd_nxt += ret; + msk->snd_burst -= ret; + msk->tx_pending_data -= ret; + copied += ret; + len -= ret; + } + WRITE_ONCE(msk->first_pending, mptcp_send_next(sk)); + } + +out: + /* __mptcp_alloc_tx_skb could have released some wmem and we are + * not going to flush it via release_sock() + */ + __mptcp_update_wmem(sk); + if (copied) { + mptcp_set_timeout(sk, ssk); + tcp_push(ssk, 0, info.mss_now, tcp_sk(ssk)->nonagle, + info.size_goal); + if (msk->snd_data_fin_enable && + msk->snd_nxt + 1 == msk->write_seq) + mptcp_schedule_work(sk); + } } static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) @@ -1543,7 +1574,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) } pfrag = sk_page_frag(sk); - mptcp_clean_una(sk); while (msg_data_left(msg)) { int total_ts, frag_truesize = 0; @@ -1563,11 +1593,9 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) dfrag = mptcp_pending_tail(sk); dfrag_collapsed = mptcp_frag_can_collapse_to(msk, pfrag, dfrag); if (!dfrag_collapsed) { - if (!sk_stream_memory_free(sk)) { - mptcp_push_pending(sk, msg->msg_flags); - if (!sk_stream_memory_free(sk)) - goto wait_for_memory; - } + if (!sk_stream_memory_free(sk)) + goto wait_for_memory; + if (!mptcp_page_frag_refill(sk, pfrag)) goto wait_for_memory; @@ -1624,9 +1652,8 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) continue; wait_for_memory: - mptcp_nospace(msk); - if (mptcp_timer_pending(sk)) - mptcp_reset_timer(sk); + set_bit(MPTCP_NOSPACE, &msk->flags); + mptcp_push_pending(sk, msg->msg_flags); ret = sk_stream_wait_memory(sk, &timeo); if (ret) goto out; @@ -2178,21 +2205,18 @@ static void mptcp_worker(struct work_struct *work) if (unlikely(state == TCP_CLOSE)) goto unlock; - mptcp_clean_una_wakeup(sk); mptcp_check_data_fin_ack(sk); __mptcp_flush_join_list(msk); if (test_and_clear_bit(MPTCP_WORK_CLOSE_SUBFLOW, &msk->flags)) __mptcp_close_subflow(msk); - if (mptcp_send_head(sk)) - mptcp_push_pending(sk, 0); - if (msk->pm.status) pm_work(msk); if (test_and_clear_bit(MPTCP_WORK_EOF, &msk->flags)) mptcp_check_for_eof(msk); + __mptcp_check_send_data_fin(sk); mptcp_check_data_fin(sk); /* if the msk data is completely acked, or the socket timedout, @@ -2318,8 +2342,6 @@ static void __mptcp_clear_xmit(struct sock *sk) struct mptcp_data_frag *dtmp, *dfrag; struct sk_buff *skb; - sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer); - WRITE_ONCE(msk->first_pending, NULL); list_for_each_entry_safe(dfrag, dtmp, &msk->rtx_queue, list) dfrag_clear(sk, dfrag); @@ -2461,7 +2483,7 @@ static void __mptcp_destroy_sock(struct sock *sk) spin_unlock_bh(&msk->join_list_lock); list_splice_init(&msk->conn_list, &conn_list); - __mptcp_clear_xmit(sk); + sk_stop_timer(sk, &msk->sk.icsk_retransmit_timer); sk_stop_timer(sk, &sk->sk_timer); msk->pm.status = 0; @@ -2695,6 +2717,8 @@ void mptcp_destroy_common(struct mptcp_sock *msk) { struct sock *sk = (struct sock *)msk; + __mptcp_clear_xmit(sk); + skb_queue_splice_tail_init(&sk->sk_receive_queue, &msk->receive_queue); skb_rbtree_purge(&msk->out_of_order_queue); mptcp_token_destroy(msk); @@ -2819,6 +2843,28 @@ static int mptcp_getsockopt(struct sock *sk, int level, int optname, return -EOPNOTSUPP; } +void __mptcp_data_acked(struct sock *sk) +{ + if (!sock_owned_by_user(sk)) + __mptcp_clean_una(sk); + else + set_bit(MPTCP_CLEAN_UNA, &mptcp_sk(sk)->flags); + + if (mptcp_pending_data_fin_ack(sk)) + mptcp_schedule_work(sk); +} + +void __mptcp_wnd_updated(struct sock *sk, struct sock *ssk) +{ + if (!mptcp_send_head(sk)) + return; + + if (!sock_owned_by_user(sk)) + __mptcp_subflow_push_pending(sk, ssk); + else + set_bit(MPTCP_PUSH_PENDING, &mptcp_sk(sk)->flags); +} + #define MPTCP_DEFERRED_ALL (TCPF_WRITE_TIMER_DEFERRED) /* processes deferred events and flush wmem */ @@ -2826,6 +2872,25 @@ static void mptcp_release_cb(struct sock *sk) { unsigned long flags, nflags; + /* push_pending may touch wmem_reserved, do it before the later + * cleanup + */ + if (test_and_clear_bit(MPTCP_CLEAN_UNA, &mptcp_sk(sk)->flags)) + __mptcp_clean_una(sk); + if (test_and_clear_bit(MPTCP_PUSH_PENDING, &mptcp_sk(sk)->flags)) { + /* mptcp_push_pending() acquires the subflow socket lock + * + * 1) can't be invoked in atomic scope + * 2) must avoid ABBA deadlock with msk socket spinlock: the RX + * datapath acquires the msk socket spinlock while helding + * the subflow socket lock + */ + + spin_unlock_bh(&sk->sk_lock.slock); + mptcp_push_pending(sk, 0); + spin_lock_bh(&sk->sk_lock.slock); + } + /* clear any wmem reservation and errors */ __mptcp_update_wmem(sk); __mptcp_update_rmem(sk); @@ -3161,24 +3226,9 @@ static __poll_t mptcp_check_readable(struct mptcp_sock *msk) 0; } -static bool __mptcp_check_writeable(struct mptcp_sock *msk) -{ - struct sock *sk = (struct sock *)msk; - bool mptcp_writable; - - mptcp_clean_una(sk); - mptcp_writable = sk_stream_is_writeable(sk); - if (!mptcp_writable) - mptcp_nospace(msk); - - return mptcp_writable; -} - static __poll_t mptcp_check_writeable(struct mptcp_sock *msk) { struct sock *sk = (struct sock *)msk; - __poll_t ret = 0; - bool slow; if (unlikely(sk->sk_shutdown & SEND_SHUTDOWN)) return 0; @@ -3186,12 +3236,12 @@ static __poll_t mptcp_check_writeable(struct mptcp_sock *msk) if (sk_stream_is_writeable(sk)) return EPOLLOUT | EPOLLWRNORM; - slow = lock_sock_fast(sk); - if (__mptcp_check_writeable(msk)) - ret = EPOLLOUT | EPOLLWRNORM; + set_bit(MPTCP_NOSPACE, &msk->flags); + smp_mb__after_atomic(); /* msk->flags is changed by write_space cb */ + if (sk_stream_is_writeable(sk)) + return EPOLLOUT | EPOLLWRNORM; - unlock_sock_fast(sk, slow); - return ret; + return 0; } static __poll_t mptcp_poll(struct file *file, struct socket *sock, diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 76a6a82d2585..9dc5ff163e1e 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -91,6 +91,8 @@ #define MPTCP_WORK_EOF 3 #define MPTCP_FALLBACK_DONE 4 #define MPTCP_WORK_CLOSE_SUBFLOW 5 +#define MPTCP_PUSH_PENDING 6 +#define MPTCP_CLEAN_UNA 7 static inline bool before64(__u64 seq1, __u64 seq2) { @@ -497,6 +499,7 @@ void mptcp_rcv_space_init(struct mptcp_sock *msk, const struct sock *ssk); void mptcp_data_ready(struct sock *sk, struct sock *ssk); bool mptcp_finish_join(struct sock *sk); bool mptcp_schedule_work(struct sock *sk); +void __mptcp_wnd_updated(struct sock *sk, struct sock *ssk); void __mptcp_data_acked(struct sock *sk); void mptcp_subflow_eof(struct sock *sk); bool mptcp_update_rcv_data_fin(struct mptcp_sock *msk, u64 data_fin_seq, bool use_64bit); diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 02d757e0ddb7..b6cd42649e5d 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -996,19 +996,9 @@ static void subflow_data_ready(struct sock *sk) mptcp_data_ready(parent, sk); } -static void subflow_write_space(struct sock *sk) +static void subflow_write_space(struct sock *ssk) { - struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); - struct socket *sock = READ_ONCE(sk->sk_socket); - struct sock *parent = subflow->conn; - - if (!sk_stream_is_writeable(sk)) - return; - - if (sock && sk_stream_is_writeable(parent)) - clear_bit(SOCK_NOSPACE, &sock->flags); - - sk_stream_write_space(parent); + /* we take action in __mptcp_clean_una() */ } static struct inet_connection_sock_af_ops *