From patchwork Wed Feb 26 09:14:46 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244827 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9B42LlGz9sRG for ; Wed, 26 Feb 2020 20:15:08 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727662AbgBZJPH (ORCPT ); Wed, 26 Feb 2020 04:15:07 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60760 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726082AbgBZJPH (ORCPT ); Wed, 26 Feb 2020 04:15:07 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6smX-0001MQ-8y; Wed, 26 Feb 2020 10:15:05 +0100 From: Florian Westphal To: Cc: Florian Westphal Subject: [PATCH net-next 1/7] mptcp: add and use mptcp_data_ready helper Date: Wed, 26 Feb 2020 10:14:46 +0100 Message-Id: <20200226091452.1116-2-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org allows us to schedule the work queue to drain the ssk receive queue in a followup patch. This is needed to avoid sending all-to-pessimistic mptcp-level acknowledgements. At this time, the ack_seq is what was last read by userspace instead of the highest in-sequence number queued for reading. Signed-off-by: Florian Westphal Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 8 ++++++++ net/mptcp/protocol.h | 1 + net/mptcp/subflow.c | 14 ++++---------- 3 files changed, 13 insertions(+), 10 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index e9aa6807b5be..1d55563e9aca 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -111,6 +111,14 @@ static struct sock *mptcp_subflow_get(const struct mptcp_sock *msk) return NULL; } +void mptcp_data_ready(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + set_bit(MPTCP_DATA_READY, &msk->flags); + sk->sk_data_ready(sk); +} + static bool mptcp_ext_cache_refill(struct mptcp_sock *msk) { if (!msk->cached_ext) diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 9f8663b30456..67895a7c1e5b 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -201,6 +201,7 @@ void mptcp_get_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx); void mptcp_finish_connect(struct sock *sk); +void mptcp_data_ready(struct sock *sk); int mptcp_token_new_request(struct request_sock *req); void mptcp_token_destroy_request(u32 token); diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 65122edf60aa..3dad662840aa 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -554,11 +554,8 @@ static void subflow_data_ready(struct sock *sk) return; } - if (mptcp_subflow_data_available(sk)) { - set_bit(MPTCP_DATA_READY, &mptcp_sk(parent)->flags); - - parent->sk_data_ready(parent); - } + if (mptcp_subflow_data_available(sk)) + mptcp_data_ready(parent); } static void subflow_write_space(struct sock *sk) @@ -690,11 +687,8 @@ static void subflow_state_change(struct sock *sk) * a fin packet carrying a DSS can be unnoticed if we don't trigger * the data available machinery here. */ - if (parent && subflow->mp_capable && mptcp_subflow_data_available(sk)) { - set_bit(MPTCP_DATA_READY, &mptcp_sk(parent)->flags); - - parent->sk_data_ready(parent); - } + if (parent && subflow->mp_capable && mptcp_subflow_data_available(sk)) + mptcp_data_ready(parent); if (parent && !(parent->sk_shutdown & RCV_SHUTDOWN) && !subflow->rx_eof && subflow_is_done(sk)) { From patchwork Wed Feb 26 09:14:47 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244828 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9BB32Pvz9sRN for ; Wed, 26 Feb 2020 20:15:14 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727721AbgBZJPM (ORCPT ); Wed, 26 Feb 2020 04:15:12 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60762 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726082AbgBZJPL (ORCPT ); Wed, 26 Feb 2020 04:15:11 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6smb-0001Mc-Er; Wed, 26 Feb 2020 10:15:09 +0100 From: Florian Westphal To: Cc: Paolo Abeni , Florian Westphal Subject: [PATCH net-next 2/7] mptcp: add work queue skeleton Date: Wed, 26 Feb 2020 10:14:47 +0100 Message-Id: <20200226091452.1116-3-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Paolo Abeni Will be extended with functionality in followup patches. Initial user is moving skbs from subflows receive queue to the mptcp-level receive queue. Signed-off-by: Paolo Abeni Signed-off-by: Florian Westphal Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 22 ++++++++++++++++++++++ net/mptcp/protocol.h | 1 + 2 files changed, 23 insertions(+) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 1d55563e9aca..cbf184a71ed7 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -551,12 +551,24 @@ static void __mptcp_close_ssk(struct sock *sk, struct sock *ssk, } } +static void mptcp_worker(struct work_struct *work) +{ + struct mptcp_sock *msk = container_of(work, struct mptcp_sock, work); + struct sock *sk = &msk->sk.icsk_inet.sk; + + lock_sock(sk); + + release_sock(sk); + sock_put(sk); +} + static int __mptcp_init_sock(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); INIT_LIST_HEAD(&msk->conn_list); __set_bit(MPTCP_SEND_SPACE, &msk->flags); + INIT_WORK(&msk->work, mptcp_worker); msk->first = NULL; @@ -571,6 +583,14 @@ static int mptcp_init_sock(struct sock *sk) return __mptcp_init_sock(sk); } +static void mptcp_cancel_work(struct sock *sk) +{ + struct mptcp_sock *msk = mptcp_sk(sk); + + if (cancel_work_sync(&msk->work)) + sock_put(sk); +} + static void mptcp_subflow_shutdown(struct sock *ssk, int how) { lock_sock(ssk); @@ -616,6 +636,8 @@ static void mptcp_close(struct sock *sk, long timeout) __mptcp_close_ssk(sk, ssk, subflow, timeout); } + mptcp_cancel_work(sk); + sk_common_release(sk); } diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 67895a7c1e5b..6e6e162d25f1 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -70,6 +70,7 @@ struct mptcp_sock { u32 token; unsigned long flags; bool can_ack; + struct work_struct work; struct list_head conn_list; struct skb_ext *cached_ext; /* for the next sendmsg */ struct socket *subflow; /* outgoing connect/listener/!mp_capable */ From patchwork Wed Feb 26 09:14:48 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244829 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9BF2XKsz9sR4 for ; Wed, 26 Feb 2020 20:15:17 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727758AbgBZJPQ (ORCPT ); Wed, 26 Feb 2020 04:15:16 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60768 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727736AbgBZJPQ (ORCPT ); Wed, 26 Feb 2020 04:15:16 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6smf-0001Mp-Kf; Wed, 26 Feb 2020 10:15:13 +0100 From: Florian Westphal To: Cc: Florian Westphal Subject: [PATCH net-next 3/7] mptcp: update mptcp ack sequence from work queue Date: Wed, 26 Feb 2020 10:14:48 +0100 Message-Id: <20200226091452.1116-4-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If userspace is not reading data, all the mptcp-level acks contain the ack_seq from the last time userspace read data rather than the most recent in-sequence value. This causes pointless retransmissions for data that is already queued. The reason for this is that all the mptcp protocol level processing happens at mptcp_recv time. This adds work queue to move skbs from the subflow sockets receive queue on the mptcp socket receive queue (which was not used so far). This allows us to announce the correct mptcp ack sequence in a timely fashion, even when the application does not call recv() on the mptcp socket for some time. We still wake userspace tasks waiting for POLLIN immediately: If the mptcp level receive queue is empty (because the work queue is still pending) it can be filled from in-sequence subflow sockets at recv time without a need to wait for the worker. The skb_orphan when moving skbs from subflow to mptcp level is needed, because the destructor (sock_rfree) relies on skb->sk (ssk!) lock being taken. A followup patch will add needed rmem accouting for the moved skbs. Other problem: In case application behaves as expected, and calls recv() as soon as mptcp socket becomes readable, the work queue will only waste cpu cycles. This will also be addressed in followup patches. Signed-off-by: Florian Westphal Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 234 ++++++++++++++++++++++++++++++------------- 1 file changed, 165 insertions(+), 69 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index cbf184a71ed7..b4a8517d8eac 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -31,6 +31,12 @@ struct mptcp6_sock { }; #endif +struct mptcp_skb_cb { + u32 offset; +}; + +#define MPTCP_SKB_CB(__skb) ((struct mptcp_skb_cb *)&((__skb)->cb[0])) + /* If msk has an initial subflow socket, and the MP_CAPABLE handshake has not * completed yet or has failed, return the subflow socket. * Otherwise return NULL. @@ -111,11 +117,88 @@ static struct sock *mptcp_subflow_get(const struct mptcp_sock *msk) return NULL; } +static void __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, + struct sk_buff *skb, + unsigned int offset, size_t copy_len) +{ + struct sock *sk = (struct sock *)msk; + + __skb_unlink(skb, &ssk->sk_receive_queue); + skb_orphan(skb); + __skb_queue_tail(&sk->sk_receive_queue, skb); + + msk->ack_seq += copy_len; + MPTCP_SKB_CB(skb)->offset = offset; +} + +static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, + struct sock *ssk, + unsigned int *bytes) +{ + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); + unsigned int moved = 0; + bool more_data_avail; + struct tcp_sock *tp; + bool done = false; + + tp = tcp_sk(ssk); + do { + u32 map_remaining, offset; + u32 seq = tp->copied_seq; + struct sk_buff *skb; + bool fin; + + /* try to move as much data as available */ + map_remaining = subflow->map_data_len - + mptcp_subflow_get_map_offset(subflow); + + skb = skb_peek(&ssk->sk_receive_queue); + if (!skb) + break; + + offset = seq - TCP_SKB_CB(skb)->seq; + fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN; + if (fin) { + done = true; + seq++; + } + + if (offset < skb->len) { + size_t len = skb->len - offset; + + if (tp->urg_data) + done = true; + + __mptcp_move_skb(msk, ssk, skb, offset, len); + seq += len; + moved += len; + + if (WARN_ON_ONCE(map_remaining < len)) + break; + } else { + WARN_ON_ONCE(!fin); + sk_eat_skb(ssk, skb); + done = true; + } + + WRITE_ONCE(tp->copied_seq, seq); + more_data_avail = mptcp_subflow_data_available(ssk); + } while (more_data_avail); + + *bytes = moved; + + return done; +} + void mptcp_data_ready(struct sock *sk) { struct mptcp_sock *msk = mptcp_sk(sk); set_bit(MPTCP_DATA_READY, &msk->flags); + + if (schedule_work(&msk->work)) + sock_hold((struct sock *)msk); + sk->sk_data_ready(sk); } @@ -373,19 +456,68 @@ static void mptcp_wait_data(struct sock *sk, long *timeo) remove_wait_queue(sk_sleep(sk), &wait); } +static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk, + struct msghdr *msg, + size_t len) +{ + struct sock *sk = (struct sock *)msk; + struct sk_buff *skb; + int copied = 0; + + while ((skb = skb_peek(&sk->sk_receive_queue)) != NULL) { + u32 offset = MPTCP_SKB_CB(skb)->offset; + u32 data_len = skb->len - offset; + u32 count = min_t(size_t, len - copied, data_len); + int err; + + err = skb_copy_datagram_msg(skb, offset, msg, count); + if (unlikely(err < 0)) { + if (!copied) + return err; + break; + } + + copied += count; + + if (count < data_len) { + MPTCP_SKB_CB(skb)->offset += count; + break; + } + + __skb_unlink(skb, &sk->sk_receive_queue); + __kfree_skb(skb); + + if (copied >= len) + break; + } + + return copied; +} + +static bool __mptcp_move_skbs(struct mptcp_sock *msk) +{ + unsigned int moved = 0; + bool done; + + do { + struct sock *ssk = mptcp_subflow_recv_lookup(msk); + + if (!ssk) + break; + + lock_sock(ssk); + done = __mptcp_move_skbs_from_subflow(msk, ssk, &moved); + release_sock(ssk); + } while (!done); + + return moved > 0; +} + static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len) { struct mptcp_sock *msk = mptcp_sk(sk); - struct mptcp_subflow_context *subflow; - bool more_data_avail = false; - struct mptcp_read_arg arg; - read_descriptor_t desc; - bool wait_data = false; struct socket *ssock; - struct tcp_sock *tp; - bool done = false; - struct sock *ssk; int copied = 0; int target; long timeo; @@ -403,65 +535,26 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, return copied; } - arg.msg = msg; - desc.arg.data = &arg; - desc.error = 0; - timeo = sock_rcvtimeo(sk, nonblock); len = min_t(size_t, len, INT_MAX); target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); - while (!done) { - u32 map_remaining; + while (len > (size_t)copied) { int bytes_read; - ssk = mptcp_subflow_recv_lookup(msk); - pr_debug("msk=%p ssk=%p", msk, ssk); - if (!ssk) - goto wait_for_data; + bytes_read = __mptcp_recvmsg_mskq(msk, msg, len - copied); + if (unlikely(bytes_read < 0)) { + if (!copied) + copied = bytes_read; + goto out_err; + } - subflow = mptcp_subflow_ctx(ssk); - tp = tcp_sk(ssk); + copied += bytes_read; - lock_sock(ssk); - do { - /* try to read as much data as available */ - map_remaining = subflow->map_data_len - - mptcp_subflow_get_map_offset(subflow); - desc.count = min_t(size_t, len - copied, map_remaining); - pr_debug("reading %zu bytes, copied %d", desc.count, - copied); - bytes_read = tcp_read_sock(ssk, &desc, - mptcp_read_actor); - if (bytes_read < 0) { - if (!copied) - copied = bytes_read; - done = true; - goto next; - } - - pr_debug("msk ack_seq=%llx -> %llx", msk->ack_seq, - msk->ack_seq + bytes_read); - msk->ack_seq += bytes_read; - copied += bytes_read; - if (copied >= len) { - done = true; - goto next; - } - if (tp->urg_data && tp->urg_seq == tp->copied_seq) { - pr_err("Urgent data present, cannot proceed"); - done = true; - goto next; - } -next: - more_data_avail = mptcp_subflow_data_available(ssk); - } while (more_data_avail && !done); - release_sock(ssk); - continue; - -wait_for_data: - more_data_avail = false; + if (skb_queue_empty(&sk->sk_receive_queue) && + __mptcp_move_skbs(msk)) + continue; /* only the master socket status is relevant here. The exit * conditions mirror closely tcp_recvmsg() @@ -502,26 +595,25 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, } pr_debug("block timeout %ld", timeo); - wait_data = true; mptcp_wait_data(sk, &timeo); if (unlikely(__mptcp_tcp_fallback(msk))) goto fallback; } - if (more_data_avail) { - if (!test_bit(MPTCP_DATA_READY, &msk->flags)) - set_bit(MPTCP_DATA_READY, &msk->flags); - } else if (!wait_data) { + if (skb_queue_empty(&sk->sk_receive_queue)) { + /* entire backlog drained, clear DATA_READY. */ clear_bit(MPTCP_DATA_READY, &msk->flags); - /* .. race-breaker: ssk might get new data after last - * data_available() returns false. + /* .. race-breaker: ssk might have gotten new data + * after last __mptcp_move_skbs() returned false. */ - ssk = mptcp_subflow_recv_lookup(msk); - if (unlikely(ssk)) + if (unlikely(__mptcp_move_skbs(msk))) set_bit(MPTCP_DATA_READY, &msk->flags); + } else if (unlikely(!test_bit(MPTCP_DATA_READY, &msk->flags))) { + /* data to read but mptcp_wait_data() cleared DATA_READY */ + set_bit(MPTCP_DATA_READY, &msk->flags); } - +out_err: release_sock(sk); return copied; } @@ -557,7 +649,7 @@ static void mptcp_worker(struct work_struct *work) struct sock *sk = &msk->sk.icsk_inet.sk; lock_sock(sk); - + __mptcp_move_skbs(msk); release_sock(sk); sock_put(sk); } @@ -638,6 +730,8 @@ static void mptcp_close(struct sock *sk, long timeout) mptcp_cancel_work(sk); + __skb_queue_purge(&sk->sk_receive_queue); + sk_common_release(sk); } @@ -1204,6 +1298,8 @@ void mptcp_proto_init(void) panic("Failed to register MPTCP proto.\n"); inet_register_protosw(&mptcp_protosw); + + BUILD_BUG_ON(sizeof(struct mptcp_skb_cb) > sizeof_field(struct sk_buff, cb)); } #if IS_ENABLED(CONFIG_MPTCP_IPV6) From patchwork Wed Feb 26 09:14:49 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244831 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9BL6hBcz9sRG for ; Wed, 26 Feb 2020 20:15:22 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727796AbgBZJPV (ORCPT ); Wed, 26 Feb 2020 04:15:21 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60770 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727744AbgBZJPT (ORCPT ); Wed, 26 Feb 2020 04:15:19 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6smj-0001Mw-V1; Wed, 26 Feb 2020 10:15:18 +0100 From: Florian Westphal To: Cc: Florian Westphal Subject: [PATCH net-next 4/7] mptcp: add rmem queue accounting Date: Wed, 26 Feb 2020 10:14:49 +0100 Message-Id: <20200226091452.1116-5-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org If userspace never drains the receive buffers we must stop draining the subflow socket(s) at some point. This adds the needed rmem accouting for this. If the threshold is reached, we stop draining the subflows. Signed-off-by: Florian Westphal Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 18 +++++++++++++++++- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index b4a8517d8eac..381d5647a95b 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -124,7 +124,7 @@ static void __mptcp_move_skb(struct mptcp_sock *msk, struct sock *ssk, struct sock *sk = (struct sock *)msk; __skb_unlink(skb, &ssk->sk_receive_queue); - skb_orphan(skb); + skb_set_owner_r(skb, sk); __skb_queue_tail(&sk->sk_receive_queue, skb); msk->ack_seq += copy_len; @@ -136,10 +136,16 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, unsigned int *bytes) { struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); + struct sock *sk = (struct sock *)msk; unsigned int moved = 0; bool more_data_avail; struct tcp_sock *tp; bool done = false; + int rcvbuf; + + rcvbuf = max(ssk->sk_rcvbuf, sk->sk_rcvbuf); + if (rcvbuf > sk->sk_rcvbuf) + sk->sk_rcvbuf = rcvbuf; tp = tcp_sk(ssk); do { @@ -183,6 +189,11 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, WRITE_ONCE(tp->copied_seq, seq); more_data_avail = mptcp_subflow_data_available(ssk); + + if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf)) { + done = true; + break; + } } while (more_data_avail); *bytes = moved; @@ -196,9 +207,14 @@ void mptcp_data_ready(struct sock *sk) set_bit(MPTCP_DATA_READY, &msk->flags); + /* don't schedule if mptcp sk is (still) over limit */ + if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf)) + goto wake; + if (schedule_work(&msk->work)) sock_hold((struct sock *)msk); +wake: sk->sk_data_ready(sk); } From patchwork Wed Feb 26 09:14:50 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244830 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9BJ5PG1z9sR4 for ; Wed, 26 Feb 2020 20:15:20 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727766AbgBZJPU (ORCPT ); Wed, 26 Feb 2020 04:15:20 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60772 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727746AbgBZJPT (ORCPT ); Wed, 26 Feb 2020 04:15:19 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6smk-0001N3-85; Wed, 26 Feb 2020 10:15:18 +0100 From: Florian Westphal To: Cc: Florian Westphal Subject: [PATCH net-next 5/7] mptcp: remove mptcp_read_actor Date: Wed, 26 Feb 2020 10:14:50 +0100 Message-Id: <20200226091452.1116-6-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org Only used to discard stale data from the subflow, so move it where needed. Signed-off-by: Florian Westphal Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 27 --------------------------- net/mptcp/protocol.h | 7 ------- net/mptcp/subflow.c | 18 +++++++++++++----- 3 files changed, 13 insertions(+), 39 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 381d5647a95b..b781498e69b4 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -430,33 +430,6 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) return ret; } -int mptcp_read_actor(read_descriptor_t *desc, struct sk_buff *skb, - unsigned int offset, size_t len) -{ - struct mptcp_read_arg *arg = desc->arg.data; - size_t copy_len; - - copy_len = min(desc->count, len); - - if (likely(arg->msg)) { - int err; - - err = skb_copy_datagram_msg(skb, offset, arg->msg, copy_len); - if (err) { - pr_debug("error path"); - desc->error = err; - return err; - } - } else { - pr_debug("Flushing skb payload"); - } - - desc->count -= copy_len; - - pr_debug("consumed %zu bytes, %zu left", copy_len, desc->count); - return copy_len; -} - static void mptcp_wait_data(struct sock *sk, long *timeo) { DEFINE_WAIT_FUNC(wait, woken_wake_function); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 6e6e162d25f1..d06170c5f191 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -191,13 +191,6 @@ void mptcp_proto_init(void); int mptcp_proto_v6_init(void); #endif -struct mptcp_read_arg { - struct msghdr *msg; -}; - -int mptcp_read_actor(read_descriptor_t *desc, struct sk_buff *skb, - unsigned int offset, size_t len); - void mptcp_get_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx); diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 3dad662840aa..37a4767db441 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -408,6 +408,18 @@ static enum mapping_status get_mapping_status(struct sock *ssk) return MAPPING_OK; } +static int subflow_read_actor(read_descriptor_t *desc, + struct sk_buff *skb, + unsigned int offset, size_t len) +{ + size_t copy_len = min(desc->count, len); + + desc->count -= copy_len; + + pr_debug("flushed %zu bytes, %zu left", copy_len, desc->count); + return copy_len; +} + static bool subflow_check_data_avail(struct sock *ssk) { struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(ssk); @@ -482,16 +494,12 @@ static bool subflow_check_data_avail(struct sock *ssk) pr_debug("discarding %zu bytes, current map len=%d", delta, map_remaining); if (delta) { - struct mptcp_read_arg arg = { - .msg = NULL, - }; read_descriptor_t desc = { .count = delta, - .arg.data = &arg, }; int ret; - ret = tcp_read_sock(ssk, &desc, mptcp_read_actor); + ret = tcp_read_sock(ssk, &desc, subflow_read_actor); if (ret < 0) { ssk->sk_err = -ret; goto fatal; From patchwork Wed Feb 26 09:14:51 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244832 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9BP2lBJz9sRG for ; Wed, 26 Feb 2020 20:15:25 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727806AbgBZJPY (ORCPT ); Wed, 26 Feb 2020 04:15:24 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60780 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727744AbgBZJPX (ORCPT ); Wed, 26 Feb 2020 04:15:23 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6smo-0001NO-If; Wed, 26 Feb 2020 10:15:22 +0100 From: Florian Westphal To: Cc: Florian Westphal Subject: [PATCH net-next 6/7] mptcp: avoid work queue scheduling if possible Date: Wed, 26 Feb 2020 10:14:51 +0100 Message-Id: <20200226091452.1116-7-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org We can't lock_sock() the mptcp socket from the subflow data_ready callback, it would result in ABBA deadlock with the subflow socket lock. We can however grab the spinlock: if that succeeds and the mptcp socket is not owned at the moment, we can process the new skbs right away without deferring this to the work queue. This avoids the schedule_work and hence the small delay until the work item is processed. Signed-off-by: Florian Westphal Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 29 ++++++++++++++++++++++++++++- net/mptcp/protocol.h | 2 +- net/mptcp/subflow.c | 4 ++-- 3 files changed, 31 insertions(+), 4 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index b781498e69b4..70f20c8eddbd 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -201,12 +201,39 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, return done; } -void mptcp_data_ready(struct sock *sk) +/* In most cases we will be able to lock the mptcp socket. If its already + * owned, we need to defer to the work queue to avoid ABBA deadlock. + */ +static bool move_skbs_to_msk(struct mptcp_sock *msk, struct sock *ssk) +{ + struct sock *sk = (struct sock *)msk; + unsigned int moved = 0; + + if (READ_ONCE(sk->sk_lock.owned)) + return false; + + if (unlikely(!spin_trylock_bh(&sk->sk_lock.slock))) + return false; + + /* must re-check after taking the lock */ + if (!READ_ONCE(sk->sk_lock.owned)) + __mptcp_move_skbs_from_subflow(msk, ssk, &moved); + + spin_unlock_bh(&sk->sk_lock.slock); + + return moved > 0; +} + +void mptcp_data_ready(struct sock *sk, struct sock *ssk) { struct mptcp_sock *msk = mptcp_sk(sk); set_bit(MPTCP_DATA_READY, &msk->flags); + if (atomic_read(&sk->sk_rmem_alloc) < READ_ONCE(sk->sk_rcvbuf) && + move_skbs_to_msk(msk, ssk)) + goto wake; + /* don't schedule if mptcp sk is (still) over limit */ if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf)) goto wake; diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index d06170c5f191..6c0b2c8ab674 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -195,7 +195,7 @@ void mptcp_get_options(const struct sk_buff *skb, struct tcp_options_received *opt_rx); void mptcp_finish_connect(struct sock *sk); -void mptcp_data_ready(struct sock *sk); +void mptcp_data_ready(struct sock *sk, struct sock *ssk); int mptcp_token_new_request(struct request_sock *req); void mptcp_token_destroy_request(u32 token); diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 37a4767db441..0de2a44bdaa0 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -563,7 +563,7 @@ static void subflow_data_ready(struct sock *sk) } if (mptcp_subflow_data_available(sk)) - mptcp_data_ready(parent); + mptcp_data_ready(parent, sk); } static void subflow_write_space(struct sock *sk) @@ -696,7 +696,7 @@ static void subflow_state_change(struct sock *sk) * the data available machinery here. */ if (parent && subflow->mp_capable && mptcp_subflow_data_available(sk)) - mptcp_data_ready(parent); + mptcp_data_ready(parent, sk); if (parent && !(parent->sk_shutdown & RCV_SHUTDOWN) && !subflow->rx_eof && subflow_is_done(sk)) { From patchwork Wed Feb 26 09:14:52 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Florian Westphal X-Patchwork-Id: 1244833 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming-netdev@ozlabs.org Delivered-To: patchwork-incoming-netdev@ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=vger.kernel.org (client-ip=209.132.180.67; helo=vger.kernel.org; envelope-from=netdev-owner@vger.kernel.org; receiver=) Authentication-Results: ozlabs.org; dmarc=none (p=none dis=none) header.from=strlen.de Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 48S9BT4jWyz9sNg for ; Wed, 26 Feb 2020 20:15:29 +1100 (AEDT) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727820AbgBZJP2 (ORCPT ); Wed, 26 Feb 2020 04:15:28 -0500 Received: from Chamillionaire.breakpoint.cc ([193.142.43.52]:60788 "EHLO Chamillionaire.breakpoint.cc" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727719AbgBZJP2 (ORCPT ); Wed, 26 Feb 2020 04:15:28 -0500 Received: from fw by Chamillionaire.breakpoint.cc with local (Exim 4.92) (envelope-from ) id 1j6sms-0001Nj-Ok; Wed, 26 Feb 2020 10:15:26 +0100 From: Florian Westphal To: Cc: Paolo Abeni , Florian Westphal Subject: [PATCH net-next 7/7] mptcp: defer work schedule until mptcp lock is released Date: Wed, 26 Feb 2020 10:14:52 +0100 Message-Id: <20200226091452.1116-8-fw@strlen.de> X-Mailer: git-send-email 2.24.1 In-Reply-To: <20200226091452.1116-1-fw@strlen.de> References: <20200226091452.1116-1-fw@strlen.de> MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Paolo Abeni Don't schedule the work queue right away, instead defer this to the lock release callback. This has the advantage that it will give recv path a chance to complete -- this might have moved all pending packets from the subflow to the mptcp receive queue, which allows to avoid the schedule_work(). Co-developed-by: Florian Westphal Signed-off-by: Florian Westphal Signed-off-by: Paolo Abeni Reviewed-by: Mat Martineau --- net/mptcp/protocol.c | 38 ++++++++++++++++++++++++++++++++++++-- 1 file changed, 36 insertions(+), 2 deletions(-) diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 70f20c8eddbd..044295707bbf 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -238,9 +238,16 @@ void mptcp_data_ready(struct sock *sk, struct sock *ssk) if (atomic_read(&sk->sk_rmem_alloc) > READ_ONCE(sk->sk_rcvbuf)) goto wake; - if (schedule_work(&msk->work)) - sock_hold((struct sock *)msk); + /* mptcp socket is owned, release_cb should retry */ + if (!test_and_set_bit(TCP_DELACK_TIMER_DEFERRED, + &sk->sk_tsq_flags)) { + sock_hold(sk); + /* need to try again, its possible release_cb() has already + * been called after the test_and_set_bit() above. + */ + move_skbs_to_msk(msk, ssk); + } wake: sk->sk_data_ready(sk); } @@ -941,6 +948,32 @@ static int mptcp_getsockopt(struct sock *sk, int level, int optname, return -EOPNOTSUPP; } +#define MPTCP_DEFERRED_ALL TCPF_DELACK_TIMER_DEFERRED + +/* this is very alike tcp_release_cb() but we must handle differently a + * different set of events + */ +static void mptcp_release_cb(struct sock *sk) +{ + unsigned long flags, nflags; + + do { + flags = sk->sk_tsq_flags; + if (!(flags & MPTCP_DEFERRED_ALL)) + return; + nflags = flags & ~MPTCP_DEFERRED_ALL; + } while (cmpxchg(&sk->sk_tsq_flags, flags, nflags) != flags); + + if (flags & TCPF_DELACK_TIMER_DEFERRED) { + struct mptcp_sock *msk = mptcp_sk(sk); + struct sock *ssk; + + ssk = mptcp_subflow_recv_lookup(msk); + if (!ssk || !schedule_work(&msk->work)) + __sock_put(sk); + } +} + static int mptcp_get_port(struct sock *sk, unsigned short snum) { struct mptcp_sock *msk = mptcp_sk(sk); @@ -1016,6 +1049,7 @@ static struct proto mptcp_prot = { .destroy = mptcp_destroy, .sendmsg = mptcp_sendmsg, .recvmsg = mptcp_recvmsg, + .release_cb = mptcp_release_cb, .hash = inet_hash, .unhash = inet_unhash, .get_port = mptcp_get_port,