From patchwork Fri May 29 16:56:03 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Christoph Paasch X-Patchwork-Id: 1300852 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (no SPF record) smtp.mailfrom=lists.01.org (client-ip=198.145.21.10; helo=ml01.01.org; envelope-from=mptcp-bounces@lists.01.org; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=quarantine dis=none) header.from=apple.com Authentication-Results: ozlabs.org; dkim=fail reason="signature verification failed" (2048-bit key; unprotected) header.d=apple.com header.i=@apple.com header.a=rsa-sha256 header.s=20180706 header.b=o2W4e5iX; dkim-atps=neutral Received: from ml01.01.org (ml01.01.org [198.145.21.10]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (4096 bits) server-digest SHA256) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 49YW190Jm8z9sSF for ; Sat, 30 May 2020 02:56:12 +1000 (AEST) Received: from ml01.vlan13.01.org (localhost [IPv6:::1]) by ml01.01.org (Postfix) with ESMTP id 51A6D12310917; Fri, 29 May 2020 09:51:44 -0700 (PDT) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=17.151.62.68; helo=nwk-aaemail-lapp03.apple.com; envelope-from=cpaasch@apple.com; receiver= Received: from nwk-aaemail-lapp03.apple.com (nwk-aaemail-lapp03.apple.com [17.151.62.68]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 153C2122F4E57 for ; Fri, 29 May 2020 09:51:43 -0700 (PDT) Received: from pps.filterd (nwk-aaemail-lapp03.apple.com [127.0.0.1]) by nwk-aaemail-lapp03.apple.com (8.16.0.42/8.16.0.42) with SMTP id 04TGqNXL009534; Fri, 29 May 2020 09:56:08 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=apple.com; h=date : from : to : cc : subject : message-id : references : mime-version : content-type : in-reply-to; s=20180706; bh=XpuJg3VlUty+M3fbQYIQUTaqFmIo8QboKxRiRr85EWY=; b=o2W4e5iXPZL86YQrNT4bIs3RxDCpCvpA3FlKXvoQPl8OIM3yll5zIKrM/wkYlVuw6zxP 4JlIOUH2M/vNH3i8g3GS6PGJ08jXfKj+bobTG7KvyBXnhQjrbuWHQxAFhJdztBx12nAD aRrg2rtRiL/+S8L6FihxQJ7slf6UsUWjlR1HxxiZM8HveUTvMquXVIAKcj0ZsI3VKAcD hWc4tQt7ZQkAYFC4zQyYouv4mqvd+9Hh5D0seFdBdHA46Zt6pUI6IvUcQ76DRBtPxBk7 NINpdjTT/WDPgIuqc6r7riGUjMFfFf+XmI7ldU3zM/B066oJyITKMo6iQn1/EWznXvW7 RA== Received: from rn-mailsvcp-mta-lapp03.rno.apple.com (rn-mailsvcp-mta-lapp03.rno.apple.com [10.225.203.151]) by nwk-aaemail-lapp03.apple.com with ESMTP id 317khrmpkw-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO); Fri, 29 May 2020 09:56:08 -0700 Received: from rn-mailsvcp-mmp-lapp02.rno.apple.com (rn-mailsvcp-mmp-lapp02.rno.apple.com [17.179.253.15]) by rn-mailsvcp-mta-lapp03.rno.apple.com (Oracle Communications Messaging Server 8.1.0.5.20200312 64bit (built Mar 12 2020)) with ESMTPS id <0QB300PTER1H3KD0@rn-mailsvcp-mta-lapp03.rno.apple.com>; Fri, 29 May 2020 09:56:06 -0700 (PDT) Received: from process_milters-daemon.rn-mailsvcp-mmp-lapp02.rno.apple.com by rn-mailsvcp-mmp-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.5.20200312 64bit (built Mar 12 2020)) id <0QB300700PK6SM00@rn-mailsvcp-mmp-lapp02.rno.apple.com>; Fri, 29 May 2020 09:56:05 -0700 (PDT) X-Va-A: X-Va-T-CD: 5faae600979887376ed7757753b2aa2e X-Va-E-CD: a62a209f92e2ca4bd4640b338a250ca5 X-Va-R-CD: a846e32603d5d1a5aa588c6a179738e7 X-Va-CD: 0 X-Va-ID: 0c508cde-b956-490b-ac7e-ec3720a87d13 X-V-A: X-V-T-CD: 5faae600979887376ed7757753b2aa2e X-V-E-CD: a62a209f92e2ca4bd4640b338a250ca5 X-V-R-CD: a846e32603d5d1a5aa588c6a179738e7 X-V-CD: 0 X-V-ID: 8b74820d-240a-4ace-a866-4b744f63f8fe X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.687 definitions=2020-05-29_08:2020-05-28,2020-05-29 signatures=0 Received: from localhost ([17.235.42.40]) by rn-mailsvcp-mmp-lapp02.rno.apple.com (Oracle Communications Messaging Server 8.1.0.5.20200312 64bit (built Mar 12 2020)) with ESMTPSA id <0QB3004TRR1FVT10@rn-mailsvcp-mmp-lapp02.rno.apple.com>; Fri, 29 May 2020 09:56:03 -0700 (PDT) Date: Fri, 29 May 2020 09:56:03 -0700 From: Christoph Paasch To: Florian Westphal Message-id: <20200529165603.GN64606@MacBook-Pro-64.local> References: <20200529091728.2679-1-fw@strlen.de> MIME-version: 1.0 Content-disposition: inline In-reply-to: <20200529091728.2679-1-fw@strlen.de> User-Agent: Mutt/1.12.2 (2019-09-21) X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:6.0.216,18.0.687 definitions=2020-05-29_08:2020-05-28,2020-05-29 signatures=0 Message-ID-Hash: EUQQPUMDZZXQHB2FM4MOGBUVFKZT7RK6 X-Message-ID-Hash: EUQQPUMDZZXQHB2FM4MOGBUVFKZT7RK6 X-MailFrom: cpaasch@apple.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; suspicious-header CC: mptcp@lists.01.org X-Mailman-Version: 3.1.1 Precedence: list Subject: [MPTCP] Re: [PATCH v2 mptcp-next] mptcp: add receive buffer auto-tuning List-Id: Discussions regarding MPTCP upstreaming Archived-At: List-Archive: List-Help: List-Post: List-Subscribe: List-Unsubscribe: On 05/29/20 - 11:17, Florian Westphal wrote: > When mptcp is used, userspace doesn't read from the tcp (subflow) > socket but from the parent (mptcp) socket receive queue. > > skbs are moved from the subflow socket to the mptcp rx queue either from > 'data_ready' callback (if mptcp socket can be locked), a work queue, or > the socket receive function. > > This means tcp_rcv_space_adjust() is never called and thus no receive > buffer size auto-tuning is done. > > An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the > function that moves skbs from subflow to mptcp socket. > While this enabled autotuning, it also meant tuning was done even if > userspace was reading the mptcp socket very slowly. > > This adds mptcp_rcv_space_adjust() and calls it after userspace has > read data from the mptcp socket rx queue. > > Its very similar to tcp_rcv_space_adjust, with two differences: > > 1. The rtt estimate is the largest one observed on a subflow > 2. The rcvbuf size and window clamp of all subflows is adjusted > to the mptcp-level rcvbuf. > > Otherwise, we get spurious drops at tcp (subflow) socket level if > the skbs are not moved to the mptcp socket fast enough and reduced > throughput.. > > Before: > time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap > [..] > ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40562ms) [ OK ] > ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5415ms) [ OK ] > ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5413ms) [ OK ] > ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41331ms) [ OK ] > ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5415ms) [ OK ] > ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5714ms) [ OK ] > Time: 846 seconds > > After: > ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ] > ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5429ms) [ OK ] > ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5418ms) [ OK ] > ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5423ms) [ OK ] > ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5715ms) [ OK ] > ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5415ms) [ OK ] > Time: 275 seconds > > Signed-off-by: Florian Westphal > --- > changes in v2: > - cache last rtt_us value used > - don't store seq value > - reset 'copied' to 0 when starting > new measurement to simplify adjust function. > - make sure space.space is not inited to 0, else div-by-0 occurs > > net/mptcp/protocol.c | 124 ++++++++++++++++++++++++++++++++++++++++--- > net/mptcp/protocol.h | 6 +++ > 2 files changed, 123 insertions(+), 7 deletions(-) > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > index b2c8b57e7942..3827e4004877 100644 > --- a/net/mptcp/protocol.c > +++ b/net/mptcp/protocol.c > @@ -207,13 +207,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, > return false; > } > > - if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > - int rcvbuf = max(ssk->sk_rcvbuf, sk->sk_rcvbuf); > - > - if (rcvbuf > sk->sk_rcvbuf) > - sk->sk_rcvbuf = rcvbuf; > - } > - > tp = tcp_sk(ssk); > do { > u32 map_remaining, offset; > @@ -928,6 +921,100 @@ static int __mptcp_recvmsg_mskq(struct mptcp_sock *msk, > return copied; > } > > +/* receive buffer autotuning. See tcp_rcv_space_adjust for more information. > + * > + * Only difference: Use highest rtt estimate of the subflows in use. > + */ > +static void mptcp_rcv_space_adjust(struct mptcp_sock *msk, int copied) > +{ > + struct mptcp_subflow_context *subflow; > + struct sock *sk = (struct sock *)msk; > + u32 time, advmss = 1; > + u64 rtt_us, mstamp; > + > + sock_owned_by_me(sk); > + > + if (copied <= 0) > + return; > + > + msk->rcvq_space.copied += copied; > + > + mstamp = div_u64(tcp_clock_ns(), NSEC_PER_USEC); > + time = tcp_stamp_us_delta(mstamp, msk->rcvq_space.time); > + > + rtt_us = msk->rcvq_space.rtt_us; > + if (rtt_us && time < (rtt_us >> 3)) > + return; > + > + rtt_us = 0; > + mptcp_for_each_subflow(msk, subflow) { > + const struct tcp_sock *tp; > + u64 sf_rtt_us; > + u32 sf_advmss; > + > + tp = tcp_sk(mptcp_subflow_tcp_sock(subflow)); > + > + sf_rtt_us = READ_ONCE(tp->rcv_rtt_est.rtt_us); > + sf_advmss = READ_ONCE(tp->advmss); > + > + rtt_us = max(sf_rtt_us, rtt_us); > + advmss = max(sf_advmss, advmss); > + } > + > + msk->rcvq_space.rtt_us = rtt_us; > + if (time < (rtt_us >> 3) || rtt_us == 0) > + return; > + > + if (msk->rcvq_space.copied <= msk->rcvq_space.space) > + goto new_measure; > + > + if (sock_net(sk)->ipv4.sysctl_tcp_moderate_rcvbuf && > + !(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > + int rcvmem, rcvbuf; > + u64 rcvwin, grow; > + > + rcvwin = ((u64)msk->rcvq_space.copied << 1) + 16 * advmss; > + > + grow = rcvwin *(msk->rcvq_space.copied - msk->rcvq_space.space); > + > + do_div(grow, msk->rcvq_space.space); > + rcvwin += (grow << 1); > + > + rcvmem = SKB_TRUESIZE(advmss + MAX_TCP_HEADER); > + while (tcp_win_from_space(sk, rcvmem) < advmss) > + rcvmem += 128; > + > + do_div(rcvwin, advmss); > + rcvbuf = min_t(u64, rcvwin * rcvmem, > + sock_net(sk)->ipv4.sysctl_tcp_rmem[2]); > + > + if (rcvbuf > sk->sk_rcvbuf) { > + u32 window_clamp; > + > + window_clamp = tcp_win_from_space(sk, rcvbuf); > + WRITE_ONCE(sk->sk_rcvbuf, rcvbuf); > + > + /* Make subflows follow along. If we do not do this, we > + * get drops at subflow level if skbs can't be moved to > + * the mptcp rx queue fast enough (announced rcv_win can > + * exceed ssk->sk_rcvbuf). > + */ > + mptcp_for_each_subflow(msk, subflow) { > + struct sock *ssk; > + > + ssk = mptcp_subflow_tcp_sock(subflow); > + WRITE_ONCE(ssk->sk_rcvbuf, rcvbuf); > + tcp_sk(ssk)->window_clamp = window_clamp; > + } > + } > + } > + > + msk->rcvq_space.space = msk->rcvq_space.copied; > +new_measure: > + msk->rcvq_space.copied = 0; > + msk->rcvq_space.time = mstamp; > +} > + > static bool __mptcp_move_skbs(struct mptcp_sock *msk) > { > unsigned int moved = 0; > @@ -1050,6 +1137,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > set_bit(MPTCP_DATA_READY, &msk->flags); > } > out_err: > + mptcp_rcv_space_adjust(msk, copied); > + > release_sock(sk); > return copied; > } > @@ -1280,6 +1369,7 @@ static int mptcp_init_sock(struct sock *sk) > return ret; > > sk_sockets_allocated_inc(sk); > + sk->sk_rcvbuf = sock_net(sk)->ipv4.sysctl_tcp_rmem[1]; > sk->sk_sndbuf = sock_net(sk)->ipv4.sysctl_tcp_wmem[2]; > > return 0; > @@ -1475,6 +1565,23 @@ struct sock *mptcp_sk_clone(const struct sock *sk, > return nsk; > } > > +static void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk) > +{ > + struct tcp_sock *tp = tcp_sk(ssk); > + > + msk->rcvq_space.copied = 0; > + msk->rcvq_space.rtt_us = 0; > + > + tcp_mstamp_refresh(tp); > + msk->rcvq_space.time = tp->tcp_mstamp; > + > + /* initial rcv_space offering made to peer */ > + msk->rcvq_space.space = min_t(u32, tp->rcv_wnd, > + TCP_INIT_CWND * tp->advmss); > + if (msk->rcvq_space.space == 0) > + msk->rcvq_space.space = TCP_INIT_CWND * TCP_MSS_DEFAULT; > +} > + > static struct sock *mptcp_accept(struct sock *sk, int flags, int *err, > bool kern) > { > @@ -1524,6 +1631,7 @@ static struct sock *mptcp_accept(struct sock *sk, int flags, int *err, > list_add(&subflow->node, &msk->conn_list); > inet_sk_state_store(newsk, TCP_ESTABLISHED); > > + mptcp_rcv_space_init(msk, ssk); > bh_unlock_sock(new_mptcp_sock); > > __MPTCP_INC_STATS(sock_net(sk), MPTCP_MIB_MPCAPABLEPASSIVEACK); > @@ -1678,6 +1786,8 @@ void mptcp_finish_connect(struct sock *ssk) > atomic64_set(&msk->snd_una, msk->write_seq); > > mptcp_pm_new_connection(msk, 0); > + > + mptcp_rcv_space_init(msk, ssk); > } Need to also handle the fallback to TCP-case. Otherwise there is a divide-by-0: server login: [ 1998.037445] divide error: 0000 [#1] SMP KASAN PTI [ 1998.038321] CPU: 3 PID: 1671 Comm: http Kdump: loaded Not tainted 5.7.0-rc6.mptcp #43 [ 1998.039599] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 1998.041526] RIP: 0010:mptcp_recvmsg+0x6ae/0xa40 [ 1998.042370] Code: 44 89 e0 89 da 4c 8b 44 24 08 45 8d b4 24 40 03 00 00 c1 e0 04 48 8d 0c 50 89 d8 49 8d b8 60 04 00 00 31 d2 29 e8 48 0f af c1 <48> f7 f5 4c 8d 2c 41 e8 66 6b 6a ff 4c 8b 44 24 08 41 8b a8 64 [ 1998.045394] RSP: 0018:ffff8881174bf968 EFLAGS: 00010206 [ 1998.046237] RAX: 0000000140764000 RBX: 000000000000b500 RCX: 000000000001c540 [ 1998.047397] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8291b460 [ 1998.048564] RBP: 0000000000000000 R08: ffffffff8291b000 R09: ffffffff82f1d0c8 [ 1998.049724] R10: ffffffff82f1d0c3 R11: fffffbfff05e3a18 R12: 00000000000005b4 [ 1998.050872] R13: 000000000022b368 R14: 00000000000008f4 R15: ffff888118740000 [ 1998.052059] FS: 00007f081dd1ea40(0000) GS:ffff88811b980000(0000) knlGS:0000000000000000 [ 1998.053357] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1998.054189] CR2: 000055b5364a7000 CR3: 0000000115d5e000 CR4: 00000000000006e0 [ 1998.055231] Call Trace: [ 1998.057574] inet_recvmsg+0x207/0x220 [ 1998.058832] sock_read_iter+0x1fe/0x230 [ 1998.060656] new_sync_read+0x33a/0x350 [ 1998.063934] vfs_read+0xbc/0x1b0 [ 1998.064466] ksys_read+0x11b/0x150 [ 1998.066975] do_syscall_64+0xbc/0x790 [ 1998.070940] entry_SYSCALL_64_after_hwframe+0x44/0xa9 This diff fixes it: ======== diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 2820da7c787a..a5b36a4c04f7 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -1533,7 +1533,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk, return nsk; } -static void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk) +void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk) { struct tcp_sock *tp = tcp_sk(ssk); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index e53410830ec3..cef677ebfd09 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -383,6 +383,7 @@ struct sock *mptcp_sk_clone(const struct sock *sk, void mptcp_get_options(const struct sk_buff *skb, struct mptcp_options_received *mp_opt); +void mptcp_rcv_space_init(struct mptcp_sock *msk, struct sock *ssk); void mptcp_finish_connect(struct sock *sk); void mptcp_data_ready(struct sock *sk, struct sock *ssk); bool mptcp_finish_join(struct sock *sk); diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index 7222503b9583..0bf2b9d575fb 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -259,8 +259,10 @@ static void subflow_finish_connect(struct sock *sk, const struct sk_buff *skb) MPTCP_MIB_MPCAPABLEACTIVEFALLBACK); } - if (mptcp_check_fallback(sk)) + if (mptcp_check_fallback(sk)) { + mptcp_rcv_space_init(mptcp_sk(parent), sk); return; + } if (subflow->mp_capable) { pr_debug("subflow=%p, remote_key=%llu", mptcp_subflow_ctx(sk),