Message ID | 20200525181508.13492-3-fw@strlen.de |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | mptcp: adjust tcp rcvspace on rx | expand |
Hello, On Mon, May 25, 2020 at 3:19 PM Florian Westphal <fw@strlen.de> wrote: > > From: Paolo Abeni <pabeni@redhat.com> > > Place receive window tuning in the recvmsg path. > This makes sure the size is only increased when userspace consumes data. > > Previously we would grow the sk receive buffer towards tcp_rmem[2], now we > so only if userspace reads data. > > Simply adjust the msk rcvbuf size to the largest receive buffer of any of > the existing subflows. > > Signed-off-by: Paolo Abeni <pabeni@redhat.com> > Signed-off-by: Florian Westphal <fw@strlen.de> > --- > This patch is new in v2. > > net/mptcp/protocol.c | 32 ++++++++++++++++++++++++-------- > 1 file changed, 24 insertions(+), 8 deletions(-) > > diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c > index dbb86cbb9e77..89a35c3fc499 100644 > --- a/net/mptcp/protocol.c > +++ b/net/mptcp/protocol.c > @@ -190,13 +190,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, > return false; > } > > - if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { > - int rcvbuf = max(ssk->sk_rcvbuf, sk->sk_rcvbuf); > - > - if (rcvbuf > sk->sk_rcvbuf) > - sk->sk_rcvbuf = rcvbuf; > - } > - > tp = tcp_sk(ssk); > do { > u32 map_remaining, offset; > @@ -933,6 +926,25 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) > return moved > 0; > } > > +static void mptcp_rcv_space_adjust(struct mptcp_sock *msk) > +{ > + const struct mptcp_subflow_context *subflow; > + struct sock *sk = (struct sock *)msk; > + const struct sock *ssk; > + int rcvbuf = 0; > + > + if (sk->sk_userlocks & SOCK_RCVBUF_LOCK) > + return; > + > + mptcp_for_each_subflow(msk, subflow) { > + ssk = mptcp_subflow_tcp_sock(subflow); > + rcvbuf = max(ssk->sk_rcvbuf, rcvbuf); tcp_rcv_space_adjust is called even when the app is not yet reading, thus wouldn't this mean that we still end up with an ever-growing window? E.g., imagine an app that does not read at all at the beginning. The call to tcp_rcv_space_adjust in patch 1/2 will make the subflow's window grow. Now, the app comes and reads one byte. Then, the window at MPTCP-level will jump regardless of how much the app actually read. I think what is needed is to size the MPTCP-level window to 2 x the amount of data read by the application within an RTT (maximum RTT among all the active subflows). That way if an app reads 1 byte a second, the window will remain low. While for a bulk-transfer it will allow all subflows to receive at full speed [1]. Or do you think that kind of tuning can be done in a follow-up patch? Christoph [1]: https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final125.pdf -> Section 4.2
Christoph Paasch <christoph.paasch@gmail.com> wrote: > tcp_rcv_space_adjust is called even when the app is not yet reading, > thus wouldn't this mean that we still end up with an ever-growing > window? Window is based on available mptcp sk recvbuf. When data is moved from ssk to the mptcp sk, the skb truesize is charged to the mptcp rmem. > E.g., imagine an app that does not read at all at the beginning. The > call to tcp_rcv_space_adjust in patch 1/2 will make the subflow's > window grow. Now, the app comes and reads one byte. Then, the window > at MPTCP-level will jump regardless of how much the app actually read. Yes, the rcvbufsz value will jump, regardless homw much the app actually read. > I think what is needed is to size the MPTCP-level window to 2 x the > amount of data read by the application within an RTT (maximum RTT > among all the active subflows). That way if an app reads 1 byte a > second, the window will remain low. While for a bulk-transfer it will > allow all subflows to receive at full speed [1]. Sounds like the idea to move skbs to msk was bad one? Sorry, I don't see how I can make this work. Even deferring tcp_rcv_space_adjust() until recv() time won't work, given data has been pulled to the mptcp socket already. NOT calling tcp_rcv_space_adjust() at all might work, but that would require something else entirely. We would still have to adjust the subflow rcvbufsz in this case, else we may announce a window that is larger than the memory limit of the ssk (and we will end up dropping data at tcp level if the worker can't move the skbs fast enough). > Or do you think that kind of tuning can be done in a follow-up patch? This sounds completely different so I don't think that makes sense.
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index dbb86cbb9e77..89a35c3fc499 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -190,13 +190,6 @@ static bool __mptcp_move_skbs_from_subflow(struct mptcp_sock *msk, return false; } - if (!(sk->sk_userlocks & SOCK_RCVBUF_LOCK)) { - int rcvbuf = max(ssk->sk_rcvbuf, sk->sk_rcvbuf); - - if (rcvbuf > sk->sk_rcvbuf) - sk->sk_rcvbuf = rcvbuf; - } - tp = tcp_sk(ssk); do { u32 map_remaining, offset; @@ -933,6 +926,25 @@ static bool __mptcp_move_skbs(struct mptcp_sock *msk) return moved > 0; } +static void mptcp_rcv_space_adjust(struct mptcp_sock *msk) +{ + const struct mptcp_subflow_context *subflow; + struct sock *sk = (struct sock *)msk; + const struct sock *ssk; + int rcvbuf = 0; + + if (sk->sk_userlocks & SOCK_RCVBUF_LOCK) + return; + + mptcp_for_each_subflow(msk, subflow) { + ssk = mptcp_subflow_tcp_sock(subflow); + rcvbuf = max(ssk->sk_rcvbuf, rcvbuf); + } + + if (rcvbuf) + sk->sk_rcvbuf = rcvbuf; +} + static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len) { @@ -962,6 +974,8 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, target = sock_rcvlowat(sk, flags & MSG_WAITALL, len); __mptcp_flush_join_list(msk); + mptcp_rcv_space_adjust(msk); + while (len > (size_t)copied) { int bytes_read; @@ -975,8 +989,10 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, copied += bytes_read; if (skb_queue_empty(&sk->sk_receive_queue) && - __mptcp_move_skbs(msk)) + __mptcp_move_skbs(msk)) { + mptcp_rcv_space_adjust(msk); continue; + } /* only the master socket status is relevant here. The exit * conditions mirror closely tcp_recvmsg()