Message ID | 20191106045011.10071-1-mathew.j.martineau@linux.intel.com |
---|---|
State | Deferred, archived |
Delegated to: | Paolo Abeni |
Headers | show |
Series | mptcp: Add DATA_FIN transmission and handling | expand |
Mat Martineau <mathew.j.martineau@linux.intel.com> wrote: > * A DATA_FIN that arrives on a packet without data payload (unless it's > a TCP FIN) are discarded by tcp_data_queue() and not propagated to > the MPTCP socket. This is detected in mptcp_incoming_options() but > not yet handled. The skb with the DATA_FIN could be cloned and > placed in the subflow error queue, or a workqueue could handle it. I have a pending patch that adds an ack workqueue to deal with the case where we get an mptcp ack for an mptcp socket that has exceeded its wmem limit, it can be reused for this too. I'll send it soon.
Hi, Thanks for doing this! I have quite a few comments below... As a general note, this patch looks quite complex, what about splitting it in a [small] series? I think it would be nice: - introducing the state machine in a separate patch, - possibly another one to deal with state update on fallback, - possibly a separate patch to fix status update on listen - data fin handling code On Tue, 2019-11-05 at 20:50 -0800, Mat Martineau wrote: > @@ -632,6 +663,18 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb, > > mpext->data_fin = mp_opt->data_fin; > > + if (mp_opt->data_fin && > + TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) { > + /* The DATA_FIN is on a packet that will be discarded by > + * tcp_data_queue() and will not get propagated to the MPTCP > + * socket. > + * > + * Use workqueue or subflow error queue? > + */ > + pr_warn("Ignored DATA_FIN"); > + } > + > + > if (msk) > mptcp_pm_fully_established(msk); > } I think we don't need a workqueue here: subflow could simply set (WRITE_ONCE) one/some msk level field[s]. I don't think we need to deal with subflows concurrently writing different values there - it will be a protocol violation, the stream will be corrupted anyway. [...] > @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) > lock_sock(ssk); > mptcp_clean_una(sk); > timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); > + > + if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { > + ret = sk_stream_wait_connect(sk, &timeo); > + if (ret != 0) > + goto put_out_subflow; > + } > + > while (msg_data_left(msg)) { > ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now, > &size_goal); I think waiting is not needed here?!? we could just bail out with ENOTCONN if state is >= TCP_FIN_WAIT1. Such check could be located before selecting the egress subflow: it should be outside the ssk lock. [...] > @@ -403,6 +410,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) > mptcp_reset_timer(sk); > } > > +put_out_subflow: > release_sock(ssk); > > out: > @@ -464,6 +472,41 @@ static void mptcp_wait_data(struct sock *sk, long *timeo) > remove_wait_queue(sk_sleep(sk), &wait); > } > > +static int mptcp_close_state(int oldstate, bool fin, bool ack) > +{ > + switch (oldstate) { > + case TCP_FIN_WAIT1: > + if (fin && ack) > + return TCP_TIME_WAIT; > + else if (fin) > + return TCP_FIN_WAIT2; > + else > + return TCP_CLOSING; As we are in 'TCP_FIN_WAIT1', we sent a DATA_FIN, I think we should: - move to TCP_FIN_WAIT2 when the fin is acked (snd_una == data_fin_seq) - move to TCP_CLOSING when we receive a DATA_FIN is the 'ack' argument true when there aren't any unacked data at the MPTCP level?!? > + case TCP_FIN_WAIT2: > + if (fin) > + return TCP_TIME_WAIT; > + else > + return oldstate; When receiving the data fin in TCP_FIN_WAIT2 status, I think we can move directly to TCP_CLOSED and let the subflow handle the time wait. It's not clear to me if the 'fin' argument refears to 'ingress' or 'egress' fin (or has different meaning according to the current sk state). > + case TCP_CLOSING: > + if (ack) > + return TCP_TIME_WAIT; > + else > + return oldstate; I think here we have to check (snd_una == data_fin_seq), and move directly to TCP_CLOSED - the subflows will handle the time wait > + case TCP_LAST_ACK: > + if (ack) > + return TCP_CLOSE; > + else > + return oldstate; > + case TCP_ESTABLISHED: > + return TCP_CLOSE_WAIT; I think we should enter TCP_CLOSE_WAIT iff we received a data fin, and it looks like we need two additional cases: - TCP_CLOSE_WAIT -> TCP_LAST_ACK, when we send data fin - TCP_LAST_ACK -> TCP_CLOSED when our fin is acked Finally I think we can avoid the TCP_TIME_WAIT at all, and we could handle some additional corner cases: - all subflows are closed (due to TCP reset) -> move msk state to TCP_CLOSE - what if we receive TCP_FIN on all subflows, but no MPTCP data fin ? [...] > @@ -545,6 +588,20 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, > continue; > } > } > + > + if (!subflow->map_valid && subflow->incoming_data_fin) { > + int newstate; > + > + // TODO: Send on all subflows? > + newstate = mptcp_close_state(sk->sk_state, true, false); > + msk->ack_seq++; > + subflow->incoming_data_fin = 0; > + subflow->send_data_fin = 1; > + subflow->data_fin_seq = msk->write_seq; The above means that the msk socket will send DATA_FIN after receiving one, regardless of shutdown(), right? I think that is not required ?!? [...] > @@ -747,11 +804,11 @@ static void mptcp_close(struct sock *sk, long timeout) > struct mptcp_sock *msk = mptcp_sk(sk); > struct socket *ssk = NULL; > > - mptcp_token_destroy(msk->token); > - inet_sk_state_store(sk, TCP_CLOSE); > - > lock_sock(sk); > > + inet_sk_state_store(sk, TCP_CLOSE); > + mptcp_token_destroy(msk->token); > + Don't we need to apply the above state machine here? and possibly use sk_stream_wait_close()? (with appropriate states handling, we can possibly re-use as it) > if (msk->subflow) { > ssk = msk->subflow; > msk->subflow = NULL; [...] > @@ -1262,6 +1330,18 @@ static int mptcp_shutdown(struct socket *sock, int how) > pr_debug("sk=%p, how=%d", msk, how); > > lock_sock(sock->sk); > + > + if (sock->sk->sk_state == TCP_ESTABLISHED) { > + inet_sk_state_store(sock->sk, TCP_FIN_WAIT1); > + } else if (sock->sk->sk_state == TCP_CLOSE_WAIT) { > + inet_sk_state_store(sock->sk, TCP_LAST_ACK); > + } else if (sock->sk->sk_state != TCP_CLOSE) { > + pr_warn("Shutdown from unexpected state %d", > + sock->sk->sk_state); > + release_sock(sock->sk); > + return -EALREADY; > + } > + I think we can re-use the state machine for the above transition? Shoul we need to check for: (how & SEND_SHUTDOWN) before doinf any mptcp level action? Should we apply the state changes only if !fallback, and ev copying the fallback socket status ? > ssock = __mptcp_fallback_get_ref(msk); > if (ssock) { > release_sock(sock->sk); > @@ -1276,6 +1356,8 @@ static int mptcp_shutdown(struct socket *sock, int how) > > tcp_socket = mptcp_subflow_tcp_socket(subflow); > pr_debug("conn_list->subflow=%p", subflow); > + subflow->send_data_fin = 1; > + subflow->data_fin_seq = msk->write_seq; > ret = kernel_sock_shutdown(tcp_socket, how); > } > release_sock(sock->sk); I think here we must avoid shutting down the subflow (but likely we can shut them down when msk transition to TCP_FIN_WAIT2. Additionally we can consider calling tcp_close() on all subflows when msk transition to TCP_CLOSE state > @@ -1303,7 +1385,7 @@ void mptcp_proto_init(void) > mptcp_stream_ops.accept = mptcp_stream_accept; > mptcp_stream_ops.getname = mptcp_getname; > mptcp_stream_ops.listen = mptcp_listen; > - mptcp_stream_ops.shutdown = mptcp_shutdown; > + mptcp_stream_ops.shutdown = mptcp_stream_shutdown; > > if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL)) > panic("Failed to allocate MPTCP pcpu counter\n"); > diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h > index 8fe5f9383d38..cf90014974ea 100644 > --- a/net/mptcp/protocol.h > +++ b/net/mptcp/protocol.h > @@ -217,9 +217,12 @@ struct mptcp_subflow_context { > map_valid : 1, > backup : 1, > data_avail : 1, > - rx_eof : 1; > + rx_eof : 1, > + incoming_data_fin : 1, > + send_data_fin : 1; > u32 remote_nonce; > u64 thmac; > + u64 data_fin_seq; > u32 local_nonce; > u32 remote_token; > u8 hmac[MPTCPOPT_HMAC_LEN]; Possibly we can move this field to msk? and possibly we don't need send_data_fin? if we have a dss, we send data fin iff dss map matches msk data_fin_seq if this is not a data packet (no dss) we send the data_fin (once ???) if msk->write_seq == data_fin_seq. Cheers, Paolo
On Wed, 6 Nov 2019, Paolo Abeni wrote: > Hi, > > Thanks for doing this! > > I have quite a few comments below... As a general note, this patch > looks quite complex, what about splitting it in a [small] series? I > think it would be nice: > > - introducing the state machine in a separate patch, > - possibly another one to deal with state update on fallback, > - possibly a separate patch to fix status update on listen > - data fin handling code > > On Tue, 2019-11-05 at 20:50 -0800, Mat Martineau wrote: >> @@ -632,6 +663,18 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb, >> >> mpext->data_fin = mp_opt->data_fin; >> >> + if (mp_opt->data_fin && >> + TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) { >> + /* The DATA_FIN is on a packet that will be discarded by >> + * tcp_data_queue() and will not get propagated to the MPTCP >> + * socket. >> + * >> + * Use workqueue or subflow error queue? >> + */ >> + pr_warn("Ignored DATA_FIN"); >> + } >> + >> + >> if (msk) >> mptcp_pm_fully_established(msk); >> } > > I think we don't need a workqueue here: subflow could simply set > (WRITE_ONCE) one/some msk level field[s]. > > I don't think we need to deal with subflows concurrently writing > different values there - it will be a protocol violation, the stream > will be corrupted anyway. The goal of the workqueue would be to process the DATA_FIN (and ack it if the recv side is caught up) even when there's no pending send/recv from userspace to wake up. That means the subflow error queue wouldn't work either. > > [...] >> @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) >> lock_sock(ssk); >> mptcp_clean_una(sk); >> timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); >> + >> + if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { >> + ret = sk_stream_wait_connect(sk, &timeo); >> + if (ret != 0) >> + goto put_out_subflow; >> + } >> + >> while (msg_data_left(msg)) { >> ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now, >> &size_goal); > > I think waiting is not needed here?!? we could just bail out with > ENOTCONN if state is >= TCP_FIN_WAIT1. Such check could be located > before selecting the egress subflow: it should be outside the ssk lock. I agree that the sk state check and possible waiting need to happen outside the ssk lock, thank you for spotting that. Waiting isn't needed for the states at the end of the connection, but I saw that TCP sendmsg/do_tcp_sendpages verified the connection state prior to sending in this way. What if I move this check very early in the function, before the fallback sock_sendmsg()? If there's an attempt to send before the handshake is done, we don't want to be waiting in the regular TCP sendmsg(). > > [...] >> @@ -403,6 +410,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) >> mptcp_reset_timer(sk); >> } >> >> +put_out_subflow: >> release_sock(ssk); >> >> out: >> @@ -464,6 +472,41 @@ static void mptcp_wait_data(struct sock *sk, long *timeo) >> remove_wait_queue(sk_sleep(sk), &wait); >> } >> >> +static int mptcp_close_state(int oldstate, bool fin, bool ack) >> +{ >> + switch (oldstate) { >> + case TCP_FIN_WAIT1: >> + if (fin && ack) >> + return TCP_TIME_WAIT; >> + else if (fin) >> + return TCP_FIN_WAIT2; >> + else >> + return TCP_CLOSING; > > As we are in 'TCP_FIN_WAIT1', we sent a DATA_FIN, I think we should: > > - move to TCP_FIN_WAIT2 when the fin is acked (snd_una == data_fin_seq) > - move to TCP_CLOSING when we receive a DATA_FIN > You're correct, that's what I was supposed to do based on the state diagram in RFC6824 appendix C. > is the 'ack' argument true when there aren't any unacked data at the > MPTCP level?!? It will be up to the caller to set 'ack' to true only when the sequence number of the DATA_FIN is acked by the peer. That isn't implemented yet. > >> + case TCP_FIN_WAIT2: >> + if (fin) >> + return TCP_TIME_WAIT; >> + else >> + return oldstate; > > When receiving the data fin in TCP_FIN_WAIT2 status, I think we can > move directly to TCP_CLOSED and let the subflow handle the time wait. The appendix C diagram transitions from TIME_WAIT to CLOSED when all the subflows have closed, but the state isn't explained further in RFC 6824. In RFC 6824bis, there is another mention of TIME_WAIT (section 3.3.3) and break-before-make - but this doesn't make sense to me since no DATA_FINs have been exchanged in a break-before-make scenario and the MPTCP-level connection would still be ESTABLISHED. Since the DATA_FIN has been acked by the time we are exiting FIN_WAIT2 or CLOSING, there isn't anything to keep track of at the MPTCP level for retransmission purposes. I think you're right that we don't need TIME_WAIT. > > It's not clear to me if the 'fin' argument refears to 'ingress' or > 'egress' fin (or has different meaning according to the current sk > state). The 'fin' and 'ack' arguments are referring to incoming DATA_FIN/DATA_ACK. I can modify those names to clarify. > >> + case TCP_CLOSING: >> + if (ack) >> + return TCP_TIME_WAIT; >> + else >> + return oldstate; > > I think here we have to check (snd_una == data_fin_seq), and move > directly to TCP_CLOSED - the subflows will handle the time wait I was planning to have the caller check that. > >> + case TCP_LAST_ACK: >> + if (ack) >> + return TCP_CLOSE; >> + else >> + return oldstate; > >> + case TCP_ESTABLISHED: >> + return TCP_CLOSE_WAIT; > > I think we should enter TCP_CLOSE_WAIT iff we received a data fin, and > it looks like we need two additional cases: > - TCP_CLOSE_WAIT -> TCP_LAST_ACK, when we send data fin I handled that transition in mptcp_stream_shutdown, but it should be here for clarity. > - TCP_LAST_ACK -> TCP_CLOSED when our fin is acked That's what is above in the TCP_LAST_ACK case. > > Finally I think we can avoid the TCP_TIME_WAIT at all, and we could > handle some additional corner cases: > - all subflows are closed (due to TCP reset) -> move msk state to > TCP_CLOSE > - what if we receive TCP_FIN on all subflows, but no MPTCP data fin ? Either of these could be a break-before-make scenario, but for now we do need to change the MPTCP state to TCP_CLOSE when all subflows are gone. > > [...] >> @@ -545,6 +588,20 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, >> continue; >> } >> } >> + >> + if (!subflow->map_valid && subflow->incoming_data_fin) { >> + int newstate; >> + >> + // TODO: Send on all subflows? >> + newstate = mptcp_close_state(sk->sk_state, true, false); >> + msk->ack_seq++; >> + subflow->incoming_data_fin = 0; >> + subflow->send_data_fin = 1; >> + subflow->data_fin_seq = msk->write_seq; > > The above means that the msk socket will send DATA_FIN after receiving > one, regardless of shutdown(), right? I think that is not required ?!? You're right - need to check newstate here and only send DATA_FIN in CLOSE_WAIT. > > [...] >> @@ -747,11 +804,11 @@ static void mptcp_close(struct sock *sk, long timeout) >> struct mptcp_sock *msk = mptcp_sk(sk); >> struct socket *ssk = NULL; >> >> - mptcp_token_destroy(msk->token); >> - inet_sk_state_store(sk, TCP_CLOSE); >> - >> lock_sock(sk); >> >> + inet_sk_state_store(sk, TCP_CLOSE); >> + mptcp_token_destroy(msk->token); >> + > > Don't we need to apply the above state machine here? and possibly use > sk_stream_wait_close()? (with appropriate states handling, we can > possibly re-use as it) Yes, need the state machine here. > >> if (msk->subflow) { >> ssk = msk->subflow; >> msk->subflow = NULL; > > [...] >> @@ -1262,6 +1330,18 @@ static int mptcp_shutdown(struct socket *sock, int how) >> pr_debug("sk=%p, how=%d", msk, how); >> >> lock_sock(sock->sk); >> + >> + if (sock->sk->sk_state == TCP_ESTABLISHED) { >> + inet_sk_state_store(sock->sk, TCP_FIN_WAIT1); >> + } else if (sock->sk->sk_state == TCP_CLOSE_WAIT) { >> + inet_sk_state_store(sock->sk, TCP_LAST_ACK); >> + } else if (sock->sk->sk_state != TCP_CLOSE) { >> + pr_warn("Shutdown from unexpected state %d", >> + sock->sk->sk_state); >> + release_sock(sock->sk); >> + return -EALREADY; >> + } >> + > > I think we can re-use the state machine for the above transition? > Shoul we need to check for: (how & SEND_SHUTDOWN) before doinf any > mptcp level action? > Should we apply the state changes only if !fallback, and ev copying the > fallback socket status ? Yes, yes, and yes. > > >> ssock = __mptcp_fallback_get_ref(msk); >> if (ssock) { >> release_sock(sock->sk); >> @@ -1276,6 +1356,8 @@ static int mptcp_shutdown(struct socket *sock, int how) >> >> tcp_socket = mptcp_subflow_tcp_socket(subflow); >> pr_debug("conn_list->subflow=%p", subflow); >> + subflow->send_data_fin = 1; >> + subflow->data_fin_seq = msk->write_seq; >> ret = kernel_sock_shutdown(tcp_socket, how); >> } >> release_sock(sock->sk); > > I think here we must avoid shutting down the subflow (but likely we can > shut them down when msk transition to TCP_FIN_WAIT2. > > Additionally we can consider calling tcp_close() on all subflows when > msk transition to TCP_CLOSE state Ok, both of those are part of the state diagram but not reflected in the code. > >> @@ -1303,7 +1385,7 @@ void mptcp_proto_init(void) >> mptcp_stream_ops.accept = mptcp_stream_accept; >> mptcp_stream_ops.getname = mptcp_getname; >> mptcp_stream_ops.listen = mptcp_listen; >> - mptcp_stream_ops.shutdown = mptcp_shutdown; >> + mptcp_stream_ops.shutdown = mptcp_stream_shutdown; >> >> if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL)) >> panic("Failed to allocate MPTCP pcpu counter\n"); >> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h >> index 8fe5f9383d38..cf90014974ea 100644 >> --- a/net/mptcp/protocol.h >> +++ b/net/mptcp/protocol.h >> @@ -217,9 +217,12 @@ struct mptcp_subflow_context { >> map_valid : 1, >> backup : 1, >> data_avail : 1, >> - rx_eof : 1; >> + rx_eof : 1, >> + incoming_data_fin : 1, >> + send_data_fin : 1; >> u32 remote_nonce; >> u64 thmac; >> + u64 data_fin_seq; >> u32 local_nonce; >> u32 remote_token; >> u8 hmac[MPTCPOPT_HMAC_LEN]; > > Possibly we can move this field to msk? and possibly we don't need > send_data_fin? Yeah, data_fin_seq can be moved up to msk. > > if we have a dss, we send data fin iff dss map matches msk data_fin_seq > if this is not a data packet (no dss) we send the data_fin (once ???) > if msk->write_seq == data_fin_seq. It's that "once" part that made me add send_data_fin. mptcp_write_data_fin() has this logic. Thanks for your review! -- Mat Martineau Intel
On Wed, 2019-11-06 at 16:37 -0800, Mat Martineau wrote: > On Wed, 6 Nov 2019, Paolo Abeni wrote: > > I think we don't need a workqueue here: subflow could simply set > > (WRITE_ONCE) one/some msk level field[s]. > > > > I don't think we need to deal with subflows concurrently writing > > different values there - it will be a protocol violation, the stream > > will be corrupted anyway. > > The goal of the workqueue would be to process the DATA_FIN (and ack it if > the recv side is caught up) even when there's no pending send/recv from > userspace to wake up. That means the subflow error queue wouldn't work > either. I mean: I think we can update msk->incoming_data_fin from mptcp_incoming_options() even without locking, just using WRITE_ONCE(). Additionally we could use cmpxchg(), if we want detect cuncurrent modification from multiple subflows, but that looks more a debug features than a functional requirement to me. > > [...] > > > @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) > > > lock_sock(ssk); > > > mptcp_clean_una(sk); > > > timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); > > > + > > > + if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { > > > + ret = sk_stream_wait_connect(sk, &timeo); > > > + if (ret != 0) > > > + goto put_out_subflow; > > > + } > > > + > > > while (msg_data_left(msg)) { > > > ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now, > > > &size_goal); > > > > I think waiting is not needed here?!? we could just bail out with > > ENOTCONN if state is >= TCP_FIN_WAIT1. Such check could be located > > before selecting the egress subflow: it should be outside the ssk lock. > > I agree that the sk state check and possible waiting need to happen > outside the ssk lock, thank you for spotting that. > > Waiting isn't needed for the states at the end of the connection, but I > saw that TCP sendmsg/do_tcp_sendpages verified the connection state prior > to sending in this way. Uhm... overall this looks a bugfix, perhaps it could be placed into it's own patch for review's sake... > What if I move this check very early in the > function, before the fallback sock_sendmsg()? If there's an attempt to > send before the handshake is done, we don't want to be waiting in the > regular TCP sendmsg(). I think you are right, otherwise we use the subflow as fallback, even if the MP_CAPABLE handshake is still running and will complete successfully. I additionally think we need something similar even for recvmsg(), for similar reasons Side note: with v1 MP_CAPABLE we will a similar problem, e.g. we can start recvmsg() when the MP_CAPABLE handshake is not completed, but the tcp connection is established (because v1 has a forth MP_CAPABLE message). In theat scenario we will use the subflow as a MPTCP enabled one - and block with no data. Later the MP_CAPABLE handshake can complete even unsuccessfully, and we have to fallback to plain TCP, while still waiting in the MPTCP- specific part of mptcp_recvmsg() - will be fun ;) Cheers, Paolo
Hi Mat, On 06/11/2019 05:50, Mat Martineau wrote: > Send and process received DATA_FIN options. This requires addition of a > state machine to the MPTCP socket to handle the closing process. DATA_FIN > is sent along with a DSS mapping for the final data transmitted, or on a > bare ACK if there is no data to transmit. > > This is still a work in progress, including these areas: > > * Need to finish handling of DSS_ACKs for the state machine. > > * Need to double-check whether it's correct to bypass inet_shutdown > as the current code does. > > * A DATA_FIN that arrives on a packet without data payload (unless it's > a TCP FIN) are discarded by tcp_data_queue() and not propagated to > the MPTCP socket. This is detected in mptcp_incoming_options() but > not yet handled. The skb with the DATA_FIN could be cloned and > placed in the subflow error queue, or a workqueue could handle it. > > Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Yesterday at the meeting we talked a bit about this patch. Here is what we said: - DATA_FIN: - Mat is working on it - It seems still to be too big for the "initial submission" with the minimal feature - because sending the DATA_FIN might be required for some implementations (not closing the MPTCP connections), it could be good to at least send the DATA_FIN, without the full support. - *Mat*: maybe for the initial submission, part 2, where only 1 subflow is supported, could we send the DATA_FIN just before closing the connection? With the minimal support. What do you think about that? Cheers and take care, Matt
On Fri, 8 Nov 2019, Matthieu Baerts wrote: > Hi Mat, > > On 06/11/2019 05:50, Mat Martineau wrote: >> Send and process received DATA_FIN options. This requires addition of a >> state machine to the MPTCP socket to handle the closing process. DATA_FIN >> is sent along with a DSS mapping for the final data transmitted, or on a >> bare ACK if there is no data to transmit. >> >> This is still a work in progress, including these areas: >> >> * Need to finish handling of DSS_ACKs for the state machine. >> >> * Need to double-check whether it's correct to bypass inet_shutdown >> as the current code does. >> >> * A DATA_FIN that arrives on a packet without data payload (unless it's >> a TCP FIN) are discarded by tcp_data_queue() and not propagated to >> the MPTCP socket. This is detected in mptcp_incoming_options() but >> not yet handled. The skb with the DATA_FIN could be cloned and >> placed in the subflow error queue, or a workqueue could handle it. >> >> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> > > Yesterday at the meeting we talked a bit about this patch. Here is what we > said: > > > - DATA_FIN: > - Mat is working on it > - It seems still to be too big for the "initial submission" with the > minimal feature > - because sending the DATA_FIN might be required for some > implementations (not closing the MPTCP connections), it could be good to at > least send the DATA_FIN, without the full support. > > - *Mat*: maybe for the initial submission, part 2, where only 1 > subflow is supported, could we send the DATA_FIN just before closing the > connection? With the minimal support. > > > What do you think about that? > For the single-subflow case, that's a good point. It will probably work well enough to set DATA_FIN whenever FIN is sent. I'll focus on that first and then see if there's also a single-subflow shortcut for acking the peer's DATA_FIN. -- Mat Martineau Intel
diff --git a/net/mptcp/options.c b/net/mptcp/options.c index 1b08e1193991..4f67826a4977 100644 --- a/net/mptcp/options.c +++ b/net/mptcp/options.c @@ -384,18 +384,43 @@ static bool mptcp_established_options_mp(struct sock *sk, unsigned int *size, return false; } +static void mptcp_write_data_fin(struct mptcp_subflow_context *subflow, + struct mptcp_ext *ext) +{ + struct mptcp_sock *msk = mptcp_sk(subflow->conn); + + WARN_ONCE(!subflow->conn, "MPTCP: Missing MPTCP socket information"); + + /* Only send DATA_FIN if all data has been sent or this is the last + * mapping. + */ + if (!ext->use_map) { + if (subflow->data_fin_seq == msk->write_seq) { + ext->use_map = 1; + ext->dsn64 = 1; + ext->data_seq = subflow->data_fin_seq; + ext->subflow_seq = 0; + ext->data_len = 1; + ext->data_fin = 1; + } + } else if (subflow->data_fin_seq == ext->data_seq + ext->data_len) { + ext->data_fin = 1; + } +} + static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb, unsigned int *size, unsigned int remaining, struct mptcp_out_options *opts) { + struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk); unsigned int dss_size = 0; struct mptcp_ext *mpext; unsigned int ack_size; mpext = skb ? mptcp_get_ext(skb) : NULL; - if (!skb || (mpext && mpext->use_map)) { + if (!skb || (mpext && mpext->use_map) || subflow->send_data_fin) { unsigned int map_size; map_size = TCPOLEN_MPTCP_DSS_BASE + TCPOLEN_MPTCP_DSS_MAP64; @@ -405,6 +430,9 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb, dss_size = map_size; if (mpext) opts->ext_copy = *mpext; + + if (subflow->send_data_fin && skb) + mptcp_write_data_fin(subflow, &opts->ext_copy); } else { opts->ext_copy.use_map = 0; WARN_ONCE(1, "MPTCP: Map dropped"); @@ -422,7 +450,7 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb, dss_size += ack_size; - msk = mptcp_sk(mptcp_subflow_ctx(sk)->conn); + msk = mptcp_sk(subflow->conn); if (msk) { opts->ext_copy.data_ack = msk->ack_seq; } else { @@ -441,6 +469,9 @@ static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb, if (!dss_size) return false; + if (opts->ext_copy.data_fin) + subflow->send_data_fin = 0; + *size = ALIGN(dss_size, 4); return true; } @@ -632,6 +663,18 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb, mpext->data_fin = mp_opt->data_fin; + if (mp_opt->data_fin && + TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) { + /* The DATA_FIN is on a packet that will be discarded by + * tcp_data_queue() and will not get propagated to the MPTCP + * socket. + * + * Use workqueue or subflow error queue? + */ + pr_warn("Ignored DATA_FIN"); + } + + if (msk) mptcp_pm_fully_established(msk); } @@ -727,10 +770,11 @@ void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts) * support for optional 32-bit mappings later. */ flags |= MPTCP_DSS_HAS_MAP | MPTCP_DSS_DSN64; - if (mpext->data_fin) - flags |= MPTCP_DSS_DATA_FIN; } + if (mpext->data_fin) + flags |= MPTCP_DSS_DATA_FIN; + *ptr++ = mptcp_option(MPTCPOPT_DSS, len, 0, flags); if (mpext->use_ack) { diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c index 696f1d03a61d..6d3b4a0a8dc1 100644 --- a/net/mptcp/protocol.c +++ b/net/mptcp/protocol.c @@ -101,7 +101,7 @@ static struct sock *mptcp_subflow_recv_lookup(const struct mptcp_sock *msk) sock_owned_by_me(sk); mptcp_for_each_subflow(msk, subflow) { - if (subflow->data_avail) + if (subflow->data_avail || subflow->incoming_data_fin) return mptcp_subflow_tcp_socket(subflow)->sk; receivers += !subflow->rx_eof; @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) lock_sock(ssk); mptcp_clean_una(sk); timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT); + + if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) { + ret = sk_stream_wait_connect(sk, &timeo); + if (ret != 0) + goto put_out_subflow; + } + while (msg_data_left(msg)) { ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now, &size_goal); @@ -403,6 +410,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len) mptcp_reset_timer(sk); } +put_out_subflow: release_sock(ssk); out: @@ -464,6 +472,41 @@ static void mptcp_wait_data(struct sock *sk, long *timeo) remove_wait_queue(sk_sleep(sk), &wait); } +static int mptcp_close_state(int oldstate, bool fin, bool ack) +{ + switch (oldstate) { + case TCP_FIN_WAIT1: + if (fin && ack) + return TCP_TIME_WAIT; + else if (fin) + return TCP_FIN_WAIT2; + else + return TCP_CLOSING; + case TCP_FIN_WAIT2: + if (fin) + return TCP_TIME_WAIT; + else + return oldstate; + case TCP_CLOSING: + if (ack) + return TCP_TIME_WAIT; + else + return oldstate; + case TCP_TIME_WAIT: + return TCP_CLOSE; + case TCP_LAST_ACK: + if (ack) + return TCP_CLOSE; + else + return oldstate; + case TCP_ESTABLISHED: + return TCP_CLOSE_WAIT; + default: + pr_debug("Unexpected state with DATA_FIN: %d", oldstate); + return oldstate; + } +} + static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, int nonblock, int flags, int *addr_len) { @@ -545,6 +588,20 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len, continue; } } + + if (!subflow->map_valid && subflow->incoming_data_fin) { + int newstate; + + // TODO: Send on all subflows? + newstate = mptcp_close_state(sk->sk_state, true, false); + msk->ack_seq++; + subflow->incoming_data_fin = 0; + subflow->send_data_fin = 1; + subflow->data_fin_seq = msk->write_seq; + inet_sk_state_store(sk, newstate); + tcp_send_ack(ssk); + } + release_sock(ssk); continue; @@ -747,11 +804,11 @@ static void mptcp_close(struct sock *sk, long timeout) struct mptcp_sock *msk = mptcp_sk(sk); struct socket *ssk = NULL; - mptcp_token_destroy(msk->token); - inet_sk_state_store(sk, TCP_CLOSE); - lock_sock(sk); + inet_sk_state_store(sk, TCP_CLOSE); + mptcp_token_destroy(msk->token); + if (msk->subflow) { ssk = msk->subflow; msk->subflow = NULL; @@ -764,6 +821,8 @@ static void mptcp_close(struct sock *sk, long timeout) list_for_each_entry_safe(subflow, tmp, &msk->conn_list, node) { pr_debug("conn_list->subflow=%p", subflow); + subflow->send_data_fin = 1; + subflow->data_fin_seq = msk->write_seq; sock_release(mptcp_subflow_tcp_socket(subflow)); } @@ -930,6 +989,11 @@ static int mptcp_getsockopt(struct sock *sk, int level, int optname, return -EOPNOTSUPP; } +static void mptcp_shutdown(struct sock *sock, int how) +{ + pr_err("Unexpected MPTCP shutdown call sock=%p how=%d", sock, how); +} + #define MPTCP_DEFERRED_ALL TCPF_WRITE_TIMER_DEFERRED /* this is very alike tcp_release_cb() but we must handle differently a @@ -1055,7 +1119,7 @@ static struct proto mptcp_prot = { .accept = mptcp_accept, .setsockopt = mptcp_setsockopt, .getsockopt = mptcp_getsockopt, - .shutdown = tcp_shutdown, + .shutdown = mptcp_shutdown, .destroy = mptcp_destroy, .sendmsg = mptcp_sendmsg, .recvmsg = mptcp_recvmsg, @@ -1199,6 +1263,10 @@ static int mptcp_listen(struct socket *sock, int backlog) err = inet_listen(ssock, backlog); sock_put(ssock->sk); + + if (!err) + inet_sk_state_store(sock->sk, TCP_LISTEN); + return err; } @@ -1252,7 +1320,7 @@ static __poll_t mptcp_poll(struct file *file, struct socket *sock, return ret; } -static int mptcp_shutdown(struct socket *sock, int how) +static int mptcp_stream_shutdown(struct socket *sock, int how) { struct mptcp_sock *msk = mptcp_sk(sock->sk); struct mptcp_subflow_context *subflow; @@ -1262,6 +1330,18 @@ static int mptcp_shutdown(struct socket *sock, int how) pr_debug("sk=%p, how=%d", msk, how); lock_sock(sock->sk); + + if (sock->sk->sk_state == TCP_ESTABLISHED) { + inet_sk_state_store(sock->sk, TCP_FIN_WAIT1); + } else if (sock->sk->sk_state == TCP_CLOSE_WAIT) { + inet_sk_state_store(sock->sk, TCP_LAST_ACK); + } else if (sock->sk->sk_state != TCP_CLOSE) { + pr_warn("Shutdown from unexpected state %d", + sock->sk->sk_state); + release_sock(sock->sk); + return -EALREADY; + } + ssock = __mptcp_fallback_get_ref(msk); if (ssock) { release_sock(sock->sk); @@ -1276,6 +1356,8 @@ static int mptcp_shutdown(struct socket *sock, int how) tcp_socket = mptcp_subflow_tcp_socket(subflow); pr_debug("conn_list->subflow=%p", subflow); + subflow->send_data_fin = 1; + subflow->data_fin_seq = msk->write_seq; ret = kernel_sock_shutdown(tcp_socket, how); } release_sock(sock->sk); @@ -1303,7 +1385,7 @@ void mptcp_proto_init(void) mptcp_stream_ops.accept = mptcp_stream_accept; mptcp_stream_ops.getname = mptcp_getname; mptcp_stream_ops.listen = mptcp_listen; - mptcp_stream_ops.shutdown = mptcp_shutdown; + mptcp_stream_ops.shutdown = mptcp_stream_shutdown; if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL)) panic("Failed to allocate MPTCP pcpu counter\n"); diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h index 8fe5f9383d38..cf90014974ea 100644 --- a/net/mptcp/protocol.h +++ b/net/mptcp/protocol.h @@ -217,9 +217,12 @@ struct mptcp_subflow_context { map_valid : 1, backup : 1, data_avail : 1, - rx_eof : 1; + rx_eof : 1, + incoming_data_fin : 1, + send_data_fin : 1; u32 remote_nonce; u64 thmac; + u64 data_fin_seq; u32 local_nonce; u32 remote_token; u8 hmac[MPTCPOPT_HMAC_LEN]; diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c index f08c4b713978..e39b71094b12 100644 --- a/net/mptcp/subflow.c +++ b/net/mptcp/subflow.c @@ -315,7 +315,7 @@ enum mapping_status { MAPPING_OK, MAPPING_INVALID, MAPPING_EMPTY, - MAPPING_DATA_FIN + MAPPING_BARE_DATA_FIN }; static u64 expand_seq(u64 old_seq, u16 old_data_len, u64 seq) @@ -406,16 +406,13 @@ static enum mapping_status get_mapping_status(struct sock *ssk) pr_err("Infinite mapping not handled"); MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_INFINITEMAPRX); return MAPPING_INVALID; - } else if (mpext->subflow_seq == 0 && - mpext->data_fin == 1) { - if (WARN_ON_ONCE(mpext->data_len != 1)) - return false; + } - /* do not try hard to handle this any better, till we have - * real data_fin support - */ - pr_debug("DATA_FIN with no payload"); - return MAPPING_DATA_FIN; + if (mpext->data_fin && mpext->subflow_seq == 0 && + mpext->data_len == 1) { + subflow->map_valid = 0; + subflow->incoming_data_fin = 1; + return MAPPING_BARE_DATA_FIN; } if (!mpext->dsn64) { @@ -450,6 +447,7 @@ static enum mapping_status get_mapping_status(struct sock *ssk) subflow->map_seq = map_seq; subflow->map_subflow_seq = mpext->subflow_seq; subflow->map_data_len = mpext->data_len; + subflow->incoming_data_fin = mpext->data_fin; subflow->map_valid = 1; pr_debug("new map seq=%llu subflow_seq=%u data_len=%u", subflow->map_seq, subflow->map_subflow_seq,
Send and process received DATA_FIN options. This requires addition of a state machine to the MPTCP socket to handle the closing process. DATA_FIN is sent along with a DSS mapping for the final data transmitted, or on a bare ACK if there is no data to transmit. This is still a work in progress, including these areas: * Need to finish handling of DSS_ACKs for the state machine. * Need to double-check whether it's correct to bypass inet_shutdown as the current code does. * A DATA_FIN that arrives on a packet without data payload (unless it's a TCP FIN) are discarded by tcp_data_queue() and not propagated to the MPTCP socket. This is detected in mptcp_incoming_options() but not yet handled. The skb with the DATA_FIN could be cloned and placed in the subflow error queue, or a workqueue could handle it. Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> --- net/mptcp/options.c | 52 ++++++++++++++++++++++-- net/mptcp/protocol.c | 96 ++++++++++++++++++++++++++++++++++++++++---- net/mptcp/protocol.h | 5 ++- net/mptcp/subflow.c | 18 ++++----- 4 files changed, 149 insertions(+), 22 deletions(-)