diff mbox series

mptcp: Add DATA_FIN transmission and handling

Message ID 20191106045011.10071-1-mathew.j.martineau@linux.intel.com
State Deferred, archived
Delegated to: Paolo Abeni
Headers show
Series mptcp: Add DATA_FIN transmission and handling | expand

Commit Message

Mat Martineau Nov. 6, 2019, 4:50 a.m. UTC
Send and process received DATA_FIN options. This requires addition of a
state machine to the MPTCP socket to handle the closing process. DATA_FIN
is sent along with a DSS mapping for the final data transmitted, or on a
bare ACK if there is no data to transmit.

This is still a work in progress, including these areas:

 * Need to finish handling of DSS_ACKs for the state machine.

 * Need to double-check whether it's correct to bypass inet_shutdown
   as the current code does.

 * A DATA_FIN that arrives on a packet without data payload (unless it's
   a TCP FIN) are discarded by tcp_data_queue() and not propagated to
   the MPTCP socket. This is detected in mptcp_incoming_options() but
   not yet handled. The skb with the DATA_FIN could be cloned and
   placed in the subflow error queue, or a workqueue could handle it.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
---
 net/mptcp/options.c  | 52 ++++++++++++++++++++++--
 net/mptcp/protocol.c | 96 ++++++++++++++++++++++++++++++++++++++++----
 net/mptcp/protocol.h |  5 ++-
 net/mptcp/subflow.c  | 18 ++++-----
 4 files changed, 149 insertions(+), 22 deletions(-)

Comments

Florian Westphal Nov. 6, 2019, 9:29 a.m. UTC | #1
Mat Martineau <mathew.j.martineau@linux.intel.com> wrote:
>  * A DATA_FIN that arrives on a packet without data payload (unless it's
>    a TCP FIN) are discarded by tcp_data_queue() and not propagated to
>    the MPTCP socket. This is detected in mptcp_incoming_options() but
>    not yet handled. The skb with the DATA_FIN could be cloned and
>    placed in the subflow error queue, or a workqueue could handle it.

I have a pending patch that adds an ack workqueue to deal with the
case where we get an mptcp ack for an mptcp socket that has exceeded
its wmem limit, it can be reused for this too.

I'll send it soon.
Paolo Abeni Nov. 6, 2019, 10:58 a.m. UTC | #2
Hi,

Thanks for doing this!

I have quite a few comments below... As a general note, this patch
looks quite complex, what about splitting it in a [small] series? I
think it would be nice:

- introducing the state machine in a separate patch, 
- possibly another one to deal with state update on fallback, 
- possibly a separate patch to fix status update on listen
- data fin handling code

On Tue, 2019-11-05 at 20:50 -0800, Mat Martineau wrote:
> @@ -632,6 +663,18 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
>  
>  	mpext->data_fin = mp_opt->data_fin;
>  
> +	if (mp_opt->data_fin &&
> +	    TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
> +		/* The DATA_FIN is on a packet that will be discarded by
> +		 * tcp_data_queue() and will not get propagated to the MPTCP
> +		 * socket.
> +		 *
> +		 * Use workqueue or subflow error queue?
> +		 */
> +		pr_warn("Ignored DATA_FIN");
> +	}
> +
> +
>  	if (msk)
>  		mptcp_pm_fully_established(msk);
>  }

I think we don't need a workqueue here: subflow could simply set
(WRITE_ONCE) one/some msk level field[s].

I don't think we need to deal with subflows concurrently writing
different values there - it will be a protocol violation, the stream
will be corrupted anyway.

[...]
> @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  	lock_sock(ssk);
>  	mptcp_clean_una(sk);
>  	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> +
> +	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
> +		ret = sk_stream_wait_connect(sk, &timeo);
> +		if (ret != 0)
> +			goto put_out_subflow;
> +	}
> +
>  	while (msg_data_left(msg)) {
>  		ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now,
>  					 &size_goal);

 I think waiting is not needed here?!? we could just bail out with
ENOTCONN if state is >= TCP_FIN_WAIT1. Such check could be located 
before selecting the egress subflow: it should be outside the ssk lock.

[...]
> @@ -403,6 +410,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>  			mptcp_reset_timer(sk);
>  	}
>  
> +put_out_subflow:
>  	release_sock(ssk);
>  
>  out:
> @@ -464,6 +472,41 @@ static void mptcp_wait_data(struct sock *sk, long *timeo)
>  	remove_wait_queue(sk_sleep(sk), &wait);
>  }
>  
> +static int mptcp_close_state(int oldstate, bool fin, bool ack)
> +{
> +	switch (oldstate) {
> +	case TCP_FIN_WAIT1:
> +		if (fin && ack)
> +			return TCP_TIME_WAIT;
> +		else if (fin)
> +			return TCP_FIN_WAIT2;
> +		else
> +			return TCP_CLOSING;

As we are in 'TCP_FIN_WAIT1', we sent a DATA_FIN, I think we should:

- move to TCP_FIN_WAIT2 when the fin is acked (snd_una == data_fin_seq)
- move to TCP_CLOSING when we receive a DATA_FIN

is the 'ack' argument true when there aren't any unacked data at the
MPTCP level?!?

> +	case TCP_FIN_WAIT2:
> +		if (fin)
> +			return TCP_TIME_WAIT;
> +		else
> +			return oldstate;

When receiving the data fin in TCP_FIN_WAIT2 status, I think we can
move directly to TCP_CLOSED and let the subflow handle the time wait.

It's not clear to me if the 'fin' argument refears to 'ingress' or
'egress' fin (or has different meaning according to the current sk
state).

> +	case TCP_CLOSING:
> +		if (ack)
> +			return TCP_TIME_WAIT;
> +		else
> +			return oldstate;

I think here we have to check (snd_una == data_fin_seq), and move
directly to TCP_CLOSED - the subflows will handle the time wait

> +	case TCP_LAST_ACK:
> +		if (ack)
> +			return TCP_CLOSE;
> +		else
> +			return oldstate;

> +	case TCP_ESTABLISHED:
> +		return TCP_CLOSE_WAIT;

I think we should enter TCP_CLOSE_WAIT iff we received a data fin, and
it looks like we need two additional cases: 
- TCP_CLOSE_WAIT -> TCP_LAST_ACK, when we send data fin
- TCP_LAST_ACK -> TCP_CLOSED when our fin is acked

Finally I think we can avoid the TCP_TIME_WAIT at all, and we could
handle some additional corner cases:
- all subflows are closed (due to TCP reset) -> move msk state to
TCP_CLOSE
- what if we receive TCP_FIN on all subflows, but no MPTCP data fin ?

[...]
> @@ -545,6 +588,20 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>  				continue;
>  			}
>  		}
> +
> +		if (!subflow->map_valid && subflow->incoming_data_fin) {
> +			int newstate;
> +
> +			// TODO: Send on all subflows?
> +			newstate = mptcp_close_state(sk->sk_state, true, false);
> +			msk->ack_seq++;
> +			subflow->incoming_data_fin = 0;
> +			subflow->send_data_fin = 1;
> +			subflow->data_fin_seq = msk->write_seq;

The above means that the msk socket will send DATA_FIN after receiving
one, regardless of shutdown(), right? I think that is not required ?!?

[...]
> @@ -747,11 +804,11 @@ static void mptcp_close(struct sock *sk, long timeout)
>  	struct mptcp_sock *msk = mptcp_sk(sk);
>  	struct socket *ssk = NULL;
>  
> -	mptcp_token_destroy(msk->token);
> -	inet_sk_state_store(sk, TCP_CLOSE);
> -
>  	lock_sock(sk);
>  
> +	inet_sk_state_store(sk, TCP_CLOSE);
> +	mptcp_token_destroy(msk->token);
> +

Don't we need to apply the above state machine here? and possibly use
sk_stream_wait_close()? (with appropriate states handling, we can
possibly re-use as it)

>  	if (msk->subflow) {
>  		ssk = msk->subflow;
>  		msk->subflow = NULL;
 
[...]
> @@ -1262,6 +1330,18 @@ static int mptcp_shutdown(struct socket *sock, int how)
>  	pr_debug("sk=%p, how=%d", msk, how);
>  
>  	lock_sock(sock->sk);
> +
> +	if (sock->sk->sk_state == TCP_ESTABLISHED) {
> +		inet_sk_state_store(sock->sk, TCP_FIN_WAIT1);
> +	} else if (sock->sk->sk_state == TCP_CLOSE_WAIT) {
> +		inet_sk_state_store(sock->sk, TCP_LAST_ACK);
> +	} else if (sock->sk->sk_state != TCP_CLOSE) {
> +		pr_warn("Shutdown from unexpected state %d",
> +			sock->sk->sk_state);
> +		release_sock(sock->sk);
> +		return -EALREADY;
> +	}
> +

I think we can re-use the state machine for the above transition? 
Shoul we need to check for: (how & SEND_SHUTDOWN) before doinf any
mptcp level action?
Should we apply the state changes only if !fallback, and ev copying the
fallback socket status ?


>  	ssock = __mptcp_fallback_get_ref(msk);
>  	if (ssock) {
>  		release_sock(sock->sk);
> @@ -1276,6 +1356,8 @@ static int mptcp_shutdown(struct socket *sock, int how)
>  
>  		tcp_socket = mptcp_subflow_tcp_socket(subflow);
>  		pr_debug("conn_list->subflow=%p", subflow);
> +		subflow->send_data_fin = 1;
> +		subflow->data_fin_seq = msk->write_seq;
>  		ret = kernel_sock_shutdown(tcp_socket, how);
>  	}
>  	release_sock(sock->sk);

I think here we must avoid shutting down the subflow (but likely we can
shut them down when msk transition to TCP_FIN_WAIT2.

Additionally we can consider calling tcp_close() on all subflows when
msk transition to TCP_CLOSE state

> @@ -1303,7 +1385,7 @@ void mptcp_proto_init(void)
>  	mptcp_stream_ops.accept = mptcp_stream_accept;
>  	mptcp_stream_ops.getname = mptcp_getname;
>  	mptcp_stream_ops.listen = mptcp_listen;
> -	mptcp_stream_ops.shutdown = mptcp_shutdown;
> +	mptcp_stream_ops.shutdown = mptcp_stream_shutdown;
>  
>  	if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL))
>  		panic("Failed to allocate MPTCP pcpu counter\n");
> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
> index 8fe5f9383d38..cf90014974ea 100644
> --- a/net/mptcp/protocol.h
> +++ b/net/mptcp/protocol.h
> @@ -217,9 +217,12 @@ struct mptcp_subflow_context {
>  		map_valid : 1,
>  		backup : 1,
>  		data_avail : 1,
> -		rx_eof : 1;
> +		rx_eof : 1,
> +		incoming_data_fin : 1,
> +		send_data_fin : 1;
>  	u32	remote_nonce;
>  	u64	thmac;
> +	u64	data_fin_seq;
>  	u32	local_nonce;
>  	u32	remote_token;
>  	u8	hmac[MPTCPOPT_HMAC_LEN];

Possibly we can move this field to msk? and possibly we don't need
send_data_fin?

if we have a dss, we send data fin iff dss map matches msk data_fin_seq
if this is not a data packet (no dss) we send the data_fin (once ???)
if msk->write_seq == data_fin_seq.

Cheers,

Paolo
Mat Martineau Nov. 7, 2019, 12:37 a.m. UTC | #3
On Wed, 6 Nov 2019, Paolo Abeni wrote:

> Hi,
>
> Thanks for doing this!
>
> I have quite a few comments below... As a general note, this patch
> looks quite complex, what about splitting it in a [small] series? I
> think it would be nice:
>
> - introducing the state machine in a separate patch,
> - possibly another one to deal with state update on fallback,
> - possibly a separate patch to fix status update on listen
> - data fin handling code
>
> On Tue, 2019-11-05 at 20:50 -0800, Mat Martineau wrote:
>> @@ -632,6 +663,18 @@ void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
>>
>>  	mpext->data_fin = mp_opt->data_fin;
>>
>> +	if (mp_opt->data_fin &&
>> +	    TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
>> +		/* The DATA_FIN is on a packet that will be discarded by
>> +		 * tcp_data_queue() and will not get propagated to the MPTCP
>> +		 * socket.
>> +		 *
>> +		 * Use workqueue or subflow error queue?
>> +		 */
>> +		pr_warn("Ignored DATA_FIN");
>> +	}
>> +
>> +
>>  	if (msk)
>>  		mptcp_pm_fully_established(msk);
>>  }
>
> I think we don't need a workqueue here: subflow could simply set
> (WRITE_ONCE) one/some msk level field[s].
>
> I don't think we need to deal with subflows concurrently writing
> different values there - it will be a protocol violation, the stream
> will be corrupted anyway.

The goal of the workqueue would be to process the DATA_FIN (and ack it if 
the recv side is caught up) even when there's no pending send/recv from 
userspace to wake up. That means the subflow error queue wouldn't work 
either.

>
> [...]
>> @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>  	lock_sock(ssk);
>>  	mptcp_clean_una(sk);
>>  	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
>> +
>> +	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
>> +		ret = sk_stream_wait_connect(sk, &timeo);
>> +		if (ret != 0)
>> +			goto put_out_subflow;
>> +	}
>> +
>>  	while (msg_data_left(msg)) {
>>  		ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now,
>>  					 &size_goal);
>
> I think waiting is not needed here?!? we could just bail out with
> ENOTCONN if state is >= TCP_FIN_WAIT1. Such check could be located
> before selecting the egress subflow: it should be outside the ssk lock.

I agree that the sk state check and possible waiting need to happen 
outside the ssk lock, thank you for spotting that.

Waiting isn't needed for the states at the end of the connection, but I 
saw that TCP sendmsg/do_tcp_sendpages verified the connection state prior 
to sending in this way. What if I move this check very early in the 
function, before the fallback sock_sendmsg()? If there's an attempt to 
send before the handshake is done, we don't want to be waiting in the 
regular TCP sendmsg().


>
> [...]
>> @@ -403,6 +410,7 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
>>  			mptcp_reset_timer(sk);
>>  	}
>>
>> +put_out_subflow:
>>  	release_sock(ssk);
>>
>>  out:
>> @@ -464,6 +472,41 @@ static void mptcp_wait_data(struct sock *sk, long *timeo)
>>  	remove_wait_queue(sk_sleep(sk), &wait);
>>  }
>>
>> +static int mptcp_close_state(int oldstate, bool fin, bool ack)
>> +{
>> +	switch (oldstate) {
>> +	case TCP_FIN_WAIT1:
>> +		if (fin && ack)
>> +			return TCP_TIME_WAIT;
>> +		else if (fin)
>> +			return TCP_FIN_WAIT2;
>> +		else
>> +			return TCP_CLOSING;
>
> As we are in 'TCP_FIN_WAIT1', we sent a DATA_FIN, I think we should:
>
> - move to TCP_FIN_WAIT2 when the fin is acked (snd_una == data_fin_seq)
> - move to TCP_CLOSING when we receive a DATA_FIN
>

You're correct, that's what I was supposed to do based on the state 
diagram in RFC6824 appendix C.

> is the 'ack' argument true when there aren't any unacked data at the
> MPTCP level?!?

It will be up to the caller to set 'ack' to true only when the sequence 
number of the DATA_FIN is acked by the peer. That isn't implemented yet.

>
>> +	case TCP_FIN_WAIT2:
>> +		if (fin)
>> +			return TCP_TIME_WAIT;
>> +		else
>> +			return oldstate;
>
> When receiving the data fin in TCP_FIN_WAIT2 status, I think we can
> move directly to TCP_CLOSED and let the subflow handle the time wait.

The appendix C diagram transitions from TIME_WAIT to CLOSED when all the 
subflows have closed, but the state isn't explained further in RFC 6824. 
In RFC 6824bis, there is another mention of TIME_WAIT (section 3.3.3) and 
break-before-make - but this doesn't make sense to me since no DATA_FINs 
have been exchanged in a break-before-make scenario and the MPTCP-level 
connection would still be ESTABLISHED.

Since the DATA_FIN has been acked by the time we are exiting FIN_WAIT2 or 
CLOSING, there isn't anything to keep track of at the MPTCP level for 
retransmission purposes. I think you're right that we don't need 
TIME_WAIT.

>
> It's not clear to me if the 'fin' argument refears to 'ingress' or
> 'egress' fin (or has different meaning according to the current sk
> state).

The 'fin' and 'ack' arguments are referring to incoming DATA_FIN/DATA_ACK. 
I can modify those names to clarify.

>
>> +	case TCP_CLOSING:
>> +		if (ack)
>> +			return TCP_TIME_WAIT;
>> +		else
>> +			return oldstate;
>
> I think here we have to check (snd_una == data_fin_seq), and move
> directly to TCP_CLOSED - the subflows will handle the time wait

I was planning to have the caller check that.

>
>> +	case TCP_LAST_ACK:
>> +		if (ack)
>> +			return TCP_CLOSE;
>> +		else
>> +			return oldstate;
>
>> +	case TCP_ESTABLISHED:
>> +		return TCP_CLOSE_WAIT;
>
> I think we should enter TCP_CLOSE_WAIT iff we received a data fin, and
> it looks like we need two additional cases:
> - TCP_CLOSE_WAIT -> TCP_LAST_ACK, when we send data fin

I handled that transition in mptcp_stream_shutdown, but it should be here 
for clarity.

> - TCP_LAST_ACK -> TCP_CLOSED when our fin is acked

That's what is above in the TCP_LAST_ACK case.

>
> Finally I think we can avoid the TCP_TIME_WAIT at all, and we could
> handle some additional corner cases:
> - all subflows are closed (due to TCP reset) -> move msk state to
> TCP_CLOSE
> - what if we receive TCP_FIN on all subflows, but no MPTCP data fin ?

Either of these could be a break-before-make scenario, but for now we do 
need to change the MPTCP state to TCP_CLOSE when all subflows are gone.

>
> [...]
>> @@ -545,6 +588,20 @@ static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
>>  				continue;
>>  			}
>>  		}
>> +
>> +		if (!subflow->map_valid && subflow->incoming_data_fin) {
>> +			int newstate;
>> +
>> +			// TODO: Send on all subflows?
>> +			newstate = mptcp_close_state(sk->sk_state, true, false);
>> +			msk->ack_seq++;
>> +			subflow->incoming_data_fin = 0;
>> +			subflow->send_data_fin = 1;
>> +			subflow->data_fin_seq = msk->write_seq;
>
> The above means that the msk socket will send DATA_FIN after receiving
> one, regardless of shutdown(), right? I think that is not required ?!?

You're right - need to check newstate here and only send DATA_FIN in 
CLOSE_WAIT.

>
> [...]
>> @@ -747,11 +804,11 @@ static void mptcp_close(struct sock *sk, long timeout)
>>  	struct mptcp_sock *msk = mptcp_sk(sk);
>>  	struct socket *ssk = NULL;
>>
>> -	mptcp_token_destroy(msk->token);
>> -	inet_sk_state_store(sk, TCP_CLOSE);
>> -
>>  	lock_sock(sk);
>>
>> +	inet_sk_state_store(sk, TCP_CLOSE);
>> +	mptcp_token_destroy(msk->token);
>> +
>
> Don't we need to apply the above state machine here? and possibly use
> sk_stream_wait_close()? (with appropriate states handling, we can
> possibly re-use as it)

Yes, need the state machine here.

>
>>  	if (msk->subflow) {
>>  		ssk = msk->subflow;
>>  		msk->subflow = NULL;
>
> [...]
>> @@ -1262,6 +1330,18 @@ static int mptcp_shutdown(struct socket *sock, int how)
>>  	pr_debug("sk=%p, how=%d", msk, how);
>>
>>  	lock_sock(sock->sk);
>> +
>> +	if (sock->sk->sk_state == TCP_ESTABLISHED) {
>> +		inet_sk_state_store(sock->sk, TCP_FIN_WAIT1);
>> +	} else if (sock->sk->sk_state == TCP_CLOSE_WAIT) {
>> +		inet_sk_state_store(sock->sk, TCP_LAST_ACK);
>> +	} else if (sock->sk->sk_state != TCP_CLOSE) {
>> +		pr_warn("Shutdown from unexpected state %d",
>> +			sock->sk->sk_state);
>> +		release_sock(sock->sk);
>> +		return -EALREADY;
>> +	}
>> +
>
> I think we can re-use the state machine for the above transition?
> Shoul we need to check for: (how & SEND_SHUTDOWN) before doinf any
> mptcp level action?
> Should we apply the state changes only if !fallback, and ev copying the
> fallback socket status ?

Yes, yes, and yes.

>
>
>>  	ssock = __mptcp_fallback_get_ref(msk);
>>  	if (ssock) {
>>  		release_sock(sock->sk);
>> @@ -1276,6 +1356,8 @@ static int mptcp_shutdown(struct socket *sock, int how)
>>
>>  		tcp_socket = mptcp_subflow_tcp_socket(subflow);
>>  		pr_debug("conn_list->subflow=%p", subflow);
>> +		subflow->send_data_fin = 1;
>> +		subflow->data_fin_seq = msk->write_seq;
>>  		ret = kernel_sock_shutdown(tcp_socket, how);
>>  	}
>>  	release_sock(sock->sk);
>
> I think here we must avoid shutting down the subflow (but likely we can
> shut them down when msk transition to TCP_FIN_WAIT2.
>
> Additionally we can consider calling tcp_close() on all subflows when
> msk transition to TCP_CLOSE state

Ok, both of those are part of the state diagram but not reflected in the 
code.

>
>> @@ -1303,7 +1385,7 @@ void mptcp_proto_init(void)
>>  	mptcp_stream_ops.accept = mptcp_stream_accept;
>>  	mptcp_stream_ops.getname = mptcp_getname;
>>  	mptcp_stream_ops.listen = mptcp_listen;
>> -	mptcp_stream_ops.shutdown = mptcp_shutdown;
>> +	mptcp_stream_ops.shutdown = mptcp_stream_shutdown;
>>
>>  	if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL))
>>  		panic("Failed to allocate MPTCP pcpu counter\n");
>> diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
>> index 8fe5f9383d38..cf90014974ea 100644
>> --- a/net/mptcp/protocol.h
>> +++ b/net/mptcp/protocol.h
>> @@ -217,9 +217,12 @@ struct mptcp_subflow_context {
>>  		map_valid : 1,
>>  		backup : 1,
>>  		data_avail : 1,
>> -		rx_eof : 1;
>> +		rx_eof : 1,
>> +		incoming_data_fin : 1,
>> +		send_data_fin : 1;
>>  	u32	remote_nonce;
>>  	u64	thmac;
>> +	u64	data_fin_seq;
>>  	u32	local_nonce;
>>  	u32	remote_token;
>>  	u8	hmac[MPTCPOPT_HMAC_LEN];
>
> Possibly we can move this field to msk? and possibly we don't need
> send_data_fin?

Yeah, data_fin_seq can be moved up to msk.

>
> if we have a dss, we send data fin iff dss map matches msk data_fin_seq
> if this is not a data packet (no dss) we send the data_fin (once ???)
> if msk->write_seq == data_fin_seq.

It's that "once" part that made me add send_data_fin. 
mptcp_write_data_fin() has this logic.

Thanks for your review!

--
Mat Martineau
Intel
Paolo Abeni Nov. 7, 2019, 8:42 a.m. UTC | #4
On Wed, 2019-11-06 at 16:37 -0800, Mat Martineau wrote:
> On Wed, 6 Nov 2019, Paolo Abeni wrote:
> > I think we don't need a workqueue here: subflow could simply set
> > (WRITE_ONCE) one/some msk level field[s].
> > 
> > I don't think we need to deal with subflows concurrently writing
> > different values there - it will be a protocol violation, the stream
> > will be corrupted anyway.
> 
> The goal of the workqueue would be to process the DATA_FIN (and ack it if 
> the recv side is caught up) even when there's no pending send/recv from 
> userspace to wake up. That means the subflow error queue wouldn't work 
> either.

I mean: I think we can update msk->incoming_data_fin from
mptcp_incoming_options() even without locking, just using WRITE_ONCE().

Additionally we could use cmpxchg(), if we want detect cuncurrent
modification from multiple subflows, but that looks more a debug
features than a functional requirement to me.

> > [...]
> > > @@ -383,6 +383,13 @@ static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
> > >  	lock_sock(ssk);
> > >  	mptcp_clean_una(sk);
> > >  	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
> > > +
> > > +	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
> > > +		ret = sk_stream_wait_connect(sk, &timeo);
> > > +		if (ret != 0)
> > > +			goto put_out_subflow;
> > > +	}
> > > +
> > >  	while (msg_data_left(msg)) {
> > >  		ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now,
> > >  					 &size_goal);
> > 
> > I think waiting is not needed here?!? we could just bail out with
> > ENOTCONN if state is >= TCP_FIN_WAIT1. Such check could be located
> > before selecting the egress subflow: it should be outside the ssk lock.
> 
> I agree that the sk state check and possible waiting need to happen 
> outside the ssk lock, thank you for spotting that.
> 
> Waiting isn't needed for the states at the end of the connection, but I 
> saw that TCP sendmsg/do_tcp_sendpages verified the connection state prior 
> to sending in this way. 

Uhm... overall this looks a bugfix, perhaps it could be placed into
it's own patch for review's sake...

> What if I move this check very early in the 
> function, before the fallback sock_sendmsg()? If there's an attempt to 
> send before the handshake is done, we don't want to be waiting in the 
> regular TCP sendmsg().

I think you are right, otherwise we use the subflow as fallback, even
if the MP_CAPABLE handshake is still running and will complete
successfully. 

I additionally think we need something similar even for recvmsg(), for
similar reasons

Side note: with v1 MP_CAPABLE we will a similar problem, e.g. we can
start recvmsg() when the MP_CAPABLE handshake is not completed, but the
tcp connection is established (because v1 has a forth MP_CAPABLE
message). In theat scenario we will use the subflow as a MPTCP enabled
one - and block with no data. 

Later the MP_CAPABLE handshake can complete even unsuccessfully, and we
have to fallback to plain TCP, while still waiting in the MPTCP-
specific part of mptcp_recvmsg() - will be fun ;)

Cheers,

Paolo
Matthieu Baerts Nov. 8, 2019, 3:02 p.m. UTC | #5
Hi Mat,

On 06/11/2019 05:50, Mat Martineau wrote:
> Send and process received DATA_FIN options. This requires addition of a
> state machine to the MPTCP socket to handle the closing process. DATA_FIN
> is sent along with a DSS mapping for the final data transmitted, or on a
> bare ACK if there is no data to transmit.
> 
> This is still a work in progress, including these areas:
> 
>   * Need to finish handling of DSS_ACKs for the state machine.
> 
>   * Need to double-check whether it's correct to bypass inet_shutdown
>     as the current code does.
> 
>   * A DATA_FIN that arrives on a packet without data payload (unless it's
>     a TCP FIN) are discarded by tcp_data_queue() and not propagated to
>     the MPTCP socket. This is detected in mptcp_incoming_options() but
>     not yet handled. The skb with the DATA_FIN could be cloned and
>     placed in the subflow error queue, or a workqueue could handle it.
> 
> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>

Yesterday at the meeting we talked a bit about this patch. Here is what 
we said:


     - DATA_FIN:
         - Mat is working on it
         - It seems still to be too big for the "initial submission" 
with the minimal feature
         - because sending the DATA_FIN might be required for some 
implementations (not closing the MPTCP connections), it could be good to 
at least send the DATA_FIN, without the full support.

         - *Mat*: maybe for the initial submission, part 2, where only 1 
subflow is supported, could we send the DATA_FIN just before closing the 
connection? With the minimal support.


What do you think about that?

Cheers and take care,
Matt
Mat Martineau Nov. 8, 2019, 7:08 p.m. UTC | #6
On Fri, 8 Nov 2019, Matthieu Baerts wrote:

> Hi Mat,
>
> On 06/11/2019 05:50, Mat Martineau wrote:
>> Send and process received DATA_FIN options. This requires addition of a
>> state machine to the MPTCP socket to handle the closing process. DATA_FIN
>> is sent along with a DSS mapping for the final data transmitted, or on a
>> bare ACK if there is no data to transmit.
>> 
>> This is still a work in progress, including these areas:
>>
>>   * Need to finish handling of DSS_ACKs for the state machine.
>>
>>   * Need to double-check whether it's correct to bypass inet_shutdown
>>     as the current code does.
>>
>>   * A DATA_FIN that arrives on a packet without data payload (unless it's
>>     a TCP FIN) are discarded by tcp_data_queue() and not propagated to
>>     the MPTCP socket. This is detected in mptcp_incoming_options() but
>>     not yet handled. The skb with the DATA_FIN could be cloned and
>>     placed in the subflow error queue, or a workqueue could handle it.
>> 
>> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
>
> Yesterday at the meeting we talked a bit about this patch. Here is what we 
> said:
>
>
>    - DATA_FIN:
>        - Mat is working on it
>        - It seems still to be too big for the "initial submission" with the 
> minimal feature
>        - because sending the DATA_FIN might be required for some 
> implementations (not closing the MPTCP connections), it could be good to at 
> least send the DATA_FIN, without the full support.
>
>        - *Mat*: maybe for the initial submission, part 2, where only 1 
> subflow is supported, could we send the DATA_FIN just before closing the 
> connection? With the minimal support.
>
>
> What do you think about that?
>

For the single-subflow case, that's a good point. It will probably work 
well enough to set DATA_FIN whenever FIN is sent. I'll focus on that first 
and then see if there's also a single-subflow shortcut for acking the 
peer's DATA_FIN.

--
Mat Martineau
Intel
diff mbox series

Patch

diff --git a/net/mptcp/options.c b/net/mptcp/options.c
index 1b08e1193991..4f67826a4977 100644
--- a/net/mptcp/options.c
+++ b/net/mptcp/options.c
@@ -384,18 +384,43 @@  static bool mptcp_established_options_mp(struct sock *sk, unsigned int *size,
 	return false;
 }
 
+static void mptcp_write_data_fin(struct mptcp_subflow_context *subflow,
+				 struct mptcp_ext *ext)
+{
+	struct mptcp_sock *msk = mptcp_sk(subflow->conn);
+
+	WARN_ONCE(!subflow->conn, "MPTCP: Missing MPTCP socket information");
+
+	/* Only send DATA_FIN if all data has been sent or this is the last
+	 * mapping.
+	 */
+	if (!ext->use_map) {
+		if (subflow->data_fin_seq == msk->write_seq) {
+			ext->use_map = 1;
+			ext->dsn64 = 1;
+			ext->data_seq = subflow->data_fin_seq;
+			ext->subflow_seq = 0;
+			ext->data_len = 1;
+			ext->data_fin = 1;
+		}
+	} else if (subflow->data_fin_seq == ext->data_seq + ext->data_len) {
+		ext->data_fin = 1;
+	}
+}
+
 static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 					  unsigned int *size,
 					  unsigned int remaining,
 					  struct mptcp_out_options *opts)
 {
+	struct mptcp_subflow_context *subflow = mptcp_subflow_ctx(sk);
 	unsigned int dss_size = 0;
 	struct mptcp_ext *mpext;
 	unsigned int ack_size;
 
 	mpext = skb ? mptcp_get_ext(skb) : NULL;
 
-	if (!skb || (mpext && mpext->use_map)) {
+	if (!skb || (mpext && mpext->use_map) || subflow->send_data_fin) {
 		unsigned int map_size;
 
 		map_size = TCPOLEN_MPTCP_DSS_BASE + TCPOLEN_MPTCP_DSS_MAP64;
@@ -405,6 +430,9 @@  static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 			dss_size = map_size;
 			if (mpext)
 				opts->ext_copy = *mpext;
+
+			if (subflow->send_data_fin && skb)
+				mptcp_write_data_fin(subflow, &opts->ext_copy);
 		} else {
 			opts->ext_copy.use_map = 0;
 			WARN_ONCE(1, "MPTCP: Map dropped");
@@ -422,7 +450,7 @@  static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 
 		dss_size += ack_size;
 
-		msk = mptcp_sk(mptcp_subflow_ctx(sk)->conn);
+		msk = mptcp_sk(subflow->conn);
 		if (msk) {
 			opts->ext_copy.data_ack = msk->ack_seq;
 		} else {
@@ -441,6 +469,9 @@  static bool mptcp_established_options_dss(struct sock *sk, struct sk_buff *skb,
 	if (!dss_size)
 		return false;
 
+	if (opts->ext_copy.data_fin)
+		subflow->send_data_fin = 0;
+
 	*size = ALIGN(dss_size, 4);
 	return true;
 }
@@ -632,6 +663,18 @@  void mptcp_incoming_options(struct sock *sk, struct sk_buff *skb,
 
 	mpext->data_fin = mp_opt->data_fin;
 
+	if (mp_opt->data_fin &&
+	    TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
+		/* The DATA_FIN is on a packet that will be discarded by
+		 * tcp_data_queue() and will not get propagated to the MPTCP
+		 * socket.
+		 *
+		 * Use workqueue or subflow error queue?
+		 */
+		pr_warn("Ignored DATA_FIN");
+	}
+
+
 	if (msk)
 		mptcp_pm_fully_established(msk);
 }
@@ -727,10 +770,11 @@  void mptcp_write_options(__be32 *ptr, struct mptcp_out_options *opts)
 			 * support for optional 32-bit mappings later.
 			 */
 			flags |= MPTCP_DSS_HAS_MAP | MPTCP_DSS_DSN64;
-			if (mpext->data_fin)
-				flags |= MPTCP_DSS_DATA_FIN;
 		}
 
+		if (mpext->data_fin)
+			flags |= MPTCP_DSS_DATA_FIN;
+
 		*ptr++ = mptcp_option(MPTCPOPT_DSS, len, 0, flags);
 
 		if (mpext->use_ack) {
diff --git a/net/mptcp/protocol.c b/net/mptcp/protocol.c
index 696f1d03a61d..6d3b4a0a8dc1 100644
--- a/net/mptcp/protocol.c
+++ b/net/mptcp/protocol.c
@@ -101,7 +101,7 @@  static struct sock *mptcp_subflow_recv_lookup(const struct mptcp_sock *msk)
 	sock_owned_by_me(sk);
 
 	mptcp_for_each_subflow(msk, subflow) {
-		if (subflow->data_avail)
+		if (subflow->data_avail || subflow->incoming_data_fin)
 			return mptcp_subflow_tcp_socket(subflow)->sk;
 
 		receivers += !subflow->rx_eof;
@@ -383,6 +383,13 @@  static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 	lock_sock(ssk);
 	mptcp_clean_una(sk);
 	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);
+
+	if ((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) {
+		ret = sk_stream_wait_connect(sk, &timeo);
+		if (ret != 0)
+			goto put_out_subflow;
+	}
+
 	while (msg_data_left(msg)) {
 		ret = mptcp_sendmsg_frag(sk, ssk, msg, NULL, &timeo, &mss_now,
 					 &size_goal);
@@ -403,6 +410,7 @@  static int mptcp_sendmsg(struct sock *sk, struct msghdr *msg, size_t len)
 			mptcp_reset_timer(sk);
 	}
 
+put_out_subflow:
 	release_sock(ssk);
 
 out:
@@ -464,6 +472,41 @@  static void mptcp_wait_data(struct sock *sk, long *timeo)
 	remove_wait_queue(sk_sleep(sk), &wait);
 }
 
+static int mptcp_close_state(int oldstate, bool fin, bool ack)
+{
+	switch (oldstate) {
+	case TCP_FIN_WAIT1:
+		if (fin && ack)
+			return TCP_TIME_WAIT;
+		else if (fin)
+			return TCP_FIN_WAIT2;
+		else
+			return TCP_CLOSING;
+	case TCP_FIN_WAIT2:
+		if (fin)
+			return TCP_TIME_WAIT;
+		else
+			return oldstate;
+	case TCP_CLOSING:
+		if (ack)
+			return TCP_TIME_WAIT;
+		else
+			return oldstate;
+	case TCP_TIME_WAIT:
+		return TCP_CLOSE;
+	case TCP_LAST_ACK:
+		if (ack)
+			return TCP_CLOSE;
+		else
+			return oldstate;
+	case TCP_ESTABLISHED:
+		return TCP_CLOSE_WAIT;
+	default:
+		pr_debug("Unexpected state with DATA_FIN: %d", oldstate);
+		return oldstate;
+	}
+}
+
 static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 			 int nonblock, int flags, int *addr_len)
 {
@@ -545,6 +588,20 @@  static int mptcp_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
 				continue;
 			}
 		}
+
+		if (!subflow->map_valid && subflow->incoming_data_fin) {
+			int newstate;
+
+			// TODO: Send on all subflows?
+			newstate = mptcp_close_state(sk->sk_state, true, false);
+			msk->ack_seq++;
+			subflow->incoming_data_fin = 0;
+			subflow->send_data_fin = 1;
+			subflow->data_fin_seq = msk->write_seq;
+			inet_sk_state_store(sk, newstate);
+			tcp_send_ack(ssk);
+		}
+
 		release_sock(ssk);
 		continue;
 
@@ -747,11 +804,11 @@  static void mptcp_close(struct sock *sk, long timeout)
 	struct mptcp_sock *msk = mptcp_sk(sk);
 	struct socket *ssk = NULL;
 
-	mptcp_token_destroy(msk->token);
-	inet_sk_state_store(sk, TCP_CLOSE);
-
 	lock_sock(sk);
 
+	inet_sk_state_store(sk, TCP_CLOSE);
+	mptcp_token_destroy(msk->token);
+
 	if (msk->subflow) {
 		ssk = msk->subflow;
 		msk->subflow = NULL;
@@ -764,6 +821,8 @@  static void mptcp_close(struct sock *sk, long timeout)
 
 	list_for_each_entry_safe(subflow, tmp, &msk->conn_list, node) {
 		pr_debug("conn_list->subflow=%p", subflow);
+		subflow->send_data_fin = 1;
+		subflow->data_fin_seq = msk->write_seq;
 		sock_release(mptcp_subflow_tcp_socket(subflow));
 	}
 
@@ -930,6 +989,11 @@  static int mptcp_getsockopt(struct sock *sk, int level, int optname,
 	return -EOPNOTSUPP;
 }
 
+static void mptcp_shutdown(struct sock *sock, int how)
+{
+	pr_err("Unexpected MPTCP shutdown call sock=%p how=%d", sock, how);
+}
+
 #define MPTCP_DEFERRED_ALL TCPF_WRITE_TIMER_DEFERRED
 
 /* this is very alike tcp_release_cb() but we must handle differently a
@@ -1055,7 +1119,7 @@  static struct proto mptcp_prot = {
 	.accept		= mptcp_accept,
 	.setsockopt	= mptcp_setsockopt,
 	.getsockopt	= mptcp_getsockopt,
-	.shutdown	= tcp_shutdown,
+	.shutdown	= mptcp_shutdown,
 	.destroy	= mptcp_destroy,
 	.sendmsg	= mptcp_sendmsg,
 	.recvmsg	= mptcp_recvmsg,
@@ -1199,6 +1263,10 @@  static int mptcp_listen(struct socket *sock, int backlog)
 
 	err = inet_listen(ssock, backlog);
 	sock_put(ssock->sk);
+
+	if (!err)
+		inet_sk_state_store(sock->sk, TCP_LISTEN);
+
 	return err;
 }
 
@@ -1252,7 +1320,7 @@  static __poll_t mptcp_poll(struct file *file, struct socket *sock,
 	return ret;
 }
 
-static int mptcp_shutdown(struct socket *sock, int how)
+static int mptcp_stream_shutdown(struct socket *sock, int how)
 {
 	struct mptcp_sock *msk = mptcp_sk(sock->sk);
 	struct mptcp_subflow_context *subflow;
@@ -1262,6 +1330,18 @@  static int mptcp_shutdown(struct socket *sock, int how)
 	pr_debug("sk=%p, how=%d", msk, how);
 
 	lock_sock(sock->sk);
+
+	if (sock->sk->sk_state == TCP_ESTABLISHED) {
+		inet_sk_state_store(sock->sk, TCP_FIN_WAIT1);
+	} else if (sock->sk->sk_state == TCP_CLOSE_WAIT) {
+		inet_sk_state_store(sock->sk, TCP_LAST_ACK);
+	} else if (sock->sk->sk_state != TCP_CLOSE) {
+		pr_warn("Shutdown from unexpected state %d",
+			sock->sk->sk_state);
+		release_sock(sock->sk);
+		return -EALREADY;
+	}
+
 	ssock = __mptcp_fallback_get_ref(msk);
 	if (ssock) {
 		release_sock(sock->sk);
@@ -1276,6 +1356,8 @@  static int mptcp_shutdown(struct socket *sock, int how)
 
 		tcp_socket = mptcp_subflow_tcp_socket(subflow);
 		pr_debug("conn_list->subflow=%p", subflow);
+		subflow->send_data_fin = 1;
+		subflow->data_fin_seq = msk->write_seq;
 		ret = kernel_sock_shutdown(tcp_socket, how);
 	}
 	release_sock(sock->sk);
@@ -1303,7 +1385,7 @@  void mptcp_proto_init(void)
 	mptcp_stream_ops.accept = mptcp_stream_accept;
 	mptcp_stream_ops.getname = mptcp_getname;
 	mptcp_stream_ops.listen = mptcp_listen;
-	mptcp_stream_ops.shutdown = mptcp_shutdown;
+	mptcp_stream_ops.shutdown = mptcp_stream_shutdown;
 
 	if (percpu_counter_init(&mptcp_sockets_allocated, 0, GFP_KERNEL))
 		panic("Failed to allocate MPTCP pcpu counter\n");
diff --git a/net/mptcp/protocol.h b/net/mptcp/protocol.h
index 8fe5f9383d38..cf90014974ea 100644
--- a/net/mptcp/protocol.h
+++ b/net/mptcp/protocol.h
@@ -217,9 +217,12 @@  struct mptcp_subflow_context {
 		map_valid : 1,
 		backup : 1,
 		data_avail : 1,
-		rx_eof : 1;
+		rx_eof : 1,
+		incoming_data_fin : 1,
+		send_data_fin : 1;
 	u32	remote_nonce;
 	u64	thmac;
+	u64	data_fin_seq;
 	u32	local_nonce;
 	u32	remote_token;
 	u8	hmac[MPTCPOPT_HMAC_LEN];
diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index f08c4b713978..e39b71094b12 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -315,7 +315,7 @@  enum mapping_status {
 	MAPPING_OK,
 	MAPPING_INVALID,
 	MAPPING_EMPTY,
-	MAPPING_DATA_FIN
+	MAPPING_BARE_DATA_FIN
 };
 
 static u64 expand_seq(u64 old_seq, u16 old_data_len, u64 seq)
@@ -406,16 +406,13 @@  static enum mapping_status get_mapping_status(struct sock *ssk)
 		pr_err("Infinite mapping not handled");
 		MPTCP_INC_STATS(sock_net(ssk), MPTCP_MIB_INFINITEMAPRX);
 		return MAPPING_INVALID;
-	} else if (mpext->subflow_seq == 0 &&
-		   mpext->data_fin == 1) {
-		if (WARN_ON_ONCE(mpext->data_len != 1))
-			return false;
+	}
 
-		/* do not try hard to handle this any better, till we have
-		 * real data_fin support
-		 */
-		pr_debug("DATA_FIN with no payload");
-		return MAPPING_DATA_FIN;
+	if (mpext->data_fin && mpext->subflow_seq == 0 &&
+	    mpext->data_len == 1) {
+		subflow->map_valid = 0;
+		subflow->incoming_data_fin = 1;
+		return MAPPING_BARE_DATA_FIN;
 	}
 
 	if (!mpext->dsn64) {
@@ -450,6 +447,7 @@  static enum mapping_status get_mapping_status(struct sock *ssk)
 	subflow->map_seq = map_seq;
 	subflow->map_subflow_seq = mpext->subflow_seq;
 	subflow->map_data_len = mpext->data_len;
+	subflow->incoming_data_fin = mpext->data_fin;
 	subflow->map_valid = 1;
 	pr_debug("new map seq=%llu subflow_seq=%u data_len=%u",
 		 subflow->map_seq, subflow->map_subflow_seq,