diff mbox

[net-next] tcp: accept RST for rcv_nxt - 1 after receiving a FIN

Message ID 1484678239-19199-1-git-send-email-jbaron@akamai.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Jason Baron Jan. 17, 2017, 6:37 p.m. UTC
From: Jason Baron <jbaron@akamai.com>

Using a Mac OSX box as a client connecting to a Linux server, we have found
that when certain applications (such as 'ab'), are abruptly terminated
(via ^C), a FIN is sent followed by a RST packet on tcp connections. The
FIN is accepted by the Linux stack but the RST is sent with the same
sequence number as the FIN, and Linux responds with a challenge ACK per
RFC 5961. The OSX client then sometimes (they are rate-limited) does not
reply with any RST as would be expected on a closed socket.

This results in sockets accumulating on the Linux server left mostly in
the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
This sequence of events can tie up a lot of resources on the Linux server
since there may be a lot of data in write buffers at the time of the RST.
Accepting a RST equal to rcv_nxt - 1, after we have already successfully
processed a FIN, has made a significant difference for us in practice, by
freeing up unneeded resources in a more expedient fashion.

A packetdrill test demonstrating the behavior:

// testing mac osx rst behavior

// Establish a connection
0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 32768 <mss 1460,nop,wscale 10>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,wscale 5>
0.200 < . 1:1(0) ack 1 win 32768
0.200 accept(3, ..., ...) = 4

// Client closes the connection
0.300 < F. 1:1(0) ack 1 win 32768

// now send rst with same sequence
0.300 < R. 1:1(0) ack 1 win 32768

// make sure we are in TCP_CLOSE
0.400 %{
assert tcpi_state == 7
}%

Signed-off-by: Jason Baron <jbaron@akamai.com>
Cc: Eric Dumazet <edumazet@google.com>
---
 net/ipv4/tcp_input.c | 28 +++++++++++++++++++++++++---
 1 file changed, 25 insertions(+), 3 deletions(-)

Comments

Eric Dumazet Jan. 17, 2017, 7:02 p.m. UTC | #1
On Tue, 2017-01-17 at 13:37 -0500, Jason Baron wrote:
> From: Jason Baron <jbaron@akamai.com>
> 
> Using a Mac OSX box as a client connecting to a Linux server, we have found
> that when certain applications (such as 'ab'), are abruptly terminated
> (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
> FIN is accepted by the Linux stack but the RST is sent with the same
> sequence number as the FIN, and Linux responds with a challenge ACK per
> RFC 5961. The OSX client then sometimes (they are rate-limited) does not
> reply with any RST as would be expected on a closed socket.
> 
> This results in sockets accumulating on the Linux server left mostly in
> the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
> This sequence of events can tie up a lot of resources on the Linux server
> since there may be a lot of data in write buffers at the time of the RST.
> Accepting a RST equal to rcv_nxt - 1, after we have already successfully
> processed a FIN, has made a significant difference for us in practice, by
> freeing up unneeded resources in a more expedient fashion.

Acked-by: Eric Dumazet <edumazet@google.com>

Thanks Jason !
Rick Jones Jan. 17, 2017, 7:04 p.m. UTC | #2
On 01/17/2017 10:37 AM, Jason Baron wrote:
> From: Jason Baron <jbaron@akamai.com>
>
> Using a Mac OSX box as a client connecting to a Linux server, we have found
> that when certain applications (such as 'ab'), are abruptly terminated
> (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
> FIN is accepted by the Linux stack but the RST is sent with the same
> sequence number as the FIN, and Linux responds with a challenge ACK per
> RFC 5961. The OSX client then sometimes (they are rate-limited) does not
> reply with any RST as would be expected on a closed socket.
>
> This results in sockets accumulating on the Linux server left mostly in
> the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
> This sequence of events can tie up a lot of resources on the Linux server
> since there may be a lot of data in write buffers at the time of the RST.
> Accepting a RST equal to rcv_nxt - 1, after we have already successfully
> processed a FIN, has made a significant difference for us in practice, by
> freeing up unneeded resources in a more expedient fashion.

Drifting a bit, and it doesn't change the value of dealing with it, but 
out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the 
server-side applications reacting to the read return of zero triggered 
by the arrival of the FIN?

happy benchmarking,

rick jones
Eric Dumazet Jan. 17, 2017, 7:13 p.m. UTC | #3
On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>up unneeded resources in a more expedient fashion.
>
>
> Drifting a bit, and it doesn't change the value of dealing with it, but out
> of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side
> applications reacting to the read return of zero triggered by the arrival of
> the FIN?

Even if the application reacts, and calls close(fd), kernel will still
try to push the data that was queued into socket write queue prior to
receiving the FIN.

By allowing this RST, we can flush the whole data and react much
faster, avoiding locking memory in the kernel for very long time.
Rick Jones Jan. 17, 2017, 7:19 p.m. UTC | #4
On 01/17/2017 11:13 AM, Eric Dumazet wrote:
> On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jones2@hpe.com> wrote:
>> Drifting a bit, and it doesn't change the value of dealing with it, but out
>> of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side
>> applications reacting to the read return of zero triggered by the arrival of
>> the FIN?
>
> Even if the application reacts, and calls close(fd), kernel will still
> try to push the data that was queued into socket write queue prior to
> receiving the FIN.
>
> By allowing this RST, we can flush the whole data and react much
> faster, avoiding locking memory in the kernel for very long time.

Understood.  I was just wondering if there is also an application bug here.

happy benchmarking,

rick jones
David Miller Jan. 17, 2017, 8:52 p.m. UTC | #5
From: Jason Baron <jbaron@akamai.com>
Date: Tue, 17 Jan 2017 13:37:19 -0500

> From: Jason Baron <jbaron@akamai.com>
> 
> Using a Mac OSX box as a client connecting to a Linux server, we have found
> that when certain applications (such as 'ab'), are abruptly terminated
> (via ^C), a FIN is sent followed by a RST packet on tcp connections. The
> FIN is accepted by the Linux stack but the RST is sent with the same
> sequence number as the FIN, and Linux responds with a challenge ACK per
> RFC 5961. The OSX client then sometimes (they are rate-limited) does not
> reply with any RST as would be expected on a closed socket.
> 
> This results in sockets accumulating on the Linux server left mostly in
> the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible.
> This sequence of events can tie up a lot of resources on the Linux server
> since there may be a lot of data in write buffers at the time of the RST.
> Accepting a RST equal to rcv_nxt - 1, after we have already successfully
> processed a FIN, has made a significant difference for us in practice, by
> freeing up unneeded resources in a more expedient fashion.
> 
> A packetdrill test demonstrating the behavior:
 ...
> Signed-off-by: Jason Baron <jbaron@akamai.com>

Applied, thanks Jason.
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 1a34e9278c07..bfa165cc455a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5199,6 +5199,23 @@  static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen)
 	return err;
 }
 
+/* Accept RST for rcv_nxt - 1 after a FIN.
+ * When tcp connections are abruptly terminated from Mac OSX (via ^C), a
+ * FIN is sent followed by a RST packet. The RST is sent with the same
+ * sequence number as the FIN, and thus according to RFC 5961 a challenge
+ * ACK should be sent. However, Mac OSX rate limits replies to challenge
+ * ACKs on the closed socket. In addition middleboxes can drop either the
+ * challenge ACK or a subsequent RST.
+ */
+static bool tcp_reset_check(const struct sock *sk, const struct sk_buff *skb)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	return unlikely(TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1) &&
+			(1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK |
+					       TCPF_CLOSING));
+}
+
 /* Does PAWS and seqno based validation of an incoming segment, flags will
  * play significant role here.
  */
@@ -5237,20 +5254,25 @@  static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 						  LINUX_MIB_TCPACKSKIPPEDSEQ,
 						  &tp->last_oow_ack_time))
 				tcp_send_dupack(sk, skb);
+		} else if (tcp_reset_check(sk, skb)) {
+			tcp_reset(sk);
 		}
 		goto discard;
 	}
 
 	/* Step 2: check RST bit */
 	if (th->rst) {
-		/* RFC 5961 3.2 (extend to match against SACK too if available):
-		 * If seq num matches RCV.NXT or the right-most SACK block,
+		/* RFC 5961 3.2 (extend to match against (RCV.NXT - 1) after a
+		 * FIN and SACK too if available):
+		 * If seq num matches RCV.NXT or (RCV.NXT - 1) after a FIN, or
+		 * the right-most SACK block,
 		 * then
 		 *     RESET the connection
 		 * else
 		 *     Send a challenge ACK
 		 */
-		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) {
+		if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt ||
+		    tcp_reset_check(sk, skb)) {
 			rst_seq_match = true;
 		} else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) {
 			struct tcp_sack_block *sp = &tp->selective_acks[0];