Message ID | 1484678239-19199-1-git-send-email-jbaron@akamai.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2017-01-17 at 13:37 -0500, Jason Baron wrote: > From: Jason Baron <jbaron@akamai.com> > > Using a Mac OSX box as a client connecting to a Linux server, we have found > that when certain applications (such as 'ab'), are abruptly terminated > (via ^C), a FIN is sent followed by a RST packet on tcp connections. The > FIN is accepted by the Linux stack but the RST is sent with the same > sequence number as the FIN, and Linux responds with a challenge ACK per > RFC 5961. The OSX client then sometimes (they are rate-limited) does not > reply with any RST as would be expected on a closed socket. > > This results in sockets accumulating on the Linux server left mostly in > the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. > This sequence of events can tie up a lot of resources on the Linux server > since there may be a lot of data in write buffers at the time of the RST. > Accepting a RST equal to rcv_nxt - 1, after we have already successfully > processed a FIN, has made a significant difference for us in practice, by > freeing up unneeded resources in a more expedient fashion. Acked-by: Eric Dumazet <edumazet@google.com> Thanks Jason !
On 01/17/2017 10:37 AM, Jason Baron wrote: > From: Jason Baron <jbaron@akamai.com> > > Using a Mac OSX box as a client connecting to a Linux server, we have found > that when certain applications (such as 'ab'), are abruptly terminated > (via ^C), a FIN is sent followed by a RST packet on tcp connections. The > FIN is accepted by the Linux stack but the RST is sent with the same > sequence number as the FIN, and Linux responds with a challenge ACK per > RFC 5961. The OSX client then sometimes (they are rate-limited) does not > reply with any RST as would be expected on a closed socket. > > This results in sockets accumulating on the Linux server left mostly in > the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. > This sequence of events can tie up a lot of resources on the Linux server > since there may be a lot of data in write buffers at the time of the RST. > Accepting a RST equal to rcv_nxt - 1, after we have already successfully > processed a FIN, has made a significant difference for us in practice, by > freeing up unneeded resources in a more expedient fashion. Drifting a bit, and it doesn't change the value of dealing with it, but out of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side applications reacting to the read return of zero triggered by the arrival of the FIN? happy benchmarking, rick jones
On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jones2@hpe.com> wrote: >up unneeded resources in a more expedient fashion. > > > Drifting a bit, and it doesn't change the value of dealing with it, but out > of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side > applications reacting to the read return of zero triggered by the arrival of > the FIN? Even if the application reacts, and calls close(fd), kernel will still try to push the data that was queued into socket write queue prior to receiving the FIN. By allowing this RST, we can flush the whole data and react much faster, avoiding locking memory in the kernel for very long time.
On 01/17/2017 11:13 AM, Eric Dumazet wrote: > On Tue, Jan 17, 2017 at 11:04 AM, Rick Jones <rick.jones2@hpe.com> wrote: >> Drifting a bit, and it doesn't change the value of dealing with it, but out >> of curiosity, when you say mostly in CLOSE_WAIT, why aren't the server-side >> applications reacting to the read return of zero triggered by the arrival of >> the FIN? > > Even if the application reacts, and calls close(fd), kernel will still > try to push the data that was queued into socket write queue prior to > receiving the FIN. > > By allowing this RST, we can flush the whole data and react much > faster, avoiding locking memory in the kernel for very long time. Understood. I was just wondering if there is also an application bug here. happy benchmarking, rick jones
From: Jason Baron <jbaron@akamai.com> Date: Tue, 17 Jan 2017 13:37:19 -0500 > From: Jason Baron <jbaron@akamai.com> > > Using a Mac OSX box as a client connecting to a Linux server, we have found > that when certain applications (such as 'ab'), are abruptly terminated > (via ^C), a FIN is sent followed by a RST packet on tcp connections. The > FIN is accepted by the Linux stack but the RST is sent with the same > sequence number as the FIN, and Linux responds with a challenge ACK per > RFC 5961. The OSX client then sometimes (they are rate-limited) does not > reply with any RST as would be expected on a closed socket. > > This results in sockets accumulating on the Linux server left mostly in > the CLOSE_WAIT state, although LAST_ACK and CLOSING are also possible. > This sequence of events can tie up a lot of resources on the Linux server > since there may be a lot of data in write buffers at the time of the RST. > Accepting a RST equal to rcv_nxt - 1, after we have already successfully > processed a FIN, has made a significant difference for us in practice, by > freeing up unneeded resources in a more expedient fashion. > > A packetdrill test demonstrating the behavior: ... > Signed-off-by: Jason Baron <jbaron@akamai.com> Applied, thanks Jason.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 1a34e9278c07..bfa165cc455a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5199,6 +5199,23 @@ static int tcp_copy_to_iovec(struct sock *sk, struct sk_buff *skb, int hlen) return err; } +/* Accept RST for rcv_nxt - 1 after a FIN. + * When tcp connections are abruptly terminated from Mac OSX (via ^C), a + * FIN is sent followed by a RST packet. The RST is sent with the same + * sequence number as the FIN, and thus according to RFC 5961 a challenge + * ACK should be sent. However, Mac OSX rate limits replies to challenge + * ACKs on the closed socket. In addition middleboxes can drop either the + * challenge ACK or a subsequent RST. + */ +static bool tcp_reset_check(const struct sock *sk, const struct sk_buff *skb) +{ + struct tcp_sock *tp = tcp_sk(sk); + + return unlikely(TCP_SKB_CB(skb)->seq == (tp->rcv_nxt - 1) && + (1 << sk->sk_state) & (TCPF_CLOSE_WAIT | TCPF_LAST_ACK | + TCPF_CLOSING)); +} + /* Does PAWS and seqno based validation of an incoming segment, flags will * play significant role here. */ @@ -5237,20 +5254,25 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb, LINUX_MIB_TCPACKSKIPPEDSEQ, &tp->last_oow_ack_time)) tcp_send_dupack(sk, skb); + } else if (tcp_reset_check(sk, skb)) { + tcp_reset(sk); } goto discard; } /* Step 2: check RST bit */ if (th->rst) { - /* RFC 5961 3.2 (extend to match against SACK too if available): - * If seq num matches RCV.NXT or the right-most SACK block, + /* RFC 5961 3.2 (extend to match against (RCV.NXT - 1) after a + * FIN and SACK too if available): + * If seq num matches RCV.NXT or (RCV.NXT - 1) after a FIN, or + * the right-most SACK block, * then * RESET the connection * else * Send a challenge ACK */ - if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt) { + if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt || + tcp_reset_check(sk, skb)) { rst_seq_match = true; } else if (tcp_is_sack(tp) && tp->rx_opt.num_sacks > 0) { struct tcp_sack_block *sp = &tp->selective_acks[0];