From patchwork Wed Mar 20 23:33:00 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuchung Cheng X-Patchwork-Id: 229523 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id CD8792C00B6 for ; Thu, 21 Mar 2013 10:41:01 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754631Ab3CTXk6 (ORCPT ); Wed, 20 Mar 2013 19:40:58 -0400 Received: from mail-qe0-f74.google.com ([209.85.128.74]:34147 "EHLO mail-qe0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754466Ab3CTXk4 (ORCPT ); Wed, 20 Mar 2013 19:40:56 -0400 X-Greylist: delayed 462 seconds by postgrey-1.27 at vger.kernel.org; Wed, 20 Mar 2013 19:40:56 EDT Received: by mail-qe0-f74.google.com with SMTP id 9so212365qea.5 for ; Wed, 20 Mar 2013 16:40:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:x-mailer:in-reply-to :references; bh=9vHDp0dV6yfduIXfo+QynmXorKSlMeG7GhM6st694bk=; b=dwR428tWSnPJtLPR2AcGAmNuVlD49EKq5her0aDPiw4NiUuTwd0uxrvdxzrmyC5fGm 8VY63cCX8bfEQ5mI9q5gooIAX1fG9qkTjLWlmPDo4vElUuNmeS6XK+GScyJGr9FAoY9M YfRX5KNu5GvemtK5ufLJ/yY/D65sat3ZMFKqq8FNJhUPbdPZT7MGpniI3rXyEpVT3WuF R8d7MAW1zexMvMcXz/2+4GVn1Rk0ZOKqoWPRV/eWIH+Uut95mDtS+CjVXJNylHY+Pi7i e1Cf5I66Z4xVWFivomHmGPp1SdZbK85WTJWIECHdmN6mgWII+RBenU+GjylFrySZCdq3 RhPg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:x-mailer:in-reply-to :references:x-gm-message-state; bh=9vHDp0dV6yfduIXfo+QynmXorKSlMeG7GhM6st694bk=; b=grlxJJEwJD2vtf+A4iIFNtfBijFKupPIzythPlu/ijKiZ3gITQkyedw66Qt1EO8gzG dEfjWp4N+vgmUvSYIodY+YaDqdZWwiRj+3lKrWNENi7k/KIwl9zO4DpnCPp0jQBZns4L M2mueDdXlOlp4BValokdM67CJqwxS79k0FeNFV7jkgEMlTuR1uSG1LK+ACbOx3R7mgUN vTkAf4uWIOfZXE1fy2c5Ot3qxwec/2HBT6rSq4vPhufthNBpEssA5tTfIlWbbq6fN9l7 CseANQ0maMdZ0OfJLRfyjSUAlhW8xeOtJgamcarueRLcZea4PPMEXv++AJGaTgF90tMP 8gOQ== X-Received: by 10.224.72.199 with SMTP id n7mr4501988qaj.5.1363822394022; Wed, 20 Mar 2013 16:33:14 -0700 (PDT) Received: from corp2gmr1-1.hot.corp.google.com (corp2gmr1-1.hot.corp.google.com [172.24.189.92]) by gmr-mx.google.com with ESMTPS id x1si2138235qci.2.2013.03.20.16.33.13 (version=TLSv1.1 cipher=AES128-SHA bits=128/128); Wed, 20 Mar 2013 16:33:14 -0700 (PDT) Received: from blast2.mtv.corp.google.com (blast2.mtv.corp.google.com [172.17.132.164]) by corp2gmr1-1.hot.corp.google.com (Postfix) with ESMTP id AF9E731C243; Wed, 20 Mar 2013 16:33:13 -0700 (PDT) Received: by blast2.mtv.corp.google.com (Postfix, from userid 5463) id 5E7FE220163; Wed, 20 Mar 2013 16:33:13 -0700 (PDT) From: Yuchung Cheng To: davem@davemloft.net, ncardwell@google.com, edumazet@google.com, nanditad@google.com Cc: ilpo.jarvinen@cs.helsinki.fi, netdev@vger.kernel.org, Yuchung Cheng Subject: [PATCH v2 3/3 net-next] tcp: implement RFC5682 F-RTO Date: Wed, 20 Mar 2013 16:33:00 -0700 Message-Id: <1363822380-16687-3-git-send-email-ycheng@google.com> X-Mailer: git-send-email 1.8.1.3 In-Reply-To: <1363822380-16687-1-git-send-email-ycheng@google.com> References: <1363822380-16687-1-git-send-email-ycheng@google.com> X-Gm-Message-State: ALoCoQl09yie7YEeb2xOrUK3fZK4SgohaZloHzfjReJSx5Pz7j84BuQhHqpOSsVBQMzyRMo/ET7G7P7xs8cfDQlv2XtzlwB0IKtB2GBPuWCMc55m2CESnURuzYGb+VUtWIhc+PEx/h+b3L/I4wW64QFoNieJgKe43fFs8ZNLjlI8mmLd2QiXQiSEOUJj7f5/fVNG61LZwNOz Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org This patch implements F-RTO (foward RTO recovery): When the first retransmission after timeout is acknowledged, F-RTO sends new data instead of old data. If the next ACK acknowledges some never-retransmitted data, then the timeout was spurious and the congestion state is reverted. Otherwise if the next ACK selectively acknowledges the new data, then the timeout was genuine and the loss recovery continues. This idea applies to recurring timeouts as well. While F-RTO sends different data during timeout recovery, it does not (and should not) change the congestion control. The implementaion follows the three steps of SACK enhanced algorithm (section 3) in RFC5682. Step 1 is in tcp_enter_loss(). Step 2 and 3 are in tcp_process_loss(). The basic version is not supported because SACK enhanced version also works for non-SACK connections. The new implementation is functionally in parity with the old F-RTO implementation except the one case where it increases undo events: In addition to the RFC algorithm, a spurious timeout may be detected without sending data in step 2, as long as the SACK confirms not all the original data are dropped. When this happens, the sender will undo the cwnd and perhaps enter fast recovery instead. This additional check increases the F-RTO undo events by 5x compared to the prior implementation on Google Web servers, since the sender often does not have new data to send for HTTP. Note F-RTO may detect spurious timeout before Eifel with timestamps does so. Signed-off-by: Yuchung Cheng Acked-by: Eric Dumazet Acked-by: Neal Cardwell --- ChangeLog in v2: - Removed extra while spaces - Allow F-RTO in sack reneging case to detect spurious retransmit(s) - Re-tested after merging with the recent TCP tail loss probe patch Documentation/networking/ip-sysctl.txt | 18 +++------ include/linux/tcp.h | 3 +- net/ipv4/tcp_input.c | 73 ++++++++++++++++++++++++++++------ 3 files changed, 68 insertions(+), 26 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 8a977a0..f98ca63 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -225,19 +225,13 @@ tcp_fin_timeout - INTEGER Default: 60 seconds tcp_frto - INTEGER - Enables Forward RTO-Recovery (F-RTO) defined in RFC4138. + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. F-RTO is an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in wireless environments - where packet loss is typically due to random radio interference - rather than intermediate router congestion. F-RTO is sender-side - only modification. Therefore it does not require any support from - the peer. - - If set to 1, basic version is enabled. 2 enables SACK enhanced - F-RTO if flow uses SACK. The basic version can be used also when - SACK is in use though scenario(s) with it exists where F-RTO - interacts badly with the packet counting of the SACK enabled TCP - flow. + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. diff --git a/include/linux/tcp.h b/include/linux/tcp.h index f5f203b..5adbc33 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -192,7 +192,8 @@ struct tcp_sock { u8 nonagle : 4,/* Disable Nagle algorithm? */ thin_lto : 1,/* Use linear timeouts for thin streams */ thin_dupack : 1,/* Fast retransmit on first dupack */ - repair : 1; + repair : 1, + frto : 1;/* F-RTO (RFC5682) activated in CA_Loss */ u8 repair_queue; u8 do_early_retrans:1,/* Enable RFC5827 early-retransmit */ syn_data:1, /* SYN includes data */ diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 8d821e4..b2b3619 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -107,6 +107,7 @@ int sysctl_tcp_early_retrans __read_mostly = 3; #define FLAG_DATA_SACKED 0x20 /* New SACK. */ #define FLAG_ECE 0x40 /* ECE in this ACK */ #define FLAG_SLOWPATH 0x100 /* Do not skip RFC checks for window update.*/ +#define FLAG_ORIG_SACK_ACKED 0x200 /* Never retransmitted data are (s)acked */ #define FLAG_SND_UNA_ADVANCED 0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */ #define FLAG_DSACKING_ACK 0x800 /* SACK blocks contained D-SACK info */ #define FLAG_SACK_RENEGING 0x2000 /* snd_una advanced to a sacked seq */ @@ -1155,6 +1156,8 @@ static u8 tcp_sacktag_one(struct sock *sk, tcp_highest_sack_seq(tp))) state->reord = min(fack_count, state->reord); + if (!after(end_seq, tp->high_seq)) + state->flag |= FLAG_ORIG_SACK_ACKED; } if (sacked & TCPCB_LOST) { @@ -1835,10 +1838,13 @@ void tcp_enter_loss(struct sock *sk, int how) const struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); struct sk_buff *skb; + bool new_recovery = false; /* Reduce ssthresh if it has not yet been made inside this window. */ - if (icsk->icsk_ca_state <= TCP_CA_Disorder || tp->snd_una == tp->high_seq || + if (icsk->icsk_ca_state <= TCP_CA_Disorder || + !after(tp->high_seq, tp->snd_una) || (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { + new_recovery = true; tp->prior_ssthresh = tcp_current_ssthresh(sk); tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); tcp_ca_event(sk, CA_EVENT_LOSS); @@ -1883,6 +1889,14 @@ void tcp_enter_loss(struct sock *sk, int how) tcp_set_ca_state(sk, TCP_CA_Loss); tp->high_seq = tp->snd_nxt; TCP_ECN_queue_cwr(tp); + + /* F-RTO RFC5682 sec 3.1 step 1: retransmit SND.UNA if no previous + * loss recovery is underway except recurring timeout(s) on + * the same SND.UNA (sec 3.2). Disable F-RTO on path MTU probing + */ + tp->frto = sysctl_tcp_frto && + (new_recovery || icsk->icsk_retransmits) && + !inet_csk(sk)->icsk_mtup.probe_size; } /* If ACK arrived pointing to a remembered SACK, it means that our @@ -2426,12 +2440,12 @@ static int tcp_try_undo_partial(struct sock *sk, int acked) return failed; } -/* Undo during loss recovery after partial ACK. */ -static bool tcp_try_undo_loss(struct sock *sk) +/* Undo during loss recovery after partial ACK or using F-RTO. */ +static bool tcp_try_undo_loss(struct sock *sk, bool frto_undo) { struct tcp_sock *tp = tcp_sk(sk); - if (tcp_may_undo(tp)) { + if (frto_undo || tcp_may_undo(tp)) { struct sk_buff *skb; tcp_for_write_queue(skb, sk) { if (skb == tcp_send_head(sk)) @@ -2445,9 +2459,12 @@ static bool tcp_try_undo_loss(struct sock *sk) tp->lost_out = 0; tcp_undo_cwr(sk, true); NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSSUNDO); + if (frto_undo) + NET_INC_STATS_BH(sock_net(sk), + LINUX_MIB_TCPSPURIOUSRTOS); inet_csk(sk)->icsk_retransmits = 0; tp->undo_marker = 0; - if (tcp_is_sack(tp)) + if (frto_undo || tcp_is_sack(tp)) tcp_set_ca_state(sk, TCP_CA_Open); return true; } @@ -2667,24 +2684,52 @@ static void tcp_enter_recovery(struct sock *sk, bool ece_ack) /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are * recovered or spurious. Otherwise retransmits more on partial ACKs. */ -static void tcp_process_loss(struct sock *sk, int flag) +static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); + bool recovered = !before(tp->snd_una, tp->high_seq); - if (!before(tp->snd_una, tp->high_seq)) { + if (tp->frto) { /* F-RTO RFC5682 sec 3.1 (sack enhanced version). */ + if (flag & FLAG_ORIG_SACK_ACKED) { + /* Step 3.b. A timeout is spurious if not all data are + * lost, i.e., never-retransmitted data are (s)acked. + */ + tcp_try_undo_loss(sk, true); + return; + } + if (after(tp->snd_nxt, tp->high_seq) && + (flag & FLAG_DATA_SACKED || is_dupack)) { + tp->frto = 0; /* Loss was real: 2nd part of step 3.a */ + } else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) { + tp->high_seq = tp->snd_nxt; + __tcp_push_pending_frames(sk, tcp_current_mss(sk), + TCP_NAGLE_OFF); + if (after(tp->snd_nxt, tp->high_seq)) + return; /* Step 2.b */ + tp->frto = 0; + } + } + + if (recovered) { + /* F-RTO RFC5682 sec 3.1 step 2.a and 1st part of step 3.a */ icsk->icsk_retransmits = 0; tcp_try_undo_recovery(sk); return; } - if (flag & FLAG_DATA_ACKED) icsk->icsk_retransmits = 0; - if (tcp_is_reno(tp) && flag & FLAG_SND_UNA_ADVANCED) - tcp_reset_reno_sack(tp); - if (tcp_try_undo_loss(sk)) + if (tcp_is_reno(tp)) { + /* A Reno DUPACK means new data in F-RTO step 2.b above are + * delivered. Lower inflight to clock out (re)tranmissions. + */ + if (after(tp->snd_nxt, tp->high_seq) && is_dupack) + tcp_add_reno_sack(sk); + else if (flag & FLAG_SND_UNA_ADVANCED) + tcp_reset_reno_sack(tp); + } + if (tcp_try_undo_loss(sk, false)) return; - tcp_moderate_cwnd(tp); tcp_xmit_retransmit_queue(sk); } @@ -2764,7 +2809,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, newly_acked_sacked = pkts_acked + tp->sacked_out - prior_sacked; break; case TCP_CA_Loss: - tcp_process_loss(sk, flag); + tcp_process_loss(sk, flag, is_dupack); if (icsk->icsk_ca_state != TCP_CA_Open) return; /* Fall through to processing in Open state. */ @@ -3003,6 +3048,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets, } if (!(sacked & TCPCB_SACKED_ACKED)) reord = min(pkts_acked, reord); + if (!after(scb->end_seq, tp->high_seq)) + flag |= FLAG_ORIG_SACK_ACKED; } if (sacked & TCPCB_SACKED_ACKED)