From patchwork Fri May 17 23:36:30 2013 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Yuchung Cheng X-Patchwork-Id: 244717 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id E65262C0095 for ; Sat, 18 May 2013 10:07:25 +1000 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1756311Ab3ERAHV (ORCPT ); Fri, 17 May 2013 20:07:21 -0400 Received: from mail-vb0-f74.google.com ([209.85.212.74]:65012 "EHLO mail-vb0-f74.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756250Ab3ERAHU (ORCPT ); Fri, 17 May 2013 20:07:20 -0400 Received: by mail-vb0-f74.google.com with SMTP id q16so327786vbe.1 for ; Fri, 17 May 2013 17:07:19 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:x-mailer; bh=Q6mODih3VLmYcbP4R2rgWSyGMvV7XtONTT7pZKigfDc=; b=gYAyecSAWoJiL3zCR5aXdQUu30vTa9kr4dJTQgenZLnaF+aBBT/N6xLFbLJQrUbWtn 1zR6LpbWHaqPbViWs9zqrTq7WqmAA5ctWmrg4vJ/u/NI2CEDNdKUNVx+i5Ypt/QIvtlb 8NJkIsNcHQLDFzKBHuSRsDVKH+2dbWxi9vwa9HFGCfxA+KMHXQqG7TXU1h4CCCmAazmk OR57GEYO0tO4sY+FghyVClViciMxnfku/zzS1uMgqTzo7Myb5dIiRaU0/xXQdUJq7apu RYShtN/knOXyLzPelhZXhuuM14S/IeS8TI9H15bEaI4luWG93sPAuTQW6VAeCmUFZ4pQ /unw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=x-received:from:to:cc:subject:date:message-id:x-mailer :x-gm-message-state; bh=Q6mODih3VLmYcbP4R2rgWSyGMvV7XtONTT7pZKigfDc=; b=pEfR2O1/TvkuqV1MM7jvghh1y2A6wKdn7MFrcNmP1MnISpqk0lXfWsBmXDvlXdnuQH 7Ay9t+QCMVjSc2K81zCMwvxSVtQhekW/GtwIHGrn5DLE+zVfz6aV7iYDg25misjTcbtV YTkay4iVgoRblx9UhUO9NdYC4aJWfIdPsriFAjB9oQcu52aG7+yHDeb3D428DBvHnOgQ wiFIzLkyY5CdwoYKsmXy0cyd7wscHXd5lhul9IbSlndrB7rFKzxA4fF+t5UDP55mMMQ1 K1E4j1xyhfOCK+3E9eUXyYmBbMCIwnN5Y+Kt2+p4xS2pcbKUYgD27IOmnsGX2ieGxmlI XcdQ== X-Received: by 10.236.152.165 with SMTP id d25mr19886581yhk.36.1368833792009; Fri, 17 May 2013 16:36:32 -0700 (PDT) Received: from corp2gmr1-1.hot.corp.google.com (corp2gmr1-1.hot.corp.google.com [172.24.189.92]) by gmr-mx.google.com with ESMTPS id u47si1116481yhe.0.2013.05.17.16.36.31 for (version=TLSv1.1 cipher=AES128-SHA bits=128/128); Fri, 17 May 2013 16:36:32 -0700 (PDT) Received: from blast2.mtv.corp.google.com (blast2.mtv.corp.google.com [172.17.132.164]) by corp2gmr1-1.hot.corp.google.com (Postfix) with ESMTP id C71F031C235; Fri, 17 May 2013 16:36:31 -0700 (PDT) Received: by blast2.mtv.corp.google.com (Postfix, from userid 5463) id 6A2F0220E1B; Fri, 17 May 2013 16:36:31 -0700 (PDT) From: Yuchung Cheng To: davem@davemloft.net, ncardwell@google.com, edumazet@google.com, nanditad@google.com Cc: ilpo.jarvinen@cs.helsinki.fi, netdev@vger.kernel.org, Yuchung Cheng Subject: [PATCH] tcp: remove bad timeout logic in fast recovery Date: Fri, 17 May 2013 16:36:30 -0700 Message-Id: <1368833790-28841-1-git-send-email-ycheng@google.com> X-Mailer: git-send-email 1.8.2.1 X-Gm-Message-State: ALoCoQnNSR//Soez2nja9iPS5oU6XrMLrQl7UubOXgSN+W7jwAlcLtjRXtDlBIO+E9QdQzTM+jF1Js1G5Optcc8gWDvT3JbDIvfvZXoitXCdCLpwWyi9EDGhe9FBcCHuSUBRua47W5TjkretfYYoHHiiMy18XLHc+C7ODW0PHPxc1BJMGdKct9fYGNYMG5dDi5T5aeLFHPha Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org tcp_timeout_skb() was intended to trigger fast recovery on timeout, unfortunately in reality it often causes spurious retransmission storms during fast recovery. The particular sign is fast retransmit over highest sacked sequence (SND.FACK). Currently the RTO timer re-arming (as in RFC6298) offers a nice cushion to avoid spurious timeout: when SND.UNA advances the sender re-arms RTO and extends the timeout by icsk_rto. The sender does not offset the time elapsed since the packet at SND.UNA was sent. But if the next (DUP)ACK arrives later than ~RTTVAR and triggers tcp_fastretrans_alert(), then tcp_timeout_skb() will mark any packet sent before icsk_rto interval lost, including one that's above the highest sacked sequence. Most likely a large part of scorebard will be marked. If most packets are not lost then the subsequence DUPACKs with new SACK blockes will cause the sender to continue retransmit packets beyond SND.FACK spuriously right. Even only one packet is lost the sender may falsely retransmit almost the entire window. The situation becomes common in the world of bufferbloat: the RTT continues to grow as the queue builds up but RTTVAR remains small and close to the minimum 200ms. If a data packet is lost and the DUPACK triggered by the next data packet is slightly delayed, then a spurious retransmission storm forms. As the original comment on tcp_timeout_skb() suggests: the usefulness of this feature is questionable. It also wastes cycles walking the sack scoreboard and is actually harmful because of the false recovery. It's time to remove this. Change-Id: I8eb2ae80e032dbc07a4d87bf5a68771856a8c04c Signed-off-by: Yuchung Cheng --- include/linux/tcp.h | 1 - include/net/tcp.h | 1 - net/ipv4/tcp_input.c | 65 +--------------------------------------------------- 3 files changed, 1 insertion(+), 66 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 5adbc33..472120b 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -246,7 +246,6 @@ struct tcp_sock { /* from STCP, retrans queue hinting */ struct sk_buff* lost_skb_hint; - struct sk_buff *scoreboard_skb_hint; struct sk_buff *retransmit_skb_hint; struct sk_buff_head out_of_order_queue; /* Out of order segments go here */ diff --git a/include/net/tcp.h b/include/net/tcp.h index 5bba80f..e1c3723 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1193,7 +1193,6 @@ static inline void tcp_mib_init(struct net *net) static inline void tcp_clear_retrans_hints_partial(struct tcp_sock *tp) { tp->lost_skb_hint = NULL; - tp->scoreboard_skb_hint = NULL; } static inline void tcp_clear_all_retrans_hints(struct tcp_sock *tp) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index b358e8c..d7d3694 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1255,8 +1255,6 @@ static bool tcp_shifted_skb(struct sock *sk, struct sk_buff *skb, if (skb == tp->retransmit_skb_hint) tp->retransmit_skb_hint = prev; - if (skb == tp->scoreboard_skb_hint) - tp->scoreboard_skb_hint = prev; if (skb == tp->lost_skb_hint) { tp->lost_skb_hint = prev; tp->lost_cnt_hint -= tcp_skb_pcount(prev); @@ -1964,20 +1962,6 @@ static bool tcp_pause_early_retransmit(struct sock *sk, int flag) return true; } -static inline int tcp_skb_timedout(const struct sock *sk, - const struct sk_buff *skb) -{ - return tcp_time_stamp - TCP_SKB_CB(skb)->when > inet_csk(sk)->icsk_rto; -} - -static inline int tcp_head_timedout(const struct sock *sk) -{ - const struct tcp_sock *tp = tcp_sk(sk); - - return tp->packets_out && - tcp_skb_timedout(sk, tcp_write_queue_head(sk)); -} - /* Linux NewReno/SACK/FACK/ECN state machine. * -------------------------------------- * @@ -2084,12 +2068,6 @@ static bool tcp_time_to_recover(struct sock *sk, int flag) if (tcp_dupack_heuristics(tp) > tp->reordering) return true; - /* Trick#3 : when we use RFC2988 timer restart, fast - * retransmit can be triggered by timeout of queue head. - */ - if (tcp_is_fack(tp) && tcp_head_timedout(sk)) - return true; - /* Trick#4: It is still not OK... But will it be useful to delay * recovery more? */ @@ -2126,44 +2104,6 @@ static bool tcp_time_to_recover(struct sock *sk, int flag) return false; } -/* New heuristics: it is possible only after we switched to restart timer - * each time when something is ACKed. Hence, we can detect timed out packets - * during fast retransmit without falling to slow start. - * - * Usefulness of this as is very questionable, since we should know which of - * the segments is the next to timeout which is relatively expensive to find - * in general case unless we add some data structure just for that. The - * current approach certainly won't find the right one too often and when it - * finally does find _something_ it usually marks large part of the window - * right away (because a retransmission with a larger timestamp blocks the - * loop from advancing). -ij - */ -static void tcp_timeout_skbs(struct sock *sk) -{ - struct tcp_sock *tp = tcp_sk(sk); - struct sk_buff *skb; - - if (!tcp_is_fack(tp) || !tcp_head_timedout(sk)) - return; - - skb = tp->scoreboard_skb_hint; - if (tp->scoreboard_skb_hint == NULL) - skb = tcp_write_queue_head(sk); - - tcp_for_write_queue_from(skb, sk) { - if (skb == tcp_send_head(sk)) - break; - if (!tcp_skb_timedout(sk, skb)) - break; - - tcp_skb_mark_lost(tp, skb); - } - - tp->scoreboard_skb_hint = skb; - - tcp_verify_left_out(tp); -} - /* Detect loss in event "A" above by marking head of queue up as lost. * For FACK or non-SACK(Reno) senders, the first "packets" number of segments * are considered lost. For RFC3517 SACK, a segment is considered lost if it @@ -2249,8 +2189,6 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit) else if (fast_rexmit) tcp_mark_head_lost(sk, 1, 1); } - - tcp_timeout_skbs(sk); } /* CWND moderation, preventing bursts due to too big ACKs @@ -2842,7 +2780,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, fast_rexmit = 1; } - if (do_lost || (tcp_is_fack(tp) && tcp_head_timedout(sk))) + if (do_lost) tcp_update_scoreboard(sk, fast_rexmit); tcp_cwnd_reduction(sk, newly_acked_sacked, fast_rexmit); tcp_xmit_retransmit_queue(sk); @@ -3075,7 +3013,6 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets, tcp_unlink_write_queue(skb, sk); sk_wmem_free_skb(sk, skb); - tp->scoreboard_skb_hint = NULL; if (skb == tp->retransmit_skb_hint) tp->retransmit_skb_hint = NULL; if (skb == tp->lost_skb_hint)