From patchwork Tue Jun 14 19:18:41 2016 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: Daniel Metz X-Patchwork-Id: 635507 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id 3rTfbZ3BD0z9t0f for ; Wed, 15 Jun 2016 05:19:38 +1000 (AEST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752708AbcFNTT0 (ORCPT ); Tue, 14 Jun 2016 15:19:26 -0400 Received: from mx2.mailbox.org ([80.241.60.215]:49372 "EHLO mx2.mailbox.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751590AbcFNTTY (ORCPT ); Tue, 14 Jun 2016 15:19:24 -0400 Received: from smtp1.mailbox.org (smtp1.mailbox.org [80.241.60.240]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx2.mailbox.org (Postfix) with ESMTPS id 0495E4207D; Tue, 14 Jun 2016 21:19:14 +0200 (CEST) X-Virus-Scanned: amavisd-new at heinlein-support.de Received: from smtp1.mailbox.org ([80.241.60.240]) by gerste.heinlein-support.de (gerste.heinlein-support.de [91.198.250.173]) (amavisd-new, port 10030) with ESMTP id dvnYrwyYPZn3; Tue, 14 Jun 2016 21:19:12 +0200 (CEST) From: Daniel Metz To: netdev@vger.kernel.org Cc: Daniel Metz , Hagen Paul Pfeifer , Daniel Metz , Eric Dumazet , Yuchung Cheng Subject: [PATCH net-next v2] tcp: use RFC6298 compliant TCP RTO calculation Date: Tue, 14 Jun 2016 21:18:41 +0200 Message-Id: <20160614191841.21496-1-dmetz@mytum.de> In-Reply-To: References: MIME-Version: 1.0 Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Daniel Metz This patch adjusts Linux RTO calculation to be RFC6298 Standard compliant. MinRTO is no longer added to the computed RTO, RTO damping and overestimation are decreased. In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is less. The Linux implementation as a discrepancy to the Standard basically adds the defined MinRTO to the calculated RTO. When comparing both approaches, the Linux calculation seems to perform worse for sender limited TCP flows like Telnet, SSH or constant bit rate encoded transmissions, especially for Round Trip Times (RTT) of 50ms to 800ms. Compared to the Linux implementation the RFC 6298 proposed RTO calculation performs better and more precise in adapting to current network characteristics. Extensive measurements for bulk data did not show a negative impact of the adjusted calculation. Exemplary performance comparison for sender-limited-flows: - Rate: 10Mbit/s - Delay: 200ms, Delay Variation: 10ms - Time between each scheduled segment: 1s - Amount Data Segments: 300 - Mean of 11 runs Mean Response Waiting Time [milliseconds] PER [%] | 0.5 1 1.5 2 3 5 7 10 --------+------------------------------------------------------- old | 206.4 208.6 218.0 218.6 227.2 249.3 274.7 308.2 new | 203.9 206.0 207.0 209.9 217.3 225.6 238.7 259.1 Detailed analysis: https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/ Reasoning for historic design: Sarolahti, P.; Kuznetsov, A. (2002). Congestion Control in Linux TCP. Conference Paper. Proceedings of the FREENIX Track. 2002 USENIX Annual https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf Signed-off-by: Hagen Paul Pfeifer Signed-off-by: Daniel Metz Cc: Eric Dumazet Cc: Yuchung Cheng --- v2: - Using the RFC 6298 compliant implementation, the tcp_sock struct variable u32 mdev_max_us becomes obsolete and consequently is being removed. - Add reference to Kuznetsov paper include/linux/tcp.h | 1 - net/ipv4/tcp_input.c | 74 ++++++++++++-------------------------------------- net/ipv4/tcp_metrics.c | 2 +- 3 files changed, 18 insertions(+), 59 deletions(-) diff --git a/include/linux/tcp.h b/include/linux/tcp.h index 7be9b12..d1790c5 100644 --- a/include/linux/tcp.h +++ b/include/linux/tcp.h @@ -231,7 +231,6 @@ struct tcp_sock { /* RTT measurement */ u32 srtt_us; /* smoothed round trip time << 3 in usecs */ u32 mdev_us; /* medium deviation */ - u32 mdev_max_us; /* maximal mdev for the last rtt period */ u32 rttvar_us; /* smoothed mdev_max */ u32 rtt_seq; /* sequence number to update rttvar */ struct rtt_meas { diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 94d4aff..0d53537 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb) /* Called to compute a smoothed rtt estimate. The data fed to this * routine either comes from timestamps, or from segments that were * known _not_ to have been retransmitted [see Karn/Partridge - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88 - * piece by Van Jacobson. + * Proceedings SIGCOMM 87]. * NOTE: the next three routines used to be one big routine. * To save cycles in the RFC 1323 implementation it was better to break * it up into three procedures. -- erics @@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us) long m = mrtt_us; /* RTT */ u32 srtt = tp->srtt_us; - /* The following amusing code comes from Jacobson's - * article in SIGCOMM '88. Note that rtt and mdev - * are scaled versions of rtt and mean deviation. - * This is designed to be as fast as possible - * m stands for "measurement". - * - * On a 1990 paper the rto value is changed to: - * RTO = rtt + 4 * mdev - * - * Funny. This algorithm seems to be very broken. - * These formulae increase RTO, when it should be decreased, increase - * too slowly, when it should be increased quickly, decrease too quickly - * etc. I guess in BSD RTO takes ONE value, so that it is absolutely - * does not matter how to _calculate_ it. Seems, it was trap - * that VJ failed to avoid. 8) - */ if (srtt != 0) { - m -= (srtt >> 3); /* m is now error in rtt est */ - srtt += m; /* rtt = 7/8 rtt + 1/8 new */ - if (m < 0) { - m = -m; /* m is now abs(error) */ - m -= (tp->mdev_us >> 2); /* similar update on mdev */ - /* This is similar to one of Eifel findings. - * Eifel blocks mdev updates when rtt decreases. - * This solution is a bit different: we use finer gain - * for mdev in this case (alpha*beta). - * Like Eifel it also prevents growth of rto, - * but also it limits too fast rto decreases, - * happening in pure Eifel. - */ - if (m > 0) - m >>= 3; - } else { - m -= (tp->mdev_us >> 2); /* similar update on mdev */ - } - tp->mdev_us += m; /* mdev = 3/4 mdev + 1/4 new */ - if (tp->mdev_us > tp->mdev_max_us) { - tp->mdev_max_us = tp->mdev_us; - if (tp->mdev_max_us > tp->rttvar_us) - tp->rttvar_us = tp->mdev_max_us; - } - if (after(tp->snd_una, tp->rtt_seq)) { - if (tp->mdev_max_us < tp->rttvar_us) - tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2; + m -= (srtt >> 3); /* m' = m - srtt / 8 = (R' - SRTT) */ + srtt += m; /* srtt = srtt + m’ = srtt + m - srtt / 8 */ + if (m < 0) + m = -m; + m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */ + tp->mdev_us += m; + tp->rttvar_us = tp->mdev_us; + if (after(tp->snd_una, tp->rtt_seq)) tp->rtt_seq = tp->snd_nxt; - tp->mdev_max_us = tcp_rto_min_us(sk); - } } else { /* no previous measure. */ - srtt = m << 3; /* take the measured time to be rtt */ - tp->mdev_us = m << 1; /* make sure rto = 3*rtt */ - tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk)); - tp->mdev_max_us = tp->rttvar_us; + srtt = m << 3; + tp->mdev_us = m << 1; + tp->rttvar_us = tp->mdev_us; tp->rtt_seq = tp->snd_nxt; } tp->srtt_us = max(1U, srtt); @@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk) */ static void tcp_set_rto(struct sock *sk) { + const u32 min_rto = tcp_rto_min_us(sk); const struct tcp_sock *tp = tcp_sk(sk); /* Old crap is replaced with new one. 8) * @@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk) * is invisible. Actually, Linux-2.4 also generates erratic * ACKs in some circumstances. */ - inet_csk(sk)->icsk_rto = __tcp_set_rto(tp); - + if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto) + inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto); + else + inet_csk(sk)->icsk_rto = __tcp_set_rto(tp); /* 2. Fixups made earlier cannot be right. * If we do not estimate RTO correctly without them, * all the algo is pure shit and should be replaced * with correct one. It is exactly, which we pretend to do. */ - - /* NOTE: clamping at TCP_RTO_MIN is not required, current algo - * guarantees that rto is higher. - */ tcp_bound_rto(sk); } diff --git a/net/ipv4/tcp_metrics.c b/net/ipv4/tcp_metrics.c index b617826..7f59f5b 100644 --- a/net/ipv4/tcp_metrics.c +++ b/net/ipv4/tcp_metrics.c @@ -561,7 +561,7 @@ reset: * retransmission. */ tp->rttvar_us = jiffies_to_usecs(TCP_TIMEOUT_FALLBACK); - tp->mdev_us = tp->mdev_max_us = tp->rttvar_us; + tp->mdev_us = tp->rttvar_us; inet_csk(sk)->icsk_rto = TCP_TIMEOUT_FALLBACK; }