diff mbox

[net-next] tcp: use RFC6298 compliant TCP RTO calculation

Message ID 20160613204529.18826-1-dmetz@mytum.de
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Daniel Metz June 13, 2016, 8:45 p.m. UTC
This patch adjusts Linux RTO calculation to be RFC6298 Standard
compliant. MinRTO is no longer added to the computed RTO, RTO damping
and overestimation are decreased.

In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
less. The Linux implementation as a discrepancy to the Standard
basically adds the defined MinRTO to the calculated RTO. When
comparing both approaches, the Linux calculation seems to perform
worse for sender limited TCP flows like Telnet, SSH or constant bit
rate encoded transmissions, especially for Round Trip Times (RTT) of
50ms to 800ms.

Compared to the Linux implementation the RFC 6298 proposed RTO
calculation performs better and more precise in adapting to current
network characteristics. Extensive measurements for bulk data did not
show a negative impact of the adjusted calculation.

Exemplary performance comparison for sender-limited-flows:

- Rate: 10Mbit/s
- Delay: 200ms, Delay Variation: 10ms
- Time between each scheduled segment: 1s
- Amount Data Segments: 300
- Mean of 11 runs

         Mean Response Waiting Time [milliseconds]

PER [%] |   0.5      1    1.5      2      3      5      7     10
--------+-------------------------------------------------------
old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1


Detailed Analysis:

https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/


Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Signed-off-by: Daniel Metz <dmetz@mytum.de>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
---

We removed outdated comments in the code, though more cleanup may required.


 net/ipv4/tcp_input.c | 74 ++++++++++++----------------------------------------
 1 file changed, 17 insertions(+), 57 deletions(-)

Comments

Eric Dumazet June 13, 2016, 9:19 p.m. UTC | #1
On Mon, 2016-06-13 at 22:45 +0200, Daniel Metz wrote:
> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
> 
> In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
> calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
> less. The Linux implementation as a discrepancy to the Standard
> basically adds the defined MinRTO to the calculated RTO. When
> comparing both approaches, the Linux calculation seems to perform
> worse for sender limited TCP flows like Telnet, SSH or constant bit
> rate encoded transmissions, especially for Round Trip Times (RTT) of
> 50ms to 800ms.
> 
> Compared to the Linux implementation the RFC 6298 proposed RTO
> calculation performs better and more precise in adapting to current
> network characteristics. Extensive measurements for bulk data did not
> show a negative impact of the adjusted calculation.
> 
> Exemplary performance comparison for sender-limited-flows:
> 
> - Rate: 10Mbit/s
> - Delay: 200ms, Delay Variation: 10ms
> - Time between each scheduled segment: 1s
> - Amount Data Segments: 300
> - Mean of 11 runs
> 
>          Mean Response Waiting Time [milliseconds]
> 
> PER [%] |   0.5      1    1.5      2      3      5      7     10
> --------+-------------------------------------------------------
> old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
> new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1
> 
> 
> Detailed Analysis:
> 
> https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/
> 
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Signed-off-by: Daniel Metz <dmetz@mytum.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
> 
> We removed outdated comments in the code, though more cleanup may required.
> 

It seems you left mdev_max_us ?
Yuchung Cheng June 13, 2016, 10:38 p.m. UTC | #2
On Mon, Jun 13, 2016 at 1:45 PM, Daniel Metz <dmetz@mytum.de> wrote:
> This patch adjusts Linux RTO calculation to be RFC6298 Standard
> compliant. MinRTO is no longer added to the computed RTO, RTO damping
> and overestimation are decreased.
>
> In RFC 6298 Standard TCP Retransmission Timeout (RTO) calculation the
> calculated RTO is rounded up to the Minimum RTO (MinRTO), if it is
> less. The Linux implementation as a discrepancy to the Standard
> basically adds the defined MinRTO to the calculated RTO. When
> comparing both approaches, the Linux calculation seems to perform
> worse for sender limited TCP flows like Telnet, SSH or constant bit
> rate encoded transmissions, especially for Round Trip Times (RTT) of
> 50ms to 800ms.
>
> Compared to the Linux implementation the RFC 6298 proposed RTO
> calculation performs better and more precise in adapting to current
> network characteristics. Extensive measurements for bulk data did not
> show a negative impact of the adjusted calculation.
>
> Exemplary performance comparison for sender-limited-flows:
>
> - Rate: 10Mbit/s
> - Delay: 200ms, Delay Variation: 10ms
> - Time between each scheduled segment: 1s
> - Amount Data Segments: 300
> - Mean of 11 runs
>
>          Mean Response Waiting Time [milliseconds]
>
> PER [%] |   0.5      1    1.5      2      3      5      7     10
> --------+-------------------------------------------------------
> old     | 206.4  208.6  218.0  218.6  227.2  249.3  274.7  308.2
> new     | 203.9  206.0  207.0  209.9  217.3  225.6  238.7  259.1
>
>
> Detailed Analysis:
>
> https://docs.google.com/document/d/1pKmPfnQb6fDK4qpiNVkN8cQyGE4wYDZukcuZfR-BnnM/
Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.

Since this is not a small change, and your patch is only tested on
emulation-based testbed AFAICT, I'd like to try your patch on Google
servers to get more data. But this would take a few days to setup &
collect.

Note that this paper
https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
detailed rationale of current design (section 4). IMO having a "tight"
RTO is less necessary now after TLP. I am also testing a new set of
patches to install a quick reordering timer. But it's worth mentioning
the paper in the commit message.


>
>
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
> Signed-off-by: Daniel Metz <dmetz@mytum.de>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> ---
>
> We removed outdated comments in the code, though more cleanup may required.
>
>
>  net/ipv4/tcp_input.c | 74 ++++++++++++----------------------------------------
>  1 file changed, 17 insertions(+), 57 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index d6c8f4cd0..a0f66f8 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -680,8 +680,7 @@ static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
>  /* Called to compute a smoothed rtt estimate. The data fed to this
>   * routine either comes from timestamps, or from segments that were
>   * known _not_ to have been retransmitted [see Karn/Partridge
> - * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
> - * piece by Van Jacobson.
> + * Proceedings SIGCOMM 87].
>   * NOTE: the next three routines used to be one big routine.
>   * To save cycles in the RFC 1323 implementation it was better to break
>   * it up into three procedures. -- erics
> @@ -692,59 +691,21 @@ static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
>         long m = mrtt_us; /* RTT */
>         u32 srtt = tp->srtt_us;
>
> -       /*      The following amusing code comes from Jacobson's
> -        *      article in SIGCOMM '88.  Note that rtt and mdev
> -        *      are scaled versions of rtt and mean deviation.
> -        *      This is designed to be as fast as possible
> -        *      m stands for "measurement".
> -        *
> -        *      On a 1990 paper the rto value is changed to:
> -        *      RTO = rtt + 4 * mdev
> -        *
> -        * Funny. This algorithm seems to be very broken.
> -        * These formulae increase RTO, when it should be decreased, increase
> -        * too slowly, when it should be increased quickly, decrease too quickly
> -        * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
> -        * does not matter how to _calculate_ it. Seems, it was trap
> -        * that VJ failed to avoid. 8)
> -        */
>         if (srtt != 0) {
> -               m -= (srtt >> 3);       /* m is now error in rtt est */
> -               srtt += m;              /* rtt = 7/8 rtt + 1/8 new */
> -               if (m < 0) {
> -                       m = -m;         /* m is now abs(error) */
> -                       m -= (tp->mdev_us >> 2);   /* similar update on mdev */
> -                       /* This is similar to one of Eifel findings.
> -                        * Eifel blocks mdev updates when rtt decreases.
> -                        * This solution is a bit different: we use finer gain
> -                        * for mdev in this case (alpha*beta).
> -                        * Like Eifel it also prevents growth of rto,
> -                        * but also it limits too fast rto decreases,
> -                        * happening in pure Eifel.
> -                        */
> -                       if (m > 0)
> -                               m >>= 3;
> -               } else {
> -                       m -= (tp->mdev_us >> 2);   /* similar update on mdev */
> -               }
> -               tp->mdev_us += m;               /* mdev = 3/4 mdev + 1/4 new */
> -               if (tp->mdev_us > tp->mdev_max_us) {
> -                       tp->mdev_max_us = tp->mdev_us;
> -                       if (tp->mdev_max_us > tp->rttvar_us)
> -                               tp->rttvar_us = tp->mdev_max_us;
> -               }
> -               if (after(tp->snd_una, tp->rtt_seq)) {
> -                       if (tp->mdev_max_us < tp->rttvar_us)
> -                               tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
> +               m -= (srtt >> 3);        /* m' = m - srtt / 8 = (R' - SRTT) */
> +               srtt += m;               /* srtt = srtt + m’ = srtt + m - srtt / 8 */
> +               if (m < 0)
> +                       m = -m;
> +               m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
> +               tp->mdev_us += m;
> +               tp->rttvar_us = tp->mdev_us;
> +               if (after(tp->snd_una, tp->rtt_seq))
>                         tp->rtt_seq = tp->snd_nxt;
> -                       tp->mdev_max_us = tcp_rto_min_us(sk);
> -               }
>         } else {
>                 /* no previous measure. */
> -               srtt = m << 3;          /* take the measured time to be rtt */
> -               tp->mdev_us = m << 1;   /* make sure rto = 3*rtt */
> -               tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
> -               tp->mdev_max_us = tp->rttvar_us;
> +               srtt = m << 3;
> +               tp->mdev_us = m << 1;
> +               tp->rttvar_us = tp->mdev_us;
>                 tp->rtt_seq = tp->snd_nxt;
>         }
>         tp->srtt_us = max(1U, srtt);
> @@ -798,6 +759,7 @@ static void tcp_update_pacing_rate(struct sock *sk)
>   */
>  static void tcp_set_rto(struct sock *sk)
>  {
> +       const u32 min_rto = tcp_rto_min_us(sk);
>         const struct tcp_sock *tp = tcp_sk(sk);
>         /* Old crap is replaced with new one. 8)
>          *
> @@ -809,17 +771,15 @@ static void tcp_set_rto(struct sock *sk)
>          *    is invisible. Actually, Linux-2.4 also generates erratic
>          *    ACKs in some circumstances.
>          */
> -       inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
> -
> +       if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
> +               inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
> +       else
> +               inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
>         /* 2. Fixups made earlier cannot be right.
>          *    If we do not estimate RTO correctly without them,
>          *    all the algo is pure shit and should be replaced
>          *    with correct one. It is exactly, which we pretend to do.
>          */
> -
> -       /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
> -        * guarantees that rto is higher.
> -        */
>         tcp_bound_rto(sk);
>  }
>
> --
> 2.8.3
>
Hagen Paul Pfeifer June 14, 2016, 6:17 a.m. UTC | #3
* Yuchung Cheng | 2016-06-13 15:38:24 [-0700]:

Hey Eric, Yuchung,

regarding the missed mdev_max_us: internal communication problem. Daniel well
respin a v2 removing the no longer required mdev_max_us.

>Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.
>
>Since this is not a small change, and your patch is only tested on
>emulation-based testbed AFAICT, I'd like to try your patch on Google
>servers to get more data. But this would take a few days to setup &
>collect.

Great - no hurry! We tried hard to find any downsides of RFC 6298 so far
without any result. If you have any special & concrete tests in mind: Daniel
will test it!

>Note that this paper
>https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
>detailed rationale of current design (section 4). IMO having a "tight"
>RTO is less necessary now after TLP. I am also testing a new set of
>patches to install a quick reordering timer. But it's worth mentioning
>the paper in the commit message.

We had "difficulties" to find scenarios where the RTO kicks-in. For the
majority of use cases duplicate ACKs triggers TCP retransmission. For bulk
data transmissions almost 100% of retransmissions are triggered by duplicate
ACKs (except connection teardown). TLP will reduce the requirement for RTO
even further, also window probes helps sometimes. The use case we realized was
sender limited, non-continuous flows where a RFC 6298 compliant implementation
is better.

Thank you Yuchung, we will add an reference in v2.

Hagen
Yuchung Cheng June 14, 2016, 5:58 p.m. UTC | #4
On Mon, Jun 13, 2016 at 11:17 PM, Hagen Paul Pfeifer <hagen@jauu.net> wrote:
>
> * Yuchung Cheng | 2016-06-13 15:38:24 [-0700]:
>
> Hey Eric, Yuchung,
>
> regarding the missed mdev_max_us: internal communication problem. Daniel well
> respin a v2 removing the no longer required mdev_max_us.
>
> >Thanks for the patch. I also have long wanted to evaluate Linux's RTO vs RFC's.
> >
> >Since this is not a small change, and your patch is only tested on
> >emulation-based testbed AFAICT, I'd like to try your patch on Google
> >servers to get more data. But this would take a few days to setup &
> >collect.
>
> Great - no hurry! We tried hard to find any downsides of RFC 6298 so far
> without any result. If you have any special & concrete tests in mind: Daniel
> will test it!
>
> >Note that this paper
> >https://www.cs.helsinki.fi/research/iwtcp/papers/linuxtcp.pdf has
> >detailed rationale of current design (section 4). IMO having a "tight"
> >RTO is less necessary now after TLP. I am also testing a new set of
> >patches to install a quick reordering timer. But it's worth mentioning
> >the paper in the commit message.
>
> We had "difficulties" to find scenarios where the RTO kicks-in. For the
> majority of use cases duplicate ACKs triggers TCP retransmission. For bulk
> data transmissions almost 100% of retransmissions are triggered by duplicate
> ACKs (except connection teardown). TLP will reduce the requirement for RTO
> even further, also window probes helps sometimes. The use case we realized was
> sender limited, non-continuous flows where a RFC 6298 compliant implementation
> is better.
>
> Thank you Yuchung, we will add an reference in v2.
Sounds good. I will wait for v2 to do the experiment.
>
> Hagen
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d6c8f4cd0..a0f66f8 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -680,8 +680,7 @@  static void tcp_event_data_recv(struct sock *sk, struct sk_buff *skb)
 /* Called to compute a smoothed rtt estimate. The data fed to this
  * routine either comes from timestamps, or from segments that were
  * known _not_ to have been retransmitted [see Karn/Partridge
- * Proceedings SIGCOMM 87]. The algorithm is from the SIGCOMM 88
- * piece by Van Jacobson.
+ * Proceedings SIGCOMM 87].
  * NOTE: the next three routines used to be one big routine.
  * To save cycles in the RFC 1323 implementation it was better to break
  * it up into three procedures. -- erics
@@ -692,59 +691,21 @@  static void tcp_rtt_estimator(struct sock *sk, long mrtt_us)
 	long m = mrtt_us; /* RTT */
 	u32 srtt = tp->srtt_us;
 
-	/*	The following amusing code comes from Jacobson's
-	 *	article in SIGCOMM '88.  Note that rtt and mdev
-	 *	are scaled versions of rtt and mean deviation.
-	 *	This is designed to be as fast as possible
-	 *	m stands for "measurement".
-	 *
-	 *	On a 1990 paper the rto value is changed to:
-	 *	RTO = rtt + 4 * mdev
-	 *
-	 * Funny. This algorithm seems to be very broken.
-	 * These formulae increase RTO, when it should be decreased, increase
-	 * too slowly, when it should be increased quickly, decrease too quickly
-	 * etc. I guess in BSD RTO takes ONE value, so that it is absolutely
-	 * does not matter how to _calculate_ it. Seems, it was trap
-	 * that VJ failed to avoid. 8)
-	 */
 	if (srtt != 0) {
-		m -= (srtt >> 3);	/* m is now error in rtt est */
-		srtt += m;		/* rtt = 7/8 rtt + 1/8 new */
-		if (m < 0) {
-			m = -m;		/* m is now abs(error) */
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-			/* This is similar to one of Eifel findings.
-			 * Eifel blocks mdev updates when rtt decreases.
-			 * This solution is a bit different: we use finer gain
-			 * for mdev in this case (alpha*beta).
-			 * Like Eifel it also prevents growth of rto,
-			 * but also it limits too fast rto decreases,
-			 * happening in pure Eifel.
-			 */
-			if (m > 0)
-				m >>= 3;
-		} else {
-			m -= (tp->mdev_us >> 2);   /* similar update on mdev */
-		}
-		tp->mdev_us += m;		/* mdev = 3/4 mdev + 1/4 new */
-		if (tp->mdev_us > tp->mdev_max_us) {
-			tp->mdev_max_us = tp->mdev_us;
-			if (tp->mdev_max_us > tp->rttvar_us)
-				tp->rttvar_us = tp->mdev_max_us;
-		}
-		if (after(tp->snd_una, tp->rtt_seq)) {
-			if (tp->mdev_max_us < tp->rttvar_us)
-				tp->rttvar_us -= (tp->rttvar_us - tp->mdev_max_us) >> 2;
+		m -= (srtt >> 3);	 /* m' = m - srtt / 8 = (R' - SRTT) */
+		srtt += m;		 /* srtt = srtt + m’ = srtt + m - srtt / 8 */
+		if (m < 0)
+			m = -m;
+		m -= (tp->mdev_us >> 2); /* m'' = |m'| - mdev / 4 */
+		tp->mdev_us += m;
+		tp->rttvar_us = tp->mdev_us;
+		if (after(tp->snd_una, tp->rtt_seq))
 			tp->rtt_seq = tp->snd_nxt;
-			tp->mdev_max_us = tcp_rto_min_us(sk);
-		}
 	} else {
 		/* no previous measure. */
-		srtt = m << 3;		/* take the measured time to be rtt */
-		tp->mdev_us = m << 1;	/* make sure rto = 3*rtt */
-		tp->rttvar_us = max(tp->mdev_us, tcp_rto_min_us(sk));
-		tp->mdev_max_us = tp->rttvar_us;
+		srtt = m << 3;
+		tp->mdev_us = m << 1;
+		tp->rttvar_us = tp->mdev_us;
 		tp->rtt_seq = tp->snd_nxt;
 	}
 	tp->srtt_us = max(1U, srtt);
@@ -798,6 +759,7 @@  static void tcp_update_pacing_rate(struct sock *sk)
  */
 static void tcp_set_rto(struct sock *sk)
 {
+	const u32 min_rto = tcp_rto_min_us(sk);
 	const struct tcp_sock *tp = tcp_sk(sk);
 	/* Old crap is replaced with new one. 8)
 	 *
@@ -809,17 +771,15 @@  static void tcp_set_rto(struct sock *sk)
 	 *    is invisible. Actually, Linux-2.4 also generates erratic
 	 *    ACKs in some circumstances.
 	 */
-	inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
-
+	if (((tp->srtt_us >> 3) + tp->rttvar_us) < min_rto)
+		inet_csk(sk)->icsk_rto = usecs_to_jiffies(min_rto);
+	else
+		inet_csk(sk)->icsk_rto = __tcp_set_rto(tp);
 	/* 2. Fixups made earlier cannot be right.
 	 *    If we do not estimate RTO correctly without them,
 	 *    all the algo is pure shit and should be replaced
 	 *    with correct one. It is exactly, which we pretend to do.
 	 */
-
-	/* NOTE: clamping at TCP_RTO_MIN is not required, current algo
-	 * guarantees that rto is higher.
-	 */
 	tcp_bound_rto(sk);
 }