Patchwork [2/3] net: TCP thin linear timeouts

login
register
mail settings
Submitter Andreas Petlund
Date Oct. 27, 2009, 4:31 p.m.
Message ID <4AE72079.4030504@simula.no>
Download mbox | patch
Permalink /patch/37003/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Andreas Petlund - Oct. 27, 2009, 4:31 p.m.
This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin.


Signed-off-by: Andreas Petlund <apetlund@simula.no>
---
 include/linux/tcp.h        |    3 +++
 include/net/tcp.h          |    1 +
 net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
 net/ipv4/tcp.c             |    5 +++++
 net/ipv4/tcp_timer.c       |   17 ++++++++++++++++-
 5 files changed, 33 insertions(+), 1 deletions(-)
Eric Dumazet - Oct. 27, 2009, 4:56 p.m.
Andreas Petlund a écrit :
> This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin.
> 

Wont this reduce the session timeout to something very small, ie 15 retransmits, way under the minute ?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
William Allen Simpson - Oct. 28, 2009, 3:20 a.m.
Sorry to be too picky about the naming, but "rm_expb" really doesn't
mean what is actually done.  Perhaps TCP_THIN_LINEAR_BACKOFF and
sysctl_tcp_thin_linear_backoff?

Also, as debated on some other recent patches, shouldn't the global
sysctl specify the default, and the per socket option specify the
forced override?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Hannemann - Oct. 28, 2009, 12:58 p.m.
Eric Dumazet schrieb:
> Andreas Petlund a écrit :
>> This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin.
>>
> 
> Wont this reduce the session timeout to something very small, ie 15 retransmits, way under the minute ?

The session timeout no longer depends on the actual number of retransmits. Instead its a time interval,
which is roughly equivalent to the time a TCP, performing exponential backoff would need to perform
15 retransmits.

However, addressing the proposal:
I wonder how one can seriously suggest to just skip congestion response during timeout-based
loss recovery? I believe that in a heavily congested scenarios, this would lead to a goodput
goodput disaster... Not to mention that in a heavily congested scenario, suddenly every flow
will become "thin", so this will even amplify the problems. Or did I miss something?

Best regards,
Arnd
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen - Oct. 28, 2009, 2:18 p.m.
On Tue, 27 Oct 2009, Andreas Petlund wrote:

> This patch will make TCP use only linear timeouts if the stream is thin. This will help to avoid the very high latencies that thin stream suffer because of exponential backoff. This mechanism is only active if enabled by iocontrol or syscontrol and the stream is identified as thin.
> 
> 
> Signed-off-by: Andreas Petlund <apetlund@simula.no>
> ---
>  include/linux/tcp.h        |    3 +++
>  include/net/tcp.h          |    1 +
>  net/ipv4/sysctl_net_ipv4.c |    8 ++++++++
>  net/ipv4/tcp.c             |    5 +++++
>  net/ipv4/tcp_timer.c       |   17 ++++++++++++++++-
>  5 files changed, 33 insertions(+), 1 deletions(-)
> 
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index 61723a7..e64368d 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -96,6 +96,7 @@ enum {
>  #define TCP_QUICKACK		12	/* Block/reenable quick acks */
>  #define TCP_CONGESTION		13	/* Congestion control algorithm */
>  #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
> +#define TCP_THIN_RM_EXPB        15      /* Remove exp. backoff for thin streams*/
>  
>  #define TCPI_OPT_TIMESTAMPS	1
>  #define TCPI_OPT_SACK		2
> @@ -299,6 +300,8 @@ struct tcp_sock {
>  	u16	advmss;		/* Advertised MSS			*/
>  	u8	frto_counter;	/* Number of new acks after RTO */
>  	u8	nonagle;	/* Disable Nagle algorithm?             */
> +	u8      thin_rm_expb:1, /* Remove exp. backoff for thin streams */
> +		thin_undef : 7;
>  
>  /* RTT measurement */
>  	u32	srtt;		/* smoothed round trip time << 3	*/
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 7c4482f..412c1bd 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -237,6 +237,7 @@ extern int sysctl_tcp_base_mss;
>  extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
>  extern int sysctl_tcp_max_ssthresh;
> +extern int sysctl_tcp_force_thin_rm_expb;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern struct percpu_counter tcp_sockets_allocated;
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 2dcf04d..7458f37 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -713,6 +713,14 @@ static struct ctl_table ipv4_table[] = {
>  		.proc_handler	= proc_dointvec,
>  	},
>  	{
> +		.ctl_name       = CTL_UNNUMBERED,
> +		.procname       = "tcp_force_thin_rm_expb",
> +		.data           = &sysctl_tcp_force_thin_rm_expb,
> +		.maxlen         = sizeof(int),
> +		.mode           = 0644,
> +		.proc_handler   = proc_dointvec
> +	},
> +	{
>  		.ctl_name	= CTL_UNNUMBERED,
>  		.procname	= "udp_mem",
>  		.data		= &sysctl_udp_mem,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 90b2e06..b4b0931 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2134,6 +2134,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  		}
>  		break;
>  
> +	case TCP_THIN_RM_EXPB:
> +		if (val)
> +			tp->thin_rm_expb = 1;
> +		break;
> +
>  	case TCP_CORK:
>  		/* When set indicates to always queue non-full frames.
>  		 * Later the user clears this option and we transmit
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index cdb2ca7..24d6dc3 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_force_thin_rm_expb __read_mostly;
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);
> @@ -386,7 +387,21 @@ void tcp_retransmit_timer(struct sock *sk)
>  	icsk->icsk_retransmits++;
>  
>  out_reset_timer:
> -	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> +	if ((tp->thin_rm_expb || sysctl_tcp_force_thin_rm_expb) &&
> +	    tcp_stream_is_thin(tp) && sk->sk_state == TCP_ESTABLISHED) {
> +		/* If stream is thin, remove exponential backoff.
> +		 * Since 'icsk_backoff' is used to reset timer, set to 0
> +		 * Recalculate 'icsk_rto' as this might be increased if
> +		 * stream oscillates between thin and thick, thus the old
> +		 * value might already be too high compared to the value
> +		 * set by 'tcp_set_rto' in tcp_input.c which resets the
> +		 * rto without backoff. */
> +		icsk->icsk_backoff = 0;
> +		icsk->icsk_rto = min(((tp->srtt >> 3) + tp->rttvar), TCP_RTO_MAX);

The first part is nowadays done with __tcp_set_rto(tp).
Ilpo Järvinen - Oct. 28, 2009, 2:31 p.m.
On Wed, 28 Oct 2009, Arnd Hannemann wrote:

> Eric Dumazet schrieb:
> > Andreas Petlund a écrit :
> >> This patch will make TCP use only linear timeouts if the stream is 
> >> thin. This will help to avoid the very high latencies that thin 
> >> stream suffer because of exponential backoff. This mechanism is only 
> >> active if enabled by iocontrol or syscontrol and the stream is 
> >> identified as thin. 

...I don't see how high latency is in any connection to stream being 
"thin" or not btw. If all ACKs are lost it usually requires silence for 
the full RTT, which affects a stream regardless of its size. ...If not all 
ACKs are lost, then the dupACK approach in the other patch should cover 
it already.

> However, addressing the proposal:
> I wonder how one can seriously suggest to just skip congestion response 
> during timeout-based loss recovery? I believe that in a heavily 
> congested scenarios, this would lead to a goodput disaster... Not to 
> mention that in a heavily congested scenario, suddenly every flow will 
> become "thin", so this will even amplify the problems. Or did I miss 
> something?

Good point. I suppose such an under-provisioned network can certainly be
there. I have heard that at least some people who remove exponential back 
off apply it later on nth retransmission as very often there really isn't 
such a super heavy congestion scenario but something completely unrelated 
to congestion which causes the RTO.
Andreas Petlund - Oct. 29, 2009, 1:50 p.m.
Den 28. okt. 2009 kl. 04.20 skrev William Allen Simpson:

> Sorry to be too picky about the naming, but "rm_expb" really doesn't
> mean what is actually done.  Perhaps TCP_THIN_LINEAR_BACKOFF and
> sysctl_tcp_thin_linear_backoff?
>

I agree that the name should be changed. As it represents linear  
timeouts and does not back off exponentially, I suggest  
TCP_THIN_LINEAR_TIMEOUTS and sysctl_tcp_thin_linear_timeouts.

> Also, as debated on some other recent patches, shouldn't the global
> sysctl specify the default, and the per socket option specify the
> forced override?

The rationale behind the suggested model is to allow the modifications  
to be forced active in cases where the application that generates the  
thin stream is proprietary, legacy, or that the code for other reasons  
can not be modified (as is the case, for instance, for many game  
clients).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Petlund - Oct. 29, 2009, 1:51 p.m.
Den 28. okt. 2009 kl. 15.31 skrev Ilpo Järvinen:

> On Wed, 28 Oct 2009, Arnd Hannemann wrote:
>
>> Eric Dumazet schrieb:
>>> Andreas Petlund a écrit :
>>>> This patch will make TCP use only linear timeouts if the stream is
>>>> thin. This will help to avoid the very high latencies that thin
>>>> stream suffer because of exponential backoff. This mechanism is  
>>>> only
>>>> active if enabled by iocontrol or syscontrol and the stream is
>>>> identified as thin.
>
> ...I don't see how high latency is in any connection to stream being
> "thin" or not btw. If all ACKs are lost it usually requires silence  
> for
> the full RTT, which affects a stream regardless of its size. ...If  
> not all
> ACKs are lost, then the dupACK approach in the other patch should  
> cover
> it already.
>

The increased latency that we observed does not arise from lost ACKs,  
but from the lack of enough packets in flight to be able to trigger  
fast retransmits. This effectively limits the retransmission options  
to retransmission by timeout, which again will increase exponentially  
with each subsequent retransmissions. We have also found that the  
"thin" stream patterns are very often generated by applications where  
human interaction is the purpose. Such applications will give a  
degraded experience to the user if such high latencies happen often.  
In-depth discussion of these effects can be found in the papers I  
linked to.
If the application produces less than one packet per RTT, the dupACK- 
modification will be ineffective and any improved latency will be from  
linear timeouts. If the number of packets in flight are 2-4, no fast  
retransmissions may be triggered based on the 3 dupACK scheme, but a  
retransmission upon the first indication of loss will improve  
retransmission latency.

>> However, addressing the proposal:
>> I wonder how one can seriously suggest to just skip congestion  
>> response
>> during timeout-based loss recovery? I believe that in a heavily
>> congested scenarios, this would lead to a goodput disaster... Not to
>> mention that in a heavily congested scenario, suddenly every flow  
>> will
>> become "thin", so this will even amplify the problems. Or did I miss
>> something?
>
> Good point. I suppose such an under-provisioned network can  
> certainly be
> there. I have heard that at least some people who remove exponential  
> back
> off apply it later on nth retransmission as very often there really  
> isn't
> such a super heavy congestion scenario but something completely  
> unrelated
> to congestion which causes the RTO.
>
> -- 
> i.


The removal of exponential backoff on a general basis has been  
investigated and discussed already, for instance here:
http://ccr.sigcomm.org/online/?q=node/416
Such steps are, however considered drastic, and I agree that caution  
must be made to thoroughly investigate the effects of such changes.
The changes introduced by the proposed patches, however, are not  
default behaviour, but an option for applications that suffer from the  
thin-stream TCP increased retransmission latencies. They will, as  
such, not affect all streams. In addition, the changes will only be  
active for streams which are perpetually thin or in the early phase of  
expanding their cwnd. Also, experiments performed on congested  
bottlenecks with tail-drop queues show very little (if any at all)  
effect on goodput for the modified scenario compared to a scenario  
with unmodified TCP streams.

Graphs both for latency-results and fairness tests can be found here:
http://folk.uio.no/apetlund/lktmp/

-AP--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet - Oct. 29, 2009, 2:24 p.m.
Andreas Petlund a écrit :

> 
> The removal of exponential backoff on a general basis has been
> investigated and discussed already, for instance here:
> http://ccr.sigcomm.org/online/?q=node/416
> Such steps are, however considered drastic, and I agree that caution
> must be made to thoroughly investigate the effects of such changes.
> The changes introduced by the proposed patches, however, are not default
> behaviour, but an option for applications that suffer from the
> thin-stream TCP increased retransmission latencies. They will, as such,
> not affect all streams. In addition, the changes will only be active for
> streams which are perpetually thin or in the early phase of expanding
> their cwnd. Also, experiments performed on congested bottlenecks with
> tail-drop queues show very little (if any at all) effect on goodput for
> the modified scenario compared to a scenario with unmodified TCP streams.
> 
> Graphs both for latency-results and fairness tests can be found here:
> http://folk.uio.no/apetlund/lktmp/
> 

There should be a limit to linear timeouts, to say ... no more than 6 retransmits
(eventually tunable), then switch to exponential backoff. Maybe your patch
already implement such heuristic ?

True link collapses do happen, it would be good if not all streams wakeup in the same
second and make recovery very slow.

Thats too easy to accept possibly dangerous features with the excuse of saying
"It wont be used very much", because you cannot predict the future.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Arnd Hannemann - Oct. 29, 2009, 4:11 p.m.
Andreas Petlund schrieb:
> We have found no noticeable degradation of the goodput in a series of
> experiments we have performed in order to map the effects of the
> modifications. Furthermore, the modifications implemented in the patches
> are explicitly enabled only for applications where the developer knows
> that streams will be thin, thus only a small subset of the streams will
> apply the modifications.  
> 
> Graphs presenting results from experiments performed to analyse latency
> and fairness issues can be found here:
> http://folk.uio.no/apetlund/lktmp/

How often did you hit consecutive RTOs in these measurements?
As I see you did a measurement with 512 thick vs. 512 thin streams.
Lets do a hypothetical calculation with only 512 "thin" streams.
Lets further assume the rtt is low, so that RTO is around 200ms.
Assume each segment has 128 Bytes (already very small...).
Assume after a period of normal operation all streams are in
timeout-based loss recovery. (e.g. because destination endpoint
suddenly behaves like a black hole)
As all streams are in timeout-based loss recovery, each stream
will transmit 5 segments each second with your modification.
This would result in a throughput around 512*5*1024bit = 2560 kbit/s
and a goodput of 0 kbit/s (because the receiver is a black hole).
So you can easily saturate a 2 MBit/s link, only with retransmissions.

Unfortunately in Germany an ADSL uplink of 786 kbit/s is still quite
common, and its already called "broadband"...

Regarding the "small subset", why have a global sysctl option, then?
And I think "tcp_stream_is_thin(tp)" will be true for every flow
in the RTO case, at least for consecutive RTOs.

Best regards,
Arnd Hannemann
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones - Oct. 29, 2009, 5:01 p.m.
Just how thin can a thin stream be when a thin stream is found thin? (to the 
cadence of "How much wood could a woodchuck chuck if a woodchuck could chuck wood?")

Does a stream get so thin that a user's send could not be split into four, 
sub-MSS TCP segments?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 61723a7..e64368d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -96,6 +96,7 @@  enum {
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
 #define TCP_MD5SIG		14	/* TCP MD5 Signature (RFC2385) */
+#define TCP_THIN_RM_EXPB        15      /* Remove exp. backoff for thin streams*/
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -299,6 +300,8 @@  struct tcp_sock {
 	u16	advmss;		/* Advertised MSS			*/
 	u8	frto_counter;	/* Number of new acks after RTO */
 	u8	nonagle;	/* Disable Nagle algorithm?             */
+	u8      thin_rm_expb:1, /* Remove exp. backoff for thin streams */
+		thin_undef : 7;
 
 /* RTT measurement */
 	u32	srtt;		/* smoothed round trip time << 3	*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 7c4482f..412c1bd 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -237,6 +237,7 @@  extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
 extern int sysctl_tcp_max_ssthresh;
+extern int sysctl_tcp_force_thin_rm_expb;
 
 extern atomic_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 2dcf04d..7458f37 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -713,6 +713,14 @@  static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec,
 	},
 	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "tcp_force_thin_rm_expb",
+		.data           = &sysctl_tcp_force_thin_rm_expb,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec
+	},
+	{
 		.ctl_name	= CTL_UNNUMBERED,
 		.procname	= "udp_mem",
 		.data		= &sysctl_udp_mem,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 90b2e06..b4b0931 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2134,6 +2134,11 @@  static int do_tcp_setsockopt(struct sock *sk, int level,
 		}
 		break;
 
+	case TCP_THIN_RM_EXPB:
+		if (val)
+			tp->thin_rm_expb = 1;
+		break;
+
 	case TCP_CORK:
 		/* When set indicates to always queue non-full frames.
 		 * Later the user clears this option and we transmit
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index cdb2ca7..24d6dc3 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@  int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
 int sysctl_tcp_orphan_retries __read_mostly;
+int sysctl_tcp_force_thin_rm_expb __read_mostly;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
@@ -386,7 +387,21 @@  void tcp_retransmit_timer(struct sock *sk)
 	icsk->icsk_retransmits++;
 
 out_reset_timer:
-	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
+	if ((tp->thin_rm_expb || sysctl_tcp_force_thin_rm_expb) &&
+	    tcp_stream_is_thin(tp) && sk->sk_state == TCP_ESTABLISHED) {
+		/* If stream is thin, remove exponential backoff.
+		 * Since 'icsk_backoff' is used to reset timer, set to 0
+		 * Recalculate 'icsk_rto' as this might be increased if
+		 * stream oscillates between thin and thick, thus the old
+		 * value might already be too high compared to the value
+		 * set by 'tcp_set_rto' in tcp_input.c which resets the
+		 * rto without backoff. */
+		icsk->icsk_backoff = 0;
+		icsk->icsk_rto = min(((tp->srtt >> 3) + tp->rttvar), TCP_RTO_MAX);
+	} else {
+		/* Use normal backoff */
+		icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
+	}
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
 	if (retransmits_timed_out(sk, sysctl_tcp_retries1 + 1))
 		__sk_dst_reset(sk);