diff mbox

FYI - TCP/IP thin stream latency

Message ID Pine.LNX.4.64.0810231516220.7072@wrl-59.cs.helsinki.fi
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Ilpo Järvinen Oct. 23, 2008, 12:22 p.m. UTC
On Wed, 22 Oct 2008, David Miller wrote:

> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Thu, 16 Oct 2008 09:55:18 -0700
> 
> >          when retransmitting send full MTU (vs redoing only than send)
> 
> We already do this (partially), see tcp_retransmit_skb():
> 
> 	/* Collapse two adjacent packets if worthwhile and we can. */
> 	if (!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_SYN) &&
> 	    (skb->len < (cur_mss >> 1)) &&
> 	    (!tcp_skb_is_last(sk, skb)) &&
> 	    (tcp_write_queue_next(sk, skb) != tcp_send_head(sk)) &&
> 	    (skb_shinfo(skb)->nr_frags == 0 &&
> 	     skb_shinfo(tcp_write_queue_next(sk, skb))->nr_frags == 0) &&
> 	    (tcp_skb_pcount(skb) == 1 &&
> 	     tcp_skb_pcount(tcp_write_queue_next(sk, skb)) == 1) &&
> 	    (sysctl_tcp_retrans_collapse != 0))
> 		tcp_retrans_try_collapse(sk, skb, cur_mss);
> 
> We'd need to make this thing loop until no further progress can be
> made.

Something along these lines...? It's not that trivial though to get
full MSS worth of contiguous area if SACK is enabled though, unless we 
make it include already sacked areas too.

--
[PATCH not-now :-)] tcp: collapse more than two on retransmission

I always had thought that collapsing up to two at a time was
intentional decision to avoid excessive processing if 1 byte
sized skbs are to be combined for a full mtu, and consecutive
retransmissions would make the size of the retransmittee
double each round anyway.

tcp_skb_is_last check is now provided by the loop.

I tried to make the order of tests bit more sensible to make
it break earlier for things which seem more common case
(in favor of TSO and SG, though latter is a restriction which
could be made less strict I think).

Barely compile tested.

Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
---
 net/ipv4/tcp_output.c |   98 ++++++++++++++++++++++++++++++------------------
 1 files changed, 61 insertions(+), 37 deletions(-)

Comments

stephen hemminger Oct. 24, 2008, 2:59 p.m. UTC | #1
I was  reading "Removing Exponential Backoff from TCP" by Mondal and Kusmznovic at Northwesetern
in ACM CCR. The paper tries to show that nothing breaks by getting rid of backoff. Interestingly,
FreeBSD already doesn't do backoff until after 5th retry.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 28, 2008, 10:48 p.m. UTC | #2
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Thu, 23 Oct 2008 15:22:17 +0300 (EEST)

> [PATCH not-now :-)] tcp: collapse more than two on retransmission
> 
> I always had thought that collapsing up to two at a time was
> intentional decision to avoid excessive processing if 1 byte
> sized skbs are to be combined for a full mtu, and consecutive
> retransmissions would make the size of the retransmittee
> double each round anyway.
> 
> tcp_skb_is_last check is now provided by the loop.
> 
> I tried to make the order of tests bit more sensible to make
> it break earlier for things which seem more common case
> (in favor of TSO and SG, though latter is a restriction which
> could be made less strict I think).
> 
> Barely compile tested.
> 
> Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>

Small error:

> +static int tcp_can_collapse(struct sock *sk, struct sk_buff *skb)
> +{
> +	if (tcp_skb_pcount(skb) > 1)
> +		return 0;
> +	/* Once recombining for SACK completes this check could be made
> +	 * less strict by reusing those parts.
> +	 */
> +	if (skb_shinfo(skb)->nr_frags != 0)
> +		return 0;
> +	if (skb_cloned(skb))
> +		return 0;
> +	if (skb == tcp_send_head(sk))
> +		return 0;
> +	/* Some heurestics for collapsing over SACK'd could be invented */
> +	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
> +		return 0;
> +
> +	return 1;
> +}

For the "skb === tcp_send_head()" test, we're not interested if
"skb" is equal, but rather whether next_skb is equal.

But the structure looks fine and when you send me a tested version
of this patch I'll apply it to net-next-2.6, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Oct. 29, 2008, 5:13 a.m. UTC | #3
From: Stephen Hemminger <shemminger@vyatta.com>
Date: Fri, 24 Oct 2008 07:59:38 -0700

> I was reading "Removing Exponential Backoff from TCP" by Mondal and
> Kusmznovic at Northwesetern in ACM CCR. The paper tries to show that
> nothing breaks by getting rid of backoff. Interestingly, FreeBSD
> already doesn't do backoff until after 5th retry.

Cell phone network folks want this (or something like it) too.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Oct. 29, 2008, 6:13 p.m. UTC | #4
David Miller wrote:
> From: Stephen Hemminger <shemminger@vyatta.com>
> Date: Fri, 24 Oct 2008 07:59:38 -0700
> 
> 
>>I was reading "Removing Exponential Backoff from TCP" by Mondal and
>>Kusmznovic at Northwesetern in ACM CCR. The paper tries to show that
>>nothing breaks by getting rid of backoff. Interestingly, FreeBSD
>>already doesn't do backoff until after 5th retry.
> 
> 
> Cell phone network folks want this (or something like it) too.

Wouldn't one get that by being able to tune TCP_RTO_MAX on some sort of 
basis (per dest perhaps)?

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Douglas Leith Nov. 8, 2008, 11:16 a.m. UTC | #5
I finally managed to make time to read the paper by Mondal and  
Kusmznovic properly.

The basic argument made is that using removing RTO backoff (i.e.  
disabling the doubling of RTO on each timeout) cannot degrade flow  
completion time vs when RTO backoffs are allowed.   Therefore, there  
is no benefit from doing backoffs.

Although they have a "theorem" on this, unfortunately it seems to be  
flawed - see note attached.  Basically, their argument relies on the  
assumption (amongst other things) that all flows sharing a link are  
in lock-step so that their backoff stage is always the same, which is  
clearly not realistic.   Without this assumption, it seems quite  
possible that the completion time is shorter when backoff is used  
over when the RTO is fixed.  In fact its also possible that the  
aggregate sum of the completion times of *all* flows is lower with  
backoff than without.  I've emailed the authors to confirm that they  
agree, but it seems pretty clear cut.     I should stress that this  
is not to say removing RTO backoff is not a reasonable idea, just  
that the argument put forward in the paper doesn't seem to work as  
stated.

The trade-off here is a complex one between the delay introduced by  
sending packets at a lower rate i.e. with wider spacing between  
packets vs the delay incurred by increased retransmissions when  
packets are sent at a higher rate due to the higher loss rate.   I  
reckon the paper doesn't really deal with common concerns regarding  
congestion collapse or with application-limited low-speed flows (e.g.  
games traffic).   For example, an RTO of 200ms still seems like a  
problem for the delay-sensitive games traffic.   We have a student  
starting soon, so my plan is to ask him to do some more RTO tests to  
see if some light can be cast on these.

Cheers,

Doug
On 24 Oct 2008, at 15:59, Stephen Hemminger wrote:

> I was  reading "Removing Exponential Backoff from TCP" by Mondal  
> and Kusmznovic at Northwesetern
> in ACM CCR. The paper tries to show that nothing breaks by getting  
> rid of backoff. Interestingly,
> FreeBSD already doesn't do backoff until after 5th retry.
Ilpo Järvinen Nov. 24, 2008, 2 p.m. UTC | #6
I'll post this rexmit combining again along with the other
tcp changes... Commenting your concerns below.

On Tue, 28 Oct 2008, David Miller wrote:

> Small error:
> 
> > +static int tcp_can_collapse(struct sock *sk, struct sk_buff *skb)
> > +{
> > +	if (tcp_skb_pcount(skb) > 1)
> > +		return 0;
> > +	/* Once recombining for SACK completes this check could be made
> > +	 * less strict by reusing those parts.
> > +	 */
> > +	if (skb_shinfo(skb)->nr_frags != 0)
> > +		return 0;
> > +	if (skb_cloned(skb))
> > +		return 0;
> > +	if (skb == tcp_send_head(sk))
> > +		return 0;
> > +	/* Some heurestics for collapsing over SACK'd could be invented */
> > +	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
> > +		return 0;
> > +
> > +	return 1;
> > +}
> 
> For the "skb === tcp_send_head()" test, we're not interested if
> "skb" is equal, but rather whether next_skb is equal.

But that's what's actually happening... We'll terminate the loop
at that skb where it matches without doing any useful work for it,
so I'm just using a different way of "saying" the same as you.

> But the structure looks fine and when you send me a tested version
> of this patch I'll apply it to net-next-2.6, thanks!

Tested along with the other recombining things posting in couple of
minutes.
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index e4c5ac9..99ee943 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1766,46 +1766,22 @@  u32 __tcp_select_window(struct sock *sk)
 	return window;
 }
 
-/* Attempt to collapse two adjacent SKB's during retransmission. */
-static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb,
-				     int mss_now)
+/* Collapses two adjacent SKB's during retransmission. */
+static void tcp_collapse_retrans(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *next_skb = tcp_write_queue_next(sk, skb);
 	int skb_size, next_skb_size;
 	u16 flags;
 
-	/* The first test we must make is that neither of these two
-	 * SKB's are still referenced by someone else.
-	 */
-	if (skb_cloned(skb) || skb_cloned(next_skb))
-		return;
-
 	skb_size = skb->len;
 	next_skb_size = next_skb->len;
 	flags = TCP_SKB_CB(skb)->flags;
 
-	/* Also punt if next skb has been SACK'd. */
-	if (TCP_SKB_CB(next_skb)->sacked & TCPCB_SACKED_ACKED)
-		return;
-
-	/* Next skb is out of window. */
-	if (after(TCP_SKB_CB(next_skb)->end_seq, tcp_wnd_end(tp)))
-		return;
-
-	/* Punt if not enough space exists in the first SKB for
-	 * the data in the second, or the total combined payload
-	 * would exceed the MSS.
-	 */
-	if ((next_skb_size > skb_tailroom(skb)) ||
-	    ((skb_size + next_skb_size) > mss_now))
-		return;
-
 	BUG_ON(tcp_skb_pcount(skb) != 1 || tcp_skb_pcount(next_skb) != 1);
 
 	tcp_highest_sack_combine(sk, next_skb, skb);
 
-	/* Ok.	We will be able to collapse the packet. */
 	tcp_unlink_write_queue(next_skb, sk);
 
 	skb_copy_from_linear_data(next_skb, skb_put(skb, next_skb_size),
@@ -1847,6 +1823,64 @@  static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *skb,
 	sk_wmem_free_skb(sk, next_skb);
 }
 
+static int tcp_can_collapse(struct sock *sk, struct sk_buff *skb)
+{
+	if (tcp_skb_pcount(skb) > 1)
+		return 0;
+	/* Once recombining for SACK completes this check could be made
+	 * less strict by reusing those parts.
+	 */
+	if (skb_shinfo(skb)->nr_frags != 0)
+		return 0;
+	if (skb_cloned(skb))
+		return 0;
+	if (skb == tcp_send_head(sk))
+		return 0;
+	/* Some heurestics for collapsing over SACK'd could be invented */
+	if (TCP_SKB_CB(skb)->sacked & TCPCB_SACKED_ACKED)
+		return 0;
+
+	return 1;
+}
+
+static void tcp_retrans_try_collapse(struct sock *sk, struct sk_buff *to,
+				     int space)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct sk_buff *skb = to, *tmp;
+	int first = 1;
+
+	if (!sysctl_tcp_retrans_collapse)
+		return;
+	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_SYN)
+		return;
+
+	tcp_for_write_queue_from_safe(skb, tmp, sk) {
+		if (!tcp_can_collapse(sk, skb))
+			break;
+
+		space -= skb->len;
+
+		if (first) {
+			first = 0;
+			continue;
+		}
+
+		if (space < 0)
+			break;
+		/* Punt if not enough space exists in the first SKB for
+		 * the data in the second
+		 */
+		if (skb->len > skb_tailroom(to))
+			break;
+
+		if (after(TCP_SKB_CB(skb)->end_seq, tcp_wnd_end(tp)))
+			break;
+
+		tcp_collapse_retrans(sk, to);
+	}
+}
+
 /* Do a simple retransmit without using the backoff mechanisms in
  * tcp_timer. This is used for path mtu discovery.
  * The socket is already locked here.
@@ -1946,17 +1980,7 @@  int tcp_retransmit_skb(struct sock *sk, struct sk_buff *skb)
 			return -ENOMEM; /* We'll try again later. */
 	}
 
-	/* Collapse two adjacent packets if worthwhile and we can. */
-	if (!(TCP_SKB_CB(skb)->flags & TCPCB_FLAG_SYN) &&
-	    (skb->len < (cur_mss >> 1)) &&
-	    (!tcp_skb_is_last(sk, skb)) &&
-	    (tcp_write_queue_next(sk, skb) != tcp_send_head(sk)) &&
-	    (skb_shinfo(skb)->nr_frags == 0 &&
-	     skb_shinfo(tcp_write_queue_next(sk, skb))->nr_frags == 0) &&
-	    (tcp_skb_pcount(skb) == 1 &&
-	     tcp_skb_pcount(tcp_write_queue_next(sk, skb)) == 1) &&
-	    (sysctl_tcp_retrans_collapse != 0))
-		tcp_retrans_try_collapse(sk, skb, cur_mss);
+	tcp_retrans_try_collapse(sk, skb, cur_mss);
 
 	/* Some Solaris stacks overoptimize and ignore the FIN on a
 	 * retransmit when old data is attached.  So strip it off