diff mbox

tcp: detect loss above high_seq in recovery

Message ID 1325033182-25905-1-git-send-email-ycheng@google.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Yuchung Cheng Dec. 28, 2011, 12:46 a.m. UTC
Correctly implement a loss detection heuristic: New sequences (above
high_seq) sent during the fast recovery are deemed lost when higher
sequences are SACKed.

Current code does not catch these losses, because tcp_mark_head_lost()
does not check packets beyond high_seq. The fix is straight-forward by
checking packets until the highest sacked packet. In addition, all the
FLAG_DATA_LOST logic are in-effective and redundant and can be removed.

The new sequences sent during fast-recovery maybe marked as lost
and/or retransmitted. It is possible these (new) losses are recovered
within the current fast recovery and the state moves back to CA_Open
without entering another fast recovery / cwnd redunction. This is
especially possible for light losses.  Note RFC 3517 (implicitly)
allows this as well.

Update the loss heuristic comments. The algorithm above is documented
as heuristic B, but it is redundant too because heuristic A already
covers B.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
---
 include/linux/snmp.h |    1 -
 net/ipv4/proc.c      |    1 -
 net/ipv4/tcp_input.c |   41 +++++++++++++++--------------------------
 3 files changed, 15 insertions(+), 28 deletions(-)

Comments

Ilpo Järvinen Dec. 28, 2011, 3:42 p.m. UTC | #1
On Tue, 27 Dec 2011, Yuchung Cheng wrote:

> Correctly implement a loss detection heuristic: New sequences (above
> high_seq) sent during the fast recovery are deemed lost when higher
> sequences are SACKed.
> 
> Current code does not catch these losses, because tcp_mark_head_lost()
> does not check packets beyond high_seq. The fix is straight-forward by
> checking packets until the highest sacked packet. In addition, all the
> FLAG_DATA_LOST logic are in-effective and redundant and can be removed.

I agree that FLAG_DATA_LOST never did anything very useful... I've never 
figure out why it was there (perhaps some earlier change made it broken or 
so before I've been tracking TCP dev).

> The new sequences sent during fast-recovery maybe marked as lost
> and/or retransmitted. It is possible these (new) losses are recovered
> within the current fast recovery and the state moves back to CA_Open
> without entering another fast recovery / cwnd redunction. This is
> especially possible for light losses.  Note RFC 3517 (implicitly)
> allows this as well.

No it doesn't! It does not allow you to avoid the second cwnd reduction. 
They're from different RTT. What you now claim is that we never need to 
do cwnd reduction until we visit CA_Open in between. That's very very 
dangerous if the congestion suddently spikes because nobody reduces twice 
causing everyone to get losses and continues in the recovery forever.

> Update the loss heuristic comments. The algorithm above is documented
> as heuristic B, but it is redundant too because heuristic A already
> covers B.
> 
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> ---
>  include/linux/snmp.h |    1 -
>  net/ipv4/proc.c      |    1 -
>  net/ipv4/tcp_input.c |   41 +++++++++++++++--------------------------
>  3 files changed, 15 insertions(+), 28 deletions(-)
> 
> diff --git a/include/linux/snmp.h b/include/linux/snmp.h
> index e16557a..c1241c42 100644
> --- a/include/linux/snmp.h
> +++ b/include/linux/snmp.h
> @@ -192,7 +192,6 @@ enum
>  	LINUX_MIB_TCPPARTIALUNDO,		/* TCPPartialUndo */
>  	LINUX_MIB_TCPDSACKUNDO,			/* TCPDSACKUndo */
>  	LINUX_MIB_TCPLOSSUNDO,			/* TCPLossUndo */
> -	LINUX_MIB_TCPLOSS,			/* TCPLoss */
>  	LINUX_MIB_TCPLOSTRETRANSMIT,		/* TCPLostRetransmit */
>  	LINUX_MIB_TCPRENOFAILURES,		/* TCPRenoFailures */
>  	LINUX_MIB_TCPSACKFAILURES,		/* TCPSackFailures */
> diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
> index 3569d8e..6afc807 100644
> --- a/net/ipv4/proc.c
> +++ b/net/ipv4/proc.c
> @@ -216,7 +216,6 @@ static const struct snmp_mib snmp4_net_list[] = {
>  	SNMP_MIB_ITEM("TCPPartialUndo", LINUX_MIB_TCPPARTIALUNDO),
>  	SNMP_MIB_ITEM("TCPDSACKUndo", LINUX_MIB_TCPDSACKUNDO),
>  	SNMP_MIB_ITEM("TCPLossUndo", LINUX_MIB_TCPLOSSUNDO),
> -	SNMP_MIB_ITEM("TCPLoss", LINUX_MIB_TCPLOSS),

I don't think you can remove MIBs as they're userspace visible.
David Miller Dec. 28, 2011, 6:07 p.m. UTC | #2
From: "Ilpo Järvinen" <ilpo.jarvinen@helsinki.fi>
Date: Wed, 28 Dec 2011 17:42:06 +0200 (EET)

> I don't think you can remove MIBs as they're userspace visible.

Because of how they are implemented and how applications use them,
this is actually allowed.

All the MIBs are presented first as a series of strings, then as a
series of values.

So applications parse them both to report the statistics, and this do
get coded in a way that they do not need to assume the presence or
particular ordering of these MIBS.

Therefore, such a change is safe.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng Dec. 29, 2011, 6:21 p.m. UTC | #3
Ilpo,
On Wed, Dec 28, 2011 at 7:42 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
>
> On Tue, 27 Dec 2011, Yuchung Cheng wrote:
> > The new sequences sent during fast-recovery maybe marked as lost
> > and/or retransmitted. It is possible these (new) losses are recovered
> > within the current fast recovery and the state moves back to CA_Open
> > without entering another fast recovery / cwnd redunction. This is
> > especially possible for light losses.  Note RFC 3517 (implicitly)
> > allows this as well.
>
> No it doesn't! It does not allow you to avoid the second cwnd reduction.
> They're from different RTT. What you now claim is that we never need to
> do cwnd reduction until we visit CA_Open in between. That's very very
> dangerous if the congestion suddently spikes because nobody reduces twice
> causing everyone to get losses and continues in the recovery forever.

Should TCP perform another CWR for sequences lost after recovery point?
my colleagues and I all agree with you it should. But my commit message did a
poor job explain some subtle deviations in standard and current Linux.

RFC3517 supports this by principle; however, it may repair such new data losses
in the current recovery: Rue 1.b and 1.c in NextSeg() allow retransmitting any
segment below highest sacked sequence. Thus it may repair every loss in the
current recovery and won't never go into another.

As a matter of fact, Linux already does that today, by implementing rule 3 in
NextSeg(), aka forward-retransmit. But Linux won't mark these packets lost even
if there are >= tp->reordering packets sacked above. This change fixes this.

How often is second CWR avoided? it's probably rare b/c it requires new data
during the recovery RTT and a complete repair any losses of them. It's arguably
 OK to fix small new data losses during current recovery, and RFC/Linux seem
to do so.

To summarize:
1) this change does NOT forbid TCP performing further CWR on new losses
2) this change (merely) marks some forward-retransmitted packets LOST.
3) the commit message is to be explicit that CWR may not always be performed,
    in both my change and current kernel.


HTH. I will update commit message if people can make sense of it :)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Dec. 30, 2011, 11:19 a.m. UTC | #4
On Thu, 29 Dec 2011, Yuchung Cheng wrote:

> Ilpo,
> On Wed, Dec 28, 2011 at 7:42 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> >
> > On Tue, 27 Dec 2011, Yuchung Cheng wrote:
> > > The new sequences sent during fast-recovery maybe marked as lost
> > > and/or retransmitted. It is possible these (new) losses are recovered
> > > within the current fast recovery and the state moves back to CA_Open
> > > without entering another fast recovery / cwnd redunction. This is
> > > especially possible for light losses.  Note RFC 3517 (implicitly)
> > > allows this as well.
> >
> > No it doesn't! It does not allow you to avoid the second cwnd reduction.
> > They're from different RTT. What you now claim is that we never need to
> > do cwnd reduction until we visit CA_Open in between. That's very very
> > dangerous if the congestion suddently spikes because nobody reduces twice
> > causing everyone to get losses and continues in the recovery forever.
> 
> Should TCP perform another CWR for sequences lost after recovery point?
> my colleagues and I all agree with you it should.

I agree, and it seems that neither Linux nor RFC3517 does this right
(I should have read the RFC more carefully instead just trust my own 
guesswork. It seems that there's instead this corner case not covered in 
the RFC correctly, probably because it would require rexmit loss that 
forces RTO with standard TCP or reordering).

> But my commit message did a poor job explain some subtle deviations in 
> standard and current Linux.

Right, I though you're changing (removing) the second reduction based on 
the commit message as I didn't have time to go through enough code. But it 
seems that it is pre-existing issue.

> RFC3517 supports this by principle; however, it may repair such new data 
> losses in the current recovery: Rue 1.b and 1.c in NextSeg() allow 
> retransmitting any segment below highest sacked sequence. Thus it may 
> repair every loss in the current recovery and won't never go into 
> another.

I suppose we should be moving high_seq (RecoveryPoint) to the snd_nxt when 
we retransmit something above high_seq and initiate the addition reduction 
using PRR. It won't be accurately measuring the next RTT though as rexmits 
and new data were mixed during the first recovery RTT but I suppose it's 
still good enough approximation.

> As a matter of fact, Linux already does that today, by implementing rule 3 in
> NextSeg(), aka forward-retransmit. But Linux won't mark these packets 
> lost even if there are >= tp->reordering packets sacked above. This 
> change fixes this.

I was thinking there used to be seq < high_seq condition in the rexmit 
loop but it seems that it either got dropped or I remember wrong. That 
would have prevented forward rexmits past the boundary.

> How often is second CWR avoided? it's probably rare b/c it requires new data
> during the recovery RTT and a complete repair any losses of them. It's 
> arguably OK to fix small new data losses during current recovery, and 
> RFC/Linux seem to do so.

There's one significant difference here: For RFC this is includes only 
some trivial reordering cases, but with Linux we can handle lost rexmits, 
making it more likely to happen (they would require RTO & its consequences 
with RFC/Std TCP if we assume no reordering cumulative ACKs end the 
previous recovery and the next one immediately gets triggered).

> To summarize:
> 1) this change does NOT forbid TCP performing further CWR on new losses
> 2) this change (merely) marks some forward-retransmitted packets LOST.
> 3) the commit message is to be explicit that CWR may not always be performed,
>     in both my change and current kernel.

I agree with these points. I wonder if you intend to look to fixing the 
lacking CWR problem or should I look to that once I get some spare time?

> HTH. I will update commit message if people can make sense of it :)

Please do.
Yuchung Cheng Jan. 3, 2012, 6:54 p.m. UTC | #5
On Fri, Dec 30, 2011 at 3:19 AM, Ilpo Järvinen
<ilpo.jarvinen@helsinki.fi> wrote:
> On Thu, 29 Dec 2011, Yuchung Cheng wrote:
>
>> To summarize:
>> 1) this change does NOT forbid TCP performing further CWR on new losses
>> 2) this change (merely) marks some forward-retransmitted packets LOST.
>> 3) the commit message is to be explicit that CWR may not always be performed,
>>     in both my change and current kernel.
>
> I agree with these points. I wonder if you intend to look to fixing the
> lacking CWR problem or should I look to that once I get some spare time?
>
>> HTH. I will update commit message if people can make sense of it :)
>
> Please do.
Great, I'll update the commit message first, and work on another
feature patch that does CWR if
a) new sequence is lost
b) a retransmission is lost
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ilpo Järvinen Jan. 3, 2012, 7:11 p.m. UTC | #6
On Tue, 3 Jan 2012, Yuchung Cheng wrote:

> On Fri, Dec 30, 2011 at 3:19 AM, Ilpo Järvinen
> <ilpo.jarvinen@helsinki.fi> wrote:
> > On Thu, 29 Dec 2011, Yuchung Cheng wrote:
> >
> >> To summarize:
> >> 1) this change does NOT forbid TCP performing further CWR on new losses
> >> 2) this change (merely) marks some forward-retransmitted packets LOST.
> >> 3) the commit message is to be explicit that CWR may not always be performed,
> >>     in both my change and current kernel.
> >
> > I agree with these points. I wonder if you intend to look to fixing the
> > lacking CWR problem or should I look to that once I get some spare time?
> >
> >> HTH. I will update commit message if people can make sense of it :)
> >
> > Please do.
> Great, I'll update the commit message first, and work on another
> feature patch that does CWR if
> a) new sequence is lost
> b) a retransmission is lost

Ok, thanks. I think that it's better to do the way you suggest... ie., to 
do the cwr on detection rather than (possibly much) later when 
rexmitting.
diff mbox

Patch

diff --git a/include/linux/snmp.h b/include/linux/snmp.h
index e16557a..c1241c42 100644
--- a/include/linux/snmp.h
+++ b/include/linux/snmp.h
@@ -192,7 +192,6 @@  enum
 	LINUX_MIB_TCPPARTIALUNDO,		/* TCPPartialUndo */
 	LINUX_MIB_TCPDSACKUNDO,			/* TCPDSACKUndo */
 	LINUX_MIB_TCPLOSSUNDO,			/* TCPLossUndo */
-	LINUX_MIB_TCPLOSS,			/* TCPLoss */
 	LINUX_MIB_TCPLOSTRETRANSMIT,		/* TCPLostRetransmit */
 	LINUX_MIB_TCPRENOFAILURES,		/* TCPRenoFailures */
 	LINUX_MIB_TCPSACKFAILURES,		/* TCPSackFailures */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 3569d8e..6afc807 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -216,7 +216,6 @@  static const struct snmp_mib snmp4_net_list[] = {
 	SNMP_MIB_ITEM("TCPPartialUndo", LINUX_MIB_TCPPARTIALUNDO),
 	SNMP_MIB_ITEM("TCPDSACKUndo", LINUX_MIB_TCPDSACKUNDO),
 	SNMP_MIB_ITEM("TCPLossUndo", LINUX_MIB_TCPLOSSUNDO),
-	SNMP_MIB_ITEM("TCPLoss", LINUX_MIB_TCPLOSS),
 	SNMP_MIB_ITEM("TCPLostRetransmit", LINUX_MIB_TCPLOSTRETRANSMIT),
 	SNMP_MIB_ITEM("TCPRenoFailures", LINUX_MIB_TCPRENOFAILURES),
 	SNMP_MIB_ITEM("TCPSackFailures", LINUX_MIB_TCPSACKFAILURES),
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2877c3e..976034f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -105,7 +105,6 @@  int sysctl_tcp_abc __read_mostly;
 #define FLAG_SYN_ACKED		0x10 /* This ACK acknowledged SYN.		*/
 #define FLAG_DATA_SACKED	0x20 /* New SACK.				*/
 #define FLAG_ECE		0x40 /* ECE in this ACK				*/
-#define FLAG_DATA_LOST		0x80 /* SACK detected data lossage.		*/
 #define FLAG_SLOWPATH		0x100 /* Do not skip RFC checks for window update.*/
 #define FLAG_ONLY_ORIG_SACKED	0x200 /* SACKs only non-rexmit sent before RTO */
 #define FLAG_SND_UNA_ADVANCED	0x400 /* Snd_una was changed (!= FLAG_DATA_ACKED) */
@@ -1040,13 +1039,11 @@  static void tcp_skb_mark_lost_uncond_verify(struct tcp_sock *tp,
  * These 6 states form finite state machine, controlled by the following events:
  * 1. New ACK (+SACK) arrives. (tcp_sacktag_write_queue())
  * 2. Retransmission. (tcp_retransmit_skb(), tcp_xmit_retransmit_queue())
- * 3. Loss detection event of one of three flavors:
+ * 3. Loss detection event of two flavors:
  *	A. Scoreboard estimator decided the packet is lost.
  *	   A'. Reno "three dupacks" marks head of queue lost.
- *	   A''. Its FACK modfication, head until snd.fack is lost.
- *	B. SACK arrives sacking data transmitted after never retransmitted
- *	   hole was sent out.
- *	C. SACK arrives sacking SND.NXT at the moment, when the
+ *	   A''. Its FACK modification, head until snd.fack is lost.
+ *	B. SACK arrives sacking SND.NXT at the moment, when the
  *	   segment was retransmitted.
  * 4. D-SACK added new rule: D-SACK changes any tag to S.
  *
@@ -1153,7 +1150,7 @@  static int tcp_is_sackblock_valid(struct tcp_sock *tp, int is_dsack,
 }
 
 /* Check for lost retransmit. This superb idea is borrowed from "ratehalving".
- * Event "C". Later note: FACK people cheated me again 8), we have to account
+ * Event "B". Later note: FACK people cheated me again 8), we have to account
  * for reordering! Ugly, but should help.
  *
  * Search retransmitted skbs from write_queue that were sent when snd_nxt was
@@ -1844,10 +1841,6 @@  tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 		if (found_dup_sack && ((i + 1) == first_sack_index))
 			next_dup = &sp[i + 1];
 
-		/* Event "B" in the comment above. */
-		if (after(end_seq, tp->high_seq))
-			state.flag |= FLAG_DATA_LOST;
-
 		/* Skip too early cached blocks */
 		while (tcp_sack_cache_ok(tp, cache) &&
 		       !before(start_seq, cache->end_seq))
@@ -2515,8 +2508,11 @@  static void tcp_timeout_skbs(struct sock *sk)
 	tcp_verify_left_out(tp);
 }
 
-/* Mark head of queue up as lost. With RFC3517 SACK, the packets is
- * is against sacked "cnt", otherwise it's against facked "cnt"
+/* Detect loss in event "A" above by marking head of queue up as lost.
+ * For FACK or non-SACK(Reno) senders, the first "packets" number of segments
+ * are considered lost. For RFC3517 SACK, a segment is considered lost if it
+ * has at least tp->reordering SACKed seqments above it; "packets" refers to
+ * the maximum SACKed segments to pass before reaching this limit.
  */
 static void tcp_mark_head_lost(struct sock *sk, int packets, int mark_head)
 {
@@ -2525,6 +2521,8 @@  static void tcp_mark_head_lost(struct sock *sk, int packets, int mark_head)
 	int cnt, oldcnt;
 	int err;
 	unsigned int mss;
+	/* Use SACK to deduce losses of new sequences sent during recovery */
+	const u32 loss_high = tcp_is_sack(tp) ?  tp->snd_nxt : tp->high_seq;
 
 	WARN_ON(packets > tp->packets_out);
 	if (tp->lost_skb_hint) {
@@ -2546,7 +2544,7 @@  static void tcp_mark_head_lost(struct sock *sk, int packets, int mark_head)
 		tp->lost_skb_hint = skb;
 		tp->lost_cnt_hint = cnt;
 
-		if (after(TCP_SKB_CB(skb)->end_seq, tp->high_seq))
+		if (after(TCP_SKB_CB(skb)->end_seq, loss_high))
 			break;
 
 		oldcnt = cnt;
@@ -3033,19 +3031,10 @@  static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 	if (tcp_check_sack_reneging(sk, flag))
 		return;
 
-	/* C. Process data loss notification, provided it is valid. */
-	if (tcp_is_fack(tp) && (flag & FLAG_DATA_LOST) &&
-	    before(tp->snd_una, tp->high_seq) &&
-	    icsk->icsk_ca_state != TCP_CA_Open &&
-	    tp->fackets_out > tp->reordering) {
-		tcp_mark_head_lost(sk, tp->fackets_out - tp->reordering, 0);
-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPLOSS);
-	}
-
-	/* D. Check consistency of the current state. */
+	/* C. Check consistency of the current state. */
 	tcp_verify_left_out(tp);
 
-	/* E. Check state exit conditions. State can be terminated
+	/* D. Check state exit conditions. State can be terminated
 	 *    when high_seq is ACKed. */
 	if (icsk->icsk_ca_state == TCP_CA_Open) {
 		WARN_ON(tp->retrans_out != 0);
@@ -3077,7 +3066,7 @@  static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked,
 		}
 	}
 
-	/* F. Process state. */
+	/* E. Process state. */
 	switch (icsk->icsk_ca_state) {
 	case TCP_CA_Recovery:
 		if (!(flag & FLAG_SND_UNA_ADVANCED)) {