Message ID | 1320775631-16341-1-git-send-email-ncardwell@google.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Nov 8, 2011 at 10:07 AM, Neal Cardwell <ncardwell@google.com> wrote: > Fixes for some issues that prevent DSACKs from allowing TCP senders to > undo cwnd reductions made during fast recovery. > > There were a few related bugs/issues: > > 1) Senders ignored DSACKs after recovery when there were no > outstanding packets (a common scenario for HTTP servers). > > 2) When the ACK field is below snd_una (which can happen when ACKs are > reordered), senders ignored DSACKs (preventing undo) and passed up > chances to send out more packets based on any newly-SACKed packets. > > 3) Senders were overriding cwnd values picked during an undo by > calling tcp_moderate_cwnd() in tcp_try_to_open(). > > The fixes: > > (1) When there are no outstanding packets (the "no_queue" goto label), > use DSACKs to undo congestion window reductions. > > (2) When the ACK field is below snd_una (the "old_ack" goto label), > process any DSACKs and try to send out more packets based on > newly-SACKed packets. > > (3) Don't moderate cwnd in tcp_try_to_open() if we're in TCP_CA_Open, > since doing so is generally unnecessary and specifically would > override a DSACK-based undo of a cwnd reduction made in fast recovery. > > (4) Simplify the congestion avoidance state machine by removing the > behavior where SACK-enabled connections hung around in the > TCP_CA_Disorder state just waiting for DSACKs. Instead, when snd_una > advances to high_seq or beyond we typically move to TCP_CA_Open > immediately and allow an undo in either TCP_CA_Open or TCP_CA_Disorder > if we later receive enough DSACKs. Previously, SACK-enabled > connections hung around in TCP_CA_Disorder state while > snd_una==high_seq, just waiting to accumulate DSACKs and hopefully > undo a cwnd reduction. This could and did lead to the following > unfortunate scenario: if some incoming ACKs advance snd_una beyond > high_seq then we were setting undo_marker to 0 and moving to > TCP_CA_Open, so if (due to reordering in the ACK return path) we > shortly thereafter received a DSACK then we were no longer able to > undo the cwnd reduction. > > Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> FWIW. This undo-fix patch on Google Web servers increased the undos in loss by 46% and in disorder by 17%. It also corrects the SNMP stats TCPTimeouts, TCPRenoFailures, TCPSackFailures by moving state into open, instead of disorder, after recovery. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Yuchung Cheng <ycheng@google.com> Date: Tue, 8 Nov 2011 17:18:34 -0800 > On Tue, Nov 8, 2011 at 10:07 AM, Neal Cardwell <ncardwell@google.com> wrote: >> Fixes for some issues that prevent DSACKs from allowing TCP senders to >> undo cwnd reductions made during fast recovery. ... >> Signed-off-by: Neal Cardwell <ncardwell@google.com> > Acked-by: Yuchung Cheng <ycheng@google.com> > > > FWIW. This undo-fix patch on Google Web servers increased the undos in > loss by 46% and in disorder by 17%. It also corrects the SNMP stats > TCPTimeouts, TCPRenoFailures, TCPSackFailures by moving state into > open, instead of disorder, after recovery. Just wanted to give you guys a heads-up that this patch will take me some time to review properly. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neal Cardwell <ncardwell@google.com> Date: Tue, 8 Nov 2011 13:07:11 -0500 > (2) When the ACK field is below snd_una (the "old_ack" goto label), > process any DSACKs and try to send out more packets based on > newly-SACKed packets. This seems dubious. An older ACK should not have more uptodate SACK information than a newer ACK. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Nov 14, 2011 at 3:23 PM, David Miller <davem@davemloft.net> wrote: > From: Neal Cardwell <ncardwell@google.com> > Date: Tue, 8 Nov 2011 13:07:11 -0500 > >> (2) When the ACK field is below snd_una (the "old_ack" goto label), >> process any DSACKs and try to send out more packets based on >> newly-SACKed packets. > > This seems dubious. > > An older ACK should not have more uptodate SACK information than a newer > ACK. > I grant that it is a rare corner case, but it is possible if there is reordering and packet loss in both directions. In fact, AFAICT the code path to handle this corner case is already there: AFAICT we call tcp_sacktag_write_queue() in the old_ack code path exactly because it is possible for old ACKs to carry SACK ranges that we haven't seen yet. If this is not possible then it seems we should remove the old_ack code path. Let me try to lay out a specific example: Suppose a sender sends packets 1-12, and there is lots of loss and reordering in both directions. Suppose the receiver sends the following ACKs with SACK blocks, in the following order (using packet numbers rather than sequence numbers): (1) ack 1, sack <3-4> (2) ack 1, sack <3-4 6> (3) ack 1, sack <3-4 6 8> (4) ack 1, sack <3-4 6 8 10> (5) ack 1, sack <6 8 10 12> (6) ack 2, sack <6 8 10 12> Note that starting at ACK (5) the receiver is limited by the fact that TCP options can only hold 4 SACK ranges, so the SACK for 3-4 "falls off" the SACK block set. Now suppose the ACKs are reordered, and some are dropped, so that the sender receives the ACKs in this order: (6) ack 2, sack <6 8 10 12> (1) ack 1, sack <3-4> The arrival of (6) advances snd_una to 2. Now the question is what to do when (1), with an ACK field below snd_una, arrives at the sender. The existing code tags packets 3-4 as SACKed, but does not use this newly SACK-tagged data to send out new data. This commit proposes to use the newly SACK-tagged 3-4 to send out new data. neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 14 Nov 2011, Neal Cardwell wrote: > On Mon, Nov 14, 2011 at 3:23 PM, David Miller <davem@davemloft.net> wrote: > > From: Neal Cardwell <ncardwell@google.com> > > Date: Tue, 8 Nov 2011 13:07:11 -0500 > > > >> (2) When you need to say "fixes 1), 2), and 3)", please stop for a moment thinking if the fixes really need to be sent as one fix or 3 fixes (a series PATCH x/3, see netdev for examples). It would make review much simpler if unrelated fixes are not put together even if all of them would be needed to get DSACK working as intented. > >> When the ACK field is below snd_una (the "old_ack" goto label), > >> process any DSACKs and try to send out more packets based on > >> newly-SACKed packets. > > > > This seems dubious. Indeed. However, there's nothing wrong in the processing of DSACKs nor SACKs in the given case but we lack call to tcp_fastretrans_alert. But the commit message does not say that, in fact, I failed to understand what 2) tried to say as I knew that DSACK and SACK processing itself is ok and done (it also mixes "DSACKs"/"SACKed" illogically). This commit message must be fixed so that guys who read it later won't get confused too. BUT I'd on the same time split the whole fix to 3 different patches unless there's some very good reason why the changes are interlocked so that the splitting is not possible. > > An older ACK should not have more uptodate SACK information than a newer > > ACK. This is not true with DSACK that are not cumulative. And also the reported SACK state is cumulative up to a limit. > I grant that it is a rare corner case, but it is possible if there is > reordering and packet loss in both directions. > > In fact, AFAICT the code path to handle this corner case is already > there: AFAICT we call tcp_sacktag_write_queue() in the old_ack code > path exactly because it is possible for old ACKs to carry SACK ranges > that we haven't seen yet. I agree. ...And also DSACKs that can only be seen in that particular ACK. > Let me try to lay out a specific example: > > Suppose a sender sends packets 1-12, and there is lots of loss and > reordering in both directions. > > Suppose the receiver sends the following ACKs with SACK blocks, in the > following order (using packet numbers rather than sequence numbers): > > (1) ack 1, sack <3-4> > (2) ack 1, sack <3-4 6> > (3) ack 1, sack <3-4 6 8> > (4) ack 1, sack <3-4 6 8 10> > (5) ack 1, sack <6 8 10 12> > (6) ack 2, sack <6 8 10 12> > > Note that starting at ACK (5) the receiver is limited by the fact that > TCP options can only hold 4 SACK ranges, so the SACK for 3-4 "falls > off" the SACK block set. > > Now suppose the ACKs are reordered, and some are dropped, so that the > sender receives the ACKs in this order: > > (6) ack 2, sack <6 8 10 12> > (1) ack 1, sack <3-4> > > The arrival of (6) advances snd_una to 2. Now the question is what to > do when (1), with an ACK field below snd_una, arrives at the > sender. The existing code tags packets 3-4 as SACKed, but does not use > this newly SACK-tagged data to send out new data. This commit proposes > to use the newly SACK-tagged 3-4 to send out new data. Contrary to your commit message, new data sending is already done outside of tcp_ack() in tcp_data_snd_check? ...It's tcp_fastretrans_alert call that is lacking so no rexmits could be done currently? In addition, this given scenarios was not DSACK related!?! I think your commit message is just bogus discussing (only) DSACK processing for this case.
On Mon, Nov 14, 2011 at 9:05 PM, Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> wrote: > On Mon, 14 Nov 2011, Neal Cardwell wrote: > >> On Mon, Nov 14, 2011 at 3:23 PM, David Miller <davem@davemloft.net> wrote: >> > From: Neal Cardwell <ncardwell@google.com> >> > Date: Tue, 8 Nov 2011 13:07:11 -0500 >> > >> >> (2) > > When you need to say "fixes 1), 2), and 3)", please stop for a moment > thinking if the fixes really need to be sent as one fix or 3 fixes (a > series PATCH x/3, see netdev for examples). It would make review much > simpler if unrelated fixes are not put together even if all of them > would be needed to get DSACK working as intented. I'm happy to split this patch into a few smaller patches. I was on the fence about how fine-grained to make them; I'll make them smaller. >> >> When the ACK field is below snd_una (the "old_ack" goto label), >> >> process any DSACKs and try to send out more packets based on >> >> newly-SACKed packets. >> > >> > This seems dubious. > > Indeed. However, there's nothing wrong in the processing of DSACKs nor > SACKs in the given case but we lack call to tcp_fastretrans_alert. But the > commit message does not say that, in fact, I failed to understand what 2) > tried to say as I knew that DSACK and SACK processing itself is ok and > done (it also mixes "DSACKs"/"SACKed" illogically). This commit message > must be fixed so that guys who read it later won't get confused too. BUT > I'd on the same time split the whole fix to 3 different patches unless > there's some very good reason why the changes are interlocked so that the > splitting is not possible. I will attempt to make the commit messages more clear as I split up the patch into smaller patches. >> > An older ACK should not have more uptodate SACK information than a newer >> > ACK. > > This is not true with DSACK that are not cumulative. And also the > reported SACK state is cumulative up to a limit. > >> I grant that it is a rare corner case, but it is possible if there is >> reordering and packet loss in both directions. >> >> In fact, AFAICT the code path to handle this corner case is already >> there: AFAICT we call tcp_sacktag_write_queue() in the old_ack code >> path exactly because it is possible for old ACKs to carry SACK ranges >> that we haven't seen yet. > > I agree. ...And also DSACKs that can only be seen in that particular ACK. > >> Let me try to lay out a specific example: >> >> Suppose a sender sends packets 1-12, and there is lots of loss and >> reordering in both directions. >> >> Suppose the receiver sends the following ACKs with SACK blocks, in the >> following order (using packet numbers rather than sequence numbers): >> >> (1) ack 1, sack <3-4> >> (2) ack 1, sack <3-4 6> >> (3) ack 1, sack <3-4 6 8> >> (4) ack 1, sack <3-4 6 8 10> >> (5) ack 1, sack <6 8 10 12> >> (6) ack 2, sack <6 8 10 12> >> >> Note that starting at ACK (5) the receiver is limited by the fact that >> TCP options can only hold 4 SACK ranges, so the SACK for 3-4 "falls >> off" the SACK block set. >> >> Now suppose the ACKs are reordered, and some are dropped, so that the >> sender receives the ACKs in this order: >> >> (6) ack 2, sack <6 8 10 12> >> (1) ack 1, sack <3-4> >> >> The arrival of (6) advances snd_una to 2. Now the question is what to >> do when (1), with an ACK field below snd_una, arrives at the >> sender. The existing code tags packets 3-4 as SACKed, but does not use >> this newly SACK-tagged data to send out new data. This commit proposes >> to use the newly SACK-tagged 3-4 to send out new data. > > Contrary to your commit message, new data sending is already done outside > of tcp_ack() in tcp_data_snd_check? If my commit message seemed to be talking about sending new data, then it was only because of a lack of clarity. Where the commit message referred to "more packets" I meant either new/untransmitted data or retransmitted data. I will try to make this more clear. > ...It's tcp_fastretrans_alert call > that is lacking so no rexmits could be done currently? Yes, in terms of missing some opportunities for SACKs to trigger sends, the tcp_fastretrans_alert call is basically all that's lacking. > In addition, this given scenarios was not DSACK related!?! I think your > commit message is just bogus discussing (only) DSACK processing for this > case. Case 2)/(2) in the commit message does mention both DSACK and SACK processing for this case. Or maybe you are referring to one of the other cases? I don' think SACK processing applies to the other cases. In any case, I'm hoping this will all be more clear once I split up the patch into smaller pieces, and do a little wordsmithing on the commit messages. I'll be sending this out soon. Thanks to both of you for taking a look at this. neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 15 Nov 2011, Neal Cardwell wrote: > On Mon, Nov 14, 2011 at 9:05 PM, Ilpo Järvinen > <ilpo.jarvinen@helsinki.fi> wrote: > > > > When you need to say "fixes 1), 2), and 3)", please stop for a moment > > thinking if the fixes really need to be sent as one fix or 3 fixes (a > > series PATCH x/3, see netdev for examples). It would make review much > > simpler if unrelated fixes are not put together even if all of them > > would be needed to get DSACK working as intented. > > I'm happy to split this patch into a few smaller patches. I was on the > fence about how fine-grained to make them; I'll make them smaller. Thanks. In fixes smaller is almost always better. You can then state per fix very exactly what is wrong in the code and in what scenario that causes problem to materialize for real. Smallness helps the review enourmously and all questions raised against can then also remain more focused (possibly the rest could be acked/applied immediately, which happens quite often actually). You will be surprised yourself too then, once you've split each patch becomes very obvious and short (and then I'll be more lax about the commit message too as the diff speaks more succesfully for itself ;-)).
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 52b5c2d..78dd38c 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -2858,7 +2858,7 @@ static void tcp_try_keep_open(struct sock *sk) struct tcp_sock *tp = tcp_sk(sk); int state = TCP_CA_Open; - if (tcp_left_out(tp) || tcp_any_retrans_done(sk) || tp->undo_marker) + if (tcp_left_out(tp) || tcp_any_retrans_done(sk)) state = TCP_CA_Disorder; if (inet_csk(sk)->icsk_ca_state != state) { @@ -2881,7 +2881,8 @@ static void tcp_try_to_open(struct sock *sk, int flag) if (inet_csk(sk)->icsk_ca_state != TCP_CA_CWR) { tcp_try_keep_open(sk); - tcp_moderate_cwnd(tp); + if (inet_csk(sk)->icsk_ca_state != TCP_CA_Open) + tcp_moderate_cwnd(tp); } else { tcp_cwnd_down(sk, flag); } @@ -3009,11 +3010,11 @@ static void tcp_update_cwnd_in_recovery(struct sock *sk, int newly_acked_sacked, * tcp_xmit_retransmit_queue(). */ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, - int newly_acked_sacked, int flag) + int newly_acked_sacked, bool is_dupack, + int flag) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); - int is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP)); int do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) && (tcp_fackets_out(tp) > tp->reordering)); int fast_rexmit = 0, mib_idx; @@ -3066,17 +3067,6 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, } break; - case TCP_CA_Disorder: - tcp_try_undo_dsack(sk); - if (!tp->undo_marker || - /* For SACK case do not Open to allow to undo - * catching for all duplicate ACKs. */ - tcp_is_reno(tp) || tp->snd_una != tp->high_seq) { - tp->undo_marker = 0; - tcp_set_ca_state(sk, TCP_CA_Open); - } - break; - case TCP_CA_Recovery: if (tcp_is_reno(tp)) tcp_reset_reno_sack(tp); @@ -3117,7 +3107,7 @@ static void tcp_fastretrans_alert(struct sock *sk, int pkts_acked, tcp_add_reno_sack(sk); } - if (icsk->icsk_ca_state == TCP_CA_Disorder) + if (icsk->icsk_ca_state <= TCP_CA_Disorder) tcp_try_undo_dsack(sk); if (!tcp_time_to_recover(sk)) { @@ -3681,10 +3671,12 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) u32 prior_snd_una = tp->snd_una; u32 ack_seq = TCP_SKB_CB(skb)->seq; u32 ack = TCP_SKB_CB(skb)->ack_seq; + bool is_dupack = false; u32 prior_in_flight; u32 prior_fackets; int prior_packets; int prior_sacked = tp->sacked_out; + int pkts_acked = 0; int newly_acked_sacked = 0; int frto_cwnd = 0; @@ -3757,6 +3749,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) /* See if we can take anything off of the retransmit queue. */ flag |= tcp_clean_rtx_queue(sk, prior_fackets, prior_snd_una); + pkts_acked = prior_packets - tp->packets_out; newly_acked_sacked = (prior_packets - prior_sacked) - (tp->packets_out - tp->sacked_out); @@ -3771,8 +3764,9 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) if ((flag & FLAG_DATA_ACKED) && !frto_cwnd && tcp_may_raise_cwnd(sk, flag)) tcp_cong_avoid(sk, ack, prior_in_flight); - tcp_fastretrans_alert(sk, prior_packets - tp->packets_out, - newly_acked_sacked, flag); + is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP)); + tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked, + is_dupack, flag); } else { if ((flag & FLAG_DATA_ACKED) && !frto_cwnd) tcp_cong_avoid(sk, ack, prior_in_flight); @@ -3784,6 +3778,10 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag) return 1; no_queue: + /* If data was DSACKed, see if we can undo a cwnd reduction. */ + if (flag & FLAG_DSACKING_ACK) + tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked, + is_dupack, flag); /* If this ack opens up a zero window, clear backoff. It was * being used to time the probes, and is probably far higher than * it needs to be for normal retransmission. @@ -3797,10 +3795,14 @@ invalid_ack: return -1; old_ack: + /* If data was SACKed, tag it and see if we should send more data. + * If data was DSACKed, see if we can undo a cwnd reduction. + */ if (TCP_SKB_CB(skb)->sacked) { - tcp_sacktag_write_queue(sk, skb, prior_snd_una); - if (icsk->icsk_ca_state == TCP_CA_Open) - tcp_try_keep_open(sk); + flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una); + newly_acked_sacked = tp->sacked_out - prior_sacked; + tcp_fastretrans_alert(sk, pkts_acked, newly_acked_sacked, + is_dupack, flag); } SOCK_DEBUG(sk, "Ack %u before %u:%u\n", ack, tp->snd_una, tp->snd_nxt);
Fixes for some issues that prevent DSACKs from allowing TCP senders to undo cwnd reductions made during fast recovery. There were a few related bugs/issues: 1) Senders ignored DSACKs after recovery when there were no outstanding packets (a common scenario for HTTP servers). 2) When the ACK field is below snd_una (which can happen when ACKs are reordered), senders ignored DSACKs (preventing undo) and passed up chances to send out more packets based on any newly-SACKed packets. 3) Senders were overriding cwnd values picked during an undo by calling tcp_moderate_cwnd() in tcp_try_to_open(). The fixes: (1) When there are no outstanding packets (the "no_queue" goto label), use DSACKs to undo congestion window reductions. (2) When the ACK field is below snd_una (the "old_ack" goto label), process any DSACKs and try to send out more packets based on newly-SACKed packets. (3) Don't moderate cwnd in tcp_try_to_open() if we're in TCP_CA_Open, since doing so is generally unnecessary and specifically would override a DSACK-based undo of a cwnd reduction made in fast recovery. (4) Simplify the congestion avoidance state machine by removing the behavior where SACK-enabled connections hung around in the TCP_CA_Disorder state just waiting for DSACKs. Instead, when snd_una advances to high_seq or beyond we typically move to TCP_CA_Open immediately and allow an undo in either TCP_CA_Open or TCP_CA_Disorder if we later receive enough DSACKs. Previously, SACK-enabled connections hung around in TCP_CA_Disorder state while snd_una==high_seq, just waiting to accumulate DSACKs and hopefully undo a cwnd reduction. This could and did lead to the following unfortunate scenario: if some incoming ACKs advance snd_una beyond high_seq then we were setting undo_marker to 0 and moving to TCP_CA_Open, so if (due to reordering in the ACK return path) we shortly thereafter received a DSACK then we were no longer able to undo the cwnd reduction. Signed-off-by: Neal Cardwell <ncardwell@google.com> --- net/ipv4/tcp_input.c | 44 +++++++++++++++++++++++--------------------- 1 files changed, 23 insertions(+), 21 deletions(-)