Message ID | 20140616211954.6E12BA3A89@unicorn.suse.cz |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote: > RFC 5681 says that ssthresh reduction in response to RTO should > be done only once and should not be repeated until all packets > from the first loss are retransmitted. RFC 6582 (as well as its > predecessor RFC 3782) is even more specific and says that when > loss is detected, one should mark current SND.NXT and ssthresh > shouldn't be reduced again due to a loss until SND.UNA reaches > this remembered value. > > In Linux implementation, this is done in tcp_enter_loss() but an > additional condition > > (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits) > > allows to further reduce ssthresh before snd_una reaches the > high_seq (the snd_nxt value at the previous loss) as > icsk_retransmits is reset as soon as snd_una moves forward. As a > result, if a retransmit timeout ouccurs early in the retransmit > phase, we can adjust snd_ssthresh based on very low value of > cwnd. This can be especially harmful for reno congestion control > with slow linear cwnd growth in congestion avoidance phase. > > The patch removes the condition above so that snd_ssthresh is > not reduced again until snd_una reaches high_seq as described in > RFC 5681 and 6582. > > Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Yuchung Cheng <ycheng@google.com> > --- > net/ipv4/tcp_input.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c > index 40661fc..768ba88 100644 > --- a/net/ipv4/tcp_input.c > +++ b/net/ipv4/tcp_input.c > @@ -1917,8 +1917,7 @@ void tcp_enter_loss(struct sock *sk, int how) > > /* Reduce ssthresh if it has not yet been made inside this window. */ > if (icsk->icsk_ca_state <= TCP_CA_Disorder || > - !after(tp->high_seq, tp->snd_una) || > - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { > + !after(tp->high_seq, tp->snd_una)) { > new_recovery = true; > tp->prior_ssthresh = tcp_current_ssthresh(sk); > tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk); > -- > 1.8.4.5 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 16, 2014 at 6:39 PM, Yuchung Cheng <ycheng@google.com> wrote: > On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote: >> RFC 5681 says that ssthresh reduction in response to RTO should >> be done only once and should not be repeated until all packets >> from the first loss are retransmitted. RFC 6582 (as well as its >> predecessor RFC 3782) is even more specific and says that when >> loss is detected, one should mark current SND.NXT and ssthresh >> shouldn't be reduced again due to a loss until SND.UNA reaches >> this remembered value. >> >> In Linux implementation, this is done in tcp_enter_loss() but an >> additional condition >> >> (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits) >> >> allows to further reduce ssthresh before snd_una reaches the >> high_seq (the snd_nxt value at the previous loss) as >> icsk_retransmits is reset as soon as snd_una moves forward. As a >> result, if a retransmit timeout ouccurs early in the retransmit >> phase, we can adjust snd_ssthresh based on very low value of >> cwnd. This can be especially harmful for reno congestion control >> with slow linear cwnd growth in congestion avoidance phase. >> >> The patch removes the condition above so that snd_ssthresh is >> not reduced again until snd_una reaches high_seq as described in >> RFC 5681 and 6582. >> >> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> AFAICT this commit description is arguing from a mis-reading of the RFCs. RFC 6582 and RFC 3782 are only about Fast Recovery, and not relevant to the timeout recovery we're dealing with in tcp_enter_loss(). RFC 5681, Section 4.3 says: Loss in two successive windows of data, or the loss of a retransmission, should be taken as two indications of congestion and, therefore, cwnd (and ssthresh) MUST be lowered twice in this case. So if we're in TCP_CA_Loss snd_una advances (FLAG_DATA_ACKED is set and icsk_retransmits is zero), but snd_una does not advance above high_seq, then if we subsequently suffer an RTO (and call tcp_enter_loss()) then that indicates a retransmission is lost, which this passage from sec 4.3 indicates should be taken as a second indication of congestion. > - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { AFAICT this existing code is a faithful implementation of, RFC 5681, Section 7: The treatment of ssthresh on retransmission timeout was clarified. In particular, ssthresh must be set to half the FlightSize on the first retransmission of a given segment and then is held constant on subsequent retransmissions of the same segment. That is, if snd_una advances (FLAG_DATA_ACKED is set and icsk_retransmits is zero), if we subsequently suffer an RTO and call tcp_enter_loss() then we will be sending a "first retransmission" at the segment pointed to by the new/higher snd_una. So this is the first retransmission of that new segment, so we should reduce ssthresh. And from first principles, the current Linux code and RFCs seem sensible on this matter, AFAICT. Suppose we suffer an RTO, and then over the following RTTs in TCP_CA_Loss we grow cwnd exponentially again. If we suffer another RTO in this cwnd growth process, then it seems like a good idea to remember the reduced ssthresh inferred from this smaller cwnd at which we suffered a loss. So AFAICT the existing code is sensible and complies with the RFC. Now, I agree the linear growth of Reno in such situations is problematic, but I think it's a somewhat separate issue. Or at least if we're going to change the behavior here then we should justify it by using data, and not by reference to RFCs. :-) neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 16, 2014 at 4:42 PM, Neal Cardwell <ncardwell@google.com> wrote: > On Mon, Jun 16, 2014 at 6:39 PM, Yuchung Cheng <ycheng@google.com> wrote: >> On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote: >>> RFC 5681 says that ssthresh reduction in response to RTO should >>> be done only once and should not be repeated until all packets >>> from the first loss are retransmitted. RFC 6582 (as well as its >>> predecessor RFC 3782) is even more specific and says that when >>> loss is detected, one should mark current SND.NXT and ssthresh >>> shouldn't be reduced again due to a loss until SND.UNA reaches >>> this remembered value. >>> >>> In Linux implementation, this is done in tcp_enter_loss() but an >>> additional condition >>> >>> (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits) >>> >>> allows to further reduce ssthresh before snd_una reaches the >>> high_seq (the snd_nxt value at the previous loss) as >>> icsk_retransmits is reset as soon as snd_una moves forward. As a >>> result, if a retransmit timeout ouccurs early in the retransmit >>> phase, we can adjust snd_ssthresh based on very low value of >>> cwnd. This can be especially harmful for reno congestion control >>> with slow linear cwnd growth in congestion avoidance phase. >>> >>> The patch removes the condition above so that snd_ssthresh is >>> not reduced again until snd_una reaches high_seq as described in >>> RFC 5681 and 6582. >>> >>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> > > AFAICT this commit description is arguing from a mis-reading of the > RFCs. > > RFC 6582 and RFC 3782 are only about Fast Recovery, and not relevant > to the timeout recovery we're dealing with in tcp_enter_loss(). > > RFC 5681, Section 4.3 says: > > Loss in two successive windows of data, or the loss of a > retransmission, should be taken as two indications of congestion and, > therefore, cwnd (and ssthresh) MUST be lowered twice in this case. > > So if we're in TCP_CA_Loss snd_una advances (FLAG_DATA_ACKED is set > and icsk_retransmits is zero), but snd_una does not advance above > high_seq, then if we subsequently suffer an RTO (and call > tcp_enter_loss()) then that indicates a retransmission is lost, which > this passage from sec 4.3 indicates should be taken as a second > indication of congestion. That's right. I should have checked the RFC more thoroughly. Sorry please ignore my Acked-by. However Linux is inconsistent on the loss of a retransmission. It reduces ssthresh (and cwnd) if this happens on a timeout, but not in fast recovery (tcp_mark_lost_retrans). We should fix that and that should help dealing with traffic policers. > >> - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { > > AFAICT this existing code is a faithful implementation of, RFC 5681, > Section 7: > > The treatment of ssthresh on retransmission timeout was clarified. > In particular, ssthresh must be set to half the FlightSize on the > first retransmission of a given segment and then is held constant on > subsequent retransmissions of the same segment. > > That is, if snd_una advances (FLAG_DATA_ACKED is set and > icsk_retransmits is zero), if we subsequently suffer an RTO and call > tcp_enter_loss() then we will be sending a "first retransmission" at > the segment pointed to by the new/higher snd_una. So this is the first > retransmission of that new segment, so we should reduce ssthresh. > > And from first principles, the current Linux code and RFCs seem > sensible on this matter, AFAICT. Suppose we suffer an RTO, and then > over the following RTTs in TCP_CA_Loss we grow cwnd exponentially > again. If we suffer another RTO in this cwnd growth process, then it > seems like a good idea to remember the reduced ssthresh inferred from > this smaller cwnd at which we suffered a loss. > > So AFAICT the existing code is sensible and complies with the RFC. > > Now, I agree the linear growth of Reno in such situations is > problematic, but I think it's a somewhat separate issue. Or at least > if we're going to change the behavior here then we should justify it > by using data, and not by reference to RFCs. :-) > > neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: > However Linux is inconsistent on the loss of a retransmission. It > reduces ssthresh (and cwnd) if this happens on a timeout, but not in > fast recovery (tcp_mark_lost_retrans). We should fix that and that > should help dealing with traffic policers. Yes, great point! neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: > On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: > > However Linux is inconsistent on the loss of a retransmission. It > > reduces ssthresh (and cwnd) if this happens on a timeout, but not in > > fast recovery (tcp_mark_lost_retrans). We should fix that and that > > should help dealing with traffic policers. > > Yes, great point! Does it mean the patch itself would be acceptable if the reasoning in its commit message was changed? Or would you prefer a different way to unify the two situations? Michal Kubecek -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote: > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: >> > However Linux is inconsistent on the loss of a retransmission. It >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that >> > should help dealing with traffic policers. >> >> Yes, great point! > > Does it mean the patch itself would be acceptable if the reasoning in > its commit message was changed? Or would you prefer a different way to > unify the two situations? It's the latter but it seems to belong to a different patch (and it'll not solve the problem you are seeing). The idea behind the RFC is that TCP should reduce cwnd and ssthresh across round trips of send, but not within an RTT. Suppose cwnd was 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3 round trips, we time out again. By the design of Reno this should reset cwnd from 8 to 1, and ssthresh from 5 to 2.5. Of course this may not make sense in various cases. But it will be a design bug in the congestion control rather than an implementation bug in the loss recovery. We are seeing many similar issues where non-queue-overflow drops mess up CCs relying on ssthresh :( > > Michal Kubecek > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote: > On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote: > > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: > >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: > >> > However Linux is inconsistent on the loss of a retransmission. It > >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in > >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that > >> > should help dealing with traffic policers. > >> > >> Yes, great point! > > > > Does it mean the patch itself would be acceptable if the reasoning in > > its commit message was changed? Or would you prefer a different way to > > unify the two situations? > > It's the latter but it seems to belong to a different patch (and it'll > not solve the problem you are seeing). OK, thank you. I guess we will have to persuade them to move to cubic which handles their problems much better. > The idea behind the RFC is that TCP should reduce cwnd and ssthresh > across round trips of send, but not within an RTT. Suppose cwnd was > 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3 > round trips, we time out again. By the design of Reno this should > reset cwnd from 8 to 1, and ssthresh from 5 to 2.5. Shouldn't that be from 5 to 4? We reduce ssthresh to half of current cwnd, not current ssthresh. BtW, this is exactly the problem our customer is facing: they have relatively fast line (15 Mb/s) but with big buffers so that the roundtrip times can rise from unloaded 35 ms up to something like 1.5 s under full load. What happens is this: cwnd initally rises to ~2100 then first drops are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start lets cwnd reach ssthresh but after that, a slow linear growth follows. In this state, all in-flight packets are dropped (simulation of what happens on router switchover) so that cwnd is reset to 1 again and ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh). If a packet loss comes shortly after that, cwnd is still very low and ssthresh is reduced to half of that cwnd (i.e. much lower than to half of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2 which takes really long to recover from. Michal Kubecek -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Michal Kubecek <mkubecek@suse.cz> wrote: >On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote: >> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote: >> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: >> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: >> >> > However Linux is inconsistent on the loss of a retransmission. It >> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in >> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that >> >> > should help dealing with traffic policers. >> >> >> >> Yes, great point! >> > >> > Does it mean the patch itself would be acceptable if the reasoning in >> > its commit message was changed? Or would you prefer a different way to >> > unify the two situations? >> >> It's the latter but it seems to belong to a different patch (and it'll >> not solve the problem you are seeing). > >OK, thank you. I guess we will have to persuade them to move to cubic >which handles their problems much better. > >> The idea behind the RFC is that TCP should reduce cwnd and ssthresh >> across round trips of send, but not within an RTT. Suppose cwnd was >> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3 >> round trips, we time out again. By the design of Reno this should >> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5. > >Shouldn't that be from 5 to 4? We reduce ssthresh to half of current >cwnd, not current ssthresh. > >BtW, this is exactly the problem our customer is facing: they have >relatively fast line (15 Mb/s) but with big buffers so that the >roundtrip times can rise from unloaded 35 ms up to something like 1.5 s >under full load. > >What happens is this: cwnd initally rises to ~2100 then first drops >are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start >lets cwnd reach ssthresh but after that, a slow linear growth follows. >In this state, all in-flight packets are dropped (simulation of what >happens on router switchover) so that cwnd is reset to 1 again and >ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh). >If a packet loss comes shortly after that, cwnd is still very low and >ssthresh is reduced to half of that cwnd (i.e. much lower than to half >of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2 >which takes really long to recover from. I'm also looking into a problem that exhibits very similar TCP characteristics, even down to cwnd and ssthresh values similar to what you cite. In this case, the situation has to do with high RTT (around 80 ms) connections competing with low RTT (1 ms) connections. This case is already using cubic. Essentially, a high RTT connection to the server transfers data in at a reasonable and steady rate until something causes some packets to be lost (in this case, another transfer from a low RTT host to the same server). Some packets are lost, and cwnd drops from ~2200 to ~300 (in stages, first to ~1500, then ~600, then to ~300, ). The ssthresh starts at around 1100, then drops to ~260, which is the lowest cwnd value. The recovery from the low cwnd situation is very slow; cwnd climbs a bit and then remains essentially flat for around 5 seconds. It then begins to climb until a few packets are lost again, and the cycle repeats. If no futher losses occur (if the competing traffic has ceased, for example), recovery from a low cwnd (300 - 750 ish) to the full value (~2200) requires on the order of 20 seconds. The connection exits recovery state fairly quickly, and most of the 20 seconds is spent in open state. -J --- -Jay Vosburgh, jay.vosburgh@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote: > Michal Kubecek <mkubecek@suse.cz> wrote: > >>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote: >>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote: >>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: >>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: >>> >> > However Linux is inconsistent on the loss of a retransmission. It >>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in >>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that >>> >> > should help dealing with traffic policers. >>> >> >>> >> Yes, great point! >>> > >>> > Does it mean the patch itself would be acceptable if the reasoning in >>> > its commit message was changed? Or would you prefer a different way to >>> > unify the two situations? >>> >>> It's the latter but it seems to belong to a different patch (and it'll >>> not solve the problem you are seeing). >> >>OK, thank you. I guess we will have to persuade them to move to cubic >>which handles their problems much better. >> >>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh >>> across round trips of send, but not within an RTT. Suppose cwnd was >>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3 >>> round trips, we time out again. By the design of Reno this should >>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5. >> >>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current >>cwnd, not current ssthresh. >> >>BtW, this is exactly the problem our customer is facing: they have >>relatively fast line (15 Mb/s) but with big buffers so that the >>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s >>under full load. >> >>What happens is this: cwnd initally rises to ~2100 then first drops >>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start >>lets cwnd reach ssthresh but after that, a slow linear growth follows. >>In this state, all in-flight packets are dropped (simulation of what >>happens on router switchover) so that cwnd is reset to 1 again and >>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh). >>If a packet loss comes shortly after that, cwnd is still very low and >>ssthresh is reduced to half of that cwnd (i.e. much lower than to half >>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2 >>which takes really long to recover from. > > I'm also looking into a problem that exhibits very similar TCP > characteristics, even down to cwnd and ssthresh values similar to what > you cite. In this case, the situation has to do with high RTT (around > 80 ms) connections competing with low RTT (1 ms) connections. This case > is already using cubic. > > Essentially, a high RTT connection to the server transfers data > in at a reasonable and steady rate until something causes some packets > to be lost (in this case, another transfer from a low RTT host to the > same server). Some packets are lost, and cwnd drops from ~2200 to ~300 > (in stages, first to ~1500, then ~600, then to ~300, ). The ssthresh > starts at around 1100, then drops to ~260, which is the lowest cwnd > value. > > The recovery from the low cwnd situation is very slow; cwnd > climbs a bit and then remains essentially flat for around 5 seconds. It > then begins to climb until a few packets are lost again, and the cycle > repeats. If no futher losses occur (if the competing traffic has > ceased, for example), recovery from a low cwnd (300 - 750 ish) to the > full value (~2200) requires on the order of 20 seconds. The connection > exits recovery state fairly quickly, and most of the 20 seconds is spent > in open state. Interesting. I'm a little surprised it takes CUBIC so long to re-grow cwnd to the full value. Would you be able to provide your kernel version number and post a tcpdump binary packet trace somewhere public? One thing you could try would be to disable CUBIC's "fast convergence" feature: echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence We have noticed that this feature can hurt performance when there is a high rate of random packet drops (packet drops that are not correlated with the sending rate of the flow in question). neal -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Neal Cardwell <ncardwell@google.com> wrote: >On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh ><jay.vosburgh@canonical.com> wrote: >> Michal Kubecek <mkubecek@suse.cz> wrote: >> >>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote: >>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote: >>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: >>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: >>>> >> > However Linux is inconsistent on the loss of a retransmission. It >>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in >>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that >>>> >> > should help dealing with traffic policers. >>>> >> >>>> >> Yes, great point! >>>> > >>>> > Does it mean the patch itself would be acceptable if the reasoning in >>>> > its commit message was changed? Or would you prefer a different way to >>>> > unify the two situations? >>>> >>>> It's the latter but it seems to belong to a different patch (and it'll >>>> not solve the problem you are seeing). >>> >>>OK, thank you. I guess we will have to persuade them to move to cubic >>>which handles their problems much better. >>> >>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh >>>> across round trips of send, but not within an RTT. Suppose cwnd was >>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3 >>>> round trips, we time out again. By the design of Reno this should >>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5. >>> >>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current >>>cwnd, not current ssthresh. >>> >>>BtW, this is exactly the problem our customer is facing: they have >>>relatively fast line (15 Mb/s) but with big buffers so that the >>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s >>>under full load. >>> >>>What happens is this: cwnd initally rises to ~2100 then first drops >>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start >>>lets cwnd reach ssthresh but after that, a slow linear growth follows. >>>In this state, all in-flight packets are dropped (simulation of what >>>happens on router switchover) so that cwnd is reset to 1 again and >>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh). >>>If a packet loss comes shortly after that, cwnd is still very low and >>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half >>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2 >>>which takes really long to recover from. >> >> I'm also looking into a problem that exhibits very similar TCP >> characteristics, even down to cwnd and ssthresh values similar to what >> you cite. In this case, the situation has to do with high RTT (around >> 80 ms) connections competing with low RTT (1 ms) connections. This case >> is already using cubic. >> >> Essentially, a high RTT connection to the server transfers data >> in at a reasonable and steady rate until something causes some packets >> to be lost (in this case, another transfer from a low RTT host to the >> same server). Some packets are lost, and cwnd drops from ~2200 to ~300 >> (in stages, first to ~1500, then ~600, then to ~300, ). The ssthresh >> starts at around 1100, then drops to ~260, which is the lowest cwnd >> value. >> >> The recovery from the low cwnd situation is very slow; cwnd >> climbs a bit and then remains essentially flat for around 5 seconds. It >> then begins to climb until a few packets are lost again, and the cycle >> repeats. If no futher losses occur (if the competing traffic has >> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the >> full value (~2200) requires on the order of 20 seconds. The connection >> exits recovery state fairly quickly, and most of the 20 seconds is spent >> in open state. > >Interesting. I'm a little surprised it takes CUBIC so long to re-grow >cwnd to the full value. Would you be able to provide your kernel >version number and post a tcpdump binary packet trace somewhere >public? The kernel I'm using at the moment is an Ubuntu 3.2.0-54 distro kernel, but I've reproduced the problem on Ubuntu distro 3.13 and a mainline 3.15-rc (although in the 3.13/3.15 cases using netem to inject delay). I've been gathering data mostly with systemtap, but I should be able to get some packet captures as well, although not until tomorrow. The test I'm using right now is pretty simple. I have three machines: two, A and B, are separated by about 80 ms RTT; the third machine, C, is about 1 ms from B, so: A --- 80ms --- B --- 1ms ---- C On A, I run an "iperf -i 1" to B, and let it max its cwnd, and then on C, run an "iperf -t 1" to B ("-t 1" means only run for one second then exit). The iperf results on A look like this: [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 1.0- 2.0 sec 1.50 MBytes 12.6 Mbits/sec [ 3] 2.0- 3.0 sec 4.62 MBytes 38.8 Mbits/sec [ 3] 3.0- 4.0 sec 13.5 MBytes 113 Mbits/sec [ 3] 4.0- 5.0 sec 27.8 MBytes 233 Mbits/sec [ 3] 5.0- 6.0 sec 39.0 MBytes 327 Mbits/sec [ 3] 6.0- 7.0 sec 36.9 MBytes 309 Mbits/sec [ 3] 7.0- 8.0 sec 34.8 MBytes 292 Mbits/sec [ 3] 8.0- 9.0 sec 39.0 MBytes 327 Mbits/sec [ 3] 9.0-10.0 sec 36.9 MBytes 309 Mbits/sec [ 3] 10.0-11.0 sec 36.9 MBytes 309 Mbits/sec [ 3] 11.0-12.0 sec 11.1 MBytes 93.3 Mbits/sec [ 3] 12.0-13.0 sec 4.50 MBytes 37.7 Mbits/sec [ 3] 13.0-14.0 sec 2.88 MBytes 24.1 Mbits/sec [ 3] 14.0-15.0 sec 5.50 MBytes 46.1 Mbits/sec [ 3] 15.0-16.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 16.0-17.0 sec 6.50 MBytes 54.5 Mbits/sec [ 3] 17.0-18.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 18.0-19.0 sec 4.25 MBytes 35.7 Mbits/sec [ 3] 19.0-20.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 20.0-21.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 21.0-22.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 22.0-23.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 23.0-24.0 sec 6.50 MBytes 54.5 Mbits/sec [ 3] 24.0-25.0 sec 8.62 MBytes 72.4 Mbits/sec [ 3] 25.0-26.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 26.0-27.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 27.0-28.0 sec 8.62 MBytes 72.4 Mbits/sec [ 3] 28.0-29.0 sec 10.6 MBytes 89.1 Mbits/sec [ 3] 29.0-30.0 sec 12.9 MBytes 108 Mbits/sec [ 3] 30.0-31.0 sec 15.0 MBytes 126 Mbits/sec [ 3] 31.0-32.0 sec 15.0 MBytes 126 Mbits/sec [ 3] 32.0-33.0 sec 21.8 MBytes 182 Mbits/sec [ 3] 33.0-34.0 sec 21.4 MBytes 179 Mbits/sec [ 3] 34.0-35.0 sec 27.8 MBytes 233 Mbits/sec [ 3] 35.0-36.0 sec 32.6 MBytes 274 Mbits/sec [ 3] 36.0-37.0 sec 36.6 MBytes 307 Mbits/sec [ 3] 37.0-38.0 sec 36.6 MBytes 307 Mbits/sec The second iperf starts at about time 10. The middle value is 1 second's throughput, so the flat throughput between roughly time 13 and time 23 is the cwnd slow recovery. I've got one graph prepared already that I can post: http://people.canonical.com/~jvosburgh/t-vs-cwnd-ssthresh.jpg This shows cwnd (green) and ssthresh (red) vs. time. In this case, the second (low RTT) iperf started at the first big drop at around time 22 and ran for 30 seconds (its data is not on the graph). The big cwnd drop is actually a series of drops, but that's hard to see at this scale. This graph shows two of the slow recoveries, and was done on a 3.13 kernel using netem to add delay. The cwnd and ssthresh data was captured by systemtap when exiting tcp_ack. >One thing you could try would be to disable CUBIC's "fast convergence" feature: > > echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence > >We have noticed that this feature can hurt performance when there is a >high rate of random packet drops (packet drops that are not correlated >with the sending rate of the flow in question). I ran the above iperf results with this disabled; it does not appear to have any effect. -J --- -Jay Vosburgh, jay.vosburgh@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-06-18 at 00:42 +0200, Michal Kubecek wrote: > BtW, this is exactly the problem our customer is facing: they have > relatively fast line (15 Mb/s) but with big buffers so that the > roundtrip times can rise from unloaded 35 ms up to something like 1.5 s > under full load. It looks like some cwnd limiting would be nice ;) BDP is about 45 packets for a 35ms rtt and 15 Mb/s link. > > What happens is this: cwnd initally rises to ~2100 then first drops > are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start > lets cwnd reach ssthresh but after that, a slow linear growth follows. > In this state, all in-flight packets are dropped (simulation of what > happens on router switchover) so that cwnd is reset to 1 again and > ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh). > If a packet loss comes shortly after that, cwnd is still very low and > ssthresh is reduced to half of that cwnd (i.e. much lower than to half > of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2 > which takes really long to recover from. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 17, 2014 at 5:38 PM, Jay Vosburgh <jay.vosburgh@canonical.com> wrote: > Michal Kubecek <mkubecek@suse.cz> wrote: > >>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote: >>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote: >>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote: >>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote: >>> >> > However Linux is inconsistent on the loss of a retransmission. It >>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in >>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that >>> >> > should help dealing with traffic policers. >>> >> >>> >> Yes, great point! >>> > >>> > Does it mean the patch itself would be acceptable if the reasoning in >>> > its commit message was changed? Or would you prefer a different way to >>> > unify the two situations? >>> >>> It's the latter but it seems to belong to a different patch (and it'll >>> not solve the problem you are seeing). >> >>OK, thank you. I guess we will have to persuade them to move to cubic >>which handles their problems much better. >> >>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh >>> across round trips of send, but not within an RTT. Suppose cwnd was >>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3 >>> round trips, we time out again. By the design of Reno this should >>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5. >> >>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current >>cwnd, not current ssthresh. Oops yes it should be 8 to 4. >> >>BtW, this is exactly the problem our customer is facing: they have >>relatively fast line (15 Mb/s) but with big buffers so that the >>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s >>under full load. >> >>What happens is this: cwnd initally rises to ~2100 then first drops >>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start >>lets cwnd reach ssthresh but after that, a slow linear growth follows. >>In this state, all in-flight packets are dropped (simulation of what >>happens on router switchover) so that cwnd is reset to 1 again and >>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh). >>If a packet loss comes shortly after that, cwnd is still very low and >>ssthresh is reduced to half of that cwnd (i.e. much lower than to half >>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2 >>which takes really long to recover from. > > I'm also looking into a problem that exhibits very similar TCP > characteristics, even down to cwnd and ssthresh values similar to what > you cite. In this case, the situation has to do with high RTT (around > 80 ms) connections competing with low RTT (1 ms) connections. This case > is already using cubic. > > Essentially, a high RTT connection to the server transfers data > in at a reasonable and steady rate until something causes some packets > to be lost (in this case, another transfer from a low RTT host to the > same server). Some packets are lost, and cwnd drops from ~2200 to ~300 > (in stages, first to ~1500, then ~600, then to ~300, ). The ssthresh > starts at around 1100, then drops to ~260, which is the lowest cwnd > value. > > The recovery from the low cwnd situation is very slow; cwnd > climbs a bit and then remains essentially flat for around 5 seconds. It > then begins to climb until a few packets are lost again, and the cycle > repeats. If no futher losses occur (if the competing traffic has > ceased, for example), recovery from a low cwnd (300 - 750 ish) to the > full value (~2200) requires on the order of 20 seconds. The connection > exits recovery state fairly quickly, and most of the 20 seconds is spent > in open state. ssthresh is problematic. Both cases show the same shortcoming of Reno/Cubic using losses and ssthresh. If losses are not caused by queue overflows but by link flaps, bursts, etc, the ssthresh is not indicative of BDP. It's kinda a random value (>> BDP on BB, <<BDP in these cases). TCP throughput goes south if we hit two losses within a few RTT and it's a point of no return :( Hopefully someone can come up a more intelligent control. Several posts in tcpm also discuss the low ssthresh issues. http://www.ietf.org/mail-archive/web/tcpm/current/msg08145.html http://www.ietf.org/mail-archive/web/tcpm/current/msg08778.html > > -J > > --- > -Jay Vosburgh, jay.vosburgh@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Neal Cardwell <ncardwell@google.com> wrote: >On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh ><jay.vosburgh@canonical.com> wrote: [...] >> The recovery from the low cwnd situation is very slow; cwnd >> climbs a bit and then remains essentially flat for around 5 seconds. It >> then begins to climb until a few packets are lost again, and the cycle >> repeats. If no futher losses occur (if the competing traffic has >> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the >> full value (~2200) requires on the order of 20 seconds. The connection >> exits recovery state fairly quickly, and most of the 20 seconds is spent >> in open state. > >Interesting. I'm a little surprised it takes CUBIC so long to re-grow >cwnd to the full value. Would you be able to provide your kernel >version number and post a tcpdump binary packet trace somewhere >public? Ok, I ran a test today that demonstrates the slow cwnd growth. The sending machine is 3.15-rc8 (net-next as of about two weeks ago), the receiver is Ubuntu 3.13.0-24. The test involves adding 40 ms of delay in and out from machine A with netem, then running iperf from A to B. Once the iperf reaches a steady cwnd, on B, I add an iptables rule to drop 1 packet out of every 1000 coming from A, then remove the rule after 10 seconds. The behavior resulting from this closely matches what I see on the real systems. I captured packets from both ends, running it twice, the second time with GSO, GRO and TSO disabled. The iperf output is as follows: [ 3] 5.0- 6.0 sec 33.6 MBytes 282 Mbits/sec [ 3] 6.0- 7.0 sec 33.8 MBytes 283 Mbits/sec [ 3] 7.0- 8.0 sec 27.0 MBytes 226 Mbits/sec [ 3] 8.0- 9.0 sec 23.2 MBytes 195 Mbits/sec [ 3] 9.0-10.0 sec 17.4 MBytes 146 Mbits/sec [ 3] 10.0-11.0 sec 13.9 MBytes 116 Mbits/sec [ 3] 11.0-12.0 sec 10.4 MBytes 87.0 Mbits/sec [ 3] 12.0-13.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 13.0-14.0 sec 5.75 MBytes 48.2 Mbits/sec [ 3] 14.0-15.0 sec 4.75 MBytes 39.8 Mbits/sec [ 3] 15.0-16.0 sec 3.12 MBytes 26.2 Mbits/sec [ 3] 16.0-17.0 sec 4.38 MBytes 36.7 Mbits/sec [ 3] 17.0-18.0 sec 3.12 MBytes 26.2 Mbits/sec [ 3] 18.0-19.0 sec 3.12 MBytes 26.2 Mbits/sec [ 3] 19.0-20.0 sec 4.25 MBytes 35.7 Mbits/sec [ 3] 20.0-21.0 sec 3.12 MBytes 26.2 Mbits/sec [ 3] 21.0-22.0 sec 3.25 MBytes 27.3 Mbits/sec [ 3] 22.0-23.0 sec 4.25 MBytes 35.7 Mbits/sec [ 3] 23.0-24.0 sec 3.12 MBytes 26.2 Mbits/sec [ 3] 24.0-25.0 sec 4.12 MBytes 34.6 Mbits/sec [ 3] 25.0-26.0 sec 4.50 MBytes 37.7 Mbits/sec [ 3] 26.0-27.0 sec 4.50 MBytes 37.7 Mbits/sec [ 3] 27.0-28.0 sec 5.88 MBytes 49.3 Mbits/sec [ 3] 28.0-29.0 sec 7.12 MBytes 59.8 Mbits/sec [ 3] 29.0-30.0 sec 7.38 MBytes 61.9 Mbits/sec [ 3] 30.0-31.0 sec 10.0 MBytes 83.9 Mbits/sec [ 3] 31.0-32.0 sec 11.6 MBytes 97.5 Mbits/sec [ 3] 32.0-33.0 sec 15.5 MBytes 130 Mbits/sec [ 3] 33.0-34.0 sec 17.2 MBytes 145 Mbits/sec [ 3] 34.0-35.0 sec 20.0 MBytes 168 Mbits/sec [ 3] 35.0-36.0 sec 25.5 MBytes 214 Mbits/sec [ 3] 36.0-37.0 sec 29.8 MBytes 250 Mbits/sec [ 3] 37.0-38.0 sec 32.2 MBytes 271 Mbits/sec [ 3] 38.0-39.0 sec 32.4 MBytes 272 Mbits/sec For the above run, the iptables drop rule went in at about time 7, and was removed 10 seconds later, so recovery began at about time 17. The second run is similar, although the exact start times differ. The full data (two runs, each with packet capture from both ends and the iperf output) can be found at: http://people.canonical.com/~jvosburgh/tcp-slow-recovery.tar.bz2 -J --- -Jay Vosburgh, jay.vosburgh@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2014-06-18 at 18:52 -0700, Jay Vosburgh wrote: > The test involves adding 40 ms of delay in and out from machine > A with netem, then running iperf from A to B. Once the iperf reaches a > steady cwnd, on B, I add an iptables rule to drop 1 packet out of every > 1000 coming from A, then remove the rule after 10 seconds. The behavior > resulting from this closely matches what I see on the real systems. Please share the netem setup. Are you sure you do not drop frames on netem ? (considering you disable GSO/TSO netem has to be able to store a lot of packets) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Eric Dumazet <eric.dumazet@gmail.com> wrote: >On Wed, 2014-06-18 at 18:52 -0700, Jay Vosburgh wrote: >> The test involves adding 40 ms of delay in and out from machine >> A with netem, then running iperf from A to B. Once the iperf reaches a >> steady cwnd, on B, I add an iptables rule to drop 1 packet out of every >> 1000 coming from A, then remove the rule after 10 seconds. The behavior >> resulting from this closely matches what I see on the real systems. > >Please share the netem setup. Are you sure you do not drop frames on >netem ? (considering you disable GSO/TSO netem has to be able to store a >lot of packets) Reasonably sure; the tc -s qdisc doesn't show any drops by netem for these test runs. The data I linked to earlier is one run with TSO/GSO/GRO enabled, and one with TSO/GSO/GRO disabled, and the results are similar in terms of cwnd recovery time. Looking at the packet capture for the TSO/GSO/GRO disabled case, the time span from the first duplicate ACK to the last is about 9 seconds, which is close to the 10 seconds the iptables drop rule is in effect; the same time analysis applies to retransmissions from the sender. I've also tested with using netem to induce drops, but in this particular case I used iptables. The script I use to set up netem is: #!/bin/bash IF=eth1 TC=/usr/local/bin/tc DELAY=40ms rmmod ifb modprobe ifb ip link set dev ifb0 up if ${TC} qdisc show dev ${IF} | grep -q ingress; then ${TC} qdisc del dev ${IF} ingress fi ${TC} qdisc add dev ${IF} ingress ${TC} qdisc del dev ${IF} root ${TC} filter add dev ${IF} parent ffff: protocol ip \ u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb0 ${TC} qdisc add dev ifb0 root netem delay ${DELAY} limit 5000 ${TC} qdisc add dev ${IF} root netem delay ${DELAY} limit 5000 In the past I've watched the tc backlog, and the highest I've seen is about 900 packets, so the limit 5000 is probably overkill. I'm also not absolutely sure the delay 40ms each direction is materially different from 80ms in one direction, but the real configuration I'm recreating is 40ms each way. The tc qdisc stats after the two runs I did earlier to capture data look like this: qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap 1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1 Sent 1905005 bytes 22277 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc netem 8002: dev eth1 root refcnt 2 limit 5000 delay 40.0ms Sent 773383636 bytes 510901 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc ingress ffff: dev eth1 parent ffff:fff1 ---------------- Sent 14852588 bytes 281846 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 qdisc netem 8001: dev ifb0 root refcnt 2 limit 5000 delay 40.0ms Sent 18763686 bytes 281291 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 Lastly, I ran the same test on the actual systems, and the iperf results are similar to my test lab: [ ID] Interval Transfer Bandwidth [ 3] 0.0- 1.0 sec 896 KBytes 7.34 Mbits/sec [ 3] 1.0- 2.0 sec 1.50 MBytes 12.6 Mbits/sec [ 3] 2.0- 3.0 sec 5.12 MBytes 43.0 Mbits/sec [ 3] 3.0- 4.0 sec 13.9 MBytes 116 Mbits/sec [ 3] 4.0- 5.0 sec 27.8 MBytes 233 Mbits/sec [ 3] 5.0- 6.0 sec 39.0 MBytes 327 Mbits/sec [ 3] 6.0- 7.0 sec 36.8 MBytes 308 Mbits/sec [ 3] 7.0- 8.0 sec 36.8 MBytes 308 Mbits/sec [ 3] 8.0- 9.0 sec 37.0 MBytes 310 Mbits/sec [ 3] 9.0-10.0 sec 36.6 MBytes 307 Mbits/sec [ 3] 10.0-11.0 sec 33.9 MBytes 284 Mbits/sec [ 3] 11.0-12.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 12.0-13.0 sec 0.00 Bytes 0.00 bits/sec [ 3] 13.0-14.0 sec 4.38 MBytes 36.7 Mbits/sec [ 3] 14.0-15.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 15.0-16.0 sec 7.00 MBytes 58.7 Mbits/sec [ 3] 16.0-17.0 sec 8.62 MBytes 72.4 Mbits/sec [ 3] 17.0-18.0 sec 4.25 MBytes 35.7 Mbits/sec [ 3] 18.0-19.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 19.0-20.0 sec 4.25 MBytes 35.7 Mbits/sec [ 3] 20.0-21.0 sec 6.50 MBytes 54.5 Mbits/sec [ 3] 21.0-22.0 sec 6.38 MBytes 53.5 Mbits/sec [ 3] 22.0-23.0 sec 6.50 MBytes 54.5 Mbits/sec [ 3] 23.0-24.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 24.0-25.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 25.0-26.0 sec 8.38 MBytes 70.3 Mbits/sec [ 3] 26.0-27.0 sec 8.62 MBytes 72.4 Mbits/sec [ 3] 27.0-28.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 28.0-29.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 29.0-30.0 sec 8.38 MBytes 70.3 Mbits/sec [ 3] 30.0-31.0 sec 8.50 MBytes 71.3 Mbits/sec [ 3] 31.0-32.0 sec 8.62 MBytes 72.4 Mbits/sec [ 3] 32.0-33.0 sec 8.38 MBytes 70.3 Mbits/sec [ 3] 33.0-34.0 sec 10.6 MBytes 89.1 Mbits/sec [ 3] 34.0-35.0 sec 10.6 MBytes 89.1 Mbits/sec [ 3] 35.0-36.0 sec 10.6 MBytes 89.1 Mbits/sec [ 3] 36.0-37.0 sec 12.8 MBytes 107 Mbits/sec [ 3] 37.0-38.0 sec 15.0 MBytes 126 Mbits/sec [ 3] 38.0-39.0 sec 17.0 MBytes 143 Mbits/sec [ 3] 39.0-40.0 sec 19.4 MBytes 163 Mbits/sec [ 3] 40.0-41.0 sec 23.5 MBytes 197 Mbits/sec [ 3] 41.0-42.0 sec 25.6 MBytes 215 Mbits/sec [ 3] 42.0-43.0 sec 30.2 MBytes 254 Mbits/sec [ 3] 43.0-44.0 sec 34.2 MBytes 287 Mbits/sec [ 3] 44.0-45.0 sec 36.6 MBytes 307 Mbits/sec [ 3] 45.0-46.0 sec 38.8 MBytes 325 Mbits/sec [ 3] 46.0-47.0 sec 36.5 MBytes 306 Mbits/sec This result is consistently repeatable. These systems have more hops between them than my lab systems, but the ping RTT is 80ms. -J --- -Jay Vosburgh, jay.vosburgh@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 40661fc..768ba88 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1917,8 +1917,7 @@ void tcp_enter_loss(struct sock *sk, int how) /* Reduce ssthresh if it has not yet been made inside this window. */ if (icsk->icsk_ca_state <= TCP_CA_Disorder || - !after(tp->high_seq, tp->snd_una) || - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) { + !after(tp->high_seq, tp->snd_una)) { new_recovery = true; tp->prior_ssthresh = tcp_current_ssthresh(sk); tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
RFC 5681 says that ssthresh reduction in response to RTO should be done only once and should not be repeated until all packets from the first loss are retransmitted. RFC 6582 (as well as its predecessor RFC 3782) is even more specific and says that when loss is detected, one should mark current SND.NXT and ssthresh shouldn't be reduced again due to a loss until SND.UNA reaches this remembered value. In Linux implementation, this is done in tcp_enter_loss() but an additional condition (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits) allows to further reduce ssthresh before snd_una reaches the high_seq (the snd_nxt value at the previous loss) as icsk_retransmits is reset as soon as snd_una moves forward. As a result, if a retransmit timeout ouccurs early in the retransmit phase, we can adjust snd_ssthresh based on very low value of cwnd. This can be especially harmful for reno congestion control with slow linear cwnd growth in congestion avoidance phase. The patch removes the condition above so that snd_ssthresh is not reduced again until snd_una reaches high_seq as described in RFC 5681 and 6582. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> --- net/ipv4/tcp_input.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)