diff mbox

[net] tcp: avoid multiple ssthresh reductions in on retransmit window

Message ID 20140616211954.6E12BA3A89@unicorn.suse.cz
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Michal Kubecek June 16, 2014, 9:19 p.m. UTC
RFC 5681 says that ssthresh reduction in response to RTO should
be done only once and should not be repeated until all packets
from the first loss are retransmitted. RFC 6582 (as well as its
predecessor RFC 3782) is even more specific and says that when
loss is detected, one should mark current SND.NXT and ssthresh
shouldn't be reduced again due to a loss until SND.UNA reaches
this remembered value.

In Linux implementation, this is done in tcp_enter_loss() but an
additional condition

  (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)

allows to further reduce ssthresh before snd_una reaches the
high_seq (the snd_nxt value at the previous loss) as
icsk_retransmits is reset as soon as snd_una moves forward. As a
result, if a retransmit timeout ouccurs early in the retransmit
phase, we can adjust snd_ssthresh based on very low value of
cwnd. This can be especially harmful for reno congestion control
with slow linear cwnd growth in congestion avoidance phase.

The patch removes the condition above so that snd_ssthresh is
not reduced again until snd_una reaches high_seq as described in
RFC 5681 and 6582.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
---
 net/ipv4/tcp_input.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Comments

Yuchung Cheng June 16, 2014, 10:39 p.m. UTC | #1
On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote:
> RFC 5681 says that ssthresh reduction in response to RTO should
> be done only once and should not be repeated until all packets
> from the first loss are retransmitted. RFC 6582 (as well as its
> predecessor RFC 3782) is even more specific and says that when
> loss is detected, one should mark current SND.NXT and ssthresh
> shouldn't be reduced again due to a loss until SND.UNA reaches
> this remembered value.
>
> In Linux implementation, this is done in tcp_enter_loss() but an
> additional condition
>
>   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>
> allows to further reduce ssthresh before snd_una reaches the
> high_seq (the snd_nxt value at the previous loss) as
> icsk_retransmits is reset as soon as snd_una moves forward. As a
> result, if a retransmit timeout ouccurs early in the retransmit
> phase, we can adjust snd_ssthresh based on very low value of
> cwnd. This can be especially harmful for reno congestion control
> with slow linear cwnd growth in congestion avoidance phase.
>
> The patch removes the condition above so that snd_ssthresh is
> not reduced again until snd_una reaches high_seq as described in
> RFC 5681 and 6582.
>
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Yuchung Cheng <ycheng@google.com>

> ---
>  net/ipv4/tcp_input.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 40661fc..768ba88 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -1917,8 +1917,7 @@ void tcp_enter_loss(struct sock *sk, int how)
>
>         /* Reduce ssthresh if it has not yet been made inside this window. */
>         if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
> -           !after(tp->high_seq, tp->snd_una) ||
> -           (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
> +           !after(tp->high_seq, tp->snd_una)) {
>                 new_recovery = true;
>                 tp->prior_ssthresh = tcp_current_ssthresh(sk);
>                 tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);
> --
> 1.8.4.5
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell June 16, 2014, 11:42 p.m. UTC | #2
On Mon, Jun 16, 2014 at 6:39 PM, Yuchung Cheng <ycheng@google.com> wrote:
> On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote:
>> RFC 5681 says that ssthresh reduction in response to RTO should
>> be done only once and should not be repeated until all packets
>> from the first loss are retransmitted. RFC 6582 (as well as its
>> predecessor RFC 3782) is even more specific and says that when
>> loss is detected, one should mark current SND.NXT and ssthresh
>> shouldn't be reduced again due to a loss until SND.UNA reaches
>> this remembered value.
>>
>> In Linux implementation, this is done in tcp_enter_loss() but an
>> additional condition
>>
>>   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>>
>> allows to further reduce ssthresh before snd_una reaches the
>> high_seq (the snd_nxt value at the previous loss) as
>> icsk_retransmits is reset as soon as snd_una moves forward. As a
>> result, if a retransmit timeout ouccurs early in the retransmit
>> phase, we can adjust snd_ssthresh based on very low value of
>> cwnd. This can be especially harmful for reno congestion control
>> with slow linear cwnd growth in congestion avoidance phase.
>>
>> The patch removes the condition above so that snd_ssthresh is
>> not reduced again until snd_una reaches high_seq as described in
>> RFC 5681 and 6582.
>>
>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>

AFAICT this commit description is arguing from a mis-reading of the
RFCs.

RFC 6582 and RFC 3782 are only about Fast Recovery, and not relevant
to the timeout recovery we're dealing with in tcp_enter_loss().

RFC 5681, Section 4.3 says:

   Loss in two successive windows of data, or the loss of a
   retransmission, should be taken as two indications of congestion and,
   therefore, cwnd (and ssthresh) MUST be lowered twice in this case.

So if we're in TCP_CA_Loss snd_una advances (FLAG_DATA_ACKED is set
and icsk_retransmits is zero), but snd_una does not advance above
high_seq, then if we subsequently suffer an RTO (and call
tcp_enter_loss()) then that indicates a retransmission is lost, which
this passage from sec 4.3 indicates should be taken as a second
indication of congestion.

> - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {

AFAICT this existing code is a faithful implementation of, RFC 5681,
Section 7:

   The treatment of ssthresh on retransmission timeout was clarified.
   In particular, ssthresh must be set to half the FlightSize on the
   first retransmission of a given segment and then is held constant on
   subsequent retransmissions of the same segment.

That is, if snd_una advances (FLAG_DATA_ACKED is set and
icsk_retransmits is zero), if we subsequently suffer an RTO and call
tcp_enter_loss() then we will be sending a "first retransmission" at
the segment pointed to by the new/higher snd_una. So this is the first
retransmission of that new segment, so we should reduce ssthresh.

And from first principles, the current Linux code and RFCs seem
sensible on this matter, AFAICT. Suppose we suffer an RTO, and then
over the following RTTs in TCP_CA_Loss we grow cwnd exponentially
again. If we suffer another RTO in this cwnd growth process, then it
seems like a good idea to remember the reduced ssthresh inferred from
this smaller cwnd at which we suffered a loss.

So AFAICT the existing code is sensible and complies with the RFC.

Now, I agree the linear growth of Reno in such situations is
problematic, but I think it's a somewhat separate issue. Or at least
if we're going to change the behavior here then we should justify it
by using data, and not by reference to RFCs. :-)

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng June 17, 2014, 12:25 a.m. UTC | #3
On Mon, Jun 16, 2014 at 4:42 PM, Neal Cardwell <ncardwell@google.com> wrote:
> On Mon, Jun 16, 2014 at 6:39 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> On Mon, Jun 16, 2014 at 2:19 PM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>> RFC 5681 says that ssthresh reduction in response to RTO should
>>> be done only once and should not be repeated until all packets
>>> from the first loss are retransmitted. RFC 6582 (as well as its
>>> predecessor RFC 3782) is even more specific and says that when
>>> loss is detected, one should mark current SND.NXT and ssthresh
>>> shouldn't be reduced again due to a loss until SND.UNA reaches
>>> this remembered value.
>>>
>>> In Linux implementation, this is done in tcp_enter_loss() but an
>>> additional condition
>>>
>>>   (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)
>>>
>>> allows to further reduce ssthresh before snd_una reaches the
>>> high_seq (the snd_nxt value at the previous loss) as
>>> icsk_retransmits is reset as soon as snd_una moves forward. As a
>>> result, if a retransmit timeout ouccurs early in the retransmit
>>> phase, we can adjust snd_ssthresh based on very low value of
>>> cwnd. This can be especially harmful for reno congestion control
>>> with slow linear cwnd growth in congestion avoidance phase.
>>>
>>> The patch removes the condition above so that snd_ssthresh is
>>> not reduced again until snd_una reaches high_seq as described in
>>> RFC 5681 and 6582.
>>>
>>> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
>
> AFAICT this commit description is arguing from a mis-reading of the
> RFCs.
>
> RFC 6582 and RFC 3782 are only about Fast Recovery, and not relevant
> to the timeout recovery we're dealing with in tcp_enter_loss().
>
> RFC 5681, Section 4.3 says:
>
>    Loss in two successive windows of data, or the loss of a
>    retransmission, should be taken as two indications of congestion and,
>    therefore, cwnd (and ssthresh) MUST be lowered twice in this case.
>
> So if we're in TCP_CA_Loss snd_una advances (FLAG_DATA_ACKED is set
> and icsk_retransmits is zero), but snd_una does not advance above
> high_seq, then if we subsequently suffer an RTO (and call
> tcp_enter_loss()) then that indicates a retransmission is lost, which
> this passage from sec 4.3 indicates should be taken as a second
> indication of congestion.
That's right. I should have checked the RFC more thoroughly. Sorry
please ignore my Acked-by.

However Linux is inconsistent on the loss of a retransmission. It
reduces ssthresh (and cwnd) if this happens on a timeout, but not in
fast recovery (tcp_mark_lost_retrans). We should fix that and that
should help dealing with traffic policers.


>
>> - (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
>
> AFAICT this existing code is a faithful implementation of, RFC 5681,
> Section 7:
>
>    The treatment of ssthresh on retransmission timeout was clarified.
>    In particular, ssthresh must be set to half the FlightSize on the
>    first retransmission of a given segment and then is held constant on
>    subsequent retransmissions of the same segment.
>
> That is, if snd_una advances (FLAG_DATA_ACKED is set and
> icsk_retransmits is zero), if we subsequently suffer an RTO and call
> tcp_enter_loss() then we will be sending a "first retransmission" at
> the segment pointed to by the new/higher snd_una. So this is the first
> retransmission of that new segment, so we should reduce ssthresh.
>
> And from first principles, the current Linux code and RFCs seem
> sensible on this matter, AFAICT. Suppose we suffer an RTO, and then
> over the following RTTs in TCP_CA_Loss we grow cwnd exponentially
> again. If we suffer another RTO in this cwnd growth process, then it
> seems like a good idea to remember the reduced ssthresh inferred from
> this smaller cwnd at which we suffered a loss.
>
> So AFAICT the existing code is sensible and complies with the RFC.
>
> Now, I agree the linear growth of Reno in such situations is
> problematic, but I think it's a somewhat separate issue. Or at least
> if we're going to change the behavior here then we should justify it
> by using data, and not by reference to RFCs. :-)
>
> neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell June 17, 2014, 12:44 a.m. UTC | #4
On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
> However Linux is inconsistent on the loss of a retransmission. It
> reduces ssthresh (and cwnd) if this happens on a timeout, but not in
> fast recovery (tcp_mark_lost_retrans). We should fix that and that
> should help dealing with traffic policers.

Yes, great point!

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kubecek June 17, 2014, 12:20 p.m. UTC | #5
On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
> > However Linux is inconsistent on the loss of a retransmission. It
> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
> > should help dealing with traffic policers.
> 
> Yes, great point!

Does it mean the patch itself would be acceptable if the reasoning in
its commit message was changed? Or would you prefer a different way to
unify the two situations?

                                                        Michal Kubecek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng June 17, 2014, 9:35 p.m. UTC | #6
On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> > However Linux is inconsistent on the loss of a retransmission. It
>> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>> > should help dealing with traffic policers.
>>
>> Yes, great point!
>
> Does it mean the patch itself would be acceptable if the reasoning in
> its commit message was changed? Or would you prefer a different way to
> unify the two situations?

It's the latter but it seems to belong to a different patch (and it'll
not solve the problem you are seeing).

The idea behind the RFC is that TCP should reduce cwnd and ssthresh
across round trips of send, but not within an RTT. Suppose cwnd was
10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
round trips, we time out again. By the design of Reno this should
reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.

Of course this may not make sense in various cases. But it will be a
design bug in the congestion control rather than an implementation bug
in the loss recovery. We are seeing many similar issues where
non-queue-overflow drops mess up CCs relying on ssthresh :(

>
>                                                         Michal Kubecek
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michal Kubecek June 17, 2014, 10:42 p.m. UTC | #7
On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
> >> > However Linux is inconsistent on the loss of a retransmission. It
> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
> >> > should help dealing with traffic policers.
> >>
> >> Yes, great point!
> >
> > Does it mean the patch itself would be acceptable if the reasoning in
> > its commit message was changed? Or would you prefer a different way to
> > unify the two situations?
> 
> It's the latter but it seems to belong to a different patch (and it'll
> not solve the problem you are seeing).

OK, thank you. I guess we will have to persuade them to move to cubic
which handles their problems much better.

> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
> across round trips of send, but not within an RTT. Suppose cwnd was
> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
> round trips, we time out again. By the design of Reno this should
> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.

Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
cwnd, not current ssthresh.

BtW, this is exactly the problem our customer is facing: they have
relatively fast line (15 Mb/s) but with big buffers so that the
roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
under full load.

What happens is this: cwnd initally rises to ~2100 then first drops
are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
lets cwnd reach ssthresh but after that, a slow linear growth follows.
In this state, all in-flight packets are dropped (simulation of what
happens on router switchover) so that cwnd is reset to 1 again and
ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
If a packet loss comes shortly after that, cwnd is still very low and
ssthresh is reduced to half of that cwnd (i.e. much lower than to half
of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
which takes really long to recover from.

                                                        Michal Kubecek

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh June 18, 2014, 12:38 a.m. UTC | #8
Michal Kubecek <mkubecek@suse.cz> wrote:

>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>> >> > However Linux is inconsistent on the loss of a retransmission. It
>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>> >> > should help dealing with traffic policers.
>> >>
>> >> Yes, great point!
>> >
>> > Does it mean the patch itself would be acceptable if the reasoning in
>> > its commit message was changed? Or would you prefer a different way to
>> > unify the two situations?
>> 
>> It's the latter but it seems to belong to a different patch (and it'll
>> not solve the problem you are seeing).
>
>OK, thank you. I guess we will have to persuade them to move to cubic
>which handles their problems much better.
>
>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>> across round trips of send, but not within an RTT. Suppose cwnd was
>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>> round trips, we time out again. By the design of Reno this should
>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>
>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>cwnd, not current ssthresh.
>
>BtW, this is exactly the problem our customer is facing: they have
>relatively fast line (15 Mb/s) but with big buffers so that the
>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>under full load.
>
>What happens is this: cwnd initally rises to ~2100 then first drops
>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>In this state, all in-flight packets are dropped (simulation of what
>happens on router switchover) so that cwnd is reset to 1 again and
>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>If a packet loss comes shortly after that, cwnd is still very low and
>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>which takes really long to recover from.

	I'm also looking into a problem that exhibits very similar TCP
characteristics, even down to cwnd and ssthresh values similar to what
you cite.  In this case, the situation has to do with high RTT (around
80 ms) connections competing with low RTT (1 ms) connections.  This case
is already using cubic.

	Essentially, a high RTT connection to the server transfers data
in at a reasonable and steady rate until something causes some packets
to be lost (in this case, another transfer from a low RTT host to the
same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
(in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
starts at around 1100, then drops to ~260, which is the lowest cwnd
value.

	The recovery from the low cwnd situation is very slow; cwnd
climbs a bit and then remains essentially flat for around 5 seconds.  It
then begins to climb until a few packets are lost again, and the cycle
repeats.  If no futher losses occur (if the competing traffic has
ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
full value (~2200) requires on the order of 20 seconds.  The connection
exits recovery state fairly quickly, and most of the 20 seconds is spent
in open state.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell June 18, 2014, 12:56 a.m. UTC | #9
On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
<jay.vosburgh@canonical.com> wrote:
> Michal Kubecek <mkubecek@suse.cz> wrote:
>
>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>> >> > should help dealing with traffic policers.
>>> >>
>>> >> Yes, great point!
>>> >
>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>> > its commit message was changed? Or would you prefer a different way to
>>> > unify the two situations?
>>>
>>> It's the latter but it seems to belong to a different patch (and it'll
>>> not solve the problem you are seeing).
>>
>>OK, thank you. I guess we will have to persuade them to move to cubic
>>which handles their problems much better.
>>
>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>> round trips, we time out again. By the design of Reno this should
>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>
>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>cwnd, not current ssthresh.
>>
>>BtW, this is exactly the problem our customer is facing: they have
>>relatively fast line (15 Mb/s) but with big buffers so that the
>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>under full load.
>>
>>What happens is this: cwnd initally rises to ~2100 then first drops
>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>In this state, all in-flight packets are dropped (simulation of what
>>happens on router switchover) so that cwnd is reset to 1 again and
>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>If a packet loss comes shortly after that, cwnd is still very low and
>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>which takes really long to recover from.
>
>         I'm also looking into a problem that exhibits very similar TCP
> characteristics, even down to cwnd and ssthresh values similar to what
> you cite.  In this case, the situation has to do with high RTT (around
> 80 ms) connections competing with low RTT (1 ms) connections.  This case
> is already using cubic.
>
>         Essentially, a high RTT connection to the server transfers data
> in at a reasonable and steady rate until something causes some packets
> to be lost (in this case, another transfer from a low RTT host to the
> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
> starts at around 1100, then drops to ~260, which is the lowest cwnd
> value.
>
>         The recovery from the low cwnd situation is very slow; cwnd
> climbs a bit and then remains essentially flat for around 5 seconds.  It
> then begins to climb until a few packets are lost again, and the cycle
> repeats.  If no futher losses occur (if the competing traffic has
> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
> full value (~2200) requires on the order of 20 seconds.  The connection
> exits recovery state fairly quickly, and most of the 20 seconds is spent
> in open state.

Interesting. I'm a little surprised it takes CUBIC so long to re-grow
cwnd to the full value. Would you be able to provide your kernel
version number and post a tcpdump binary packet trace somewhere
public?

One thing you could try would be to disable CUBIC's "fast convergence" feature:

  echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence

We have noticed that this feature can hurt performance when there is a
high rate of random packet drops (packet drops that are not correlated
with the sending rate of the flow in question).

neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh June 18, 2014, 2 a.m. UTC | #10
Neal Cardwell <ncardwell@google.com> wrote:

>On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
><jay.vosburgh@canonical.com> wrote:
>> Michal Kubecek <mkubecek@suse.cz> wrote:
>>
>>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>>> >> > should help dealing with traffic policers.
>>>> >>
>>>> >> Yes, great point!
>>>> >
>>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>>> > its commit message was changed? Or would you prefer a different way to
>>>> > unify the two situations?
>>>>
>>>> It's the latter but it seems to belong to a different patch (and it'll
>>>> not solve the problem you are seeing).
>>>
>>>OK, thank you. I guess we will have to persuade them to move to cubic
>>>which handles their problems much better.
>>>
>>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>>> round trips, we time out again. By the design of Reno this should
>>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>>
>>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>>cwnd, not current ssthresh.
>>>
>>>BtW, this is exactly the problem our customer is facing: they have
>>>relatively fast line (15 Mb/s) but with big buffers so that the
>>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>>under full load.
>>>
>>>What happens is this: cwnd initally rises to ~2100 then first drops
>>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>>In this state, all in-flight packets are dropped (simulation of what
>>>happens on router switchover) so that cwnd is reset to 1 again and
>>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>>If a packet loss comes shortly after that, cwnd is still very low and
>>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>>which takes really long to recover from.
>>
>>         I'm also looking into a problem that exhibits very similar TCP
>> characteristics, even down to cwnd and ssthresh values similar to what
>> you cite.  In this case, the situation has to do with high RTT (around
>> 80 ms) connections competing with low RTT (1 ms) connections.  This case
>> is already using cubic.
>>
>>         Essentially, a high RTT connection to the server transfers data
>> in at a reasonable and steady rate until something causes some packets
>> to be lost (in this case, another transfer from a low RTT host to the
>> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
>> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
>> starts at around 1100, then drops to ~260, which is the lowest cwnd
>> value.
>>
>>         The recovery from the low cwnd situation is very slow; cwnd
>> climbs a bit and then remains essentially flat for around 5 seconds.  It
>> then begins to climb until a few packets are lost again, and the cycle
>> repeats.  If no futher losses occur (if the competing traffic has
>> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
>> full value (~2200) requires on the order of 20 seconds.  The connection
>> exits recovery state fairly quickly, and most of the 20 seconds is spent
>> in open state.
>
>Interesting. I'm a little surprised it takes CUBIC so long to re-grow
>cwnd to the full value. Would you be able to provide your kernel
>version number and post a tcpdump binary packet trace somewhere
>public?

	The kernel I'm using at the moment is an Ubuntu 3.2.0-54 distro
kernel, but I've reproduced the problem on Ubuntu distro 3.13 and a
mainline 3.15-rc (although in the 3.13/3.15 cases using netem to inject
delay).  I've been gathering data mostly with systemtap, but I should be
able to get some packet captures as well, although not until tomorrow.

	The test I'm using right now is pretty simple. I have three
machines: two, A and B, are separated by about 80 ms RTT; the third
machine, C, is about 1 ms from B, so:

	A --- 80ms --- B --- 1ms ---- C

	On A, I run an "iperf -i 1" to B, and let it max its cwnd, and
then on C, run an "iperf -t 1" to B ("-t 1" means only run for one
second then exit).  The iperf results on A look like this:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  1.0- 2.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3]  2.0- 3.0 sec  4.62 MBytes  38.8 Mbits/sec
[  3]  3.0- 4.0 sec  13.5 MBytes   113 Mbits/sec
[  3]  4.0- 5.0 sec  27.8 MBytes   233 Mbits/sec
[  3]  5.0- 6.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  6.0- 7.0 sec  36.9 MBytes   309 Mbits/sec
[  3]  7.0- 8.0 sec  34.8 MBytes   292 Mbits/sec
[  3]  8.0- 9.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  9.0-10.0 sec  36.9 MBytes   309 Mbits/sec
[  3] 10.0-11.0 sec  36.9 MBytes   309 Mbits/sec
[  3] 11.0-12.0 sec  11.1 MBytes  93.3 Mbits/sec
[  3] 12.0-13.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 13.0-14.0 sec  2.88 MBytes  24.1 Mbits/sec
[  3] 14.0-15.0 sec  5.50 MBytes  46.1 Mbits/sec
[  3] 15.0-16.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 16.0-17.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 17.0-18.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 18.0-19.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 19.0-20.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 20.0-21.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 21.0-22.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 22.0-23.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 23.0-24.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 24.0-25.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 25.0-26.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 26.0-27.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 27.0-28.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 28.0-29.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 29.0-30.0 sec  12.9 MBytes   108 Mbits/sec
[  3] 30.0-31.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 31.0-32.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 32.0-33.0 sec  21.8 MBytes   182 Mbits/sec
[  3] 33.0-34.0 sec  21.4 MBytes   179 Mbits/sec
[  3] 34.0-35.0 sec  27.8 MBytes   233 Mbits/sec
[  3] 35.0-36.0 sec  32.6 MBytes   274 Mbits/sec
[  3] 36.0-37.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 37.0-38.0 sec  36.6 MBytes   307 Mbits/sec

	The second iperf starts at about time 10.  The middle value is 1
second's throughput, so the flat throughput between roughly time 13 and
time 23 is the cwnd slow recovery.

	I've got one graph prepared already that I can post:

http://people.canonical.com/~jvosburgh/t-vs-cwnd-ssthresh.jpg

	This shows cwnd (green) and ssthresh (red) vs. time.  In this
case, the second (low RTT) iperf started at the first big drop at around
time 22 and ran for 30 seconds (its data is not on the graph).  The big
cwnd drop is actually a series of drops, but that's hard to see at this
scale.  This graph shows two of the slow recoveries, and was done on a
3.13 kernel using netem to add delay.  The cwnd and ssthresh data was
captured by systemtap when exiting tcp_ack.

>One thing you could try would be to disable CUBIC's "fast convergence" feature:
>
>  echo 0 > /sys/module/tcp_cubic/parameters/fast_convergence
>
>We have noticed that this feature can hurt performance when there is a
>high rate of random packet drops (packet drops that are not correlated
>with the sending rate of the flow in question).

	I ran the above iperf results with this disabled; it does not
appear to have any effect.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 18, 2014, 7:17 a.m. UTC | #11
On Wed, 2014-06-18 at 00:42 +0200, Michal Kubecek wrote:

> BtW, this is exactly the problem our customer is facing: they have
> relatively fast line (15 Mb/s) but with big buffers so that the
> roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
> under full load.

It looks like some cwnd limiting would be nice ;)

BDP is about 45 packets for a 35ms rtt and 15 Mb/s link.

> 
> What happens is this: cwnd initally rises to ~2100 then first drops
> are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
> lets cwnd reach ssthresh but after that, a slow linear growth follows.
> In this state, all in-flight packets are dropped (simulation of what
> happens on router switchover) so that cwnd is reset to 1 again and
> ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
> If a packet loss comes shortly after that, cwnd is still very low and
> ssthresh is reduced to half of that cwnd (i.e. much lower than to half
> of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
> which takes really long to recover from.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuchung Cheng June 18, 2014, 4:56 p.m. UTC | #12
On Tue, Jun 17, 2014 at 5:38 PM, Jay Vosburgh
<jay.vosburgh@canonical.com> wrote:
> Michal Kubecek <mkubecek@suse.cz> wrote:
>
>>On Tue, Jun 17, 2014 at 02:35:23PM -0700, Yuchung Cheng wrote:
>>> On Tue, Jun 17, 2014 at 5:20 AM, Michal Kubecek <mkubecek@suse.cz> wrote:
>>> > On Mon, Jun 16, 2014 at 08:44:04PM -0400, Neal Cardwell wrote:
>>> >> On Mon, Jun 16, 2014 at 8:25 PM, Yuchung Cheng <ycheng@google.com> wrote:
>>> >> > However Linux is inconsistent on the loss of a retransmission. It
>>> >> > reduces ssthresh (and cwnd) if this happens on a timeout, but not in
>>> >> > fast recovery (tcp_mark_lost_retrans). We should fix that and that
>>> >> > should help dealing with traffic policers.
>>> >>
>>> >> Yes, great point!
>>> >
>>> > Does it mean the patch itself would be acceptable if the reasoning in
>>> > its commit message was changed? Or would you prefer a different way to
>>> > unify the two situations?
>>>
>>> It's the latter but it seems to belong to a different patch (and it'll
>>> not solve the problem you are seeing).
>>
>>OK, thank you. I guess we will have to persuade them to move to cubic
>>which handles their problems much better.
>>
>>> The idea behind the RFC is that TCP should reduce cwnd and ssthresh
>>> across round trips of send, but not within an RTT. Suppose cwnd was
>>> 10 on first timeout, so cwnd becomes 1 and ssthresh is 5. Then after 3
>>> round trips, we time out again. By the design of Reno this should
>>> reset cwnd from 8 to 1, and ssthresh from 5 to 2.5.
>>
>>Shouldn't that be from 5 to 4? We reduce ssthresh to half of current
>>cwnd, not current ssthresh.
Oops yes it should be 8 to 4.
>>
>>BtW, this is exactly the problem our customer is facing: they have
>>relatively fast line (15 Mb/s) but with big buffers so that the
>>roundtrip times can rise from unloaded 35 ms up to something like 1.5 s
>>under full load.
>>
>>What happens is this: cwnd initally rises to ~2100 then first drops
>>are encountered, cwnd is set to 1 and ssthresh to ~1050. The slow start
>>lets cwnd reach ssthresh but after that, a slow linear growth follows.
>>In this state, all in-flight packets are dropped (simulation of what
>>happens on router switchover) so that cwnd is reset to 1 again and
>>ssthresh to something like 530-550 (cwnd was a bit higher than ssthresh).
>>If a packet loss comes shortly after that, cwnd is still very low and
>>ssthresh is reduced to half of that cwnd (i.e. much lower than to half
>>of ssthresh). If unlucky, one can even end up with ssthresh reduced to 2
>>which takes really long to recover from.
>
>         I'm also looking into a problem that exhibits very similar TCP
> characteristics, even down to cwnd and ssthresh values similar to what
> you cite.  In this case, the situation has to do with high RTT (around
> 80 ms) connections competing with low RTT (1 ms) connections.  This case
> is already using cubic.
>
>         Essentially, a high RTT connection to the server transfers data
> in at a reasonable and steady rate until something causes some packets
> to be lost (in this case, another transfer from a low RTT host to the
> same server).  Some packets are lost, and cwnd drops from ~2200 to ~300
> (in stages, first to ~1500, then ~600, then to ~300, ).  The ssthresh
> starts at around 1100, then drops to ~260, which is the lowest cwnd
> value.
>
>         The recovery from the low cwnd situation is very slow; cwnd
> climbs a bit and then remains essentially flat for around 5 seconds.  It
> then begins to climb until a few packets are lost again, and the cycle
> repeats.  If no futher losses occur (if the competing traffic has
> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
> full value (~2200) requires on the order of 20 seconds.  The connection
> exits recovery state fairly quickly, and most of the 20 seconds is spent
> in open state.

ssthresh is problematic. Both cases show the same shortcoming of
Reno/Cubic using losses and ssthresh.
If losses are not caused by queue overflows but by link flaps, bursts,
etc, the ssthresh is not indicative of BDP. It's kinda a random
value (>> BDP on BB, <<BDP in these cases). TCP throughput goes
south if we hit two losses within a few RTT and it's a point of no
return :( Hopefully someone can come up a more intelligent control.

Several posts in tcpm also discuss the low ssthresh issues.

http://www.ietf.org/mail-archive/web/tcpm/current/msg08145.html
http://www.ietf.org/mail-archive/web/tcpm/current/msg08778.html

>
>         -J
>
> ---
>         -Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh June 19, 2014, 1:52 a.m. UTC | #13
Neal Cardwell <ncardwell@google.com> wrote:

>On Tue, Jun 17, 2014 at 8:38 PM, Jay Vosburgh
><jay.vosburgh@canonical.com> wrote:
[...]
>>         The recovery from the low cwnd situation is very slow; cwnd
>> climbs a bit and then remains essentially flat for around 5 seconds.  It
>> then begins to climb until a few packets are lost again, and the cycle
>> repeats.  If no futher losses occur (if the competing traffic has
>> ceased, for example), recovery from a low cwnd (300 - 750 ish) to the
>> full value (~2200) requires on the order of 20 seconds.  The connection
>> exits recovery state fairly quickly, and most of the 20 seconds is spent
>> in open state.
>
>Interesting. I'm a little surprised it takes CUBIC so long to re-grow
>cwnd to the full value. Would you be able to provide your kernel
>version number and post a tcpdump binary packet trace somewhere
>public?

	Ok, I ran a test today that demonstrates the slow cwnd growth.
The sending machine is 3.15-rc8 (net-next as of about two weeks ago),
the receiver is Ubuntu 3.13.0-24.

	The test involves adding 40 ms of delay in and out from machine
A with netem, then running iperf from A to B.  Once the iperf reaches a
steady cwnd, on B, I add an iptables rule to drop 1 packet out of every
1000 coming from A, then remove the rule after 10 seconds.  The behavior
resulting from this closely matches what I see on the real systems.

	I captured packets from both ends, running it twice, the second
time with GSO, GRO and TSO disabled.

	The iperf output is as follows:

[  3]  5.0- 6.0 sec  33.6 MBytes   282 Mbits/sec
[  3]  6.0- 7.0 sec  33.8 MBytes   283 Mbits/sec
[  3]  7.0- 8.0 sec  27.0 MBytes   226 Mbits/sec
[  3]  8.0- 9.0 sec  23.2 MBytes   195 Mbits/sec
[  3]  9.0-10.0 sec  17.4 MBytes   146 Mbits/sec
[  3] 10.0-11.0 sec  13.9 MBytes   116 Mbits/sec
[  3] 11.0-12.0 sec  10.4 MBytes  87.0 Mbits/sec
[  3] 12.0-13.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 13.0-14.0 sec  5.75 MBytes  48.2 Mbits/sec
[  3] 14.0-15.0 sec  4.75 MBytes  39.8 Mbits/sec
[  3] 15.0-16.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 16.0-17.0 sec  4.38 MBytes  36.7 Mbits/sec
[  3] 17.0-18.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 18.0-19.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 19.0-20.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 20.0-21.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 21.0-22.0 sec  3.25 MBytes  27.3 Mbits/sec
[  3] 22.0-23.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 23.0-24.0 sec  3.12 MBytes  26.2 Mbits/sec
[  3] 24.0-25.0 sec  4.12 MBytes  34.6 Mbits/sec
[  3] 25.0-26.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 26.0-27.0 sec  4.50 MBytes  37.7 Mbits/sec
[  3] 27.0-28.0 sec  5.88 MBytes  49.3 Mbits/sec
[  3] 28.0-29.0 sec  7.12 MBytes  59.8 Mbits/sec
[  3] 29.0-30.0 sec  7.38 MBytes  61.9 Mbits/sec
[  3] 30.0-31.0 sec  10.0 MBytes  83.9 Mbits/sec
[  3] 31.0-32.0 sec  11.6 MBytes  97.5 Mbits/sec
[  3] 32.0-33.0 sec  15.5 MBytes   130 Mbits/sec
[  3] 33.0-34.0 sec  17.2 MBytes   145 Mbits/sec
[  3] 34.0-35.0 sec  20.0 MBytes   168 Mbits/sec
[  3] 35.0-36.0 sec  25.5 MBytes   214 Mbits/sec
[  3] 36.0-37.0 sec  29.8 MBytes   250 Mbits/sec
[  3] 37.0-38.0 sec  32.2 MBytes   271 Mbits/sec
[  3] 38.0-39.0 sec  32.4 MBytes   272 Mbits/sec

	For the above run, the iptables drop rule went in at about time
7, and was removed 10 seconds later, so recovery began at about time 17.
The second run is similar, although the exact start times differ.

	The full data (two runs, each with packet capture from both ends
and the iperf output) can be found at:

http://people.canonical.com/~jvosburgh/tcp-slow-recovery.tar.bz2

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet June 19, 2014, 2:28 a.m. UTC | #14
On Wed, 2014-06-18 at 18:52 -0700, Jay Vosburgh wrote:
> 	The test involves adding 40 ms of delay in and out from machine
> A with netem, then running iperf from A to B.  Once the iperf reaches a
> steady cwnd, on B, I add an iptables rule to drop 1 packet out of every
> 1000 coming from A, then remove the rule after 10 seconds.  The behavior
> resulting from this closely matches what I see on the real systems.

Please share the netem setup. Are you sure you do not drop frames on
netem ? (considering you disable GSO/TSO netem has to be able to store a
lot of packets)




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jay Vosburgh June 19, 2014, 6:05 a.m. UTC | #15
Eric Dumazet <eric.dumazet@gmail.com> wrote:

>On Wed, 2014-06-18 at 18:52 -0700, Jay Vosburgh wrote:
>> 	The test involves adding 40 ms of delay in and out from machine
>> A with netem, then running iperf from A to B.  Once the iperf reaches a
>> steady cwnd, on B, I add an iptables rule to drop 1 packet out of every
>> 1000 coming from A, then remove the rule after 10 seconds.  The behavior
>> resulting from this closely matches what I see on the real systems.
>
>Please share the netem setup. Are you sure you do not drop frames on
>netem ? (considering you disable GSO/TSO netem has to be able to store a
>lot of packets)

	Reasonably sure; the tc -s qdisc doesn't show any drops by netem
for these test runs.  The data I linked to earlier is one run with
TSO/GSO/GRO enabled, and one with TSO/GSO/GRO disabled, and the results
are similar in terms of cwnd recovery time.  Looking at the packet
capture for the TSO/GSO/GRO disabled case, the time span from the first
duplicate ACK to the last is about 9 seconds, which is close to the 10
seconds the iptables drop rule is in effect; the same time analysis
applies to retransmissions from the sender.

	I've also tested with using netem to induce drops, but in this
particular case I used iptables.

	The script I use to set up netem is:

#!/bin/bash

IF=eth1
TC=/usr/local/bin/tc
DELAY=40ms

rmmod ifb
modprobe ifb
ip link set dev ifb0 up

if ${TC} qdisc show dev ${IF} | grep -q ingress; then
	${TC} qdisc del dev ${IF} ingress
fi
${TC} qdisc add dev ${IF} ingress

${TC} qdisc del dev ${IF} root

${TC} filter add dev ${IF} parent ffff: protocol ip \
	u32 match u32 0 0 flowid 1:1 action mirred egress redirect dev ifb0
${TC} qdisc add dev ifb0 root netem delay ${DELAY} limit 5000
${TC} qdisc add dev ${IF} root netem delay ${DELAY} limit 5000

	In the past I've watched the tc backlog, and the highest I've
seen is about 900 packets, so the limit 5000 is probably overkill.

	I'm also not absolutely sure the delay 40ms each direction is
materially different from 80ms in one direction, but the real
configuration I'm recreating is 40ms each way.

	The tc qdisc stats after the two runs I did earlier to capture
data look like this:

qdisc pfifo_fast 0: dev eth0 root refcnt 2 bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 1905005 bytes 22277 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc netem 8002: dev eth1 root refcnt 2 limit 5000 delay 40.0ms
 Sent 773383636 bytes 510901 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc ingress ffff: dev eth1 parent ffff:fff1 ---------------- 
 Sent 14852588 bytes 281846 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 
qdisc netem 8001: dev ifb0 root refcnt 2 limit 5000 delay 40.0ms
 Sent 18763686 bytes 281291 pkt (dropped 0, overlimits 0 requeues 0) 
 backlog 0b 0p requeues 0 

	Lastly, I ran the same test on the actual systems, and the iperf
results are similar to my test lab:

[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   896 KBytes  7.34 Mbits/sec
[  3]  1.0- 2.0 sec  1.50 MBytes  12.6 Mbits/sec
[  3]  2.0- 3.0 sec  5.12 MBytes  43.0 Mbits/sec
[  3]  3.0- 4.0 sec  13.9 MBytes   116 Mbits/sec
[  3]  4.0- 5.0 sec  27.8 MBytes   233 Mbits/sec
[  3]  5.0- 6.0 sec  39.0 MBytes   327 Mbits/sec
[  3]  6.0- 7.0 sec  36.8 MBytes   308 Mbits/sec
[  3]  7.0- 8.0 sec  36.8 MBytes   308 Mbits/sec
[  3]  8.0- 9.0 sec  37.0 MBytes   310 Mbits/sec
[  3]  9.0-10.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 10.0-11.0 sec  33.9 MBytes   284 Mbits/sec
[  3] 11.0-12.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 12.0-13.0 sec  0.00 Bytes  0.00 bits/sec
[  3] 13.0-14.0 sec  4.38 MBytes  36.7 Mbits/sec
[  3] 14.0-15.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 15.0-16.0 sec  7.00 MBytes  58.7 Mbits/sec
[  3] 16.0-17.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 17.0-18.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 18.0-19.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 19.0-20.0 sec  4.25 MBytes  35.7 Mbits/sec
[  3] 20.0-21.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 21.0-22.0 sec  6.38 MBytes  53.5 Mbits/sec
[  3] 22.0-23.0 sec  6.50 MBytes  54.5 Mbits/sec
[  3] 23.0-24.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 24.0-25.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 25.0-26.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 26.0-27.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 27.0-28.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 28.0-29.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 29.0-30.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 30.0-31.0 sec  8.50 MBytes  71.3 Mbits/sec
[  3] 31.0-32.0 sec  8.62 MBytes  72.4 Mbits/sec
[  3] 32.0-33.0 sec  8.38 MBytes  70.3 Mbits/sec
[  3] 33.0-34.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 34.0-35.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 35.0-36.0 sec  10.6 MBytes  89.1 Mbits/sec
[  3] 36.0-37.0 sec  12.8 MBytes   107 Mbits/sec
[  3] 37.0-38.0 sec  15.0 MBytes   126 Mbits/sec
[  3] 38.0-39.0 sec  17.0 MBytes   143 Mbits/sec
[  3] 39.0-40.0 sec  19.4 MBytes   163 Mbits/sec
[  3] 40.0-41.0 sec  23.5 MBytes   197 Mbits/sec
[  3] 41.0-42.0 sec  25.6 MBytes   215 Mbits/sec
[  3] 42.0-43.0 sec  30.2 MBytes   254 Mbits/sec
[  3] 43.0-44.0 sec  34.2 MBytes   287 Mbits/sec
[  3] 44.0-45.0 sec  36.6 MBytes   307 Mbits/sec
[  3] 45.0-46.0 sec  38.8 MBytes   325 Mbits/sec
[  3] 46.0-47.0 sec  36.5 MBytes   306 Mbits/sec

	This result is consistently repeatable.  These systems have more
hops between them than my lab systems, but the ping RTT is 80ms.

	-J

---
	-Jay Vosburgh, jay.vosburgh@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 40661fc..768ba88 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1917,8 +1917,7 @@  void tcp_enter_loss(struct sock *sk, int how)
 
 	/* Reduce ssthresh if it has not yet been made inside this window. */
 	if (icsk->icsk_ca_state <= TCP_CA_Disorder ||
-	    !after(tp->high_seq, tp->snd_una) ||
-	    (icsk->icsk_ca_state == TCP_CA_Loss && !icsk->icsk_retransmits)) {
+	    !after(tp->high_seq, tp->snd_una)) {
 		new_recovery = true;
 		tp->prior_ssthresh = tcp_current_ssthresh(sk);
 		tp->snd_ssthresh = icsk->icsk_ca_ops->ssthresh(sk);