[net-next,v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction

Message ID 20180627023403.3395818-1-brakmo@fb.com
State Superseded
Delegated to: David Miller
Headers show
Series
  • [net-next,v2] tcp: force cwnd at least 2 in tcp_cwnd_reduction
Related show

Commit Message

Lawrence Brakmo June 27, 2018, 2:34 a.m.
When using dctcp and doing RPCs, if the last packet of a request is
ECN marked as having seen congestion (CE), the sender can decrease its
cwnd to 1. As a result, it will only send one packet when a new request
is sent. In some instances this results in high tail latencies.

For example, in one setup there are 3 hosts sending to a 4th one, with
each sender having 3 flows (1 stream, 1 1MB back-to-back RPCs and 1 10KB
back-to-back RPCs). The following table shows the 99% and 99.9%
latencies for both Cubic and dctcp

           Cubic 99%  Cubic 99.9%   dctcp 99%    dctcp 99.9%
 1MB RPCs    3.5ms       6.0ms         43ms          208ms
10KB RPCs    1.0ms       2.5ms         53ms          212ms

On 4.11, pcap traces indicate that in some instances the 1st packet of
the RPC is received but no ACK is sent before the packet is
retransmitted. On 4.11 netstat shows TCP timeouts, with some of them
spurious.

On 4.16, we don't see retransmits in netstat but the high tail latencies
are still there. Forcing cwnd to be at least 2 in tcp_cwnd_reduction
fixes the problem with the high tail latencies. The latencies now look
like this:

             dctcp 99%       dctcp 99.9%
 1MB RPCs      3.8ms             4.4ms
10KB RPCs      168us             211us

Another group working with dctcp saw the same issue with production
traffic and it was solved with this patch.

The only issue is if it is safe to always use 2 or if it is better to
use min(2, snd_ssthresh) (which could still trigger the problem).

v2: fixed compiler warning in max function arguments

Signed-off-by: Lawrence Brakmo <brakmo@fb.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Dumazet June 27, 2018, 3:04 p.m. | #1
On 06/26/2018 07:34 PM, Lawrence Brakmo wrote:
> When using dctcp and doing RPCs, if the last packet of a request is
> ECN marked as having seen congestion (CE), the sender can decrease its
> cwnd to 1. As a result, it will only send one packet when a new request
> is sent. In some instances this results in high tail latencies.
> 

>  	}
>  	/* Force a fast retransmit upon entering fast recovery */
>  	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
> -	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
> +	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);

Canonical way is to use min_t(), please respin (no need to explain this trivia in changelog)

Thanks.
Eric Dumazet June 27, 2018, 3:06 p.m. | #2
On 06/27/2018 08:04 AM, Eric Dumazet wrote:
> 
> 
> On 06/26/2018 07:34 PM, Lawrence Brakmo wrote:
>> When using dctcp and doing RPCs, if the last packet of a request is
>> ECN marked as having seen congestion (CE), the sender can decrease its
>> cwnd to 1. As a result, it will only send one packet when a new request
>> is sent. In some instances this results in high tail latencies.
>>
> 
>>  	}
>>  	/* Force a fast retransmit upon entering fast recovery */
>>  	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
>> -	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
>> +	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
> 
> Canonical way is to use min_t(), please respin (no need to explain this trivia in changelog)


Well, max_t() here, obviously :)
Neal Cardwell June 27, 2018, 3:24 p.m. | #3
On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo <brakmo@fb.com> wrote:
> The only issue is if it is safe to always use 2 or if it is better to
> use min(2, snd_ssthresh) (which could still trigger the problem).

Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
that should be the same as just 2, since:

(a) RFCs mandate ssthresh should not be below 2, e.g.
https://tools.ietf.org/html/rfc5681 page 7:

 ssthresh = max (FlightSize / 2, 2*SMSS)            (4)

(b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
that constraint, and always have an ssthresh of at least 2.

And if some CC misbehaves and uses a lower ssthresh, then taking
min(2, snd_ssthresh) will trigger problems, as you note.

> +       tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);

AFAICT this does seem like it will make the sender behavior more
aggressive in cases with high loss and/or a very low per-flow
fair-share.

Old:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packet 1
o using ACKs, slow-start upward

New:

o send N packets
o receive SACKs for last 3 packets
o fast retransmit packets 1 and 2
o using ACKs, slow-start upward

In the extreme case, if the available fair share is less than 2
packets, whereas inflight would have oscillated between 1 packet and 2
packets with the existing code, it now seems like with this commit the
inflight will now hover at 2. It seems like this would have
significantly higher losses than we had with the existing code.

This may or may not be OK in practice, but IMHO it is worth mentioning
and discussing.

neal
Yuchung Cheng June 27, 2018, 4:57 p.m. | #4
On Wed, Jun 27, 2018 at 8:24 AM, Neal Cardwell <ncardwell@google.com> wrote:
> On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo <brakmo@fb.com> wrote:
>> The only issue is if it is safe to always use 2 or if it is better to
>> use min(2, snd_ssthresh) (which could still trigger the problem).
>
> Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
> that should be the same as just 2, since:
>
> (a) RFCs mandate ssthresh should not be below 2, e.g.
> https://tools.ietf.org/html/rfc5681 page 7:
>
>  ssthresh = max (FlightSize / 2, 2*SMSS)            (4)
>
> (b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
> that constraint, and always have an ssthresh of at least 2.
>
> And if some CC misbehaves and uses a lower ssthresh, then taking
> min(2, snd_ssthresh) will trigger problems, as you note.
>
>> +       tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
>
> AFAICT this does seem like it will make the sender behavior more
> aggressive in cases with high loss and/or a very low per-flow
> fair-share.
>
> Old:
>
> o send N packets
> o receive SACKs for last 3 packets
> o fast retransmit packet 1
> o using ACKs, slow-start upward
>
> New:
>
> o send N packets
> o receive SACKs for last 3 packets
> o fast retransmit packets 1 and 2
> o using ACKs, slow-start upward
>
> In the extreme case, if the available fair share is less than 2
> packets, whereas inflight would have oscillated between 1 packet and 2
> packets with the existing code, it now seems like with this commit the
> inflight will now hover at 2. It seems like this would have
> significantly higher losses than we had with the existing code.
I share similar concern. Note that this function is used by most
existing congestion control modules beside DCTCP so I am more cautious
of changing this to address DCTCP issue.

One problem that DCTCP paper notices when cwnd = 1 is still too big
when the bottleneck
is shared by many flows (e.g. incast). It specifically suggest
changing the lower-bound of 2 in the spec to 1. (Section 8.2).
https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-judd.pdf

I am curious about the differences you observe in 4.11 and 4.16. I
wasn't aware of any (significant) change in tcp_cwnd_reduction / PRR
algorithm between 4.11 and 4.16. Also the receiver should not delay
ACKs if it has out-of-order packet or receiving CE data packets. This
means the delayed ACK is by tail losses and the last data received
carries no CE mark: seems a less common scenario?

If delayed-ACK is the problem, we probably should fix the receiver to
delay ACK more intelligently, not the sender. weiwan@google.com is
working on it.



>
> This may or may not be OK in practice, but IMHO it is worth mentioning
> and discussing.
>
> neal
Yuchung Cheng June 27, 2018, 10:27 p.m. | #5
On Wed, Jun 27, 2018 at 1:00 PM, Lawrence Brakmo <brakmo@fb.com> wrote:
>
>
> From: <netdev-owner@vger.kernel.org> on behalf of Yuchung Cheng
> <ycheng@google.com>
> Date: Wednesday, June 27, 2018 at 9:59 AM
> To: Neal Cardwell <ncardwell@google.com>
> Cc: Lawrence Brakmo <brakmo@fb.com>, Matt Mathis <mattmathis@google.com>,
> Netdev <netdev@vger.kernel.org>, Kernel Team <Kernel-team@fb.com>, Blake
> Matheny <bmatheny@fb.com>, Alexei Starovoitov <ast@fb.com>, Eric Dumazet
> <eric.dumazet@gmail.com>, Wei Wang <weiwan@google.com>
> Subject: Re: [PATCH net-next v2] tcp: force cwnd at least 2 in
> tcp_cwnd_reduction
>
>
>
> On Wed, Jun 27, 2018 at 8:24 AM, Neal Cardwell <ncardwell@google.com> wrote:
>
> On Tue, Jun 26, 2018 at 10:34 PM Lawrence Brakmo <brakmo@fb.com> wrote:
>
> The only issue is if it is safe to always use 2 or if it is better to
>
> use min(2, snd_ssthresh) (which could still trigger the problem).
>
>
>
> Always using 2 SGTM. I don't think we need min(2, snd_ssthresh), as
>
> that should be the same as just 2, since:
>
>
>
> (a) RFCs mandate ssthresh should not be below 2, e.g.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__tools.ietf.org_html_rfc5681&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=PZQC-6NGqK6QEVrf1WhlAD3mQt7tK8aqrfQGp93dJy4&s=WH9fXpDBHC31NCixiiMvhOb2UXCuJIxUiY4IXyJTgpc&e=
> page 7:
>
>
>
>   ssthresh = max (FlightSize / 2, 2*SMSS)            (4)
>
>
>
> (b) The main loss-based CCs used in Linux (CUBIC, Reno, DCTCP) respect
>
> that constraint, and always have an ssthresh of at least 2.
>
>
>
> And if some CC misbehaves and uses a lower ssthresh, then taking
>
> min(2, snd_ssthresh) will trigger problems, as you note.
>
>
>
> +       tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
>
>
>
> AFAICT this does seem like it will make the sender behavior more
>
> aggressive in cases with high loss and/or a very low per-flow
>
> fair-share.
>
>
>
> Old:
>
>
>
> o send N packets
>
> o receive SACKs for last 3 packets
>
> o fast retransmit packet 1
>
> o using ACKs, slow-start upward
>
>
>
> New:
>
>
>
> o send N packets
>
> o receive SACKs for last 3 packets
>
> o fast retransmit packets 1 and 2
>
> o using ACKs, slow-start upward
>
>
>
> In the extreme case, if the available fair share is less than 2
>
> packets, whereas inflight would have oscillated between 1 packet and 2
>
> packets with the existing code, it now seems like with this commit the
>
> inflight will now hover at 2. It seems like this would have
>
> significantly higher losses than we had with the existing code.
>
> I share similar concern. Note that this function is used by most
>
> existing congestion control modules beside DCTCP so I am more cautious
>
> of changing this to address DCTCP issue.
>
>
>
> Theoretically it could happen with any CC, but I have only seen it
> consistently with DCTCP.
>
> One option, as I mentioned to Neal, is to use 2 only when cwnd reduction is
> caused by
>
> ECN signal.
>
>
>
> One problem that DCTCP paper notices when cwnd = 1 is still too big
>
> when the bottleneck
>
> is shared by many flows (e.g. incast). It specifically suggest
>
> changing the lower-bound of 2 in the spec to 1. (Section 8.2).
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__www.usenix.org_system_files_conference_nsdi15_nsdi15-2Dpaper-2Djudd.pdf&d=DwIBaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=pq_Mqvzfy-C8ltkgyx1u_g&m=PZQC-6NGqK6QEVrf1WhlAD3mQt7tK8aqrfQGp93dJy4&s=fO8p5khks-sGO_lwEvKicWtFz_q9LI-ecwZcS1nyZ7w&e=
>
>
>
> One could even envision using even lower values than 1, where we send one
> packet
>
> every 2 RTTs. However, I do not think this is a problem that needs to be
> fixed.
>
>
>
> I am curious about the differences you observe in 4.11 and 4.16. I
>
> wasn't aware of any (significant) change in tcp_cwnd_reduction / PRR
>
> algorithm between 4.11 and 4.16. Also the receiver should not delay
>
> ACKs if it has out-of-order packet or receiving CE data packets. This
>
> means the delayed ACK is by tail losses and the last data received
>
> carries no CE mark: seems a less common scenario?
>
>
>
> I have the feeling the differences between 4.11 and 4.16 are due to
> accounting
>
> changes and not change in CC behaviors.
If the difference is not due to change in PRR, I don't think we
should change PRR to address that until we have some evidence.

Could you share some traces?

>
>
>
> The packet and ACK are not lost. I captured pcaps on both hosts and could
> see
>
> cases when the data packet arrived, but the ACK was not sent before the
> packet was
>
> retransmitted. I saw this behavior multiple times. In many cases the ACK did
> not show
>
> up in the pcap within the 40ms one would usually expect with a delayed ACK.
>
> If delayed-ACK is the problem, we probably should fix the receiver to
>
> delay ACK more intelligently, not the sender. weiwan@google.com is
>
> working on it.
>
>
>
> Not sure if it is delayed-ACK or something caused the ACK to not be sent
> since it did
>
> not show up within 40ms as one would usually expect for a delayed ACK.
>
>
>
>
>
>
>
> This may or may not be OK in practice, but IMHO it is worth mentioning
>
> and discussing.
>
>
>
> neal
>
>
Neal Cardwell June 28, 2018, 8:47 p.m. | #6
On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo <brakmo@fb.com> wrote:
>
> I just looked at 4.18 traces and the behavior is as follows:
>
>    Host A sends the last packets of the request
>
>    Host B receives them, and the last packet is marked with congestion (CE)
>
>    Host B sends ACKs for packets not marked with congestion
>
>    Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE)
>
>    Host A receives ACKs with no ECE flag
>
>    Host A receives data packet with ACK for the last packet of request and has TCP ECE bit set
>
>    Host A sends 1st data packet of the next request with TCP flag CWR
>
>    Host B receives the packet (as seen in tcpdump at B), no CE flag
>
>    Host B sends a dup ACK that also has the TCP ECE flag
>
>    Host A RTO timer fires!
>
>    Host A to send the next packet
>
>    Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request)
>
>    Host A send more packets…

Thanks, Larry! This is very interesting. I don't know the cause, but
this reminds me of an issue  Steve Ibanez raised on the netdev list
last December, where he was seeing cases with DCTCP where a CWR packet
would be received and buffered by Host B but not ACKed by Host B. This
was the thread "Re: Linux ECN Handling", starting around December 5. I
have cc-ed Steve.

I wonder if this may somehow be related to the DCTCP logic to rewind
tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
DCTCP notices that the incoming CE bits have been changed while the
receiver thinks it is holding on to a delayed ACK (in
dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
"synthetic" call to tcp_send_ack() somehow has side effects in the
delayed ACK state machine that can cause the connection to forget that
it still needs to fire a delayed ACK, even though it just sent an ACK
just now.

neal
Lawrence Brakmo June 28, 2018, 8:58 p.m. | #7
On 6/28/18, 1:48 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:

    On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > I just looked at 4.18 traces and the behavior is as follows:
    >
    >    Host A sends the last packets of the request
    >
    >    Host B receives them, and the last packet is marked with congestion (CE)
    >
    >    Host B sends ACKs for packets not marked with congestion
    >
    >    Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE)
    >
    >    Host A receives ACKs with no ECE flag
    >
    >    Host A receives data packet with ACK for the last packet of request and has TCP ECE bit set
    >
    >    Host A sends 1st data packet of the next request with TCP flag CWR
    >
    >    Host B receives the packet (as seen in tcpdump at B), no CE flag
    >
    >    Host B sends a dup ACK that also has the TCP ECE flag
    >
    >    Host A RTO timer fires!
    >
    >    Host A to send the next packet
    >
    >    Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request)
    >
    >    Host A send more packets…
    
    Thanks, Larry! This is very interesting. I don't know the cause, but
    this reminds me of an issue  Steve Ibanez raised on the netdev list
    last December, where he was seeing cases with DCTCP where a CWR packet
    would be received and buffered by Host B but not ACKed by Host B. This
    was the thread "Re: Linux ECN Handling", starting around December 5. I
    have cc-ed Steve.
    
    I wonder if this may somehow be related to the DCTCP logic to rewind
    tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
    DCTCP notices that the incoming CE bits have been changed while the
    receiver thinks it is holding on to a delayed ACK (in
    dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
    "synthetic" call to tcp_send_ack() somehow has side effects in the
    delayed ACK state machine that can cause the connection to forget that
    it still needs to fire a delayed ACK, even though it just sent an ACK
    just now.
    
    neal
    
Just to reiterate that there are 2 distinct issues:

1) When the last data packet of a request is received with the CE flag, it causes a dup ack with ECE flag to be sent after the first data packet of the next request is received.

2) When one of the last (but not the last) data packets of a request is received with the CE flag, it causes a delayed ack to be sent after the first data packet of the next request is received. This delayed ack does acknowledge the data in 1st packet of the reply.
Lawrence Brakmo June 29, 2018, 4:32 a.m. | #8
On 6/28/18, 1:48 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:

    On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > I just looked at 4.18 traces and the behavior is as follows:
    >
    >    Host A sends the last packets of the request
    >
    >    Host B receives them, and the last packet is marked with congestion (CE)
    >
    >    Host B sends ACKs for packets not marked with congestion
    >
    >    Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE)
    >
    >    Host A receives ACKs with no ECE flag
    >
    >    Host A receives data packet with ACK for the last packet of request and has TCP ECE bit set
    >
    >    Host A sends 1st data packet of the next request with TCP flag CWR
    >
    >    Host B receives the packet (as seen in tcpdump at B), no CE flag
    >
    >    Host B sends a dup ACK that also has the TCP ECE flag
    >
    >    Host A RTO timer fires!
    >
    >    Host A to send the next packet
    >
    >    Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request)
    >
    >    Host A send more packets…
    
    Thanks, Larry! This is very interesting. I don't know the cause, but
    this reminds me of an issue  Steve Ibanez raised on the netdev list
    last December, where he was seeing cases with DCTCP where a CWR packet
    would be received and buffered by Host B but not ACKed by Host B. This
    was the thread "Re: Linux ECN Handling", starting around December 5. I
    have cc-ed Steve.
    
    I wonder if this may somehow be related to the DCTCP logic to rewind
    tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
    DCTCP notices that the incoming CE bits have been changed while the
    receiver thinks it is holding on to a delayed ACK (in
    dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
    "synthetic" call to tcp_send_ack() somehow has side effects in the
    delayed ACK state machine that can cause the connection to forget that
    it still needs to fire a delayed ACK, even though it just sent an ACK
    just now.
    
    neal

Here is a packetdrill script that reproduces the problem:

// Repro bug that does not ack data, not even with delayed-ack

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
0.100 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 5>
0.110 < [ect0] . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

0.200 < [ect0] . 1:1001(1000) ack 1 win 257
0.200 > [ect0] . 1:1(0) ack 1001
0.200 write(4, ..., 1) = 1
0.200 > [ect0] P. 1:2(1) ack 1001

0.200 < [ect0] . 1001:2001(1000) ack 2 win 257
0.200 write(4, ..., 1) = 1
0.200 > [ect0] P. 2:3(1) ack 2001

0.200 < [ect0] . 2001:3001(1000) ack 3 win 257
0.200 < [ect0] . 3001:4001(1000) ack 3 win 257
0.200 > [ect0] . 3:3(0) ack 4001

0.210 < [ce] P. 4001:4501(500) ack 3 win 257

+0.001 read(4, ..., 4500) = 4500
+0 write(4, ..., 1) = 1
+0 > [ect0] PE. 3:4(1) ack 4501

+0.010 < [ect0] W. 4501:5501(1000) ack 4 win 257
+0  > [ect0] E. 4:4(0) ack 4501   // dup ack sent

+0.311 < [ect0] . 5501:6501(1000) ack 4 win 257  // Long RTO
+0 > [ect0] . 4:4(0) ack 6501     // now acks everything

+0.500 < F. 9501:9501(0) ack 4 win 257
Lawrence Brakmo June 29, 2018, 6:55 p.m. | #9
On 6/28/18, 1:48 PM, "netdev-owner@vger.kernel.org on behalf of Neal Cardwell" <netdev-owner@vger.kernel.org on behalf of ncardwell@google.com> wrote:

    On Thu, Jun 28, 2018 at 4:20 PM Lawrence Brakmo <brakmo@fb.com> wrote:
    >
    > I just looked at 4.18 traces and the behavior is as follows:
    >
    >    Host A sends the last packets of the request
    >
    >    Host B receives them, and the last packet is marked with congestion (CE)
    >
    >    Host B sends ACKs for packets not marked with congestion
    >
    >    Host B sends data packet with reply and ACK for packet marked with congestion (TCP flag ECE)
    >
    >    Host A receives ACKs with no ECE flag
    >
    >    Host A receives data packet with ACK for the last packet of request and has TCP ECE bit set
    >
    >    Host A sends 1st data packet of the next request with TCP flag CWR
    >
    >    Host B receives the packet (as seen in tcpdump at B), no CE flag
    >
    >    Host B sends a dup ACK that also has the TCP ECE flag
    >
    >    Host A RTO timer fires!
    >
    >    Host A to send the next packet
    >
    >    Host A receives an ACK for everything it has sent (i.e. Host B did receive 1st packet of request)
    >
    >    Host A send more packets…
    
    Thanks, Larry! This is very interesting. I don't know the cause, but
    this reminds me of an issue  Steve Ibanez raised on the netdev list
    last December, where he was seeing cases with DCTCP where a CWR packet
    would be received and buffered by Host B but not ACKed by Host B. This
    was the thread "Re: Linux ECN Handling", starting around December 5. I
    have cc-ed Steve.
    
    I wonder if this may somehow be related to the DCTCP logic to rewind
    tp->rcv_nxt and call tcp_send_ack(), and then restore tp->rcv_nxt, if
    DCTCP notices that the incoming CE bits have been changed while the
    receiver thinks it is holding on to a delayed ACK (in
    dctcp_ce_state_0_to_1() and dctcp_ce_state_1_to_0()). I wonder if the
    "synthetic" call to tcp_send_ack() somehow has side effects in the
    delayed ACK state machine that can cause the connection to forget that
    it still needs to fire a delayed ACK, even though it just sent an ACK
    just now.
    
    neal
    
You were right Neal, one of the bugs is related to this and is caused by a lack of state update to DCTCP. DCTCP is first informed that the ACK was delayed but it is not updated when the ACK is sent with a data packet.

I am working on a patch to fix this which hopefully should be out today. Thanks everyone for the great feedback!

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 76ca88f63b70..282bd85322b0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2477,7 +2477,7 @@  void tcp_cwnd_reduction(struct sock *sk, int newly_acked_sacked, int flag)
 	}
 	/* Force a fast retransmit upon entering fast recovery */
 	sndcnt = max(sndcnt, (tp->prr_out ? 0 : 1));
-	tp->snd_cwnd = tcp_packets_in_flight(tp) + sndcnt;
+	tp->snd_cwnd = max((int)tcp_packets_in_flight(tp) + sndcnt, 2);
 }
 
 static inline void tcp_end_cwnd_reduction(struct sock *sk)