diff mbox

[net-next] tcp: Fix CWV being too strict on thin streams

Message ID 8021096.gY5K1ti0Dn@garfield
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

=?UTF-8?q?Bendik=20R=C3=B8nning=20Opstad?= Sept. 22, 2015, 2:29 p.m. UTC
> Thanks for this report!

Thank you for answering!

> When you say "CWND is reduced due to loss", are you talking about RTO
> or Fast Recovery? Do you have any traces you can share that illustrate
> this issue?

The problem is not related to loss recovery, but only occurs in congestion
avoidance state. This is because tcp_is_cwnd_limited(), called from the
congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when
in congestion avoidance.

I've made two traces to compare the effect with and without the patch:
http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/

To ensure that the results are comparable, the traces are produced with an extra
patch applied (shown below) such that both connections are in congestion
avoidance from the beginning.

> Have you verified that this patch fixes the issue you identified? I
> think the fix may be in the wrong place for that scenario, or at least
> incomplete.

Yes, we have tested this and verified that the patch solves the issue in the
test scenarios we've run.

As always (in networking) it can be difficult to reproduce the issue in a
consistent manner, so I will describe the setup we have used.

In our testbed we have a sender host, a bridge and a receiver. The bridge is
configured with 75 ms delay in each direction, giving an RTT of 150 ms.

On bridge:
tc qdisc add dev eth0 handle 1:0 root netem delay 75ms
tc qdisc add dev eth1 handle 1:0 root netem delay 75ms

We have used the tool streamzero (https://bitbucket.org/bendikro/streamzero) to
produce the thin stream traffic. An example where the sender opens a TCP
connection and performs write calls to the socket every 10 ms with 100 bytes per
call for 60 seconds:

On receiver host (10.0.0.22):
./streamzero_srv -p 5010 -A

On sender host:
./streamzero_client -s 10.0.0.22 -p 5010 -v3 -A4 -d 60 -I i:10,S:100

streamzero_client will by default disable Nagle (TCP_NODELAY=1).

Adding loss for a few seconds in netem causes the CWND and ssthresh to
eventually drop to a low value, e.g. 2.

On bridge:
tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay
75ms loss 4%

Removing loss after a few seconds:
tc qdisc del dev eth1 root && tc qdisc add dev eth1 handle 1:0 root netem delay
75ms

From this point and on the CWND will not grow any further, as shown by the print
statement when applying this patch:



The traces linked to above are run with this patch applied.

> In the scenario you describe, you say "all the buffered data in the
> output queue was sent within the current CWND", which means that there
> is no unsent data, so in tcp_write_xmit() the call to tcp_send_head()
> would return NULL, so we would not enter the while loop, and could not
> set is_cwnd_limited to true.

I'll describe two example scenarios in detail. In both scenarios we are in
congestion avoidance after experiencing loss. Nagle is disabled.

Scenario 1:
Current CWND is 2 and there is currently one packet in flight. The output queue
contains one SKB with unsent data (payload <= 1 * MSS).

In tcp_write_xmit(), tcp_send_head() will return the SKB with unsent data, and
tcp_cwnd_test() will set cwnd_quota to 1, which means that is_cwnd_limited will
remain false. After sending the packet, sent_pkts will be set to 1, and the
while will break because tcp_send_head() returns NULL. tcp_cwnd_validate() will
be called with is_cwnd_limited=false.

When a new data chunk is put on the output queue, and tcp_write_xmit is called,
tcp_send_head() will return the SKB with unsent data. tcp_cwnd_test() will now
set cwnd_quota to 0 because packets in flight == CWND. is_cwnd_limited will
correctly be set to true, however, since no packets were sent,
tcp_cwnd_validate() will never be called.
(Calling tcp_cwnd_validate() if sent_pkts == 0 will not solve the problem in
this scenario as it will not enter the first if test in tcp_cwnd_validate()
where tp->is_cwnd_limited is updated.)


Scenario 2:
CWND is 2 and there is currently one packet in flight. The output queue contains
two SKBs with unsent data.

In tcp_write_xmit(), tcp_send_head() will return the first SKB with unsent data,
and tcp_cwnd_test() will set cwnd_quota to 1, which means that is_cwnd_limited
will remain false. After sending the packet, sent_pkts will be set to 1, and the
while loop continues on to the next (and last) SKB with unsent data returned by
tcp_send_head(). tcp_cwnd_test() will set cwnd_quota to 0 and is_cwnd_limited
will be set to true (and the while breaks). As one packet was sent, and one
packet was held back, tcp_cwnd_validate() will be called with
is_cwnd_limited=true, and tp->is_cwnd_limited will be set to true (as it
should).


In scenario 1, "all the buffered data in the output queue was sent within the
current CWND", and the result is that the CWND does not grow even when the CWND
is fully utilized after the packet is sent. In scenario 2, some of the unsent
data buffered in the output queue was not sent because there was no room in the
CWND, and this results in the CWND growing because tp->is_cwnd_limited was set
to true.

Analysing the traces with the analyseTCP
(https://bitbucket.org/mpg_code/tstools_analysetcp) gives the following results:

# ./analyseTCP -s 10.0.0.15 -f test_net-next.pcap
Processing sent packets...
Using filter: 'tcp && src host 10.0.0.15'
Finished processing sent packets...
Processing acknowledgements...
Finished processing acknowledgements...
STATS FOR CONN: 10.0.0.15:15000 -> 10.0.0.22:5010
  Duration: 61 seconds (0.016944 hours)
  Total packets sent                            :        782
  Total data packets sent                       :        779
  Total pure acks (no payload)                  :          2
  SYN/FIN/RST packets sent                      :      1/1/0
  Number of retransmissions                     :          0
  Number of packets with bundled segments       :          0
  Number of packets with redundant data         :          0
  Number of received acks                       :        779
  Total bytes sent (payload)                    :     555600
  Number of unique bytes                        :     555600
  Number of retransmitted bytes                 :          0
  Redundant bytes (bytes already sent)          :          0 (0.00 %)
  Estimated loss rate based on retransmissions  :       0.00 %
---------------------------------------------------------------
Payload size stats:
  Minimum    payload                            :     100 bytes
  Average    payload                            :     713 bytes
  Maximum    payload                            :    1448 bytes
---------------------------------------------------------------
Latency stats:
  Minimum    latency                            :  150244 usec
  Average    latency                            :  150675 usec
  Maximum    latency                            :  302493 usec
---------------------------------------------------------------
ITT stats:
  Minimum        itt                            :      12 usec
  Average        itt                            :   78383 usec
  Maximum        itt                            :  302704 usec
===============================================================


General info for entire dump:
  10.0.0.15: -> :
  Filename: test_net-next.pcap
  Sent Packet Count     : 782
  Received Packet Count : 0
  ACK Count             : 779
  Sent Bytes Count      : 555600
  Max payload size      : 1448
#


# ./analyseTCP -s 10.0.0.15 -f test_cwv_fix.pcap
Processing sent packets...
Using filter: 'tcp && src host 10.0.0.15'
Finished processing sent packets...
Processing acknowledgements...
Finished processing acknowledgements...
STATS FOR CONN: 10.0.0.15:15000 -> 10.0.0.22:5010
  Duration: 60 seconds (0.016667 hours)
  Total packets sent                            :       4760
  Total data packets sent                       :       4756
  Total pure acks (no payload)                  :          2
  SYN/FIN/RST packets sent                      :      1/1/0
  Number of retransmissions                     :          0
  Number of packets with bundled segments       :          0
  Number of packets with redundant data         :          0
  Number of received acks                       :       4758
  Total bytes sent (payload)                    :     483300
  Number of unique bytes                        :     483300
  Number of retransmitted bytes                 :          0
  Redundant bytes (bytes already sent)          :          0 (0.00 %)
  Estimated loss rate based on retransmissions  :       0.00 %
---------------------------------------------------------------
Payload size stats:
  Minimum    payload                            :     100 bytes
  Average    payload                            :     102 bytes
  Maximum    payload                            :    1400 bytes
---------------------------------------------------------------
Latency stats:
  Minimum    latency                            :  150221 usec
  Average    latency                            :  150340 usec
  Maximum    latency                            :  302022 usec
---------------------------------------------------------------
ITT stats:
  Minimum        itt                            :    1469 usec
  Average        itt                            :   12761 usec
  Maximum        itt                            :  302287 usec
===============================================================


General info for entire dump:
  10.0.0.15: -> :
  Filename: test_cwv_fix.pcap
  Sent Packet Count     : 4760
  Received Packet Count : 0
  ACK Count             : 4758
  Sent Bytes Count      : 483300
  Max payload size      : 1400
#


The key results revealing the effect of the proposed patch:

test_net-next.pcap:
Packets with payload sent: 779
Average payload per packet: 713 bytes
Average time between each packet transmission: 78.3 ms

test_cwv_fix.pcap:
Packets with payload sent: 4756
Average payload per packet: 102 bytes
Average time between each packet transmission: 12.7 ms


> In the scenario you describe, I think we'd need to check for being
> cwnd-limited in tcp_xmit_retransmit_queue().
> 
> How about something like the following patch:
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index d0ad355..8e6a772 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -2145,6 +2145,7 @@ repair:
>                 tcp_cwnd_validate(sk, is_cwnd_limited);
>                 return false;
>         }
> +       tp->is_cwnd_limited |= is_cwnd_limited;
>         return !tp->packets_out && tcp_send_head(sk);
>  }
> 
> @@ -2762,8 +2763,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>                  * packet to be MSS sized and all the
>                  * packet counting works out.
>                  */
> -               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd)
> +               if (tcp_packets_in_flight(tp) >= tp->snd_cwnd) {
> +                       tp->is_cwnd_limited = true;
>                         return;
> +               }
> 
>                 if (fwd_rexmitting) {
>  begin_fwd:
> 
> neal

This will not fix the problem, as that would only set tp->is_cwnd_limited=true
once after retransmitting. The main issue is that the CWND does not grow for
these types of streams while no loss occurs, even when the CWND is fully
utilized (but only after a loss _has_ occurred to put the connection in
congestion avoidance).

When setting the initial CWND and ssthresh to 10 instead of 2 (with the patch
above to start in congestion avoidance), and running the same test as in the
traces (where the application performs ~15 write calls per RTT), the CWND never
grows past 10. When applying the proposed patch to update tp->is_cwnd_limited
when no packets are transmitted, the CWND grows to 13-14, which produces minimal
sojourn time for the segments written by the application.

Let me know if you need more information or additional test traces.

Bendik

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Sept. 22, 2015, 2:46 p.m. UTC | #1
On Tue, 2015-09-22 at 16:29 +0200, Bendik Rønning Opstad wrote:
> > Thanks for this report!
> 
> Thank you for answering!
> 
> > When you say "CWND is reduced due to loss", are you talking about RTO
> > or Fast Recovery? Do you have any traces you can share that illustrate
> > this issue?
> 
> The problem is not related to loss recovery, but only occurs in congestion
> avoidance state. This is because tcp_is_cwnd_limited(), called from the
> congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when
> in congestion avoidance.
> 
> I've made two traces to compare the effect with and without the patch:
> http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/
> 
> To ensure that the results are comparable, the traces are produced with an extra
> patch applied (shown below) such that both connections are in congestion
> avoidance from the beginning.
> 
> > Have you verified that this patch fixes the issue you identified? I
> > think the fix may be in the wrong place for that scenario, or at least
> > incomplete.
> 
> Yes, we have tested this and verified that the patch solves the issue in the
> test scenarios we've run.
> 
> As always (in networking) it can be difficult to reproduce the issue in a
> consistent manner, so I will describe the setup we have used.
> 


Ahem.

packetdrill can make this in one script, as you can exactly control the
packets that the 'remote' peer would answer.

No need for complex setup. You should try it, and as a bonus we could
easily reproduce the problem and check the fix.

Let see if we can cook a packetdrill scenario.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 22, 2015, 4:09 p.m. UTC | #2
On Tue, 2015-09-22 at 07:46 -0700, Eric Dumazet wrote:

> 
> Ahem.
> 
> packetdrill can make this in one script, as you can exactly control the
> packets that the 'remote' peer would answer.
> 
> No need for complex setup. You should try it, and as a bonus we could
> easily reproduce the problem and check the fix.
> 
> Let see if we can cook a packetdrill scenario.


Following packetdrill skeleton will provide you a connexion with cwnd=1
sshtresh=2  (without patching kernel)

// Establish a connection and send 1 MSS.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.200 < . 1:1(0) ack 1 win 65535
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 1
+.000 %{ print  tcpi_rto }%

// TLP
+.500~+.505 > P. 1:101(100) ack 1
// RTO
+.600~+.605 > P. 1:101(100) ack 1

+.200 < . 1:1(0) ack 101 win 65535
// cwnd should be 2, ssthresh should be 7
2.000 write(4, ..., 100) = 100
   +0  > P. 101:201(100) ack 1

// TLP
+.500~+.505 > P. 101:201(100) ack 1
// RTO
+1.200~+1.210 > P. 101:201(100) ack 1

   +0 %{ print "tcpi_snd_cwnd=%d" % tcpi_snd_cwnd }%
   +0 %{ print "tcpi_snd_ssthresh=%d" % tcpi_snd_ssthresh }%


Then we can add a bunch of  +.01 write(4, ..., 100) = 100




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 22, 2015, 5:11 p.m. UTC | #3
On Tue, 2015-09-22 at 09:09 -0700, Eric Dumazet wrote:
> On Tue, 2015-09-22 at 07:46 -0700, Eric Dumazet wrote:
> 
> > 
> > Ahem.
> > 
> > packetdrill can make this in one script, as you can exactly control the
> > packets that the 'remote' peer would answer.
> > 
> > No need for complex setup. You should try it, and as a bonus we could
> > easily reproduce the problem and check the fix.
> > 
> > Let see if we can cook a packetdrill scenario.
> 
> 
> Following packetdrill skeleton will provide you a connexion with cwnd=1
> sshtresh=2  (without patching kernel)
> 

A more complete packetdrill scenario would be

# cat cwv.pkt
// Establish a connection and send 1 MSS.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.200 < . 1:1(0) ack 1 win 65535
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 1
+.000 %{ print  tcpi_rto }%

// TLP
+.500~+.505 > P. 1:101(100) ack 1
// RTO
+.600~+.605 > P. 1:101(100) ack 1

+.200 < . 1:1(0) ack 101 win 65535
// cwnd should be 2, ssthresh should be 7
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
2.000 write(4, ..., 100) = 100
   +0  > P. 101:201(100) ack 1

// TLP
+.500~+.505 > P. 101:201(100) ack 1
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
// RTO
+1.200~+1.210 > P. 101:201(100) ack 1

+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
+.200 < . 1:1(0) ack 201 win 65535

4.00 write(4, ..., 100) = 100
    +0  > P. 201:301(100) ack 1
4.01 write(4, ..., 100) = 100
	+0 > P. 301:401(100) ack 1
4.02 write(4, ..., 100) = 100
4.03 write(4, ..., 100) = 100
4.04 write(4, ..., 100) = 100
4.05 write(4, ..., 100) = 100
4.06 write(4, ..., 100) = 100
4.07 write(4, ..., 100) = 100
4.08 write(4, ..., 100) = 100
4.09 write(4, ..., 100) = 100
4.10 write(4, ..., 100) = 100
4.11 write(4, ..., 100) = 100
4.12 write(4, ..., 100) = 100
4.13 write(4, ..., 100) = 100
4.14 write(4, ..., 100) = 100
4.15 write(4, ..., 100) = 100
4.16 write(4, ..., 100) = 100
4.17 write(4, ..., 100) = 100
4.18 write(4, ..., 100) = 100
4.19 write(4, ..., 100) = 100

4.20 write(4, ..., 100) = 100
4.20 < . 1:1(0) ack 301 win 65535
4.20 > . 401:1401(1000) ack 1

4.21 write(4, ..., 100) = 100
4.21 < . 1:1(0) ack 401 win 65535
4.21 > P. 1401:2401(1000) ack 1
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.22 write(4, ..., 100) = 100
4.23 write(4, ..., 100) = 100
4.24 write(4, ..., 100) = 100
4.25 write(4, ..., 100) = 100
4.26 write(4, ..., 100) = 100
4.27 write(4, ..., 100) = 100
4.28 write(4, ..., 100) = 100
4.29 write(4, ..., 100) = 100
4.31 write(4, ..., 100) = 100
4.32 write(4, ..., 100) = 100
4.33 write(4, ..., 100) = 100
4.34 write(4, ..., 100) = 100
4.35 write(4, ..., 100) = 100
4.36 write(4, ..., 100) = 100
4.37 write(4, ..., 100) = 100
4.38 write(4, ..., 100) = 100
4.39 write(4, ..., 100) = 100

4.40 write(4, ..., 100) = 100
4.40 < . 1:1(0) ack 1401 win 65535
4.40 > . 2401:3401(1000) ack 1
4.41 write(4, ..., 100) = 100
4.41 < . 1:1(0) ack 2401 win 65535
4.41 > P. 3401:4301(900) ack 1
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.42 write(4, ..., 100) = 100
4.43 write(4, ..., 100) = 100
4.44 write(4, ..., 100) = 100
4.45 write(4, ..., 100) = 100
4.46 write(4, ..., 100) = 100
4.47 write(4, ..., 100) = 100
4.48 write(4, ..., 100) = 100
4.49 write(4, ..., 100) = 100
4.50 write(4, ..., 100) = 100
4.51 write(4, ..., 100) = 100
4.52 write(4, ..., 100) = 100
4.53 write(4, ..., 100) = 100
4.54 write(4, ..., 100) = 100
4.55 write(4, ..., 100) = 100
4.56 write(4, ..., 100) = 100
4.57 write(4, ..., 100) = 100
4.58 write(4, ..., 100) = 100
4.59 write(4, ..., 100) = 100
4.60 write(4, ..., 100) = 100
4.60 < . 1:1(0) ack 3401 win 65535
4.60 > . 4301:5301(1000) ack 1
4.61 write(4, ..., 100) = 100
4.61 < . 1:1(0) ack 4301 win 65535
4.61 > P. 5301:6301(1000) ack 1
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.62 write(4, ..., 100) = 100
4.63 write(4, ..., 100) = 100
4.64 write(4, ..., 100) = 100
4.65 write(4, ..., 100) = 100
4.66 write(4, ..., 100) = 100
4.67 write(4, ..., 100) = 100
4.68 write(4, ..., 100) = 100
4.69 write(4, ..., 100) = 100
4.70 write(4, ..., 100) = 100
4.71 write(4, ..., 100) = 100
4.72 write(4, ..., 100) = 100
4.73 write(4, ..., 100) = 100
4.74 write(4, ..., 100) = 100
4.75 write(4, ..., 100) = 100
4.76 write(4, ..., 100) = 100
4.77 write(4, ..., 100) = 100
4.78 write(4, ..., 100) = 100
4.79 write(4, ..., 100) = 100
4.80 write(4, ..., 100) = 100
4.80 < . 1:1(0) ack 5301 win 65535
4.80 > . 6301:7301(1000) ack 1
4.81 write(4, ..., 100) = 100
4.81 < . 1:1(0) ack 6301 win 65535
   +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
4.81 > P. 7301:8301(1000) ack 1
4.82 write(4, ..., 100) = 100
4.83 write(4, ..., 100) = 100
4.84 write(4, ..., 100) = 100
4.85 write(4, ..., 100) = 100
4.86 write(4, ..., 100) = 100
4.87 write(4, ..., 100) = 100
4.88 write(4, ..., 100) = 100
4.89 write(4, ..., 100) = 100
4.90 write(4, ..., 100) = 100
4.91 write(4, ..., 100) = 100
4.92 write(4, ..., 100) = 100
4.93 write(4, ..., 100) = 100
4.94 write(4, ..., 100) = 100
4.95 write(4, ..., 100) = 100
4.96 write(4, ..., 100) = 100
4.97 write(4, ..., 100) = 100
4.98 write(4, ..., 100) = 100
4.99 write(4, ..., 100) = 100
5.00 write(4, ..., 100) = 100
5.00 < . 1:1(0) ack 7301 win 65535
5.00 > . 8301:9301(1000) ack 1
5.01 write(4, ..., 100) = 100
5.01 < . 1:1(0) ack 8301 win 65535
5.01 > P. 9301:10301(1000) ack 1
  +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
5.02 write(4, ..., 100) = 100
5.03 write(4, ..., 100) = 100
5.04 write(4, ..., 100) = 100
5.05 write(4, ..., 100) = 100
5.06 write(4, ..., 100) = 100
5.07 write(4, ..., 100) = 100
5.08 write(4, ..., 100) = 100
5.09 write(4, ..., 100) = 100
5.10 write(4, ..., 100) = 100
5.11 write(4, ..., 100) = 100
5.12 write(4, ..., 100) = 100
5.13 write(4, ..., 100) = 100
5.14 write(4, ..., 100) = 100
5.15 write(4, ..., 100) = 100
5.16 write(4, ..., 100) = 100
5.17 write(4, ..., 100) = 100
5.18 write(4, ..., 100) = 100
5.19 write(4, ..., 100) = 100
5.20 write(4, ..., 100) = 100
5.20 < . 1:1(0) ack 9301 win 65535
5.20 > . 10301:11301(1000) ack 1
5.21 write(4, ..., 100) = 100
5.21 < . 1:1(0) ack 10301 win 65535
5.21 > P. 11301:12301(1000) ack 1
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }%
5.22 write(4, ..., 100) = 100
5.23 write(4, ..., 100) = 100
5.24 write(4, ..., 100) = 100
5.25 write(4, ..., 100) = 100
5.26 write(4, ..., 100) = 100
5.27 write(4, ..., 100) = 100
5.28 write(4, ..., 100) = 100
5.29 write(4, ..., 100) = 100
5.30 write(4, ..., 100) = 100
5.31 write(4, ..., 100) = 100
5.32 write(4, ..., 100) = 100
5.33 write(4, ..., 100) = 100
5.34 write(4, ..., 100) = 100
5.35 write(4, ..., 100) = 100
5.36 write(4, ..., 100) = 100
5.37 write(4, ..., 100) = 100
5.38 write(4, ..., 100) = 100
5.39 write(4, ..., 100) = 100
5.40 write(4, ..., 100) = 100
5.40 < . 1:1(0) ack 11301 win 65535
5.40 > . 12301:13301(1000) ack 1
5.41 write(4, ..., 100) = 100
5.41 < . 1:1(0) ack 12301 win 65535
5.41 > P. 13301:14301(1000) ack 1
+0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 


Output is :
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=7
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=7
tcpi_snd_cwnd=1 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2
tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Sept. 22, 2015, 6:02 p.m. UTC | #4
> I'll describe two example scenarios in detail. In both scenarios we are in
> congestion avoidance after experiencing loss. Nagle is disabled.

Thanks for the detailed follow-up! And thanks, Eric, for the packetdrill
script!

This looks like an issue of how to deal with the case when we run out of
packets to send and cwnd at exactly the same moment, and whether we consider
that case cwnd-limited.

Previously we did not consider ourselves cwnd-limited in that case, but I think
your "scenario 1", and Eric's packetdrill case, show convincingly that we ought
to consider ourselves cwnd-limited in that case.

I would suggest we try fixing this by making sure that in scenario 1, at the
moment when we fill the cwnd (packets_in_flight == cwnd == 2) we mark ourselves
as cwnd-limited.

More generally, my sense is that we should tweak the is_cwnd_limited code to
shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
flight fill the cwnd".

This is partly justified by the kind of case presented here. Plus, in RFC 2861,
section 3.1, which this code is helping to implement, it says:

   After TCP sends a packet, it also checks to see if that packet filled
   the congestion window.  If so, the sender is network-limited, and
   sets the variable T_prev to the current TCP clock time,

So at the point in Scenario 1 when we send the second packet, and
  packets_in_flight == cwnd == 2
then IMHO we should set
  tp->is_cwnd_limited = true
and thus
  tp->snd_cwnd_stamp = tcp_time_stamp;

So maybe we want to slightly tweak the way tcp_write_xmit() and
tcp_tso_should_defer() treat this boundary condition? Gotta run to a
meeting....

neal

ps: If the tp->is_cwnd_limited and tcp_cwnd_validate() code were only used to
decide whether to increase cwnd, then I think your fix would probably be
sufficient. But it is also used for the code to implement RFC 2861 ("TCP
Congestion Window Validation"). So I don't think just changing
tp->is_cwnd_limited is sufficient.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index d0ad355..947eb57 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2135,6 +2135,8 @@  repair:
 			break;
 	}
 
+	printk("CWND: %d, SSTHRESH: %d, PKTS_OUT: %d\n", tp->snd_cwnd, tp->snd_ssthresh, tp->packets_out);
+
 	if (likely(sent_pkts)) {
 		if (tcp_in_cwnd_reduction(sk))
 			tp->prr_out += sent_pkts;


The following patch avoids the need to induce loss by setting the connection in
congestion avoidance from the beginning:

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a62e9c7..206b32f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5394,6 +5394,7 @@  void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 
 	tcp_init_metrics(sk);
 +
	tp->snd_cwnd = tp->snd_ssthresh = 2;
 	tcp_init_congestion_control(sk);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data