Message ID | 8021096.gY5K1ti0Dn@garfield |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, 2015-09-22 at 16:29 +0200, Bendik Rønning Opstad wrote: > > Thanks for this report! > > Thank you for answering! > > > When you say "CWND is reduced due to loss", are you talking about RTO > > or Fast Recovery? Do you have any traces you can share that illustrate > > this issue? > > The problem is not related to loss recovery, but only occurs in congestion > avoidance state. This is because tcp_is_cwnd_limited(), called from the > congestion controls ".cong_avoid" hook, will only use tp->is_cwnd_limited when > in congestion avoidance. > > I've made two traces to compare the effect with and without the patch: > http://heim.ifi.uio.no/bendiko/traces/CWV_PATCH/ > > To ensure that the results are comparable, the traces are produced with an extra > patch applied (shown below) such that both connections are in congestion > avoidance from the beginning. > > > Have you verified that this patch fixes the issue you identified? I > > think the fix may be in the wrong place for that scenario, or at least > > incomplete. > > Yes, we have tested this and verified that the patch solves the issue in the > test scenarios we've run. > > As always (in networking) it can be difficult to reproduce the issue in a > consistent manner, so I will describe the setup we have used. > Ahem. packetdrill can make this in one script, as you can exactly control the packets that the 'remote' peer would answer. No need for complex setup. You should try it, and as a bonus we could easily reproduce the problem and check the fix. Let see if we can cook a packetdrill scenario. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-09-22 at 07:46 -0700, Eric Dumazet wrote: > > Ahem. > > packetdrill can make this in one script, as you can exactly control the > packets that the 'remote' peer would answer. > > No need for complex setup. You should try it, and as a bonus we could > easily reproduce the problem and check the fix. > > Let see if we can cook a packetdrill scenario. Following packetdrill skeleton will provide you a connexion with cwnd=1 sshtresh=2 (without patching kernel) // Establish a connection and send 1 MSS. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.200 < . 1:1(0) ack 1 win 65535 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 100) = 100 +0 > P. 1:101(100) ack 1 +.000 %{ print tcpi_rto }% // TLP +.500~+.505 > P. 1:101(100) ack 1 // RTO +.600~+.605 > P. 1:101(100) ack 1 +.200 < . 1:1(0) ack 101 win 65535 // cwnd should be 2, ssthresh should be 7 2.000 write(4, ..., 100) = 100 +0 > P. 101:201(100) ack 1 // TLP +.500~+.505 > P. 101:201(100) ack 1 // RTO +1.200~+1.210 > P. 101:201(100) ack 1 +0 %{ print "tcpi_snd_cwnd=%d" % tcpi_snd_cwnd }% +0 %{ print "tcpi_snd_ssthresh=%d" % tcpi_snd_ssthresh }% Then we can add a bunch of +.01 write(4, ..., 100) = 100 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 2015-09-22 at 09:09 -0700, Eric Dumazet wrote: > On Tue, 2015-09-22 at 07:46 -0700, Eric Dumazet wrote: > > > > > Ahem. > > > > packetdrill can make this in one script, as you can exactly control the > > packets that the 'remote' peer would answer. > > > > No need for complex setup. You should try it, and as a bonus we could > > easily reproduce the problem and check the fix. > > > > Let see if we can cook a packetdrill scenario. > > > Following packetdrill skeleton will provide you a connexion with cwnd=1 > sshtresh=2 (without patching kernel) > A more complete packetdrill scenario would be # cat cwv.pkt // Establish a connection and send 1 MSS. 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 +0 bind(3, ..., ...) = 0 +0 listen(3, 1) = 0 +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop> +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK> +.200 < . 1:1(0) ack 1 win 65535 +0 accept(3, ..., ...) = 4 +0 write(4, ..., 100) = 100 +0 > P. 1:101(100) ack 1 +.000 %{ print tcpi_rto }% // TLP +.500~+.505 > P. 1:101(100) ack 1 // RTO +.600~+.605 > P. 1:101(100) ack 1 +.200 < . 1:1(0) ack 101 win 65535 // cwnd should be 2, ssthresh should be 7 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 2.000 write(4, ..., 100) = 100 +0 > P. 101:201(100) ack 1 // TLP +.500~+.505 > P. 101:201(100) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% // RTO +1.200~+1.210 > P. 101:201(100) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% +.200 < . 1:1(0) ack 201 win 65535 4.00 write(4, ..., 100) = 100 +0 > P. 201:301(100) ack 1 4.01 write(4, ..., 100) = 100 +0 > P. 301:401(100) ack 1 4.02 write(4, ..., 100) = 100 4.03 write(4, ..., 100) = 100 4.04 write(4, ..., 100) = 100 4.05 write(4, ..., 100) = 100 4.06 write(4, ..., 100) = 100 4.07 write(4, ..., 100) = 100 4.08 write(4, ..., 100) = 100 4.09 write(4, ..., 100) = 100 4.10 write(4, ..., 100) = 100 4.11 write(4, ..., 100) = 100 4.12 write(4, ..., 100) = 100 4.13 write(4, ..., 100) = 100 4.14 write(4, ..., 100) = 100 4.15 write(4, ..., 100) = 100 4.16 write(4, ..., 100) = 100 4.17 write(4, ..., 100) = 100 4.18 write(4, ..., 100) = 100 4.19 write(4, ..., 100) = 100 4.20 write(4, ..., 100) = 100 4.20 < . 1:1(0) ack 301 win 65535 4.20 > . 401:1401(1000) ack 1 4.21 write(4, ..., 100) = 100 4.21 < . 1:1(0) ack 401 win 65535 4.21 > P. 1401:2401(1000) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 4.22 write(4, ..., 100) = 100 4.23 write(4, ..., 100) = 100 4.24 write(4, ..., 100) = 100 4.25 write(4, ..., 100) = 100 4.26 write(4, ..., 100) = 100 4.27 write(4, ..., 100) = 100 4.28 write(4, ..., 100) = 100 4.29 write(4, ..., 100) = 100 4.31 write(4, ..., 100) = 100 4.32 write(4, ..., 100) = 100 4.33 write(4, ..., 100) = 100 4.34 write(4, ..., 100) = 100 4.35 write(4, ..., 100) = 100 4.36 write(4, ..., 100) = 100 4.37 write(4, ..., 100) = 100 4.38 write(4, ..., 100) = 100 4.39 write(4, ..., 100) = 100 4.40 write(4, ..., 100) = 100 4.40 < . 1:1(0) ack 1401 win 65535 4.40 > . 2401:3401(1000) ack 1 4.41 write(4, ..., 100) = 100 4.41 < . 1:1(0) ack 2401 win 65535 4.41 > P. 3401:4301(900) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 4.42 write(4, ..., 100) = 100 4.43 write(4, ..., 100) = 100 4.44 write(4, ..., 100) = 100 4.45 write(4, ..., 100) = 100 4.46 write(4, ..., 100) = 100 4.47 write(4, ..., 100) = 100 4.48 write(4, ..., 100) = 100 4.49 write(4, ..., 100) = 100 4.50 write(4, ..., 100) = 100 4.51 write(4, ..., 100) = 100 4.52 write(4, ..., 100) = 100 4.53 write(4, ..., 100) = 100 4.54 write(4, ..., 100) = 100 4.55 write(4, ..., 100) = 100 4.56 write(4, ..., 100) = 100 4.57 write(4, ..., 100) = 100 4.58 write(4, ..., 100) = 100 4.59 write(4, ..., 100) = 100 4.60 write(4, ..., 100) = 100 4.60 < . 1:1(0) ack 3401 win 65535 4.60 > . 4301:5301(1000) ack 1 4.61 write(4, ..., 100) = 100 4.61 < . 1:1(0) ack 4301 win 65535 4.61 > P. 5301:6301(1000) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 4.62 write(4, ..., 100) = 100 4.63 write(4, ..., 100) = 100 4.64 write(4, ..., 100) = 100 4.65 write(4, ..., 100) = 100 4.66 write(4, ..., 100) = 100 4.67 write(4, ..., 100) = 100 4.68 write(4, ..., 100) = 100 4.69 write(4, ..., 100) = 100 4.70 write(4, ..., 100) = 100 4.71 write(4, ..., 100) = 100 4.72 write(4, ..., 100) = 100 4.73 write(4, ..., 100) = 100 4.74 write(4, ..., 100) = 100 4.75 write(4, ..., 100) = 100 4.76 write(4, ..., 100) = 100 4.77 write(4, ..., 100) = 100 4.78 write(4, ..., 100) = 100 4.79 write(4, ..., 100) = 100 4.80 write(4, ..., 100) = 100 4.80 < . 1:1(0) ack 5301 win 65535 4.80 > . 6301:7301(1000) ack 1 4.81 write(4, ..., 100) = 100 4.81 < . 1:1(0) ack 6301 win 65535 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 4.81 > P. 7301:8301(1000) ack 1 4.82 write(4, ..., 100) = 100 4.83 write(4, ..., 100) = 100 4.84 write(4, ..., 100) = 100 4.85 write(4, ..., 100) = 100 4.86 write(4, ..., 100) = 100 4.87 write(4, ..., 100) = 100 4.88 write(4, ..., 100) = 100 4.89 write(4, ..., 100) = 100 4.90 write(4, ..., 100) = 100 4.91 write(4, ..., 100) = 100 4.92 write(4, ..., 100) = 100 4.93 write(4, ..., 100) = 100 4.94 write(4, ..., 100) = 100 4.95 write(4, ..., 100) = 100 4.96 write(4, ..., 100) = 100 4.97 write(4, ..., 100) = 100 4.98 write(4, ..., 100) = 100 4.99 write(4, ..., 100) = 100 5.00 write(4, ..., 100) = 100 5.00 < . 1:1(0) ack 7301 win 65535 5.00 > . 8301:9301(1000) ack 1 5.01 write(4, ..., 100) = 100 5.01 < . 1:1(0) ack 8301 win 65535 5.01 > P. 9301:10301(1000) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 5.02 write(4, ..., 100) = 100 5.03 write(4, ..., 100) = 100 5.04 write(4, ..., 100) = 100 5.05 write(4, ..., 100) = 100 5.06 write(4, ..., 100) = 100 5.07 write(4, ..., 100) = 100 5.08 write(4, ..., 100) = 100 5.09 write(4, ..., 100) = 100 5.10 write(4, ..., 100) = 100 5.11 write(4, ..., 100) = 100 5.12 write(4, ..., 100) = 100 5.13 write(4, ..., 100) = 100 5.14 write(4, ..., 100) = 100 5.15 write(4, ..., 100) = 100 5.16 write(4, ..., 100) = 100 5.17 write(4, ..., 100) = 100 5.18 write(4, ..., 100) = 100 5.19 write(4, ..., 100) = 100 5.20 write(4, ..., 100) = 100 5.20 < . 1:1(0) ack 9301 win 65535 5.20 > . 10301:11301(1000) ack 1 5.21 write(4, ..., 100) = 100 5.21 < . 1:1(0) ack 10301 win 65535 5.21 > P. 11301:12301(1000) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% 5.22 write(4, ..., 100) = 100 5.23 write(4, ..., 100) = 100 5.24 write(4, ..., 100) = 100 5.25 write(4, ..., 100) = 100 5.26 write(4, ..., 100) = 100 5.27 write(4, ..., 100) = 100 5.28 write(4, ..., 100) = 100 5.29 write(4, ..., 100) = 100 5.30 write(4, ..., 100) = 100 5.31 write(4, ..., 100) = 100 5.32 write(4, ..., 100) = 100 5.33 write(4, ..., 100) = 100 5.34 write(4, ..., 100) = 100 5.35 write(4, ..., 100) = 100 5.36 write(4, ..., 100) = 100 5.37 write(4, ..., 100) = 100 5.38 write(4, ..., 100) = 100 5.39 write(4, ..., 100) = 100 5.40 write(4, ..., 100) = 100 5.40 < . 1:1(0) ack 11301 win 65535 5.40 > . 12301:13301(1000) ack 1 5.41 write(4, ..., 100) = 100 5.41 < . 1:1(0) ack 12301 win 65535 5.41 > P. 13301:14301(1000) ack 1 +0 %{ print "tcpi_snd_cwnd=%d tcpi_snd_ssthresh=%d" % (tcpi_snd_cwnd, tcpi_snd_ssthresh) }% Output is : tcpi_snd_cwnd=2 tcpi_snd_ssthresh=7 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=7 tcpi_snd_cwnd=1 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 tcpi_snd_cwnd=2 tcpi_snd_ssthresh=2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> I'll describe two example scenarios in detail. In both scenarios we are in > congestion avoidance after experiencing loss. Nagle is disabled. Thanks for the detailed follow-up! And thanks, Eric, for the packetdrill script! This looks like an issue of how to deal with the case when we run out of packets to send and cwnd at exactly the same moment, and whether we consider that case cwnd-limited. Previously we did not consider ourselves cwnd-limited in that case, but I think your "scenario 1", and Eric's packetdrill case, show convincingly that we ought to consider ourselves cwnd-limited in that case. I would suggest we try fixing this by making sure that in scenario 1, at the moment when we fill the cwnd (packets_in_flight == cwnd == 2) we mark ourselves as cwnd-limited. More generally, my sense is that we should tweak the is_cwnd_limited code to shift from saying "set is_cwnd_limited to true iff the cwnd is known to be limiting transmits" to saying "set is_cwnd_limited to true iff the packets in flight fill the cwnd". This is partly justified by the kind of case presented here. Plus, in RFC 2861, section 3.1, which this code is helping to implement, it says: After TCP sends a packet, it also checks to see if that packet filled the congestion window. If so, the sender is network-limited, and sets the variable T_prev to the current TCP clock time, So at the point in Scenario 1 when we send the second packet, and packets_in_flight == cwnd == 2 then IMHO we should set tp->is_cwnd_limited = true and thus tp->snd_cwnd_stamp = tcp_time_stamp; So maybe we want to slightly tweak the way tcp_write_xmit() and tcp_tso_should_defer() treat this boundary condition? Gotta run to a meeting.... neal ps: If the tp->is_cwnd_limited and tcp_cwnd_validate() code were only used to decide whether to increase cwnd, then I think your fix would probably be sufficient. But it is also used for the code to implement RFC 2861 ("TCP Congestion Window Validation"). So I don't think just changing tp->is_cwnd_limited is sufficient. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c index d0ad355..947eb57 100644 --- a/net/ipv4/tcp_output.c +++ b/net/ipv4/tcp_output.c @@ -2135,6 +2135,8 @@ repair: break; } + printk("CWND: %d, SSTHRESH: %d, PKTS_OUT: %d\n", tp->snd_cwnd, tp->snd_ssthresh, tp->packets_out); + if (likely(sent_pkts)) { if (tcp_in_cwnd_reduction(sk)) tp->prr_out += sent_pkts; The following patch avoids the need to induce loss by setting the connection in congestion avoidance from the beginning: diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index a62e9c7..206b32f 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5394,6 +5394,7 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb) tcp_init_metrics(sk); + tp->snd_cwnd = tp->snd_ssthresh = 2; tcp_init_congestion_control(sk); /* Prevent spurious tcp_cwnd_restart() on first data