diff mbox

[net-next] tcp: Fix CWV being too strict on thin streams

Message ID CADVnQyn7683PoCo=PaP4CUF6CCNf+V7kCA8RdeZgZwk2SSddig@mail.gmail.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Neal Cardwell Sept. 22, 2015, 8:04 p.m. UTC
On Tue, Sep 22, 2015 at 2:02 PM, Neal Cardwell <ncardwell@google.com> wrote:
> More generally, my sense is that we should tweak the is_cwnd_limited code to
> shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
> limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
> flight fill the cwnd".

Here is a proposed patch, and a packetdrill test case based on Eric's,
trying to capture the flavor of your Scenario 1. The test shows how
this proposed variant allows the sender (Reno in this case, for
simplicity) to increase cwnd each RTT in congestion avoidance, since
the cwnd is fully utilized, even though we run out of packets to send
and available cwnd at exactly the same moment.

Bendik, does this fix your scenario? How does this strike you as a
potential fix?

-------------------------------------

        return true;
@@ -2060,7 +2060,6 @@ static bool tcp_write_xmit(struct sock *sk,
unsigned int mss_now, int nonagle,

                cwnd_quota = tcp_cwnd_test(tp, skb);
                if (!cwnd_quota) {
-                       is_cwnd_limited = true;
                        if (push_one == 2)
                                /* Force out a loss probe pkt. */
                                cwnd_quota = 1;
@@ -2142,6 +2141,7 @@ repair:
                /* Send one loss probe per tail loss episode. */
                if (push_one != 2)
                        tcp_schedule_loss_probe(sk);
+               is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
                tcp_cwnd_validate(sk, is_cwnd_limited);
                return false;
        }

-------------------------------------
# cat cwv-ndc-upstream-1.pkt
--->

// Keep it simple.
`sysctl net.ipv4.tcp_congestion_control=reno`

// Establish a connection and send 1 MSS.
0     socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < S 0:0(0) win 65535 <mss 1000,sackOK,nop,nop>
   +0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK>
+.200 < . 1:1(0) ack 1 win 65535
   +0 accept(3, ..., ...) = 4

   +0 write(4, ..., 100) = 100
   +0 > P. 1:101(100) ack 1
+.000 %{ print  tcpi_rto }%

// TLP
+.500~+.505 > P. 1:101(100) ack 1
// RTO
+.600~+.605 > P. 1:101(100) ack 1

+.200 < . 1:1(0) ack 101 win 65535
// cwnd should be 2, ssthresh should be 7
+0 %{ print "1: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%
2.000 write(4, ..., 100) = 100
   +0  > P. 101:201(100) ack 1

// TLP
+.500~+.505 > P. 101:201(100) ack 1
+0 %{ print "2: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%
// RTO
+1.200~+1.210 > P. 101:201(100) ack 1

+0 %{ print "3: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%
+.200 < . 1:1(0) ack 201 win 65535

// cwnd should be 2; since we fill it we should raise it to 3:
+0 %{ assert tcpi_snd_cwnd == 2 }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 201:1201(1000) ack 1
+0 %{ print "4: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 1201:2201(1000) ack 1
+0 %{ print "5: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 1201 win 65535
+0 %{ print "6: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 2201 win 65535
+0 %{ print "7: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

// cwnd should be 3; since we fill it we should raise it to 4:
+0 %{ assert tcpi_snd_cwnd == 3 }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 2201:3201(1000) ack 1
+0 %{ print "8: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 3201:4201(1000) ack 1
+0 %{ print "9: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 write(4, ..., 1000) = 1000
   +0 > P. 4201:5201(1000) ack 1
+0 %{ print "9: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 3201 win 65535
+0 %{ print "10: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 4201 win 65535
+0 %{ print "11: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

+.010 < . 1:1(0) ack 5201 win 65535
+0 %{ print "12: cwnd=%d ssthresh=%d out=%d" % (tcpi_snd_cwnd,
tcpi_snd_ssthresh, tcpi_unacked) }%

// cwnd should be 4:
+0 %{ assert tcpi_snd_cwnd == 4 }%

-------------------------------------


Output is:

net.ipv4.tcp_congestion_control = reno
602000
1: cwnd=2 ssthresh=5 out=0
2: cwnd=2 ssthresh=5 out=1
3: cwnd=1 ssthresh=2 out=1
4: cwnd=2 ssthresh=2 out=1
5: cwnd=2 ssthresh=2 out=2
6: cwnd=2 ssthresh=2 out=1
7: cwnd=3 ssthresh=2 out=0
8: cwnd=3 ssthresh=2 out=1
9: cwnd=3 ssthresh=2 out=2
9: cwnd=3 ssthresh=2 out=3
10: cwnd=3 ssthresh=2 out=2
11: cwnd=3 ssthresh=2 out=1
12: cwnd=4 ssthresh=2 out=0
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

=?UTF-8?q?Bendik=20R=C3=B8nning=20Opstad?= Sept. 23, 2015, 2:46 p.m. UTC | #1
Neal, Eric, sorry for the late replies, but keeping up with your speedy replies
is a full time job :-)

The packetdrill scripts are certainly useful to test this, so thanks for
supplying those!

On Tuesday, September 22, 2015 04:04:37 PM you wrote:
> On Tue, Sep 22, 2015 at 2:02 PM, Neal Cardwell <ncardwell@google.com> wrote:
> > More generally, my sense is that we should tweak the is_cwnd_limited code to
> > shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
> > limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
> > flight fill the cwnd".

A fully agree. This ensures that a failed attempt at transmitting a packet
is not required for the connection to be considered CWND limited.

> Here is a proposed patch, and a packetdrill test case based on Eric's,
> trying to capture the flavor of your Scenario 1. The test shows how
> this proposed variant allows the sender (Reno in this case, for
> simplicity) to increase cwnd each RTT in congestion avoidance, since
> the cwnd is fully utilized, even though we run out of packets to send
> and available cwnd at exactly the same moment.

This is actually very close to a variation of the patch we experimented with,
where we added a test before the call to tcp_cwnd_validate() to set
is_cwnd_limited. However, to avoid the extra procedure for every time one or
more packets are transmitted (which we presumed would be preferable for
performance) we ended up with updating the value only when no packets were sent.
If this is not a problem, then updating the value before the call to
tcp_cwnd_validate() will be better and also more correct according to RFC2861 as
you mentioned.

> Bendik, does this fix your scenario? How does this strike you as a
> potential fix?

This fixes the identified problem and is a good fix as far as our tests show.

Should I resend this as PATCH v2?

> -------------------------------------
> 
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4cd0b50..57a586f 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -1827,7 +1827,7 @@ static bool tcp_tso_should_defer(struct sock
> *sk, struct sk_buff *skb,
> 
>         /* Ok, it looks like it is advisable to defer. */
> 
> -       if (cong_win < send_win && cong_win < skb->len)
> +       if (cong_win < send_win && cong_win <= skb->len)
>                 *is_cwnd_limited = true;
> 
>         return true;
> @@ -2060,7 +2060,6 @@ static bool tcp_write_xmit(struct sock *sk,
> unsigned int mss_now, int nonagle,
> 
>                 cwnd_quota = tcp_cwnd_test(tp, skb);
>                 if (!cwnd_quota) {
> -                       is_cwnd_limited = true;
>                         if (push_one == 2)
>                                 /* Force out a loss probe pkt. */
>                                 cwnd_quota = 1;
> @@ -2142,6 +2141,7 @@ repair:
>                 /* Send one loss probe per tail loss episode. */
>                 if (push_one != 2)
>                         tcp_schedule_loss_probe(sk);
> +               is_cwnd_limited |= (tcp_packets_in_flight(tp) >= tp->snd_cwnd);
>                 tcp_cwnd_validate(sk, is_cwnd_limited);
>                 return false;
>         }
> 

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neal Cardwell Sept. 23, 2015, 3:34 p.m. UTC | #2
On Wed, Sep 23, 2015 at 10:46 AM, Bendik Rønning Opstad
<bro.devel@gmail.com> wrote:
>> On Tue, Sep 22, 2015 at 2:02 PM, Neal Cardwell <ncardwell@google.com> wrote:
>> > More generally, my sense is that we should tweak the is_cwnd_limited code to
>> > shift from saying "set is_cwnd_limited to true iff the cwnd is known to be
>> > limiting transmits" to saying "set is_cwnd_limited to true iff the packets in
>> > flight fill the cwnd".
...
> This fixes the identified problem and is a good fix as far as our tests show.
>
> Should I resend this as PATCH v2?

Yes, please send this latest patch as a v2, with an updated commit
message, to reflect the latest diagnosis/rationale.

Thanks!
neal
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4cd0b50..57a586f 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1827,7 +1827,7 @@  static bool tcp_tso_should_defer(struct sock
*sk, struct sk_buff *skb,

        /* Ok, it looks like it is advisable to defer. */

-       if (cong_win < send_win && cong_win < skb->len)
+       if (cong_win < send_win && cong_win <= skb->len)
                *is_cwnd_limited = true;