diff mbox

[next,resend] tcp: use zero-window when free_space is low

Message ID 1392292350-28800-1-git-send-email-fw@strlen.de
State Deferred, archived
Delegated to: David Miller
Headers show

Commit Message

Florian Westphal Feb. 13, 2014, 11:52 a.m. UTC
Currently the kernel tries to announce a zero window when free_space
is below the current receiver mss estimate.

When a sender is transmitting small packets and reader consumes data
slowly (or not at all), receiver might be unable to shrink the receive
win because

a) we cannot withdraw already-commited receive window, and,
b) we have to round the current rwin up to a multiple of the wscale
   factor, else we would shrink the current window.

This causes the receive buffer to fill up until the rmem limit is hit.
When this happens, we start dropping packets.

Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
even if socket is not being read from.

As we cannot avoid the "current_win is rounded up to multiple of mss"
issue [we would violate a) above] at least try to prevent the receive buf
growth towards tcp_rmem[2] limit by attempting to move to zero-window
announcement when free_space becomes less than 1/16 of the current
allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
increase our chances to get a zero-window announcement out in time.

Reproducer:
On server:
$ nc -l -p 12345
<suspend it: CTRL-Z>

Client:
#!/usr/bin/env python
import socket
import time

sock = socket.socket()
sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
sock.connect(("192.168.4.1", 12345));
while True:
   sock.send('A' * 23)
   time.sleep(0.005)


socket buffer on server-side will grow until tcp_rmem[2] is hit,
at which point the client rexmits data until -EDTIMEOUT:

tcp_data_queue invokes tcp_try_rmem_schedule which will call
tcp_prune_queue which calls tcp_clamp_window().  And that function will
grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].

Cc: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 V1 of this patch was deferred, resending to get discussion going again.
 Changes since v1:
  - add reproducer to commit message

 Unfortunately I couldn't come up with something that has no magic
 ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
 tput limitations in my 'full-mss-sized, steady state' netcat tests.

 Maybe someone has better idea?

Comments

Eric Dumazet Feb. 13, 2014, 2:58 p.m. UTC | #1
On Thu, 2014-02-13 at 12:52 +0100, Florian Westphal wrote:
> Currently the kernel tries to announce a zero window when free_space
> is below the current receiver mss estimate.
> 
> When a sender is transmitting small packets and reader consumes data
> slowly (or not at all), receiver might be unable to shrink the receive
> win because
> 
> a) we cannot withdraw already-commited receive window, and,
> b) we have to round the current rwin up to a multiple of the wscale
>    factor, else we would shrink the current window.
> 
> This causes the receive buffer to fill up until the rmem limit is hit.
> When this happens, we start dropping packets.
> 
> Moreover, tcp_clamp_window may continue to grow sk_rcvbuf towards rmem[2]
> even if socket is not being read from.
> 
> As we cannot avoid the "current_win is rounded up to multiple of mss"
> issue [we would violate a) above] at least try to prevent the receive buf
> growth towards tcp_rmem[2] limit by attempting to move to zero-window
> announcement when free_space becomes less than 1/16 of the current
> allowed receive buffer maximum.  If tcp_rmem[2] is large, this will
> increase our chances to get a zero-window announcement out in time.
> 
> Reproducer:
> On server:
> $ nc -l -p 12345
> <suspend it: CTRL-Z>
> 
> Client:
> #!/usr/bin/env python
> import socket
> import time
> 
> sock = socket.socket()
> sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, 1)
> sock.connect(("192.168.4.1", 12345));
> while True:
>    sock.send('A' * 23)
>    time.sleep(0.005)
> 
> 
> socket buffer on server-side will grow until tcp_rmem[2] is hit,
> at which point the client rexmits data until -EDTIMEOUT:
> 
> tcp_data_queue invokes tcp_try_rmem_schedule which will call
> tcp_prune_queue which calls tcp_clamp_window().  And that function will
> grow sk->sk_rcvbuf up until it eventually hits tcp_rmem[2].
> 
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  V1 of this patch was deferred, resending to get discussion going again.
>  Changes since v1:
>   - add reproducer to commit message
> 
>  Unfortunately I couldn't come up with something that has no magic
>  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
>  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> 
>  Maybe someone has better idea?

Thanks a lot Florian looking at this.

Do we have one SNMP counter tracking number of time we took the decision
to send a 0 window ?

Would you mind waiting we run our packetdrill tests before acknowledging
this patch, because I suspect this might have some impact ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Feb. 13, 2014, 3:34 p.m. UTC | #2
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Do we have one SNMP counter tracking number of time we took the decision
> to send a 0 window ?

No.

> Would you mind waiting we run our packetdrill tests before acknowledging
> this patch, because I suspect this might have some impact ?

Of course not.  I am very happy that you folks have these kinds of tests
and are willing to double-check.  Take all time you need, there is no
need to haste.

Many Thanks Eric.

Do you think it makes sense to add counters for this?

The caveat is that decision to send 0 window doesn't mean we end
up sending one, since we cannot shrink already offered window.

static u16 tcp_select_window(struct sock *sk)
{
    struct tcp_sock *tp = tcp_sk(sk);
    u32 cur_win = tcp_receive_window(tp);
    u32 new_win = __tcp_select_window(sk);

    /* Never shrink the offered window */
    if (new_win < cur_win) {

Would you add SNMP counter for '__tcp_select_window() wants 0 window'
or for 'tcp_select_window() does pick 0 window' ?

[ or even different counters for both ? ]

Cheers,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 13, 2014, 4:19 p.m. UTC | #3
On Thu, 2014-02-13 at 16:34 +0100, Florian Westphal wrote:
> Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Would you mind waiting we run our packetdrill tests before acknowledging
> > this patch, because I suspect this might have some impact ?
> 
> Of course not.  I am very happy that you folks have these kinds of tests
> and are willing to double-check.  Take all time you need, there is no
> need to haste.
> 


We have today 348 different packetdrill tests, and this number keeps
increasing ;)

> Many Thanks Eric.
> 
> Do you think it makes sense to add counters for this?
> 
> The caveat is that decision to send 0 window doesn't mean we end
> up sending one, since we cannot shrink already offered window.
> 
> static u16 tcp_select_window(struct sock *sk)
> {
>     struct tcp_sock *tp = tcp_sk(sk);
>     u32 cur_win = tcp_receive_window(tp);
>     u32 new_win = __tcp_select_window(sk);
> 
>     /* Never shrink the offered window */
>     if (new_win < cur_win) {
> 
> Would you add SNMP counter for '__tcp_select_window() wants 0 window'
> or for 'tcp_select_window() does pick 0 window' ?
> 
> [ or even different counters for both ? ]

Ideally, having counters of transitions would be nice.

This would help us to spot regressions in TCP stacks or network drivers,
or wrong application tunings.

One counter tracking number of times a socket went from a non zero
window to a zero window (as you said, I am referring to effective window
being sent)

One counter tracking the reverse.

If it proves being too complex or expensive, a single counter tracking
"win 0" sent to the peers.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones Feb. 13, 2014, 5:18 p.m. UTC | #4
On 02/13/2014 03:52 AM, Florian Westphal wrote:
>   Maybe someone has better idea?

Does tcp_rmem have to be a "hard" limit, or could it allowed to be fuzzy?

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 17, 2014, 7:34 p.m. UTC | #5
From: Florian Westphal <fw@strlen.de>
Date: Thu, 13 Feb 2014 12:52:30 +0100

>  V1 of this patch was deferred, resending to get discussion going again.
>  Changes since v1:
>   - add reproducer to commit message
> 
>  Unfortunately I couldn't come up with something that has no magic
>  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
>  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> 
>  Maybe someone has better idea?

I know this is going to be frustrating, but I've marked this 'deferred'
again.  Please resubmit after the testing and further discussions have
been worked out.

Thank you.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Feb. 17, 2014, 8:52 p.m. UTC | #6
David Miller <davem@davemloft.net> wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Thu, 13 Feb 2014 12:52:30 +0100
> 
> >  V1 of this patch was deferred, resending to get discussion going again.
> >  Changes since v1:
> >   - add reproducer to commit message
> > 
> >  Unfortunately I couldn't come up with something that has no magic
> >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
> >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> > 
> >  Maybe someone has better idea?
> 
> I know this is going to be frustrating, but I've marked this 'deferred'
> again.  Please resubmit after the testing and further discussions have
> been worked out.

Thanks for letting me know.  I understand why you are reluctant to just
apply this.

I will submit another patch shortly that introduces snmp counter for zero-window
(what Eric suggested), perhaps it helps him or others to find a better solution
in the scenario.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 18, 2014, 5:30 p.m. UTC | #7
On Mon, 2014-02-17 at 21:52 +0100, Florian Westphal wrote:
> David Miller <davem@davemloft.net> wrote:
> > From: Florian Westphal <fw@strlen.de>
> > Date: Thu, 13 Feb 2014 12:52:30 +0100
> > 
> > >  V1 of this patch was deferred, resending to get discussion going again.
> > >  Changes since v1:
> > >   - add reproducer to commit message
> > > 
> > >  Unfortunately I couldn't come up with something that has no magic
> > >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
> > >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> > > 
> > >  Maybe someone has better idea?
> > 
> > I know this is going to be frustrating, but I've marked this 'deferred'
> > again.  Please resubmit after the testing and further discussions have
> > been worked out.
> 
> Thanks for letting me know.  I understand why you are reluctant to just
> apply this.
> 
> I will submit another patch shortly that introduces snmp counter for zero-window
> (what Eric suggested), perhaps it helps him or others to find a better solution
> in the scenario.

Sorry for the delay guys, I was on vacation.

I started the tests this morning, results in about 2/3 hours.

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Feb. 18, 2014, 11:12 p.m. UTC | #8
On Tue, 2014-02-18 at 09:30 -0800, Eric Dumazet wrote:
> On Mon, 2014-02-17 at 21:52 +0100, Florian Westphal wrote:
> > David Miller <davem@davemloft.net> wrote:
> > > From: Florian Westphal <fw@strlen.de>
> > > Date: Thu, 13 Feb 2014 12:52:30 +0100
> > > 
> > > >  V1 of this patch was deferred, resending to get discussion going again.
> > > >  Changes since v1:
> > > >   - add reproducer to commit message
> > > > 
> > > >  Unfortunately I couldn't come up with something that has no magic
> > > >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
> > > >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
> > > > 
> > > >  Maybe someone has better idea?
> > > 
> > > I know this is going to be frustrating, but I've marked this 'deferred'
> > > again.  Please resubmit after the testing and further discussions have
> > > been worked out.
> > 
> > Thanks for letting me know.  I understand why you are reluctant to just
> > apply this.
> > 
> > I will submit another patch shortly that introduces snmp counter for zero-window
> > (what Eric suggested), perhaps it helps him or others to find a better solution
> > in the scenario.
> 
> Sorry for the delay guys, I was on vacation.
> 
> I started the tests this morning, results in about 2/3 hours.

No regression found, feel free to add my :

Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Eric Dumazet <edumazet@google.com>

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Feb. 19, 2014, 9:36 p.m. UTC | #9
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 18 Feb 2014 15:12:13 -0800

> On Tue, 2014-02-18 at 09:30 -0800, Eric Dumazet wrote:
>> On Mon, 2014-02-17 at 21:52 +0100, Florian Westphal wrote:
>> > David Miller <davem@davemloft.net> wrote:
>> > > From: Florian Westphal <fw@strlen.de>
>> > > Date: Thu, 13 Feb 2014 12:52:30 +0100
>> > > 
>> > > >  V1 of this patch was deferred, resending to get discussion going again.
>> > > >  Changes since v1:
>> > > >   - add reproducer to commit message
>> > > > 
>> > > >  Unfortunately I couldn't come up with something that has no magic
>> > > >  ('allowed >> 4') value.  I chose >>4 (1/16th) because it didn't cause
>> > > >  tput limitations in my 'full-mss-sized, steady state' netcat tests.
>> > > > 
>> > > >  Maybe someone has better idea?
>> > > 
>> > > I know this is going to be frustrating, but I've marked this 'deferred'
>> > > again.  Please resubmit after the testing and further discussions have
>> > > been worked out.
>> > 
>> > Thanks for letting me know.  I understand why you are reluctant to just
>> > apply this.
>> > 
>> > I will submit another patch shortly that introduces snmp counter for zero-window
>> > (what Eric suggested), perhaps it helps him or others to find a better solution
>> > in the scenario.
>> 
>> Sorry for the delay guys, I was on vacation.
>> 
>> I started the tests this morning, results in about 2/3 hours.
> 
> No regression found, feel free to add my :
> 
> Acked-by: Eric Dumazet <edumazet@google.com>
> Tested-by: Eric Dumazet <edumazet@google.com>

Florian please resend with Eric's tags and I'll apply.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2a69f42..fd8d821 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2145,7 +2145,8 @@  u32 __tcp_select_window(struct sock *sk)
 	 */
 	int mss = icsk->icsk_ack.rcv_mss;
 	int free_space = tcp_space(sk);
-	int full_space = min_t(int, tp->window_clamp, tcp_full_space(sk));
+	int allowed_space = tcp_full_space(sk);
+	int full_space = min_t(int, tp->window_clamp, allowed_space);
 	int window;
 
 	if (mss > full_space)
@@ -2158,7 +2159,19 @@  u32 __tcp_select_window(struct sock *sk)
 			tp->rcv_ssthresh = min(tp->rcv_ssthresh,
 					       4U * tp->advmss);
 
-		if (free_space < mss)
+		/* free_space might become our new window, make sure we don't
+		 * increase it due to wscale.
+		 */
+		free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+
+		/* if free space is less than mss estimate, or is below 1/16th
+		 * of the maximum allowed, try to move to zero-window, else
+		 * tcp_clamp_window() will grow rcv buf up to tcp_rmem[2], and
+		 * new incoming data is dropped due to memory limits.
+		 * With large window, mss test triggers way too late in order
+		 * to announce zero window in time before rmem limit kicks in.
+		 */
+		if (free_space < (allowed_space >> 4) || free_space < mss)
 			return 0;
 	}