diff mbox

[net-next] tcp: do not increase the rcv window when the FIN has been received

Message ID 20140102224021.GA20358@1wt.eu
State Deferred, archived
Delegated to: David Miller
Headers show

Commit Message

Willy Tarreau Jan. 2, 2014, 10:40 p.m. UTC
In HTTP performance tests it appeared that my client was always sending
an ACK immediately after receiving the FIN from the server and that the
sole purpose of this ACK was to advertise a larger window. This is not
nedeed when the FIN is received since the peer won't send any more data,
and wastes a packet in the direction to the server. Adding a check for
RCV_SHUTDOWN to the condition to send this packet fixes the issue, as
all packet sizes now behave like the short version. BTW that this is
even visible on the loopback.

Before the patch :

A 536-byte payload induces an immediate ACK after the server's FIN is
received :

    $ injectl464 -o 1 -u 1  -G 192.168.0.176:9000/?s=386 -T 1000

    15:08:38.939425 IP 192.168.0.176.35781 > 192.168.0.176.9000: Flags [S], seq 4152512456, win 43690, options [mss 65495,sackOK,TS val 1581485 ecr 0,nop,wscale 7], length 0
    15:08:38.939441 IP 192.168.0.176.9000 > 192.168.0.176.35781: Flags [S.], seq 691229241, ack 4152512457, win 43690, options [mss 65495,sackOK,TS val 1581485 ecr 1581485,nop,wscale 7], length 0
    15:08:38.939457 IP 192.168.0.176.35781 > 192.168.0.176.9000: Flags [.], ack 691229242, win 342, options [nop,nop,TS val 1581485 ecr 1581485], length 0
    15:08:38.939501 IP 192.168.0.176.35781 > 192.168.0.176.9000: Flags [P.], seq 4152512457:4152512584, ack 691229242, win 342, options [nop,nop,TS val 1581485 ecr 1581485], length 127
    15:08:38.939570 IP 192.168.0.176.9000 > 192.168.0.176.35781: Flags [F.], seq 691229242:691229778, ack 4152512584, win 342, options [nop,nop,TS val 1581485 ecr 1581485], length 536
    15:08:38.939595 IP 192.168.0.176.35781 > 192.168.0.176.9000: Flags [.], ack 691229779, win 342, options [nop,nop,TS val 1581485 ecr 1581485], length 0
    15:08:38.940017 IP 192.168.0.176.35781 > 192.168.0.176.9000: Flags [F.], seq 4152512584, ack 691229779, win 342, options [nop,nop,TS val 1581486 ecr 1581485], length 0
    15:08:38.940035 IP 192.168.0.176.9000 > 192.168.0.176.35781: Flags [.], ack 4152512585, win 342, options [nop,nop,TS val 1581486 ecr 1581486], length 0

A smaller payload (535 bytes and below) does not cause this to happen :

    $ injectl464 -o 1 -u 1  -G 192.168.0.176:9000/?s=385 -T 1000

    15:08:43.691785 IP 192.168.0.176.35782 > 192.168.0.176.9000: Flags [S], seq 1414138895, win 43690, options [mss 65495,sackOK,TS val 1582673 ecr 0,nop,wscale 7], length 0
    15:08:43.691810 IP 192.168.0.176.9000 > 192.168.0.176.35782: Flags [S.], seq 1784595528, ack 1414138896, win 43690, options [mss 65495,sackOK,TS val 1582673 ecr 1582673,nop,wscale 7], length 0
    15:08:43.691825 IP 192.168.0.176.35782 > 192.168.0.176.9000: Flags [.], ack 1784595529, win 342, options [nop,nop,TS val 1582673 ecr 1582673], length 0
    15:08:43.691868 IP 192.168.0.176.35782 > 192.168.0.176.9000: Flags [P.], seq 1414138896:1414139023, ack 1784595529, win 342, options [nop,nop,TS val 1582673 ecr 1582673], length 127
    15:08:43.691933 IP 192.168.0.176.9000 > 192.168.0.176.35782: Flags [F.], seq 1784595529:1784596064, ack 1414139023, win 342, options [nop,nop,TS val 1582673 ecr 1582673], length 535
    15:08:43.692034 IP 192.168.0.176.35782 > 192.168.0.176.9000: Flags [F.], seq 1414139023, ack 1784596065, win 342, options [nop,nop,TS val 1582674 ecr 1582673], length 0
    15:08:43.692048 IP 192.168.0.176.9000 > 192.168.0.176.35782: Flags [.], ack 1414139024, win 342, options [nop,nop,TS val 1582674 ecr 1582674], length 0

Signed-off-by: Willy Tarreau <w@1wt.eu>
---
 net/ipv4/tcp.c       | 3 ++-
 net/ipv4/tcp_input.c | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

Comments

David Miller Jan. 4, 2014, 12:58 a.m. UTC | #1
From: Willy Tarreau <w@1wt.eu>
Date: Thu, 2 Jan 2014 23:40:21 +0100

> In HTTP performance tests it appeared that my client was always sending
> an ACK immediately after receiving the FIN from the server and that the
> sole purpose of this ACK was to advertise a larger window.

I guess the question is what behavior do we want here.

Frankly, I think we should always immediately ACK a FIN _unless_ we
already have data pending on the send queue on which to piggyback that
ACK.

The reason is that since we know there will be no more data, delaying
the ACK has none of the useful characteristics.  In fact, sending the
ACK immediately will allow the closing side to release the data in it's
retransmit queue, and thus reclaim memory, more quickly.

I'm not so sure about this change, so I'm marking it deferred.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 4, 2014, 8:22 a.m. UTC | #2
Hi David,

On Fri, Jan 03, 2014 at 07:58:10PM -0500, David Miller wrote:
> From: Willy Tarreau <w@1wt.eu>
> Date: Thu, 2 Jan 2014 23:40:21 +0100
> 
> > In HTTP performance tests it appeared that my client was always sending
> > an ACK immediately after receiving the FIN from the server and that the
> > sole purpose of this ACK was to advertise a larger window.
> 
> I guess the question is what behavior do we want here.
> 
> Frankly, I think we should always immediately ACK a FIN _unless_ we
> already have data pending on the send queue on which to piggyback that
> ACK.
> 
> The reason is that since we know there will be no more data, delaying
> the ACK has none of the useful characteristics.  In fact, sending the
> ACK immediately will allow the closing side to release the data in it's
> retransmit queue, and thus reclaim memory, more quickly.

Yes but on the other hand, most often when we receive a FIN, there is
immediate local action. Either data are pending and will serve as an
ACK, or the local endpoint will immediately close as well. Currently,
if the application doesn't react fast enough, the ACK is emitted after
40 ms anyway. So a properly designed application has an opportunity of
40 ms to react quickly and save this ACK. Currently it works only when
the data accompanying the FIN is less than 536 bytes, and my point was
to ensure that larger responses were covered by the same possibility
as well.

> I'm not so sure about this change, so I'm marking it deferred.

OK no problem.

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index c4638e6..1eab74c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1350,7 +1350,8 @@  void tcp_cleanup_rbuf(struct sock *sk, int copied)
 		    * receive. */
 		if (icsk->icsk_ack.blocked ||
 		    /* Once-per-two-segments ACK was not sent by tcp_input.c */
-		    tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss ||
+		    (!(sk->sk_shutdown & RCV_SHUTDOWN) &&
+		     tp->rcv_nxt - tp->rcv_wup > icsk->icsk_ack.rcv_mss) ||
 		    /*
 		     * If this read emptied read buffer, we send ACK, if
 		     * connection is not bidirectional, user drained
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c53b7f3..c6c0420 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4796,6 +4796,8 @@  static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 
 	    /* More than one full frame received... */
 	if (((tp->rcv_nxt - tp->rcv_wup) > inet_csk(sk)->icsk_ack.rcv_mss &&
+	     /* and peer is interested in our window updates */
+	     !(sk->sk_shutdown & RCV_SHUTDOWN) &&
 	     /* ... and right edge of window advances far enough.
 	      * (tcp_recvmsg() will send ACK otherwise). Or...
 	      */