diff mbox

[-next] net: tcp: move to timewait when receiving data post active-close

Message ID 1447859024-1040-1-git-send-email-fw@strlen.de
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Florian Westphal Nov. 18, 2015, 3:03 p.m. UTC
RFC 1122, 4.2.2.13:
 [..] if new data is received after CLOSE is called, its TCP
 SHOULD send a RST to show that data was lost.

 When a connection is closed actively, it MUST linger in
 TIME-WAIT state [..].

We reset a connection, but destroy state immediately.

After discussing this with Hannes, we decided it was preferable
to also move to TW state to avoid immediate port reuse.

packetdrill testcase:

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < S 0:0(0) win 29200 <mss 1460>
0.100 > S. 0:0(0) ack 1 <mss 1460>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
// close our side.
0.210 close(4) = 0
// we should expect to see FIN now, sk moves to FIN_WAIT_1
0.210 > F. 1:1(0) ack 1 win 29200
// receive data, but sk already closed -> Reset
0.300 < P. 1:1001(1000) ack 1 win 46
0.300 > R 1:1(0) win 0

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
We got complaint from customer that CLOSED state transition
is RFC violation.

The advantage of current behaviour is that further packets will also
result in RST.

I'm interested if others think its worth changing this now.

Thanks.

Comments

Eric Dumazet Nov. 18, 2015, 3:28 p.m. UTC | #1
On Wed, 2015-11-18 at 16:03 +0100, Florian Westphal wrote:
> RFC 1122, 4.2.2.13:
>  [..] if new data is received after CLOSE is called, its TCP
>  SHOULD send a RST to show that data was lost.
> 
>  When a connection is closed actively, it MUST linger in
>  TIME-WAIT state [..].
> 
> We reset a connection, but destroy state immediately.
> 
> After discussing this with Hannes, we decided it was preferable
> to also move to TW state to avoid immediate port reuse.
> 
> packetdrill testcase:
> 
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 0.000 bind(3, ..., ...) = 0
> 0.000 listen(3, 1) = 0
> 0.100 < S 0:0(0) win 29200 <mss 1460>
> 0.100 > S. 0:0(0) ack 1 <mss 1460>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> // close our side.
> 0.210 close(4) = 0
> // we should expect to see FIN now, sk moves to FIN_WAIT_1
> 0.210 > F. 1:1(0) ack 1 win 29200
> // receive data, but sk already closed -> Reset
> 0.300 < P. 1:1001(1000) ack 1 win 46
> 0.300 > R 1:1(0) win 0

Thanks for using packetdrill ;)

This packetdrill test shows nothing special regarding your patch, it
should work right now with current kernels ???

lpk51:~# tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
lpk51:~# ./packetdrill Florian.pkt 
07:26:49.656548 IP 192.0.2.1.59883 > 192.168.155.177.8080: Flags [S], seq 0, win 29200, options [mss 1460], length 0
07:26:49.656611 IP 192.168.155.177.8080 > 192.0.2.1.59883: Flags [S.], seq 1433468786, ack 1, win 29200, options [mss 1460], length 0
07:26:49.756530 IP 192.0.2.1.59883 > 192.168.155.177.8080: Flags [.], ack 1, win 257, length 0
07:26:49.766529 IP 192.168.155.177.8080 > 192.0.2.1.59883: Flags [F.], seq 1, ack 1, win 29200, length 0
07:26:49.856529 IP 192.0.2.1.59883 > 192.168.155.177.8080: Flags [P.], seq 1:1001, ack 1, win 46, length 1000: HTTP
07:26:49.856560 IP 192.168.155.177.8080 > 192.0.2.1.59883: Flags [R], seq 1433468787, win 0, length 0
07:26:49.856806 IP 192.0.2.1.59883 > 192.168.155.177.8080: Flags [R.], seq 1001, ack 1, win 46, length 0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Nov. 18, 2015, 3:36 p.m. UTC | #2
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-11-18 at 16:03 +0100, Florian Westphal wrote:
> > RFC 1122, 4.2.2.13:
> >  [..] if new data is received after CLOSE is called, its TCP
> >  SHOULD send a RST to show that data was lost.
> > 
> >  When a connection is closed actively, it MUST linger in
> >  TIME-WAIT state [..].
> > 
> > We reset a connection, but destroy state immediately.
> > 
> > After discussing this with Hannes, we decided it was preferable
> > to also move to TW state to avoid immediate port reuse.
> > 
> > packetdrill testcase:
> > 
> > 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> > 0.000 bind(3, ..., ...) = 0
> > 0.000 listen(3, 1) = 0
> > 0.100 < S 0:0(0) win 29200 <mss 1460>
> > 0.100 > S. 0:0(0) ack 1 <mss 1460>
> > 0.200 < . 1:1(0) ack 1 win 257
> > 0.200 accept(3, ..., ...) = 4
> > // close our side.
> > 0.210 close(4) = 0
> > // we should expect to see FIN now, sk moves to FIN_WAIT_1
> > 0.210 > F. 1:1(0) ack 1 win 29200
> > // receive data, but sk already closed -> Reset
> > 0.300 < P. 1:1001(1000) ack 1 win 46
> > 0.300 > R 1:1(0) win 0


> This packetdrill test shows nothing special regarding your patch, it
> should work right now with current kernels ???

Yes, but we kill the socket.

I should have added

0.400 `ss -nito state time-wait`

as last line...

Before patch: no output
after patch: tw socket shown.

The on-wire behavior doesn't change unless further packets arrive.
Old behaviour: more RST
New behaviour: acks+tw timer restart

Sorry for the confusion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 18, 2015, 3:46 p.m. UTC | #3
On Wed, 2015-11-18 at 16:36 +0100, Florian Westphal wrote:

> Yes, but we kill the socket.
> 
> I should have added
> 
> 0.400 `ss -nito state time-wait`
> 
> as last line...
> 
> Before patch: no output
> after patch: tw socket shown.
> 
> The on-wire behavior doesn't change unless further packets arrive.
> Old behaviour: more RST
> New behaviour: acks+tw timer restart

Just add few more incoming packets to the packetdrill test then ?

Also, is your customer really _not_ using TCP timestamps ?

This is kind of a requirement for port reuse anyway.

Anyway, having a TIMEWAIT setup after sending a RST makes little sense
to me.

When a RST packet is sent, the remote peer will forget everything about
this previous connection, and another connect() might reuse the tuple
and I do not think we should forbid this. Normal PAWS checks were
invented for a good reason.

RFC 1122, 4.2.2.13 can be interpreted in very different ways.

Please show us real issue your customer has.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Nov. 18, 2015, 3:54 p.m. UTC | #4
On Wed, Nov 18, 2015, at 16:46, Eric Dumazet wrote:
> On Wed, 2015-11-18 at 16:36 +0100, Florian Westphal wrote:
> 
> > Yes, but we kill the socket.
> > 
> > I should have added
> > 
> > 0.400 `ss -nito state time-wait`
> > 
> > as last line...
> > 
> > Before patch: no output
> > after patch: tw socket shown.
> > 
> > The on-wire behavior doesn't change unless further packets arrive.
> > Old behaviour: more RST
> > New behaviour: acks+tw timer restart
> 
> Just add few more incoming packets to the packetdrill test then ?
> 
> Also, is your customer really _not_ using TCP timestamps ?

Windows mostly does not use TCP timestamps. Also we have cases were
security folks tell customers to turn off timestamps as they enable
attackers to guess uptime. :(

> This is kind of a requirement for port reuse anyway.
> 
> Anyway, having a TIMEWAIT setup after sending a RST makes little sense
> to me.
> 
> When a RST packet is sent, the remote peer will forget everything about
> this previous connection, and another connect() might reuse the tuple
> and I do not think we should forbid this. Normal PAWS checks were
> invented for a good reason.

Still, the RST packet can be dropped along the way. So the teardown of
the socket on the other side might not happen.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 18, 2015, 5:28 p.m. UTC | #5
On Wed, 2015-11-18 at 16:54 +0100, Hannes Frederic Sowa wrote:

> Still, the RST packet can be dropped along the way. So the teardown of
> the socket on the other side might not happen.

This is why it is better to send RST for every incoming in-excess packet

Try following packetdrill test, before and after your patch :

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0
0.100 < S 0:0(0) win 29200 <mss 1460>
0.100 > S. 0:0(0) ack 1 <mss 1460>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4
// close our side.
0.210 close(4) = 0
// we should expect to see FIN now, sk moves to FIN_WAIT_1
0.210 > F. 1:1(0) ack 1 win 29200

// receive data, but sk already closed -> Reset
+.010 < P. 1:1001(1000) ack 1 win 46
+0    > R 1:1(0) win 0

// Are we properly sending a RST like prior packet did ?
+.010 < P. 1001:2001(1000) ack 1 win 46
+0    > R 1:1(0) win 0



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Nov. 18, 2015, 5:32 p.m. UTC | #6
Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2015-11-18 at 16:54 +0100, Hannes Frederic Sowa wrote:
> 
> > Still, the RST packet can be dropped along the way. So the teardown of
> > the socket on the other side might not happen.
> 
> This is why it is better to send RST for every incoming in-excess packet

I'm convinced and marked patch as rejected in patchwork.

Thanks Eric!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Nov. 18, 2015, 5:35 p.m. UTC | #7
On Wed, Nov 18, 2015, at 18:28, Eric Dumazet wrote:
> On Wed, 2015-11-18 at 16:54 +0100, Hannes Frederic Sowa wrote:
> 
> > Still, the RST packet can be dropped along the way. So the teardown of
> > the socket on the other side might not happen.
> 
> This is why it is better to send RST for every incoming in-excess packet

I agree it would be better to send a RST than ACK incoming data. Just
thinking out loud, would it make sense to add a sub-state to timewait so
we RST all other packets for TIMEWAIT_LEN duration?

> Try following packetdrill test, before and after your patch :
> 
> 0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> 0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
> 0.000 bind(3, ..., ...) = 0
> 0.000 listen(3, 1) = 0
> 0.100 < S 0:0(0) win 29200 <mss 1460>
> 0.100 > S. 0:0(0) ack 1 <mss 1460>
> 0.200 < . 1:1(0) ack 1 win 257
> 0.200 accept(3, ..., ...) = 4
> // close our side.
> 0.210 close(4) = 0
> // we should expect to see FIN now, sk moves to FIN_WAIT_1
> 0.210 > F. 1:1(0) ack 1 win 29200
> 
> // receive data, but sk already closed -> Reset
> +.010 < P. 1:1001(1000) ack 1 win 46
> +0    > R 1:1(0) win 0
> 
> // Are we properly sending a RST like prior packet did ?
> +.010 < P. 1001:2001(1000) ack 1 win 46
> +0    > R 1:1(0) win 0

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 18, 2015, 6:35 p.m. UTC | #8
On Wed, 2015-11-18 at 18:35 +0100, Hannes Frederic Sowa wrote:
> On Wed, Nov 18, 2015, at 18:28, Eric Dumazet wrote:
> > On Wed, 2015-11-18 at 16:54 +0100, Hannes Frederic Sowa wrote:
> > 
> > > Still, the RST packet can be dropped along the way. So the teardown of
> > > the socket on the other side might not happen.
> > 
> > This is why it is better to send RST for every incoming in-excess packet
> 
> I agree it would be better to send a RST than ACK incoming data. Just
> thinking out loud, would it make sense to add a sub-state to timewait so
> we RST all other packets for TIMEWAIT_LEN duration?

Intuitively this timewait creation looks dangerous to me, maybe from
an attack surface perspective.

An out of sequence packet sends an advisory RST packet, which is already
a hint for building some kinds of attacks.

Now if we also create a timewait socket, and its associated timer, we
also add quite a risk of memory and cpu consumption.

So I am fine you create a timewait, only if you can prove there is no
change in security.

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Nov. 19, 2015, 5:11 p.m. UTC | #9
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Nov 2015 07:46:31 -0800

> Anyway, having a TIMEWAIT setup after sending a RST makes little
> sense to me.
> 
> When a RST packet is sent, the remote peer will forget everything about
> this previous connection, and another connect() might reuse the tuple
> and I do not think we should forbid this. Normal PAWS checks were
> invented for a good reason.
> 
> RFC 1122, 4.2.2.13 can be interpreted in very different ways.

I think it is clear that once RST is emitted, that connection ID
no longer exists, on both ends.

Sender of RST must not have that matching state, and receiver of
RST must tear things down.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index fdd88c3..3bcdad1 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5904,11 +5904,16 @@  int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 			break;
 		}
 
-		if (tp->linger2 < 0 ||
-		    (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
-		     after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt))) {
+		if (tp->linger2 < 0) {
+			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
 			tcp_done(sk);
+			return 1;
+		}
+
+		if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
+		    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
+			tcp_time_wait(sk, TCP_TIME_WAIT, 0);
 			return 1;
 		}
 
@@ -5966,7 +5971,13 @@  int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 			if (TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(skb)->seq &&
 			    after(TCP_SKB_CB(skb)->end_seq - th->fin, tp->rcv_nxt)) {
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPABORTONDATA);
-				tcp_reset(sk);
+
+				if (sk->sk_state == TCP_CLOSE_WAIT ||
+				    sk->sk_state == TCP_LAST_ACK)
+					tcp_reset(sk);
+				else
+					tcp_time_wait(sk, TCP_TIME_WAIT, 0);
+
 				return 1;
 			}
 		}