diff mbox

Missing TCP SYN on loopback, retransmits after 1s

Message ID 1322059124.17693.24.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 23, 2011, 2:38 p.m. UTC
Le mardi 22 novembre 2011 à 18:37 -0600, Jesse Young a écrit :
> On Tue, 22 Nov 2011 19:23:38 -0500 (EST)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Jesse Young <jlyo@jlyo.org>
> > Date: Tue, 22 Nov 2011 18:13:20 -0600
> >
> > > What's also puzzling, is that I see no packet drop reporting in
> > > $ ifconfig lo
> > > lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 16436  metric 1
> > > inet 127.0.0.1  netmask 255.0.0.0
> > > inet6 ::1  prefixlen 128  scopeid 0x10<host>
> > > loop  txqueuelen 0  (Local Loopback)
> > > RX packets 276411482  bytes 15822880567 (14.7 GiB)
> > > RX errors 0  dropped 0  overruns 0  frame 0
> > > TX packets 276411482 bytes 15822880567 (14.7 GiB)
> > > TX errors 0 dropped 0 overruns 0 carrier 0 collisions
> >
> > The device driver therefore isn't even seeing the packets, they are
> > being dropped elsewhere.
> >
> > Why is this "puzzling"?
> >
> > There's layers upon layers and thousands of places where packets can
> > be dropped between the originating network stack and the actual device
> > driver.
> 
> Maybe puzzling isn't the best word... just some more relevant
> information.  Also, this is the loopback interface, there is no device
> driver, PHY or DLL layer in question here (just the loopback's mock
> driver/PHY/DLL).
> 
> I presume that the drop is occuring in between the NET layer, and the sys
> call interface, do you agree?  Where should I begin looking?
> --

Here is the patch to solve this IPv6 problem, thanks a lot for the
report !

[PATCH] ipv6: tcp: fix tcp_v6_conn_request()

Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
cookies), we can drop a SYN packet reusing a TIME_WAIT socket.

(As a matter of fact we fail to send the SYNACK answer)

As the client resends its SYN packet after a one second timeout, we
accept it, because first packet removed the TIME_WAIT socket before
being dropped.

This probably explains why nobody ever noticed or complained.

Reported-by: Jesse Young <jlyo@jlyo.org>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/tcp_ipv6.c |   13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Nov. 23, 2011, 10:29 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 23 Nov 2011 15:38:44 +0100

> [PATCH] ipv6: tcp: fix tcp_v6_conn_request()
> 
> Since linux 2.6.26 (commit c6aefafb7ec6 : Add IPv6 support to TCP SYN
> cookies), we can drop a SYN packet reusing a TIME_WAIT socket.
> 
> (As a matter of fact we fail to send the SYNACK answer)
> 
> As the client resends its SYN packet after a one second timeout, we
> accept it, because first packet removed the TIME_WAIT socket before
> being dropped.
> 
> This probably explains why nobody ever noticed or complained.
> 
> Reported-by: Jesse Young <jlyo@jlyo.org>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied and queued up for -stable, thanks!
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 23, 2013, 4:58 p.m. UTC | #2
On Fri, 2013-03-22 at 19:03 -0700, parasytic@gmail.com wrote:
> Hi List!
> 
> 
> First, I'm sorry for resurrecting an extremely old thread, but I've
> exhausted all other resources. We're experiencing this same "1 second
> retransmit" with ipv4 (including loopback). And the best part is, it
> can be replicated very easily using the 'closed' and 'tcping' tests
> provided by Jesse Young in the initial post. For reference:
> 
> 
> $ git clone git://github.com/jlyo/tcping.git
> $ cd tcping && make
> 
> 
> $ git clone git://github.com/jlyo/closed.git
> $ cd closed && make
> 
> 
> $ ./closed 0.0.0.0
> 
> 
> $ time ./tcping -f -p8009 0.0.0.0
> 
> 
> Results:
> 
> 
>         ...
>         response from 0.0.0.0:8009, seq=1907 time=0.02 ms
>         response from 0.0.0.0:8009, seq=1908 time=0.03 ms
>         response from 0.0.0.0:8009, seq=1909 time=999.11 ms
>         --- 0.0.0.0:8009 ping statistics ---
>         1909 responses, 1910 ok, 0.00% failed
>         round-trip min/avg/max = 0.0/0.6/999.1 ms
>         
>         
>         real    0m1.125s
>         user    0m0.008s
>         sys     0m0.104s
> 
> 
> 
> 
> Packet captures from tcpdump look remarkably similar to what Eric
> Dumazet shared. That eventually lead me to this thread.
> 
> 
> This happens on a fresh Ubuntu 12.10 install, and also with our tuning
> parameters. (Includes increasing the syn backlog, open file
> descriptors, TCP memory, max orphans, etc.)  I've also seen the
> problem with other kernels, within EC2 and Azure. I have not been able
> to test with ipv6 yet.
> 
> 
> $ uname -a
> Linux test 3.5.0-21-generic #32-Ubuntu SMP Tue Dec 11 18:51:59 UTC
> 2012 x86_64 x86_64 x86_64 GNU/Linux
> 
> 
> I'm hoping to spark some interest in revisiting this issue (with focus
> on ipv4, this time).
> 
> 
> Thanks everyone!
> Jay
> 
Hi Jay

Not reproducible on current kernels (net-next tree for example)

ip netns add eric
ip netns exec eric ifconfig -a
ip netns exec eric ifconfig lo 127.0.0.1 up
ip netns exec eric ./closed 0.0.0.0 &
ip netns exec eric nstat
ip netns exec eric ./tcping -f -p8009 0.0.0.0
127.0.0.1:40832 Connected...response from 0.0.0.0:8009, seq=32799
time=0.04 ms
 closed
127.0.0.1:40999 Connected...response from 0.0.0.0:8009, seq=32800
time=0.04 ms
 closed
127.0.0.1:42795 Connected...response from 0.0.0.0:8009, seq=32801
time=0.20 ms
 closed
127.0.0.1:43226 Connected...response from 0.0.0.0:8009, seq=32802
time=0.07 ms
 closed
error connecting to host (99): Cannot assign requested address
................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................^C.--- 0.0.0.0:8009 ping statistics ---
33765 responses, 32803 ok, 0.00% failed
round-trip min/avg/max = 0.0/0.0/0.5 ms

# ip netns exec eric nstat
#kernel
IpInReceives                    197087             0.0
IpInDelivers                    197087             0.0
IpOutRequests                   197087             0.0
TcpActiveOpens                  32803              0.0
TcpPassiveOpens                 32803              0.0
TcpInSegs                       197087             0.0
TcpOutSegs                      197084             0.0
TcpRetransSegs                  3                  0.0
TcpOutRsts                      11                 0.0
TcpExtSyncookiesFailed          11                 0.0
TcpExtDelayedACKs               238                0.0
TcpExtDelayedACKLocked          248                0.0
TcpExtTCPPureAcks               65838              0.0
TcpExtTCPTimeouts               3                  0.0
IpExtInOctets                   10773240           0.0
IpExtOutOctets                  10773240           0.0


But yes, on 3.5.X kernel you might hit a bug somewhere.

Since the same sequence gives suspect TcpExtListenDrops :

# ip netns exec eric nstat 
#kernel
IpInReceives                    49367              0.0
IpInDelivers                    49367              0.0
IpOutRequests                   49367              0.0
TcpActiveOpens                  8184               0.0
TcpPassiveOpens                 8184               0.0
TcpInSegs                       49367              0.0
TcpOutSegs                      49362              0.0
TcpRetransSegs                  5                  0.0
TcpExtDelayedACKs               63                 0.0
TcpExtDelayedACKLocked          32                 0.0
TcpExtListenOverflows           4                  0.0
TcpExtListenDrops               4                  0.0
TcpExtTCPPureAcks               16624              0.0
TcpExtTCPLossUndo               1                  0.0
TcpExtTCPTimeouts               5                  0.0
IpExtInOctets                   2698036            0.0
IpExtOutOctets                  2698036            0.0


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Oster March 23, 2013, 7:17 p.m. UTC | #3
Hello Eric,

On Mar 23, 2013, at 9:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Hi Jay
> 
> Not reproducible on current kernels (net-next tree for example)

Thank you for looking into this so quickly! And it sounds like promising news, too. I'll experiment with newer kernels right away.

Thanks again,
Jay--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Oster March 24, 2013, 12:08 a.m. UTC | #4
Hi again,

Resending this; it didn't make it to the netdev list.

On Mar 23, 2013, at 12:17 PM, Jason Oster <parasytic@gmail.com> wrote:

> Hello Eric,
> 
> Thank you for looking into this so quickly! And it sounds like promising news, too. I'll experiment with newer kernels right away.
> 
> Thanks again,
> Jay


FYI net-next may have some changes not available in 3.9.0-rc3, because I can still reproduce it there:

$ uname -a
Linux test 3.9.0-030900rc3-generic #201303171935 SMP Sun Mar 17 23:36:17 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

This is from the Ubuntu kernel mainline PPA (on Raring Ringtail-dev). I haven't built the kernel yet.--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 36131d1..2dea4bb 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1255,6 +1255,13 @@  static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (!want_cookie || tmp_opt.tstamp_ok)
 		TCP_ECN_create_request(req, tcp_hdr(skb));
 
+	treq->iif = sk->sk_bound_dev_if;
+
+	/* So that link locals have meaning */
+	if (!sk->sk_bound_dev_if &&
+	    ipv6_addr_type(&treq->rmt_addr) & IPV6_ADDR_LINKLOCAL)
+		treq->iif = inet6_iif(skb);
+
 	if (!isn) {
 		struct inet_peer *peer = NULL;
 
@@ -1264,12 +1271,6 @@  static int tcp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 			atomic_inc(&skb->users);
 			treq->pktopts = skb;
 		}
-		treq->iif = sk->sk_bound_dev_if;
-
-		/* So that link locals have meaning */
-		if (!sk->sk_bound_dev_if &&
-		    ipv6_addr_type(&treq->rmt_addr) & IPV6_ADDR_LINKLOCAL)
-			treq->iif = inet6_iif(skb);
 
 		if (want_cookie) {
 			isn = cookie_v6_init_sequence(sk, skb, &req->mss);