Patchwork PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )

login
register
mail settings
Submitter Eric Dumazet
Date April 21, 2010, 11:27 a.m.
Message ID <1271849253.7895.1929.camel@edumazet-laptop>
Download mbox | patch
Permalink /patch/50654/
State Superseded
Delegated to: David Miller
Headers show

Comments

Eric Dumazet - April 21, 2010, 11:27 a.m.
Here is the patch I use now and my test application is now able to open
and connect 1000000 sockets (ulimit -n 1000000)

Trick is bind_conflict() must refuse a socket to bind to a port on a non
null IP if another socket already uses same port on same IP.

Plus the previous patch sent (check a conflict before exiting the search
loop)

What do you think ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
George B. - April 21, 2010, 4:52 p.m.
On Wed, Apr 21, 2010 at 4:27 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Here is the patch I use now and my test application is now able to open
> and connect 1000000 sockets (ulimit -n 1000000)

I believe we hit this very yesterday in our test lab.  We had a stress
test running of one of our applications with about a dozen instances
of it running on the box.  Suddenly dns requests began failing with
the complaint that it couldn't make a request out because there were
no sockets.

root@champagne:/proc/sys/net/ipv4> host gh
host: isc_socket_bind: address in use

Netstat showed 61580 total sockets (UDP and TCP) on the address being
used by the above dns request. (local port range 1025 65535).  That
dns request should not have been failing.

I noticed that the number of UDP sockets was close to the maximum
allowed by the port range, but they were across different IP
addresses, no one IP address had too many and there should have been
available ports on all IP addresses.

Further, the number of udp sockets in use seemed to hit the wall at a
little above 64,000 and I never got above that number.

If that is the normal behavior of the kernel, it could be a big
problem for scaling the application.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov - April 21, 2010, 6:27 p.m.
On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Here is the patch I use now and my test application is now able to open
> and connect 1000000 sockets (ulimit -n 1000000)
> 
> Trick is bind_conflict() must refuse a socket to bind to a port on a non
> null IP if another socket already uses same port on same IP.
> 
> Plus the previous patch sent (check a conflict before exiting the search
> loop)
> 
> What do you think ?

Looks good, but do we want to check only reused socket's address there?
What if one of the sockets does not have reuse option turned on, will it
break?
Eric Dumazet - April 21, 2010, 6:43 p.m.
Le mercredi 21 avril 2010 à 22:27 +0400, Evgeniy Polyakov a écrit :
> On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > Here is the patch I use now and my test application is now able to open
> > and connect 1000000 sockets (ulimit -n 1000000)
> > 
> > Trick is bind_conflict() must refuse a socket to bind to a port on a non
> > null IP if another socket already uses same port on same IP.
> > 
> > Plus the previous patch sent (check a conflict before exiting the search
> > loop)
> > 
> > What do you think ?
> 
> Looks good, but do we want to check only reused socket's address there?
> What if one of the sockets does not have reuse option turned on, will it
> break?
> 

Well, if one socket doesnt have reuse option turned on, the previous
test already works ?

if (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) {
	if (!sk2_rcv_saddr || !sk_rcv_saddr ||
	    sk2_rcv_saddr == sk_rcv_saddr)
		break;
} else if (reuse && sk2->sk_reuse &&
           sk2_rcv_saddr &&
           sk2_rcv_saddr == sk_rcv_saddr)
	break;

I failed to factorize this complex test :(



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov - April 21, 2010, 6:58 p.m.
On Wed, Apr 21, 2010 at 08:43:36PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> Le mercredi 21 avril 2010 à 22:27 +0400, Evgeniy Polyakov a écrit :
> > On Wed, Apr 21, 2010 at 01:27:33PM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> > > Here is the patch I use now and my test application is now able to open
> > > and connect 1000000 sockets (ulimit -n 1000000)
> > > 
> > > Trick is bind_conflict() must refuse a socket to bind to a port on a non
> > > null IP if another socket already uses same port on same IP.
> > > 
> > > Plus the previous patch sent (check a conflict before exiting the search
> > > loop)
> > > 
> > > What do you think ?
> > 
> > Looks good, but do we want to check only reused socket's address there?
> > What if one of the sockets does not have reuse option turned on, will it
> > break?
> > 
> 
> Well, if one socket doesnt have reuse option turned on, the previous
> test already works ?
> 
> if (!reuse || !sk2->sk_reuse || sk2->sk_state == TCP_LISTEN) {
> 	if (!sk2_rcv_saddr || !sk_rcv_saddr ||
> 	    sk2_rcv_saddr == sk_rcv_saddr)
> 		break;
> } else if (reuse && sk2->sk_reuse &&
>            sk2_rcv_saddr &&
>            sk2_rcv_saddr == sk_rcv_saddr)
> 	break;
> 
> I failed to factorize this complex test :(

Damn it, I tried multiple times :)
You are right of course!

Patch

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index e0a3e35..78cbc39 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -70,13 +70,17 @@  int inet_csk_bind_conflict(const struct sock *sk,
 		    (!sk->sk_bound_dev_if ||
 		     !sk2->sk_bound_dev_if ||
 		     sk->sk_bound_dev_if == sk2->sk_bound_dev_if)) {
+			const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
+
 			if (!reuse || !sk2->sk_reuse ||
 			    sk2->sk_state == TCP_LISTEN) {
-				const __be32 sk2_rcv_saddr = inet_rcv_saddr(sk2);
 				if (!sk2_rcv_saddr || !sk_rcv_saddr ||
 				    sk2_rcv_saddr == sk_rcv_saddr)
 					break;
-			}
+			} else if (reuse && sk2->sk_reuse &&
+				   sk2_rcv_saddr &&
+				   sk2_rcv_saddr == sk_rcv_saddr)
+				break;
 		}
 	}
 	return node != NULL;
@@ -120,9 +124,11 @@  again:
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
 						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
+							if (!inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
+								spin_unlock(&head->lock);
+								snum = smallest_rover;
+								goto have_snum;
+							}
 						}
 					}
 					goto next;