diff mbox

PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )

Message ID 1271808314.7895.614.camel@edumazet-laptop
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet April 21, 2010, 12:05 a.m. UTC
Le mardi 20 avril 2010 à 16:49 -0700, Ben Greear a écrit :
> On 04/20/2010 04:35 PM, Gaspar Chilingarov wrote:
> > sysctl -a | grep local_port_range
> 
> [root@ct503-10G-09 ~]# sysctl -a | grep local_port_range
> net.ipv4.ip_local_port_range = 10000	61000
> 
> I'm explicitly binding to local ports as well as local IPs, btw.
> 

I believe the bsockets 'optimization' is a bug, we should remove it.

This is a stable candidate (2.6.30+)

[PATCH net-next-2.6] tcp: remove bsockets count

Counting number of bound sockets to avoid a loop is buggy, since we cant
know how many IP addresses are in use. When threshold is reached, we try
5 random slots and can fail while there are plenty available ports.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/inet_hashtables.h   |    2 --
 net/ipv4/inet_connection_sock.c |    5 -----
 net/ipv4/inet_hashtables.c      |    5 -----
 3 files changed, 12 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Gaspar Chilingarov April 21, 2010, 12:12 a.m. UTC | #1
2010/4/21 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mardi 20 avril 2010 à 16:49 -0700, Ben Greear a écrit :
>> On 04/20/2010 04:35 PM, Gaspar Chilingarov wrote:
>> > sysctl -a | grep local_port_range
>>
>> [root@ct503-10G-09 ~]# sysctl -a | grep local_port_range
>> net.ipv4.ip_local_port_range = 10000  61000
>>
>> I'm explicitly binding to local ports as well as local IPs, btw.
>>
>
> I believe the bsockets 'optimization' is a bug, we should remove it.
>
> This is a stable candidate (2.6.30+)
>
> [PATCH net-next-2.6] tcp: remove bsockets count
>
> Counting number of bound sockets to avoid a loop is buggy, since we cant
> know how many IP addresses are in use. When threshold is reached, we try
> 5 random slots and can fail while there are plenty available ports.
>

Thank you a lot for the patch - I will try it.

In FreeBSD I was able to add about 32 C classes (8192 ips) on the
single interface (never tried to do that in Linux yet :) - so you
really never know how much IP's are there available. Tens and even up
to hundred IPs on the single machine are not that usual in hosting
environment at all.

/Gaspar
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller April 21, 2010, 12:14 a.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Apr 2010 02:05:14 +0200

> [PATCH net-next-2.6] tcp: remove bsockets count
> 
> Counting number of bound sockets to avoid a loop is buggy, since we cant
> know how many IP addresses are in use. When threshold is reached, we try
> 5 random slots and can fail while there are plenty available ports.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Hmmm, yes indeed it seems there are improper assumptions made by
this scheme.

This is a tricky area, so I'll wait for some test results from the
reporter and study the code over a few times myself to make sure
we get this right.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov April 21, 2010, 12:30 a.m. UTC | #3
On Wed, Apr 21, 2010 at 02:05:14AM +0200, Eric Dumazet (eric.dumazet@gmail.com) wrote:
> I believe the bsockets 'optimization' is a bug, we should remove it.
> 
> This is a stable candidate (2.6.30+)
> 
> [PATCH net-next-2.6] tcp: remove bsockets count
> 
> Counting number of bound sockets to avoid a loop is buggy, since we cant
> know how many IP addresses are in use. When threshold is reached, we try
> 5 random slots and can fail while there are plenty available ports.

To return back to exponential bind() times you need to revert the whole
original patch including magic 5 number, not only bsockets.

But actual problem is not in this digit, but in a deeper logic.
Previously we scanned the whole table, now we have 5 attempts to
find out at least one bucket (without conflict) we will insert
new socket into. Apparently for large number of addresses it is possible
that all 5 times we will randomly select those buckets which conflicts.
As dumb solution we can increase 'attempt' number to infinite one, or
fallback to whole-table-search after several random attempts, which is a
bit more clever I think.
David Miller April 21, 2010, 2:04 a.m. UTC | #4
From: Evgeniy Polyakov <zbr@ioremap.net>
Date: Wed, 21 Apr 2010 04:30:22 +0400

> To return back to exponential bind() times you need to revert the whole
> original patch including magic 5 number, not only bsockets.

Indeed, if we keep that '5' thing there it's just going to still fail
similarly.

> But actual problem is not in this digit, but in a deeper logic.
> Previously we scanned the whole table, now we have 5 attempts to
> find out at least one bucket (without conflict) we will insert
> new socket into. Apparently for large number of addresses it is possible
> that all 5 times we will randomly select those buckets which conflicts.
> As dumb solution we can increase 'attempt' number to infinite one, or
> fallback to whole-table-search after several random attempts, which is a
> bit more clever I think.

If random number generator is not too terrible, just using infinite
limit would be roughly equivalent.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 74358d1..e0f3a05 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -150,8 +150,6 @@  struct inet_hashinfo {
 	 */
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
-
-	atomic_t			bsockets;
 };
 
 static inline struct inet_ehash_bucket *inet_ehash_bucket(
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..0bbfd00 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -119,11 +119,6 @@  again:
 					    (tb->num_owners < smallest_size || smallest_size == -1)) {
 						smallest_size = tb->num_owners;
 						smallest_rover = rover;
-						if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) {
-							spin_unlock(&head->lock);
-							snum = smallest_rover;
-							goto have_snum;
-						}
 					}
 					goto next;
 				}
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 2b79377..4bc921f 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -62,8 +62,6 @@  void inet_bind_hash(struct sock *sk, struct inet_bind_bucket *tb,
 {
 	struct inet_hashinfo *hashinfo = sk->sk_prot->h.hashinfo;
 
-	atomic_inc(&hashinfo->bsockets);
-
 	inet_sk(sk)->inet_num = snum;
 	sk_add_bind_node(sk, &tb->owners);
 	tb->num_owners++;
@@ -81,8 +79,6 @@  static void __inet_put_port(struct sock *sk)
 	struct inet_bind_hashbucket *head = &hashinfo->bhash[bhash];
 	struct inet_bind_bucket *tb;
 
-	atomic_dec(&hashinfo->bsockets);
-
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
 	__sk_del_bind_node(sk);
@@ -551,7 +547,6 @@  void inet_hashinfo_init(struct inet_hashinfo *h)
 {
 	int i;
 
-	atomic_set(&h->bsockets, 0);
 	for (i = 0; i < INET_LHTABLE_SIZE; i++) {
 		spin_lock_init(&h->listening_hash[i].lock);
 		INIT_HLIST_NULLS_HEAD(&h->listening_hash[i].head,