diff mbox

Establishing more than 64K outgoing TCP connections

Message ID 241CE5A1B2AFFA4987A663380D74C0BB3FD69B08A7@MBX73.ad2.softcom.biz
State Rejected, archived
Delegated to: David Miller
Headers show

Commit Message

Yavor Goulishev Oct. 28, 2008, 9:06 p.m. UTC
Hi guys,
I'm sending you a tiny patch for the kernel.
The change allowed me to establish more than 64K outgoing TCP connections from a single Linux box.
Being able to use multihomed client box to load test a server reduced dramatically the number of physical boxes I needed.
The other way to go was virtual machines, but you can't beat the simplicity of the single multihomed client.
Also the patch fixes an important symmetry.
Using the multihome setup is a simple technique to overcome the 64K limit on the server side.
The patch makes it possible on the client side too.

The fix is only for TCP but probably the same issue exists for UDP.
I didn't had the time to investigate.

The patch is against the latest code base 2.6.28-rc2.

--Yavor

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Nov. 2, 2008, 4:15 a.m. UTC | #1
From: Yavor Goulishev <yavor@simplifymedia.com>
Date: Tue, 28 Oct 2008 17:06:33 -0400

> I'm sending you a tiny patch for the kernel.  The change allowed me
> to establish more than 64K outgoing TCP connections from a single
> Linux box.  Being able to use multihomed client box to load test a
> server reduced dramatically the number of physical boxes I needed.
> The other way to go was virtual machines, but you can't beat the
> simplicity of the single multihomed client.  Also the patch fixes an
> important symmetry.  Using the multihome setup is a simple technique
> to overcome the 64K limit on the server side.  The patch makes it
> possible on the client side too.
> 
> The fix is only for TCP but probably the same issue exists for UDP.
> I didn't had the time to investigate.
> 
> The patch is against the latest code base 2.6.28-rc2.

As I mentioned in my initial reply to you, this exact patch
has been proposed in the future and rejected.

The problem is that although it fixes the limitation, it
introduces a new problem.

The current code tries to keep the hash chains as shallow
as possible.  One part of how it achieves this is that it
tries to use up all the available ports using it's
scanning algorithm.  Your change works against this, and
will tend to make the hash chains deeper.

A solution needs to be devised which handles both issues.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yavor Goulishev Nov. 7, 2008, 6:12 a.m. UTC | #2
OK,
I see that you want to access the hash table in O(1)  and
you try to get a port number at empty bucket. This is clever.
However having hash table size 65536 and hash function just the port number can
work only for 64K binds, right? If I need more they need to go in the chains.

So may be having a better hash function and configurable table size (/proc/sys/net/...)
can work without limitation and with good performance.
For example:
    hash_fn = (port + local_ip) % size

--Yavor
David Miller Nov. 7, 2008, 6:23 a.m. UTC | #3
From: Yavor Goulishev <yavor@simplifymedia.com>
Date: Fri, 7 Nov 2008 01:12:36 -0500

> I see that you want to access the hash table in O(1)  and
> you try to get a port number at empty bucket. This is clever.
> However having hash table size 65536 and hash function just the port number can
> work only for 64K binds, right? If I need more they need to go in the chains.
> 
> So may be having a better hash function and configurable table size (/proc/sys/net/...)
> can work without limitation and with good performance.
> For example:
>     hash_fn = (port + local_ip) % size

All you have to preserve is that the hash distribution is evenly
spread.  That is what your patch violated.  I'm not asking for O(1)
access, I'm asking for O(num_ports_bound / hash_table_size) access.

Trying to solve all other kinds of issues just distracts from the
core issue.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yavor Goulishev Nov. 7, 2008, 6:57 a.m. UTC | #4
I'm suggesting a solution different from my original patch.
To overcome the limit and preserver the current properties you can
just change the hash function and the the table size.
This is just an idea.

The current hash is only the port number.
Not sure why you want linear access in a hash table though:
  >  I'm asking for O(num_ports_bound / hash_table_size) access.
What is this? Searching for unused port number?
It would stay the same if you use "(port + ip) % size".

--Yavor
David Miller Nov. 7, 2008, 7:04 a.m. UTC | #5
From: Yavor Goulishev <yavor@simplifymedia.com>
Date: Fri, 7 Nov 2008 01:57:34 -0500

> The current hash is only the port number.
> Not sure why you want linear access in a hash table though:
>   >  I'm asking for O(num_ports_bound / hash_table_size) access.
> What is this? Searching for unused port number?

Come back when you figure it out then.  I really don't have time to
explain it at the moment, perhaps someone else can.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yavor Goulishev Nov. 7, 2008, 7:15 a.m. UTC | #6
Now I know why the kernel stays broken for so long.
Some people don't want to admit their errors and fix the issues.
Or at least explain them so that someone else can fix it for them.
I hope you work on something really important!

--Yavor
David Miller Nov. 7, 2008, 7:30 a.m. UTC | #7
From: Yavor Goulishev <yavor@simplifymedia.com>
Date: Fri, 7 Nov 2008 02:15:15 -0500

> Now I know why the kernel stays broken for so long.
> Some people don't want to admit their errors and fix the issues.
> Or at least explain them so that someone else can fix it for them.
> I hope you work on something really important!

Yes, a locally exploitable DoS that effects every Linux system
out there.

Are you done complaining now?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yavor Goulishev Nov. 7, 2008, 7:40 a.m. UTC | #8
It is not complain it is an observation.
And you are proving me right again!
But I'm done. You have it under control I'm sure.

--Yavor
Rémi Denis-Courmont Nov. 7, 2008, 8:36 a.m. UTC | #9
On Sunday 02 November 2008 06:15:41 ext David Miller, you wrote:
> As I mentioned in my initial reply to you, this exact patch
> has been proposed in the future and rejected.
                           ^^^^^^
You've just destroyed the last remaining bits of hope of ever becoming a great 
hacker, for I know I will never be able to know future patches :(
Evgeniy Polyakov Nov. 7, 2008, 8:45 a.m. UTC | #10
On Fri, Nov 07, 2008 at 01:57:34AM -0500, Yavor Goulishev (yavor@simplifymedia.com) wrote:
> I'm suggesting a solution different from my original patch.
> To overcome the limit and preserver the current properties you can
> just change the hash function and the the table size.
> This is just an idea.

You can use bindtodevice and reuseaddr socket options already.

> The current hash is only the port number.
> Not sure why you want linear access in a hash table though:
>   >  I'm asking for O(num_ports_bound / hash_table_size) access.
> What is this? Searching for unused port number?
> It would stay the same if you use "(port + ip) % size".

When sockets (hash results) are evenly distributed over the hash table,
access will be O(num_of_elemets / hash_table_size), with particulary big
table it will be O(1).

If you will add any other fields into hash in inet_csk_get_port(), this
will not change anything, since it will just switch hash table bucket to
be checked. When we found hash bucket for given port, all its entries
are checked to have the same parameters as in provided data. See how
bind_conflict() callback is invoked in inet_csk_get_port().
Yavor Goulishev Nov. 7, 2008, 3:56 p.m. UTC | #11
Hey,
I"m not kernel hacker and I don't want to be.
I stumbled on something and I fixed it.
That's not my focus or interest.
I've spend  a couple of hours on it, just to unblock my project.
That's all.

What was a complete surprise for me was fact that
something so fundamental can be broken for so long.
After wasting some time on your mailing list I know
that it is not technical issue.

--Yavor
Eric Dumazet Nov. 7, 2008, 4:23 p.m. UTC | #12
Please dont top post your messages on this list.

Yavor Goulishev a écrit :
> Hey,
> I"m not kernel hacker and I don't want to be.
> I stumbled on something and I fixed it.
> That's not my focus or interest.
> I've spend  a couple of hours on it, just to unblock my project.
> That's all.
> 
> What was a complete surprise for me was fact that
> something so fundamental can be broken for so long.
> After wasting some time on your mailing list I know
> that it is not technical issue.

Thats the deal.

You post a patch, other people comment on it, and if its good,
patch is accepted. Linux kernel developement is not an easy task.
Not being a "kernel hacker" wont allow you to bypass the rules.

As many people are concerned, its even more important to adopt
a right attitude, because if people are upset, they wont even
consider your patch. Yes, thats a pity.

Your concern is valid, but your patch in its current form
is not *the* solution.

About your idea of having a hash (port , IP), it is not
applicable. If you believe I am wrong, please post
a new patch so that truth can be revealed.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yavor Goulishev Nov. 7, 2008, 4:38 p.m. UTC | #13

Evgeniy Polyakov Nov. 7, 2008, 5:25 p.m. UTC | #14
On Fri, Nov 07, 2008 at 11:38:14AM -0500, Yavor Goulishev (yavor@simplifymedia.com) wrote:
> OK, but this is mainly achieved with good hash function and proper table size.
> My original patch wouldn't change the locations(buckets) of the ports. It will just chose different ports.
> The hack for picking a good port number just looks for empty bucket.

Your original patch just added full check into bucket selection
algorithm, so that you could jump into the next bucket early.

> > If you will add any other fields into hash in inet_csk_get_port(), this
> > will not change anything, since it will just switch hash table bucket to
> > be checked. When we found hash bucket for given port, all its entries
> > are checked to have the same parameters as in provided data. See how
> > bind_conflict() callback is invoked in inet_csk_get_port().
> 
> It would change. Switching the bucket allows you to grow the table size and keep O(1).
> Without changing the hash function, growing the table size will not help.
> Currently all binds on the same port will go to the same bucket.

Listen sockets have a priviledge to steal ports, so they can not be
accessed by any other bind, if you will distribute sockets with the
same port over the whole table, listen sockets will not caught this.
David Miller Nov. 7, 2008, 7:54 p.m. UTC | #15
From: Yavor Goulishev <yavor@simplifymedia.com>
Date: Fri, 7 Nov 2008 11:38:14 -0500

> It would change. Switching the bucket allows you to grow the table
> size and keep O(1).  Without changing the hash function, growing the
> table size will not help.  Currently all binds on the same port will
> go to the same bucket.

I hear a lot of talk about a "new hash function" but what in the world are you
going to hash on other than the port?

Listening sockets can bind to just a specific port, with source and
destination address set to "ANY" (ie. wildcard).

Therefore, if you use some idea like using saddr or daddr in the hash
function, then a listening socket bind will have to walk the entire table
to see if any socket exists already bound to that port.

You really lack some fundamental understanding of the code you are trying
to change, and the things that code must cater to.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yavor Goulishev Nov. 7, 2008, 10:37 p.m. UTC | #16
> Therefore, if you use some idea like using saddr or daddr in the hash
> function, then a listening socket bind will have to walk the entire table
> to see if any socket exists already bound to that port.

It is not the entire table. It is O(number_of_local_interfaces).
In general, if you have more then one interface you either need to flatten the
ip/port space and have a smart way to access it in O(1) or you have the O(number_of_local_interfaces).
The case of ANY can be solved with special ANY interface. So you can get O(1) even there, but the space will grow.

--Yavor


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Nov. 7, 2008, 11:04 p.m. UTC | #17
On Fri, Nov 07, 2008 at 05:37:33PM -0500, Yavor Goulishev (yavor@simplifymedia.com) wrote:
> It is not the entire table. It is O(number_of_local_interfaces).
> In general, if you have more then one interface you either need to flatten the
> ip/port space and have a smart way to access it in O(1) or you have the O(number_of_local_interfaces).
> The case of ANY can be solved with special ANY interface. So you can get O(1) even there, but the space will grow.

As was already pointed, you can use SO_BINDTODEVICE for this purpose.
Next solution you may think of is already implemented as SO_REUSEADDR
socket option.
diff mbox

Patch

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index bd1278a..d72fba6 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -109,7 +109,9 @@  int inet_csk_get_port(struct sock *sk, unsigned short snum)
                                        hashinfo->bhash_size)];
                        spin_lock(&head->lock);
                        inet_bind_bucket_for_each(tb, node, &head->chain)
-                               if (tb->ib_net == net && tb->port == rover)
+                               if (tb->ib_net == net &&
+                                   tb->port == rover &&
+                                   inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb))
                                        goto next;
                        break;
                next: