diff mbox

[2/2] socket: add minimum listen queue length sysctl

Message ID 1301077899-16482-2-git-send-email-hagen@jauu.net
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Hagen Paul Pfeifer March 25, 2011, 6:31 p.m. UTC
In the case that a server programmer misjudge network characteristic the
backlog parameter for listen(2) may not adequate to utilize hosts
capabilities and lead to unrequired SYN retransmission - thus a
underestimated backlog value can form an artificial limitation.

A listen queue length of 8 is often a way to small, but several
server authors does not about know this limitation (from Erics server
setup):

ss -a | head
State      Recv-Q Send-Q      Local Address:Port          Peer
Address:Port
LISTEN     0      8                       *:imaps                    *:*
LISTEN     0      8                       *:pop3s                    *:*
LISTEN     0      50                      *:mysql                    *:*
LISTEN     0      8                       *:pop3                     *:*
LISTEN     0      8                       *:imap2                    *:*
LISTEN     0      511                     *:www                      *:*

Until now it was not possible for the system (network) administrator to
increase this value. A bug report must be filled, the backlog increased,
a new version released or even worse: if using closed source software
you cannot make anything.

sysctl_min_syn_backlog provides the ability to increase the minimum
queue length.

Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/request_sock.h |    1 +
 net/core/request_sock.c    |    6 +++++-
 net/ipv4/af_inet.c         |    2 +-
 net/ipv4/sysctl_net_ipv4.c |    7 +++++++
 4 files changed, 14 insertions(+), 2 deletions(-)

Comments

Rick Jones March 25, 2011, 8:24 p.m. UTC | #1
On Fri, 2011-03-25 at 19:31 +0100, Hagen Paul Pfeifer wrote:
> In the case that a server programmer misjudge network characteristic the
> backlog parameter for listen(2) may not adequate to utilize hosts
> capabilities and lead to unrequired SYN retransmission - thus a
> underestimated backlog value can form an artificial limitation.
> 
> A listen queue length of 8 is often a way to small, but several
> server authors does not about know this limitation (from Erics server
> setup):
> 
> ss -a | head
> State      Recv-Q Send-Q      Local Address:Port          Peer
> Address:Port
> LISTEN     0      8                       *:imaps                    *:*
> LISTEN     0      8                       *:pop3s                    *:*
> LISTEN     0      50                      *:mysql                    *:*
> LISTEN     0      8                       *:pop3                     *:*
> LISTEN     0      8                       *:imap2                    *:*
> LISTEN     0      511                     *:www                      *:*
> 
> Until now it was not possible for the system (network) administrator to
> increase this value. A bug report must be filled, the backlog increased,
> a new version released or even worse: if using closed source software
> you cannot make anything.

Well, one could LD_PRELOAD something that intercepted listen() calls no?

> sysctl_min_syn_backlog provides the ability to increase the minimum
> queue length.

Is there already a similar minimum the admin can configure when the
applications makes "too small" an explicit setsockopt() call against
SO_SNDBUF or SO_RCVBUF?

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hagen Paul Pfeifer March 25, 2011, 11:51 p.m. UTC | #2
* Rick Jones | 2011-03-25 13:24:37 [-0700]:

Hello Rick

>Well, one could LD_PRELOAD something that intercepted listen() calls no?

Noes, for dynamically linked programs yes, for statically linked ones no.

Furthermore, for distribution shipped programs an administrator would not
alter the init script or something. Editing /etc/sysctl.conf is as simple
as ...


>Is there already a similar minimum the admin can configure when the
>applications makes "too small" an explicit setsockopt() call against
>SO_SNDBUF or SO_RCVBUF?

net.ipv4.tcp_rmem, net.ipv4.tcp_mem, net.core.rmem_default, ...?

IMHO, _if_ a programmer modifies the send or receive buffer he _knows_ exactly
why. If he does not modify the buffer it is fine too, because _we_ tune the
buffers as good as we can - and we are good in this.

But, the backlog is different. Often the programmer does _not_ know how to
tune this variable. And, often the backlog depends on the target system, on
the network characteristic and the like.

Therefore we provide the system administrator the _ability_ to tune the actual
backlog.


Hagen
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Rick Jones March 26, 2011, 12:21 a.m. UTC | #3
On Sat, 2011-03-26 at 00:51 +0100, Hagen Paul Pfeifer wrote:
> * Rick Jones | 2011-03-25 13:24:37 [-0700]:
> 
> Hello Rick
> 
> >Well, one could LD_PRELOAD something that intercepted listen() calls no?
> 
> Noes, for dynamically linked programs yes, for statically linked ones no.
> 
> Furthermore, for distribution shipped programs an administrator would not
> alter the init script or something. Editing /etc/sysctl.conf is as simple
> as ...
> 
> 
> >Is there already a similar minimum the admin can configure when the
> >applications makes "too small" an explicit setsockopt() call against
> >SO_SNDBUF or SO_RCVBUF?
> 
> net.ipv4.tcp_rmem, net.ipv4.tcp_mem, net.core.rmem_default, ...?

I believe (based on my netperf experience) tcp_rmem and tcp_wmem aren't
consulted when one makes an explicit setsockopt() call against the
SO_*BUF sizes. and the net.core.[rw]mem_default are used by UDP sockets:

raj@tardy:~/netperf2_trunk$ uname -a
Linux tardy 2.6.35-28-generic #49-Ubuntu SMP Tue Mar 1 14:39:03 UTC 2011
x86_64 GNU/Linux
raj@tardy:~/netperf2_trunk$ sysctl net.ipv4.tcp_rmem
net.ipv4.tcp_rmem = 4096 87380 4194304
raj@tardy:~/netperf2_trunk$ sysctl net.ipv4.tcp_wmem
net.ipv4.tcp_wmem = 4096 16384 4194304
raj@tardy:~/netperf2_trunk$ sysctl net.core.wmem_default
net.core.wmem_default = 126976
raj@tardy:~/netperf2_trunk$ sysctl net.core.rmem_default
net.core.rmem_default = 126976

(lss == local socket send; rsr == remote socket receive) 

src/netperf -t omni -- -k lss_size,lss_size_end,rsr_size,rsr_size_end
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain
(127.0.0.1) port 0 AF_INET : demo
LSS_SIZE=16384
LSS_SIZE_END=2679048
RSR_SIZE=87380
RSR_SIZE_END=4194304

raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k
lss_size,lss_size_end,rsr_size,rsr_size_end -T udp -m 1024
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain
(127.0.0.1) port 0 AF_INET : demo
LSS_SIZE=126976
LSS_SIZE_END=126976
RSR_SIZE=126976
RSR_SIZE_END=126976

I believe that net.core.[rw]mem_max are the upper limits (modulo the
2X?) applied when making explicit setsockopt() calls:

raj@tardy:~/netperf2_trunk$ sysctl net.core.rmem_max
net.core.rmem_max = 131071
raj@tardy:~/netperf2_trunk$ sysctl net.core.wmem_max
net.core.wmem_max = 131071
raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k
lss_size,lss_size_end,rsr_size,rsr_size_end -T udp -m 1024 -s 1M -S 1M
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain
(127.0.0.1) port 0 AF_INET : demo
LSS_SIZE=262142
LSS_SIZE_END=262142
RSR_SIZE=262142
RSR_SIZE_END=262142
raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k
lss_size,lss_size_end,rsr_size,rsr_size_end -s 1M -S 1M
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain
(127.0.0.1) port 0 AF_INET : demo
LSS_SIZE=262142
LSS_SIZE_END=262142
RSR_SIZE=262142
RSR_SIZE_END=262142

When though one asks for single-byte socket buffers:

raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k
lss_size,lss_size_end,rsr_size,rsr_size_end -s 1 -S 1
OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain
(127.0.0.1) port 0 AF_INET : demo
LSS_SIZE=2048
LSS_SIZE_END=2048
RSR_SIZE=256
RSR_SIZE_END=256

One gets values that at face value don't seem to be related to sysctl
settings. Although perhaps the receive socket size comes from the min
mss:

raj@tardy:~/netperf2_trunk$ sysctl -a | grep 256
error: permission denied on key 'kernel.cad_pid'
error: permission denied on key 'fs.binfmt_misc.register'
vm.lowmem_reserve_ratio = 256	256	32
fs.mqueue.queues_max = 256
error: permission denied on key 'net.ipv4.route.flush'
net.ipv4.route.min_adv_mss = 256
error: permission denied on key 'net.ipv6.route.flush'
raj@tardy:~/netperf2_trunk$ sysctl -a | grep 2048
error: permission denied on key 'kernel.cad_pid'
error: permission denied on key 'fs.binfmt_misc.register'
error: permission denied on key 'net.ipv4.route.flush'
net.core.optmem_max = 20480
net.ipv4.route.redirect_silence = 2048
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv6.xfrm6_gc_thresh = 2048
error: permission denied on key 'net.ipv6.route.flush'


> IMHO, _if_ a programmer modifies the send or receive buffer he _knows_ exactly
> why. 

I admire your optimism - particularly in the face of all the 10GbE NIC
vendors' suggestions that everyone use 16 MB socket buffers (or at least
set the auto tuning limits to 16 MB).

> If he does not modify the buffer it is fine too, because _we_ tune the
> buffers as good as we can - and we are good in this.

The "bloat" folks might disagree :)

> But, the backlog is different. Often the programmer does _not_ know how to
> tune this variable. And, often the backlog depends on the target system, on
> the network characteristic and the like.

As do the settings for socket buffer sizes.  So, how is it that the
programmer is educated and intelligent enough to set a minimum socket
buffer size but not a minimum listen queue backlog?

> Therefore we provide the system administrator the _ability_ to tune the actual
> backlog.

And, perhaps, do something that flies in the face of what the programmer
was trying to do, by limiting how many connections could be queued and
so changing the behaviour for the N+1st connection attempt while service
was backlogged. 

It really is a rather existential "Who's right? The Programmer or the
Administrator" question.  And perhaps my asking if there should be a
(possibly) foolish consistency.

rick jones

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 26, 2011, 7:06 a.m. UTC | #4
Le samedi 26 mars 2011 à 00:51 +0100, Hagen Paul Pfeifer a écrit :

> IMHO, _if_ a programmer modifies the send or receive buffer he _knows_ exactly
> why. If he does not modify the buffer it is fine too, because _we_ tune the
> buffers as good as we can - and we are good in this.
> 
> But, the backlog is different. Often the programmer does _not_ know how to
> tune this variable. And, often the backlog depends on the target system, on
> the network characteristic and the like.
> 
> Therefore we provide the system administrator the _ability_ to tune the actual
> backlog.

What you want to tune is not the backlog (max number of ready to be
delivered connections to accept()), but the number of SYN_RECV half
connections, still waiting for a second packet coming from clients.

An application might really want to have a listen(fd, 1) to accept one
incoming connection, but still be able to survive to a SYNFLOOD.

By the way, you still are confused by the fact that tcp_max_syn_backlog
has nothing to do with the 'backlog', as I already mentioned it, its a
parameter to cap the size of the hash table associated to a listener
socket.

You can have a hash table with 1024 slots, and still have a backlog of
16384 for example.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 99e6e19..3e8865f 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -89,6 +89,7 @@  static inline void reqsk_free(struct request_sock *req)
 }
 
 extern int sysctl_max_syn_backlog;
+extern int sysctl_min_syn_backlog;
 
 /** struct listen_sock - listen state
  *
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 182236b..0e968b6 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -35,6 +35,9 @@ 
 int sysctl_max_syn_backlog = 256;
 EXPORT_SYMBOL(sysctl_max_syn_backlog);
 
+int sysctl_min_syn_backlog = 0;
+EXPORT_SYMBOL(sysctl_min_syn_backlog);
+
 int reqsk_queue_alloc(struct request_sock_queue *queue,
 		      unsigned int nr_table_entries)
 {
@@ -42,7 +45,8 @@  int reqsk_queue_alloc(struct request_sock_queue *queue,
 	struct listen_sock *lopt;
 
 	nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);
-	nr_table_entries = max_t(u32, nr_table_entries, 8);
+	nr_table_entries = max_t(u32, nr_table_entries,
+			max_t(u32, 8, sysctl_min_syn_backlog));
 	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
 	lopt_size += nr_table_entries * sizeof(struct request_sock *);
 	if (lopt_size > PAGE_SIZE)
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 807d83c..c580d7c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -213,7 +213,7 @@  int inet_listen(struct socket *sock, int backlog)
 		if (err)
 			goto out;
 	}
-	sk->sk_max_ack_backlog = backlog;
+	sk->sk_max_ack_backlog = max_t(u32, backlog, sysctl_min_syn_backlog);
 	err = 0;
 
 out:
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1a45665..cc03c62 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -298,6 +298,13 @@  static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
+		.procname	= "tcp_min_syn_backlog",
+		.data		= &sysctl_min_syn_backlog,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec
+	},
+	{
 		.procname	= "ip_local_port_range",
 		.data		= &sysctl_local_ports.range,
 		.maxlen		= sizeof(sysctl_local_ports.range),