Message ID | 1301077899-16482-2-git-send-email-hagen@jauu.net |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Fri, 2011-03-25 at 19:31 +0100, Hagen Paul Pfeifer wrote: > In the case that a server programmer misjudge network characteristic the > backlog parameter for listen(2) may not adequate to utilize hosts > capabilities and lead to unrequired SYN retransmission - thus a > underestimated backlog value can form an artificial limitation. > > A listen queue length of 8 is often a way to small, but several > server authors does not about know this limitation (from Erics server > setup): > > ss -a | head > State Recv-Q Send-Q Local Address:Port Peer > Address:Port > LISTEN 0 8 *:imaps *:* > LISTEN 0 8 *:pop3s *:* > LISTEN 0 50 *:mysql *:* > LISTEN 0 8 *:pop3 *:* > LISTEN 0 8 *:imap2 *:* > LISTEN 0 511 *:www *:* > > Until now it was not possible for the system (network) administrator to > increase this value. A bug report must be filled, the backlog increased, > a new version released or even worse: if using closed source software > you cannot make anything. Well, one could LD_PRELOAD something that intercepted listen() calls no? > sysctl_min_syn_backlog provides the ability to increase the minimum > queue length. Is there already a similar minimum the admin can configure when the applications makes "too small" an explicit setsockopt() call against SO_SNDBUF or SO_RCVBUF? rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Rick Jones | 2011-03-25 13:24:37 [-0700]: Hello Rick >Well, one could LD_PRELOAD something that intercepted listen() calls no? Noes, for dynamically linked programs yes, for statically linked ones no. Furthermore, for distribution shipped programs an administrator would not alter the init script or something. Editing /etc/sysctl.conf is as simple as ... >Is there already a similar minimum the admin can configure when the >applications makes "too small" an explicit setsockopt() call against >SO_SNDBUF or SO_RCVBUF? net.ipv4.tcp_rmem, net.ipv4.tcp_mem, net.core.rmem_default, ...? IMHO, _if_ a programmer modifies the send or receive buffer he _knows_ exactly why. If he does not modify the buffer it is fine too, because _we_ tune the buffers as good as we can - and we are good in this. But, the backlog is different. Often the programmer does _not_ know how to tune this variable. And, often the backlog depends on the target system, on the network characteristic and the like. Therefore we provide the system administrator the _ability_ to tune the actual backlog. Hagen -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, 2011-03-26 at 00:51 +0100, Hagen Paul Pfeifer wrote: > * Rick Jones | 2011-03-25 13:24:37 [-0700]: > > Hello Rick > > >Well, one could LD_PRELOAD something that intercepted listen() calls no? > > Noes, for dynamically linked programs yes, for statically linked ones no. > > Furthermore, for distribution shipped programs an administrator would not > alter the init script or something. Editing /etc/sysctl.conf is as simple > as ... > > > >Is there already a similar minimum the admin can configure when the > >applications makes "too small" an explicit setsockopt() call against > >SO_SNDBUF or SO_RCVBUF? > > net.ipv4.tcp_rmem, net.ipv4.tcp_mem, net.core.rmem_default, ...? I believe (based on my netperf experience) tcp_rmem and tcp_wmem aren't consulted when one makes an explicit setsockopt() call against the SO_*BUF sizes. and the net.core.[rw]mem_default are used by UDP sockets: raj@tardy:~/netperf2_trunk$ uname -a Linux tardy 2.6.35-28-generic #49-Ubuntu SMP Tue Mar 1 14:39:03 UTC 2011 x86_64 GNU/Linux raj@tardy:~/netperf2_trunk$ sysctl net.ipv4.tcp_rmem net.ipv4.tcp_rmem = 4096 87380 4194304 raj@tardy:~/netperf2_trunk$ sysctl net.ipv4.tcp_wmem net.ipv4.tcp_wmem = 4096 16384 4194304 raj@tardy:~/netperf2_trunk$ sysctl net.core.wmem_default net.core.wmem_default = 126976 raj@tardy:~/netperf2_trunk$ sysctl net.core.rmem_default net.core.rmem_default = 126976 (lss == local socket send; rsr == remote socket receive) src/netperf -t omni -- -k lss_size,lss_size_end,rsr_size,rsr_size_end OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : demo LSS_SIZE=16384 LSS_SIZE_END=2679048 RSR_SIZE=87380 RSR_SIZE_END=4194304 raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k lss_size,lss_size_end,rsr_size,rsr_size_end -T udp -m 1024 OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : demo LSS_SIZE=126976 LSS_SIZE_END=126976 RSR_SIZE=126976 RSR_SIZE_END=126976 I believe that net.core.[rw]mem_max are the upper limits (modulo the 2X?) applied when making explicit setsockopt() calls: raj@tardy:~/netperf2_trunk$ sysctl net.core.rmem_max net.core.rmem_max = 131071 raj@tardy:~/netperf2_trunk$ sysctl net.core.wmem_max net.core.wmem_max = 131071 raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k lss_size,lss_size_end,rsr_size,rsr_size_end -T udp -m 1024 -s 1M -S 1M OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : demo LSS_SIZE=262142 LSS_SIZE_END=262142 RSR_SIZE=262142 RSR_SIZE_END=262142 raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k lss_size,lss_size_end,rsr_size,rsr_size_end -s 1M -S 1M OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : demo LSS_SIZE=262142 LSS_SIZE_END=262142 RSR_SIZE=262142 RSR_SIZE_END=262142 When though one asks for single-byte socket buffers: raj@tardy:~/netperf2_trunk$ src/netperf -t omni -- -k lss_size,lss_size_end,rsr_size,rsr_size_end -s 1 -S 1 OMNI TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost.localdomain (127.0.0.1) port 0 AF_INET : demo LSS_SIZE=2048 LSS_SIZE_END=2048 RSR_SIZE=256 RSR_SIZE_END=256 One gets values that at face value don't seem to be related to sysctl settings. Although perhaps the receive socket size comes from the min mss: raj@tardy:~/netperf2_trunk$ sysctl -a | grep 256 error: permission denied on key 'kernel.cad_pid' error: permission denied on key 'fs.binfmt_misc.register' vm.lowmem_reserve_ratio = 256 256 32 fs.mqueue.queues_max = 256 error: permission denied on key 'net.ipv4.route.flush' net.ipv4.route.min_adv_mss = 256 error: permission denied on key 'net.ipv6.route.flush' raj@tardy:~/netperf2_trunk$ sysctl -a | grep 2048 error: permission denied on key 'kernel.cad_pid' error: permission denied on key 'fs.binfmt_misc.register' error: permission denied on key 'net.ipv4.route.flush' net.core.optmem_max = 20480 net.ipv4.route.redirect_silence = 2048 net.ipv4.tcp_max_syn_backlog = 2048 net.ipv6.xfrm6_gc_thresh = 2048 error: permission denied on key 'net.ipv6.route.flush' > IMHO, _if_ a programmer modifies the send or receive buffer he _knows_ exactly > why. I admire your optimism - particularly in the face of all the 10GbE NIC vendors' suggestions that everyone use 16 MB socket buffers (or at least set the auto tuning limits to 16 MB). > If he does not modify the buffer it is fine too, because _we_ tune the > buffers as good as we can - and we are good in this. The "bloat" folks might disagree :) > But, the backlog is different. Often the programmer does _not_ know how to > tune this variable. And, often the backlog depends on the target system, on > the network characteristic and the like. As do the settings for socket buffer sizes. So, how is it that the programmer is educated and intelligent enough to set a minimum socket buffer size but not a minimum listen queue backlog? > Therefore we provide the system administrator the _ability_ to tune the actual > backlog. And, perhaps, do something that flies in the face of what the programmer was trying to do, by limiting how many connections could be queued and so changing the behaviour for the N+1st connection attempt while service was backlogged. It really is a rather existential "Who's right? The Programmer or the Administrator" question. And perhaps my asking if there should be a (possibly) foolish consistency. rick jones -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Le samedi 26 mars 2011 à 00:51 +0100, Hagen Paul Pfeifer a écrit : > IMHO, _if_ a programmer modifies the send or receive buffer he _knows_ exactly > why. If he does not modify the buffer it is fine too, because _we_ tune the > buffers as good as we can - and we are good in this. > > But, the backlog is different. Often the programmer does _not_ know how to > tune this variable. And, often the backlog depends on the target system, on > the network characteristic and the like. > > Therefore we provide the system administrator the _ability_ to tune the actual > backlog. What you want to tune is not the backlog (max number of ready to be delivered connections to accept()), but the number of SYN_RECV half connections, still waiting for a second packet coming from clients. An application might really want to have a listen(fd, 1) to accept one incoming connection, but still be able to survive to a SYNFLOOD. By the way, you still are confused by the fact that tcp_max_syn_backlog has nothing to do with the 'backlog', as I already mentioned it, its a parameter to cap the size of the hash table associated to a listener socket. You can have a hash table with 1024 slots, and still have a backlog of 16384 for example. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/request_sock.h b/include/net/request_sock.h index 99e6e19..3e8865f 100644 --- a/include/net/request_sock.h +++ b/include/net/request_sock.h @@ -89,6 +89,7 @@ static inline void reqsk_free(struct request_sock *req) } extern int sysctl_max_syn_backlog; +extern int sysctl_min_syn_backlog; /** struct listen_sock - listen state * diff --git a/net/core/request_sock.c b/net/core/request_sock.c index 182236b..0e968b6 100644 --- a/net/core/request_sock.c +++ b/net/core/request_sock.c @@ -35,6 +35,9 @@ int sysctl_max_syn_backlog = 256; EXPORT_SYMBOL(sysctl_max_syn_backlog); +int sysctl_min_syn_backlog = 0; +EXPORT_SYMBOL(sysctl_min_syn_backlog); + int reqsk_queue_alloc(struct request_sock_queue *queue, unsigned int nr_table_entries) { @@ -42,7 +45,8 @@ int reqsk_queue_alloc(struct request_sock_queue *queue, struct listen_sock *lopt; nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog); - nr_table_entries = max_t(u32, nr_table_entries, 8); + nr_table_entries = max_t(u32, nr_table_entries, + max_t(u32, 8, sysctl_min_syn_backlog)); nr_table_entries = roundup_pow_of_two(nr_table_entries + 1); lopt_size += nr_table_entries * sizeof(struct request_sock *); if (lopt_size > PAGE_SIZE) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 807d83c..c580d7c 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -213,7 +213,7 @@ int inet_listen(struct socket *sock, int backlog) if (err) goto out; } - sk->sk_max_ack_backlog = backlog; + sk->sk_max_ack_backlog = max_t(u32, backlog, sysctl_min_syn_backlog); err = 0; out: diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 1a45665..cc03c62 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -298,6 +298,13 @@ static struct ctl_table ipv4_table[] = { .proc_handler = proc_dointvec }, { + .procname = "tcp_min_syn_backlog", + .data = &sysctl_min_syn_backlog, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec + }, + { .procname = "ip_local_port_range", .data = &sysctl_local_ports.range, .maxlen = sizeof(sysctl_local_ports.range),
In the case that a server programmer misjudge network characteristic the backlog parameter for listen(2) may not adequate to utilize hosts capabilities and lead to unrequired SYN retransmission - thus a underestimated backlog value can form an artificial limitation. A listen queue length of 8 is often a way to small, but several server authors does not about know this limitation (from Erics server setup): ss -a | head State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 8 *:imaps *:* LISTEN 0 8 *:pop3s *:* LISTEN 0 50 *:mysql *:* LISTEN 0 8 *:pop3 *:* LISTEN 0 8 *:imap2 *:* LISTEN 0 511 *:www *:* Until now it was not possible for the system (network) administrator to increase this value. A bug report must be filled, the backlog increased, a new version released or even worse: if using closed source software you cannot make anything. sysctl_min_syn_backlog provides the ability to increase the minimum queue length. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> --- include/net/request_sock.h | 1 + net/core/request_sock.c | 6 +++++- net/ipv4/af_inet.c | 2 +- net/ipv4/sysctl_net_ipv4.c | 7 +++++++ 4 files changed, 14 insertions(+), 2 deletions(-)