From patchwork Tue Jan 5 00:20:18 2010 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Simon Horman X-Patchwork-Id: 42104 X-Patchwork-Delegate: davem@davemloft.net Return-Path: X-Original-To: patchwork-incoming@ozlabs.org Delivered-To: patchwork-incoming@ozlabs.org Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by ozlabs.org (Postfix) with ESMTP id E67DBB6EF6 for ; Tue, 5 Jan 2010 11:20:28 +1100 (EST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754005Ab0AEAUW (ORCPT ); Mon, 4 Jan 2010 19:20:22 -0500 Received: (majordomo@vger.kernel.org) by vger.kernel.org id S1753900Ab0AEAUW (ORCPT ); Mon, 4 Jan 2010 19:20:22 -0500 Received: from kirsty.vergenet.net ([202.4.237.240]:58054 "EHLO kirsty.vergenet.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752261Ab0AEAUV (ORCPT ); Mon, 4 Jan 2010 19:20:21 -0500 Received: from yukiko.kent.sydney.vergenet.net (124-170-248-2.dyn.iinet.net.au [124.170.248.2]) by kirsty.vergenet.net (Postfix) with ESMTP id 0ADA724051; Tue, 5 Jan 2010 11:20:19 +1100 (EST) Received: by yukiko.kent.sydney.vergenet.net (Postfix, from userid 7100) id E19D0C2863; Tue, 5 Jan 2010 11:20:18 +1100 (EST) Date: Tue, 5 Jan 2010 11:20:18 +1100 From: Simon Horman To: netdev@vger.kernel.org, lvs-devel@vger.kernel.org Cc: "Catalin(ux) M. BOIE" , Mark Bergsma , Joseph Mack NA3T , Graeme Fowler , David Miller , Patrick McHardy Subject: [PATCH] IPVS: Allow boot time change of hash size. Message-ID: <20100105002018.GJ2554@verge.net.au> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <4B38FF15.6040803@wikimedia.org> <4B38FDC2.9000507@wikimedia.org> User-Agent: Mutt/1.5.20 (2009-06-14) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Catalin(ux) M. BOIE IPVS: Allow boot time change of hash size. I was very frustrated about the fact that I have to recompile the kernel to change the hash size. So, I created this patch. If IPVS is built-in you can append ip_vs.conn_tab_bits=?? to kernel command line, or, if you built IPVS as modules, you can add options ip_vs conn_tab_bits=??. To keep everything backward compatible, you still can select the size at compile time, and that will be used as default. [ horms@verge.net.au: trivial up-port and minor style fixes ] Signed-off-by: Catalin(ux) M. BOIE Cc: Mark Bergsma Signed-off-by: Simon Horman --- Patrick, please consider applying this. I'd like to do something dynamic. But this change is an obvious win in the mean time. It has been about a year since this patch was originally posted and subsequently dropped on the basis of insufficient test data. Mark Bergsma has provided the following test results which seem to strongly support the need for larger hash table sizes: We do however run into the same problem with the default setting (2^12 = 4096 entries), as most of our LVS balancers handle around a million connections/SLAB entries at any point in time (around 100-150 kpps load). With only 4096 hash table entries this implies that each entry consists of a linked list of 256 connections *on average*. To provide some statistics, I did an oprofile run on an 2.6.31 kernel, with both the default 4096 table size, and the same kernel recompiled with IP_VS_CONN_TAB_BITS set to 18 (2^18 = 262144 entries). I built a quick test setup with a part of Wikimedia/Wikipedia's live traffic mirrored by the switch to the test host. With the default setting, at ~ 120 kpps packet load we saw a typical %si CPU usage of around 30-35%, and oprofile reported a hot spot in ip_vs_conn_in_get: samples % image name app name symbol name 1719761 42.3741 ip_vs.ko ip_vs.ko ip_vs_conn_in_get 302577 7.4554 bnx2 bnx2 /bnx2 181984 4.4840 vmlinux vmlinux __ticket_spin_lock 128636 3.1695 vmlinux vmlinux ip_route_input 74345 1.8318 ip_vs.ko ip_vs.ko ip_vs_conn_out_get 68482 1.6874 vmlinux vmlinux mwait_idle After loading the recompiled kernel with 2^18 entries, %si CPU usage dropped in half to around 12-18%, and oprofile looks much healthier, with only 7% spent in ip_vs_conn_in_get: samples % image name app name symbol name 265641 14.4616 bnx2 bnx2 /bnx2 143251 7.7986 vmlinux vmlinux __ticket_spin_lock 140661 7.6576 ip_vs.ko ip_vs.ko ip_vs_conn_in_get 94364 5.1372 vmlinux vmlinux mwait_idle 86267 4.6964 vmlinux vmlinux ip_route_input -- To unsubscribe from this list: send the line "unsubscribe lvs-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Index: net-next-2.6/include/net/ip_vs.h =================================================================== --- net-next-2.6.orig/include/net/ip_vs.h 2009-12-29 10:18:56.000000000 +0900 +++ net-next-2.6/include/net/ip_vs.h 2009-12-29 10:19:13.000000000 +0900 @@ -26,6 +26,11 @@ #include /* for struct ipv6hdr */ #include /* for ipv6_addr_copy */ + +/* Connections' size value needed by ip_vs_ctl.c */ +extern int ip_vs_conn_tab_size; + + struct ip_vs_iphdr { int len; __u8 protocol; @@ -592,17 +597,6 @@ extern void ip_vs_init_hash_table(struct * (from ip_vs_conn.c) */ -/* - * IPVS connection entry hash table - */ -#ifndef CONFIG_IP_VS_TAB_BITS -#define CONFIG_IP_VS_TAB_BITS 12 -#endif - -#define IP_VS_CONN_TAB_BITS CONFIG_IP_VS_TAB_BITS -#define IP_VS_CONN_TAB_SIZE (1 << IP_VS_CONN_TAB_BITS) -#define IP_VS_CONN_TAB_MASK (IP_VS_CONN_TAB_SIZE - 1) - enum { IP_VS_DIR_INPUT = 0, IP_VS_DIR_OUTPUT, Index: net-next-2.6/net/netfilter/ipvs/Kconfig =================================================================== --- net-next-2.6.orig/net/netfilter/ipvs/Kconfig 2009-12-29 10:18:56.000000000 +0900 +++ net-next-2.6/net/netfilter/ipvs/Kconfig 2009-12-29 10:19:13.000000000 +0900 @@ -68,6 +68,10 @@ config IP_VS_TAB_BITS each hash entry uses 8 bytes, so you can estimate how much memory is needed for your box. + You can overwrite this number setting conn_tab_bits module parameter + or by appending ip_vs.conn_tab_bits=? to the kernel command line + if IP VS was compiled built-in. + comment "IPVS transport protocol load balancing support" config IP_VS_PROTO_TCP Index: net-next-2.6/net/netfilter/ipvs/ip_vs_conn.c =================================================================== --- net-next-2.6.orig/net/netfilter/ipvs/ip_vs_conn.c 2009-12-29 10:18:56.000000000 +0900 +++ net-next-2.6/net/netfilter/ipvs/ip_vs_conn.c 2009-12-29 10:19:13.000000000 +0900 @@ -40,6 +40,21 @@ #include +#ifndef CONFIG_IP_VS_TAB_BITS +#define CONFIG_IP_VS_TAB_BITS 12 +#endif + +/* + * Connection hash size. Default is what was selected at compile time. +*/ +int ip_vs_conn_tab_bits = CONFIG_IP_VS_TAB_BITS; +module_param_named(conn_tab_bits, ip_vs_conn_tab_bits, int, 0444); +MODULE_PARM_DESC(conn_tab_bits, "Set connections' hash size"); + +/* size and mask values */ +int ip_vs_conn_tab_size; +int ip_vs_conn_tab_mask; + /* * Connection hash table: for input and output packets lookups of IPVS */ @@ -125,11 +140,11 @@ static unsigned int ip_vs_conn_hashkey(i if (af == AF_INET6) return jhash_3words(jhash(addr, 16, ip_vs_conn_rnd), (__force u32)port, proto, ip_vs_conn_rnd) - & IP_VS_CONN_TAB_MASK; + & ip_vs_conn_tab_mask; #endif return jhash_3words((__force u32)addr->ip, (__force u32)port, proto, ip_vs_conn_rnd) - & IP_VS_CONN_TAB_MASK; + & ip_vs_conn_tab_mask; } @@ -760,7 +775,7 @@ static void *ip_vs_conn_array(struct seq int idx; struct ip_vs_conn *cp; - for(idx = 0; idx < IP_VS_CONN_TAB_SIZE; idx++) { + for (idx = 0; idx < ip_vs_conn_tab_size; idx++) { ct_read_lock_bh(idx); list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) { if (pos-- == 0) { @@ -797,7 +812,7 @@ static void *ip_vs_conn_seq_next(struct idx = l - ip_vs_conn_tab; ct_read_unlock_bh(idx); - while (++idx < IP_VS_CONN_TAB_SIZE) { + while (++idx < ip_vs_conn_tab_size) { ct_read_lock_bh(idx); list_for_each_entry(cp, &ip_vs_conn_tab[idx], c_list) { seq->private = &ip_vs_conn_tab[idx]; @@ -976,8 +991,8 @@ void ip_vs_random_dropentry(void) /* * Randomly scan 1/32 of the whole table every second */ - for (idx = 0; idx < (IP_VS_CONN_TAB_SIZE>>5); idx++) { - unsigned hash = net_random() & IP_VS_CONN_TAB_MASK; + for (idx = 0; idx < (ip_vs_conn_tab_size>>5); idx++) { + unsigned hash = net_random() & ip_vs_conn_tab_mask; /* * Lock is actually needed in this loop. @@ -1029,7 +1044,7 @@ static void ip_vs_conn_flush(void) struct ip_vs_conn *cp; flush_again: - for (idx=0; idx