Message ID | 20090727182246.GC15823@hmsreliant.think-freely.org |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Neil Horman <nhorman@tuxdriver.com> Date: Mon, 27 Jul 2009 14:22:46 -0400 > Export garbage collector thresholds for xfrm[4|6]_dst_ops > > Had a problem reported to me recently in which a high volume of ipsec > connections on a system began reporting ENOBUFS for new connections eventually. > It seemed that after about 2000 connections we started being unable to create > more. A quick look revealed that the xfrm code used a dst_ops structure that > limited the gc_thresh value to 1024, and alaways dropped route cache entries > after 2x the gc_thresh. It seems the most direct solution is to export the > gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing > table does, so that higher volumes of connections can be supported. This patch > has been tested and allows the reporter to increase their ipsec connection > volume successfully. > > Reported-by: Joe Nall <joe@nall.com> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> Applied, but this suggests that either: 1) we pick a horrible default 2) our IPSEC machinery holds onto dst entries too tightly and that's the true cause of this problem I'd like to ask that you investigate this, because with defaults we should be able to handle IPSEC loads as high as the routing loads we could handle. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jul 27, 2009 at 11:37:55AM -0700, David Miller wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Mon, 27 Jul 2009 14:22:46 -0400 > > > Export garbage collector thresholds for xfrm[4|6]_dst_ops > > > > Had a problem reported to me recently in which a high volume of ipsec > > connections on a system began reporting ENOBUFS for new connections eventually. > > It seemed that after about 2000 connections we started being unable to create > > more. A quick look revealed that the xfrm code used a dst_ops structure that > > limited the gc_thresh value to 1024, and alaways dropped route cache entries > > after 2x the gc_thresh. It seems the most direct solution is to export the > > gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing > > table does, so that higher volumes of connections can be supported. This patch > > has been tested and allows the reporter to increase their ipsec connection > > volume successfully. > > > > Reported-by: Joe Nall <joe@nall.com> > > Signed-off-by: Neil Horman <nhorman@tuxdriver.com> > > Applied, but this suggests that either: > Thanks! > 1) we pick a horrible default > > 2) our IPSEC machinery holds onto dst entries too tightly and that's > the true cause of this problem > > I'd like to ask that you investigate this, because with defaults > we should be able to handle IPSEC loads as high as the routing > loads we could handle. > I'll gladly look into this further. Compared to the main routing table, the ipsec default selection is pretty bad. Its statcially set despite the size of the larger routing table. Looking at the garbage collection algorithm, we do keep a pretty tight leash on freeing entries, but I think its warrented. We create 1 dst_entry for each open socket on an ipsec tunnel, and don't release it until its __refcnt drops to zero. I think that makes sense, since it means we only keep cache entries for active connections, and clean them up as soon as they close (e.g. I don't really see the advantage to unhashing a xfrm cache entry only to recreate it on the next packet sent). I think the most sensible first step is to dynamically choose a gc threshold based on the size of memory or the main routing table. I'll write this up and post later this week after I do some testing. Thanks! Neil > Thanks. > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Neil Horman <nhorman@tuxdriver.com> Date: Mon, 27 Jul 2009 15:36:25 -0400 > I think that makes sense, since it means we only keep cache entries > for active connections, and clean them up as soon as they close > (e.g. I don't really see the advantage to unhashing a xfrm cache > entry only to recreate it on the next packet sent). How is this related to the user's problem? My impression was that they were forwarding IPSEC traffic when running up against these limits, and that has no socket based assosciation at all. Making the XFRM GC limits get computed similarly to how the ipv4/ipv6 one's do probably makes sense. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Jul 27, 2009, at 2:40 PM, David Miller wrote: > From: Neil Horman <nhorman@tuxdriver.com> > Date: Mon, 27 Jul 2009 15:36:25 -0400 > >> I think that makes sense, since it means we only keep cache entries >> for active connections, and clean them up as soon as they close >> (e.g. I don't really see the advantage to unhashing a xfrm cache >> entry only to recreate it on the next packet sent). > > How is this related to the user's problem? > > My impression was that they were forwarding IPSEC traffic when > running up against these limits, and that has no socket based > assosciation at all. > > Making the XFRM GC limits get computed similarly to how the > ipv4/ipv6 one's do probably makes sense. The problem was seen serving TCP connections over IPSec when the server (a 24 core machine w 32GB RAM) and the IPSec host were the same (transport not tunnel). The problem was originally identified in an MLS ipsec thin client stress test with 25 clients and 200+ windows per client. We then duplicated the issue with single level xclocks on the same hardware. I then duplicated the problem with two bone stock (not MLS) F10 (and later F11) boxes running ab (apache benchmark tool) with 10k requests over 2k concurrent connections. See https://bugzilla.redhat.com/show_bug.cgi?id=503124 With the gc_thresh raised to 8k, my ab test passes with 5k+ connections and 100k requests. Our test guys ran 11 workstations with about 2200 concurrent X connections over the weekend and I'm hoping for some MLS results before we lose the test suite on Wednesday. joe -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Joe Nall <joe@nall.com> Date: Mon, 27 Jul 2009 15:02:20 -0500 > The problem was seen serving TCP connections over IPSec when the > server (a 24 core machine w 32GB RAM) and the IPSec host were the same > (transport not tunnel). Aha, thanks for the clarification, that makes a lot more sense. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c index 0071ee6..018ac8b 100644 --- a/net/ipv4/xfrm4_policy.c +++ b/net/ipv4/xfrm4_policy.c @@ -264,6 +264,20 @@ static struct xfrm_policy_afinfo xfrm4_policy_afinfo = { .fill_dst = xfrm4_fill_dst, }; +static struct ctl_table xfrm4_policy_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "xfrm4_gc_thresh", + .data = &xfrm4_dst_ops.gc_thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { } +}; + +static struct ctl_table_header *sysctl_hdr; + static void __init xfrm4_policy_init(void) { xfrm_policy_register_afinfo(&xfrm4_policy_afinfo); @@ -271,6 +285,8 @@ static void __init xfrm4_policy_init(void) static void __exit xfrm4_policy_fini(void) { + if (sysctl_hdr) + unregister_net_sysctl_table(sysctl_hdr); xfrm_policy_unregister_afinfo(&xfrm4_policy_afinfo); } @@ -278,5 +294,7 @@ void __init xfrm4_init(void) { xfrm4_state_init(); xfrm4_policy_init(); + sysctl_hdr = register_net_sysctl_table(&init_net, net_ipv4_ctl_path, + xfrm4_policy_table); } diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c index 3a3c677..4acc308 100644 --- a/net/ipv6/xfrm6_policy.c +++ b/net/ipv6/xfrm6_policy.c @@ -306,6 +306,20 @@ static void xfrm6_policy_fini(void) xfrm_policy_unregister_afinfo(&xfrm6_policy_afinfo); } +static struct ctl_table xfrm6_policy_table[] = { + { + .ctl_name = CTL_UNNUMBERED, + .procname = "xfrm6_gc_thresh", + .data = &xfrm6_dst_ops.gc_thresh, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec, + }, + { } +}; + +static struct ctl_table_header *sysctl_hdr; + int __init xfrm6_init(void) { int ret; @@ -317,6 +331,8 @@ int __init xfrm6_init(void) ret = xfrm6_state_init(); if (ret) goto out_policy; + sysctl_hdr = register_net_sysctl_table(&init_net, net_ipv6_ctl_path, + xfrm6_policy_table); out: return ret; out_policy: @@ -326,6 +342,8 @@ out_policy: void xfrm6_fini(void) { + if (sysctl_hdr) + unregister_net_sysctl_table(sysctl_hdr); //xfrm6_input_fini(); xfrm6_policy_fini(); xfrm6_state_fini();
Export garbage collector thresholds for xfrm[4|6]_dst_ops Had a problem reported to me recently in which a high volume of ipsec connections on a system began reporting ENOBUFS for new connections eventually. It seemed that after about 2000 connections we started being unable to create more. A quick look revealed that the xfrm code used a dst_ops structure that limited the gc_thresh value to 1024, and alaways dropped route cache entries after 2x the gc_thresh. It seems the most direct solution is to export the gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing table does, so that higher volumes of connections can be supported. This patch has been tested and allows the reporter to increase their ipsec connection volume successfully. Reported-by: Joe Nall <joe@nall.com> Signed-off-by: Neil Horman <nhorman@tuxdriver.com> ipv4/xfrm4_policy.c | 18 ++++++++++++++++++ ipv6/xfrm6_policy.c | 18 ++++++++++++++++++ 2 files changed, 36 insertions(+) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html