diff mbox

xfrm: export xfrm garbage collector thresholds via sysctl

Message ID 20090727182246.GC15823@hmsreliant.think-freely.org
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Neil Horman July 27, 2009, 6:22 p.m. UTC
Export garbage collector thresholds for xfrm[4|6]_dst_ops

Had a problem reported to me recently in which a high volume of ipsec
connections on a system began reporting ENOBUFS for new connections eventually.
It seemed that after about 2000 connections we started being unable to create
more.  A quick look revealed that the xfrm code used a dst_ops structure that
limited the gc_thresh value to 1024, and alaways dropped route cache entries
after 2x the gc_thresh.  It seems the most direct solution is to export the
gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing
table does, so that higher volumes of connections can be supported.  This patch
has been tested and allows the reporter to increase their ipsec connection
volume successfully.

Reported-by: Joe Nall <joe@nall.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


ipv4/xfrm4_policy.c |   18 ++++++++++++++++++
ipv6/xfrm6_policy.c |   18 ++++++++++++++++++
2 files changed, 36 insertions(+)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller July 27, 2009, 6:37 p.m. UTC | #1
From: Neil Horman <nhorman@tuxdriver.com>
Date: Mon, 27 Jul 2009 14:22:46 -0400

> Export garbage collector thresholds for xfrm[4|6]_dst_ops
> 
> Had a problem reported to me recently in which a high volume of ipsec
> connections on a system began reporting ENOBUFS for new connections eventually.
> It seemed that after about 2000 connections we started being unable to create
> more.  A quick look revealed that the xfrm code used a dst_ops structure that
> limited the gc_thresh value to 1024, and alaways dropped route cache entries
> after 2x the gc_thresh.  It seems the most direct solution is to export the
> gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing
> table does, so that higher volumes of connections can be supported.  This patch
> has been tested and allows the reporter to increase their ipsec connection
> volume successfully.
> 
> Reported-by: Joe Nall <joe@nall.com>
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>

Applied, but this suggests that either:

1) we pick a horrible default

2) our IPSEC machinery holds onto dst entries too tightly and that's
   the true cause of this problem

I'd like to ask that you investigate this, because with defaults
we should be able to handle IPSEC loads as high as the routing
loads we could handle.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Neil Horman July 27, 2009, 7:36 p.m. UTC | #2
On Mon, Jul 27, 2009 at 11:37:55AM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 27 Jul 2009 14:22:46 -0400
> 
> > Export garbage collector thresholds for xfrm[4|6]_dst_ops
> > 
> > Had a problem reported to me recently in which a high volume of ipsec
> > connections on a system began reporting ENOBUFS for new connections eventually.
> > It seemed that after about 2000 connections we started being unable to create
> > more.  A quick look revealed that the xfrm code used a dst_ops structure that
> > limited the gc_thresh value to 1024, and alaways dropped route cache entries
> > after 2x the gc_thresh.  It seems the most direct solution is to export the
> > gc_thresh values in the xfrm[4|6] dst_ops as sysctls, like the main routing
> > table does, so that higher volumes of connections can be supported.  This patch
> > has been tested and allows the reporter to increase their ipsec connection
> > volume successfully.
> > 
> > Reported-by: Joe Nall <joe@nall.com>
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> Applied, but this suggests that either:
> 
Thanks!

> 1) we pick a horrible default
> 
> 2) our IPSEC machinery holds onto dst entries too tightly and that's
>    the true cause of this problem
> 
> I'd like to ask that you investigate this, because with defaults
> we should be able to handle IPSEC loads as high as the routing
> loads we could handle.
> 
I'll gladly look into this further.  Compared to the main routing table, the
ipsec default selection is pretty bad.  Its statcially set despite the size of
the larger routing table.  Looking at the garbage collection algorithm, we do
keep a pretty tight leash on freeing entries, but I think its warrented.  We
create 1 dst_entry for each open socket on an ipsec tunnel, and don't release it
until its __refcnt drops to zero.  I think that makes sense, since it means
we only keep cache entries for active connections, and clean them up as soon as
they close (e.g. I don't really see the advantage to unhashing a xfrm cache
entry only to recreate it on the next packet sent).  I think the most sensible
first step is to dynamically choose a gc threshold based on the size of memory
or the main routing table.  I'll write this up and post later this week after I
do some testing.  Thanks!
Neil

> Thanks.
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 27, 2009, 7:40 p.m. UTC | #3
From: Neil Horman <nhorman@tuxdriver.com>
Date: Mon, 27 Jul 2009 15:36:25 -0400

> I think that makes sense, since it means we only keep cache entries
> for active connections, and clean them up as soon as they close
> (e.g. I don't really see the advantage to unhashing a xfrm cache
> entry only to recreate it on the next packet sent).

How is this related to the user's problem?

My impression was that they were forwarding IPSEC traffic when
running up against these limits, and that has no socket based
assosciation at all.

Making the XFRM GC limits get computed similarly to how the
ipv4/ipv6 one's do probably makes sense.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Joe Nall July 27, 2009, 8:02 p.m. UTC | #4
On Jul 27, 2009, at 2:40 PM, David Miller wrote:

> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 27 Jul 2009 15:36:25 -0400
>
>> I think that makes sense, since it means we only keep cache entries
>> for active connections, and clean them up as soon as they close
>> (e.g. I don't really see the advantage to unhashing a xfrm cache
>> entry only to recreate it on the next packet sent).
>
> How is this related to the user's problem?
>
> My impression was that they were forwarding IPSEC traffic when
> running up against these limits, and that has no socket based
> assosciation at all.
>
> Making the XFRM GC limits get computed similarly to how the
> ipv4/ipv6 one's do probably makes sense.

The problem was seen serving TCP connections over IPSec when the  
server (a 24 core machine w 32GB RAM) and the IPSec host were the same  
(transport not tunnel).

The problem was originally identified in an MLS ipsec thin client  
stress test with 25 clients and 200+ windows per client.

We then duplicated the issue with single level xclocks on the same  
hardware.

I then duplicated the problem with two bone stock (not MLS) F10 (and  
later F11) boxes running ab (apache benchmark tool) with 10k requests  
over 2k concurrent connections. See https://bugzilla.redhat.com/show_bug.cgi?id=503124

With the gc_thresh raised to 8k, my ab test passes with 5k+  
connections and 100k requests. Our test guys ran 11 workstations with  
about 2200 concurrent X connections over the weekend and I'm hoping  
for some MLS results before we lose the test suite on Wednesday.

joe
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 27, 2009, 8:10 p.m. UTC | #5
From: Joe Nall <joe@nall.com>
Date: Mon, 27 Jul 2009 15:02:20 -0500

> The problem was seen serving TCP connections over IPSec when the
> server (a 24 core machine w 32GB RAM) and the IPSec host were the same
> (transport not tunnel).

Aha, thanks for the clarification, that makes a lot more sense.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 0071ee6..018ac8b 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -264,6 +264,20 @@  static struct xfrm_policy_afinfo xfrm4_policy_afinfo = {
 	.fill_dst =		xfrm4_fill_dst,
 };
 
+static struct ctl_table xfrm4_policy_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "xfrm4_gc_thresh",
+		.data           = &xfrm4_dst_ops.gc_thresh,
+		.maxlen         = sizeof(int),
+		.mode           = 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{ }
+};
+
+static struct ctl_table_header *sysctl_hdr;
+
 static void __init xfrm4_policy_init(void)
 {
 	xfrm_policy_register_afinfo(&xfrm4_policy_afinfo);
@@ -271,6 +285,8 @@  static void __init xfrm4_policy_init(void)
 
 static void __exit xfrm4_policy_fini(void)
 {
+	if (sysctl_hdr)
+		unregister_net_sysctl_table(sysctl_hdr);
 	xfrm_policy_unregister_afinfo(&xfrm4_policy_afinfo);
 }
 
@@ -278,5 +294,7 @@  void __init xfrm4_init(void)
 {
 	xfrm4_state_init();
 	xfrm4_policy_init();
+	sysctl_hdr = register_net_sysctl_table(&init_net, net_ipv4_ctl_path,
+						xfrm4_policy_table);
 }
 
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 3a3c677..4acc308 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -306,6 +306,20 @@  static void xfrm6_policy_fini(void)
 	xfrm_policy_unregister_afinfo(&xfrm6_policy_afinfo);
 }
 
+static struct ctl_table xfrm6_policy_table[] = {
+	{
+		.ctl_name       = CTL_UNNUMBERED,
+		.procname       = "xfrm6_gc_thresh",
+		.data	   	= &xfrm6_dst_ops.gc_thresh,
+		.maxlen	 	= sizeof(int),
+		.mode	   	= 0644,
+		.proc_handler   = proc_dointvec,
+	},
+	{ }
+};
+
+static struct ctl_table_header *sysctl_hdr;
+
 int __init xfrm6_init(void)
 {
 	int ret;
@@ -317,6 +331,8 @@  int __init xfrm6_init(void)
 	ret = xfrm6_state_init();
 	if (ret)
 		goto out_policy;
+	sysctl_hdr = register_net_sysctl_table(&init_net, net_ipv6_ctl_path,
+						xfrm6_policy_table);
 out:
 	return ret;
 out_policy:
@@ -326,6 +342,8 @@  out_policy:
 
 void xfrm6_fini(void)
 {
+	if (sysctl_hdr)
+		unregister_net_sysctl_table(sysctl_hdr);
 	//xfrm6_input_fini();
 	xfrm6_policy_fini();
 	xfrm6_state_fini();