diff mbox

xfrm: implement basic garbage collection for bundles

Message ID 1269087341-7009-1-git-send-email-timo.teras@iki.fi
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Timo Teras March 20, 2010, 12:15 p.m. UTC
The dst core calls garbage collection only from dst_alloc when
the dst entry threshold is exceeded. Xfrm core currently checks
bundles only on NETDEV_DOWN event.

Previously this has not been a big problem since xfrm gc threshold
was small, and they were generated all the time due to another bug.

Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
large gc threshold sizes (>200000 on machines with normal amount
of memory) the garbage collection does not get triggered under
normal circumstances. This can result in enormous amount of stale
bundles. Further more, each of these stale bundles keep a reference
to ipv4/ipv6 rtable entries which are already gargage collected and
put to dst core "destroy free'd dst's" list. Now this list can grow
to be very large, and the dst core periodic job can bring even a fast
machine to it's knees.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
---
 net/xfrm/xfrm_policy.c |   20 +++++++++++++++++++-
 1 files changed, 19 insertions(+), 1 deletions(-)

Comments

Herbert Xu March 20, 2010, 12:32 p.m. UTC | #1
On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote:
> The dst core calls garbage collection only from dst_alloc when
> the dst entry threshold is exceeded. Xfrm core currently checks
> bundles only on NETDEV_DOWN event.
> 
> Previously this has not been a big problem since xfrm gc threshold
> was small, and they were generated all the time due to another bug.
> 
> Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
> ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
> large gc threshold sizes (>200000 on machines with normal amount
> of memory) the garbage collection does not get triggered under
> normal circumstances. This can result in enormous amount of stale
> bundles. Further more, each of these stale bundles keep a reference
> to ipv4/ipv6 rtable entries which are already gargage collected and
> put to dst core "destroy free'd dst's" list. Now this list can grow
> to be very large, and the dst core periodic job can bring even a fast
> machine to it's knees.

So why do we need this larger threshold in the first place? Neil?
Timo Teras March 20, 2010, 12:42 p.m. UTC | #2
Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote:
>> The dst core calls garbage collection only from dst_alloc when
>> the dst entry threshold is exceeded. Xfrm core currently checks
>> bundles only on NETDEV_DOWN event.
>>
>> Previously this has not been a big problem since xfrm gc threshold
>> was small, and they were generated all the time due to another bug.
>>
>> Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
>> ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
>> large gc threshold sizes (>200000 on machines with normal amount
>> of memory) the garbage collection does not get triggered under
>> normal circumstances. This can result in enormous amount of stale
>> bundles. Further more, each of these stale bundles keep a reference
>> to ipv4/ipv6 rtable entries which are already gargage collected and
>> put to dst core "destroy free'd dst's" list. Now this list can grow
>> to be very large, and the dst core periodic job can bring even a fast
>> machine to it's knees.
> 
> So why do we need this larger threshold in the first place? Neil?

Actually it looks like that on ipv6 side the gc_thresh is something
more normal. On ipv4 side it's insanely big. The 1/2 ratio is not
what ipv4 rtable uses for it's own gc_thresh. Looks like it's using
1/16 ratio which yields much better value.

But even if we have the gc_thresh back to 1024 or similar size,
it is still a good thing to do some basic gc on xfrm bundles so
that the underlaying rtable dst's can be freed before they end up
in the dst core list.

- Timo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 20, 2010, 12:49 p.m. UTC | #3
On Sat, Mar 20, 2010 at 02:42:02PM +0200, Timo Teräs wrote:
>
> But even if we have the gc_thresh back to 1024 or similar size,
> it is still a good thing to do some basic gc on xfrm bundles so
> that the underlaying rtable dst's can be freed before they end up
> in the dst core list.

But I'm not sure if a timer-based one really makes sense though.

If you're worried about stale entries preventing IPv4/IPv6 rt
entries from being collected, perhaps we should invoke the xfrm
GC process from the IPv4/IPv6 GC function?

Cheers,
Timo Teras March 20, 2010, 12:54 p.m. UTC | #4
Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:42:02PM +0200, Timo Teräs wrote:
>> But even if we have the gc_thresh back to 1024 or similar size,
>> it is still a good thing to do some basic gc on xfrm bundles so
>> that the underlaying rtable dst's can be freed before they end up
>> in the dst core list.
> 
> But I'm not sure if a timer-based one really makes sense though.
> 
> If you're worried about stale entries preventing IPv4/IPv6 rt
> entries from being collected, perhaps we should invoke the xfrm
> GC process from the IPv4/IPv6 GC function?

Those are more or less timer based too. I guess it would be
better to put to dst core's function then. It can see hanging
refs, and call xfrm gc in that case.

To even minimize that it would help a lot if xfrm_bundle_ok
could release (or swap to dummies) the referenced dst->route
and dst->child entries. xfrm_bundle_ok is mostly called for
all bundles regularly by xfrm_find_bundle before new ones are
created.

- Timo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu March 20, 2010, 1:13 p.m. UTC | #5
On Sat, Mar 20, 2010 at 02:54:31PM +0200, Timo Teräs wrote:
>
> Those are more or less timer based too. I guess it would be
> better to put to dst core's function then. It can see hanging
> refs, and call xfrm gc in that case.

The IPv4 one appears to be invoked from dst_alloc just like
xfrm.

> To even minimize that it would help a lot if xfrm_bundle_ok
> could release (or swap to dummies) the referenced dst->route
> and dst->child entries. xfrm_bundle_ok is mostly called for
> all bundles regularly by xfrm_find_bundle before new ones are
> created.

Yes I agree that's what we should do once your other patch is
applied.

So every top-level xfrm_dst that's not referenced by some user
should sit in the flow cache.  When it's needed and we find it
to be stale we kick it out of the cache and free it along with
the rest of its constituents.

Cheers,
Neil Horman March 20, 2010, 1:26 p.m. UTC | #6
On Sat, Mar 20, 2010 at 08:32:48PM +0800, Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:15:41PM +0200, Timo Teras wrote:
> > The dst core calls garbage collection only from dst_alloc when
> > the dst entry threshold is exceeded. Xfrm core currently checks
> > bundles only on NETDEV_DOWN event.
> > 
> > Previously this has not been a big problem since xfrm gc threshold
> > was small, and they were generated all the time due to another bug.
> > 
> > Since after a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
> > ("xfrm: select sane defaults for xfrm[4|6] gc_thresh") we can have
> > large gc threshold sizes (>200000 on machines with normal amount
> > of memory) the garbage collection does not get triggered under
> > normal circumstances. This can result in enormous amount of stale
> > bundles. Further more, each of these stale bundles keep a reference
> > to ipv4/ipv6 rtable entries which are already gargage collected and
> > put to dst core "destroy free'd dst's" list. Now this list can grow
> > to be very large, and the dst core periodic job can bring even a fast
> > machine to it's knees.
> 
> So why do we need this larger threshold in the first place? Neil?
My initial reasoning is here:
http://git.kernel.org/?p=linux/kernel/git/davem/net-next-2.6.git;a=commit;h=a33bc5c15154c835aae26f16e6a3a7d9ad4acb45
I'd had a bug  claiming that ipsec didn't scale to thousands of SA's, and the
reason turned out to be that xfrm code capped their entry threshold at 1024
entries:
https://bugzilla.redhat.com/show_bug.cgi?id=253053

I exported the gc thresholds via sysctls, and then used the above commit to
select sane values, reasoning that we should be able to support as many tunnels
as routes when possible.

If its too high, I have no qualm with lowering it, given that those that need it
can bump it up higher via the sysctl.

Regards
Neil

> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Timo Teras March 20, 2010, 1:34 p.m. UTC | #7
Herbert Xu wrote:
> On Sat, Mar 20, 2010 at 02:54:31PM +0200, Timo Teräs wrote:
>> Those are more or less timer based too. I guess it would be
>> better to put to dst core's function then. It can see hanging
>> refs, and call xfrm gc in that case.
> 
> The IPv4 one appears to be invoked from dst_alloc just like
> xfrm.

Yes. But under normal circumstances that's not used. It also
has a separate timer based gc that goes through the buckets
periodically. That e.g. how the expired pmtu routes are kicked
out and how routes with old genid are purged before we start
to reach the gc_thresh limit. This happens in 
rt_worker_func() which is called every ip_rt_gc_interval
(defaults to one minute).
 
>> To even minimize that it would help a lot if xfrm_bundle_ok
>> could release (or swap to dummies) the referenced dst->route
>> and dst->child entries. xfrm_bundle_ok is mostly called for
>> all bundles regularly by xfrm_find_bundle before new ones are
>> created.
> 
> Yes I agree that's what we should do once your other patch is
> applied.
> 
> So every top-level xfrm_dst that's not referenced by some user
> should sit in the flow cache.  When it's needed and we find it
> to be stale we kick it out of the cache and free it along with
> the rest of its constituents.

Right. Ok, just to get road map of what I should do and in which
order and how.

xfrm gc:
  - should dst core call it?
  - should xfrm do it periodically?
  - or do we rely that we get new bundles and let
    xfrm[46]_find_bundle do the clean up?
  - should xfrm_bundle_ok() release the inner dst's right away
    instead of waiting any of the above to happen?

caching of bundles instead of policies for outgoing traffic:
  - should flow cache be per-netns?
  - since it will now have two types of objects, would it make
    sense to have virtual put and get callbacks instead of
    atomic_t pointer. this way qw can BUG_ON() if the refcount
    goes to zero unexpectedly (or call the appropriate destructor)
    and call the real _put function anway (e.g. dst_release does
    WARN_ON stuff); the _get function can also do additional
    checking if the object is valid or not; this way the flow
    cache resolver output is always a valid object
  - resolver to have "resolve_fast" and "resolve_slow". if fast
    fails, the slow gets called with bh enabled so it can sleep,
    since bundle creation needs that.
  - it would probably be then useful to have dummy xfrm_dst for
    policy reference when the policy forbids traffic?

- Timo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/xfrm/xfrm_policy.c b/net/xfrm/xfrm_policy.c
index 843e066..f3f3d36 100644
--- a/net/xfrm/xfrm_policy.c
+++ b/net/xfrm/xfrm_policy.c
@@ -44,6 +44,9 @@  static struct xfrm_policy_afinfo *xfrm_policy_afinfo[NPROTO];
 
 static struct kmem_cache *xfrm_dst_cache __read_mostly;
 
+static int xfrm_gc_interval __read_mostly = 5 * 60 * HZ;
+static struct delayed_work expires_work;
+
 static HLIST_HEAD(xfrm_policy_gc_list);
 static DEFINE_SPINLOCK(xfrm_policy_gc_lock);
 
@@ -2203,6 +2206,16 @@  static int xfrm_flush_bundles(struct net *net)
 	return 0;
 }
 
+static void xfrm_gc_worker_func(struct work_struct *work)
+{
+	struct net *net;
+
+	for_each_net(net)
+		xfrm_flush_bundles(net);
+
+	schedule_delayed_work(&expires_work, xfrm_gc_interval);
+}
+
 static void xfrm_init_pmtu(struct dst_entry *dst)
 {
 	do {
@@ -2498,8 +2511,13 @@  static int __net_init xfrm_policy_init(struct net *net)
 
 	INIT_LIST_HEAD(&net->xfrm.policy_all);
 	INIT_WORK(&net->xfrm.policy_hash_work, xfrm_hash_resize);
-	if (net_eq(net, &init_net))
+	if (net_eq(net, &init_net)) {
 		register_netdevice_notifier(&xfrm_dev_notifier);
+
+		INIT_DELAYED_WORK_DEFERRABLE(&expires_work, xfrm_gc_worker_func);
+		schedule_delayed_work(&expires_work,
+			net_random() % xfrm_gc_interval + xfrm_gc_interval);
+	}
 	return 0;
 
 out_bydst: