diff mbox

[stable,3.4,1/2] ipv4: move route garbage collector to work queue

Message ID fc0168a71c064457c18f2563371176380d0012c7.1407796232.git.mleitner@redhat.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Marcelo Leitner Aug. 11, 2014, 10:41 p.m. UTC
Currently the route garbage collector gets called by dst_alloc() if it
have more entries than the threshold. But it's an expensive call, that
don't really need to be done by then.

Another issue with current way is that it allows running the garbage
collector with the same start parameters on multiple CPUs at once, which
is not optimal. A system may even soft lockup if the cache is big enough
as the garbage collectors will be fighting over the hash lock entries.

This patch thus moves the garbage collector to run asynchronously on a
work queue, much similar to how rt_expire_check runs.

There is one condition left that allows multiple executions, which is
handled by the next patch.

Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Hannes Frederic Sowa <hannes@redhat.com>
---

Notes:
    Hi,
    
    This set is needed for stables <= 3.4, as the IPv4 route cache was
    removed after that.
    
    Thanks

 net/ipv4/route.c | 43 +++++++++++++++++++++++++++++--------------
 1 file changed, 29 insertions(+), 14 deletions(-)

Comments

Hannes Frederic Sowa Aug. 12, 2014, 6:50 p.m. UTC | #1
On Mo, 2014-08-11 at 19:41 -0300, Marcelo Ricardo Leitner wrote:
> Currently the route garbage collector gets called by dst_alloc() if it
> have more entries than the threshold. But it's an expensive call, that
> don't really need to be done by then.
> 
> Another issue with current way is that it allows running the garbage
> collector with the same start parameters on multiple CPUs at once, which
> is not optimal. A system may even soft lockup if the cache is big enough
> as the garbage collectors will be fighting over the hash lock entries.
> 
> This patch thus moves the garbage collector to run asynchronously on a
> work queue, much similar to how rt_expire_check runs.
> 
> There is one condition left that allows multiple executions, which is
> handled by the next patch.
> 
> Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> Cc: Hannes Frederic Sowa <hannes@redhat.com>

Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>

Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Aug. 12, 2014, 8:23 p.m. UTC | #2
On Tue, 2014-08-12 at 20:50 +0200, Hannes Frederic Sowa wrote:
> On Mo, 2014-08-11 at 19:41 -0300, Marcelo Ricardo Leitner wrote:
> > Currently the route garbage collector gets called by dst_alloc() if it
> > have more entries than the threshold. But it's an expensive call, that
> > don't really need to be done by then.
> > 
> > Another issue with current way is that it allows running the garbage
> > collector with the same start parameters on multiple CPUs at once, which
> > is not optimal. A system may even soft lockup if the cache is big enough
> > as the garbage collectors will be fighting over the hash lock entries.
> > 
> > This patch thus moves the garbage collector to run asynchronously on a
> > work queue, much similar to how rt_expire_check runs.
> > 
> > There is one condition left that allows multiple executions, which is
> > handled by the next patch.
> > 
> > Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> > Cc: Hannes Frederic Sowa <hannes@redhat.com>
> 
> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>


This does not look as stable material.

One can always disable route cache in 3.4 kernels



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 12, 2014, 9:41 p.m. UTC | #3
Hi Eric,

On Di, 2014-08-12 at 13:23 -0700, Eric Dumazet wrote:
> On Tue, 2014-08-12 at 20:50 +0200, Hannes Frederic Sowa wrote:
> > On Mo, 2014-08-11 at 19:41 -0300, Marcelo Ricardo Leitner wrote:
> > > Currently the route garbage collector gets called by dst_alloc() if it
> > > have more entries than the threshold. But it's an expensive call, that
> > > don't really need to be done by then.
> > > 
> > > Another issue with current way is that it allows running the garbage
> > > collector with the same start parameters on multiple CPUs at once, which
> > > is not optimal. A system may even soft lockup if the cache is big enough
> > > as the garbage collectors will be fighting over the hash lock entries.
> > > 
> > > This patch thus moves the garbage collector to run asynchronously on a
> > > work queue, much similar to how rt_expire_check runs.
> > > 
> > > There is one condition left that allows multiple executions, which is
> > > handled by the next patch.
> > > 
> > > Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
> > > Cc: Hannes Frederic Sowa <hannes@redhat.com>
> > 
> > Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> 
> 
> This does not look as stable material.

We hesitated at first, too, to send those out.

We had a machine being brought down by production traffic while using
TPROXY. The routing cache, while still having a relatively good hit
ratio, was filled with combinations of source and destination addresses.
Multiple GCs running and trying to grab the same per-chain spin_lock
caused a complete lockdown of the machine. That's why we submitted those
patches for review in the end.

> One can always disable route cache in 3.4 kernels

Sure, but we didn't like the fact that it is possible to bring down the
machine in the first place.

Thanks,
Hannes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 12, 2014, 10:42 p.m. UTC | #4
From: Hannes Frederic Sowa <hannes@redhat.com>
Date: Tue, 12 Aug 2014 23:41:32 +0200

> Hi Eric,
> 
> On Di, 2014-08-12 at 13:23 -0700, Eric Dumazet wrote:
>> On Tue, 2014-08-12 at 20:50 +0200, Hannes Frederic Sowa wrote:
>> > On Mo, 2014-08-11 at 19:41 -0300, Marcelo Ricardo Leitner wrote:
>> > > Currently the route garbage collector gets called by dst_alloc() if it
>> > > have more entries than the threshold. But it's an expensive call, that
>> > > don't really need to be done by then.
>> > > 
>> > > Another issue with current way is that it allows running the garbage
>> > > collector with the same start parameters on multiple CPUs at once, which
>> > > is not optimal. A system may even soft lockup if the cache is big enough
>> > > as the garbage collectors will be fighting over the hash lock entries.
>> > > 
>> > > This patch thus moves the garbage collector to run asynchronously on a
>> > > work queue, much similar to how rt_expire_check runs.
>> > > 
>> > > There is one condition left that allows multiple executions, which is
>> > > handled by the next patch.
>> > > 
>> > > Signed-off-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
>> > > Cc: Hannes Frederic Sowa <hannes@redhat.com>
>> > 
>> > Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
>> 
>> 
>> This does not look as stable material.
> 
> We hesitated at first, too, to send those out.
> 
> We had a machine being brought down by production traffic while using
> TPROXY. The routing cache, while still having a relatively good hit
> ratio, was filled with combinations of source and destination addresses.
> Multiple GCs running and trying to grab the same per-chain spin_lock
> caused a complete lockdown of the machine. That's why we submitted those
> patches for review in the end.
> 
>> One can always disable route cache in 3.4 kernels
> 
> Sure, but we didn't like the fact that it is possible to bring down the
> machine in the first place.

I think I can handle this first patch for 3.4/3.2 -stable, it is very
straightforward and deals with what are actually purely asynchronous
invocations of garbage collection anyways.

Although this needs to be reformatted properly:

+	if (dst_entries_get_fast(&ipv4_dst_ops) >= ip_rt_max_size ||
+		dst_entries_get_slow(&ipv4_dst_ops) >= ip_rt_max_size) {

The second line needs to start at the first column after the openning
parenthesis of the if () statement.

The second patch, on the other hand, needs some more thought.  It is
changing behavior, in that cases that would have succeeded in the past
will now potentially fail only because the neighbour cache limits were
hit at an unlucky moment (when an async GC was going already).

If this happens from a software interrupt, we'll fail instantly
because attempts starts at 0.

Just bite the bullet and put a spinlock around the GC operation.

The async GCs are normal on a loaded machine, whereas the neighbour
tables filling up is much less so.  I think having the neighbout
overflow path synchronize with async GCs is therefore not going to be
a real problem in practice.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 12, 2014, 11:11 p.m. UTC | #5
On Wed, Aug 13, 2014, at 00:42, David Miller wrote:
> The second patch, on the other hand, needs some more thought.  It is
> changing behavior, in that cases that would have succeeded in the past
> will now potentially fail only because the neighbour cache limits were
> hit at an unlucky moment (when an async GC was going already).
> 
> If this happens from a software interrupt, we'll fail instantly
> because attempts starts at 0.

Hmhmhm, a truly valid concern.

> Just bite the bullet and put a spinlock around the GC operation.

We had a spinlock around the GC operation at first but still were
capable to cause softlockups, I don't remember if a complete lockup
happend.

> The async GCs are normal on a loaded machine, whereas the neighbour
> tables filling up is much less so.  I think having the neighbout
> overflow path synchronize with async GCs is therefore not going to be
> a real problem in practice.

Thanks for your comments, we look into it. Should be no problem to
conditional synchronize with the neighbour overflow path.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Aug. 13, 2014, 12:46 a.m. UTC | #6
From: Hannes Frederic Sowa <hannes@stressinduktion.org>
Date: Wed, 13 Aug 2014 01:11:14 +0200

> On Wed, Aug 13, 2014, at 00:42, David Miller wrote:
>> Just bite the bullet and put a spinlock around the GC operation.
> 
> We had a spinlock around the GC operation at first but still were
> capable to cause softlockups, I don't remember if a complete lockup
> happend.

This much I understand.

What I'm saying is retain this patch #1, in order to make the normal
GCs run asynchronously via a work queue, but on top of that put a
spinlock around the GC function.

> Should be no problem to conditional synchronize with the neighbour
> overflow path.

Once you have the spinlock around the GC it should be just fine.

If we hit the neighbour table overflow condition when an async GC is
operating, the direct GC call from the neighbour table overflow path
will just spin on the lock waiting for the async GC to complete.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa Aug. 13, 2014, 1:50 a.m. UTC | #7
Hi David,

On Wed, Aug 13, 2014, at 02:46, David Miller wrote:
> From: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Date: Wed, 13 Aug 2014 01:11:14 +0200
> 
> > On Wed, Aug 13, 2014, at 00:42, David Miller wrote:
> >> Just bite the bullet and put a spinlock around the GC operation.
> > 
> > We had a spinlock around the GC operation at first but still were
> > capable to cause softlockups, I don't remember if a complete lockup
> > happend.
> 
> This much I understand.
> 
> What I'm saying is retain this patch #1, in order to make the normal
> GCs run asynchronously via a work queue, but on top of that put a
> spinlock around the GC function.

I understood you correctly. ;)

The spin_lock would need to be _bh protected as it can be called from
softirq as well as in the wq. IIRC that was the problem. I'll follow up
with Marcelo to find a solution for this.

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 90bc88bbd2a4f73304671e0182df72f450fa901a..2ad1cdb2c9e45f5ec5a6c1fa9b781657b1c292ec 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -150,6 +150,9 @@  static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
 static int rt_garbage_collect(struct dst_ops *ops);
 
+static void __rt_garbage_collect(struct work_struct *w);
+static DECLARE_WORK(rt_gc_worker, __rt_garbage_collect);
+
 static void ipv4_dst_ifdown(struct dst_entry *dst, struct net_device *dev,
 			    int how)
 {
@@ -977,7 +980,7 @@  static void rt_emergency_hash_rebuild(struct net *net)
    and when load increases it reduces to limit cache size.
  */
 
-static int rt_garbage_collect(struct dst_ops *ops)
+static void __do_rt_garbage_collect(int elasticity, int min_interval)
 {
 	static unsigned long expire = RT_GC_TIMEOUT;
 	static unsigned long last_gc;
@@ -996,7 +999,7 @@  static int rt_garbage_collect(struct dst_ops *ops)
 
 	RT_CACHE_STAT_INC(gc_total);
 
-	if (now - last_gc < ip_rt_gc_min_interval &&
+	if (now - last_gc < min_interval &&
 	    entries < ip_rt_max_size) {
 		RT_CACHE_STAT_INC(gc_ignored);
 		goto out;
@@ -1004,7 +1007,7 @@  static int rt_garbage_collect(struct dst_ops *ops)
 
 	entries = dst_entries_get_slow(&ipv4_dst_ops);
 	/* Calculate number of entries, which we want to expire now. */
-	goal = entries - (ip_rt_gc_elasticity << rt_hash_log);
+	goal = entries - (elasticity << rt_hash_log);
 	if (goal <= 0) {
 		if (equilibrium < ipv4_dst_ops.gc_thresh)
 			equilibrium = ipv4_dst_ops.gc_thresh;
@@ -1021,7 +1024,7 @@  static int rt_garbage_collect(struct dst_ops *ops)
 		equilibrium = entries - goal;
 	}
 
-	if (now - last_gc >= ip_rt_gc_min_interval)
+	if (now - last_gc >= min_interval)
 		last_gc = now;
 
 	if (goal <= 0) {
@@ -1086,15 +1089,33 @@  static int rt_garbage_collect(struct dst_ops *ops)
 	if (net_ratelimit())
 		pr_warn("dst cache overflow\n");
 	RT_CACHE_STAT_INC(gc_dst_overflow);
-	return 1;
+	return;
 
 work_done:
-	expire += ip_rt_gc_min_interval;
+	expire += min_interval;
 	if (expire > ip_rt_gc_timeout ||
 	    dst_entries_get_fast(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh ||
 	    dst_entries_get_slow(&ipv4_dst_ops) < ipv4_dst_ops.gc_thresh)
 		expire = ip_rt_gc_timeout;
-out:	return 0;
+out:	return;
+}
+
+static void __rt_garbage_collect(struct work_struct *w)
+{
+	__do_rt_garbage_collect(ip_rt_gc_elasticity, ip_rt_gc_min_interval);
+}
+
+static int rt_garbage_collect(struct dst_ops *ops)
+{
+	if (!work_pending(&rt_gc_worker))
+		schedule_work(&rt_gc_worker);
+
+	if (dst_entries_get_fast(&ipv4_dst_ops) >= ip_rt_max_size ||
+		dst_entries_get_slow(&ipv4_dst_ops) >= ip_rt_max_size) {
+		RT_CACHE_STAT_INC(gc_dst_overflow);
+		return 1;
+	}
+	return 0;
 }
 
 /*
@@ -1288,13 +1309,7 @@  restart:
 			   it is most likely it holds some neighbour records.
 			 */
 			if (attempts-- > 0) {
-				int saved_elasticity = ip_rt_gc_elasticity;
-				int saved_int = ip_rt_gc_min_interval;
-				ip_rt_gc_elasticity	= 1;
-				ip_rt_gc_min_interval	= 0;
-				rt_garbage_collect(&ipv4_dst_ops);
-				ip_rt_gc_min_interval	= saved_int;
-				ip_rt_gc_elasticity	= saved_elasticity;
+				__do_rt_garbage_collect(1, 0);
 				goto restart;
 			}