diff mbox

[nf,v4] netfilter: conntrack: refine gc worker heuristics

Message ID 1478274898-24605-1-git-send-email-fw@strlen.de
State Accepted
Delegated to: Pablo Neira
Headers show

Commit Message

Florian Westphal Nov. 4, 2016, 3:54 p.m. UTC
Nicolas Dichtel says:
  After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
  remove timed-out entries"), netlink conntrack deletion events may be
  sent with a huge delay.

Nicolas further points at this line:

  goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);

and indeed, this isn't optimal at all.  Rationale here was to ensure that
we don't block other work items for too long, even if
nf_conntrack_htable_size is huge.  But in order to have some guarantee
about maximum time period where a scan of the full conntrack table
completes we should always use a fixed slice size, so that once every
N scans the full table has been examined at least once.

We also need to balance this vs. the case where the system is either idle
(i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
from packet path).

So, after some discussion with Nicolas:

1. want hard guarantee that we scan entire table at least once every X s
-> need to scan fraction of table (get rid of upper bound)

2. don't want to eat cycles on idle or very busy system
-> increase interval if we did not evict any entries

3. don't want to block other worker items for too long
-> make fraction really small, and prefer small scan interval instead

4. Want reasonable short time where we detect timed-out entry when
system went idle after a burst of traffic, while not doing scans
all the time.
-> Store next gc scan in worker, increasing delays when no eviction
happened and shrinking delay when we see timed out entries.

The old gc interval is turned into a max number, scans can now happen
every jiffy if stale entries are present.

Longest possible time period until an entry is evicted is now 2 minutes
in worst case (entry expires right after it was deemed 'not expired').

Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 v4: fix spelling of Nicolas first name
     remove useless assignment

 v3:
   get rid of unused variable
   cap max interval at 2 seconds
   use 64 as divisor to give 2 minute worst-case behaviour

Comments

Nicolas Dichtel Nov. 4, 2016, 4:16 p.m. UTC | #1
Le 04/11/2016 à 16:54, Florian Westphal a écrit :
> Nicolas Dichtel says:
>   After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
>   remove timed-out entries"), netlink conntrack deletion events may be
>   sent with a huge delay.
> 
> Nicolas further points at this line:
> 
>   goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
> 
> and indeed, this isn't optimal at all.  Rationale here was to ensure that
> we don't block other work items for too long, even if
> nf_conntrack_htable_size is huge.  But in order to have some guarantee
> about maximum time period where a scan of the full conntrack table
> completes we should always use a fixed slice size, so that once every
> N scans the full table has been examined at least once.
> 
> We also need to balance this vs. the case where the system is either idle
> (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
> from packet path).
> 
> So, after some discussion with Nicolas:
> 
> 1. want hard guarantee that we scan entire table at least once every X s
> -> need to scan fraction of table (get rid of upper bound)
> 
> 2. don't want to eat cycles on idle or very busy system
> -> increase interval if we did not evict any entries
> 
> 3. don't want to block other worker items for too long
> -> make fraction really small, and prefer small scan interval instead
> 
> 4. Want reasonable short time where we detect timed-out entry when
> system went idle after a burst of traffic, while not doing scans
> all the time.
> -> Store next gc scan in worker, increasing delays when no eviction
> happened and shrinking delay when we see timed out entries.
> 
> The old gc interval is turned into a max number, scans can now happen
> every jiffy if stale entries are present.
> 
> Longest possible time period until an entry is evicted is now 2 minutes
> in worst case (entry expires right after it was deemed 'not expired').
> 
> Reported-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>


Thank you,
Nicolas
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Pablo Neira Ayuso Nov. 8, 2016, 11:04 p.m. UTC | #2
On Fri, Nov 04, 2016 at 04:54:58PM +0100, Florian Westphal wrote:
> Nicolas Dichtel says:
>   After commit b87a2f9199ea ("netfilter: conntrack: add gc worker to
>   remove timed-out entries"), netlink conntrack deletion events may be
>   sent with a huge delay.
> 
> Nicolas further points at this line:
> 
>   goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
> 
> and indeed, this isn't optimal at all.  Rationale here was to ensure that
> we don't block other work items for too long, even if
> nf_conntrack_htable_size is huge.  But in order to have some guarantee
> about maximum time period where a scan of the full conntrack table
> completes we should always use a fixed slice size, so that once every
> N scans the full table has been examined at least once.
> 
> We also need to balance this vs. the case where the system is either idle
> (i.e., conntrack table (almost) empty) or very busy (i.e. eviction happens
> from packet path).
> 
> So, after some discussion with Nicolas:
> 
> 1. want hard guarantee that we scan entire table at least once every X s
> -> need to scan fraction of table (get rid of upper bound)
> 
> 2. don't want to eat cycles on idle or very busy system
> -> increase interval if we did not evict any entries
> 
> 3. don't want to block other worker items for too long
> -> make fraction really small, and prefer small scan interval instead
> 
> 4. Want reasonable short time where we detect timed-out entry when
> system went idle after a burst of traffic, while not doing scans
> all the time.
> -> Store next gc scan in worker, increasing delays when no eviction
> happened and shrinking delay when we see timed out entries.
> 
> The old gc interval is turned into a max number, scans can now happen
> every jiffy if stale entries are present.
> 
> Longest possible time period until an entry is evicted is now 2 minutes
> in worst case (entry expires right after it was deemed 'not expired').

Applied, thanks a lot Florian.

Tunning this is the hard part of this new workqueue approach IMO, it
should be fine if we need more iterations to revisit this logic.
--
To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index df2f5a3901df..0f87e5d21be7 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -76,6 +76,7 @@  struct conntrack_gc_work {
 	struct delayed_work	dwork;
 	u32			last_bucket;
 	bool			exiting;
+	long			next_gc_run;
 };
 
 static __read_mostly struct kmem_cache *nf_conntrack_cachep;
@@ -83,9 +84,11 @@  static __read_mostly spinlock_t nf_conntrack_locks_all_lock;
 static __read_mostly DEFINE_SPINLOCK(nf_conntrack_locks_all_lock);
 static __read_mostly bool nf_conntrack_locks_all;
 
+/* every gc cycle scans at most 1/GC_MAX_BUCKETS_DIV part of table */
 #define GC_MAX_BUCKETS_DIV	64u
-#define GC_MAX_BUCKETS		8192u
-#define GC_INTERVAL		(5 * HZ)
+/* upper bound of scan intervals */
+#define GC_INTERVAL_MAX		(2 * HZ)
+/* maximum conntracks to evict per gc run */
 #define GC_MAX_EVICTS		256u
 
 static struct conntrack_gc_work conntrack_gc_work;
@@ -936,13 +939,13 @@  static noinline int early_drop(struct net *net, unsigned int _hash)
 static void gc_worker(struct work_struct *work)
 {
 	unsigned int i, goal, buckets = 0, expired_count = 0;
-	unsigned long next_run = GC_INTERVAL;
-	unsigned int ratio, scanned = 0;
 	struct conntrack_gc_work *gc_work;
+	unsigned int ratio, scanned = 0;
+	unsigned long next_run;
 
 	gc_work = container_of(work, struct conntrack_gc_work, dwork.work);
 
-	goal = min(nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV, GC_MAX_BUCKETS);
+	goal = nf_conntrack_htable_size / GC_MAX_BUCKETS_DIV;
 	i = gc_work->last_bucket;
 
 	do {
@@ -982,17 +985,47 @@  static void gc_worker(struct work_struct *work)
 	if (gc_work->exiting)
 		return;
 
+	/*
+	 * Eviction will normally happen from the packet path, and not
+	 * from this gc worker.
+	 *
+	 * This worker is only here to reap expired entries when system went
+	 * idle after a busy period.
+	 *
+	 * The heuristics below are supposed to balance conflicting goals:
+	 *
+	 * 1. Minimize time until we notice a stale entry
+	 * 2. Maximize scan intervals to not waste cycles
+	 *
+	 * Normally, expired_count will be 0, this increases the next_run time
+	 * to priorize 2) above.
+	 *
+	 * As soon as a timed-out entry is found, move towards 1) and increase
+	 * the scan frequency.
+	 * In case we have lots of evictions next scan is done immediately.
+	 */
 	ratio = scanned ? expired_count * 100 / scanned : 0;
-	if (ratio >= 90 || expired_count == GC_MAX_EVICTS)
+	if (ratio >= 90 || expired_count == GC_MAX_EVICTS) {
+		gc_work->next_gc_run = 0;
 		next_run = 0;
+	} else if (expired_count) {
+		gc_work->next_gc_run /= 2U;
+		next_run = msecs_to_jiffies(1);
+	} else {
+		if (gc_work->next_gc_run < GC_INTERVAL_MAX)
+			gc_work->next_gc_run += msecs_to_jiffies(1);
+
+		next_run = gc_work->next_gc_run;
+	}
 
 	gc_work->last_bucket = i;
-	schedule_delayed_work(&gc_work->dwork, next_run);
+	queue_delayed_work(system_long_wq, &gc_work->dwork, next_run);
 }
 
 static void conntrack_gc_work_init(struct conntrack_gc_work *gc_work)
 {
 	INIT_DELAYED_WORK(&gc_work->dwork, gc_worker);
+	gc_work->next_gc_run = GC_INTERVAL_MAX;
 	gc_work->exiting = false;
 }
 
@@ -1885,7 +1918,7 @@  int nf_conntrack_init_start(void)
 	nf_ct_untracked_status_or(IPS_CONFIRMED | IPS_UNTRACKED);
 
 	conntrack_gc_work_init(&conntrack_gc_work);
-	schedule_delayed_work(&conntrack_gc_work.dwork, GC_INTERVAL);
+	queue_delayed_work(system_long_wq, &conntrack_gc_work.dwork, GC_INTERVAL_MAX);
 
 	return 0;