From patchwork Mon May 31 21:21:36 2010
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
X-Patchwork-Submitter: Eric Dumazet <eric.dumazet@gmail.com>
X-Patchwork-Id: 54134
X-Patchwork-Delegate: davem@davemloft.net
Return-Path: <netdev-owner@vger.kernel.org>
X-Original-To: patchwork-incoming@ozlabs.org
Delivered-To: patchwork-incoming@ozlabs.org
Received: from vger.kernel.org (vger.kernel.org [209.132.180.67])
	by ozlabs.org (Postfix) with ESMTP id DFEC5B7D1B
	for <patchwork-incoming@ozlabs.org>;
	Tue,  1 Jun 2010 07:21:51 +1000 (EST)
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1754162Ab0EaVVp (ORCPT <rfc822;patchwork-incoming@ozlabs.org>);
	Mon, 31 May 2010 17:21:45 -0400
Received: from mail-wy0-f174.google.com ([74.125.82.174]:63410 "EHLO
	mail-wy0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1753646Ab0EaVVo (ORCPT
	<rfc822;netdev@vger.kernel.org>); Mon, 31 May 2010 17:21:44 -0400
Received: by wyi11 with SMTP id 11so521111wyi.19
	for <multiple recipients>; Mon, 31 May 2010 14:21:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma;
	h=domainkey-signature:received:received:subject:from:to:cc
	:in-reply-to:references:content-type:date:message-id:mime-version
	:x-mailer:content-transfer-encoding;
	bh=2FPlZUArods3h7FA6LLpGjkCjabEVIlvADpy0SYxITo=;
	b=D8roNMZd+3/9ckVTn0UmHTjdAhFonyTn3oOFvC7ICg8TNMwFCiiq3muB+5tJC04M9g
	OD5WYyApL7S7OvY3BwvK2ySQ+nb3DKzc0ok2ccMm/JDkpvmVZ08hlEeZr9JS2aKIp6T+
	wYnwJZM31LVVH0Vbh0RVIHJEqakMi0Le95yd4=
DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma;
	h=subject:from:to:cc:in-reply-to:references:content-type:date
	:message-id:mime-version:x-mailer:content-transfer-encoding;
	b=g4USJJNh5d9g48n2/qheinU9194cQs8jik1TZn/0fiHJkjFPD9bRgbysFDMNxxBuGw
	PSKrCTAH2L9dmlGtwc9Lxc8uC/J8HRSqYt0NTeOmoPXIjwn9AtdbYqTmnMHOwh1cZVO1
	TZayHhWEyyyeWCXxztPe72doSs/Vd2ZkUVwJc=
Received: by 10.227.69.21 with SMTP id x21mr5001386wbi.103.1275340902058;
	Mon, 31 May 2010 14:21:42 -0700 (PDT)
Received: from [127.0.0.1] ([85.17.35.125]) by mx.google.com with ESMTPS id
	v59sm2269989wec.15.2010.05.31.14.21.38
	(version=SSLv3 cipher=RC4-MD5);
	Mon, 31 May 2010 14:21:40 -0700 (PDT)
Subject: Re: DDoS attack causing bad effect on conntrack searches
From: Eric Dumazet <eric.dumazet@gmail.com>
To: hawk@comx.dk
Cc: Jesper Dangaard Brouer <hawk@diku.dk>, paulmck@linux.vnet.ibm.com,
	Patrick McHardy <kaber@trash.net>, Changli Gao <xiaosuo@gmail.com>,
	Linux Kernel Network Hackers <netdev@vger.kernel.org>,
	Netfilter Developers <netfilter-devel@vger.kernel.org>
In-Reply-To: <1272292568.13192.43.camel@jdb-workstation>
References: <1271941082.14501.189.camel@jdb-workstation>
	<q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com>
	<4BD04C74.9020402@trash.net>
	<1271946961.7895.5665.camel@edumazet-laptop>
	<1271948029.7895.5707.camel@edumazet-laptop>
	<20100422155123.GA2524@linux.vnet.ibm.com>
	<1271952128.7895.5851.camel@edumazet-laptop>
	<Pine.LNX.4.64.1004222213290.10919@ask.diku.dk>
	<1272056237.4599.7.camel@edumazet-laptop>
	<Pine.LNX.4.64.1004241219280.32071@ask.diku.dk>
	<1272139861.20714.525.camel@edumazet-laptop>
	<1272292568.13192.43.camel@jdb-workstation>
Date: Mon, 31 May 2010 23:21:36 +0200
Message-ID: <1275340896.2478.26.camel@edumazet-laptop>
Mime-Version: 1.0
X-Mailer: Evolution 2.28.3 
Sender: netdev-owner@vger.kernel.org
Precedence: bulk
List-ID: <netdev.vger.kernel.org>
X-Mailing-List: netdev@vger.kernel.org

Le lundi 26 avril 2010 à 16:36 +0200, Jesper Dangaard Brouer a écrit :
> On Sat, 2010-04-24 at 22:11 +0200, Eric Dumazet wrote:
> >  
> > > Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> > > see if I can reproduce the DDoS attack senario.  And see if I can get 
> > > it into to lookup loop.
> > 
> > Theorically a loop is very unlikely, given a single retry is very
> > unlikly too.
> > 
> > Unless a cpu gets in its cache a corrupted value of a 'next' pointer.
> > 
> ...
> >
> > With same hash bucket size (300.032) and max conntracks (800.000), and
> > after more than 10 hours of test, not a single lookup was restarted
> > because of a nulls with wrong value.
> 
> So fare, I have to agree with you.  I have now tested on the same type
> of hardware (although running a 64-bit kernel, and off net-next-2.6),
> and the result is, the same as yours, I don't see a any restarts of the
> loop.  The test systems differs a bit, as it has two physical CPU and 2M
> cache (and annoyingly the system insists on using HPET as clocksource).
> 
> Guess, the only explanation would be bad/sub-optimal hash distribution.
> With 40 kpps and 700.000 'searches' per second, the hash bucket list
> length "only" need to be 17.5 elements on average, where optimum is 3.
> With my pktgen DoS test, where I tried to reproduce the DoS attack, only
> see a screw of 6 elements on average.
> 
> 
> > I can setup a test on a 16 cpu machine, multiqueue card too.
> 
> Don't think that is necessary.  My theory was it was possible on slower
> single queue NIC, where one CPU is 100% busy in the conntrack search,
> and the other CPUs delete the entries (due to early drop and
> call_rcu()).  But guess that note the case, and RCU works perfectly ;-)
> 
> > Hmm, I forgot to say I am using net-next-2.6, not your kernel version...
> 
> I also did this test using net-next-2.6, perhaps I should try the
> version I use in production...
> 
> 


I had a look at current conntrack and found the 'unconfirmed' list was
maybe a candidate for a potential blackhole.

That is, if a reader happens to hit an entry that is moved from regular
hash table slot 'hash' to unconfirmed list, reader might scan whole
unconfirmed list to find out he is not anymore on the wanted hash chain.

Problem is this unconfirmed list might be very very long in case of
DDOS. It's really not designed to be scanned during a lookup.

So I guess we should stop early if we find an unconfirmed entry ?
---
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
index bde095f..0573641 100644
--- a/include/net/netfilter/nf_conntrack.h
+++ b/include/net/netfilter/nf_conntrack.h
@@ -298,8 +298,10 @@ extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
 extern unsigned int nf_conntrack_htable_size;
 extern unsigned int nf_conntrack_max;
 
-#define NF_CT_STAT_INC(net, count)	\
+#define NF_CT_STAT_INC(net, count)		\
 	__this_cpu_inc((net)->ct.stat->count)
+#define NF_CT_STAT_ADD(net, count, value)	\
+	__this_cpu_add((net)->ct.stat->count, value)
 #define NF_CT_STAT_INC_ATOMIC(net, count)		\
 do {							\
 	local_bh_disable();				\
diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index eeeb8bc..e96d999 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -299,6 +299,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	unsigned int hash = hash_conntrack(net, zone, tuple);
+	unsigned int cnt = 0;
 
 	/* Disable BHs the entire time since we normally need to disable them
 	 * at least once for the stats anyway.
@@ -309,10 +310,19 @@ begin:
 		if (nf_ct_tuple_equal(tuple, &h->tuple) &&
 		    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
 			NF_CT_STAT_INC(net, found);
+			NF_CT_STAT_ADD(net, searched, cnt);
 			local_bh_enable();
 			return h;
 		}
-		NF_CT_STAT_INC(net, searched);
+		/*
+		 * If we find an unconfirmed entry, restart the lookup to
+		 * avoid scanning whole unconfirmed list
+		 */
+		if (unlikely(++cnt > 8 &&
+			     !nf_ct_is_confirmed(nf_ct_tuplehash_to_ctrack(h)))) {
+			NF_CT_STAT_INC(net, search_restart);
+			goto begin;
+		}
 	}
 	/*
 	 * if the nulls value we got at the end of this lookup is
@@ -323,6 +333,7 @@ begin:
 		NF_CT_STAT_INC(net, search_restart);
 		goto begin;
 	}
+	NF_CT_STAT_ADD(net, searched, cnt);
 	local_bh_enable();
 
 	return NULL;