DDoS attack causing bad effect on conntrack searches

Message ID	1275340896.2478.26.camel@edumazet-laptop
State	Not Applicable, archived
Delegated to:	David Miller
Headers	show Return-Path: <netdev-owner@vger.kernel.org> DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=subject:from:to:cc:in-reply-to:references:content-type:date :message-id:mime-version:x-mailer:content-transfer-encoding; b=g4USJJNh5d9g48n2/qheinU9194cQs8jik1TZn/0fiHJkjFPD9bRgbysFDMNxxBuGw PSKrCTAH2L9dmlGtwc9Lxc8uC/J8HRSqYt0NTeOmoPXIjwn9AtdbYqTmnMHOwh1cZVO1 TZayHhWEyyyeWCXxztPe72doSs/Vd2ZkUVwJc= Subject: Re: DDoS attack causing bad effect on conntrack searches From: Eric Dumazet <eric.dumazet@gmail.com> To: hawk@comx.dk Cc: Jesper Dangaard Brouer <hawk@diku.dk>, paulmck@linux.vnet.ibm.com, Patrick McHardy <kaber@trash.net>, Changli Gao <xiaosuo@gmail.com>, Linux Kernel Network Hackers <netdev@vger.kernel.org>, Netfilter Developers <netfilter-devel@vger.kernel.org> In-Reply-To: <1272292568.13192.43.camel@jdb-workstation> References: <1271941082.14501.189.camel@jdb-workstation> <q2h412e6f7f1004220613m488c2ee4r6d24a8d1e65997d4@mail.gmail.com> <4BD04C74.9020402@trash.net> <1271946961.7895.5665.camel@edumazet-laptop> <1271948029.7895.5707.camel@edumazet-laptop> <20100422155123.GA2524@linux.vnet.ibm.com> <1271952128.7895.5851.camel@edumazet-laptop> <Pine.LNX.4.64.1004222213290.10919@ask.diku.dk> <1272056237.4599.7.camel@edumazet-laptop> <Pine.LNX.4.64.1004241219280.32071@ask.diku.dk> <1272139861.20714.525.camel@edumazet-laptop> <1272292568.13192.43.camel@jdb-workstation> Content-Type: text/plain; charset="UTF-8" Date: Mon, 31 May 2010 23:21:36 +0200 Message-ID: <1275340896.2478.26.camel@edumazet-laptop> Mime-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: netdev-owner@vger.kernel.org Precedence: bulk

Eric Dumazet May 31, 2010, 9:21 p.m. UTC

Le lundi 26 avril 2010 à 16:36 +0200, Jesper Dangaard Brouer a écrit :
> On Sat, 2010-04-24 at 22:11 +0200, Eric Dumazet wrote:
> >  
> > > Monday or Tuesdag I'll do a test setup with some old HP380 G4 machines to 
> > > see if I can reproduce the DDoS attack senario.  And see if I can get 
> > > it into to lookup loop.
> > 
> > Theorically a loop is very unlikely, given a single retry is very
> > unlikly too.
> > 
> > Unless a cpu gets in its cache a corrupted value of a 'next' pointer.
> > 
> ...
> >
> > With same hash bucket size (300.032) and max conntracks (800.000), and
> > after more than 10 hours of test, not a single lookup was restarted
> > because of a nulls with wrong value.
> 
> So fare, I have to agree with you.  I have now tested on the same type
> of hardware (although running a 64-bit kernel, and off net-next-2.6),
> and the result is, the same as yours, I don't see a any restarts of the
> loop.  The test systems differs a bit, as it has two physical CPU and 2M
> cache (and annoyingly the system insists on using HPET as clocksource).
> 
> Guess, the only explanation would be bad/sub-optimal hash distribution.
> With 40 kpps and 700.000 'searches' per second, the hash bucket list
> length "only" need to be 17.5 elements on average, where optimum is 3.
> With my pktgen DoS test, where I tried to reproduce the DoS attack, only
> see a screw of 6 elements on average.
> 
> 
> > I can setup a test on a 16 cpu machine, multiqueue card too.
> 
> Don't think that is necessary.  My theory was it was possible on slower
> single queue NIC, where one CPU is 100% busy in the conntrack search,
> and the other CPUs delete the entries (due to early drop and
> call_rcu()).  But guess that note the case, and RCU works perfectly ;-)
> 
> > Hmm, I forgot to say I am using net-next-2.6, not your kernel version...
> 
> I also did this test using net-next-2.6, perhaps I should try the
> version I use in production...
> 
> 


I had a look at current conntrack and found the 'unconfirmed' list was
maybe a candidate for a potential blackhole.

That is, if a reader happens to hit an entry that is moved from regular
hash table slot 'hash' to unconfirmed list, reader might scan whole
unconfirmed list to find out he is not anymore on the wanted hash chain.

Problem is this unconfirmed list might be very very long in case of
DDOS. It's really not designed to be scanned during a lookup.

So I guess we should stop early if we find an unconfirmed entry ?





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Changli Gao June 1, 2010, 12:28 a.m. UTC | #1

On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> I had a look at current conntrack and found the 'unconfirmed' list was
> maybe a candidate for a potential blackhole.
>
> That is, if a reader happens to hit an entry that is moved from regular
> hash table slot 'hash' to unconfirmed list,

Sorry, but I can't find where we do this things. unconfirmed list is
used to track the unconfirmed cts, whose corresponding skbs are still
in path from the first to the last netfilter hooks. As soon as the
skbs end their travel in netfilter, the corresponding cts will be
confirmed(moving ct from unconfirmed list to regular hash table).

unconfirmed list should be small, as networking receiving is in BH.
How about implementing unconfirmed list as a per cpu variable?

> reader might scan whole
> unconfirmed list to find out he is not anymore on the wanted hash chain.
>
> Problem is this unconfirmed list might be very very long in case of
> DDOS. It's really not designed to be scanned during a lookup.
>
> So I guess we should stop early if we find an unconfirmed entry ?
>
>
>
> diff --git a/include/net/netfilter/nf_conntrack.h b/include/net/netfilter/nf_conntrack.h
> index bde095f..0573641 100644
> --- a/include/net/netfilter/nf_conntrack.h
> +++ b/include/net/netfilter/nf_conntrack.h
> @@ -298,8 +298,10 @@ extern int nf_conntrack_set_hashsize(const char *val, struct kernel_param *kp);
>  extern unsigned int nf_conntrack_htable_size;
>  extern unsigned int nf_conntrack_max;
>
> -#define NF_CT_STAT_INC(net, count)     \
> +#define NF_CT_STAT_INC(net, count)             \
>        __this_cpu_inc((net)->ct.stat->count)
> +#define NF_CT_STAT_ADD(net, count, value)      \
> +       __this_cpu_add((net)->ct.stat->count, value)
>  #define NF_CT_STAT_INC_ATOMIC(net, count)              \
>  do {                                                   \
>        local_bh_disable();                             \
> diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
> index eeeb8bc..e96d999 100644
> --- a/net/netfilter/nf_conntrack_core.c
> +++ b/net/netfilter/nf_conntrack_core.c
> @@ -299,6 +299,7 @@ __nf_conntrack_find(struct net *net, u16 zone,
>        struct nf_conntrack_tuple_hash *h;
>        struct hlist_nulls_node *n;
>        unsigned int hash = hash_conntrack(net, zone, tuple);
> +       unsigned int cnt = 0;
>
>        /* Disable BHs the entire time since we normally need to disable them
>         * at least once for the stats anyway.
> @@ -309,10 +310,19 @@ begin:
>                if (nf_ct_tuple_equal(tuple, &h->tuple) &&
>                    nf_ct_zone(nf_ct_tuplehash_to_ctrack(h)) == zone) {
>                        NF_CT_STAT_INC(net, found);
> +                       NF_CT_STAT_ADD(net, searched, cnt);
>                        local_bh_enable();
>                        return h;
>                }
> -               NF_CT_STAT_INC(net, searched);
> +               /*
> +                * If we find an unconfirmed entry, restart the lookup to
> +                * avoid scanning whole unconfirmed list
> +                */
> +               if (unlikely(++cnt > 8 &&
> +                            !nf_ct_is_confirmed(nf_ct_tuplehash_to_ctrack(h)))) {
> +                       NF_CT_STAT_INC(net, search_restart);
> +                       goto begin;
> +               }
>        }
>        /*
>         * if the nulls value we got at the end of this lookup is
> @@ -323,6 +333,7 @@ begin:
>                NF_CT_STAT_INC(net, search_restart);
>                goto begin;
>        }
> +       NF_CT_STAT_ADD(net, searched, cnt);
>        local_bh_enable();
>
>        return NULL;
>
>
>

Eric Dumazet June 1, 2010, 5:05 a.m. UTC | #2

Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >
> > I had a look at current conntrack and found the 'unconfirmed' list was
> > maybe a candidate for a potential blackhole.
> >
> > That is, if a reader happens to hit an entry that is moved from regular
> > hash table slot 'hash' to unconfirmed list,
> 
> Sorry, but I can't find where we do this things. unconfirmed list is
> used to track the unconfirmed cts, whose corresponding skbs are still
> in path from the first to the last netfilter hooks. As soon as the
> skbs end their travel in netfilter, the corresponding cts will be
> confirmed(moving ct from unconfirmed list to regular hash table).
> 

So netfilter is a monolithic thing.

When a packet begins its travel into netfilter, you guarantee that no
other packet can also begin its travel and find an unconfirmed
conntrack ?

I wonder why we use atomic ops then to track the confirmed bit :)

> unconfirmed list should be small, as networking receiving is in BH.

So according to you, netfilter/ct runs only in input path ?

So I assume a packet is handled by CPU X, creates a new conntrack
(possibly early droping an old entry that was previously in a standard
hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
Y, handling another packet, possibly sent by a hacker reading your
netdev mails, cannot find the conntrack that was early dropped ?

> How about implementing unconfirmed list as a per cpu variable?

I first implemented such a patch to reduce cache line contention, then I
asked to myself : What is exactly an unconfirmed conntrack ? Can their
number be unbounded ? If yes, we have a problem, even on a two cpus
machine. Using two lists instead of one wont solve the fundamental
problem.

The real question is, why do we need this unconfirmed 'list' in the
first place. Is it really a private per cpu thing ? Can you prove this,
in respect of lockless lookups, and things like NFQUEUE ? 

Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
one for IP_CT_DIR_REPLY.

Unconfirmed list use the first anchor. This means another cpu can
definitely find an unconfirmed item in a regular hash chain, since we
dont respect an RCU grace period before re-using an object.

If memory was not a problem, we probably would use a third anchor to
avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Changli Gao June 1, 2010, 5:48 a.m. UTC | #3

On Tue, Jun 1, 2010 at 1:05 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
>> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >
>> > I had a look at current conntrack and found the 'unconfirmed' list was
>> > maybe a candidate for a potential blackhole.
>> >
>> > That is, if a reader happens to hit an entry that is moved from regular
>> > hash table slot 'hash' to unconfirmed list,
>>
>> Sorry, but I can't find where we do this things. unconfirmed list is
>> used to track the unconfirmed cts, whose corresponding skbs are still
>> in path from the first to the last netfilter hooks. As soon as the
>> skbs end their travel in netfilter, the corresponding cts will be
>> confirmed(moving ct from unconfirmed list to regular hash table).
>>
>
> So netfilter is a monolithic thing.
>
> When a packet begins its travel into netfilter, you guarantee that no
> other packet can also begin its travel and find an unconfirmed
> conntrack ?
>
> I wonder why we use atomic ops then to track the confirmed bit :)

seems no need.

>
>
>> unconfirmed list should be small, as networking receiving is in BH.
>
> So according to you, netfilter/ct runs only in input path ?

No. there are another entrances: local out and nf_reinject. If there
isn't any packet queued, as netfilter is in atomic context, the nubmer
of unconfirmed cts should be small( at most, 2 * nr_cpu?).

>
> So I assume a packet is handled by CPU X, creates a new conntrack
> (possibly early droping an old entry that was previously in a standard
> hash chain), inserted in unconfirmed list.
>

Oh, Thanks, I got it.

> _You_ guarantee another CPU
> Y, handling another packet, possibly sent by a hacker reading your
> netdev mails, cannot find the conntrack that was early dropped ?
>
>> How about implementing unconfirmed list as a per cpu variable?
>
> I first implemented such a patch to reduce cache line contention, then I
> asked to myself : What is exactly an unconfirmed conntrack ? Can their
> number be unbounded ? If yes, we have a problem, even on a two cpus
> machine. Using two lists instead of one wont solve the fundamental
> problem.
>
> The real question is, why do we need this unconfirmed 'list' in the
> first place. Is it really a private per cpu thing ?

No, it isn't really private. But most of time, it is accessed locally,
if it is implemented as a per cpu var.

> Can you prove this,
> in respect of lockless lookups, and things like NFQUEUE ?
>
> Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
> one for IP_CT_DIR_REPLY.
>
> Unconfirmed list use the first anchor. This means another cpu can
> definitely find an unconfirmed item in a regular hash chain, since we
> dont respect an RCU grace period before re-using an object.
>
> If memory was not a problem, we probably would use a third anchor to
> avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.
>

thanks for your explaining.

Patrick McHardy June 1, 2010, 10:18 a.m. UTC | #4

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 08:28 +0800, Changli Gao a écrit :
>> On Tue, Jun 1, 2010 at 5:21 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> I had a look at current conntrack and found the 'unconfirmed' list was
>>> maybe a candidate for a potential blackhole.
>>>
>>> That is, if a reader happens to hit an entry that is moved from regular
>>> hash table slot 'hash' to unconfirmed list,
>> Sorry, but I can't find where we do this things. unconfirmed list is
>> used to track the unconfirmed cts, whose corresponding skbs are still
>> in path from the first to the last netfilter hooks. As soon as the
>> skbs end their travel in netfilter, the corresponding cts will be
>> confirmed(moving ct from unconfirmed list to regular hash table).
>>
> 
> So netfilter is a monolithic thing.
> 
> When a packet begins its travel into netfilter, you guarantee that no
> other packet can also begin its travel and find an unconfirmed
> conntrack ?

Correct, the unconfirmed list exists only for cleanup.

> I wonder why we use atomic ops then to track the confirmed bit :)

Good question, that looks unnecessary :)

>> unconfirmed list should be small, as networking receiving is in BH.
> 
> So according to you, netfilter/ct runs only in input path ?
> 
> So I assume a packet is handled by CPU X, creates a new conntrack
> (possibly early droping an old entry that was previously in a standard
> hash chain), inserted in unconfirmed list. _You_ guarantee another CPU
> Y, handling another packet, possibly sent by a hacker reading your
> netdev mails, cannot find the conntrack that was early dropped ?
> 
>> How about implementing unconfirmed list as a per cpu variable?
> 
> I first implemented such a patch to reduce cache line contention, then I
> asked to myself : What is exactly an unconfirmed conntrack ? Can their
> number be unbounded ? If yes, we have a problem, even on a two cpus
> machine. Using two lists instead of one wont solve the fundamental
> problem.

If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
added to the unconfirmed list and moved to the hash as soon as the
packet passes POST_ROUTING. This means the number of unconfirmed entries
created by the network is bound by the number of CPUs due to BH
processing. The number created by locally generated packets is unbound
in case of preemptible kernels however.

> The real question is, why do we need this unconfirmed 'list' in the
> first place. Is it really a private per cpu thing ? Can you prove this,
> in respect of lockless lookups, and things like NFQUEUE ? 

Its used for cleaning up conntracks not in the hash table yet on
module unload (or manual flush). It is supposed to be write-only
during regular operation.

> Each conntrack object has two list anchors. One for IP_CT_DIR_ORIGINAL,
> one for IP_CT_DIR_REPLY.
> 
> Unconfirmed list use the first anchor. This means another cpu can
> definitely find an unconfirmed item in a regular hash chain, since we
> dont respect an RCU grace period before re-using an object.
> 
> If memory was not a problem, we probably would use a third anchor to
> avoid this, or regular RCU instead of SLAB_DESTROY_BY_RCU variant.

So I guess we should check the CONFIRMED bit when searching
in the hash table.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Eric Dumazet June 1, 2010, 10:31 a.m. UTC | #5

Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :

> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
> added to the unconfirmed list and moved to the hash as soon as the
> packet passes POST_ROUTING. This means the number of unconfirmed entries
> created by the network is bound by the number of CPUs due to BH
> processing. The number created by locally generated packets is unbound
> in case of preemptible kernels however.
> 

OK, we should have a percpu list then.

BTW, I notice nf_conntrack_untracked is incorrectly annotated
'__read_mostly'.

It can be written very often :(

Should'nt we special case it and let be really const ?



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patrick McHardy June 1, 2010, 10:41 a.m. UTC | #6

Eric Dumazet wrote:
> Le mardi 01 juin 2010 à 12:18 +0200, Patrick McHardy a écrit :
> 
>> If a new conntrack is created in PRE_ROUTING or LOCAL_OUT, it will be
>> added to the unconfirmed list and moved to the hash as soon as the
>> packet passes POST_ROUTING. This means the number of unconfirmed entries
>> created by the network is bound by the number of CPUs due to BH
>> processing. The number created by locally generated packets is unbound
>> in case of preemptible kernels however.
>>
> 
> OK, we should have a percpu list then.

Yes, that makes sense.

> BTW, I notice nf_conntrack_untracked is incorrectly annotated
> '__read_mostly'.
> 
> It can be written very often :(
> 
> Should'nt we special case it and let be really const ?

That would need quite a bit of special-casing to avoid touching
the reference counts. So far this is completely hidden, so I'd
say it just shouldn't be marked __read_mostly. Alternatively we
can make "untracked" a nfctinfo state.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

DDoS attack causing bad effect on conntrack searches

Commit Message

Comments

Patch