diff mbox

[net-next,2/3] net: fix enforcing of fragment queue hash list depth

Message ID 20130418213732.14296.36026.stgit@dragon
State Superseded, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer April 18, 2013, 9:38 p.m. UTC
I have found an issues with commit:

 commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
 Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
 Date:   Fri Mar 15 11:32:30 2013 +0000

    inet: limit length of fragment queue hash table bucket lists

There is a connection between the fixed 128 hash depth limit and the
frag mem limit/thresh settings, which limits how high the thresh can
be set.

The 128 elems hash depth limit, results in bad behaviour if mem limit
thresh holds are increased, via /proc/sys/net ::

 /proc/sys/net/ipv4/ipfrag_high_thresh
 /proc/sys/net/ipv4/ipfrag_low_thresh

If we increase the thresh, to something allowing 128 elements in each
bucket, which is not that high given the hash array size of 64
(64*128=8192), e.g.
  big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
  small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968

The problem with commit 5a3da1fe (inet: limit length of fragment queue
hash table bucket lists) is that, once we hit the limit, the we *keep*
the existing frag queues, not allowing new frag queues to be created.
Thus, an attacker can effectivly block handling of fragments for 30
sec (as each frag queue have a timeout of 30 sec).

Even without increasing the limit, as Hannes showed, an attacker on
IPv6 can "attack" a specific hash bucket, and via that change, can
block/drop new fragments also (trying to) utilize this bucket.

Summary:
 With the default mem limit/thresh settings, this is not general
problem, but adjusting the thresh limits result in some-what
unexpected behavior.

Proposed solution:
 IMHO instead of keeping existing frag queues, we should kill one of
the frag queues in the hash instead.

Implementation complications:
 Killing of frag queues while only holding the hash bucket lock, and
not the frag queue lock, complicates the implementation, as we race
and can end up (trying to) remove the hash element twice (resulting in
an oops). This have been addressed by using hlist_del_init() and a
hlist_unhashed() check in fq_unlink_hash().

Extra:
* Added new sysctl "max_hash_depth" option, to allow users to adjust the hash
  depth along with adjusting the thresh limits.
* Change max hash depth to 32, thus limit handling to approx 2048 frag queues.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h                 |    9 +---
 net/ipv4/inet_fragment.c                |   64 ++++++++++++++++++++-----------
 net/ipv4/ip_fragment.c                  |   13 +++++-
 net/ipv6/netfilter/nf_conntrack_reasm.c |    5 +-
 net/ipv6/reassembly.c                   |   15 ++++++-
 5 files changed, 68 insertions(+), 38 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Hannes Frederic Sowa April 19, 2013, 12:52 a.m. UTC | #1
On Thu, Apr 18, 2013 at 11:38:00PM +0200, Jesper Dangaard Brouer wrote:
> -	if (PTR_ERR(q) == -ENOBUFS)
> -		LIMIT_NETDEBUG(KERN_WARNING "%s%s", prefix, msg);
> +		LIMIT_NETDEBUG(KERN_WARNING "%s(): Fragment hash bucket"
> +			       " list length grew over limit (len %d),"
> +			       " Dropping another fragment.\n",
> +			       __func__, depth);
> +	}

Simple nice-to-have thingie: Do you see a simple possibility to add the
user (e.g.  ipv6/ipv6-netfilter/ipv4) to the warning?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2013, 10:11 a.m. UTC | #2
On Thu, 2013-04-18 at 23:38 +0200, Jesper Dangaard Brouer wrote:
> I have found an issues with commit:
> 
>  commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
>  Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
>  Date:   Fri Mar 15 11:32:30 2013 +0000
> 
>     inet: limit length of fragment queue hash table bucket lists
> 
> There is a connection between the fixed 128 hash depth limit and the
> frag mem limit/thresh settings, which limits how high the thresh can
> be set.
> 
> The 128 elems hash depth limit, results in bad behaviour if mem limit
> thresh holds are increased, via /proc/sys/net ::
> 
>  /proc/sys/net/ipv4/ipfrag_high_thresh
>  /proc/sys/net/ipv4/ipfrag_low_thresh
> 
> If we increase the thresh, to something allowing 128 elements in each
> bucket, which is not that high given the hash array size of 64
> (64*128=8192), e.g.
>   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
>   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968
> 
> The problem with commit 5a3da1fe (inet: limit length of fragment queue
> hash table bucket lists) is that, once we hit the limit, the we *keep*
> the existing frag queues, not allowing new frag queues to be created.
> Thus, an attacker can effectivly block handling of fragments for 30
> sec (as each frag queue have a timeout of 30 sec).
> 
> Even without increasing the limit, as Hannes showed, an attacker on
> IPv6 can "attack" a specific hash bucket, and via that change, can
> block/drop new fragments also (trying to) utilize this bucket.
> 
> Summary:
>  With the default mem limit/thresh settings, this is not general
> problem, but adjusting the thresh limits result in some-what
> unexpected behavior.
> 
> Proposed solution:
>  IMHO instead of keeping existing frag queues, we should kill one of
> the frag queues in the hash instead.


This strategy wont really help DDOS attacks. No frag will ever complete.

I am not sure its worth adding extra complexity.

> 
> Implementation complications:
>  Killing of frag queues while only holding the hash bucket lock, and
> not the frag queue lock, complicates the implementation, as we race
> and can end up (trying to) remove the hash element twice (resulting in
> an oops). This have been addressed by using hlist_del_init() and a
> hlist_unhashed() check in fq_unlink_hash().
> 
> Extra:
> * Added new sysctl "max_hash_depth" option, to allow users to adjust the hash
>   depth along with adjusting the thresh limits.
> * Change max hash depth to 32, thus limit handling to approx 2048 frag queues.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
> 
>  include/net/inet_frag.h                 |    9 +---
>  net/ipv4/inet_fragment.c                |   64 ++++++++++++++++++++-----------
>  net/ipv4/ip_fragment.c                  |   13 +++++-
>  net/ipv6/netfilter/nf_conntrack_reasm.c |    5 +-
>  net/ipv6/reassembly.c                   |   15 ++++++-
>  5 files changed, 68 insertions(+), 38 deletions(-)

Hmm... adding a new sysctl without documentation is a clear sign you'll
be the only user of it.

You are also setting a default limit of 32, more likely to hit the
problem than current 128 value.

We know the real solution is to have a correctly sized hash table, so
why adding a temporary sysctl ?

As soon as /proc/sys/net/ipv4/ipfrag_high_thresh is changed, a resize
should be attempted.

But the max depth itself should be a reasonable value, and doesn't need
to be tuned IMHO.

The 64 slots hash table was chosen years ago, when machines had 3 order
of magnitude less ram than today.

Before hash resizing, I would just bump hash size to something more
reasonable like 1024.

That would allow some admin to set /proc/sys/net/ipv4/ipfrag_high_thresh
to a quite large value.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2013, 11:14 a.m. UTC | #3
On Fri, 2013-04-19 at 11:41 +0100, David Laight wrote:
> > The 64 slots hash table was chosen years ago, when machines had 3 order
> > of magnitude less ram than today.
> > 
> > Before hash resizing, I would just bump hash size to something more
> > reasonable like 1024.
> 
> While that is true of many systems, there are embedded systems
> with much less ram than the typical x86 desktop or server.
> 
> There seem to be quite a lot of large hash tables appearing
> that on many small systems will never contain a significant
> number of entries and just waste precious memory.

Embedded systems run linux since ages, and their memory also increased
by 2 order of magnitude since 1995.

If they run linux-3.10, using 4096 bytes of 8192 bytes for the hash
table is fine.

We are not going to add yet another ifdef, for such a small amount of
ram.

The code that we will add to do the resize will be larger than this.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer April 19, 2013, 12:19 p.m. UTC | #4
On Fri, 2013-04-19 at 03:11 -0700, Eric Dumazet wrote:
> On Thu, 2013-04-18 at 23:38 +0200, Jesper Dangaard Brouer wrote:
> > I have found an issues with commit:
> > 
> >  commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
> >  Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >  Date:   Fri Mar 15 11:32:30 2013 +0000
> > 
> >     inet: limit length of fragment queue hash table bucket lists
> > 
> > There is a connection between the fixed 128 hash depth limit and the
> > frag mem limit/thresh settings, which limits how high the thresh can
> > be set.
> > 
> > The 128 elems hash depth limit, results in bad behaviour if mem limit
> > thresh holds are increased, via /proc/sys/net ::
> > 
> >  /proc/sys/net/ipv4/ipfrag_high_thresh
> >  /proc/sys/net/ipv4/ipfrag_low_thresh
> > 
> > If we increase the thresh, to something allowing 128 elements in each
> > bucket, which is not that high given the hash array size of 64
> > (64*128=8192), e.g.
> >   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
> >   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968
> > 
> > The problem with commit 5a3da1fe (inet: limit length of fragment queue
> > hash table bucket lists) is that, once we hit the limit, the we *keep*
> > the existing frag queues, not allowing new frag queues to be created.
> > Thus, an attacker can effectivly block handling of fragments for 30
> > sec (as each frag queue have a timeout of 30 sec).
> > 
> > Even without increasing the limit, as Hannes showed, an attacker on
> > IPv6 can "attack" a specific hash bucket, and via that change, can
> > block/drop new fragments also (trying to) utilize this bucket.
> > 
> > Summary:
> >  With the default mem limit/thresh settings, this is not general
> > problem, but adjusting the thresh limits result in some-what
> > unexpected behavior.
> > 
> > Proposed solution:
> >  IMHO instead of keeping existing frag queues, we should kill one of
> > the frag queues in the hash instead.
> 
> 
> This strategy wont really help DDOS attacks. No frag will ever complete.

I beg to differ.

First if all, I also though keeping fragments were the right choice, BUT
then I realized that we cannot tell god-from-bad frags apart, and an
attacker will obviously generate more fragments than the real traffic,
thus we end up keeping the attackers fragment for 30 sec. Because the
attacker will never "finish/complete" his fragment queues.
Thus, the real "help" for a DoS attack is to keep fragments.


Second, I have clearly shown, with my performance tests, that frags will
complete under DoS.  

Consider argument for frags will be able to complete:

 For an attacker to "push-out" the real-fragments she will have to
generate enough packets to fill the entire hash queue (to be sure) which
is 64 * 32 = 2048 elements, thus generate (some were below) 2048
packets, in the same period the real-fragment sender only needs to send
3 packets (my DNSSEC use case).
That is the reason it works, do you buy that explanation?


> I am not sure its worth adding extra complexity.

It's not that complex, and we simply need it, else an attacker can DoS
us very easily by sending a burst every 30 sec.  We do need this change,
else we must revert Hannes patch, and find a complete other approach of
removing the LRU list system.



> > Implementation complications:
> >  Killing of frag queues while only holding the hash bucket lock, and
> > not the frag queue lock, complicates the implementation, as we race
> > and can end up (trying to) remove the hash element twice (resulting in
> > an oops). This have been addressed by using hlist_del_init() and a
> > hlist_unhashed() check in fq_unlink_hash().
> > 
> > Extra:
> > * Added new sysctl "max_hash_depth" option, to allow users to adjust the hash
> >   depth along with adjusting the thresh limits.
> > * Change max hash depth to 32, thus limit handling to approx 2048 frag queues.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > ---
> > 
> >  include/net/inet_frag.h                 |    9 +---
> >  net/ipv4/inet_fragment.c                |   64 ++++++++++++++++++++-----------
> >  net/ipv4/ip_fragment.c                  |   13 +++++-
> >  net/ipv6/netfilter/nf_conntrack_reasm.c |    5 +-
> >  net/ipv6/reassembly.c                   |   15 ++++++-
> >  5 files changed, 68 insertions(+), 38 deletions(-)
> 
> Hmm... adding a new sysctl without documentation is a clear sign you'll
> be the only user of it.

LOL - I know, my next patch contains a note that I need to update the
docs too...


> You are also setting a default limit of 32, more likely to hit the
> problem than current 128 value.
> 
> We know the real solution is to have a correctly sized hash table, so
> why adding a temporary sysctl ?

No, the real solution is to remove LRU list system (that is the real
bottleneck).  But this patch (and the next) is a step needed before we
can remove the LRU list cleanup system.


> As soon as /proc/sys/net/ipv4/ipfrag_high_thresh is changed, a resize
> should be attempted.

Perhaps, but the ipfrag_high_thresh memory size does not translate
easily to number of needed frag queues (hash_size * max_depth).  Because
the size of a frag queue can range from 1108 bytes to approx 130.000
bytes.


> But the max depth itself should be a reasonable value, and doesn't need
> to be tuned IMHO.

What is a reasonable max depth value for you?



> The 64 slots hash table was chosen years ago, when machines had 3 order
> of magnitude less ram than today.
> 
> Before hash resizing, I would just bump hash size to something more
> reasonable like 1024.
> 
> That would allow some admin to set /proc/sys/net/ipv4/ipfrag_high_thresh
> to a quite large value.

When removing the LRU system (which is the real bottleneck, see perf
tests in cover mail), and doing direct hash cleaning we are trading-in
accuracy.

The reason I don't want a too big hash table is the following.

With direct has cleaning, and even using the mem limit
(ipfrag_high_thresh) in the cleaning loop/selection (see patch-02) we
should at least allow one frag queue to survive (in a hash bucket).
This means that we can violate/exceed the ipfrag_high_thresh a lot (due
to size of a frag queue can range from 1108 bytes to approx 130.000
bytes).
Worst case 1024 buckets * 130K bytes = 133 MBytes, which on smaller
embedded systems is a lot of kernel memory we are permitting a remote
host to "lock-down".


--Jesper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa April 19, 2013, 12:45 p.m. UTC | #5
On Fri, Apr 19, 2013 at 02:19:10PM +0200, Jesper Dangaard Brouer wrote:
> I beg to differ.
> 
> First if all, I also though keeping fragments were the right choice, BUT
> then I realized that we cannot tell god-from-bad frags apart, and an
> attacker will obviously generate more fragments than the real traffic,
> thus we end up keeping the attackers fragment for 30 sec. Because the
> attacker will never "finish/complete" his fragment queues.
> Thus, the real "help" for a DoS attack is to keep fragments.
>
> Second, I have clearly shown, with my performance tests, that frags will
> complete under DoS.

How did you simulate the DoS? Did you try to fill up specific buckets or did
you try to fill up the whole fragment cache?

> > I am not sure its worth adding extra complexity.
> 
> It's not that complex, and we simply need it, else an attacker can DoS
> us very easily by sending a burst every 30 sec.  We do need this change,
> else we must revert Hannes patch, and find a complete other approach of
> removing the LRU list system.

I agree that we have to do something to mitigate this. I would also
drop the sysctl, I do think the kernel should have to take care of that.

Perhaps a way to solve this problem more easily would be to switch back to
use a jenkins hash for all inputs of the hash function and not deal with this
kind of unfairness regarding the possiblity to fill up one of the buckets.

Perhaps the performance improvements by Jesper (per bucket locking) could make
up the more expensive hashing. :)

Just some minor things I found while looking at the patch:

> -
> -/* averaged:
> - * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> - *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> - *	       struct frag_queue))
> - */
> -#define INETFRAGS_MAXDEPTH		128
> +#define INETFRAGS_MAX_HASH_DEPTH	32

Perhaps a comment could be added, that this is only a default value for the
max hash depth.
 
 
> -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
> +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f,
> +		      bool unlink_hash)

This function could be static.

> -	hlist_for_each_entry(q, &hb->chain, list) {
> +	hlist_for_each_entry_safe(q, n, &hb->chain, list) {

Minor, but I think _safe is not needed here.

> +static int zero;
> +

Could be const.

Thanks,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer April 19, 2013, 2:29 p.m. UTC | #6
On Fri, 2013-04-19 at 14:45 +0200, Hannes Frederic Sowa wrote:
> On Fri, Apr 19, 2013 at 02:19:10PM +0200, Jesper Dangaard Brouer wrote:
> > I beg to differ.
> > 
> > First if all, I also though keeping fragments were the right choice, BUT
> > then I realized that we cannot tell god-from-bad frags apart, and an
> > attacker will obviously generate more fragments than the real traffic,
> > thus we end up keeping the attackers fragment for 30 sec. Because the
> > attacker will never "finish/complete" his fragment queues.
> > Thus, the real "help" for a DoS attack is to keep fragments.
> >
> > Second, I have clearly shown, with my performance tests, that frags will
> > complete under DoS.
> 
> How did you simulate the DoS? Did you try to fill up specific buckets or did
> you try to fill up the whole fragment cache?

I'm using trafgen (part of netsniff-ng git download from
https://github.com/borkmann/netsniff-ng).

Simple DoS test setup read (scroll down):
 http://thread.gmane.org/gmane.linux.network/257155

As mentioned last time:
 http://article.gmane.org/gmane.linux.network/263644

I have extended the DoS generator. quote:
 'I have changed the frag DoS generator script to be more
efficient/deadly.  Before it would only hit one RX queue, now its
sending packets causing multi-queue RX, due to "better" RX hashing.'

I have placed the new "multi-queue" trafgen script/input here:
http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet05_small_frag_multi_queue.txf

Perhaps you can tell me, if my trafgen traffic is getting hashed badly?


> > > I am not sure its worth adding extra complexity.
> > 
> > It's not that complex, and we simply need it, else an attacker can DoS
> > us very easily by sending a burst every 30 sec.  We do need this change,
> > else we must revert Hannes patch, and find a complete other approach of
> > removing the LRU list system.
> 
> I agree that we have to do something to mitigate this. I would also
> drop the sysctl, I do think the kernel should have to take care of that.
> 
> Perhaps a way to solve this problem more easily would be to switch back to
> use a jenkins hash for all inputs of the hash function and not deal with this
> kind of unfairness regarding the possiblity to fill up one of the buckets.
> 
> Perhaps the performance improvements by Jesper (per bucket locking) could make
> up the more expensive hashing. :)

Well, I don't know.  But we do need some solution, to the current code.


> Just some minor things I found while looking at the patch:
> 
> > -
> > -/* averaged:
> > - * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
> > - *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
> > - *	       struct frag_queue))
> > - */
> > -#define INETFRAGS_MAXDEPTH		128
> > +#define INETFRAGS_MAX_HASH_DEPTH	32
> 
> Perhaps a comment could be added, that this is only a default value for the
> max hash depth.

Yes, I though of calling it INETFRAGS_MAX_HASH_DEPTH_DEFAULT, but it was
getting too long.  But now you and Eric argue against making this
tune-able, so it might be the right "name".

 
> 
> > -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
> > +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f,
> > +		      bool unlink_hash)
> 
> This function could be static.

(Sorry to be dumb) could you explain the benefit of doing so?


> > -	hlist_for_each_entry(q, &hb->chain, list) {
> > +	hlist_for_each_entry_safe(q, n, &hb->chain, list) {
> 
> Minor, but I think _safe is not needed here.

Yes, the hash unlink happens after we have exited the for_each loop.


> > +static int zero;
> > +
> 
> Could be const.

Could you also explain the benefit of doing so for a variable?


--Jesper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2013, 2:42 p.m. UTC | #7
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 2013-04-19 at 03:11 -0700, Eric Dumazet wrote:

> > This strategy wont really help DDOS attacks. No frag will ever complete.
> 
> I beg to differ.

We already had a discussion on this topic.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2013, 2:45 p.m. UTC | #8
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 2013-04-19 at 03:11 -0700, Eric Dumazet wrote:

> > I am not sure its worth adding extra complexity.
> 
> It's not that complex, and we simply need it, else an attacker can DoS
> us very easily by sending a burst every 30 sec.  We do need this change,
> else we must revert Hannes patch, and find a complete other approach of
> removing the LRU list system.

Its a never ending stuff.

fragments are fundamentally not suitable for any workload that can be
attacked by an hostile guy.

The guy will adapt its strategy knowing yours.

Thats pretty easy for him, linux sources are public.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2013, 2:45 p.m. UTC | #9
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote:

> What is a reasonable max depth value for you?
> 

Current one is OK.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 19, 2013, 2:49 p.m. UTC | #10
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote:

> When removing the LRU system (which is the real bottleneck, see perf
> tests in cover mail), and doing direct hash cleaning we are trading-in
> accuracy.
> 

You are mixing performance issues and correctness.

> The reason I don't want a too big hash table is the following.
> 

> Worst case 1024 buckets * 130K bytes = 133 MBytes, which on smaller
> embedded systems is a lot of kernel memory we are permitting a remote
> host to "lock-down".

Thats pretty irrelevant, memory is limited by the total amount of memory
used by fragments, not by hash table size.

Its called /proc/sys/net/ipv4/ipfrag_high_thresh

It seems you me you are spending time on wrong things.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa April 19, 2013, 3:06 p.m. UTC | #11
On Fri, Apr 19, 2013 at 04:29:02PM +0200, Jesper Dangaard Brouer wrote:
> I'm using trafgen (part of netsniff-ng git download from
> https://github.com/borkmann/netsniff-ng).
> 
> Simple DoS test setup read (scroll down):
>  http://thread.gmane.org/gmane.linux.network/257155
> 
> As mentioned last time:
>  http://article.gmane.org/gmane.linux.network/263644
> 
> I have extended the DoS generator. quote:
>  'I have changed the frag DoS generator script to be more
> efficient/deadly.  Before it would only hit one RX queue, now its
> sending packets causing multi-queue RX, due to "better" RX hashing.'
> 
> I have placed the new "multi-queue" trafgen script/input here:
> http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet05_small_frag_multi_queue.txf
> 
> Perhaps you can tell me, if my trafgen traffic is getting hashed badly?

Thanks!

IPv4 traffic does not have this problem, as all the information to identify
the packet are directly used as input to jhash_3words (the problem is, that
IPv6 addresses are xored first and then fed into the jhash). So with this
setup you would flood the hash uniformly distributed.

(Btw. I made a tool to generate fragments with colliding hashes:
https://gist.github.com/hannes/5116331).

> > I agree that we have to do something to mitigate this. I would also
> > drop the sysctl, I do think the kernel should have to take care of that.
> > 
> > Perhaps a way to solve this problem more easily would be to switch back to
> > use a jenkins hash for all inputs of the hash function and not deal with this
> > kind of unfairness regarding the possiblity to fill up one of the buckets.
> > 
> > Perhaps the performance improvements by Jesper (per bucket locking) could make
> > up the more expensive hashing. :)
> 
> Well, I don't know.  But we do need some solution, to the current code.

You do have fancy 10GigE gear? :)
Could you profile a kernel with 279e9f2 ("ipv6: optimize inet6_hash_frag()")
reverted? This would only effect IPv6 traffic, so you would have to write
an IPv6 txf file. If the numbers show that the effect is not too high we could
actually stop worry about per-hash bucket hash utilization in general (still,
a limit per hash chain would be reasonable, but a global element limit should
catch first).

This would give us the possibility to evict fragments from the fragmentation
cache as soon as new fragments come in without caring about which hash bucket
the fragment does actually belong, too. This would solve one of your problems
of the RFC patch.

Btw, where is the txf file format actually documented?

> > 
> > > -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
> > > +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f,
> > > +		      bool unlink_hash)
> > 
> > This function could be static.
> 
> (Sorry to be dumb) could you explain the benefit of doing so?

The function's namespace will be limited to the current file and won't
be exported for linking by other files. It also does give a good hint
for developers that the function is not used externally.

> > > +static int zero;
> > > +
> > 
> > Could be const.
> 
> Could you also explain the benefit of doing so for a variable?

Precaution that it won't be changed by any statement in the file (does not add
overhead but throws compiler errors if someone accidentally changes the
value out of mistake, compile-time).

Thanks,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa April 19, 2013, 7:44 p.m. UTC | #12
On Fri, Apr 19, 2013 at 04:29:02PM +0200, Jesper Dangaard Brouer wrote:
> Well, I don't know.  But we do need some solution, to the current code.

In <http://article.gmane.org/gmane.linux.network/261361> I said that we could
actually have a list lengt of about 370. At this time this number was stable,
perhaps you could verify?

I tried to flood the cache with very minimal packets so this was actually
the hint that I should have resized the hash back then. With the current
fragmentation cache design we could reach optimal behaviour as soon
as the memory limits kicks in and lru eviction starts before we limit the
fragments queues in the hash chains. The only way to achieve this is to
increase the hash table slots and lower the maximum length limit. I would
propose a limit of about 25-32 and as Eric said, a hash size of 1024. We could
test if we are limited of accepting new fragments by memory limit (which would
be fine because lru eviction kicks in) or by chain length (we could recheck
the numbers then).

So the chain limit would only kick in if someone tries to exploit the fragment
cache by using the method I demonstrated before (which was the reason I
introduced this limit).

Thanks,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer April 22, 2013, 9:10 a.m. UTC | #13
On Fri, 2013-04-19 at 21:44 +0200, Hannes Frederic Sowa wrote:
> On Fri, Apr 19, 2013 at 04:29:02PM +0200, Jesper Dangaard Brouer wrote:
> > Well, I don't know.  But we do need some solution, to the current code.
> 
> In <http://article.gmane.org/gmane.linux.network/261361> I said that we could
> actually have a list lengt of about 370. At this time this number was stable,
> perhaps you could verify?
> 
> I tried to flood the cache with very minimal packets so this was actually
> the hint that I should have resized the hash back then. With the current
> fragmentation cache design we could reach optimal behaviour as soon
> as the memory limits kicks in and lru eviction starts before we limit the
> fragments queues in the hash chains. The only way to achieve this is to
> increase the hash table slots and lower the maximum length limit. I would
> propose a limit of about 25-32 and as Eric said, a hash size of 1024. We could
> test if we are limited of accepting new fragments by memory limit (which would
> be fine because lru eviction kicks in) or by chain length (we could recheck
> the numbers then).
> 
> So the chain limit would only kick in if someone tries to exploit the fragment
> cache by using the method I demonstrated before (which was the reason I
> introduced this limit).

(To avoid pissing people off) I acknowledge that we should change the
hash size, as its ridiculously small with 64 entries.

But your mem limit assumption and hash depth limit assumptions are
broken, because the mem limit is per netns (network namespace).
Thus, starting more netns instances will break these assumptions.

The dangerous part of your change (commit 5a3da1fe) is that you keep the
existing frag queues (and don't allow new frag queues to be created).
The attackers fragments will never finish (timeout 30 sec), while valid
fragments will complete and "exit" the queue, thus the end result is
hash bucket is filled with attackers invalid/incomplete fragments.


Besides, after we have implemented per hash bucket locking (in my change
commit 19952cc4 "net: frag queue per hash bucket locking").
Then, I don't think it is a big problem that a single hash bucket is
being "attacked".

IMHO we should just revert the change (commit 5a3da1fe), and increase
the hash size and fix the hashing for IPv6.


And then I'll find another method for fixing the global LRU list
scalability problem (than the "direct-hash-cleaning" method).
Hannes Frederic Sowa April 22, 2013, 2:54 p.m. UTC | #14
On Mon, Apr 22, 2013 at 11:10:34AM +0200, Jesper Dangaard Brouer wrote:
> (To avoid pissing people off) I acknowledge that we should change the
> hash size, as its ridiculously small with 64 entries.
> 
> But your mem limit assumption and hash depth limit assumptions are
> broken, because the mem limit is per netns (network namespace).
> Thus, starting more netns instances will break these assumptions.

Oh, I see. :/

At first I thought we should make the fragment hash per namespace too,
to provide better isolation in case of lxc. But then each chrome tab
would allocate its own fragment cache, too. Hmm... but people using
namespaces have plenty much memory, don't they? We could also provide
an inet_fragment namespace. ;)

> The dangerous part of your change (commit 5a3da1fe) is that you keep the
> existing frag queues (and don't allow new frag queues to be created).
> The attackers fragments will never finish (timeout 30 sec), while valid
> fragments will complete and "exit" the queue, thus the end result is
> hash bucket is filled with attackers invalid/incomplete fragments.

I would not mind if your change gets accepted (I have not completyl
reviewed it yet), but I have my doubts if it is an advantage to the
current solution.

First off, I think an attacker can keep the fragment cache pretty much
filled up with little cost. The current implementation has the grace
period where no new fragments will be accepted after the DoS, this is
solved by your patch. But the change makes it easier for an attacker to
evict "valid" fragments from the cache in the first 30 seconds of the
DoS, too.

I am not sure whether the current fragmentation handling or your solution
does perform better in real world (or if it actually matters).

Nonetheless it does add a bit more complexity and a new sysctl which does
expose something the kernel should know how to do better.

> Besides, after we have implemented per hash bucket locking (in my change
> commit 19952cc4 "net: frag queue per hash bucket locking").
> Then, I don't think it is a big problem that a single hash bucket is
> being "attacked".

I don't know, I wouldn't say so. The contention point is now the per
hash bucket lock but it should show the same symptoms as before.

In my opinion we should start resizing the hash table irrespective of
the namespace limits (one needs CAP_NET_ADMIN to connect a netns to
the outside world, I think) and try to move forward with Patch 3. This
patch 2 would then only be a dependency and would introduce the eviction
strategy you need for patch 3. But the focus should be on the removal of the
lru cleanup. What do you think?

Thanks,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer April 22, 2013, 5:49 p.m. UTC | #15
On Mon, 2013-04-22 at 16:54 +0200, Hannes Frederic Sowa wrote:
> On Mon, Apr 22, 2013 at 11:10:34AM +0200, Jesper Dangaard Brouer wrote:
> > (To avoid pissing people off) I acknowledge that we should change the
> > hash size, as its ridiculously small with 64 entries.
> > 
> > But your mem limit assumption and hash depth limit assumptions are
> > broken, because the mem limit is per netns (network namespace).
> > Thus, starting more netns instances will break these assumptions.
> 
> Oh, I see. :/
> 
> At first I thought we should make the fragment hash per namespace too,
> to provide better isolation in case of lxc. But then each chrome tab
> would allocate its own fragment cache, too. Hmm... but people using
> namespaces have plenty much memory, don't they? We could also provide
> an inet_fragment namespace. ;)

I'm wondering if we could do the opposite, move the mem limit and LRU
list "out-of" the netns?
Either way, this would make the relationship of the mem limit and hash
size more sane.



> > The dangerous part of your change (commit 5a3da1fe) is that you keep the
> > existing frag queues (and don't allow new frag queues to be created).
> > The attackers fragments will never finish (timeout 30 sec), while valid
> > fragments will complete and "exit" the queue, thus the end result is
> > hash bucket is filled with attackers invalid/incomplete fragments.
> 
> I would not mind if your change gets accepted (I have not completyl
> reviewed it yet), but I have my doubts if it is an advantage to the
> current solution.
> 
> First off, I think an attacker can keep the fragment cache pretty much
> filled up with little cost. The current implementation has the grace
> period where no new fragments will be accepted after the DoS, this is
> solved by your patch. But the change makes it easier for an attacker to
> evict "valid" fragments from the cache in the first 30 seconds of the
> DoS, too.

The "grace period" is quite harmful (where no new fragments will be
accepted).  Just creating 3 netns (3x 4MB mem limit) we can make all
queues reach 128 entries, resulting in a "grace period" of 30 sec where
no frags are possible. (min frag size is 1108 bytes with my trafgen
script).


> I am not sure whether the current fragmentation handling or your solution
> does perform better in real world (or if it actually matters).
> 
> Nonetheless it does add a bit more complexity and a new sysctl which does
> expose something the kernel should know how to do better.

Well, actually I don't like exposing the max_hash_depth sysctl, it was a
wrong idea/move.

I like Eric's idea of resizing the hash based on the max thresh,
unfortunately this does not make sense when the max thresh is per netns
and the hash table is global.

I'm also thinking, it is really worth the complexity of having a depth
limit on this hash table?  Is it that important.  The mem limit should
at some point kick in and save the day anyhow (before, without per hash
bucket locking it might make sense).



> > Besides, after we have implemented per hash bucket locking (in my change
> > commit 19952cc4 "net: frag queue per hash bucket locking").
> > Then, I don't think it is a big problem that a single hash bucket is
> > being "attacked".
> 
> I don't know, I wouldn't say so. The contention point is now the per
> hash bucket lock but it should show the same symptoms as before.
> 
> In my opinion we should start resizing the hash table irrespective of
> the namespace limits (one needs CAP_NET_ADMIN to connect a netns to
> the outside world, I think) and try to move forward with Patch 3. This
> patch 2 would then only be a dependency and would introduce the eviction
> strategy you need for patch 3. But the focus should be on the removal of the
> lru cleanup. What do you think?

I agree, increasing the hash tables size makes sense, as its
ridiculously small with 64 entries.

Yes, removal of the (per netns) "global" LRU list should be the real
focus.  This was just a dependency for introducing the eviction strategy
I needed for patch 3.

But the eviction strategy in patch-3, is actually also "broken" because
the mem limit is per netns and we do eviction based on the netns shared
hash table... thinking of going back to my original idea of simply doing
LRU lists per CPU.

--Jesper



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Hannes Frederic Sowa April 23, 2013, 12:20 a.m. UTC | #16
On Mon, Apr 22, 2013 at 06:30:17PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 2013-04-22 at 16:54 +0200, Hannes Frederic Sowa wrote:
> > On Mon, Apr 22, 2013 at 11:10:34AM +0200, Jesper Dangaard Brouer wrote:
> [...]
> > > Besides, after we have implemented per hash bucket locking (in my change
> > > commit 19952cc4 "net: frag queue per hash bucket locking").
> > > Then, I don't think it is a big problem that a single hash bucket is
> > > being "attacked".
> > 
> > I don't know, I wouldn't say so. The contention point is now the per
> > hash bucket lock but it should show the same symptoms as before.
> 
> No, the contention point is the LRU list lock, not the hash bucket lock.
> If you perf record/profile the code, you can easily miss that its the
> LRU lock, because its inlined.  Try to rerun your tests with noinline
> e.g.:

It depends on the test. Last time I checked with my ipv6 torture test I
had most hits in the inet_frag_find loop (I looked at it after your per
bucket locks landed in net-next). I think your test setup could provide
more meaningful numbers. If you fill up the whole fragment cache it is
plausible that the contention will shift towards the lru lock.

On Mon, Apr 22, 2013 at 07:49:12PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 2013-04-22 at 16:54 +0200, Hannes Frederic Sowa wrote:
> > At first I thought we should make the fragment hash per namespace too,
> > to provide better isolation in case of lxc. But then each chrome tab
> > would allocate its own fragment cache, too. Hmm... but people using
> > namespaces have plenty much memory, don't they? We could also provide
> > an inet_fragment namespace. ;)
> 
> I'm wondering if we could do the opposite, move the mem limit and LRU
> list "out-of" the netns?
> Either way, this would make the relationship of the mem limit and hash
> size more sane.

It seems to be the only sane way currently. If we resize on cloning/changing
per-ns parameters we would also have to resize the frag hash on every google
chrome tab creation/destruction.

Perhaps we sould think about adding lazy allocation/resizing of the
fragment cache per-ns (or we do the resize on every rebuild timeout).

> > > The dangerous part of your change (commit 5a3da1fe) is that you keep the
> > > existing frag queues (and don't allow new frag queues to be created).
> > > The attackers fragments will never finish (timeout 30 sec), while valid
> > > fragments will complete and "exit" the queue, thus the end result is
> > > hash bucket is filled with attackers invalid/incomplete fragments.
> > 
> > I would not mind if your change gets accepted (I have not completyl
> > reviewed it yet), but I have my doubts if it is an advantage to the
> > current solution.
> > 
> > First off, I think an attacker can keep the fragment cache pretty much
> > filled up with little cost. The current implementation has the grace
> > period where no new fragments will be accepted after the DoS, this is
> > solved by your patch. But the change makes it easier for an attacker to
> > evict "valid" fragments from the cache in the first 30 seconds of the
> > DoS, too.
> 
> The "grace period" is quite harmful (where no new fragments will be
> accepted).  Just creating 3 netns (3x 4MB mem limit) we can make all
> queues reach 128 entries, resulting in a "grace period" of 30 sec where
> no frags are possible. (min frag size is 1108 bytes with my trafgen
> script).

I see your point. But some other fragments would be in the queue perhaps
going to be reassembled correctly which would otherwise be evicted. So
I really don't know what is the better behavior here.

But this would be changed as a side-effect of this whole patch set. ;)

> I'm also thinking, it is really worth the complexity of having a depth
> limit on this hash table?  Is it that important.  The mem limit should
> at some point kick in and save the day anyhow (before, without per hash
> bucket locking it might make sense).

Depends on how we proceed with the ipv6 fragment hashing. If we go back to the
3 rounds of jhash for each fragment this would be possible.

> Yes, removal of the (per netns) "global" LRU list should be the real
> focus.  This was just a dependency for introducing the eviction strategy
> I needed for patch 3.
> 
> But the eviction strategy in patch-3, is actually also "broken" because
> the mem limit is per netns and we do eviction based on the netns shared
> hash table... thinking of going back to my original idea of simply doing
> LRU lists per CPU.

That depends on how we proceed with the memory limits. If we don't solve this
now we will have to deal with this again if we think about how to calculate a
decent hash size.

Greetings,

  Hannes

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer April 24, 2013, 1:35 p.m. UTC | #17
On Fri, 2013-04-19 at 07:49 -0700, Eric Dumazet wrote:
> On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote:
> 
> > When removing the LRU system (which is the real bottleneck, see perf
> > tests in cover mail), and doing direct hash cleaning we are trading-in
> > accuracy.
[...]
> > The reason I don't want a too big hash table is the following.
> > 
> > Worst case 1024 buckets * 130K bytes = 133 MBytes, which on smaller
> > embedded systems is a lot of kernel memory we are permitting a remote
> > host to "lock-down".
> 
> Thats pretty irrelevant, memory is limited by the total amount of memory
> used by fragments, not by hash table size.
>
> Its called /proc/sys/net/ipv4/ipfrag_high_thresh


I was talking about patch-03, where I do "direct hash cleaning", and
have moved the mem limit "ipfrag_high_thresh" into the hash cleaning
step.

It seems we are talking past each-other...

--Jesper


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet April 24, 2013, 3:05 p.m. UTC | #18
On Wed, 2013-04-24 at 15:35 +0200, Jesper Dangaard Brouer wrote:
> On Fri, 2013-04-19 at 07:49 -0700, Eric Dumazet wrote:

> > Thats pretty irrelevant, memory is limited by the total amount of memory
> > used by fragments, not by hash table size.
> >
> > Its called /proc/sys/net/ipv4/ipfrag_high_thresh
> 
> 
> I was talking about patch-03, where I do "direct hash cleaning", and
> have moved the mem limit "ipfrag_high_thresh" into the hash cleaning
> step.

I am fine to remove LRU, as maintaining a LRU is too expensive and this
LRU has little purpose.

Old frags are evicted anyway because they have a timer.

We have to keep a global limit, especially if we increase hash table
size, because even with perfect hashing and one frag per hash slot we
might consume too much ram.

We can rename it if you wish but currently this is ipfrag_high_thresh.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6f41b45..b2cd72a 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -42,13 +42,7 @@  struct inet_frag_queue {
 };
 
 #define INETFRAGS_HASHSZ		64
-
-/* averaged:
- * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
- *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
- *	       struct frag_queue))
- */
-#define INETFRAGS_MAXDEPTH		128
+#define INETFRAGS_MAX_HASH_DEPTH	32
 
 struct inet_frag_bucket {
 	struct hlist_head	chain;
@@ -65,6 +59,7 @@  struct inet_frags {
 	int			secret_interval;
 	struct timer_list	secret_timer;
 	u32			rnd;
+	u32			max_hash_depth;
 	int			qsize;
 
 	unsigned int		(*hashfn)(struct inet_frag_queue *);
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index beec05b..fd9fe5c 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -130,7 +130,8 @@  void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
 }
 EXPORT_SYMBOL(inet_frags_exit_net);
 
-static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
+static inline void fq_unlink_hash(struct inet_frag_queue *fq,
+				  struct inet_frags *f)
 {
 	struct inet_frag_bucket *hb;
 	unsigned int hash;
@@ -140,24 +141,35 @@  static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 	hb = &f->hash[hash];
 
 	spin_lock(&hb->chain_lock);
-	hlist_del(&fq->list);
+	/* Handle race-condition between direct hash tables cleaning
+	 * in inet_frag_find() and users of inet_frag_kill()
+	 */
+	if (!hlist_unhashed(&fq->list))
+		hlist_del(&fq->list);
 	spin_unlock(&hb->chain_lock);
 
 	read_unlock(&f->lock);
-	inet_frag_lru_del(fq);
 }
 
-void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
+void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f,
+		      bool unlink_hash)
 {
 	if (del_timer(&fq->timer))
 		atomic_dec(&fq->refcnt);
 
 	if (!(fq->last_in & INET_FRAG_COMPLETE)) {
-		fq_unlink(fq, f);
+		if (unlink_hash)
+			fq_unlink_hash(fq, f);
+		inet_frag_lru_del(fq);
 		atomic_dec(&fq->refcnt);
 		fq->last_in |= INET_FRAG_COMPLETE;
 	}
 }
+
+void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
+{
+	__inet_frag_kill(fq, f, true);
+}
 EXPORT_SYMBOL(inet_frag_kill);
 
 static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
@@ -326,13 +338,14 @@  struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 	__releases(&f->lock)
 {
 	struct inet_frag_bucket *hb;
-	struct inet_frag_queue *q;
+	struct inet_frag_queue *q, *q_evict = NULL;
+	struct hlist_node *n;
 	int depth = 0;
 
 	hb = &f->hash[hash];
 
 	spin_lock(&hb->chain_lock);
-	hlist_for_each_entry(q, &hb->chain, list) {
+	hlist_for_each_entry_safe(q, n, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
 			spin_unlock(&hb->chain_lock);
@@ -340,25 +353,32 @@  struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 			return q;
 		}
 		depth++;
+		q_evict = q; /* candidate for eviction */
 	}
+	/* Not found situation */
+	if (depth > f->max_hash_depth) {
+		atomic_inc(&q_evict->refcnt);
+		hlist_del_init(&q_evict->list);
+	} else
+		q_evict = NULL;
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
 
-	if (depth <= INETFRAGS_MAXDEPTH)
-		return inet_frag_create(nf, f, key);
-	else
-		return ERR_PTR(-ENOBUFS);
-}
-EXPORT_SYMBOL(inet_frag_find);
+	if (q_evict) {
+		spin_lock(&q_evict->lock);
+		if (!(q_evict->last_in & INET_FRAG_COMPLETE))
+			__inet_frag_kill(q_evict, f, false);
+		spin_unlock(&q_evict->lock);
 
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix)
-{
-	static const char msg[] = "inet_frag_find: Fragment hash bucket"
-		" list length grew over limit " __stringify(INETFRAGS_MAXDEPTH)
-		". Dropping fragment.\n";
+		if (atomic_dec_and_test(&q_evict->refcnt))
+			inet_frag_destroy(q_evict, f, NULL);
 
-	if (PTR_ERR(q) == -ENOBUFS)
-		LIMIT_NETDEBUG(KERN_WARNING "%s%s", prefix, msg);
+		LIMIT_NETDEBUG(KERN_WARNING "%s(): Fragment hash bucket"
+			       " list length grew over limit (len %d),"
+			       " Dropping another fragment.\n",
+			       __func__, depth);
+	}
+
+	return inet_frag_create(nf, f, key);
 }
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
+EXPORT_SYMBOL(inet_frag_find);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9385206..4e8489b 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -263,10 +263,8 @@  static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (q == NULL)
 		return NULL;
-	}
 	return container_of(q, struct ipq, q);
 }
 
@@ -748,6 +746,14 @@  static struct ctl_table ip4_frags_ctl_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero
 	},
+	{
+		.procname	= "ipfrag_max_hash_depth",
+		.data		= &ip4_frags.max_hash_depth,
+		.maxlen		= sizeof(u32),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero
+	},
 	{ }
 };
 
@@ -866,5 +872,6 @@  void __init ipfrag_init(void)
 	ip4_frags.match = ip4_frag_match;
 	ip4_frags.frag_expire = ip_expire;
 	ip4_frags.secret_interval = 10 * 60 * HZ;
+	ip4_frags.max_hash_depth = INETFRAGS_MAX_HASH_DEPTH;
 	inet_frags_init(&ip4_frags);
 }
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index dffdc1a..3713764 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -189,10 +189,8 @@  static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
 	local_bh_enable();
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (q == NULL)
 		return NULL;
-	}
 	return container_of(q, struct frag_queue, q);
 }
 
@@ -681,6 +679,7 @@  int nf_ct_frag6_init(void)
 	nf_frags.match = ip6_frag_match;
 	nf_frags.frag_expire = nf_ct_frag6_expire;
 	nf_frags.secret_interval = 10 * 60 * HZ;
+	nf_frags.max_hash_depth = INETFRAGS_MAX_HASH_DEPTH;
 	inet_frags_init(&nf_frags);
 
 	ret = register_pernet_subsys(&nf_ct_net_ops);
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e6e44ce..45de5eb 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -196,10 +196,8 @@  fq_find(struct net *net, __be32 id, const struct in6_addr *src,
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
 
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (q == NULL)
 		return NULL;
-	}
 	return container_of(q, struct frag_queue, q);
 }
 
@@ -600,6 +598,8 @@  static struct ctl_table ip6_frags_ns_ctl_table[] = {
 	{ }
 };
 
+static int zero;
+
 static struct ctl_table ip6_frags_ctl_table[] = {
 	{
 		.procname	= "ip6frag_secret_interval",
@@ -608,6 +608,14 @@  static struct ctl_table ip6_frags_ctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
 	},
+	{
+		.procname	= "ip6frag_max_hash_depth",
+		.data		= &ip6_frags.max_hash_depth,
+		.maxlen		= sizeof(u32),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero
+	},
 	{ }
 };
 
@@ -734,6 +742,7 @@  int __init ipv6_frag_init(void)
 	ip6_frags.match = ip6_frag_match;
 	ip6_frags.frag_expire = ip6_frag_expire;
 	ip6_frags.secret_interval = 10 * 60 * HZ;
+	ip6_frags.max_hash_depth = INETFRAGS_MAX_HASH_DEPTH;
 	inet_frags_init(&ip6_frags);
 out:
 	return ret;