diff mbox

[net-next,1/4] Revert "inet: limit length of fragment queue hash table bucket lists"

Message ID 20130424154800.16883.4797.stgit@dragon
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Jesper Dangaard Brouer April 24, 2013, 3:48 p.m. UTC
This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.

The problem with commit 5a3da1fe (inet: limit length of fragment queue
hash table bucket lists) is that, once we hit the hash depth limit (of
128), the we *keep* the existing frag queues, not allowing new frag
queues to be created.  Thus, an attacker can effectivly block handling
of fragments for 30 sec (as each frag queue have a timeout of 30 sec)

For this situation to occur the mem limit need to increase (from
default 4MB per netns). This can either happen by 1) creating more
netns (network namespaces) or 2) by manually increasing the mem limits
via proc files:

 /proc/sys/net/ipv4/ipfrag_high_thresh
 /proc/sys/net/ipv4/ipfrag_low_thresh

To be exact, situation occurs when, increasing the thresh to something
allowing 128 elements in each bucket, which is not that high given the
hash array size of 64 (64*128=8192), e.g.
   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968

Thus, with small frags we only need to start >=3 netns instances, for
the situation to be possible.

The reason this is inevitable, is the attackers invalid fragments will
never finish (timeout 30 sec), while valid fragments will complete and
"exit" the queue, thus the end result is hash bucket is filled with
attackers invalid/incomplete fragments.

Fixed conflicts in:
    include/net/inet_frag.h

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 include/net/inet_frag.h                 |    9 ---------
 net/ipv4/inet_fragment.c                |   20 +-------------------
 net/ipv4/ip_fragment.c                  |   11 +++++++----
 net/ipv6/netfilter/nf_conntrack_reasm.c |   12 ++++++------
 net/ipv6/reassembly.c                   |    8 ++------
 5 files changed, 16 insertions(+), 44 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet April 25, 2013, midnight UTC | #1
On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> 
> The problem with commit 5a3da1fe (inet: limit length of fragment queue
> hash table bucket lists) is that, once we hit the hash depth limit (of
> 128), the we *keep* the existing frag queues, not allowing new frag
> queues to be created.  Thus, an attacker can effectivly block handling
> of fragments for 30 sec (as each frag queue have a timeout of 30 sec)
> 

I do not think its good to revert this patch. It was a step in right
direction.

Limiting chain length to 128 is good.

An attacker can attack a defrag unit, no matter strategy you use, thats
why fragments are doomed.

The only way to resist to an attack is to have enough memory in defrag
unit to accumulate 30 seconds worth of traffic.

Thats 30GB of memory if you receive 1GB per second, bit counting the
overhead.

If the attacker knows the IP address of the user of your defrag unit,
you are doomed because IP id are 16bits.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer April 25, 2013, 1:10 p.m. UTC | #2
On Wed, 2013-04-24 at 17:00 -0700, Eric Dumazet wrote:
> On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > 
> > The problem with commit 5a3da1fe (inet: limit length of fragment queue
> > hash table bucket lists) is that, once we hit the hash depth limit (of
> > 128), the we *keep* the existing frag queues, not allowing new frag
> > queues to be created.  Thus, an attacker can effectivly block handling
> > of fragments for 30 sec (as each frag queue have a timeout of 30 sec)
> > 
> 
> I do not think its good to revert this patch. It was a step in right
> direction.

But in its current state I consider this patch dangerous.

> Limiting chain length to 128 is good.

Yes, its good to have a limit on the hash depth, but with current mem
limit per netns this creates an unfortunate side-effect.  There is a
disconnect between the netns mem limits and the global hash table.

Even with hash size of 1024, this just postpones the problem.
We now need 35 netns instances to reach the point where we block all
fragments for 30 sec.

Given a min frag size of 1108 bytes:
  1024*128*1108 = 145227776 
  145227776/(4*1024*1024) = 34.62500000000000000000

The reason this is inevitable, is the attackers invalid fragments will
never finish (timeout 30 sec), while valid fragments will complete and
"exit" the queue, thus the end result is hash bucket is filled with
attackers invalid/incomplete fragments.  IMHO this is a very dangerous
"feature" to support.

--Jesper



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Laight April 25, 2013, 1:58 p.m. UTC | #3
> The reason this is inevitable, is the attackers invalid fragments will

> never finish (timeout 30 sec), while valid fragments will complete and

> "exit" the queue, thus the end result is hash bucket is filled with

> attackers invalid/incomplete fragments.  IMHO this is a very dangerous

> "feature" to support.


If you are being attacked, the hash buckets will immediately
be filled with invalid packets, no valid packets will enter.
If you delete old entries to add new ones then any valid
packet will soon be flushed out by invalid ones.
The only time you'll manage to assemble anything is if
the valid fragments arrive 'back to back'.

The only hope is to part-decode initial fragments and decide
whether they are valid.

	David
Jesper Dangaard Brouer May 2, 2013, 7:59 a.m. UTC | #4
On Wed, 24 Apr 2013 17:00:30 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > 
> > The problem with commit 5a3da1fe (inet: limit length of fragment
> > queue hash table bucket lists) is that, once we hit the hash depth
> > limit (of 128), the we *keep* the existing frag queues, not
> > allowing new frag queues to be created.  Thus, an attacker can
> > effectivly block handling of fragments for 30 sec (as each frag
> > queue have a timeout of 30 sec)
> > 
> 
> I do not think its good to revert this patch. It was a step in right
> direction.

We need a revert, because we are too close to the merge window, and
cannot complete the needed "steps" to make this patch safe, sorry.

--Jesper
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet May 2, 2013, 3:16 p.m. UTC | #5
On Thu, 2013-05-02 at 09:59 +0200, Jesper Dangaard Brouer wrote:
> On Wed, 24 Apr 2013 17:00:30 -0700 Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> > On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > > 
> > > The problem with commit 5a3da1fe (inet: limit length of fragment
> > > queue hash table bucket lists) is that, once we hit the hash depth
> > > limit (of 128), the we *keep* the existing frag queues, not
> > > allowing new frag queues to be created.  Thus, an attacker can
> > > effectivly block handling of fragments for 30 sec (as each frag
> > > queue have a timeout of 30 sec)
> > > 
> > 
> > I do not think its good to revert this patch. It was a step in right
> > direction.
> 
> We need a revert, because we are too close to the merge window, and
> cannot complete the needed "steps" to make this patch safe, sorry.

Again, a limit of 128 is totally OK. Its in fact too big.

128 cache misses consume 5 us

Allowing a chain being non limited is a more severe bug.

Reverting will allow an attacker to consume all your cpu cycles.

We changed INETFRAGS_HASHSZ to 1024, so 128*1024 max frags is already a
very big limit.

No matter what we do, we need to limit both :

- Memory consumption
- Cpu consumption

For people willing to allow more memory to be used, the only way is to
resize hash table, or using a bigger INETFRAGS_HASHSZ

I do not think there is a hurry, current defrag code is already better
than what we had years ago.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesper Dangaard Brouer May 3, 2013, 9:15 a.m. UTC | #6
On Thu, 02 May 2013 08:16:41 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On Thu, 2013-05-02 at 09:59 +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 24 Apr 2013 17:00:30 -0700 Eric Dumazet
> > <eric.dumazet@gmail.com> wrote:
> > 
> > > On Wed, 2013-04-24 at 17:48 +0200, Jesper Dangaard Brouer wrote:
> > > > This reverts commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055.
> > > > 
> > > > The problem with commit 5a3da1fe (inet: limit length of fragment
> > > > queue hash table bucket lists) is that, once we hit the hash
> > > > depth limit (of 128), the we *keep* the existing frag queues,
> > > > not allowing new frag queues to be created.  Thus, an attacker
> > > > can effectivly block handling of fragments for 30 sec (as each
> > > > frag queue have a timeout of 30 sec)
> > > > 
> > > 
> > > I do not think its good to revert this patch. It was a step in
> > > right direction.
> > 
> > We need a revert, because we are too close to the merge window, and
> > cannot complete the needed "steps" to make this patch safe, sorry.
> 
[...]
> 
> For people willing to allow more memory to be used, the only way is to
> resize hash table, or using a bigger INETFRAGS_HASHSZ
>
> I do not think there is a hurry, current defrag code is already better
> than what we had years ago.
 
Eric I think we agree that:
 1) we need resizing of hash table based on mem limit
 2) mem limit per netns "blocks" the hash resize patch

Without these two patches/changes, the static 128 depth limit
introduces an undocumented limit on the max mem limit
setting (/proc/sys/net/ipv4/ipfrag_high_thresh).

I think we only disagree on the order of the patches.

But lets keep this, because after we have increased hash
size (INETFRAGS_HASHSZ) to 1024, we have pushed the "undocumented
limit" so-far that is very unlikely to be hit.  We would have to
start >36 netns instances, all being overloaded with small
incomplete fragments at the same time (30 sec time window).

--Jesper
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 6f41b45..eb1d6ee 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -43,13 +43,6 @@  struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
-/* averaged:
- * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
- *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
- *	       struct frag_queue))
- */
-#define INETFRAGS_MAXDEPTH		128
-
 struct inet_frag_bucket {
 	struct hlist_head	chain;
 	spinlock_t		chain_lock;
@@ -89,8 +82,6 @@  int inet_frag_evictor(struct netns_frags *nf, struct inet_frags *f, bool force);
 struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock);
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix);
 
 static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f)
 {
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index e97d66a..cabe3d7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -21,7 +21,6 @@ 
 #include <linux/rtnetlink.h>
 #include <linux/slab.h>
 
-#include <net/sock.h>
 #include <net/inet_frag.h>
 #include <net/inet_ecn.h>
 
@@ -327,7 +326,6 @@  struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 {
 	struct inet_frag_bucket *hb;
 	struct inet_frag_queue *q;
-	int depth = 0;
 
 	hb = &f->hash[hash];
 
@@ -339,26 +337,10 @@  struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 			read_unlock(&f->lock);
 			return q;
 		}
-		depth++;
 	}
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);
 
-	if (depth <= INETFRAGS_MAXDEPTH)
-		return inet_frag_create(nf, f, key);
-	else
-		return ERR_PTR(-ENOBUFS);
+	return inet_frag_create(nf, f, key);
 }
 EXPORT_SYMBOL(inet_frag_find);
-
-void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q,
-				   const char *prefix)
-{
-	static const char msg[] = "inet_frag_find: Fragment hash bucket"
-		" list length grew over limit " __stringify(INETFRAGS_MAXDEPTH)
-		". Dropping fragment.\n";
-
-	if (PTR_ERR(q) == -ENOBUFS)
-		LIMIT_NETDEBUG(KERN_WARNING "%s%s", prefix, msg);
-}
-EXPORT_SYMBOL(inet_frag_maybe_warn_overflow);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 9385206..cda5514 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -263,11 +263,14 @@  static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
-		return NULL;
-	}
+	if (q == NULL)
+		goto out_nomem;
+
 	return container_of(q, struct ipq, q);
+
+out_nomem:
+	LIMIT_NETDEBUG(KERN_ERR pr_fmt("ip_frag_create: no memory left !\n"));
+	return NULL;
 }
 
 /* Is the fragment too far ahead to be part of ipq? */
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index dffdc1a..7cfa829 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -14,8 +14,6 @@ 
  * 2 of the License, or (at your option) any later version.
  */
 
-#define pr_fmt(fmt) "IPv6-nf: " fmt
-
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/string.h>
@@ -189,11 +187,13 @@  static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
 	local_bh_enable();
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
-		return NULL;
-	}
+	if (q == NULL)
+		goto oom;
+
 	return container_of(q, struct frag_queue, q);
+
+oom:
+	return NULL;
 }
 
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index e6e44ce..74505c5 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -26,9 +26,6 @@ 
  *	YOSHIFUJI,H. @USAGI	Always remove fragment header to
  *				calculate ICV correctly.
  */
-
-#define pr_fmt(fmt) "IPv6: " fmt
-
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/string.h>
@@ -196,10 +193,9 @@  fq_find(struct net *net, __be32 id, const struct in6_addr *src,
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
 
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
-	if (IS_ERR_OR_NULL(q)) {
-		inet_frag_maybe_warn_overflow(q, pr_fmt());
+	if (q == NULL)
 		return NULL;
-	}
+
 	return container_of(q, struct frag_queue, q);
 }