Message ID | 20130418213732.14296.36026.stgit@dragon |
---|---|
State | Superseded, archived |
Delegated to: | David Miller |
Headers | show |
On Thu, Apr 18, 2013 at 11:38:00PM +0200, Jesper Dangaard Brouer wrote: > - if (PTR_ERR(q) == -ENOBUFS) > - LIMIT_NETDEBUG(KERN_WARNING "%s%s", prefix, msg); > + LIMIT_NETDEBUG(KERN_WARNING "%s(): Fragment hash bucket" > + " list length grew over limit (len %d)," > + " Dropping another fragment.\n", > + __func__, depth); > + } Simple nice-to-have thingie: Do you see a simple possibility to add the user (e.g. ipv6/ipv6-netfilter/ipv4) to the warning? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2013-04-18 at 23:38 +0200, Jesper Dangaard Brouer wrote: > I have found an issues with commit: > > commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055 > Author: Hannes Frederic Sowa <hannes@stressinduktion.org> > Date: Fri Mar 15 11:32:30 2013 +0000 > > inet: limit length of fragment queue hash table bucket lists > > There is a connection between the fixed 128 hash depth limit and the > frag mem limit/thresh settings, which limits how high the thresh can > be set. > > The 128 elems hash depth limit, results in bad behaviour if mem limit > thresh holds are increased, via /proc/sys/net :: > > /proc/sys/net/ipv4/ipfrag_high_thresh > /proc/sys/net/ipv4/ipfrag_low_thresh > > If we increase the thresh, to something allowing 128 elements in each > bucket, which is not that high given the hash array size of 64 > (64*128=8192), e.g. > big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648 > small frags ( 896(truesize)+208(ipq))*8192(max elems)=9043968 > > The problem with commit 5a3da1fe (inet: limit length of fragment queue > hash table bucket lists) is that, once we hit the limit, the we *keep* > the existing frag queues, not allowing new frag queues to be created. > Thus, an attacker can effectivly block handling of fragments for 30 > sec (as each frag queue have a timeout of 30 sec). > > Even without increasing the limit, as Hannes showed, an attacker on > IPv6 can "attack" a specific hash bucket, and via that change, can > block/drop new fragments also (trying to) utilize this bucket. > > Summary: > With the default mem limit/thresh settings, this is not general > problem, but adjusting the thresh limits result in some-what > unexpected behavior. > > Proposed solution: > IMHO instead of keeping existing frag queues, we should kill one of > the frag queues in the hash instead. This strategy wont really help DDOS attacks. No frag will ever complete. I am not sure its worth adding extra complexity. > > Implementation complications: > Killing of frag queues while only holding the hash bucket lock, and > not the frag queue lock, complicates the implementation, as we race > and can end up (trying to) remove the hash element twice (resulting in > an oops). This have been addressed by using hlist_del_init() and a > hlist_unhashed() check in fq_unlink_hash(). > > Extra: > * Added new sysctl "max_hash_depth" option, to allow users to adjust the hash > depth along with adjusting the thresh limits. > * Change max hash depth to 32, thus limit handling to approx 2048 frag queues. > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > --- > > include/net/inet_frag.h | 9 +--- > net/ipv4/inet_fragment.c | 64 ++++++++++++++++++++----------- > net/ipv4/ip_fragment.c | 13 +++++- > net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +- > net/ipv6/reassembly.c | 15 ++++++- > 5 files changed, 68 insertions(+), 38 deletions(-) Hmm... adding a new sysctl without documentation is a clear sign you'll be the only user of it. You are also setting a default limit of 32, more likely to hit the problem than current 128 value. We know the real solution is to have a correctly sized hash table, so why adding a temporary sysctl ? As soon as /proc/sys/net/ipv4/ipfrag_high_thresh is changed, a resize should be attempted. But the max depth itself should be a reasonable value, and doesn't need to be tuned IMHO. The 64 slots hash table was chosen years ago, when machines had 3 order of magnitude less ram than today. Before hash resizing, I would just bump hash size to something more reasonable like 1024. That would allow some admin to set /proc/sys/net/ipv4/ipfrag_high_thresh to a quite large value. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 11:41 +0100, David Laight wrote: > > The 64 slots hash table was chosen years ago, when machines had 3 order > > of magnitude less ram than today. > > > > Before hash resizing, I would just bump hash size to something more > > reasonable like 1024. > > While that is true of many systems, there are embedded systems > with much less ram than the typical x86 desktop or server. > > There seem to be quite a lot of large hash tables appearing > that on many small systems will never contain a significant > number of entries and just waste precious memory. Embedded systems run linux since ages, and their memory also increased by 2 order of magnitude since 1995. If they run linux-3.10, using 4096 bytes of 8192 bytes for the hash table is fine. We are not going to add yet another ifdef, for such a small amount of ram. The code that we will add to do the resize will be larger than this. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 03:11 -0700, Eric Dumazet wrote: > On Thu, 2013-04-18 at 23:38 +0200, Jesper Dangaard Brouer wrote: > > I have found an issues with commit: > > > > commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055 > > Author: Hannes Frederic Sowa <hannes@stressinduktion.org> > > Date: Fri Mar 15 11:32:30 2013 +0000 > > > > inet: limit length of fragment queue hash table bucket lists > > > > There is a connection between the fixed 128 hash depth limit and the > > frag mem limit/thresh settings, which limits how high the thresh can > > be set. > > > > The 128 elems hash depth limit, results in bad behaviour if mem limit > > thresh holds are increased, via /proc/sys/net :: > > > > /proc/sys/net/ipv4/ipfrag_high_thresh > > /proc/sys/net/ipv4/ipfrag_low_thresh > > > > If we increase the thresh, to something allowing 128 elements in each > > bucket, which is not that high given the hash array size of 64 > > (64*128=8192), e.g. > > big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648 > > small frags ( 896(truesize)+208(ipq))*8192(max elems)=9043968 > > > > The problem with commit 5a3da1fe (inet: limit length of fragment queue > > hash table bucket lists) is that, once we hit the limit, the we *keep* > > the existing frag queues, not allowing new frag queues to be created. > > Thus, an attacker can effectivly block handling of fragments for 30 > > sec (as each frag queue have a timeout of 30 sec). > > > > Even without increasing the limit, as Hannes showed, an attacker on > > IPv6 can "attack" a specific hash bucket, and via that change, can > > block/drop new fragments also (trying to) utilize this bucket. > > > > Summary: > > With the default mem limit/thresh settings, this is not general > > problem, but adjusting the thresh limits result in some-what > > unexpected behavior. > > > > Proposed solution: > > IMHO instead of keeping existing frag queues, we should kill one of > > the frag queues in the hash instead. > > > This strategy wont really help DDOS attacks. No frag will ever complete. I beg to differ. First if all, I also though keeping fragments were the right choice, BUT then I realized that we cannot tell god-from-bad frags apart, and an attacker will obviously generate more fragments than the real traffic, thus we end up keeping the attackers fragment for 30 sec. Because the attacker will never "finish/complete" his fragment queues. Thus, the real "help" for a DoS attack is to keep fragments. Second, I have clearly shown, with my performance tests, that frags will complete under DoS. Consider argument for frags will be able to complete: For an attacker to "push-out" the real-fragments she will have to generate enough packets to fill the entire hash queue (to be sure) which is 64 * 32 = 2048 elements, thus generate (some were below) 2048 packets, in the same period the real-fragment sender only needs to send 3 packets (my DNSSEC use case). That is the reason it works, do you buy that explanation? > I am not sure its worth adding extra complexity. It's not that complex, and we simply need it, else an attacker can DoS us very easily by sending a burst every 30 sec. We do need this change, else we must revert Hannes patch, and find a complete other approach of removing the LRU list system. > > Implementation complications: > > Killing of frag queues while only holding the hash bucket lock, and > > not the frag queue lock, complicates the implementation, as we race > > and can end up (trying to) remove the hash element twice (resulting in > > an oops). This have been addressed by using hlist_del_init() and a > > hlist_unhashed() check in fq_unlink_hash(). > > > > Extra: > > * Added new sysctl "max_hash_depth" option, to allow users to adjust the hash > > depth along with adjusting the thresh limits. > > * Change max hash depth to 32, thus limit handling to approx 2048 frag queues. > > > > Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> > > --- > > > > include/net/inet_frag.h | 9 +--- > > net/ipv4/inet_fragment.c | 64 ++++++++++++++++++++----------- > > net/ipv4/ip_fragment.c | 13 +++++- > > net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +- > > net/ipv6/reassembly.c | 15 ++++++- > > 5 files changed, 68 insertions(+), 38 deletions(-) > > Hmm... adding a new sysctl without documentation is a clear sign you'll > be the only user of it. LOL - I know, my next patch contains a note that I need to update the docs too... > You are also setting a default limit of 32, more likely to hit the > problem than current 128 value. > > We know the real solution is to have a correctly sized hash table, so > why adding a temporary sysctl ? No, the real solution is to remove LRU list system (that is the real bottleneck). But this patch (and the next) is a step needed before we can remove the LRU list cleanup system. > As soon as /proc/sys/net/ipv4/ipfrag_high_thresh is changed, a resize > should be attempted. Perhaps, but the ipfrag_high_thresh memory size does not translate easily to number of needed frag queues (hash_size * max_depth). Because the size of a frag queue can range from 1108 bytes to approx 130.000 bytes. > But the max depth itself should be a reasonable value, and doesn't need > to be tuned IMHO. What is a reasonable max depth value for you? > The 64 slots hash table was chosen years ago, when machines had 3 order > of magnitude less ram than today. > > Before hash resizing, I would just bump hash size to something more > reasonable like 1024. > > That would allow some admin to set /proc/sys/net/ipv4/ipfrag_high_thresh > to a quite large value. When removing the LRU system (which is the real bottleneck, see perf tests in cover mail), and doing direct hash cleaning we are trading-in accuracy. The reason I don't want a too big hash table is the following. With direct has cleaning, and even using the mem limit (ipfrag_high_thresh) in the cleaning loop/selection (see patch-02) we should at least allow one frag queue to survive (in a hash bucket). This means that we can violate/exceed the ipfrag_high_thresh a lot (due to size of a frag queue can range from 1108 bytes to approx 130.000 bytes). Worst case 1024 buckets * 130K bytes = 133 MBytes, which on smaller embedded systems is a lot of kernel memory we are permitting a remote host to "lock-down". --Jesper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 19, 2013 at 02:19:10PM +0200, Jesper Dangaard Brouer wrote: > I beg to differ. > > First if all, I also though keeping fragments were the right choice, BUT > then I realized that we cannot tell god-from-bad frags apart, and an > attacker will obviously generate more fragments than the real traffic, > thus we end up keeping the attackers fragment for 30 sec. Because the > attacker will never "finish/complete" his fragment queues. > Thus, the real "help" for a DoS attack is to keep fragments. > > Second, I have clearly shown, with my performance tests, that frags will > complete under DoS. How did you simulate the DoS? Did you try to fill up specific buckets or did you try to fill up the whole fragment cache? > > I am not sure its worth adding extra complexity. > > It's not that complex, and we simply need it, else an attacker can DoS > us very easily by sending a burst every 30 sec. We do need this change, > else we must revert Hannes patch, and find a complete other approach of > removing the LRU list system. I agree that we have to do something to mitigate this. I would also drop the sysctl, I do think the kernel should have to take care of that. Perhaps a way to solve this problem more easily would be to switch back to use a jenkins hash for all inputs of the hash function and not deal with this kind of unfairness regarding the possiblity to fill up one of the buckets. Perhaps the performance improvements by Jesper (per bucket locking) could make up the more expensive hashing. :) Just some minor things I found while looking at the patch: > - > -/* averaged: > - * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ / > - * rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or > - * struct frag_queue)) > - */ > -#define INETFRAGS_MAXDEPTH 128 > +#define INETFRAGS_MAX_HASH_DEPTH 32 Perhaps a comment could be added, that this is only a default value for the max hash depth. > -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f) > +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f, > + bool unlink_hash) This function could be static. > - hlist_for_each_entry(q, &hb->chain, list) { > + hlist_for_each_entry_safe(q, n, &hb->chain, list) { Minor, but I think _safe is not needed here. > +static int zero; > + Could be const. Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 14:45 +0200, Hannes Frederic Sowa wrote: > On Fri, Apr 19, 2013 at 02:19:10PM +0200, Jesper Dangaard Brouer wrote: > > I beg to differ. > > > > First if all, I also though keeping fragments were the right choice, BUT > > then I realized that we cannot tell god-from-bad frags apart, and an > > attacker will obviously generate more fragments than the real traffic, > > thus we end up keeping the attackers fragment for 30 sec. Because the > > attacker will never "finish/complete" his fragment queues. > > Thus, the real "help" for a DoS attack is to keep fragments. > > > > Second, I have clearly shown, with my performance tests, that frags will > > complete under DoS. > > How did you simulate the DoS? Did you try to fill up specific buckets or did > you try to fill up the whole fragment cache? I'm using trafgen (part of netsniff-ng git download from https://github.com/borkmann/netsniff-ng). Simple DoS test setup read (scroll down): http://thread.gmane.org/gmane.linux.network/257155 As mentioned last time: http://article.gmane.org/gmane.linux.network/263644 I have extended the DoS generator. quote: 'I have changed the frag DoS generator script to be more efficient/deadly. Before it would only hit one RX queue, now its sending packets causing multi-queue RX, due to "better" RX hashing.' I have placed the new "multi-queue" trafgen script/input here: http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet05_small_frag_multi_queue.txf Perhaps you can tell me, if my trafgen traffic is getting hashed badly? > > > I am not sure its worth adding extra complexity. > > > > It's not that complex, and we simply need it, else an attacker can DoS > > us very easily by sending a burst every 30 sec. We do need this change, > > else we must revert Hannes patch, and find a complete other approach of > > removing the LRU list system. > > I agree that we have to do something to mitigate this. I would also > drop the sysctl, I do think the kernel should have to take care of that. > > Perhaps a way to solve this problem more easily would be to switch back to > use a jenkins hash for all inputs of the hash function and not deal with this > kind of unfairness regarding the possiblity to fill up one of the buckets. > > Perhaps the performance improvements by Jesper (per bucket locking) could make > up the more expensive hashing. :) Well, I don't know. But we do need some solution, to the current code. > Just some minor things I found while looking at the patch: > > > - > > -/* averaged: > > - * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ / > > - * rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or > > - * struct frag_queue)) > > - */ > > -#define INETFRAGS_MAXDEPTH 128 > > +#define INETFRAGS_MAX_HASH_DEPTH 32 > > Perhaps a comment could be added, that this is only a default value for the > max hash depth. Yes, I though of calling it INETFRAGS_MAX_HASH_DEPTH_DEFAULT, but it was getting too long. But now you and Eric argue against making this tune-able, so it might be the right "name". > > > -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f) > > +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f, > > + bool unlink_hash) > > This function could be static. (Sorry to be dumb) could you explain the benefit of doing so? > > - hlist_for_each_entry(q, &hb->chain, list) { > > + hlist_for_each_entry_safe(q, n, &hb->chain, list) { > > Minor, but I think _safe is not needed here. Yes, the hash unlink happens after we have exited the for_each loop. > > +static int zero; > > + > > Could be const. Could you also explain the benefit of doing so for a variable? --Jesper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote: > On Fri, 2013-04-19 at 03:11 -0700, Eric Dumazet wrote: > > This strategy wont really help DDOS attacks. No frag will ever complete. > > I beg to differ. We already had a discussion on this topic. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote: > On Fri, 2013-04-19 at 03:11 -0700, Eric Dumazet wrote: > > I am not sure its worth adding extra complexity. > > It's not that complex, and we simply need it, else an attacker can DoS > us very easily by sending a burst every 30 sec. We do need this change, > else we must revert Hannes patch, and find a complete other approach of > removing the LRU list system. Its a never ending stuff. fragments are fundamentally not suitable for any workload that can be attacked by an hostile guy. The guy will adapt its strategy knowing yours. Thats pretty easy for him, linux sources are public. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote: > What is a reasonable max depth value for you? > Current one is OK. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote: > When removing the LRU system (which is the real bottleneck, see perf > tests in cover mail), and doing direct hash cleaning we are trading-in > accuracy. > You are mixing performance issues and correctness. > The reason I don't want a too big hash table is the following. > > Worst case 1024 buckets * 130K bytes = 133 MBytes, which on smaller > embedded systems is a lot of kernel memory we are permitting a remote > host to "lock-down". Thats pretty irrelevant, memory is limited by the total amount of memory used by fragments, not by hash table size. Its called /proc/sys/net/ipv4/ipfrag_high_thresh It seems you me you are spending time on wrong things. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 19, 2013 at 04:29:02PM +0200, Jesper Dangaard Brouer wrote: > I'm using trafgen (part of netsniff-ng git download from > https://github.com/borkmann/netsniff-ng). > > Simple DoS test setup read (scroll down): > http://thread.gmane.org/gmane.linux.network/257155 > > As mentioned last time: > http://article.gmane.org/gmane.linux.network/263644 > > I have extended the DoS generator. quote: > 'I have changed the frag DoS generator script to be more > efficient/deadly. Before it would only hit one RX queue, now its > sending packets causing multi-queue RX, due to "better" RX hashing.' > > I have placed the new "multi-queue" trafgen script/input here: > http://people.netfilter.org/hawk/frag_work/trafgen/frag_packet05_small_frag_multi_queue.txf > > Perhaps you can tell me, if my trafgen traffic is getting hashed badly? Thanks! IPv4 traffic does not have this problem, as all the information to identify the packet are directly used as input to jhash_3words (the problem is, that IPv6 addresses are xored first and then fed into the jhash). So with this setup you would flood the hash uniformly distributed. (Btw. I made a tool to generate fragments with colliding hashes: https://gist.github.com/hannes/5116331). > > I agree that we have to do something to mitigate this. I would also > > drop the sysctl, I do think the kernel should have to take care of that. > > > > Perhaps a way to solve this problem more easily would be to switch back to > > use a jenkins hash for all inputs of the hash function and not deal with this > > kind of unfairness regarding the possiblity to fill up one of the buckets. > > > > Perhaps the performance improvements by Jesper (per bucket locking) could make > > up the more expensive hashing. :) > > Well, I don't know. But we do need some solution, to the current code. You do have fancy 10GigE gear? :) Could you profile a kernel with 279e9f2 ("ipv6: optimize inet6_hash_frag()") reverted? This would only effect IPv6 traffic, so you would have to write an IPv6 txf file. If the numbers show that the effect is not too high we could actually stop worry about per-hash bucket hash utilization in general (still, a limit per hash chain would be reasonable, but a global element limit should catch first). This would give us the possibility to evict fragments from the fragmentation cache as soon as new fragments come in without caring about which hash bucket the fragment does actually belong, too. This would solve one of your problems of the RFC patch. Btw, where is the txf file format actually documented? > > > > > -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f) > > > +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f, > > > + bool unlink_hash) > > > > This function could be static. > > (Sorry to be dumb) could you explain the benefit of doing so? The function's namespace will be limited to the current file and won't be exported for linking by other files. It also does give a good hint for developers that the function is not used externally. > > > +static int zero; > > > + > > > > Could be const. > > Could you also explain the benefit of doing so for a variable? Precaution that it won't be changed by any statement in the file (does not add overhead but throws compiler errors if someone accidentally changes the value out of mistake, compile-time). Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 19, 2013 at 04:29:02PM +0200, Jesper Dangaard Brouer wrote:
> Well, I don't know. But we do need some solution, to the current code.
In <http://article.gmane.org/gmane.linux.network/261361> I said that we could
actually have a list lengt of about 370. At this time this number was stable,
perhaps you could verify?
I tried to flood the cache with very minimal packets so this was actually
the hint that I should have resized the hash back then. With the current
fragmentation cache design we could reach optimal behaviour as soon
as the memory limits kicks in and lru eviction starts before we limit the
fragments queues in the hash chains. The only way to achieve this is to
increase the hash table slots and lower the maximum length limit. I would
propose a limit of about 25-32 and as Eric said, a hash size of 1024. We could
test if we are limited of accepting new fragments by memory limit (which would
be fine because lru eviction kicks in) or by chain length (we could recheck
the numbers then).
So the chain limit would only kick in if someone tries to exploit the fragment
cache by using the method I demonstrated before (which was the reason I
introduced this limit).
Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 21:44 +0200, Hannes Frederic Sowa wrote: > On Fri, Apr 19, 2013 at 04:29:02PM +0200, Jesper Dangaard Brouer wrote: > > Well, I don't know. But we do need some solution, to the current code. > > In <http://article.gmane.org/gmane.linux.network/261361> I said that we could > actually have a list lengt of about 370. At this time this number was stable, > perhaps you could verify? > > I tried to flood the cache with very minimal packets so this was actually > the hint that I should have resized the hash back then. With the current > fragmentation cache design we could reach optimal behaviour as soon > as the memory limits kicks in and lru eviction starts before we limit the > fragments queues in the hash chains. The only way to achieve this is to > increase the hash table slots and lower the maximum length limit. I would > propose a limit of about 25-32 and as Eric said, a hash size of 1024. We could > test if we are limited of accepting new fragments by memory limit (which would > be fine because lru eviction kicks in) or by chain length (we could recheck > the numbers then). > > So the chain limit would only kick in if someone tries to exploit the fragment > cache by using the method I demonstrated before (which was the reason I > introduced this limit). (To avoid pissing people off) I acknowledge that we should change the hash size, as its ridiculously small with 64 entries. But your mem limit assumption and hash depth limit assumptions are broken, because the mem limit is per netns (network namespace). Thus, starting more netns instances will break these assumptions. The dangerous part of your change (commit 5a3da1fe) is that you keep the existing frag queues (and don't allow new frag queues to be created). The attackers fragments will never finish (timeout 30 sec), while valid fragments will complete and "exit" the queue, thus the end result is hash bucket is filled with attackers invalid/incomplete fragments. Besides, after we have implemented per hash bucket locking (in my change commit 19952cc4 "net: frag queue per hash bucket locking"). Then, I don't think it is a big problem that a single hash bucket is being "attacked". IMHO we should just revert the change (commit 5a3da1fe), and increase the hash size and fix the hashing for IPv6. And then I'll find another method for fixing the global LRU list scalability problem (than the "direct-hash-cleaning" method).
On Mon, Apr 22, 2013 at 11:10:34AM +0200, Jesper Dangaard Brouer wrote: > (To avoid pissing people off) I acknowledge that we should change the > hash size, as its ridiculously small with 64 entries. > > But your mem limit assumption and hash depth limit assumptions are > broken, because the mem limit is per netns (network namespace). > Thus, starting more netns instances will break these assumptions. Oh, I see. :/ At first I thought we should make the fragment hash per namespace too, to provide better isolation in case of lxc. But then each chrome tab would allocate its own fragment cache, too. Hmm... but people using namespaces have plenty much memory, don't they? We could also provide an inet_fragment namespace. ;) > The dangerous part of your change (commit 5a3da1fe) is that you keep the > existing frag queues (and don't allow new frag queues to be created). > The attackers fragments will never finish (timeout 30 sec), while valid > fragments will complete and "exit" the queue, thus the end result is > hash bucket is filled with attackers invalid/incomplete fragments. I would not mind if your change gets accepted (I have not completyl reviewed it yet), but I have my doubts if it is an advantage to the current solution. First off, I think an attacker can keep the fragment cache pretty much filled up with little cost. The current implementation has the grace period where no new fragments will be accepted after the DoS, this is solved by your patch. But the change makes it easier for an attacker to evict "valid" fragments from the cache in the first 30 seconds of the DoS, too. I am not sure whether the current fragmentation handling or your solution does perform better in real world (or if it actually matters). Nonetheless it does add a bit more complexity and a new sysctl which does expose something the kernel should know how to do better. > Besides, after we have implemented per hash bucket locking (in my change > commit 19952cc4 "net: frag queue per hash bucket locking"). > Then, I don't think it is a big problem that a single hash bucket is > being "attacked". I don't know, I wouldn't say so. The contention point is now the per hash bucket lock but it should show the same symptoms as before. In my opinion we should start resizing the hash table irrespective of the namespace limits (one needs CAP_NET_ADMIN to connect a netns to the outside world, I think) and try to move forward with Patch 3. This patch 2 would then only be a dependency and would introduce the eviction strategy you need for patch 3. But the focus should be on the removal of the lru cleanup. What do you think? Thanks, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 2013-04-22 at 16:54 +0200, Hannes Frederic Sowa wrote: > On Mon, Apr 22, 2013 at 11:10:34AM +0200, Jesper Dangaard Brouer wrote: > > (To avoid pissing people off) I acknowledge that we should change the > > hash size, as its ridiculously small with 64 entries. > > > > But your mem limit assumption and hash depth limit assumptions are > > broken, because the mem limit is per netns (network namespace). > > Thus, starting more netns instances will break these assumptions. > > Oh, I see. :/ > > At first I thought we should make the fragment hash per namespace too, > to provide better isolation in case of lxc. But then each chrome tab > would allocate its own fragment cache, too. Hmm... but people using > namespaces have plenty much memory, don't they? We could also provide > an inet_fragment namespace. ;) I'm wondering if we could do the opposite, move the mem limit and LRU list "out-of" the netns? Either way, this would make the relationship of the mem limit and hash size more sane. > > The dangerous part of your change (commit 5a3da1fe) is that you keep the > > existing frag queues (and don't allow new frag queues to be created). > > The attackers fragments will never finish (timeout 30 sec), while valid > > fragments will complete and "exit" the queue, thus the end result is > > hash bucket is filled with attackers invalid/incomplete fragments. > > I would not mind if your change gets accepted (I have not completyl > reviewed it yet), but I have my doubts if it is an advantage to the > current solution. > > First off, I think an attacker can keep the fragment cache pretty much > filled up with little cost. The current implementation has the grace > period where no new fragments will be accepted after the DoS, this is > solved by your patch. But the change makes it easier for an attacker to > evict "valid" fragments from the cache in the first 30 seconds of the > DoS, too. The "grace period" is quite harmful (where no new fragments will be accepted). Just creating 3 netns (3x 4MB mem limit) we can make all queues reach 128 entries, resulting in a "grace period" of 30 sec where no frags are possible. (min frag size is 1108 bytes with my trafgen script). > I am not sure whether the current fragmentation handling or your solution > does perform better in real world (or if it actually matters). > > Nonetheless it does add a bit more complexity and a new sysctl which does > expose something the kernel should know how to do better. Well, actually I don't like exposing the max_hash_depth sysctl, it was a wrong idea/move. I like Eric's idea of resizing the hash based on the max thresh, unfortunately this does not make sense when the max thresh is per netns and the hash table is global. I'm also thinking, it is really worth the complexity of having a depth limit on this hash table? Is it that important. The mem limit should at some point kick in and save the day anyhow (before, without per hash bucket locking it might make sense). > > Besides, after we have implemented per hash bucket locking (in my change > > commit 19952cc4 "net: frag queue per hash bucket locking"). > > Then, I don't think it is a big problem that a single hash bucket is > > being "attacked". > > I don't know, I wouldn't say so. The contention point is now the per > hash bucket lock but it should show the same symptoms as before. > > In my opinion we should start resizing the hash table irrespective of > the namespace limits (one needs CAP_NET_ADMIN to connect a netns to > the outside world, I think) and try to move forward with Patch 3. This > patch 2 would then only be a dependency and would introduce the eviction > strategy you need for patch 3. But the focus should be on the removal of the > lru cleanup. What do you think? I agree, increasing the hash tables size makes sense, as its ridiculously small with 64 entries. Yes, removal of the (per netns) "global" LRU list should be the real focus. This was just a dependency for introducing the eviction strategy I needed for patch 3. But the eviction strategy in patch-3, is actually also "broken" because the mem limit is per netns and we do eviction based on the netns shared hash table... thinking of going back to my original idea of simply doing LRU lists per CPU. --Jesper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 22, 2013 at 06:30:17PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 2013-04-22 at 16:54 +0200, Hannes Frederic Sowa wrote: > > On Mon, Apr 22, 2013 at 11:10:34AM +0200, Jesper Dangaard Brouer wrote: > [...] > > > Besides, after we have implemented per hash bucket locking (in my change > > > commit 19952cc4 "net: frag queue per hash bucket locking"). > > > Then, I don't think it is a big problem that a single hash bucket is > > > being "attacked". > > > > I don't know, I wouldn't say so. The contention point is now the per > > hash bucket lock but it should show the same symptoms as before. > > No, the contention point is the LRU list lock, not the hash bucket lock. > If you perf record/profile the code, you can easily miss that its the > LRU lock, because its inlined. Try to rerun your tests with noinline > e.g.: It depends on the test. Last time I checked with my ipv6 torture test I had most hits in the inet_frag_find loop (I looked at it after your per bucket locks landed in net-next). I think your test setup could provide more meaningful numbers. If you fill up the whole fragment cache it is plausible that the contention will shift towards the lru lock. On Mon, Apr 22, 2013 at 07:49:12PM +0200, Jesper Dangaard Brouer wrote: > On Mon, 2013-04-22 at 16:54 +0200, Hannes Frederic Sowa wrote: > > At first I thought we should make the fragment hash per namespace too, > > to provide better isolation in case of lxc. But then each chrome tab > > would allocate its own fragment cache, too. Hmm... but people using > > namespaces have plenty much memory, don't they? We could also provide > > an inet_fragment namespace. ;) > > I'm wondering if we could do the opposite, move the mem limit and LRU > list "out-of" the netns? > Either way, this would make the relationship of the mem limit and hash > size more sane. It seems to be the only sane way currently. If we resize on cloning/changing per-ns parameters we would also have to resize the frag hash on every google chrome tab creation/destruction. Perhaps we sould think about adding lazy allocation/resizing of the fragment cache per-ns (or we do the resize on every rebuild timeout). > > > The dangerous part of your change (commit 5a3da1fe) is that you keep the > > > existing frag queues (and don't allow new frag queues to be created). > > > The attackers fragments will never finish (timeout 30 sec), while valid > > > fragments will complete and "exit" the queue, thus the end result is > > > hash bucket is filled with attackers invalid/incomplete fragments. > > > > I would not mind if your change gets accepted (I have not completyl > > reviewed it yet), but I have my doubts if it is an advantage to the > > current solution. > > > > First off, I think an attacker can keep the fragment cache pretty much > > filled up with little cost. The current implementation has the grace > > period where no new fragments will be accepted after the DoS, this is > > solved by your patch. But the change makes it easier for an attacker to > > evict "valid" fragments from the cache in the first 30 seconds of the > > DoS, too. > > The "grace period" is quite harmful (where no new fragments will be > accepted). Just creating 3 netns (3x 4MB mem limit) we can make all > queues reach 128 entries, resulting in a "grace period" of 30 sec where > no frags are possible. (min frag size is 1108 bytes with my trafgen > script). I see your point. But some other fragments would be in the queue perhaps going to be reassembled correctly which would otherwise be evicted. So I really don't know what is the better behavior here. But this would be changed as a side-effect of this whole patch set. ;) > I'm also thinking, it is really worth the complexity of having a depth > limit on this hash table? Is it that important. The mem limit should > at some point kick in and save the day anyhow (before, without per hash > bucket locking it might make sense). Depends on how we proceed with the ipv6 fragment hashing. If we go back to the 3 rounds of jhash for each fragment this would be possible. > Yes, removal of the (per netns) "global" LRU list should be the real > focus. This was just a dependency for introducing the eviction strategy > I needed for patch 3. > > But the eviction strategy in patch-3, is actually also "broken" because > the mem limit is per netns and we do eviction based on the netns shared > hash table... thinking of going back to my original idea of simply doing > LRU lists per CPU. That depends on how we proceed with the memory limits. If we don't solve this now we will have to deal with this again if we think about how to calculate a decent hash size. Greetings, Hannes -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 2013-04-19 at 07:49 -0700, Eric Dumazet wrote: > On Fri, 2013-04-19 at 14:19 +0200, Jesper Dangaard Brouer wrote: > > > When removing the LRU system (which is the real bottleneck, see perf > > tests in cover mail), and doing direct hash cleaning we are trading-in > > accuracy. [...] > > The reason I don't want a too big hash table is the following. > > > > Worst case 1024 buckets * 130K bytes = 133 MBytes, which on smaller > > embedded systems is a lot of kernel memory we are permitting a remote > > host to "lock-down". > > Thats pretty irrelevant, memory is limited by the total amount of memory > used by fragments, not by hash table size. > > Its called /proc/sys/net/ipv4/ipfrag_high_thresh I was talking about patch-03, where I do "direct hash cleaning", and have moved the mem limit "ipfrag_high_thresh" into the hash cleaning step. It seems we are talking past each-other... --Jesper -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 2013-04-24 at 15:35 +0200, Jesper Dangaard Brouer wrote: > On Fri, 2013-04-19 at 07:49 -0700, Eric Dumazet wrote: > > Thats pretty irrelevant, memory is limited by the total amount of memory > > used by fragments, not by hash table size. > > > > Its called /proc/sys/net/ipv4/ipfrag_high_thresh > > > I was talking about patch-03, where I do "direct hash cleaning", and > have moved the mem limit "ipfrag_high_thresh" into the hash cleaning > step. I am fine to remove LRU, as maintaining a LRU is too expensive and this LRU has little purpose. Old frags are evicted anyway because they have a timer. We have to keep a global limit, especially if we increase hash table size, because even with perfect hashing and one frag per hash slot we might consume too much ram. We can rename it if you wish but currently this is ipfrag_high_thresh. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h index 6f41b45..b2cd72a 100644 --- a/include/net/inet_frag.h +++ b/include/net/inet_frag.h @@ -42,13 +42,7 @@ struct inet_frag_queue { }; #define INETFRAGS_HASHSZ 64 - -/* averaged: - * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ / - * rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or - * struct frag_queue)) - */ -#define INETFRAGS_MAXDEPTH 128 +#define INETFRAGS_MAX_HASH_DEPTH 32 struct inet_frag_bucket { struct hlist_head chain; @@ -65,6 +59,7 @@ struct inet_frags { int secret_interval; struct timer_list secret_timer; u32 rnd; + u32 max_hash_depth; int qsize; unsigned int (*hashfn)(struct inet_frag_queue *); diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c index beec05b..fd9fe5c 100644 --- a/net/ipv4/inet_fragment.c +++ b/net/ipv4/inet_fragment.c @@ -130,7 +130,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f) } EXPORT_SYMBOL(inet_frags_exit_net); -static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f) +static inline void fq_unlink_hash(struct inet_frag_queue *fq, + struct inet_frags *f) { struct inet_frag_bucket *hb; unsigned int hash; @@ -140,24 +141,35 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f) hb = &f->hash[hash]; spin_lock(&hb->chain_lock); - hlist_del(&fq->list); + /* Handle race-condition between direct hash tables cleaning + * in inet_frag_find() and users of inet_frag_kill() + */ + if (!hlist_unhashed(&fq->list)) + hlist_del(&fq->list); spin_unlock(&hb->chain_lock); read_unlock(&f->lock); - inet_frag_lru_del(fq); } -void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f) +void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f, + bool unlink_hash) { if (del_timer(&fq->timer)) atomic_dec(&fq->refcnt); if (!(fq->last_in & INET_FRAG_COMPLETE)) { - fq_unlink(fq, f); + if (unlink_hash) + fq_unlink_hash(fq, f); + inet_frag_lru_del(fq); atomic_dec(&fq->refcnt); fq->last_in |= INET_FRAG_COMPLETE; } } + +void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f) +{ + __inet_frag_kill(fq, f, true); +} EXPORT_SYMBOL(inet_frag_kill); static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f, @@ -326,13 +338,14 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, __releases(&f->lock) { struct inet_frag_bucket *hb; - struct inet_frag_queue *q; + struct inet_frag_queue *q, *q_evict = NULL; + struct hlist_node *n; int depth = 0; hb = &f->hash[hash]; spin_lock(&hb->chain_lock); - hlist_for_each_entry(q, &hb->chain, list) { + hlist_for_each_entry_safe(q, n, &hb->chain, list) { if (q->net == nf && f->match(q, key)) { atomic_inc(&q->refcnt); spin_unlock(&hb->chain_lock); @@ -340,25 +353,32 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf, return q; } depth++; + q_evict = q; /* candidate for eviction */ } + /* Not found situation */ + if (depth > f->max_hash_depth) { + atomic_inc(&q_evict->refcnt); + hlist_del_init(&q_evict->list); + } else + q_evict = NULL; spin_unlock(&hb->chain_lock); read_unlock(&f->lock); - if (depth <= INETFRAGS_MAXDEPTH) - return inet_frag_create(nf, f, key); - else - return ERR_PTR(-ENOBUFS); -} -EXPORT_SYMBOL(inet_frag_find); + if (q_evict) { + spin_lock(&q_evict->lock); + if (!(q_evict->last_in & INET_FRAG_COMPLETE)) + __inet_frag_kill(q_evict, f, false); + spin_unlock(&q_evict->lock); -void inet_frag_maybe_warn_overflow(struct inet_frag_queue *q, - const char *prefix) -{ - static const char msg[] = "inet_frag_find: Fragment hash bucket" - " list length grew over limit " __stringify(INETFRAGS_MAXDEPTH) - ". Dropping fragment.\n"; + if (atomic_dec_and_test(&q_evict->refcnt)) + inet_frag_destroy(q_evict, f, NULL); - if (PTR_ERR(q) == -ENOBUFS) - LIMIT_NETDEBUG(KERN_WARNING "%s%s", prefix, msg); + LIMIT_NETDEBUG(KERN_WARNING "%s(): Fragment hash bucket" + " list length grew over limit (len %d)," + " Dropping another fragment.\n", + __func__, depth); + } + + return inet_frag_create(nf, f, key); } -EXPORT_SYMBOL(inet_frag_maybe_warn_overflow); +EXPORT_SYMBOL(inet_frag_find); diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c index 9385206..4e8489b 100644 --- a/net/ipv4/ip_fragment.c +++ b/net/ipv4/ip_fragment.c @@ -263,10 +263,8 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user) hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol); q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash); - if (IS_ERR_OR_NULL(q)) { - inet_frag_maybe_warn_overflow(q, pr_fmt()); + if (q == NULL) return NULL; - } return container_of(q, struct ipq, q); } @@ -748,6 +746,14 @@ static struct ctl_table ip4_frags_ctl_table[] = { .proc_handler = proc_dointvec_minmax, .extra1 = &zero }, + { + .procname = "ipfrag_max_hash_depth", + .data = &ip4_frags.max_hash_depth, + .maxlen = sizeof(u32), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero + }, { } }; @@ -866,5 +872,6 @@ void __init ipfrag_init(void) ip4_frags.match = ip4_frag_match; ip4_frags.frag_expire = ip_expire; ip4_frags.secret_interval = 10 * 60 * HZ; + ip4_frags.max_hash_depth = INETFRAGS_MAX_HASH_DEPTH; inet_frags_init(&ip4_frags); } diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c index dffdc1a..3713764 100644 --- a/net/ipv6/netfilter/nf_conntrack_reasm.c +++ b/net/ipv6/netfilter/nf_conntrack_reasm.c @@ -189,10 +189,8 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id, q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash); local_bh_enable(); - if (IS_ERR_OR_NULL(q)) { - inet_frag_maybe_warn_overflow(q, pr_fmt()); + if (q == NULL) return NULL; - } return container_of(q, struct frag_queue, q); } @@ -681,6 +679,7 @@ int nf_ct_frag6_init(void) nf_frags.match = ip6_frag_match; nf_frags.frag_expire = nf_ct_frag6_expire; nf_frags.secret_interval = 10 * 60 * HZ; + nf_frags.max_hash_depth = INETFRAGS_MAX_HASH_DEPTH; inet_frags_init(&nf_frags); ret = register_pernet_subsys(&nf_ct_net_ops); diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c index e6e44ce..45de5eb 100644 --- a/net/ipv6/reassembly.c +++ b/net/ipv6/reassembly.c @@ -196,10 +196,8 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src, hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd); q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash); - if (IS_ERR_OR_NULL(q)) { - inet_frag_maybe_warn_overflow(q, pr_fmt()); + if (q == NULL) return NULL; - } return container_of(q, struct frag_queue, q); } @@ -600,6 +598,8 @@ static struct ctl_table ip6_frags_ns_ctl_table[] = { { } }; +static int zero; + static struct ctl_table ip6_frags_ctl_table[] = { { .procname = "ip6frag_secret_interval", @@ -608,6 +608,14 @@ static struct ctl_table ip6_frags_ctl_table[] = { .mode = 0644, .proc_handler = proc_dointvec_jiffies, }, + { + .procname = "ip6frag_max_hash_depth", + .data = &ip6_frags.max_hash_depth, + .maxlen = sizeof(u32), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero + }, { } }; @@ -734,6 +742,7 @@ int __init ipv6_frag_init(void) ip6_frags.match = ip6_frag_match; ip6_frags.frag_expire = ip6_frag_expire; ip6_frags.secret_interval = 10 * 60 * HZ; + ip6_frags.max_hash_depth = INETFRAGS_MAX_HASH_DEPTH; inet_frags_init(&ip6_frags); out: return ret;
I have found an issues with commit: commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055 Author: Hannes Frederic Sowa <hannes@stressinduktion.org> Date: Fri Mar 15 11:32:30 2013 +0000 inet: limit length of fragment queue hash table bucket lists There is a connection between the fixed 128 hash depth limit and the frag mem limit/thresh settings, which limits how high the thresh can be set. The 128 elems hash depth limit, results in bad behaviour if mem limit thresh holds are increased, via /proc/sys/net :: /proc/sys/net/ipv4/ipfrag_high_thresh /proc/sys/net/ipv4/ipfrag_low_thresh If we increase the thresh, to something allowing 128 elements in each bucket, which is not that high given the hash array size of 64 (64*128=8192), e.g. big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648 small frags ( 896(truesize)+208(ipq))*8192(max elems)=9043968 The problem with commit 5a3da1fe (inet: limit length of fragment queue hash table bucket lists) is that, once we hit the limit, the we *keep* the existing frag queues, not allowing new frag queues to be created. Thus, an attacker can effectivly block handling of fragments for 30 sec (as each frag queue have a timeout of 30 sec). Even without increasing the limit, as Hannes showed, an attacker on IPv6 can "attack" a specific hash bucket, and via that change, can block/drop new fragments also (trying to) utilize this bucket. Summary: With the default mem limit/thresh settings, this is not general problem, but adjusting the thresh limits result in some-what unexpected behavior. Proposed solution: IMHO instead of keeping existing frag queues, we should kill one of the frag queues in the hash instead. Implementation complications: Killing of frag queues while only holding the hash bucket lock, and not the frag queue lock, complicates the implementation, as we race and can end up (trying to) remove the hash element twice (resulting in an oops). This have been addressed by using hlist_del_init() and a hlist_unhashed() check in fq_unlink_hash(). Extra: * Added new sysctl "max_hash_depth" option, to allow users to adjust the hash depth along with adjusting the thresh limits. * Change max hash depth to 32, thus limit handling to approx 2048 frag queues. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> --- include/net/inet_frag.h | 9 +--- net/ipv4/inet_fragment.c | 64 ++++++++++++++++++++----------- net/ipv4/ip_fragment.c | 13 +++++- net/ipv6/netfilter/nf_conntrack_reasm.c | 5 +- net/ipv6/reassembly.c | 15 ++++++- 5 files changed, 68 insertions(+), 38 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html