diff mbox

iptables/tc: page allocation failures question

Message ID 1351942499.21634.1640.camel@edumazet-glaptop
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 3, 2012, 11:34 a.m. UTC
On Sat, 2012-11-03 at 11:27 +0100, Miroslav Kratochvil wrote:
> Hello everyone,
> 
> I've got several linux boxes that do mostly routing and traffic
> shaping stuff. The load isn't any dramatic - it's around 100Mbit of
> traffic shaping over a HFSC qdisc with ~10k classes/filters.
> 
> Recently I started seeing messages like this in dmesg:
> 
> iptables: page allocation failure: order:9, mode:0xc0d0
> 
> tc: page allocation failure (....)
> 
> (full messages are attached below)
> 
> I understood that it means the kernel couldn't allocate memory for
> execution of given command - it is usually triggered by stuff like 'tc
> class add' or 'iptables -A something'.
> 
> The boxes, on the other hand, still have pretty much free memory
> (alloc+buffers+cache fill around 400MB of 2 gigs available, swap is
> empty). I guess the problem is caused by the fact that the allocation
> is constrained by something (like GFP_ATOMIC, or that they can only
> allocate lower memory). Is this true? If so, is there some possibility
> to avoid such constraint?
> 
> What also worries me is that when the box at some point starts to do
> memory allocation failures, I've been unable to make it stop, even if
> I delete all qdiscs/iptable entries, clear every cache I know about
> and restart most of userspace, which should hopefully free a good
> amount of memory, nothing can be added back.
> 
> I'm attaching the dmesg of the failure below. Could anyone provide a
> comment on this, or possibly point me to what can cause this behavior?
> Is there any better debug output that could clarify this?
> 
> Thanks in advance,
> Mirek Kratochvil

You apparently load xt_recent module with a big ip_list_tot value
(default is 100), and kzalloc() wants an order-9 page (contiguous 2MB of
ram), and it fails.

I guess following patch should solve your problem



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Miroslav Kratochvil Nov. 3, 2012, 3:24 p.m. UTC | #1
Hi,

Thanks for the patch! I think it will fix the problem, I patched one
of the production boxes and will see if it breaks again; it usually
happens after a day or two.

Anyway, more questions:

- my problem sometimes happens even when there are no big xt_recent
allocations happening (just TC/HFSC). Therefore:

  1] Is it possible that something similarly big gets allocated in
HFSC? I didn't actually find anything that would, so...

  2] Is it possible that allocation fragmentation of kalloc/kfree zone
(well it's 10k filters + 10k classes + filter hash table
infrastructure and it is still being rewritten/restructured by the
management software...) can cause similar problems?

- is there some decent way to possibly fix this without manually
patching all production kernels? magic kernel parameter that would
convert failing kalloc to valloc? sysctl to prevent exhausting the
memory? or, at least, something that would reset the failing machine's
memory to a state other than "everything fails"?

Sorry for asking too many questions, but I feel it'd be unwise to let
it behave this way... :]

Thanks,
-mk


On Sat, Nov 3, 2012 at 12:34 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Sat, 2012-11-03 at 11:27 +0100, Miroslav Kratochvil wrote:
>> Hello everyone,
>>
>> I've got several linux boxes that do mostly routing and traffic
>> shaping stuff. The load isn't any dramatic - it's around 100Mbit of
>> traffic shaping over a HFSC qdisc with ~10k classes/filters.
>>
>> Recently I started seeing messages like this in dmesg:
>>
>> iptables: page allocation failure: order:9, mode:0xc0d0
>>
>> tc: page allocation failure (....)
>>
>> (full messages are attached below)
>>
>> I understood that it means the kernel couldn't allocate memory for
>> execution of given command - it is usually triggered by stuff like 'tc
>> class add' or 'iptables -A something'.
>>
>> The boxes, on the other hand, still have pretty much free memory
>> (alloc+buffers+cache fill around 400MB of 2 gigs available, swap is
>> empty). I guess the problem is caused by the fact that the allocation
>> is constrained by something (like GFP_ATOMIC, or that they can only
>> allocate lower memory). Is this true? If so, is there some possibility
>> to avoid such constraint?
>>
>> What also worries me is that when the box at some point starts to do
>> memory allocation failures, I've been unable to make it stop, even if
>> I delete all qdiscs/iptable entries, clear every cache I know about
>> and restart most of userspace, which should hopefully free a good
>> amount of memory, nothing can be added back.
>>
>> I'm attaching the dmesg of the failure below. Could anyone provide a
>> comment on this, or possibly point me to what can cause this behavior?
>> Is there any better debug output that could clarify this?
>>
>> Thanks in advance,
>> Mirek Kratochvil
>
> You apparently load xt_recent module with a big ip_list_tot value
> (default is 100), and kzalloc() wants an order-9 page (contiguous 2MB of
> ram), and it fails.
>
> I guess following patch should solve your problem
>
> diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
> index 4635c9b..ceebd8b 100644
> --- a/net/netfilter/xt_recent.c
> +++ b/net/netfilter/xt_recent.c
> @@ -29,6 +29,7 @@
>  #include <linux/skbuff.h>
>  #include <linux/inet.h>
>  #include <linux/slab.h>
> +#include <linux/vmalloc.h>
>  #include <net/net_namespace.h>
>  #include <net/netns/generic.h>
>
> @@ -310,6 +311,14 @@ out:
>         return ret;
>  }
>
> +static void recent_table_free(void *addr)
> +{
> +       if (is_vmalloc_addr(addr))
> +               vfree(addr);
> +       else
> +               kfree(addr);
> +}
> +
>  static int recent_mt_check(const struct xt_mtchk_param *par,
>                            const struct xt_recent_mtinfo_v1 *info)
>  {
> @@ -322,6 +331,7 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
>  #endif
>         unsigned int i;
>         int ret = -EINVAL;
> +       size_t sz;
>
>         if (unlikely(!hash_rnd_inited)) {
>                 get_random_bytes(&hash_rnd, sizeof(hash_rnd));
> @@ -360,8 +370,11 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
>                 goto out;
>         }
>
> -       t = kzalloc(sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size,
> -                   GFP_KERNEL);
> +       sz = sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size;
> +       if (sz <= PAGE_SIZE)
> +               t = kzalloc(sz, GFP_KERNEL);
> +       else
> +               t = vzalloc(sz);
>         if (t == NULL) {
>                 ret = -ENOMEM;
>                 goto out;
> @@ -377,14 +390,14 @@ static int recent_mt_check(const struct xt_mtchk_param *par,
>         uid = make_kuid(&init_user_ns, ip_list_uid);
>         gid = make_kgid(&init_user_ns, ip_list_gid);
>         if (!uid_valid(uid) || !gid_valid(gid)) {
> -               kfree(t);
> +               recent_table_free(t);
>                 ret = -EINVAL;
>                 goto out;
>         }
>         pde = proc_create_data(t->name, ip_list_perms, recent_net->xt_recent,
>                   &recent_mt_fops, t);
>         if (pde == NULL) {
> -               kfree(t);
> +               recent_table_free(t);
>                 ret = -ENOMEM;
>                 goto out;
>         }
> @@ -434,7 +447,7 @@ static void recent_mt_destroy(const struct xt_mtdtor_param *par)
>                 remove_proc_entry(t->name, recent_net->xt_recent);
>  #endif
>                 recent_table_flush(t);
> -               kfree(t);
> +               recent_table_free(t);
>         }
>         mutex_unlock(&recent_mutex);
>  }
>
>
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 4635c9b..ceebd8b 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -29,6 +29,7 @@ 
 #include <linux/skbuff.h>
 #include <linux/inet.h>
 #include <linux/slab.h>
+#include <linux/vmalloc.h>
 #include <net/net_namespace.h>
 #include <net/netns/generic.h>
 
@@ -310,6 +311,14 @@  out:
 	return ret;
 }
 
+static void recent_table_free(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		vfree(addr);
+	else
+		kfree(addr);
+}
+
 static int recent_mt_check(const struct xt_mtchk_param *par,
 			   const struct xt_recent_mtinfo_v1 *info)
 {
@@ -322,6 +331,7 @@  static int recent_mt_check(const struct xt_mtchk_param *par,
 #endif
 	unsigned int i;
 	int ret = -EINVAL;
+	size_t sz;
 
 	if (unlikely(!hash_rnd_inited)) {
 		get_random_bytes(&hash_rnd, sizeof(hash_rnd));
@@ -360,8 +370,11 @@  static int recent_mt_check(const struct xt_mtchk_param *par,
 		goto out;
 	}
 
-	t = kzalloc(sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size,
-		    GFP_KERNEL);
+	sz = sizeof(*t) + sizeof(t->iphash[0]) * ip_list_hash_size;
+	if (sz <= PAGE_SIZE)
+		t = kzalloc(sz, GFP_KERNEL);
+	else
+		t = vzalloc(sz);
 	if (t == NULL) {
 		ret = -ENOMEM;
 		goto out;
@@ -377,14 +390,14 @@  static int recent_mt_check(const struct xt_mtchk_param *par,
 	uid = make_kuid(&init_user_ns, ip_list_uid);
 	gid = make_kgid(&init_user_ns, ip_list_gid);
 	if (!uid_valid(uid) || !gid_valid(gid)) {
-		kfree(t);
+		recent_table_free(t);
 		ret = -EINVAL;
 		goto out;
 	}
 	pde = proc_create_data(t->name, ip_list_perms, recent_net->xt_recent,
 		  &recent_mt_fops, t);
 	if (pde == NULL) {
-		kfree(t);
+		recent_table_free(t);
 		ret = -ENOMEM;
 		goto out;
 	}
@@ -434,7 +447,7 @@  static void recent_mt_destroy(const struct xt_mtdtor_param *par)
 		remove_proc_entry(t->name, recent_net->xt_recent);
 #endif
 		recent_table_flush(t);
-		kfree(t);
+		recent_table_free(t);
 	}
 	mutex_unlock(&recent_mutex);
 }