Message ID | 1444906569-9131-1-git-send-email-kernel@kyup.com |
---|---|
State | Changes Requested |
Delegated to: | Pablo Neira |
Headers | show |
On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote: > Commit 00590fdd5be0 introduced RCU locking in list type and in > doing so introduced a memory allocation in list_set_add, which > results in the following splat: > > BUG: sleeping function called from invalid context at mm/page_alloc.c:2759 > in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset > CPU: 18 PID: 9664 Comm: ipset Tainted: G O 3.12.47-clouder3 #1 > Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 > 0000000000000002 ffff881fd14273c8 ffffffff8163d891 ffff881fcb4264b0 > ffff881fcb4260c0 ffff881fd14273e8 ffffffff810ba5bf ffff881fd1427558 > 0000000000000000 ffff881fd1427568 ffffffff81142b33 ffff881f00000000 > Call Trace: > [<ffffffff8163d891>] dump_stack+0x58/0x7f > [<ffffffff810ba5bf>] __might_sleep+0xdf/0x110 > [<ffffffff81142b33>] __alloc_pages_nodemask+0x243/0xc20 > [<ffffffff81181c6e>] alloc_pages_current+0xbe/0x170 > [<ffffffff81188315>] new_slab+0x295/0x340 > [<ffffffff81189a40>] __slab_alloc+0x2c0/0x5a0 > [<ffffffff8164000c>] ? __schedule+0x2dc/0x760 > [<ffffffff8118a71b>] __kmalloc+0x11b/0x230 > [<ffffffffa02bd0ac>] ? ip_set_get_byname+0xec/0x100 [ip_set] > [<ffffffffa02d23fb>] list_set_uadd+0x16b/0x314 [ip_set_list_set] > [<ffffffff81642148>] ? _raw_write_unlock_bh+0x28/0x30 > [<ffffffffa02d1cfc>] list_set_uadt+0x21c/0x320 [ip_set_list_set] > [<ffffffffa02d2290>] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set] > [<ffffffffa02be242>] call_ad+0x82/0x200 [ip_set] > [<ffffffffa02bb171>] ? find_set_type+0x51/0xa0 [ip_set] > [<ffffffff8133f275>] ? nla_parse+0xf5/0x130 > [<ffffffffa02be8ae>] ip_set_uadd+0x20e/0x2d0 [ip_set] > [<ffffffffa02be013>] ? ip_set_create+0x2a3/0x450 [ip_set] > [<ffffffffa02be6a0>] ? ip_set_udel+0x2e0/0x2e0 [ip_set] > [<ffffffff815b316e>] nfnetlink_rcv_msg+0x31e/0x330 > [<ffffffff815b2e91>] ? nfnetlink_rcv_msg+0x41/0x330 > [<ffffffff815b2e50>] ? nfnl_lock+0x30/0x30 > [<ffffffff815ae179>] netlink_rcv_skb+0xa9/0xd0 > [<ffffffff815b2d45>] nfnetlink_rcv+0x15/0x20 > [<ffffffff815ade5f>] netlink_unicast+0x10f/0x190 > [<ffffffff815aedb0>] netlink_sendmsg+0x2c0/0x660 > [<ffffffff81567f00>] sock_sendmsg+0x90/0xc0 > [<ffffffff81565b03>] ? move_addr_to_user+0xa3/0xc0 > [<ffffffff81568552>] ? ___sys_recvmsg+0x182/0x300 > [<ffffffff81568064>] SYSC_sendto+0x134/0x180 > [<ffffffff811c4e01>] ? mntput+0x21/0x30 > [<ffffffff81572d2f>] ? __kfree_skb+0x3f/0xa0 > [<ffffffff815680be>] SyS_sendto+0xe/0x10 > [<ffffffff816434b2>] system_call_fastpath+0x16/0x1b > > The call chain leading to this is as follow: > call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL). > And since GFP_KERNEL allows initiating direct reclaim thus > potentially sleeping in the allocation path, this leads to the > aforementioned splat. > > To fix it change the allocation type to GFP_ATOMIC, to > correctly reflect that it is occuring in an atomic context. > > Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type") > > Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> > Signed-off-by: Nikolay Borisov <kernel@kyup.com> > --- > > Changes since V1: > * Added acked-by > * Fixed patch header > > net/netfilter/ipset/ip_set_list_set.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c > index a1fe537..5a30ce6 100644 > --- a/net/netfilter/ipset/ip_set_list_set.c > +++ b/net/netfilter/ipset/ip_set_list_set.c > @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext, > ip_set_timeout_expired(ext_timeout(n, set)))) > n = NULL; > > - e = kzalloc(set->dsize, GFP_KERNEL); > + e = kzalloc(set->dsize, GFP_ATOMIC); > if (!e) > return -ENOMEM; > e->id = d->id; This patch looks very bogus to me. Could we fix the root cause please ? Root cause is that somewhere in this controlling path, an erroneous rcu_read_lock() is used, while it is very probably not needed, as controlling path should be protected by a mutex, which definitely is sane, because it allows us to perform GFP_KERNEL allocations and being preempted. Why are we using rcu_read_lock() in list_set_list() ? This looks as yet another bit of 'let us throw rcu_read_lock()/rcu_read_unlock() pairs' all over the places because it feels so good. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/15/2015 04:32 PM, Eric Dumazet wrote: > On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote: >> Commit 00590fdd5be0 introduced RCU locking in list type and in >> doing so introduced a memory allocation in list_set_add, which >> results in the following splat: >> >> BUG: sleeping function called from invalid context at mm/page_alloc.c:2759 >> in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset >> CPU: 18 PID: 9664 Comm: ipset Tainted: G O 3.12.47-clouder3 #1 >> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 >> 0000000000000002 ffff881fd14273c8 ffffffff8163d891 ffff881fcb4264b0 >> ffff881fcb4260c0 ffff881fd14273e8 ffffffff810ba5bf ffff881fd1427558 >> 0000000000000000 ffff881fd1427568 ffffffff81142b33 ffff881f00000000 >> Call Trace: >> [<ffffffff8163d891>] dump_stack+0x58/0x7f >> [<ffffffff810ba5bf>] __might_sleep+0xdf/0x110 >> [<ffffffff81142b33>] __alloc_pages_nodemask+0x243/0xc20 >> [<ffffffff81181c6e>] alloc_pages_current+0xbe/0x170 >> [<ffffffff81188315>] new_slab+0x295/0x340 >> [<ffffffff81189a40>] __slab_alloc+0x2c0/0x5a0 >> [<ffffffff8164000c>] ? __schedule+0x2dc/0x760 >> [<ffffffff8118a71b>] __kmalloc+0x11b/0x230 >> [<ffffffffa02bd0ac>] ? ip_set_get_byname+0xec/0x100 [ip_set] >> [<ffffffffa02d23fb>] list_set_uadd+0x16b/0x314 [ip_set_list_set] >> [<ffffffff81642148>] ? _raw_write_unlock_bh+0x28/0x30 >> [<ffffffffa02d1cfc>] list_set_uadt+0x21c/0x320 [ip_set_list_set] >> [<ffffffffa02d2290>] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set] >> [<ffffffffa02be242>] call_ad+0x82/0x200 [ip_set] >> [<ffffffffa02bb171>] ? find_set_type+0x51/0xa0 [ip_set] >> [<ffffffff8133f275>] ? nla_parse+0xf5/0x130 >> [<ffffffffa02be8ae>] ip_set_uadd+0x20e/0x2d0 [ip_set] >> [<ffffffffa02be013>] ? ip_set_create+0x2a3/0x450 [ip_set] >> [<ffffffffa02be6a0>] ? ip_set_udel+0x2e0/0x2e0 [ip_set] >> [<ffffffff815b316e>] nfnetlink_rcv_msg+0x31e/0x330 >> [<ffffffff815b2e91>] ? nfnetlink_rcv_msg+0x41/0x330 >> [<ffffffff815b2e50>] ? nfnl_lock+0x30/0x30 >> [<ffffffff815ae179>] netlink_rcv_skb+0xa9/0xd0 >> [<ffffffff815b2d45>] nfnetlink_rcv+0x15/0x20 >> [<ffffffff815ade5f>] netlink_unicast+0x10f/0x190 >> [<ffffffff815aedb0>] netlink_sendmsg+0x2c0/0x660 >> [<ffffffff81567f00>] sock_sendmsg+0x90/0xc0 >> [<ffffffff81565b03>] ? move_addr_to_user+0xa3/0xc0 >> [<ffffffff81568552>] ? ___sys_recvmsg+0x182/0x300 >> [<ffffffff81568064>] SYSC_sendto+0x134/0x180 >> [<ffffffff811c4e01>] ? mntput+0x21/0x30 >> [<ffffffff81572d2f>] ? __kfree_skb+0x3f/0xa0 >> [<ffffffff815680be>] SyS_sendto+0xe/0x10 >> [<ffffffff816434b2>] system_call_fastpath+0x16/0x1b >> >> The call chain leading to this is as follow: >> call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL). >> And since GFP_KERNEL allows initiating direct reclaim thus >> potentially sleeping in the allocation path, this leads to the >> aforementioned splat. >> >> To fix it change the allocation type to GFP_ATOMIC, to >> correctly reflect that it is occuring in an atomic context. >> >> Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type") >> >> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> >> Signed-off-by: Nikolay Borisov <kernel@kyup.com> >> --- >> >> Changes since V1: >> * Added acked-by >> * Fixed patch header >> >> net/netfilter/ipset/ip_set_list_set.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c >> index a1fe537..5a30ce6 100644 >> --- a/net/netfilter/ipset/ip_set_list_set.c >> +++ b/net/netfilter/ipset/ip_set_list_set.c >> @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext, >> ip_set_timeout_expired(ext_timeout(n, set)))) >> n = NULL; >> >> - e = kzalloc(set->dsize, GFP_KERNEL); >> + e = kzalloc(set->dsize, GFP_ATOMIC); >> if (!e) >> return -ENOMEM; >> e->id = d->id; > > This patch looks very bogus to me. > > Could we fix the root cause please ? > > Root cause is that somewhere in this controlling path, an erroneous > rcu_read_lock() is used, while it is very probably not needed, as > controlling path should be protected by a mutex, which definitely is > sane, because it allows us to perform GFP_KERNEL allocations and being > preempted. > > Why are we using rcu_read_lock() in list_set_list() ? > > This looks as yet another bit of 'let us throw > rcu_read_lock()/rcu_read_unlock() pairs' all over the places because it > feels so good. I did check the call paths and there isn't an rcu_read_lock called in list_set_uadt/list_set_uadd. On the contrary, this "write" operation to the list is being serialised in call_ad() via set->lock spin_lock. What am I missing here? > > > > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-10-15 at 16:41 +0300, Nikolay Borisov wrote: > > On 10/15/2015 04:32 PM, Eric Dumazet wrote: > > On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote: > >> Commit 00590fdd5be0 introduced RCU locking in list type and in > >> doing so introduced a memory allocation in list_set_add, which > >> results in the following splat: > >> > >> BUG: sleeping function called from invalid context at mm/page_alloc.c:2759 > >> in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset > >> CPU: 18 PID: 9664 Comm: ipset Tainted: G O 3.12.47-clouder3 #1 > >> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 > >> 0000000000000002 ffff881fd14273c8 ffffffff8163d891 ffff881fcb4264b0 > >> ffff881fcb4260c0 ffff881fd14273e8 ffffffff810ba5bf ffff881fd1427558 > >> 0000000000000000 ffff881fd1427568 ffffffff81142b33 ffff881f00000000 > >> Call Trace: > >> [<ffffffff8163d891>] dump_stack+0x58/0x7f > >> [<ffffffff810ba5bf>] __might_sleep+0xdf/0x110 > >> [<ffffffff81142b33>] __alloc_pages_nodemask+0x243/0xc20 > >> [<ffffffff81181c6e>] alloc_pages_current+0xbe/0x170 > >> [<ffffffff81188315>] new_slab+0x295/0x340 > >> [<ffffffff81189a40>] __slab_alloc+0x2c0/0x5a0 > >> [<ffffffff8164000c>] ? __schedule+0x2dc/0x760 > >> [<ffffffff8118a71b>] __kmalloc+0x11b/0x230 > >> [<ffffffffa02bd0ac>] ? ip_set_get_byname+0xec/0x100 [ip_set] > >> [<ffffffffa02d23fb>] list_set_uadd+0x16b/0x314 [ip_set_list_set] > >> [<ffffffff81642148>] ? _raw_write_unlock_bh+0x28/0x30 > >> [<ffffffffa02d1cfc>] list_set_uadt+0x21c/0x320 [ip_set_list_set] > >> [<ffffffffa02d2290>] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set] > >> [<ffffffffa02be242>] call_ad+0x82/0x200 [ip_set] > >> [<ffffffffa02bb171>] ? find_set_type+0x51/0xa0 [ip_set] > >> [<ffffffff8133f275>] ? nla_parse+0xf5/0x130 > >> [<ffffffffa02be8ae>] ip_set_uadd+0x20e/0x2d0 [ip_set] > >> [<ffffffffa02be013>] ? ip_set_create+0x2a3/0x450 [ip_set] > >> [<ffffffffa02be6a0>] ? ip_set_udel+0x2e0/0x2e0 [ip_set] > >> [<ffffffff815b316e>] nfnetlink_rcv_msg+0x31e/0x330 > >> [<ffffffff815b2e91>] ? nfnetlink_rcv_msg+0x41/0x330 > >> [<ffffffff815b2e50>] ? nfnl_lock+0x30/0x30 > >> [<ffffffff815ae179>] netlink_rcv_skb+0xa9/0xd0 > >> [<ffffffff815b2d45>] nfnetlink_rcv+0x15/0x20 > >> [<ffffffff815ade5f>] netlink_unicast+0x10f/0x190 > >> [<ffffffff815aedb0>] netlink_sendmsg+0x2c0/0x660 > >> [<ffffffff81567f00>] sock_sendmsg+0x90/0xc0 > >> [<ffffffff81565b03>] ? move_addr_to_user+0xa3/0xc0 > >> [<ffffffff81568552>] ? ___sys_recvmsg+0x182/0x300 > >> [<ffffffff81568064>] SYSC_sendto+0x134/0x180 > >> [<ffffffff811c4e01>] ? mntput+0x21/0x30 > >> [<ffffffff81572d2f>] ? __kfree_skb+0x3f/0xa0 > >> [<ffffffff815680be>] SyS_sendto+0xe/0x10 > >> [<ffffffff816434b2>] system_call_fastpath+0x16/0x1b > >> > >> The call chain leading to this is as follow: > >> call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL). > >> And since GFP_KERNEL allows initiating direct reclaim thus > >> potentially sleeping in the allocation path, this leads to the > >> aforementioned splat. > >> > >> To fix it change the allocation type to GFP_ATOMIC, to > >> correctly reflect that it is occuring in an atomic context. > >> > >> Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type") > >> > >> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> > >> Signed-off-by: Nikolay Borisov <kernel@kyup.com> > >> --- > >> > >> Changes since V1: > >> * Added acked-by > >> * Fixed patch header > >> > >> net/netfilter/ipset/ip_set_list_set.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c > >> index a1fe537..5a30ce6 100644 > >> --- a/net/netfilter/ipset/ip_set_list_set.c > >> +++ b/net/netfilter/ipset/ip_set_list_set.c > >> @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext, > >> ip_set_timeout_expired(ext_timeout(n, set)))) > >> n = NULL; > >> > >> - e = kzalloc(set->dsize, GFP_KERNEL); > >> + e = kzalloc(set->dsize, GFP_ATOMIC); > >> if (!e) > >> return -ENOMEM; > >> e->id = d->id; > > > > This patch looks very bogus to me. > > > > Could we fix the root cause please ? > > > > Root cause is that somewhere in this controlling path, an erroneous > > rcu_read_lock() is used, while it is very probably not needed, as > > controlling path should be protected by a mutex, which definitely is > > sane, because it allows us to perform GFP_KERNEL allocations and being > > preempted. > > > > Why are we using rcu_read_lock() in list_set_list() ? > > > > This looks as yet another bit of 'let us throw > > rcu_read_lock()/rcu_read_unlock() pairs' all over the places because it > > feels so good. > > I did check the call paths and there isn't an rcu_read_lock called in > list_set_uadt/list_set_uadd. On the contrary, this "write" operation to > the list is being serialised in call_ad() via set->lock spin_lock. > > What am I missing here? I was not complaining to you, but to Jozsef ;) Looking at commit 00590fdd5be0d7 terse changelog, we have to look at whole commit, and find suspicious rcu_read_lock() Apparently, before this commit, allocations were safe, in process context (and using GFP_KERNEL), but after the commit, we have the unfortunate side effect of having potential burst of allocations done from bh context, using the special reserves dedicated to GFP_ATOMIC true users (processing packets from softirq handlers) I hate when we add more GFP_ATOMIC allocations all over the places, as this increases the risk of depleting memory reserves. Can the spinlock be converted to a mutex ? If not, then instead of putting a stack trace in your changelog, a good explanation would be more useful. Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 10/15/2015 05:32 PM, Eric Dumazet wrote: > On Thu, 2015-10-15 at 16:41 +0300, Nikolay Borisov wrote: >> >> On 10/15/2015 04:32 PM, Eric Dumazet wrote: >>> On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote: >>>> Commit 00590fdd5be0 introduced RCU locking in list type and in >>>> doing so introduced a memory allocation in list_set_add, which >>>> results in the following splat: >>>> >>>> BUG: sleeping function called from invalid context at mm/page_alloc.c:2759 >>>> in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset >>>> CPU: 18 PID: 9664 Comm: ipset Tainted: G O 3.12.47-clouder3 #1 >>>> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 >>>> 0000000000000002 ffff881fd14273c8 ffffffff8163d891 ffff881fcb4264b0 >>>> ffff881fcb4260c0 ffff881fd14273e8 ffffffff810ba5bf ffff881fd1427558 >>>> 0000000000000000 ffff881fd1427568 ffffffff81142b33 ffff881f00000000 >>>> Call Trace: >>>> [<ffffffff8163d891>] dump_stack+0x58/0x7f >>>> [<ffffffff810ba5bf>] __might_sleep+0xdf/0x110 >>>> [<ffffffff81142b33>] __alloc_pages_nodemask+0x243/0xc20 >>>> [<ffffffff81181c6e>] alloc_pages_current+0xbe/0x170 >>>> [<ffffffff81188315>] new_slab+0x295/0x340 >>>> [<ffffffff81189a40>] __slab_alloc+0x2c0/0x5a0 >>>> [<ffffffff8164000c>] ? __schedule+0x2dc/0x760 >>>> [<ffffffff8118a71b>] __kmalloc+0x11b/0x230 >>>> [<ffffffffa02bd0ac>] ? ip_set_get_byname+0xec/0x100 [ip_set] >>>> [<ffffffffa02d23fb>] list_set_uadd+0x16b/0x314 [ip_set_list_set] >>>> [<ffffffff81642148>] ? _raw_write_unlock_bh+0x28/0x30 >>>> [<ffffffffa02d1cfc>] list_set_uadt+0x21c/0x320 [ip_set_list_set] >>>> [<ffffffffa02d2290>] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set] >>>> [<ffffffffa02be242>] call_ad+0x82/0x200 [ip_set] >>>> [<ffffffffa02bb171>] ? find_set_type+0x51/0xa0 [ip_set] >>>> [<ffffffff8133f275>] ? nla_parse+0xf5/0x130 >>>> [<ffffffffa02be8ae>] ip_set_uadd+0x20e/0x2d0 [ip_set] >>>> [<ffffffffa02be013>] ? ip_set_create+0x2a3/0x450 [ip_set] >>>> [<ffffffffa02be6a0>] ? ip_set_udel+0x2e0/0x2e0 [ip_set] >>>> [<ffffffff815b316e>] nfnetlink_rcv_msg+0x31e/0x330 >>>> [<ffffffff815b2e91>] ? nfnetlink_rcv_msg+0x41/0x330 >>>> [<ffffffff815b2e50>] ? nfnl_lock+0x30/0x30 >>>> [<ffffffff815ae179>] netlink_rcv_skb+0xa9/0xd0 >>>> [<ffffffff815b2d45>] nfnetlink_rcv+0x15/0x20 >>>> [<ffffffff815ade5f>] netlink_unicast+0x10f/0x190 >>>> [<ffffffff815aedb0>] netlink_sendmsg+0x2c0/0x660 >>>> [<ffffffff81567f00>] sock_sendmsg+0x90/0xc0 >>>> [<ffffffff81565b03>] ? move_addr_to_user+0xa3/0xc0 >>>> [<ffffffff81568552>] ? ___sys_recvmsg+0x182/0x300 >>>> [<ffffffff81568064>] SYSC_sendto+0x134/0x180 >>>> [<ffffffff811c4e01>] ? mntput+0x21/0x30 >>>> [<ffffffff81572d2f>] ? __kfree_skb+0x3f/0xa0 >>>> [<ffffffff815680be>] SyS_sendto+0xe/0x10 >>>> [<ffffffff816434b2>] system_call_fastpath+0x16/0x1b >>>> >>>> The call chain leading to this is as follow: >>>> call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL). >>>> And since GFP_KERNEL allows initiating direct reclaim thus >>>> potentially sleeping in the allocation path, this leads to the >>>> aforementioned splat. >>>> >>>> To fix it change the allocation type to GFP_ATOMIC, to >>>> correctly reflect that it is occuring in an atomic context. >>>> >>>> Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type") >>>> >>>> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> >>>> Signed-off-by: Nikolay Borisov <kernel@kyup.com> >>>> --- >>>> >>>> Changes since V1: >>>> * Added acked-by >>>> * Fixed patch header >>>> >>>> net/netfilter/ipset/ip_set_list_set.c | 2 +- >>>> 1 file changed, 1 insertion(+), 1 deletion(-) >>>> >>>> diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c >>>> index a1fe537..5a30ce6 100644 >>>> --- a/net/netfilter/ipset/ip_set_list_set.c >>>> +++ b/net/netfilter/ipset/ip_set_list_set.c >>>> @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext, >>>> ip_set_timeout_expired(ext_timeout(n, set)))) >>>> n = NULL; >>>> >>>> - e = kzalloc(set->dsize, GFP_KERNEL); >>>> + e = kzalloc(set->dsize, GFP_ATOMIC); >>>> if (!e) >>>> return -ENOMEM; >>>> e->id = d->id; >>> >>> This patch looks very bogus to me. >>> >>> Could we fix the root cause please ? >>> >>> Root cause is that somewhere in this controlling path, an erroneous >>> rcu_read_lock() is used, while it is very probably not needed, as >>> controlling path should be protected by a mutex, which definitely is >>> sane, because it allows us to perform GFP_KERNEL allocations and being >>> preempted. >>> >>> Why are we using rcu_read_lock() in list_set_list() ? >>> >>> This looks as yet another bit of 'let us throw >>> rcu_read_lock()/rcu_read_unlock() pairs' all over the places because it >>> feels so good. >> >> I did check the call paths and there isn't an rcu_read_lock called in >> list_set_uadt/list_set_uadd. On the contrary, this "write" operation to >> the list is being serialised in call_ad() via set->lock spin_lock. >> >> What am I missing here? > > I was not complaining to you, but to Jozsef ;) > > Looking at commit 00590fdd5be0d7 terse changelog, we have to look at > whole commit, and find suspicious rcu_read_lock() > > Apparently, before this commit, allocations were safe, in process > context (and using GFP_KERNEL), but after the commit, we have the > unfortunate side effect of having potential burst of allocations > done from bh context, using the special reserves dedicated to GFP_ATOMIC > true users (processing packets from softirq handlers) I guess the reason why a spinlock is used is due to the gc code which is responsible for reaping entries whose timeout has elapsed. All of this is happening in set_cleanup_entries. Given the state of things I don't think using a mutex is a feasible solution unless the gc mechanism is reworked. > > I hate when we add more GFP_ATOMIC allocations all over the places, > as this increases the risk of depleting memory reserves. > > Can the spinlock be converted to a mutex ? If not, then instead of > putting a stack trace in your changelog, a good explanation would be > more useful. > > Thanks ! > > > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi, On Thu, 15 Oct 2015, Eric Dumazet wrote: > On Thu, 2015-10-15 at 16:41 +0300, Nikolay Borisov wrote: > > > > On 10/15/2015 04:32 PM, Eric Dumazet wrote: > > > On Thu, 2015-10-15 at 13:56 +0300, Nikolay Borisov wrote: > > >> Commit 00590fdd5be0 introduced RCU locking in list type and in > > >> doing so introduced a memory allocation in list_set_add, which > > >> results in the following splat: > > >> > > >> BUG: sleeping function called from invalid context at mm/page_alloc.c:2759 > > >> in_atomic(): 1, irqs_disabled(): 0, pid: 9664, name: ipset > > >> CPU: 18 PID: 9664 Comm: ipset Tainted: G O 3.12.47-clouder3 #1 > > >> Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1 04/14/2015 > > >> 0000000000000002 ffff881fd14273c8 ffffffff8163d891 ffff881fcb4264b0 > > >> ffff881fcb4260c0 ffff881fd14273e8 ffffffff810ba5bf ffff881fd1427558 > > >> 0000000000000000 ffff881fd1427568 ffffffff81142b33 ffff881f00000000 > > >> Call Trace: > > >> [<ffffffff8163d891>] dump_stack+0x58/0x7f > > >> [<ffffffff810ba5bf>] __might_sleep+0xdf/0x110 > > >> [<ffffffff81142b33>] __alloc_pages_nodemask+0x243/0xc20 > > >> [<ffffffff81181c6e>] alloc_pages_current+0xbe/0x170 > > >> [<ffffffff81188315>] new_slab+0x295/0x340 > > >> [<ffffffff81189a40>] __slab_alloc+0x2c0/0x5a0 > > >> [<ffffffff8164000c>] ? __schedule+0x2dc/0x760 > > >> [<ffffffff8118a71b>] __kmalloc+0x11b/0x230 > > >> [<ffffffffa02bd0ac>] ? ip_set_get_byname+0xec/0x100 [ip_set] > > >> [<ffffffffa02d23fb>] list_set_uadd+0x16b/0x314 [ip_set_list_set] > > >> [<ffffffff81642148>] ? _raw_write_unlock_bh+0x28/0x30 > > >> [<ffffffffa02d1cfc>] list_set_uadt+0x21c/0x320 [ip_set_list_set] > > >> [<ffffffffa02d2290>] ? list_set_create+0x1a0/0x1a0 [ip_set_list_set] > > >> [<ffffffffa02be242>] call_ad+0x82/0x200 [ip_set] > > >> [<ffffffffa02bb171>] ? find_set_type+0x51/0xa0 [ip_set] > > >> [<ffffffff8133f275>] ? nla_parse+0xf5/0x130 > > >> [<ffffffffa02be8ae>] ip_set_uadd+0x20e/0x2d0 [ip_set] > > >> [<ffffffffa02be013>] ? ip_set_create+0x2a3/0x450 [ip_set] > > >> [<ffffffffa02be6a0>] ? ip_set_udel+0x2e0/0x2e0 [ip_set] > > >> [<ffffffff815b316e>] nfnetlink_rcv_msg+0x31e/0x330 > > >> [<ffffffff815b2e91>] ? nfnetlink_rcv_msg+0x41/0x330 > > >> [<ffffffff815b2e50>] ? nfnl_lock+0x30/0x30 > > >> [<ffffffff815ae179>] netlink_rcv_skb+0xa9/0xd0 > > >> [<ffffffff815b2d45>] nfnetlink_rcv+0x15/0x20 > > >> [<ffffffff815ade5f>] netlink_unicast+0x10f/0x190 > > >> [<ffffffff815aedb0>] netlink_sendmsg+0x2c0/0x660 > > >> [<ffffffff81567f00>] sock_sendmsg+0x90/0xc0 > > >> [<ffffffff81565b03>] ? move_addr_to_user+0xa3/0xc0 > > >> [<ffffffff81568552>] ? ___sys_recvmsg+0x182/0x300 > > >> [<ffffffff81568064>] SYSC_sendto+0x134/0x180 > > >> [<ffffffff811c4e01>] ? mntput+0x21/0x30 > > >> [<ffffffff81572d2f>] ? __kfree_skb+0x3f/0xa0 > > >> [<ffffffff815680be>] SyS_sendto+0xe/0x10 > > >> [<ffffffff816434b2>] system_call_fastpath+0x16/0x1b > > >> > > >> The call chain leading to this is as follow: > > >> call_ad -> list_set_uadt -> list_set_uadd -> kzalloc(, GFP_KERNEL). > > >> And since GFP_KERNEL allows initiating direct reclaim thus > > >> potentially sleeping in the allocation path, this leads to the > > >> aforementioned splat. > > >> > > >> To fix it change the allocation type to GFP_ATOMIC, to > > >> correctly reflect that it is occuring in an atomic context. > > >> > > >> Fixes: 00590fdd5be0 ("netfilter: ipset: Introduce RCU locking in list type") > > >> > > >> Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> > > >> Signed-off-by: Nikolay Borisov <kernel@kyup.com> > > >> --- > > >> > > >> Changes since V1: > > >> * Added acked-by > > >> * Fixed patch header > > >> > > >> net/netfilter/ipset/ip_set_list_set.c | 2 +- > > >> 1 file changed, 1 insertion(+), 1 deletion(-) > > >> > > >> diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c > > >> index a1fe537..5a30ce6 100644 > > >> --- a/net/netfilter/ipset/ip_set_list_set.c > > >> +++ b/net/netfilter/ipset/ip_set_list_set.c > > >> @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext, > > >> ip_set_timeout_expired(ext_timeout(n, set)))) > > >> n = NULL; > > >> > > >> - e = kzalloc(set->dsize, GFP_KERNEL); > > >> + e = kzalloc(set->dsize, GFP_ATOMIC); > > >> if (!e) > > >> return -ENOMEM; > > >> e->id = d->id; > > > > > > This patch looks very bogus to me. > > > > > > Could we fix the root cause please ? > > > > > > Root cause is that somewhere in this controlling path, an erroneous > > > rcu_read_lock() is used, while it is very probably not needed, as > > > controlling path should be protected by a mutex, which definitely is > > > sane, because it allows us to perform GFP_KERNEL allocations and being > > > preempted. > > > > > > Why are we using rcu_read_lock() in list_set_list() ? > > > > > > This looks as yet another bit of 'let us throw > > > rcu_read_lock()/rcu_read_unlock() pairs' all over the places because it > > > feels so good. > > > > I did check the call paths and there isn't an rcu_read_lock called in > > list_set_uadt/list_set_uadd. On the contrary, this "write" operation to > > the list is being serialised in call_ad() via set->lock spin_lock. > > > > What am I missing here? > > I was not complaining to you, but to Jozsef ;) > > Looking at commit 00590fdd5be0d7 terse changelog, we have to look at > whole commit, and find suspicious rcu_read_lock() > > Apparently, before this commit, allocations were safe, in process > context (and using GFP_KERNEL), but after the commit, we have the > unfortunate side effect of having potential burst of allocations > done from bh context, using the special reserves dedicated to GFP_ATOMIC > true users (processing packets from softirq handlers) Before commit 00590fdd5be0d7 the elements was stored in a preallocated array and there was no need to call kzalloc() at all in list_set_uadd(). > I hate when we add more GFP_ATOMIC allocations all over the places, > as this increases the risk of depleting memory reserves. I don't like it either - maybe that was the reason I entered GFP_KERNEL instead of thinking about that the code is called inside a spinlock. > Can the spinlock be converted to a mutex ? If not, then instead of > putting a stack trace in your changelog, a good explanation would be > more useful. Nikolay answered this pretty well: we wouldn't need the spinlock at all, because all commands are serialized anyway with the netlink mutex. But the garbage collector is called by a timer and therefore spinlock is used. Best regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-10-15 at 20:25 +0200, Jozsef Kadlecsik wrote: > Nikolay answered this pretty well: we wouldn't need the spinlock at all, > because all commands are serialized anyway with the netlink mutex. But the > garbage collector is called by a timer and therefore spinlock is used. > Good, please Nikolay, send a v2 of the patch with all these details explained in the changelog, so that we can all agree. If properly explained, no need to add the stack trace which does not really tell us the story. Thanks ! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 15, 2015 at 9:46 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote: > On Thu, 2015-10-15 at 20:25 +0200, Jozsef Kadlecsik wrote: > >> Nikolay answered this pretty well: we wouldn't need the spinlock at all, >> because all commands are serialized anyway with the netlink mutex. But the >> garbage collector is called by a timer and therefore spinlock is used. >> > > Good, please Nikolay, send a v2 of the patch with all these details > explained in the changelog, so that we can all agree. While GFP_ATOMIC does indeed look the correct solution for this particular case I was wondering whether something like (GFP_KERNEL & ~__GFP_WAIT) wouldn't also make the cut without causing sleeping? I guess this is exactly the sort of situation that Mel Gorman's patch can address (marc.info/?l=linux-kernel&m=144283282101953) ? In any case I will send v2 tomorrow. > > If properly explained, no need to add the stack trace which does not > really tell us the story. > > Thanks ! > > > -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 2015-10-15 at 23:20 +0300, Nikolay Borisov wrote: > While GFP_ATOMIC does indeed look the correct solution for this particular > case I was wondering whether something like (GFP_KERNEL & ~__GFP_WAIT) > wouldn't also make the cut without causing sleeping? I guess this is exactly > the sort of situation that Mel Gorman's patch can address > (marc.info/?l=linux-kernel&m=144283282101953) ? This is not applicable here, because the caller would have to find a way to keep trying. I believe one way to handle this problem (in a followup patch) would be to use a work queue for the gc, not a timer. Using a timer for gc is almost always subject to big problems anyway. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Oct 15, 2015 at 01:53:11PM -0700, Eric Dumazet wrote: > On Thu, 2015-10-15 at 23:20 +0300, Nikolay Borisov wrote: > > > While GFP_ATOMIC does indeed look the correct solution for this particular > > case I was wondering whether something like (GFP_KERNEL & ~__GFP_WAIT) > > wouldn't also make the cut without causing sleeping? I guess this is exactly > > the sort of situation that Mel Gorman's patch can address > > (marc.info/?l=linux-kernel&m=144283282101953) ? > > This is not applicable here, because the caller would have to find a way > to keep trying. > > I believe one way to handle this problem (in a followup patch) would be > to use a work queue for the gc, not a timer. > > Using a timer for gc is almost always subject to big problems anyway. Agreed, this is what we're doing in nft_hash. Thanks for feedback Eric! -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 15 Oct 2015, Eric Dumazet wrote: > On Thu, 2015-10-15 at 23:20 +0300, Nikolay Borisov wrote: > > > While GFP_ATOMIC does indeed look the correct solution for this particular > > case I was wondering whether something like (GFP_KERNEL & ~__GFP_WAIT) > > wouldn't also make the cut without causing sleeping? I guess this is exactly > > the sort of situation that Mel Gorman's patch can address > > (marc.info/?l=linux-kernel&m=144283282101953) ? > > This is not applicable here, because the caller would have to find a way > to keep trying. > > I believe one way to handle this problem (in a followup patch) would be > to use a work queue for the gc, not a timer. > > Using a timer for gc is almost always subject to big problems anyway. Thanks, really! I'll work on to replace the timer based gc with a work queue based one in the next version. Best regards, Jozsef - E-mail : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences H-1525 Budapest 114, POB. 49, Hungary -- To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c index a1fe537..5a30ce6 100644 --- a/net/netfilter/ipset/ip_set_list_set.c +++ b/net/netfilter/ipset/ip_set_list_set.c @@ -297,7 +297,7 @@ list_set_uadd(struct ip_set *set, void *value, const struct ip_set_ext *ext, ip_set_timeout_expired(ext_timeout(n, set)))) n = NULL; - e = kzalloc(set->dsize, GFP_KERNEL); + e = kzalloc(set->dsize, GFP_ATOMIC); if (!e) return -ENOMEM; e->id = d->id;