Message ID | 20200607205229.2389672-3-jakub@cloudflare.com |
---|---|
State | Accepted |
Delegated to: | BPF Maintainers |
Headers | show |
Series | Fixes for sock_hash_free | expand |
Jakub Sitnicki wrote: > We can end up modifying the sockhash bucket list from two CPUs when a > sockhash is being destroyed (sock_hash_free) on one CPU, while a socket > that is in the sockhash is unlinking itself from it on another CPU > it (sock_hash_delete_from_link). > > This results in accessing a list element that is in an undefined state as > reported by KASAN: > > | ================================================================== > | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280 > | Write of size 8 at addr dead000000000122 by task kworker/2:1/95 > | > | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691 > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 > | Workqueue: events bpf_map_free_deferred > | Call Trace: > | dump_stack+0x97/0xe0 > | ? sock_hash_free+0x13c/0x280 > | __kasan_report.cold+0x5/0x40 > | ? mark_lock+0xbc1/0xc00 > | ? sock_hash_free+0x13c/0x280 > | kasan_report+0x38/0x50 > | ? sock_hash_free+0x152/0x280 > | sock_hash_free+0x13c/0x280 > | bpf_map_free_deferred+0xb2/0xd0 > | ? bpf_map_charge_finish+0x50/0x50 > | ? rcu_read_lock_sched_held+0x81/0xb0 > | ? rcu_read_lock_bh_held+0x90/0x90 > | process_one_work+0x59a/0xac0 > | ? lock_release+0x3b0/0x3b0 > | ? pwq_dec_nr_in_flight+0x110/0x110 > | ? rwlock_bug.part.0+0x60/0x60 > | worker_thread+0x7a/0x680 > | ? _raw_spin_unlock_irqrestore+0x4c/0x60 > | kthread+0x1cc/0x220 > | ? process_one_work+0xac0/0xac0 > | ? kthread_create_on_node+0xa0/0xa0 > | ret_from_fork+0x24/0x30 > | ================================================================== > > Fix it by reintroducing spin-lock protected critical section around the > code that removes the elements from the bucket on sockhash free. > > To do that we also need to defer processing of removed elements, until out > of atomic context so that we can unlink the socket from the map when > holding the sock lock. > > Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free") > Reported-by: Eric Dumazet <eric.dumazet@gmail.com> > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> > --- > net/core/sock_map.c | 23 +++++++++++++++++++++-- > 1 file changed, 21 insertions(+), 2 deletions(-) Thanks. Acked-by: John Fastabend <john.fastabend@gmail.com>
On Tue, Jun 9, 2020 at 10:51 AM John Fastabend <john.fastabend@gmail.com> wrote: > > Jakub Sitnicki wrote: > > We can end up modifying the sockhash bucket list from two CPUs when a > > sockhash is being destroyed (sock_hash_free) on one CPU, while a socket > > that is in the sockhash is unlinking itself from it on another CPU > > it (sock_hash_delete_from_link). > > > > This results in accessing a list element that is in an undefined state as > > reported by KASAN: > > > > | ================================================================== > > | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280 > > | Write of size 8 at addr dead000000000122 by task kworker/2:1/95 > > | > > | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691 > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 > > | Workqueue: events bpf_map_free_deferred > > | Call Trace: > > | dump_stack+0x97/0xe0 > > | ? sock_hash_free+0x13c/0x280 > > | __kasan_report.cold+0x5/0x40 > > | ? mark_lock+0xbc1/0xc00 > > | ? sock_hash_free+0x13c/0x280 > > | kasan_report+0x38/0x50 > > | ? sock_hash_free+0x152/0x280 > > | sock_hash_free+0x13c/0x280 > > | bpf_map_free_deferred+0xb2/0xd0 > > | ? bpf_map_charge_finish+0x50/0x50 > > | ? rcu_read_lock_sched_held+0x81/0xb0 > > | ? rcu_read_lock_bh_held+0x90/0x90 > > | process_one_work+0x59a/0xac0 > > | ? lock_release+0x3b0/0x3b0 > > | ? pwq_dec_nr_in_flight+0x110/0x110 > > | ? rwlock_bug.part.0+0x60/0x60 > > | worker_thread+0x7a/0x680 > > | ? _raw_spin_unlock_irqrestore+0x4c/0x60 > > | kthread+0x1cc/0x220 > > | ? process_one_work+0xac0/0xac0 > > | ? kthread_create_on_node+0xa0/0xa0 > > | ret_from_fork+0x24/0x30 > > | ================================================================== > > > > Fix it by reintroducing spin-lock protected critical section around the > > code that removes the elements from the bucket on sockhash free. > > > > To do that we also need to defer processing of removed elements, until out > > of atomic context so that we can unlink the socket from the map when > > holding the sock lock. > > > > Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free") > > Reported-by: Eric Dumazet <eric.dumazet@gmail.com> > > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> > > --- > > net/core/sock_map.c | 23 +++++++++++++++++++++-- > > 1 file changed, 21 insertions(+), 2 deletions(-) > > Thanks. > > Acked-by: John Fastabend <john.fastabend@gmail.com> Applied both to bpf tree. FYI I see this splat: ./test_sockmap # 1/ 6 sockmap::txmsg test passthrough:OK # 2/ 6 sockmap::txmsg test redirect:OK # 3/ 6 sockmap::txmsg test drop:OK # 4/ 6 sockmap::txmsg test ingress redirect:OK [ 19.180397] [ 19.180633] ============================= [ 19.181042] WARNING: suspicious RCU usage [ 19.181517] 5.7.0-07177-g75e68e5bf2c7 #688 Not tainted [ 19.182048] ----------------------------- [ 19.182570] include/linux/skmsg.h:284 suspicious rcu_dereference_check() usage! [ 19.183341] [ 19.183341] other info that might help us debug this: [ 19.183341] [ 19.184215] [ 19.184215] rcu_scheduler_active = 2, debug_locks = 1 [ 19.184875] 1 lock held by test_sockmap/291: [ 19.185356] #0: ffff8881315b5b20 (sk_lock-AF_INET){+.+.}-{0:0}, at: tls_sw_recvmsg+0x128/0x810 [ 19.186315] [ 19.186315] stack backtrace: [ 19.186774] CPU: 0 PID: 291 Comm: test_sockmap Not tainted 5.7.0-07177-g75e68e5bf2c7 #688 [ 19.188497] Call Trace: [ 19.188757] dump_stack+0x71/0xa0 [ 19.189140] sk_psock_skb_redirect.isra.0+0xa6/0xf0 [ 19.189651] sk_psock_tls_strp_read+0x1cc/0x250 [ 19.190142] tls_sw_recvmsg+0x6e4/0x810 [ 19.190584] inet_recvmsg+0x55/0x1d0 [ 19.190963] ____sys_recvmsg+0x73/0x130 [ 19.191365] ? import_iovec+0x27/0xd0 [ 19.191800] ? copy_msghdr_from_user+0x4c/0x70 [ 19.192271] ___sys_recvmsg+0x68/0x90 [ 19.192682] ? __might_fault+0x36/0x80 [ 19.193078] ? __audit_syscall_exit+0x242/0x2b0 [ 19.193576] __sys_recvmsg+0x46/0x80 [ 19.193964] do_syscall_64+0x4a/0x1b0 [ 19.194355] entry_SYSCALL_64_after_hwframe+0x49/0xb3 [ 19.194924] RIP: 0033:0x7fc915cc4490
Alexei Starovoitov wrote: > On Tue, Jun 9, 2020 at 10:51 AM John Fastabend <john.fastabend@gmail.com> wrote: > > > > Jakub Sitnicki wrote: > > > We can end up modifying the sockhash bucket list from two CPUs when a > > > sockhash is being destroyed (sock_hash_free) on one CPU, while a socket > > > that is in the sockhash is unlinking itself from it on another CPU > > > it (sock_hash_delete_from_link). > > > > > > This results in accessing a list element that is in an undefined state as > > > reported by KASAN: > > > > > > | ================================================================== > > > | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280 > > > | Write of size 8 at addr dead000000000122 by task kworker/2:1/95 > > > | > > > | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691 > > > | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 > > > | Workqueue: events bpf_map_free_deferred > > > | Call Trace: > > > | dump_stack+0x97/0xe0 > > > | ? sock_hash_free+0x13c/0x280 > > > | __kasan_report.cold+0x5/0x40 > > > | ? mark_lock+0xbc1/0xc00 > > > | ? sock_hash_free+0x13c/0x280 > > > | kasan_report+0x38/0x50 > > > | ? sock_hash_free+0x152/0x280 > > > | sock_hash_free+0x13c/0x280 > > > | bpf_map_free_deferred+0xb2/0xd0 > > > | ? bpf_map_charge_finish+0x50/0x50 > > > | ? rcu_read_lock_sched_held+0x81/0xb0 > > > | ? rcu_read_lock_bh_held+0x90/0x90 > > > | process_one_work+0x59a/0xac0 > > > | ? lock_release+0x3b0/0x3b0 > > > | ? pwq_dec_nr_in_flight+0x110/0x110 > > > | ? rwlock_bug.part.0+0x60/0x60 > > > | worker_thread+0x7a/0x680 > > > | ? _raw_spin_unlock_irqrestore+0x4c/0x60 > > > | kthread+0x1cc/0x220 > > > | ? process_one_work+0xac0/0xac0 > > > | ? kthread_create_on_node+0xa0/0xa0 > > > | ret_from_fork+0x24/0x30 > > > | ================================================================== > > > > > > Fix it by reintroducing spin-lock protected critical section around the > > > code that removes the elements from the bucket on sockhash free. > > > > > > To do that we also need to defer processing of removed elements, until out > > > of atomic context so that we can unlink the socket from the map when > > > holding the sock lock. > > > > > > Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free") > > > Reported-by: Eric Dumazet <eric.dumazet@gmail.com> > > > Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> > > > --- > > > net/core/sock_map.c | 23 +++++++++++++++++++++-- > > > 1 file changed, 21 insertions(+), 2 deletions(-) > > > > Thanks. > > > > Acked-by: John Fastabend <john.fastabend@gmail.com> > > Applied both to bpf tree. > > FYI I see this splat: > ./test_sockmap > # 1/ 6 sockmap::txmsg test passthrough:OK > # 2/ 6 sockmap::txmsg test redirect:OK > # 3/ 6 sockmap::txmsg test drop:OK > # 4/ 6 sockmap::txmsg test ingress redirect:OK > [ 19.180397] > [ 19.180633] ============================= > [ 19.181042] WARNING: suspicious RCU usage > [ 19.181517] 5.7.0-07177-g75e68e5bf2c7 #688 Not tainted > [ 19.182048] ----------------------------- > [ 19.182570] include/linux/skmsg.h:284 suspicious > rcu_dereference_check() usage! I'll have a fix for this splat shortly thanks.
diff --git a/net/core/sock_map.c b/net/core/sock_map.c index ea46f07a22d8..17a40a947546 100644 --- a/net/core/sock_map.c +++ b/net/core/sock_map.c @@ -1013,6 +1013,7 @@ static void sock_hash_free(struct bpf_map *map) { struct bpf_htab *htab = container_of(map, struct bpf_htab, map); struct bpf_htab_bucket *bucket; + struct hlist_head unlink_list; struct bpf_htab_elem *elem; struct hlist_node *node; int i; @@ -1024,13 +1025,31 @@ static void sock_hash_free(struct bpf_map *map) synchronize_rcu(); for (i = 0; i < htab->buckets_num; i++) { bucket = sock_hash_select_bucket(htab, i); - hlist_for_each_entry_safe(elem, node, &bucket->head, node) { - hlist_del_rcu(&elem->node); + + /* We are racing with sock_hash_delete_from_link to + * enter the spin-lock critical section. Every socket on + * the list is still linked to sockhash. Since link + * exists, psock exists and holds a ref to socket. That + * lets us to grab a socket ref too. + */ + raw_spin_lock_bh(&bucket->lock); + hlist_for_each_entry(elem, &bucket->head, node) + sock_hold(elem->sk); + hlist_move_list(&bucket->head, &unlink_list); + raw_spin_unlock_bh(&bucket->lock); + + /* Process removed entries out of atomic context to + * block for socket lock before deleting the psock's + * link to sockhash. + */ + hlist_for_each_entry_safe(elem, node, &unlink_list, node) { + hlist_del(&elem->node); lock_sock(elem->sk); rcu_read_lock(); sock_map_unref(elem->sk, elem); rcu_read_unlock(); release_sock(elem->sk); + sock_put(elem->sk); sock_hash_free_elem(htab, elem); } }
We can end up modifying the sockhash bucket list from two CPUs when a sockhash is being destroyed (sock_hash_free) on one CPU, while a socket that is in the sockhash is unlinking itself from it on another CPU it (sock_hash_delete_from_link). This results in accessing a list element that is in an undefined state as reported by KASAN: | ================================================================== | BUG: KASAN: wild-memory-access in sock_hash_free+0x13c/0x280 | Write of size 8 at addr dead000000000122 by task kworker/2:1/95 | | CPU: 2 PID: 95 Comm: kworker/2:1 Not tainted 5.7.0-rc7-02961-ge22c35ab0038-dirty #691 | Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014 | Workqueue: events bpf_map_free_deferred | Call Trace: | dump_stack+0x97/0xe0 | ? sock_hash_free+0x13c/0x280 | __kasan_report.cold+0x5/0x40 | ? mark_lock+0xbc1/0xc00 | ? sock_hash_free+0x13c/0x280 | kasan_report+0x38/0x50 | ? sock_hash_free+0x152/0x280 | sock_hash_free+0x13c/0x280 | bpf_map_free_deferred+0xb2/0xd0 | ? bpf_map_charge_finish+0x50/0x50 | ? rcu_read_lock_sched_held+0x81/0xb0 | ? rcu_read_lock_bh_held+0x90/0x90 | process_one_work+0x59a/0xac0 | ? lock_release+0x3b0/0x3b0 | ? pwq_dec_nr_in_flight+0x110/0x110 | ? rwlock_bug.part.0+0x60/0x60 | worker_thread+0x7a/0x680 | ? _raw_spin_unlock_irqrestore+0x4c/0x60 | kthread+0x1cc/0x220 | ? process_one_work+0xac0/0xac0 | ? kthread_create_on_node+0xa0/0xa0 | ret_from_fork+0x24/0x30 | ================================================================== Fix it by reintroducing spin-lock protected critical section around the code that removes the elements from the bucket on sockhash free. To do that we also need to defer processing of removed elements, until out of atomic context so that we can unlink the socket from the map when holding the sock lock. Fixes: 90db6d772f74 ("bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> --- net/core/sock_map.c | 23 +++++++++++++++++++++-- 1 file changed, 21 insertions(+), 2 deletions(-)