diff mbox series

[net-next,v4,07/17] net: sched: protect filter_chain list with filter_chain_lock mutex

Message ID 20190211085548.7190-8-vladbu@mellanox.com
State Accepted
Delegated to: David Miller
Headers show
Series Refactor classifier API to work with chain/classifiers without rtnl lock | expand

Commit Message

Vlad Buslov Feb. 11, 2019, 8:55 a.m. UTC
Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
when accessing filter_chain list, instead of relying on rtnl lock.
Dereference filter_chain with tcf_chain_dereference() lockdep macro to
verify that all users of chain_list have the lock taken.

Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
all necessary code while holding chain lock in order to prevent
invalidation of chain_info structure by potential concurrent change. This
also serializes calls to tcf_chain0_head_change(), which allows head change
callbacks to rely on filter_chain_lock for synchronization instead of rtnl
mutex.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/sch_generic.h |  17 +++++++
 net/sched/cls_api.c       | 111 +++++++++++++++++++++++++++++++++-------------
 net/sched/sch_generic.c   |   6 ++-
 3 files changed, 101 insertions(+), 33 deletions(-)

Comments

Ido Schimmel Feb. 14, 2019, 6:24 p.m. UTC | #1
On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
> when accessing filter_chain list, instead of relying on rtnl lock.
> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
> verify that all users of chain_list have the lock taken.
> 
> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
> all necessary code while holding chain lock in order to prevent
> invalidation of chain_info structure by potential concurrent change. This
> also serializes calls to tcf_chain0_head_change(), which allows head change
> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
> mutex.
> 
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> Acked-by: Jiri Pirko <jiri@mellanox.com>

With this sequence [1] I get the following trace [2]. Bisected it to
this patch. Note that second filter is rejected by the device driver
(that's the intention). I guess it is not properly removed from the
filter chain in the error path?

Thanks

[1]
# tc qdisc add dev swp3 clsact
# tc filter add dev swp3 ingress pref 1 matchall skip_sw \
	action mirred egress mirror dev swp7
# tc filter add dev swp3 ingress pref 2 matchall skip_sw \
	action mirred egress mirror dev swp7
RTNETLINK answers: File exists
We have an error talking to the kernel, -1
# tc qdisc del dev swp3 clsact

[2]
[   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
[   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
[   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
[   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
[   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
[   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
[   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
[   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
[   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
[   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
[   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
[   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
[   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
[   70.675633] Call Trace:
[   70.693046]  tcf_block_playback_offloads+0x94/0x230
[   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
[   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
[   70.738739]  tcf_block_offload_unbind+0x22b/0x350
[   70.748898]  __tcf_block_put+0x203/0x630
[   70.769700]  tcf_block_put_ext+0x2f/0x40
[   70.774098]  clsact_destroy+0x7a/0xb0
[   70.782604]  qdisc_destroy+0x11a/0x5c0
[   70.786810]  qdisc_put+0x5a/0x70
[   70.790435]  notify_and_destroy+0x8e/0xa0
[   70.794933]  qdisc_graft+0xbb7/0xef0
[   70.809009]  tc_get_qdisc+0x518/0xa30
[   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
[   70.835510]  netlink_rcv_skb+0x132/0x380
[   70.848419]  netlink_unicast+0x4c0/0x690
[   70.866019]  netlink_sendmsg+0x929/0xe10
[   70.882134]  sock_sendmsg+0xc8/0x110
[   70.886144]  ___sys_sendmsg+0x77a/0x8f0
[   70.924064]  __sys_sendmsg+0xf7/0x250
[   70.955727]  do_syscall_64+0x14d/0x610
Vlad Buslov Feb. 15, 2019, 10:02 a.m. UTC | #2
On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
> On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> when accessing filter_chain list, instead of relying on rtnl lock.
>> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> verify that all users of chain_list have the lock taken.
>>
>> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> all necessary code while holding chain lock in order to prevent
>> invalidation of chain_info structure by potential concurrent change. This
>> also serializes calls to tcf_chain0_head_change(), which allows head change
>> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> mutex.
>>
>> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> Acked-by: Jiri Pirko <jiri@mellanox.com>
>
> With this sequence [1] I get the following trace [2]. Bisected it to
> this patch. Note that second filter is rejected by the device driver
> (that's the intention). I guess it is not properly removed from the
> filter chain in the error path?
>
> Thanks
>
> [1]
> # tc qdisc add dev swp3 clsact
> # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
> 	action mirred egress mirror dev swp7
> # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
> 	action mirred egress mirror dev swp7
> RTNETLINK answers: File exists
> We have an error talking to the kernel, -1
> # tc qdisc del dev swp3 clsact
>
> [2]
> [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
> [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
> [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
> [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
> [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
> [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
> e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
> [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
> [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
> [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
> [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
> [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
> [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
> [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
> [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
> [   70.675633] Call Trace:
> [   70.693046]  tcf_block_playback_offloads+0x94/0x230
> [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
> [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
> [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
> [   70.748898]  __tcf_block_put+0x203/0x630
> [   70.769700]  tcf_block_put_ext+0x2f/0x40
> [   70.774098]  clsact_destroy+0x7a/0xb0
> [   70.782604]  qdisc_destroy+0x11a/0x5c0
> [   70.786810]  qdisc_put+0x5a/0x70
> [   70.790435]  notify_and_destroy+0x8e/0xa0
> [   70.794933]  qdisc_graft+0xbb7/0xef0
> [   70.809009]  tc_get_qdisc+0x518/0xa30
> [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
> [   70.835510]  netlink_rcv_skb+0x132/0x380
> [   70.848419]  netlink_unicast+0x4c0/0x690
> [   70.866019]  netlink_sendmsg+0x929/0xe10
> [   70.882134]  sock_sendmsg+0xc8/0x110
> [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
> [   70.924064]  __sys_sendmsg+0xf7/0x250
> [   70.955727]  do_syscall_64+0x14d/0x610

Hi Ido,

Thanks for reporting this.

I looked at the code and problem seems to be matchall classifier
specific. My implementation of unlocked cls API assumes that concurrent
insertions are possible and checks for it when deleting "empty" tp.
Since classifiers don't expose number of elements, the only way to test
this is to do tp->walk() on them and assume that walk callback is called
once per filter on every classifier. In your example new tp is created
for second filter, filter insertion fails, number of elements on newly
created tp is checked with tp->walk() before deleting it. However,
matchall classifier always calls the tp->walk() callback once, even when
it doesn't have a valid filter (in this case with NULL filter pointer).

Possible ways to fix this:

1) Check for NULL filter pointer in tp->walk() callback and ignore it
when counting filters on tp - will work but I don't like it because I
don't think it is ever correct to call tp->walk() callback with NULL
filter pointer.

2) Fix matchall to not call tp->walk() callback with NULL filter
pointers - my preferred simple solution.

3) Extend tp API to have direct way to count filters by implementing
tp->count - requires change to every classifier. Or maybe add it as
optional API that only unlocked classifiers are required to implement
and just delete rtnl lock dependent tp's without checking for concurrent
insertions.

What do you think?

Regards,
Vlad
Ido Schimmel Feb. 15, 2019, 11:30 a.m. UTC | #3
On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
> 
> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
> >> when accessing filter_chain list, instead of relying on rtnl lock.
> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
> >> verify that all users of chain_list have the lock taken.
> >>
> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
> >> all necessary code while holding chain lock in order to prevent
> >> invalidation of chain_info structure by potential concurrent change. This
> >> also serializes calls to tcf_chain0_head_change(), which allows head change
> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
> >> mutex.
> >>
> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
> >
> > With this sequence [1] I get the following trace [2]. Bisected it to
> > this patch. Note that second filter is rejected by the device driver
> > (that's the intention). I guess it is not properly removed from the
> > filter chain in the error path?
> >
> > Thanks
> >
> > [1]
> > # tc qdisc add dev swp3 clsact
> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
> > 	action mirred egress mirror dev swp7
> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
> > 	action mirred egress mirror dev swp7
> > RTNETLINK answers: File exists
> > We have an error talking to the kernel, -1
> > # tc qdisc del dev swp3 clsact
> >
> > [2]
> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
> > [   70.675633] Call Trace:
> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
> > [   70.748898]  __tcf_block_put+0x203/0x630
> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
> > [   70.774098]  clsact_destroy+0x7a/0xb0
> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
> > [   70.786810]  qdisc_put+0x5a/0x70
> > [   70.790435]  notify_and_destroy+0x8e/0xa0
> > [   70.794933]  qdisc_graft+0xbb7/0xef0
> > [   70.809009]  tc_get_qdisc+0x518/0xa30
> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
> > [   70.835510]  netlink_rcv_skb+0x132/0x380
> > [   70.848419]  netlink_unicast+0x4c0/0x690
> > [   70.866019]  netlink_sendmsg+0x929/0xe10
> > [   70.882134]  sock_sendmsg+0xc8/0x110
> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
> > [   70.924064]  __sys_sendmsg+0xf7/0x250
> > [   70.955727]  do_syscall_64+0x14d/0x610
> 
> Hi Ido,
> 
> Thanks for reporting this.
> 
> I looked at the code and problem seems to be matchall classifier
> specific. My implementation of unlocked cls API assumes that concurrent
> insertions are possible and checks for it when deleting "empty" tp.
> Since classifiers don't expose number of elements, the only way to test
> this is to do tp->walk() on them and assume that walk callback is called
> once per filter on every classifier. In your example new tp is created
> for second filter, filter insertion fails, number of elements on newly
> created tp is checked with tp->walk() before deleting it. However,
> matchall classifier always calls the tp->walk() callback once, even when
> it doesn't have a valid filter (in this case with NULL filter pointer).
> 
> Possible ways to fix this:
> 
> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
> when counting filters on tp - will work but I don't like it because I
> don't think it is ever correct to call tp->walk() callback with NULL
> filter pointer.
> 
> 2) Fix matchall to not call tp->walk() callback with NULL filter
> pointers - my preferred simple solution.
> 
> 3) Extend tp API to have direct way to count filters by implementing
> tp->count - requires change to every classifier. Or maybe add it as
> optional API that only unlocked classifiers are required to implement
> and just delete rtnl lock dependent tp's without checking for concurrent
> insertions.
> 
> What do you think?

Since the problem is matchall-specific, then it makes sense to fix it in
matchall like you suggested in option #2.

Can you please use this opportunity and audit other classifiers to
confirm problem is indeed specific to matchall?

In any case, feel free to send me a patch and I'll test it.

Thanks
Vlad Buslov Feb. 15, 2019, 12:15 p.m. UTC | #4
On Fri 15 Feb 2019 at 11:30, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
>>
>> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
>> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> >> when accessing filter_chain list, instead of relying on rtnl lock.
>> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> >> verify that all users of chain_list have the lock taken.
>> >>
>> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> >> all necessary code while holding chain lock in order to prevent
>> >> invalidation of chain_info structure by potential concurrent change. This
>> >> also serializes calls to tcf_chain0_head_change(), which allows head change
>> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> >> mutex.
>> >>
>> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
>> >
>> > With this sequence [1] I get the following trace [2]. Bisected it to
>> > this patch. Note that second filter is rejected by the device driver
>> > (that's the intention). I guess it is not properly removed from the
>> > filter chain in the error path?
>> >
>> > Thanks
>> >
>> > [1]
>> > # tc qdisc add dev swp3 clsact
>> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > RTNETLINK answers: File exists
>> > We have an error talking to the kernel, -1
>> > # tc qdisc del dev swp3 clsact
>> >
>> > [2]
>> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
>> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
>> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
>> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
>> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
>> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
>> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
>> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
>> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
>> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
>> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
>> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
>> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
>> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
>> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
>> > [   70.675633] Call Trace:
>> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
>> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
>> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
>> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
>> > [   70.748898]  __tcf_block_put+0x203/0x630
>> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
>> > [   70.774098]  clsact_destroy+0x7a/0xb0
>> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
>> > [   70.786810]  qdisc_put+0x5a/0x70
>> > [   70.790435]  notify_and_destroy+0x8e/0xa0
>> > [   70.794933]  qdisc_graft+0xbb7/0xef0
>> > [   70.809009]  tc_get_qdisc+0x518/0xa30
>> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
>> > [   70.835510]  netlink_rcv_skb+0x132/0x380
>> > [   70.848419]  netlink_unicast+0x4c0/0x690
>> > [   70.866019]  netlink_sendmsg+0x929/0xe10
>> > [   70.882134]  sock_sendmsg+0xc8/0x110
>> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
>> > [   70.924064]  __sys_sendmsg+0xf7/0x250
>> > [   70.955727]  do_syscall_64+0x14d/0x610
>>
>> Hi Ido,
>>
>> Thanks for reporting this.
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>>
>> Possible ways to fix this:
>>
>> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
>> when counting filters on tp - will work but I don't like it because I
>> don't think it is ever correct to call tp->walk() callback with NULL
>> filter pointer.
>>
>> 2) Fix matchall to not call tp->walk() callback with NULL filter
>> pointers - my preferred simple solution.
>>
>> 3) Extend tp API to have direct way to count filters by implementing
>> tp->count - requires change to every classifier. Or maybe add it as
>> optional API that only unlocked classifiers are required to implement
>> and just delete rtnl lock dependent tp's without checking for concurrent
>> insertions.
>>
>> What do you think?
>
> Since the problem is matchall-specific, then it makes sense to fix it in
> matchall like you suggested in option #2.
>
> Can you please use this opportunity and audit other classifiers to
> confirm problem is indeed specific to matchall?
>
> In any case, feel free to send me a patch and I'll test it.
>
> Thanks

I've sent you the patch for matchall and will audit all other
classifiers for this issue.

Thanks,
Vlad
Vlad Buslov Feb. 15, 2019, 3:35 p.m. UTC | #5
On Fri 15 Feb 2019 at 11:30, Ido Schimmel <idosch@idosch.org> wrote:
> On Fri, Feb 15, 2019 at 10:02:11AM +0000, Vlad Buslov wrote:
>>
>> On Thu 14 Feb 2019 at 18:24, Ido Schimmel <idosch@idosch.org> wrote:
>> > On Mon, Feb 11, 2019 at 10:55:38AM +0200, Vlad Buslov wrote:
>> >> Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
>> >> when accessing filter_chain list, instead of relying on rtnl lock.
>> >> Dereference filter_chain with tcf_chain_dereference() lockdep macro to
>> >> verify that all users of chain_list have the lock taken.
>> >>
>> >> Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
>> >> all necessary code while holding chain lock in order to prevent
>> >> invalidation of chain_info structure by potential concurrent change. This
>> >> also serializes calls to tcf_chain0_head_change(), which allows head change
>> >> callbacks to rely on filter_chain_lock for synchronization instead of rtnl
>> >> mutex.
>> >>
>> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
>> >> Acked-by: Jiri Pirko <jiri@mellanox.com>
>> >
>> > With this sequence [1] I get the following trace [2]. Bisected it to
>> > this patch. Note that second filter is rejected by the device driver
>> > (that's the intention). I guess it is not properly removed from the
>> > filter chain in the error path?
>> >
>> > Thanks
>> >
>> > [1]
>> > # tc qdisc add dev swp3 clsact
>> > # tc filter add dev swp3 ingress pref 1 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > # tc filter add dev swp3 ingress pref 2 matchall skip_sw \
>> > 	action mirred egress mirror dev swp7
>> > RTNETLINK answers: File exists
>> > We have an error talking to the kernel, -1
>> > # tc qdisc del dev swp3 clsact
>> >
>> > [2]
>> > [   70.545131] kasan: GPF could be caused by NULL-ptr deref or user memory access
>> > [   70.553394] general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
>> > [   70.560618] CPU: 2 PID: 2268 Comm: tc Not tainted 5.0.0-rc5-custom-02103-g415d39427317 #1304
>> > [   70.570065] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
>> > [   70.580204] RIP: 0010:mall_reoffload+0x14a/0x760
>> > [   70.585382] Code: c1 0f 85 ba 05 00 00 31 c0 4d 8d 6c 24 34 b9 06 00 00 00 4c 89 ff f3 48 ab 4c 89 ea 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <0f> b6 14 02 4c 89 e8 83
>> > e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 bd
>> > [   70.606382] RSP: 0018:ffff888231faefc0 EFLAGS: 00010207
>> > [   70.612235] RAX: dffffc0000000000 RBX: 1ffff110463f5dfe RCX: 0000000000000000
>> > [   70.620220] RDX: 0000000000000006 RSI: 1ffff110463f5e01 RDI: ffff888231faf040
>> > [   70.628206] RBP: ffff8881ef151a00 R08: 0000000000000000 R09: ffffed10463f5dfa
>> > [   70.636192] R10: ffffed10463f5dfa R11: 0000000000000003 R12: 0000000000000000
>> > [   70.644177] R13: 0000000000000034 R14: 0000000000000000 R15: ffff888231faf010
>> > [   70.652163] FS:  00007f46b5bf0800(0000) GS:ffff888236c00000(0000) knlGS:0000000000000000
>> > [   70.661218] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> > [   70.667649] CR2: 0000000001d590a8 CR3: 0000000231c3c000 CR4: 00000000001006e0
>> > [   70.675633] Call Trace:
>> > [   70.693046]  tcf_block_playback_offloads+0x94/0x230
>> > [   70.710617]  __tcf_block_cb_unregister+0xf7/0x2d0
>> > [   70.734143]  mlxsw_sp_setup_tc+0x20f/0x660
>> > [   70.738739]  tcf_block_offload_unbind+0x22b/0x350
>> > [   70.748898]  __tcf_block_put+0x203/0x630
>> > [   70.769700]  tcf_block_put_ext+0x2f/0x40
>> > [   70.774098]  clsact_destroy+0x7a/0xb0
>> > [   70.782604]  qdisc_destroy+0x11a/0x5c0
>> > [   70.786810]  qdisc_put+0x5a/0x70
>> > [   70.790435]  notify_and_destroy+0x8e/0xa0
>> > [   70.794933]  qdisc_graft+0xbb7/0xef0
>> > [   70.809009]  tc_get_qdisc+0x518/0xa30
>> > [   70.821530]  rtnetlink_rcv_msg+0x397/0xa20
>> > [   70.835510]  netlink_rcv_skb+0x132/0x380
>> > [   70.848419]  netlink_unicast+0x4c0/0x690
>> > [   70.866019]  netlink_sendmsg+0x929/0xe10
>> > [   70.882134]  sock_sendmsg+0xc8/0x110
>> > [   70.886144]  ___sys_sendmsg+0x77a/0x8f0
>> > [   70.924064]  __sys_sendmsg+0xf7/0x250
>> > [   70.955727]  do_syscall_64+0x14d/0x610
>>
>> Hi Ido,
>>
>> Thanks for reporting this.
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>>
>> Possible ways to fix this:
>>
>> 1) Check for NULL filter pointer in tp->walk() callback and ignore it
>> when counting filters on tp - will work but I don't like it because I
>> don't think it is ever correct to call tp->walk() callback with NULL
>> filter pointer.
>>
>> 2) Fix matchall to not call tp->walk() callback with NULL filter
>> pointers - my preferred simple solution.
>>
>> 3) Extend tp API to have direct way to count filters by implementing
>> tp->count - requires change to every classifier. Or maybe add it as
>> optional API that only unlocked classifiers are required to implement
>> and just delete rtnl lock dependent tp's without checking for concurrent
>> insertions.
>>
>> What do you think?
>
> Since the problem is matchall-specific, then it makes sense to fix it in
> matchall like you suggested in option #2.
>
> Can you please use this opportunity and audit other classifiers to
> confirm problem is indeed specific to matchall?
>

Turns out cls_cgroup has the same problem.

Another problem that I found in cls_fw and cls_route is that they set
arg->stop when empty. Both of them have code unchanged since it was
committed initially in 2005 so I assume this convention is no longer
relevant because all other classifiers don't do that (they only set
arg->stop when arg->fn returns negative value).

I sent fixes for all of those cases.
Cong Wang Feb. 15, 2019, 10:35 p.m. UTC | #6
On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> +#ifdef CONFIG_PROVE_LOCKING
> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
> +{
> +       return lockdep_is_held(&chain->filter_chain_lock);
> +}
> +#else
> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
> +{
> +       return true;
> +}
> +#endif /* #ifdef CONFIG_PROVE_LOCKING */
> +
> +#define tcf_chain_dereference(p, chain)                                        \
> +       rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))


Are you sure you need this #ifdef CONFIG_PROVE_LOCKING?
rcu_dereference_protected() should already test CONFIG_PROVE_RCU.

Ditto for tcf_proto_dereference().
Vlad Buslov Feb. 18, 2019, 11:06 a.m. UTC | #7
On Fri 15 Feb 2019 at 22:35, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> +#ifdef CONFIG_PROVE_LOCKING
>> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
>> +{
>> +       return lockdep_is_held(&chain->filter_chain_lock);
>> +}
>> +#else
>> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
>> +{
>> +       return true;
>> +}
>> +#endif /* #ifdef CONFIG_PROVE_LOCKING */
>> +
>> +#define tcf_chain_dereference(p, chain)                                        \
>> +       rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
>
>
> Are you sure you need this #ifdef CONFIG_PROVE_LOCKING?
> rcu_dereference_protected() should already test CONFIG_PROVE_RCU.
>
> Ditto for tcf_proto_dereference().

I implemented these macro same way as rtnl_dereference() is implemented,
which they are intended to substitute.

After removing them I get following compilation error with
CONFIG_PROVE_LOCKING disabled:

./include/net/sch_generic.h: In function ‘lockdep_tcf_chain_is_locked’:
./include/net/sch_generic.h:404:9: error: implicit declaration of function ‘lockdep_is_held’; did you mean ‘lockdep_rtnl_is_held’? [-Werror=implicit-function-declaration]
  return lockdep_is_held(&chain->filter_chain_lock);
         ^~~~~~~~~~~~~~~
         lockdep_rtnl_is_held
Cong Wang Feb. 18, 2019, 6:31 p.m. UTC | #8
On Mon, Feb 18, 2019 at 3:06 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> On Fri 15 Feb 2019 at 22:35, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Mon, Feb 11, 2019 at 12:56 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >> +#ifdef CONFIG_PROVE_LOCKING
> >> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
> >> +{
> >> +       return lockdep_is_held(&chain->filter_chain_lock);
> >> +}
> >> +#else
> >> +static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
> >> +{
> >> +       return true;
> >> +}
> >> +#endif /* #ifdef CONFIG_PROVE_LOCKING */
> >> +
> >> +#define tcf_chain_dereference(p, chain)                                        \
> >> +       rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
> >
> >
> > Are you sure you need this #ifdef CONFIG_PROVE_LOCKING?
> > rcu_dereference_protected() should already test CONFIG_PROVE_RCU.
> >
> > Ditto for tcf_proto_dereference().
>
> I implemented these macro same way as rtnl_dereference() is implemented,
> which they are intended to substitute.
>
> After removing them I get following compilation error with
> CONFIG_PROVE_LOCKING disabled:


This is pretty odd, because net/core/neighbour.c uses it without
any #ifdef CONFIG_PROVE_LOCKING, for instance:

 192                 neigh = rcu_dereference_protected(n->next,
 193
lockdep_is_held(&tbl->lock));
 194                 rcu_assign_pointer(*np, neigh);
 195                 neigh_mark_dead(n);
 196                 retval = true;

So how does this compile when CONFIG_PROVE_LOCKING
is disabled? :-/
Cong Wang Feb. 19, 2019, 5:08 a.m. UTC | #9
On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> I looked at the code and problem seems to be matchall classifier
> specific. My implementation of unlocked cls API assumes that concurrent
> insertions are possible and checks for it when deleting "empty" tp.
> Since classifiers don't expose number of elements, the only way to test
> this is to do tp->walk() on them and assume that walk callback is called
> once per filter on every classifier. In your example new tp is created
> for second filter, filter insertion fails, number of elements on newly
> created tp is checked with tp->walk() before deleting it. However,
> matchall classifier always calls the tp->walk() callback once, even when
> it doesn't have a valid filter (in this case with NULL filter pointer).

Again, this can be eliminated by just switching to normal
non-retry logic. This is yet another headache to review this
kind of unlock-and-retry logic, I have no idea why you are such
a big fan of it.
Cong Wang Feb. 19, 2019, 5:26 a.m. UTC | #10
On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
> Another problem that I found in cls_fw and cls_route is that they set
> arg->stop when empty. Both of them have code unchanged since it was
> committed initially in 2005 so I assume this convention is no longer
> relevant because all other classifiers don't do that (they only set
> arg->stop when arg->fn returns negative value).
>

The question is why do you want to use arg->stop==0 as
an indication for emptiness? Isn't what arg->count==0
supposed to be?
Vlad Buslov Feb. 19, 2019, 12:31 p.m. UTC | #11
On Tue 19 Feb 2019 at 05:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> Another problem that I found in cls_fw and cls_route is that they set
>> arg->stop when empty. Both of them have code unchanged since it was
>> committed initially in 2005 so I assume this convention is no longer
>> relevant because all other classifiers don't do that (they only set
>> arg->stop when arg->fn returns negative value).
>>
>
> The question is why do you want to use arg->stop==0 as
> an indication for emptiness? Isn't what arg->count==0
> supposed to be?

Good question! I initially wanted to implement it like that, but
reconsidered because iterating through all filters on classifier to
count them is O(N), and terminating on first filter and relying on
arg->stop==1 is constant time. Making function that is called
"tcf_proto_is_empty" linear on number of filters seemed sloppy to me...
Vlad Buslov Feb. 19, 2019, 3:20 p.m. UTC | #12
On Tue 19 Feb 2019 at 05:08, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>> I looked at the code and problem seems to be matchall classifier
>> specific. My implementation of unlocked cls API assumes that concurrent
>> insertions are possible and checks for it when deleting "empty" tp.
>> Since classifiers don't expose number of elements, the only way to test
>> this is to do tp->walk() on them and assume that walk callback is called
>> once per filter on every classifier. In your example new tp is created
>> for second filter, filter insertion fails, number of elements on newly
>> created tp is checked with tp->walk() before deleting it. However,
>> matchall classifier always calls the tp->walk() callback once, even when
>> it doesn't have a valid filter (in this case with NULL filter pointer).
>
> Again, this can be eliminated by just switching to normal
> non-retry logic. This is yet another headache to review this
> kind of unlock-and-retry logic, I have no idea why you are such
> a big fan of it.

The retry approach was suggested to me multiple times by Jiri on
previous code reviews so I assumed it is preferred approach in such
cases. I don't have a strong preference in this regard, but locking
whole tp on filter update will remove any parallelism when updating same
classifier instance concurrently. The goal of these changes is to allow
parallel rule update and to achieve that I had to introduce some
complexity into the code.

Now let me explain why these two approaches result completely different
performance in this case. Lets start with a list of most CPU-consuming
parts in new filter creation process in descending order (raw data at
the end of this mail):

1) Hardware offload - if available and no skip_hw.
2) Exts (actions) initalization - most expensive part even with single
action, CPU usage increases with number of actions per filter.
3) cls API.
4) Flower classifier data structure initialization.

Note that 1)+2) is ~80% of cost of creating a flower filter. So if we
just lock the whole flower classifier instance during rule update we
serialize 1, 2 and 4, and only cls API (~13% of CPU cost) can be
executed concurrently. However, in proposed flower implementation hw
offloading and action initialization code is called without any locks
and tp->lock is only obtained when modifying flower data structures,
which means that only 3) is serialized and everything else (87% of CPU
cost) can be executed in parallel.

First page of profiling data:

Samples: 100K of event 'cycles:ppp', Event count (approx.): 11191878316
  Children      Self  Command  Shared Object       Symbol
+   84.71%     0.08%  tc       [kernel.vmlinux]    [k] entry_SYSCALL_64_after_hwframe
+   84.62%     0.06%  tc       [kernel.vmlinux]    [k] do_syscall_64
+   82.63%     0.01%  tc       libc-2.25.so        [.] __libc_sendmsg
+   82.37%     0.00%  tc       [kernel.vmlinux]    [k] __sys_sendmsg
+   82.37%     0.00%  tc       [kernel.vmlinux]    [k] ___sys_sendmsg
+   82.34%     0.00%  tc       [kernel.vmlinux]    [k] sock_sendmsg
+   82.34%     0.01%  tc       [kernel.vmlinux]    [k] netlink_sendmsg
+   82.15%     0.15%  tc       [kernel.vmlinux]    [k] netlink_unicast
+   82.10%     0.11%  tc       [kernel.vmlinux]    [k] netlink_rcv_skb
+   80.76%     0.22%  tc       [kernel.vmlinux]    [k] rtnetlink_rcv_msg
+   80.10%     0.24%  tc       [kernel.vmlinux]    [k] tc_new_tfilter
+   69.30%     2.11%  tc       [cls_flower]        [k] fl_change
+   33.56%     0.05%  tc       [kernel.vmlinux]    [k] tcf_exts_validate
+   33.50%     0.12%  tc       [kernel.vmlinux]    [k] tcf_action_init
+   33.30%     0.10%  tc       [kernel.vmlinux]    [k] tcf_action_init_1
+   32.78%     0.11%  tc       [act_gact]          [k] tcf_gact_init
+   30.93%     0.16%  tc       [kernel.vmlinux]    [k] tc_setup_cb_call
+   29.96%     0.60%  tc       [mlx5_core]         [k] mlx5e_configure_flower
+   27.62%     0.23%  tc       [mlx5_core]         [k] mlx5e_tc_add_nic_flow
+   27.31%     0.45%  tc       [kernel.vmlinux]    [k] tcf_idr_create
+   25.45%     1.75%  tc       [kernel.vmlinux]    [k] pcpu_alloc
+   16.33%     0.07%  tc       [mlx5_core]         [k] mlx5_cmd_exec
+   16.26%     1.96%  tc       [mlx5_core]         [k] cmd_exec
+   14.28%     1.05%  tc       [mlx5_core]         [k] mlx5_add_flow_rules
+   14.02%     0.26%  tc       [kernel.vmlinux]    [k] pcpu_alloc_area
+   13.09%     0.13%  tc       [mlx5_core]         [k] mlx5_fc_create
+    9.77%     0.30%  tc       [mlx5_core]         [k] add_rule_fg.isra.28
+    9.08%     0.84%  tc       [mlx5_core]         [k] mlx5_cmd_set_fte
+    8.90%     0.09%  tc       [mlx5_core]         [k] mlx5_cmd_fc_alloc
+    7.90%     0.12%  tc       [kernel.vmlinux]    [k] tfilter_notify
+    7.34%     0.61%  tc       [kernel.vmlinux]    [k] __queue_work
+    7.25%     0.26%  tc       [kernel.vmlinux]    [k] tcf_fill_node
+    6.73%     0.23%  tc       [kernel.vmlinux]    [k] wait_for_completion_timeout
+    6.67%     0.20%  tc       [cls_flower]        [k] fl_dump
+    6.52%     5.93%  tc       [kernel.vmlinux]    [k] memset_erms
+    5.77%     0.49%  tc       [kernel.vmlinux]    [k] schedule_timeout
+    5.57%     1.29%  tc       [kernel.vmlinux]    [k] try_to_wake_up
+    5.50%     0.11%  tc       [kernel.vmlinux]    [k] pcpu_block_update_hint_alloc
+    5.40%     0.85%  tc       [kernel.vmlinux]    [k] pcpu_block_refresh_hint
+    5.28%     0.11%  tc       [kernel.vmlinux]    [k] queue_work_on
+    5.19%     4.96%  tc       [kernel.vmlinux]    [k] find_next_bit
+    4.77%     0.11%  tc       [kernel.vmlinux]    [k] idr_alloc_u32
+    4.71%     0.10%  tc       [kernel.vmlinux]    [k] schedule
+    4.62%     0.30%  tc       [kernel.vmlinux]    [k] __sched_text_start
+    4.48%     4.41%  tc       [kernel.vmlinux]    [k] idr_get_free
+    4.19%     0.04%  tc       [kernel.vmlinux]    [k] tcf_idr_check_alloc
Cong Wang Feb. 20, 2019, 10:43 p.m. UTC | #13
On Tue, Feb 19, 2019 at 4:31 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Tue 19 Feb 2019 at 05:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >> Another problem that I found in cls_fw and cls_route is that they set
> >> arg->stop when empty. Both of them have code unchanged since it was
> >> committed initially in 2005 so I assume this convention is no longer
> >> relevant because all other classifiers don't do that (they only set
> >> arg->stop when arg->fn returns negative value).
> >>
> >
> > The question is why do you want to use arg->stop==0 as
> > an indication for emptiness? Isn't what arg->count==0
> > supposed to be?
>
> Good question! I initially wanted to implement it like that, but
> reconsidered because iterating through all filters on classifier to
> count them is O(N), and terminating on first filter and relying on
> arg->stop==1 is constant time. Making function that is called
> "tcf_proto_is_empty" linear on number of filters seemed sloppy to me...

Good point, however arg->stop _was_ supposed to set only when
error happens. Probably you want a new arg here to stop on the first
entry.
Cong Wang Feb. 20, 2019, 11 p.m. UTC | #14
On Tue, Feb 19, 2019 at 7:20 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>
>
> On Tue 19 Feb 2019 at 05:08, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> > On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
> >>
> >> I looked at the code and problem seems to be matchall classifier
> >> specific. My implementation of unlocked cls API assumes that concurrent
> >> insertions are possible and checks for it when deleting "empty" tp.
> >> Since classifiers don't expose number of elements, the only way to test
> >> this is to do tp->walk() on them and assume that walk callback is called
> >> once per filter on every classifier. In your example new tp is created
> >> for second filter, filter insertion fails, number of elements on newly
> >> created tp is checked with tp->walk() before deleting it. However,
> >> matchall classifier always calls the tp->walk() callback once, even when
> >> it doesn't have a valid filter (in this case with NULL filter pointer).
> >
> > Again, this can be eliminated by just switching to normal
> > non-retry logic. This is yet another headache to review this
> > kind of unlock-and-retry logic, I have no idea why you are such
> > a big fan of it.
>
> The retry approach was suggested to me multiple times by Jiri on
> previous code reviews so I assumed it is preferred approach in such
> cases. I don't have a strong preference in this regard, but locking
> whole tp on filter update will remove any parallelism when updating same
> classifier instance concurrently. The goal of these changes is to allow
> parallel rule update and to achieve that I had to introduce some
> complexity into the code.

Yeah, but with unlock-and-retry it would waste more time when
retry occurs. So it can't be better in the worst scenario.

The question is essentially that do we want to waste CPU cycles
when conflicts occurs or just block there until it is safe to enter
the critical section?

And, is the retry bound? Is it possible that we would retry infinitely
as long as we time it correctly?


>
> Now let me explain why these two approaches result completely different
> performance in this case. Lets start with a list of most CPU-consuming
> parts in new filter creation process in descending order (raw data at
> the end of this mail):
>
> 1) Hardware offload - if available and no skip_hw.
> 2) Exts (actions) initalization - most expensive part even with single
> action, CPU usage increases with number of actions per filter.
> 3) cls API.
> 4) Flower classifier data structure initialization.
>
> Note that 1)+2) is ~80% of cost of creating a flower filter. So if we
> just lock the whole flower classifier instance during rule update we
> serialize 1, 2 and 4, and only cls API (~13% of CPU cost) can be
> executed concurrently. However, in proposed flower implementation hw
> offloading and action initialization code is called without any locks
> and tp->lock is only obtained when modifying flower data structures,
> which means that only 3) is serialized and everything else (87% of CPU
> cost) can be executed in parallel.

What about when conflicts detected and retry the whole change?
And, of course, how often do conflicts happen?

Thanks.
Vlad Buslov Feb. 21, 2019, 3:49 p.m. UTC | #15
On Wed 20 Feb 2019 at 22:43, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, Feb 19, 2019 at 4:31 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Tue 19 Feb 2019 at 05:26, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Fri, Feb 15, 2019 at 7:35 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >> Another problem that I found in cls_fw and cls_route is that they set
>> >> arg->stop when empty. Both of them have code unchanged since it was
>> >> committed initially in 2005 so I assume this convention is no longer
>> >> relevant because all other classifiers don't do that (they only set
>> >> arg->stop when arg->fn returns negative value).
>> >>
>> >
>> > The question is why do you want to use arg->stop==0 as
>> > an indication for emptiness? Isn't what arg->count==0
>> > supposed to be?
>>
>> Good question! I initially wanted to implement it like that, but
>> reconsidered because iterating through all filters on classifier to
>> count them is O(N), and terminating on first filter and relying on
>> arg->stop==1 is constant time. Making function that is called
>> "tcf_proto_is_empty" linear on number of filters seemed sloppy to me...
>
> Good point, however arg->stop _was_ supposed to set only when
> error happens. Probably you want a new arg here to stop on the first
> entry.

Got it. I'll prepare a patch for that.
Vlad Buslov Feb. 21, 2019, 5:11 p.m. UTC | #16
On Wed 20 Feb 2019 at 23:00, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Tue, Feb 19, 2019 at 7:20 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>>
>>
>> On Tue 19 Feb 2019 at 05:08, Cong Wang <xiyou.wangcong@gmail.com> wrote:
>> > On Fri, Feb 15, 2019 at 2:02 AM Vlad Buslov <vladbu@mellanox.com> wrote:
>> >>
>> >> I looked at the code and problem seems to be matchall classifier
>> >> specific. My implementation of unlocked cls API assumes that concurrent
>> >> insertions are possible and checks for it when deleting "empty" tp.
>> >> Since classifiers don't expose number of elements, the only way to test
>> >> this is to do tp->walk() on them and assume that walk callback is called
>> >> once per filter on every classifier. In your example new tp is created
>> >> for second filter, filter insertion fails, number of elements on newly
>> >> created tp is checked with tp->walk() before deleting it. However,
>> >> matchall classifier always calls the tp->walk() callback once, even when
>> >> it doesn't have a valid filter (in this case with NULL filter pointer).
>> >
>> > Again, this can be eliminated by just switching to normal
>> > non-retry logic. This is yet another headache to review this
>> > kind of unlock-and-retry logic, I have no idea why you are such
>> > a big fan of it.
>>
>> The retry approach was suggested to me multiple times by Jiri on
>> previous code reviews so I assumed it is preferred approach in such
>> cases. I don't have a strong preference in this regard, but locking
>> whole tp on filter update will remove any parallelism when updating same
>> classifier instance concurrently. The goal of these changes is to allow
>> parallel rule update and to achieve that I had to introduce some
>> complexity into the code.
>
> Yeah, but with unlock-and-retry it would waste more time when
> retry occurs. So it can't be better in the worst scenario.
>
> The question is essentially that do we want to waste CPU cycles
> when conflicts occurs or just block there until it is safe to enter
> the critical section?
>
> And, is the retry bound? Is it possible that we would retry infinitely
> as long as we time it correctly?
>
>
>>
>> Now let me explain why these two approaches result completely different
>> performance in this case. Lets start with a list of most CPU-consuming
>> parts in new filter creation process in descending order (raw data at
>> the end of this mail):
>>
>> 1) Hardware offload - if available and no skip_hw.
>> 2) Exts (actions) initalization - most expensive part even with single
>> action, CPU usage increases with number of actions per filter.
>> 3) cls API.
>> 4) Flower classifier data structure initialization.
>>
>> Note that 1)+2) is ~80% of cost of creating a flower filter. So if we
>> just lock the whole flower classifier instance during rule update we
>> serialize 1, 2 and 4, and only cls API (~13% of CPU cost) can be
>> executed concurrently. However, in proposed flower implementation hw
>> offloading and action initialization code is called without any locks
>> and tp->lock is only obtained when modifying flower data structures,
>> which means that only 3) is serialized and everything else (87% of CPU
>> cost) can be executed in parallel.
>
> What about when conflicts detected and retry the whole change?
> And, of course, how often do conflicts happen?
>
> Thanks.

I had similar concerns when designing this change. Lets look at two
cases when this retry is needed.

One process creates first filter on classifier and fails, while other
processes are trying to concurrently add filter to same block/chain/tp:

1) Process obtains filter_chain_lock, performs unsuccessful tp lookup,
   releases the lock.
2) Calls tcf_chain_tp_insert_unique() which obtains filter_chain_lock,
   inserts new tp, releases the lock.
3) Calls tp->ops->change() that returns an error.
4) Calls tcf_chain_tp_delete_empty() which takes filter_chain_lock, verifies that no
   filters were added to tp concurrently, sets tp->deleting flag, removes
   tp from chain.

This is supposed to be very rare occurrence because for retry to happen
it not only requires concurrent insertions to same block/chain/tp, but
also that tp with requested prio didn't exist before and no concurrent
process succeeded in adding at least one filter to tp during step 3
before it is marked for deletion in step 4 (otherwise
tcf_proto_check_delete() fails and concurrent threads don't perform
retry).

Another case is when last filter is being deleted while concurrent
processes adding new filters to same block/chain/tp:

1) tc_del_tfilter() gets last filter with tp->ops->get()
2) Deletes it with tp->ops->delete()...
3) ... that return 'last' hint set to true.
4) Calls tcf_chain_tp_delete_empty() which takes filter_chain_lock, verifies that no
   filters were added to tp concurrently, sets tp->deleting flag, removes
   tp from chain.

This case is also quite rare because it requires concurrent users to
successfully lookup tp before tp->deleting is set to true and tp is
removed from chain, but not create any new filters on tp during that
time.

After considering this I decided that it is not worth it to penalize
common case of updating filters by completely removing parallelism when
updates target same tp instance for such rare corner cases as described
above.

Now regarding forcing users to retry indefinitely. In later cases no
more than one retry is possible because concurrent add processes create
new tp on first retry. In former case multiple retries are possible, but
to block concurrent users indefinitely would require malicious process
to somehow always have priority when obtaining filter_chain_lock during
steps 1, 2, then wait to allow all concurrent users to lookup the tp,
then obtain filter_chain_lock in step 4 and initiate tp deletion before
any of concurrent users that have a reference to this new tp instance
can insert any single filter on it, then go back to step 1, obtain lock
first and repeat. I don't see how this can be timed from userspace
repeatedly as creating first filter on new tp involves multiple cycles
of getting and releasing filter_chain_lock and each of them require
attacker to "influence" kernel scheduler to behave in very specific
fashion.

Regards,
Vlad
diff mbox series

Patch

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31b8ea66a47d..85993d7efee6 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -341,6 +341,8 @@  struct qdisc_skb_cb {
 typedef void tcf_chain_head_change_t(struct tcf_proto *tp_head, void *priv);
 
 struct tcf_chain {
+	/* Protects filter_chain. */
+	struct mutex filter_chain_lock;
 	struct tcf_proto __rcu *filter_chain;
 	struct list_head list;
 	struct tcf_block *block;
@@ -374,6 +376,21 @@  struct tcf_block {
 	struct rcu_head rcu;
 };
 
+#ifdef CONFIG_PROVE_LOCKING
+static inline bool lockdep_tcf_chain_is_locked(struct tcf_chain *chain)
+{
+	return lockdep_is_held(&chain->filter_chain_lock);
+}
+#else
+static inline bool lockdep_tcf_chain_is_locked(struct tcf_block *chain)
+{
+	return true;
+}
+#endif /* #ifdef CONFIG_PROVE_LOCKING */
+
+#define tcf_chain_dereference(p, chain)					\
+	rcu_dereference_protected(p, lockdep_tcf_chain_is_locked(chain))
+
 static inline void tcf_block_offload_inc(struct tcf_block *block, u32 *flags)
 {
 	if (*flags & TCA_CLS_FLAGS_IN_HW)
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
index 0dcce8b0c7b4..3fce30ae9a9b 100644
--- a/net/sched/cls_api.c
+++ b/net/sched/cls_api.c
@@ -221,6 +221,7 @@  static struct tcf_chain *tcf_chain_create(struct tcf_block *block,
 	if (!chain)
 		return NULL;
 	list_add_tail(&chain->list, &block->chain_list);
+	mutex_init(&chain->filter_chain_lock);
 	chain->block = block;
 	chain->index = chain_index;
 	chain->refcnt = 1;
@@ -280,6 +281,7 @@  static void tcf_chain_destroy(struct tcf_chain *chain, bool free_block)
 {
 	struct tcf_block *block = chain->block;
 
+	mutex_destroy(&chain->filter_chain_lock);
 	kfree(chain);
 	if (free_block)
 		tcf_block_destroy(block);
@@ -443,9 +445,13 @@  static void tcf_chain_put_explicitly_created(struct tcf_chain *chain)
 
 static void tcf_chain_flush(struct tcf_chain *chain)
 {
-	struct tcf_proto *tp = rtnl_dereference(chain->filter_chain);
+	struct tcf_proto *tp;
 
+	mutex_lock(&chain->filter_chain_lock);
+	tp = tcf_chain_dereference(chain->filter_chain, chain);
 	tcf_chain0_head_change(chain, NULL);
+	mutex_unlock(&chain->filter_chain_lock);
+
 	while (tp) {
 		RCU_INIT_POINTER(chain->filter_chain, tp->next);
 		tcf_proto_destroy(tp, NULL);
@@ -785,11 +791,29 @@  tcf_chain0_head_change_cb_add(struct tcf_block *block,
 
 	mutex_lock(&block->lock);
 	chain0 = block->chain0.chain;
-	if (chain0 && chain0->filter_chain)
-		tcf_chain_head_change_item(item, chain0->filter_chain);
-	list_add(&item->list, &block->chain0.filter_chain_list);
+	if (chain0)
+		tcf_chain_hold(chain0);
+	else
+		list_add(&item->list, &block->chain0.filter_chain_list);
 	mutex_unlock(&block->lock);
 
+	if (chain0) {
+		struct tcf_proto *tp_head;
+
+		mutex_lock(&chain0->filter_chain_lock);
+
+		tp_head = tcf_chain_dereference(chain0->filter_chain, chain0);
+		if (tp_head)
+			tcf_chain_head_change_item(item, tp_head);
+
+		mutex_lock(&block->lock);
+		list_add(&item->list, &block->chain0.filter_chain_list);
+		mutex_unlock(&block->lock);
+
+		mutex_unlock(&chain0->filter_chain_lock);
+		tcf_chain_put(chain0);
+	}
+
 	return 0;
 }
 
@@ -1464,9 +1488,10 @@  struct tcf_chain_info {
 	struct tcf_proto __rcu *next;
 };
 
-static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain_info *chain_info)
+static struct tcf_proto *tcf_chain_tp_prev(struct tcf_chain *chain,
+					   struct tcf_chain_info *chain_info)
 {
-	return rtnl_dereference(*chain_info->pprev);
+	return tcf_chain_dereference(*chain_info->pprev, chain);
 }
 
 static void tcf_chain_tp_insert(struct tcf_chain *chain,
@@ -1475,7 +1500,7 @@  static void tcf_chain_tp_insert(struct tcf_chain *chain,
 {
 	if (*chain_info->pprev == chain->filter_chain)
 		tcf_chain0_head_change(chain, tp);
-	RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain_info));
+	RCU_INIT_POINTER(tp->next, tcf_chain_tp_prev(chain, chain_info));
 	rcu_assign_pointer(*chain_info->pprev, tp);
 	tcf_chain_hold(chain);
 }
@@ -1484,7 +1509,7 @@  static void tcf_chain_tp_remove(struct tcf_chain *chain,
 				struct tcf_chain_info *chain_info,
 				struct tcf_proto *tp)
 {
-	struct tcf_proto *next = rtnl_dereference(chain_info->next);
+	struct tcf_proto *next = tcf_chain_dereference(chain_info->next, chain);
 
 	if (tp == chain->filter_chain)
 		tcf_chain0_head_change(chain, next);
@@ -1502,7 +1527,8 @@  static struct tcf_proto *tcf_chain_tp_find(struct tcf_chain *chain,
 
 	/* Check the chain for existence of proto-tcf with this priority */
 	for (pprev = &chain->filter_chain;
-	     (tp = rtnl_dereference(*pprev)); pprev = &tp->next) {
+	     (tp = tcf_chain_dereference(*pprev, chain));
+	     pprev = &tp->next) {
 		if (tp->prio >= prio) {
 			if (tp->prio == prio) {
 				if (prio_allocate ||
@@ -1710,12 +1736,13 @@  static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		goto errout;
 	}
 
+	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
 			       prio, prio_allocate);
 	if (IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
 		err = PTR_ERR(tp);
-		goto errout;
+		goto errout_locked;
 	}
 
 	if (tp == NULL) {
@@ -1724,29 +1751,37 @@  static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		if (tca[TCA_KIND] == NULL || !protocol) {
 			NL_SET_ERR_MSG(extack, "Filter kind and protocol must be specified");
 			err = -EINVAL;
-			goto errout;
+			goto errout_locked;
 		}
 
 		if (!(n->nlmsg_flags & NLM_F_CREATE)) {
 			NL_SET_ERR_MSG(extack, "Need both RTM_NEWTFILTER and NLM_F_CREATE to create a new filter");
 			err = -ENOENT;
-			goto errout;
+			goto errout_locked;
 		}
 
 		if (prio_allocate)
-			prio = tcf_auto_prio(tcf_chain_tp_prev(&chain_info));
+			prio = tcf_auto_prio(tcf_chain_tp_prev(chain,
+							       &chain_info));
 
+		mutex_unlock(&chain->filter_chain_lock);
 		tp = tcf_proto_create(nla_data(tca[TCA_KIND]),
 				      protocol, prio, chain, extack);
 		if (IS_ERR(tp)) {
 			err = PTR_ERR(tp);
 			goto errout;
 		}
+
+		mutex_lock(&chain->filter_chain_lock);
+		tcf_chain_tp_insert(chain, &chain_info, tp);
+		mutex_unlock(&chain->filter_chain_lock);
 		tp_created = 1;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
 		err = -EINVAL;
-		goto errout;
+		goto errout_locked;
+	} else {
+		mutex_unlock(&chain->filter_chain_lock);
 	}
 
 	fh = tp->ops->get(tp, t->tcm_handle);
@@ -1772,15 +1807,11 @@  static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 	err = tp->ops->change(net, skb, tp, cl, t->tcm_handle, tca, &fh,
 			      n->nlmsg_flags & NLM_F_CREATE ? TCA_ACT_NOREPLACE : TCA_ACT_REPLACE,
 			      extack);
-	if (err == 0) {
-		if (tp_created)
-			tcf_chain_tp_insert(chain, &chain_info, tp);
+	if (err == 0)
 		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
 			       RTM_NEWTFILTER, false);
-	} else {
-		if (tp_created)
-			tcf_proto_destroy(tp, NULL);
-	}
+	else if (tp_created)
+		tcf_proto_destroy(tp, NULL);
 
 errout:
 	if (chain)
@@ -1790,6 +1821,10 @@  static int tc_new_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		/* Replay the request. */
 		goto replay;
 	return err;
+
+errout_locked:
+	mutex_unlock(&chain->filter_chain_lock);
+	goto errout;
 }
 
 static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
@@ -1865,31 +1900,34 @@  static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		goto errout;
 	}
 
+	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
 			       prio, false);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
 		err = tp ? PTR_ERR(tp) : -ENOENT;
-		goto errout;
+		goto errout_locked;
 	} else if (tca[TCA_KIND] && nla_strcmp(tca[TCA_KIND], tp->ops->kind)) {
 		NL_SET_ERR_MSG(extack, "Specified filter kind does not match existing one");
 		err = -EINVAL;
+		goto errout_locked;
+	} else if (t->tcm_handle == 0) {
+		tcf_chain_tp_remove(chain, &chain_info, tp);
+		mutex_unlock(&chain->filter_chain_lock);
+
+		tfilter_notify(net, skb, n, tp, block, q, parent, fh,
+			       RTM_DELTFILTER, false);
+		tcf_proto_destroy(tp, extack);
+		err = 0;
 		goto errout;
 	}
+	mutex_unlock(&chain->filter_chain_lock);
 
 	fh = tp->ops->get(tp, t->tcm_handle);
 
 	if (!fh) {
-		if (t->tcm_handle == 0) {
-			tcf_chain_tp_remove(chain, &chain_info, tp);
-			tfilter_notify(net, skb, n, tp, block, q, parent, fh,
-				       RTM_DELTFILTER, false);
-			tcf_proto_destroy(tp, extack);
-			err = 0;
-		} else {
-			NL_SET_ERR_MSG(extack, "Specified filter handle not found");
-			err = -ENOENT;
-		}
+		NL_SET_ERR_MSG(extack, "Specified filter handle not found");
+		err = -ENOENT;
 	} else {
 		bool last;
 
@@ -1899,7 +1937,10 @@  static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		if (err)
 			goto errout;
 		if (last) {
+			mutex_lock(&chain->filter_chain_lock);
 			tcf_chain_tp_remove(chain, &chain_info, tp);
+			mutex_unlock(&chain->filter_chain_lock);
+
 			tcf_proto_destroy(tp, extack);
 		}
 	}
@@ -1909,6 +1950,10 @@  static int tc_del_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		tcf_chain_put(chain);
 	tcf_block_release(q, block);
 	return err;
+
+errout_locked:
+	mutex_unlock(&chain->filter_chain_lock);
+	goto errout;
 }
 
 static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
@@ -1966,8 +2011,10 @@  static int tc_get_tfilter(struct sk_buff *skb, struct nlmsghdr *n,
 		goto errout;
 	}
 
+	mutex_lock(&chain->filter_chain_lock);
 	tp = tcf_chain_tp_find(chain, &chain_info, protocol,
 			       prio, false);
+	mutex_unlock(&chain->filter_chain_lock);
 	if (!tp || IS_ERR(tp)) {
 		NL_SET_ERR_MSG(extack, "Filter with specified priority/protocol not found");
 		err = tp ? PTR_ERR(tp) : -ENOENT;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 66ba2ce2320f..e24568f9246c 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -1366,7 +1366,11 @@  static void mini_qdisc_rcu_func(struct rcu_head *head)
 void mini_qdisc_pair_swap(struct mini_Qdisc_pair *miniqp,
 			  struct tcf_proto *tp_head)
 {
-	struct mini_Qdisc *miniq_old = rtnl_dereference(*miniqp->p_miniq);
+	/* Protected with chain0->filter_chain_lock.
+	 * Can't access chain directly because tp_head can be NULL.
+	 */
+	struct mini_Qdisc *miniq_old =
+		rcu_dereference_protected(*miniqp->p_miniq, 1);
 	struct mini_Qdisc *miniq;
 
 	if (!tp_head) {