Message ID | 20190225154544.10453-1-vladbu@mellanox.com |
---|---|
State | Accepted |
Delegated to: | David Miller |
Headers | show |
Series | [net-next] net: sched: don't release block->lock when dumping chains | expand |
From: Vlad Buslov <vladbu@mellanox.com> Date: Mon, 25 Feb 2019 17:45:44 +0200 > Function tc_dump_chain() obtains and releases block->lock on each iteration > of its inner loop that dumps all chains on block. Outputting chain template > info is fast operation so locking/unlocking mutex multiple times is an > overhead when lock is highly contested. Modify tc_dump_chain() to only > obtain block->lock once and dump all chains without releasing it. > > Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Applied, thanks Vlad.
On Mon, Feb 25, 2019 at 7:45 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > Function tc_dump_chain() obtains and releases block->lock on each iteration > of its inner loop that dumps all chains on block. Outputting chain template > info is fast operation so locking/unlocking mutex multiple times is an > overhead when lock is highly contested. Modify tc_dump_chain() to only > obtain block->lock once and dump all chains without releasing it. > > Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> Thanks for the followup! Isn't it similar for __tcf_get_next_proto() in tcf_chain_dump()? And for tc_dump_tfilter()?
On Tue 26 Feb 2019 at 00:15, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Mon, Feb 25, 2019 at 7:45 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> Function tc_dump_chain() obtains and releases block->lock on each iteration >> of its inner loop that dumps all chains on block. Outputting chain template >> info is fast operation so locking/unlocking mutex multiple times is an >> overhead when lock is highly contested. Modify tc_dump_chain() to only >> obtain block->lock once and dump all chains without releasing it. >> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> >> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> > > Thanks for the followup! > > Isn't it similar for __tcf_get_next_proto() in tcf_chain_dump()? > And for tc_dump_tfilter()? Not really. These two dump all tp filters and not just a template, which is O(n) on number of filters and can be slow because it calls hw offload API for each of them. Our typical use-case involves periodic filter dump (to update stats) while multiple concurrent user-space threads are updating filters, so it is important for them to be able to execute in parallel.
On Tue, Feb 26, 2019 at 8:10 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > > On Tue 26 Feb 2019 at 00:15, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Mon, Feb 25, 2019 at 7:45 AM Vlad Buslov <vladbu@mellanox.com> wrote: > >> > >> Function tc_dump_chain() obtains and releases block->lock on each iteration > >> of its inner loop that dumps all chains on block. Outputting chain template > >> info is fast operation so locking/unlocking mutex multiple times is an > >> overhead when lock is highly contested. Modify tc_dump_chain() to only > >> obtain block->lock once and dump all chains without releasing it. > >> > >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > >> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> > > > > Thanks for the followup! > > > > Isn't it similar for __tcf_get_next_proto() in tcf_chain_dump()? > > And for tc_dump_tfilter()? > > Not really. These two dump all tp filters and not just a template, which > is O(n) on number of filters and can be slow because it calls hw offload > API for each of them. Our typical use-case involves periodic filter dump > (to update stats) while multiple concurrent user-space threads are > updating filters, so it is important for them to be able to execute in > parallel. Hmm, but if these are read-only, you probably don't even need a mutex, you can just use RCU read lock to protect list iteration and you still can grab the refcnt in the same way.
On Wed 27 Feb 2019 at 23:03, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Tue, Feb 26, 2019 at 8:10 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> On Tue 26 Feb 2019 at 00:15, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> > On Mon, Feb 25, 2019 at 7:45 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> >> Function tc_dump_chain() obtains and releases block->lock on each iteration >> >> of its inner loop that dumps all chains on block. Outputting chain template >> >> info is fast operation so locking/unlocking mutex multiple times is an >> >> overhead when lock is highly contested. Modify tc_dump_chain() to only >> >> obtain block->lock once and dump all chains without releasing it. >> >> >> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> >> >> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> >> > >> > Thanks for the followup! >> > >> > Isn't it similar for __tcf_get_next_proto() in tcf_chain_dump()? >> > And for tc_dump_tfilter()? >> >> Not really. These two dump all tp filters and not just a template, which >> is O(n) on number of filters and can be slow because it calls hw offload >> API for each of them. Our typical use-case involves periodic filter dump >> (to update stats) while multiple concurrent user-space threads are >> updating filters, so it is important for them to be able to execute in >> parallel. > > Hmm, but if these are read-only, you probably don't even need a > mutex, you can just use RCU read lock to protect list iteration > and you still can grab the refcnt in the same way. That is how it worked in my initial implementation. However, it doesn't work with hw offloads because driver callbacks can sleep.
On Thu, Feb 28, 2019 at 6:53 AM Vlad Buslov <vladbu@mellanox.com> wrote: > > > On Wed 27 Feb 2019 at 23:03, Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Tue, Feb 26, 2019 at 8:10 AM Vlad Buslov <vladbu@mellanox.com> wrote: > >> > >> > >> On Tue 26 Feb 2019 at 00:15, Cong Wang <xiyou.wangcong@gmail.com> wrote: > >> > On Mon, Feb 25, 2019 at 7:45 AM Vlad Buslov <vladbu@mellanox.com> wrote: > >> >> > >> >> Function tc_dump_chain() obtains and releases block->lock on each iteration > >> >> of its inner loop that dumps all chains on block. Outputting chain template > >> >> info is fast operation so locking/unlocking mutex multiple times is an > >> >> overhead when lock is highly contested. Modify tc_dump_chain() to only > >> >> obtain block->lock once and dump all chains without releasing it. > >> >> > >> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> > >> >> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> > >> > > >> > Thanks for the followup! > >> > > >> > Isn't it similar for __tcf_get_next_proto() in tcf_chain_dump()? > >> > And for tc_dump_tfilter()? > >> > >> Not really. These two dump all tp filters and not just a template, which > >> is O(n) on number of filters and can be slow because it calls hw offload > >> API for each of them. Our typical use-case involves periodic filter dump > >> (to update stats) while multiple concurrent user-space threads are > >> updating filters, so it is important for them to be able to execute in > >> parallel. > > > > Hmm, but if these are read-only, you probably don't even need a > > mutex, you can just use RCU read lock to protect list iteration > > and you still can grab the refcnt in the same way. > > That is how it worked in my initial implementation. However, it doesn't > work with hw offloads because driver callbacks can sleep. Hmm? You drop RCU read lock after grabbing the refcnt, right? If so what's the problem with sleeping?
On Sat 02 Mar 2019 at 00:08, Cong Wang <xiyou.wangcong@gmail.com> wrote: > On Thu, Feb 28, 2019 at 6:53 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> On Wed 27 Feb 2019 at 23:03, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> > On Tue, Feb 26, 2019 at 8:10 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> >> >> >> On Tue 26 Feb 2019 at 00:15, Cong Wang <xiyou.wangcong@gmail.com> wrote: >> >> > On Mon, Feb 25, 2019 at 7:45 AM Vlad Buslov <vladbu@mellanox.com> wrote: >> >> >> >> >> >> Function tc_dump_chain() obtains and releases block->lock on each iteration >> >> >> of its inner loop that dumps all chains on block. Outputting chain template >> >> >> info is fast operation so locking/unlocking mutex multiple times is an >> >> >> overhead when lock is highly contested. Modify tc_dump_chain() to only >> >> >> obtain block->lock once and dump all chains without releasing it. >> >> >> >> >> >> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> >> >> >> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> >> >> > >> >> > Thanks for the followup! >> >> > >> >> > Isn't it similar for __tcf_get_next_proto() in tcf_chain_dump()? >> >> > And for tc_dump_tfilter()? >> >> >> >> Not really. These two dump all tp filters and not just a template, which >> >> is O(n) on number of filters and can be slow because it calls hw offload >> >> API for each of them. Our typical use-case involves periodic filter dump >> >> (to update stats) while multiple concurrent user-space threads are >> >> updating filters, so it is important for them to be able to execute in >> >> parallel. >> > >> > Hmm, but if these are read-only, you probably don't even need a >> > mutex, you can just use RCU read lock to protect list iteration >> > and you still can grab the refcnt in the same way. >> >> That is how it worked in my initial implementation. However, it doesn't >> work with hw offloads because driver callbacks can sleep. > > Hmm? You drop RCU read lock after grabbing the refcnt, > right? If so what's the problem with sleeping? Okay, I misunderstood your suggestion. In tc_dump_tfilter() we can't use RCU in __tcf_get_next_chain() because chain reference counters are not atomic and require protection of block->lock. __tcf_get_next_proto() requires chain->filter_chain_lock because it checks 'deleting' flag besides taking reference to tp.
diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c index 3543be31d400..5c6a9baa389f 100644 --- a/net/sched/cls_api.c +++ b/net/sched/cls_api.c @@ -2996,12 +2996,12 @@ static int tc_ctl_chain(struct sk_buff *skb, struct nlmsghdr *n, /* called with RTNL */ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb) { - struct tcf_chain *chain, *chain_prev; struct net *net = sock_net(skb->sk); struct nlattr *tca[TCA_MAX + 1]; struct Qdisc *q = NULL; struct tcf_block *block; struct tcmsg *tcm = nlmsg_data(cb->nlh); + struct tcf_chain *chain; long index_start; long index; u32 parent; @@ -3064,11 +3064,8 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb) index_start = cb->args[0]; index = 0; - for (chain = __tcf_get_next_chain(block, NULL); - chain; - chain_prev = chain, - chain = __tcf_get_next_chain(block, chain), - tcf_chain_put(chain_prev)) { + mutex_lock(&block->lock); + list_for_each_entry(chain, &block->chain_list, list) { if ((tca[TCA_CHAIN] && nla_get_u32(tca[TCA_CHAIN]) != chain->index)) continue; @@ -3076,17 +3073,18 @@ static int tc_dump_chain(struct sk_buff *skb, struct netlink_callback *cb) index++; continue; } + if (tcf_chain_held_by_acts_only(chain)) + continue; err = tc_chain_fill_node(chain->tmplt_ops, chain->tmplt_priv, chain->index, net, skb, block, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWCHAIN); - if (err <= 0) { - tcf_chain_put(chain); + if (err <= 0) break; - } index++; } + mutex_unlock(&block->lock); if (tcm->tcm_ifindex == TCM_IFINDEX_MAGIC_BLOCK) tcf_block_refcnt_put(block, true);
Function tc_dump_chain() obtains and releases block->lock on each iteration of its inner loop that dumps all chains on block. Outputting chain template info is fast operation so locking/unlocking mutex multiple times is an overhead when lock is highly contested. Modify tc_dump_chain() to only obtain block->lock once and dump all chains without releasing it. Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Suggested-by: Cong Wang <xiyou.wangcong@gmail.com> --- net/sched/cls_api.c | 16 +++++++--------- 1 file changed, 7 insertions(+), 9 deletions(-)