Message ID | 20190206230019.1303-1-xiyou.wangcong@gmail.com |
---|---|
State | Awaiting Upstream |
Delegated to: | David Miller |
Headers | show |
Series | [net-next,v2] mlx5: use RCU lock in mlx5_eq_cq_get() | expand |
On 02/06/2019 03:00 PM, Cong Wang wrote: > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside > gets a lot of contentions when we test some heavy workload > with 60 RX queues and 80 CPU's, and it is clearly shown in the > flame graph. > > In fact, radix_tree_lookup() is perfectly fine with RCU read lock, > we don't have to take a spinlock on this hot path. This is pretty > much similar to commit 291c566a2891 > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths > are still serialized with the spinlock, and with synchronize_irq() > it should be safe to just move the fast path to RCU read lock. > > This patch itself reduces the latency by about 50% for our memcached > workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed, > this spinlock gets some rework in commit 02d92f790364 > ("net/mlx5: CQ Database per EQ"), so the difference could be smaller. > > Cc: Saeed Mahameed <saeedm@mellanox.com> > Cc: Tariq Toukan <tariqt@mellanox.com> > Acked-by: Saeed Mahameed <saeedm@mellanox.com> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > --- > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------ > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > index ee04aab65a9f..7092457705a2 100644 > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > @@ -114,11 +114,11 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn) > struct mlx5_cq_table *table = &eq->cq_table; > struct mlx5_core_cq *cq = NULL; > > - spin_lock(&table->lock); > + rcu_read_lock(); > cq = radix_tree_lookup(&table->tree, cqn); > if (likely(cq)) > mlx5_cq_hold(cq); I suspect that you need a variant that makes sure refcount is not zero. ( Typical RCU rules apply ) if (cq && !refcount_inc_not_zero(&cq->refcount)) cq = NULL; See commit 6fa19f5637a6c22bc0999596bcc83bdcac8a4fa6 rds: fix refcount bug in rds_sock_addref for a similar issue I fixed recently.
On 02/06/2019 03:36 PM, Eric Dumazet wrote: > > I suspect that you need a variant that makes sure refcount is not zero. > > ( Typical RCU rules apply ) > > if (cq && !refcount_inc_not_zero(&cq->refcount)) > cq = NULL; > > > See commit 6fa19f5637a6c22bc0999596bcc83bdcac8a4fa6 rds: fix refcount bug in rds_sock_addref > for a similar issue I fixed recently. > By the way, we also could avoid two atomics on the cq->refcount , by using rcu_read_lock()/unlock() in the two callers .
On Wed, Feb 6, 2019 at 3:36 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > On 02/06/2019 03:00 PM, Cong Wang wrote: > > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside > > gets a lot of contentions when we test some heavy workload > > with 60 RX queues and 80 CPU's, and it is clearly shown in the > > flame graph. > > > > In fact, radix_tree_lookup() is perfectly fine with RCU read lock, > > we don't have to take a spinlock on this hot path. This is pretty > > much similar to commit 291c566a2891 > > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths > > are still serialized with the spinlock, and with synchronize_irq() > > it should be safe to just move the fast path to RCU read lock. > > > > This patch itself reduces the latency by about 50% for our memcached > > workload on a 4.14 kernel we test. In upstream, as pointed out by Saeed, > > this spinlock gets some rework in commit 02d92f790364 > > ("net/mlx5: CQ Database per EQ"), so the difference could be smaller. > > > > Cc: Saeed Mahameed <saeedm@mellanox.com> > > Cc: Tariq Toukan <tariqt@mellanox.com> > > Acked-by: Saeed Mahameed <saeedm@mellanox.com> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > > --- > > drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------ > > 1 file changed, 6 insertions(+), 6 deletions(-) > > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > index ee04aab65a9f..7092457705a2 100644 > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c > > @@ -114,11 +114,11 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn) > > struct mlx5_cq_table *table = &eq->cq_table; > > struct mlx5_core_cq *cq = NULL; > > > > - spin_lock(&table->lock); > > + rcu_read_lock(); > > cq = radix_tree_lookup(&table->tree, cqn); > > if (likely(cq)) > > mlx5_cq_hold(cq); > > I suspect that you need a variant that makes sure refcount is not zero. > > ( Typical RCU rules apply ) > > if (cq && !refcount_inc_not_zero(&cq->refcount)) > cq = NULL; > > > See commit 6fa19f5637a6c22bc0999596bcc83bdcac8a4fa6 rds: fix refcount bug in rds_sock_addref > for a similar issue I fixed recently. synchronize_irq() is called before mlx5_cq_put(), so I don't see why readers could get 0 refcnt. For the rds you mentioned, it doesn't wait for readers, this is why it needs to check against 0 and why it is different from this one. Thanks.
On 02/06/2019 04:04 PM, Cong Wang wrote: > synchronize_irq() is called before mlx5_cq_put(), so I don't > see why readers could get 0 refcnt. Then the more reasons to get rid of the refcount increment/decrement completely ... Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed, since synchronize_irq() is enough. > > For the rds you mentioned, it doesn't wait for readers, this > is why it needs to check against 0 and why it is different from > this one. > > Thanks. >
On Wed, Feb 6, 2019 at 4:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > On 02/06/2019 04:04 PM, Cong Wang wrote: > > > synchronize_irq() is called before mlx5_cq_put(), so I don't > > see why readers could get 0 refcnt. > > Then the more reasons to get rid of the refcount increment/decrement completely ... > > Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed, > since synchronize_irq() is enough. > I already suggested this, quoting myself from my first reply to this patch V0: "another way to do it is not to do any refcounting in the irq handler and fence cq removal via synchronize_irq(eq->irqn) on mlx5_eq_del_cq." I already have a patch I was just waiting for Cong to push V2.
On Wed, Feb 6, 2019 at 4:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > On 02/06/2019 04:04 PM, Cong Wang wrote: > > > synchronize_irq() is called before mlx5_cq_put(), so I don't > > see why readers could get 0 refcnt. > > Then the more reasons to get rid of the refcount increment/decrement completely ... > > Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed, > since synchronize_irq() is enough. Excellent point. For the refcnt, I am afraid we still have to hold refcnt for the tasklet, mlx5_cq_tasklet_cb. But yeah, should be safe to remove from IRQ path.
On Wed, Feb 6, 2019 at 4:53 PM Cong Wang <xiyou.wangcong@gmail.com> wrote: > > On Wed, Feb 6, 2019 at 4:28 PM Eric Dumazet <eric.dumazet@gmail.com> wrote: > > > > > > > > On 02/06/2019 04:04 PM, Cong Wang wrote: > > > > > synchronize_irq() is called before mlx5_cq_put(), so I don't > > > see why readers could get 0 refcnt. > > > > Then the more reasons to get rid of the refcount increment/decrement completely ... > > > > Technically, even the rcu_read_lock() and rcu_read_unlock() are not needed, > > since synchronize_irq() is enough. > > Excellent point. > > For the refcnt, I am afraid we still have to hold refcnt for the tasklet, > mlx5_cq_tasklet_cb. But yeah, should be safe to remove from IRQ > path. the tasklet path is for rdma CQs only, netdev cqs handling will be refcnt free.
On Wed, 2019-02-06 at 15:00 -0800, Cong Wang wrote: > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside > gets a lot of contentions when we test some heavy workload > with 60 RX queues and 80 CPU's, and it is clearly shown in the > flame graph. > > In fact, radix_tree_lookup() is perfectly fine with RCU read lock, > we don't have to take a spinlock on this hot path. This is pretty > much similar to commit 291c566a2891 > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths > are still serialized with the spinlock, and with synchronize_irq() > it should be safe to just move the fast path to RCU read lock. > > This patch itself reduces the latency by about 50% for our memcached > workload on a 4.14 kernel we test. In upstream, as pointed out by > Saeed, > this spinlock gets some rework in commit 02d92f790364 > ("net/mlx5: CQ Database per EQ"), so the difference could be smaller. > > Cc: Saeed Mahameed <saeedm@mellanox.com> > Cc: Tariq Toukan <tariqt@mellanox.com> > Acked-by: Saeed Mahameed <saeedm@mellanox.com> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> > Applied to mlx5-next Will be sent to net-next in my next pull request Thanks!
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c index ee04aab65a9f..7092457705a2 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c @@ -114,11 +114,11 @@ static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn) struct mlx5_cq_table *table = &eq->cq_table; struct mlx5_core_cq *cq = NULL; - spin_lock(&table->lock); + rcu_read_lock(); cq = radix_tree_lookup(&table->tree, cqn); if (likely(cq)) mlx5_cq_hold(cq); - spin_unlock(&table->lock); + rcu_read_unlock(); return cq; } @@ -371,9 +371,9 @@ int mlx5_eq_add_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq) struct mlx5_cq_table *table = &eq->cq_table; int err; - spin_lock_irq(&table->lock); + spin_lock(&table->lock); err = radix_tree_insert(&table->tree, cq->cqn, cq); - spin_unlock_irq(&table->lock); + spin_unlock(&table->lock); return err; } @@ -383,9 +383,9 @@ int mlx5_eq_del_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq) struct mlx5_cq_table *table = &eq->cq_table; struct mlx5_core_cq *tmp; - spin_lock_irq(&table->lock); + spin_lock(&table->lock); tmp = radix_tree_delete(&table->tree, cq->cqn); - spin_unlock_irq(&table->lock); + spin_unlock(&table->lock); if (!tmp) { mlx5_core_warn(eq->dev, "cq 0x%x not found in eq 0x%x tree\n", eq->eqn, cq->cqn);