diff mbox

[net-next] mlx4: do not fire tasklet unless necessary

Message ID 1486729678.7793.139.camel@edumazet-glaptop3.roam.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Feb. 10, 2017, 12:27 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

All rx and rx netdev interrupts are handled by respectively
by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.

But mlx4_eq_int() also fires a tasklet to service all items that were
queued via mlx4_add_cq_to_tasklet(), but this handler was not called
unless user cqe was handled.

This is very confusing, as "mpstat -I SCPU ..." show huge number of
tasklet invocations.

This patch saves this overhead, by carefully firing the tasklet directly
from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Cc: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
 drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
 2 files changed, 6 insertions(+), 9 deletions(-)

Comments

David Miller Feb. 14, 2017, 4:29 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 10 Feb 2017 04:27:58 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
> 
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
> 
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
> 
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>

Tariq/Saeed, please review.
Saeed Mahameed Feb. 15, 2017, 11:10 a.m. UTC | #2
On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
>
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
>
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
>  drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
>  2 files changed, 6 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
> index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
> @@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
>
>  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>  {
> -       unsigned long flags;
>         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> +       unsigned long flags;
> +       bool kick;
>
>         spin_lock_irqsave(&tasklet_ctx->lock, flags);
>         /* When migrating CQs between EQs will be implemented, please note
> @@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>          */
>         if (list_empty_careful(&cq->tasklet_ctx.list)) {
>                 atomic_inc(&cq->refcount);
> +               kick = list_empty(&tasklet_ctx->list);

So first one in would fire the tasklet, but wouldn't this cause CQE
processing loss
in the same mlx4_eq_int loop if the tasklet was fast enough to
schedule and while other CQEs are going to add themselves to the
tasklet_ctx->list ?

Anyway i tried to find race scenarios that could cause such thing but
synchronization looks good.

>                 list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> +               if (kick)
> +                       tasklet_schedule(&tasklet_ctx->task);
>         }
>         spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
>  }
> diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
> index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
> @@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>  {
>         struct mlx4_priv *priv = mlx4_priv(dev);
>         struct mlx4_eqe *eqe;
> -       int cqn = -1;
> +       int cqn;
>         int eqes_found = 0;
>         int set_ci = 0;
>         int port;
> @@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>
>         eq_set_ci(eq, 1);
>
> -       /* cqn is 24bit wide but is initialized such that its higher bits
> -        * are ones too. Thus, if we got any event, cqn's high bits should be off
> -        * and we need to schedule the tasklet.
> -        */
> -       if (!(cqn & ~0xffffff))

what if we simply change this condition to:
if (!list_empty_careful(eq->tasklet_ctx.list))

Wouldn't this be sort of equivalent to what you did ? and this way we
would simply fire the tasklet only when needed and not on every
handled CQE.

> -               tasklet_schedule(&eq->tasklet_ctx.task);
> -
>         return eqes_found;
>  }
>
>
>
Eric Dumazet Feb. 15, 2017, 1:29 p.m. UTC | #3
On Wed, 2017-02-15 at 13:10 +0200, Saeed Mahameed wrote:
> On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > All rx and rx netdev interrupts are handled by respectively
> > by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
> >
> > But mlx4_eq_int() also fires a tasklet to service all items that were
> > queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> > unless user cqe was handled.
> >
> > This is very confusing, as "mpstat -I SCPU ..." show huge number of
> > tasklet invocations.
> >
> > This patch saves this overhead, by carefully firing the tasklet directly
> > from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
> >  drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
> >  2 files changed, 6 insertions(+), 9 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
> > index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
> > @@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
> >
> >  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
> >  {
> > -       unsigned long flags;
> >         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
> > +       unsigned long flags;
> > +       bool kick;
> >
> >         spin_lock_irqsave(&tasklet_ctx->lock, flags);
> >         /* When migrating CQs between EQs will be implemented, please note
> > @@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
> >          */
> >         if (list_empty_careful(&cq->tasklet_ctx.list)) {
> >                 atomic_inc(&cq->refcount);
> > +               kick = list_empty(&tasklet_ctx->list);
> 
> So first one in would fire the tasklet, but wouldn't this cause CQE
> processing loss
> in the same mlx4_eq_int loop if the tasklet was fast enough to
> schedule and while other CQEs are going to add themselves to the
> tasklet_ctx->list ?


mlx4_eq_int() is a hard irq handler.

How a tasklet could run in the middle of it ?

A tasklet is a softirq handler.

softirq must wait that the current hard irq handler is done.
> 
> Anyway i tried to find race scenarios that could cause such thing but
> synchronization looks good.
> 
> >                 list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
> > +               if (kick)
> > +                       tasklet_schedule(&tasklet_ctx->task);
> >         }
> >         spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
> >  }
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
> > index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
> > @@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
> >  {
> >         struct mlx4_priv *priv = mlx4_priv(dev);
> >         struct mlx4_eqe *eqe;
> > -       int cqn = -1;
> > +       int cqn;
> >         int eqes_found = 0;
> >         int set_ci = 0;
> >         int port;
> > @@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
> >
> >         eq_set_ci(eq, 1);
> >
> > -       /* cqn is 24bit wide but is initialized such that its higher bits
> > -        * are ones too. Thus, if we got any event, cqn's high bits should be off
> > -        * and we need to schedule the tasklet.
> > -        */
> > -       if (!(cqn & ~0xffffff))
> 
> what if we simply change this condition to:
> if (!list_empty_careful(eq->tasklet_ctx.list))
> 
> Wouldn't this be sort of equivalent to what you did ? and this way we
> would simply fire the tasklet only when needed and not on every
> handled CQE.

Still this test would be done one million time per second on my hosts.

What is the point exactly ?

Thanks.
Saeed Mahameed Feb. 15, 2017, 1:59 p.m. UTC | #4
On Wed, Feb 15, 2017 at 3:29 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-15 at 13:10 +0200, Saeed Mahameed wrote:
>> On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > From: Eric Dumazet <edumazet@google.com>
>> >
>> > All rx and rx netdev interrupts are handled by respectively
>> > by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>> >
>> > But mlx4_eq_int() also fires a tasklet to service all items that were
>> > queued via mlx4_add_cq_to_tasklet(), but this handler was not called
>> > unless user cqe was handled.
>> >
>> > This is very confusing, as "mpstat -I SCPU ..." show huge number of
>> > tasklet invocations.
>> >
>> > This patch saves this overhead, by carefully firing the tasklet directly
>> > from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>> >
>> > Signed-off-by: Eric Dumazet <edumazet@google.com>
>> > Cc: Tariq Toukan <tariqt@mellanox.com>
>> > Cc: Saeed Mahameed <saeedm@mellanox.com>
>> > ---
>> >  drivers/net/ethernet/mellanox/mlx4/cq.c |    6 +++++-
>> >  drivers/net/ethernet/mellanox/mlx4/eq.c |    9 +--------
>> >  2 files changed, 6 insertions(+), 9 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
>> > index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/cq.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
>> > @@ -81,8 +81,9 @@ void mlx4_cq_tasklet_cb(unsigned long data)
>> >
>> >  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>> >  {
>> > -       unsigned long flags;
>> >         struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
>> > +       unsigned long flags;
>> > +       bool kick;
>> >
>> >         spin_lock_irqsave(&tasklet_ctx->lock, flags);
>> >         /* When migrating CQs between EQs will be implemented, please note
>> > @@ -92,7 +93,10 @@ static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
>> >          */
>> >         if (list_empty_careful(&cq->tasklet_ctx.list)) {
>> >                 atomic_inc(&cq->refcount);
>> > +               kick = list_empty(&tasklet_ctx->list);
>>
>> So first one in would fire the tasklet, but wouldn't this cause CQE
>> processing loss
>> in the same mlx4_eq_int loop if the tasklet was fast enough to
>> schedule and while other CQEs are going to add themselves to the
>> tasklet_ctx->list ?
>
>
> mlx4_eq_int() is a hard irq handler.
>
> How a tasklet could run in the middle of it ?
>

can the tasklet run on a different core ?

> A tasklet is a softirq handler.
>
> softirq must wait that the current hard irq handler is done.
>>
>> Anyway i tried to find race scenarios that could cause such thing but
>> synchronization looks good.
>>
>> >                 list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
>> > +               if (kick)
>> > +                       tasklet_schedule(&tasklet_ctx->task);
>> >         }
>> >         spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
>> >  }
>> > diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
>> > index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx4/eq.c
>> > +++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
>> > @@ -494,7 +494,7 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>> >  {
>> >         struct mlx4_priv *priv = mlx4_priv(dev);
>> >         struct mlx4_eqe *eqe;
>> > -       int cqn = -1;
>> > +       int cqn;
>> >         int eqes_found = 0;
>> >         int set_ci = 0;
>> >         int port;
>> > @@ -840,13 +840,6 @@ static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
>> >
>> >         eq_set_ci(eq, 1);
>> >
>> > -       /* cqn is 24bit wide but is initialized such that its higher bits
>> > -        * are ones too. Thus, if we got any event, cqn's high bits should be off
>> > -        * and we need to schedule the tasklet.
>> > -        */
>> > -       if (!(cqn & ~0xffffff))
>>
>> what if we simply change this condition to:
>> if (!list_empty_careful(eq->tasklet_ctx.list))
>>
>> Wouldn't this be sort of equivalent to what you did ? and this way we
>> would simply fire the tasklet only when needed and not on every
>> handled CQE.
>
> Still this test would be done one million time per second on my hosts.
>
> What is the point exactly ?
>

the point is that if the EQ is full of CQEs from different CQs you would
do the "  kick = list_empty(&tasklet_ctx->list);" test per empty CQ
list rather than once at the end.

in mlx4_en case, you have only two CQs on each EQ but in RoCE/IB you
can have as many CQs as you want.

> Thanks.
>
>
Eric Dumazet Feb. 15, 2017, 2:04 p.m. UTC | #5
On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:

> 
> mlx4_eq_int() is a hard irq handler.
> 
> How a tasklet could run in the middle of it ?
> 
> A tasklet is a softirq handler.

Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
in eq_set_ci()

I wonder why this very expensive mb() is required, right before exiting
the interrupt handler.
Eric Dumazet Feb. 15, 2017, 2:34 p.m. UTC | #6
On Wed, 2017-02-15 at 15:59 +0200, Saeed Mahameed wrote:

> can the tasklet run on a different core ?

No. tasklets are scheduled on local cpu, like softirqs in general.

They are not migrated, unless cpu is removed (hotplug)


> 
> the point is that if the EQ is full of CQEs from different CQs you would
> do the "  kick = list_empty(&tasklet_ctx->list);" test per empty CQ
> list rather than once at the end.
> 
> in mlx4_en case, you have only two CQs on each EQ but in RoCE/IB you
> can have as many CQs as you want.

list_empty() before one list_add_tail() is a single instruction.

By doing this right before manipulating the list, it comes with a zero
cache line penalty.

While if you do it at the wrong place, it incurs one extra cache line
miss.

This extra tasklet invocations can cost 3 % of cpu cycles.
Matan Barak Feb. 15, 2017, 2:52 p.m. UTC | #7
On 15/02/2017 13:10, Saeed Mahameed wrote:
> On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> From: Eric Dumazet <edumazet@google.com>
>>
>> All rx and rx netdev interrupts are handled by respectively
>> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>>
>> But mlx4_eq_int() also fires a tasklet to service all items that were
>> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
>> unless user cqe was handled.
>>
>> This is very confusing, as "mpstat -I SCPU ..." show huge number of
>> tasklet invocations.
>>
>> This patch saves this overhead, by carefully firing the tasklet directly
>> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>>

So, in case of RDMA CQs, we add some per-CQE overhead of comparing the 
list pointers and condition upon that. Maybe we could add an 
invoke_tasklet boolean field on mlx4_cq and return its value from 
mlx4_cq_completion.
That's way we could do invoke_tasklet |= mlx4_cq_completion(....);

Outside the while loop we could just
if (invoke_tasklet)
     tasklet_schedule

Anyway, I guess that even with per-CQE overhead, the performance impact 
here is pretty negligible - so I guess that's fine too :)
Eric Dumazet Feb. 15, 2017, 3:26 p.m. UTC | #8
On Wed, 2017-02-15 at 16:52 +0200, Matan Barak (External) wrote:

> So, in case of RDMA CQs, we add some per-CQE overhead of comparing the 
> list pointers and condition upon that. Maybe we could add an 
> invoke_tasklet boolean field on mlx4_cq and return its value from 
> mlx4_cq_completion.
> That's way we could do invoke_tasklet |= mlx4_cq_completion(....);
> 
> Outside the while loop we could just
> if (invoke_tasklet)
>      tasklet_schedule
> 
> Anyway, I guess that even with per-CQE overhead, the performance impact 
> here is pretty negligible - so I guess that's fine too :)


Real question or suggestion would be to use/fire a tasklet only under
stress.

Firing a tasklet adds a lot of latencies for user-space CQ completion,
since softirqs might have to be handled by a kernel thread (ksoftirqd)

I would be surprised if no customer was hit by your commit,
( net/mlx4_core: Use tasklet for user-space CQ completion events )
especially when using specific (RT) scheduler classes.
Saeed Mahameed Feb. 16, 2017, 12:34 p.m. UTC | #9
On Wed, Feb 15, 2017 at 5:26 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-15 at 16:52 +0200, Matan Barak (External) wrote:
>
>> So, in case of RDMA CQs, we add some per-CQE overhead of comparing the
>> list pointers and condition upon that. Maybe we could add an
>> invoke_tasklet boolean field on mlx4_cq and return its value from
>> mlx4_cq_completion.
>> That's way we could do invoke_tasklet |= mlx4_cq_completion(....);
>>
>> Outside the while loop we could just
>> if (invoke_tasklet)
>>      tasklet_schedule
>>
>> Anyway, I guess that even with per-CQE overhead, the performance impact
>> here is pretty negligible - so I guess that's fine too :)
>
>
> Real question or suggestion would be to use/fire a tasklet only under
> stress.
>
> Firing a tasklet adds a lot of latencies for user-space CQ completion,
> since softirqs might have to be handled by a kernel thread (ksoftirqd)
>

At least for mlx4_en driver we don't need this tasklet and it is only
adding this overhead. (we have napi)

we must consider removing it for mlx4_en cqs and move the tasklet
handling to mlx4_ib.

I will ack the patch.
Saeed Mahameed Feb. 16, 2017, 12:38 p.m. UTC | #10
On Fri, Feb 10, 2017 at 2:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
>
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
>
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
>
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>

Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Saeed Mahameed Feb. 16, 2017, 12:44 p.m. UTC | #11
On Wed, Feb 15, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:
>
>>
>> mlx4_eq_int() is a hard irq handler.
>>
>> How a tasklet could run in the middle of it ?
>>
>> A tasklet is a softirq handler.
>
> Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
> in eq_set_ci()
>

I wonder why you have so many interrupts ? don't you have some kind of
interrupt moderation ?
what test are you running that got your CPU so busy.

> I wonder why this very expensive mb() is required, right before exiting
> the interrupt handler.

to make sure the HW knows we handled Completions up to (ci) consumer
index. so it will generate next irq.
Eric Dumazet Feb. 16, 2017, 3:30 p.m. UTC | #12
On Thu, 2017-02-16 at 14:38 +0200, Saeed Mahameed wrote:

> Acked-by: Saeed Mahameed <saeedm@mellanox.com>

Thanks for reviewing this Saeed !
Eric Dumazet Feb. 16, 2017, 3:49 p.m. UTC | #13
On Thu, 2017-02-16 at 14:44 +0200, Saeed Mahameed wrote:
> On Wed, Feb 15, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:
> >
> >>
> >> mlx4_eq_int() is a hard irq handler.
> >>
> >> How a tasklet could run in the middle of it ?
> >>
> >> A tasklet is a softirq handler.
> >
> > Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
> > in eq_set_ci()
> >
> 
> I wonder why you have so many interrupts ? don't you have some kind of
> interrupt moderation ?
> what test are you running that got your CPU so busy.


Simply 8 RX queues. About 140,000 irq per second per RX queue,
for a moderate network load on 40Gbit NIC.

Interrupt moderation is a latency killer, we want our usec back.

> 
> > I wonder why this very expensive mb() is required, right before exiting
> > the interrupt handler.
> 
> to make sure the HW knows we handled Completions up to (ci) consumer
> index. so it will generate next irq.


So why a mb() is needed exactly ?

wmb() seems enough.
Saeed Mahameed Feb. 16, 2017, 9:17 p.m. UTC | #14
On Thu, Feb 16, 2017 at 5:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-02-16 at 14:44 +0200, Saeed Mahameed wrote:
>> On Wed, Feb 15, 2017 at 4:04 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> > On Wed, 2017-02-15 at 05:29 -0800, Eric Dumazet wrote:
>> >
>> >>
>> >> mlx4_eq_int() is a hard irq handler.
>> >>
>> >> How a tasklet could run in the middle of it ?
>> >>
>> >> A tasklet is a softirq handler.
>> >
>> > Speaking of mlx4_eq_int() , 50% of cycles are spent on mb() (mfence)
>> > in eq_set_ci()
>> >
>>
>> I wonder why you have so many interrupts ? don't you have some kind of
>> interrupt moderation ?
>> what test are you running that got your CPU so busy.
>
>
> Simply 8 RX queues. About 140,000 irq per second per RX queue,
> for a moderate network load on 40Gbit NIC.
>

so i guess you are not busy polling .. and adaptive moderation decided
to lower down
rx-usecs for you, and you are looking to improve latency.

> Interrupt moderation is a latency killer, we want our usec back.
>

well, for RX we have adaptive moderation.
and for TX we have xmit more.
so interrupt moderation can be good if it is done right.

do i understand from this that you are against interrupt moderation ?

>>
>> > I wonder why this very expensive mb() is required, right before exiting
>> > the interrupt handler.
>>
>> to make sure the HW knows we handled Completions up to (ci) consumer
>> index. so it will generate next irq.
>
>
> So why a mb() is needed exactly ?
>
> wmb() seems enough.
>

Yes seems right, not sure why it is mb() though, i will have to check
and get back to you on this.
Eric Dumazet Feb. 16, 2017, 11:13 p.m. UTC | #15
On Thu, 2017-02-16 at 23:17 +0200, Saeed Mahameed wrote:

> so i guess you are not busy polling .. and adaptive moderation decided
> to lower down
> rx-usecs for you, and you are looking to improve latency.
> 
> > Interrupt moderation is a latency killer, we want our usec back.
> >
> 
> well, for RX we have adaptive moderation.
> and for TX we have xmit more.
> so interrupt moderation can be good if it is done right.
> 
> do i understand from this that you are against interrupt moderation ?

Absolutely against it. We prefer gro_flush_timeout.

The defaults values in mlx4 never were tuned for 40Gbit.

If you want to be able to reach 40Gbit on a single TCP flow, you want :

ethtool -C ethX rx-frames 16 rx-usecs-high 32

Or disable interrupt mitigation of course.


Interrupt moderation comes for free with NAPI really :

If under stress, number of interrupts will naturally decrease,
so what's the point with hardware _delaying_ an incoming packet
exactly ???

BTW interrupt moderation in mlx4 has at least two bugs.

I will send a patch.


> 
> >>
> >> > I wonder why this very expensive mb() is required, right before exiting
> >> > the interrupt handler.
> >>
> >> to make sure the HW knows we handled Completions up to (ci) consumer
> >> index. so it will generate next irq.
> >
> >
> > So why a mb() is needed exactly ?
> >
> > wmb() seems enough.
> >
> 
> Yes seems right, not sure why it is mb() though, i will have to check
> and get back to you on this.

Thanks.
David Miller Feb. 17, 2017, 3:55 p.m. UTC | #16
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 10 Feb 2017 04:27:58 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> All rx and rx netdev interrupts are handled by respectively
> by mlx4_en_rx_irq() and mlx4_en_tx_irq() which simply schedule a NAPI.
> 
> But mlx4_eq_int() also fires a tasklet to service all items that were
> queued via mlx4_add_cq_to_tasklet(), but this handler was not called
> unless user cqe was handled.
> 
> This is very confusing, as "mpstat -I SCPU ..." show huge number of
> tasklet invocations.
> 
> This patch saves this overhead, by carefully firing the tasklet directly
> from mlx4_add_cq_to_tasklet(), removing four atomic operations per IRQ.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Applied, thanks.
Saeed Mahameed Feb. 19, 2017, 10:33 p.m. UTC | #17
On Fri, Feb 17, 2017 at 1:13 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2017-02-16 at 23:17 +0200, Saeed Mahameed wrote:
>
>> so i guess you are not busy polling .. and adaptive moderation decided
>> to lower down
>> rx-usecs for you, and you are looking to improve latency.
>>
>> > Interrupt moderation is a latency killer, we want our usec back.
>> >
>>
>> well, for RX we have adaptive moderation.
>> and for TX we have xmit more.
>> so interrupt moderation can be good if it is done right.
>>
>> do i understand from this that you are against interrupt moderation ?
>
> Absolutely against it. We prefer gro_flush_timeout.
>
> The defaults values in mlx4 never were tuned for 40Gbit.
>
> If you want to be able to reach 40Gbit on a single TCP flow, you want :
>
> ethtool -C ethX rx-frames 16 rx-usecs-high 32
>
> Or disable interrupt mitigation of course.
>
>
> Interrupt moderation comes for free with NAPI really :
>
> If under stress, number of interrupts will naturally decrease,
> so what's the point with hardware _delaying_ an incoming packet
> exactly ???
>

Theoretically NAPI provides some kind of interrupt delaying mechanism
(only under stress), but is it as good as explicit HW interrupt
moderation ?

for example in mlx5 the adaptive mechanism is more intelligent than
the one we have in mlx4.
it always tries to improve BW/packet rate, and reduce # of interrupts
per second.
and it will park at the optimal values.

As you know the less the interrupts overhead the better the latency.

I assume there are many cases where napi will not be good enough and
will keep running at max interrupt rate.
while the above mechanism will deliberately look for and find a better
interrupt rate for at least the same packet rate or even better one.
(Napi will not do that for you).

But we will have to measure before we decide napi as is is good enough.

> BTW interrupt moderation in mlx4 has at least two bugs.
>
> I will send a patch.

Saw it, nice catch !!
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/cq.c b/drivers/net/ethernet/mellanox/mlx4/cq.c
index 6b8635378f1fcb2aae4e8ac390bcd09d552c2256..fa6d2354a0e910ee160863e3cbe21a512d77bf03 100644
--- a/drivers/net/ethernet/mellanox/mlx4/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/cq.c
@@ -81,8 +81,9 @@  void mlx4_cq_tasklet_cb(unsigned long data)
 
 static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
 {
-	unsigned long flags;
 	struct mlx4_eq_tasklet *tasklet_ctx = cq->tasklet_ctx.priv;
+	unsigned long flags;
+	bool kick;
 
 	spin_lock_irqsave(&tasklet_ctx->lock, flags);
 	/* When migrating CQs between EQs will be implemented, please note
@@ -92,7 +93,10 @@  static void mlx4_add_cq_to_tasklet(struct mlx4_cq *cq)
 	 */
 	if (list_empty_careful(&cq->tasklet_ctx.list)) {
 		atomic_inc(&cq->refcount);
+		kick = list_empty(&tasklet_ctx->list);
 		list_add_tail(&cq->tasklet_ctx.list, &tasklet_ctx->list);
+		if (kick)
+			tasklet_schedule(&tasklet_ctx->task);
 	}
 	spin_unlock_irqrestore(&tasklet_ctx->lock, flags);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index 0509996957d9664b612358dd805359f4bc67b8dc..39232b6a974f4b4b961d3b0b8634f04e6b9d0caa 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -494,7 +494,7 @@  static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
 {
 	struct mlx4_priv *priv = mlx4_priv(dev);
 	struct mlx4_eqe *eqe;
-	int cqn = -1;
+	int cqn;
 	int eqes_found = 0;
 	int set_ci = 0;
 	int port;
@@ -840,13 +840,6 @@  static int mlx4_eq_int(struct mlx4_dev *dev, struct mlx4_eq *eq)
 
 	eq_set_ci(eq, 1);
 
-	/* cqn is 24bit wide but is initialized such that its higher bits
-	 * are ones too. Thus, if we got any event, cqn's high bits should be off
-	 * and we need to schedule the tasklet.
-	 */
-	if (!(cqn & ~0xffffff))
-		tasklet_schedule(&eq->tasklet_ctx.task);
-
 	return eqes_found;
 }