diff mbox

mlx4 netpoll and rx/tx weirdness

Message ID 87ob17kz96.fsf_-_@xmission.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric W. Biederman March 15, 2014, 8:01 p.m. UTC
David Miller <davem@davemloft.net> writes:

> From: Ben Hutchings <ben@decadent.org.uk>
> Date: Sat, 15 Mar 2014 15:23:34 +0000
>
>> On Fri, 2014-03-14 at 18:11 -0700, Eric W. Biederman wrote:
>>> Processing any incoming packets with a with a napi budget of 0
>>> is incorrect driver behavior.
>>> 
>>> This matters as netpoll will shortly call drivers with a budget of 0
>>> to avoid receive packet processing happening in hard irq context.
>> 
>> But this also prevents handling TX completions, at which point you may
>> as well change efx_netpoll() to a no-op.  And then, does it make sense
>> to implement ndo_poll_controller at all?
>> 
>> Note that sfc does have a module parameter to enable separate RX and TX
>> completions so they could be polled separately, but it is disabled by
>> default.
>
> TX completions should run unconditionally, irregardless of the given
> budget.
>
> This is how I have coded all of my drivers, and I how I tell others
> to do so.

Given this comment I have to ask:  How insane is the mellanox mlx4
driver that has separate rx and tx queues, separate rx and tx interrupts
and uses separate napi bottom halves to process each, and honors the
budget passed into it's rx napi handler.

Right now the mlx4 code disables interrupts and calls napi_synchronize
in it's netpoll routine instead of just scheduling the napi bottom
halves as all of the sane drivers do.  napi_synchronize trying to sleep
in hard irq context is pretty terrible, but I can see roughly how that
should be fixed.

What I can't see is what is a clean thing to do with the mlx4 tx bottom
napi bottom half.  As it won't processes the tx cq when I pass in a
budget of 0.

What I can't see is how to prevent a netpoll stalling after enough
packets are transmitted from hard irq context say with sysrq-l on
a machine with enough cpus to fill the tx queue.

If possible I would very much like to get this to work.

Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet March 16, 2014, 4:17 p.m. UTC | #1
On Sat, 2014-03-15 at 13:01 -0700, Eric W. Biederman wrote:

> Given this comment I have to ask:  How insane is the mellanox mlx4
> driver that has separate rx and tx queues, separate rx and tx interrupts
> and uses separate napi bottom halves to process each, and honors the
> budget passed into it's rx napi handler.

I believe this driver had TX completion from hard irq, and NAPI polling
for RX.

This was changed quite recently to also do TX completion from softirq
handler.

> 
> Right now the mlx4 code disables interrupts and calls napi_synchronize
> in it's netpoll routine instead of just scheduling the napi bottom
> halves as all of the sane drivers do.  napi_synchronize trying to sleep
> in hard irq context is pretty terrible, but I can see roughly how that
> should be fixed.
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> index fa5ee719e04b..2e6fded14e60 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
> @@ -1302,16 +1302,10 @@ out:
>  static void mlx4_en_netpoll(struct net_device *dev)
>  {
>         struct mlx4_en_priv *priv = netdev_priv(dev);
> -       struct mlx4_en_cq *cq;
> -       unsigned long flags;
>         int i;
>  
>         for (i = 0; i < priv->rx_ring_num; i++) {
> -               cq = priv->rx_cq[i];
> -               spin_lock_irqsave(&cq->lock, flags);
> -               napi_synchronize(&cq->napi);
> -               mlx4_en_process_rx_cq(dev, cq, 0);
> -               spin_unlock_irqrestore(&cq->lock, flags);
> +               napi_schedule(&priv->tx_cq[i]->napi);
>         }
>  }

Wait a minute, I thought ndo_poll_controller() had to be synchronous,
not schedule a napi ?

> 
> What I can't see is what is a clean thing to do with the mlx4 tx bottom
> napi bottom half.  As it won't processes the tx cq when I pass in a
> budget of 0.

Just ignore the budget to drain tx queues, as David said.

> 
> What I can't see is how to prevent a netpoll stalling after enough
> packets are transmitted from hard irq context say with sysrq-l on
> a machine with enough cpus to fill the tx queue.

Is this specific to mlx4 ?


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller March 17, 2014, 9:22 p.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Sun, 16 Mar 2014 09:17:01 -0700

> Wait a minute, I thought ndo_poll_controller() had to be synchronous,
> not schedule a napi ?

ndo_poll_controller() basically runs the hardware interrupt handler,
which might schedule NAPI processing.  Then netpoll invokes
the NAPI poll in whatever state ndo_poll_controller() left it in.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet March 17, 2014, 9:40 p.m. UTC | #3
On Mon, 2014-03-17 at 17:22 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Sun, 16 Mar 2014 09:17:01 -0700
> 
> > Wait a minute, I thought ndo_poll_controller() had to be synchronous,
> > not schedule a napi ?
> 
> ndo_poll_controller() basically runs the hardware interrupt handler,
> which might schedule NAPI processing.  Then netpoll invokes
> the NAPI poll in whatever state ndo_poll_controller() left it in.

Ah right, this makes sense, poll_napi() uses a spin_trylock()...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index fa5ee719e04b..2e6fded14e60 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -1302,16 +1302,10 @@  out:
 static void mlx4_en_netpoll(struct net_device *dev)
 {
        struct mlx4_en_priv *priv = netdev_priv(dev);
-       struct mlx4_en_cq *cq;
-       unsigned long flags;
        int i;
 
        for (i = 0; i < priv->rx_ring_num; i++) {
-               cq = priv->rx_cq[i];
-               spin_lock_irqsave(&cq->lock, flags);
-               napi_synchronize(&cq->napi);
-               mlx4_en_process_rx_cq(dev, cq, 0);
-               spin_unlock_irqrestore(&cq->lock, flags);
+               napi_schedule(&priv->tx_cq[i]->napi);
        }
 }