diff mbox series

[net] ixgbe: check return value of napi_complete_done()

Message ID 20180920190113.490005-1-songliubraving@fb.com
State Accepted
Delegated to: Jeff Kirsher
Headers show
Series [net] ixgbe: check return value of napi_complete_done() | expand

Commit Message

Song Liu Sept. 20, 2018, 7:01 p.m. UTC
The NIC driver should only enable interrupts when napi_complete_done()
returns true. This patch adds the check for ixgbe.

Cc: stable@vger.kernel.org # 4.10+
Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +++++++-----
 1 file changed, 7 insertions(+), 5 deletions(-)

Comments

Eric Dumazet Sept. 20, 2018, 8:35 p.m. UTC | #1
On 09/20/2018 12:01 PM, Song Liu wrote:
> The NIC driver should only enable interrupts when napi_complete_done()
> returns true. This patch adds the check for ixgbe.
> 
> Cc: stable@vger.kernel.org # 4.10+
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---


Well, unfortunately we do not know why this is needed,
this is why I have not yet sent this patch formally.

netpoll has correct synchronization :

poll_napi() places into napi->poll_owner current cpu number before calling poll_one_napi()

netpoll_poll_lock() does also use napi->poll_owner

When netpoll calls ixgbe poll() method, it passed a budget of 0,
meaning napi_complete_done() is not called.

As long as we can not explain the problem properly in the changelog,
we should investigate, otherwise we will probably see coming dozens of patches
trying to fix a 'potential hazard'.

Thanks.
Kirsher, Jeffrey T Sept. 20, 2018, 9:01 p.m. UTC | #2
On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
> On 09/20/2018 12:01 PM, Song Liu wrote:
> > The NIC driver should only enable interrupts when napi_complete_done()
> > returns true. This patch adds the check for ixgbe.
> > 
> > Cc: stable@vger.kernel.org # 4.10+
> > Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> > Suggested-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Song Liu <songliubraving@fb.com>
> > ---
> 
> 
> Well, unfortunately we do not know why this is needed,
> this is why I have not yet sent this patch formally.
> 
> netpoll has correct synchronization :
> 
> poll_napi() places into napi->poll_owner current cpu number before
> calling poll_one_napi()
> 
> netpoll_poll_lock() does also use napi->poll_owner
> 
> When netpoll calls ixgbe poll() method, it passed a budget of 0,
> meaning napi_complete_done() is not called.
> 
> As long as we can not explain the problem properly in the changelog,
> we should investigate, otherwise we will probably see coming dozens of
> patches
> trying to fix a 'potential hazard'.

Agreed, which is why I have our validation and developers looking into it,
while we test the current patch from Song.
Song Liu Sept. 20, 2018, 10:42 p.m. UTC | #3
> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:
> 
> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
>> On 09/20/2018 12:01 PM, Song Liu wrote:
>>> The NIC driver should only enable interrupts when napi_complete_done()
>>> returns true. This patch adds the check for ixgbe.
>>> 
>>> Cc: stable@vger.kernel.org # 4.10+
>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>> Suggested-by: Eric Dumazet <edumazet@google.com>
>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>> ---
>> 
>> 
>> Well, unfortunately we do not know why this is needed,
>> this is why I have not yet sent this patch formally.
>> 
>> netpoll has correct synchronization :
>> 
>> poll_napi() places into napi->poll_owner current cpu number before
>> calling poll_one_napi()
>> 
>> netpoll_poll_lock() does also use napi->poll_owner
>> 
>> When netpoll calls ixgbe poll() method, it passed a budget of 0,
>> meaning napi_complete_done() is not called.
>> 
>> As long as we can not explain the problem properly in the changelog,
>> we should investigate, otherwise we will probably see coming dozens of
>> patches
>> trying to fix a 'potential hazard'.
> 
> Agreed, which is why I have our validation and developers looking into it,
> while we test the current patch from Song.

I figured out what is the issue here. And I have a proposal to fix it. I 
have verified that this fixes the issue in our tests. But Alexei suggests
that it may not be the right way to fix. 

Here is what happened:

netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
within the same NIC. 

This is problematic, because at the end napi_schedule() calls:

    ____napi_schedule(this_cpu_ptr(&softnet_data), n);

which attached these NAPIs to softnet_data on THIS CPU. This is done
via napi->poll_list. 

Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
not give up the ownership until it calls napi_complete_done(). However, 
for a very busy server, we usually use 16 CPUs to poll NAPI, so this
CPU can easily be overloaded. And as a result, each call of napi->poll() 
will hit budget (of 64), and it will not call napi_complete_done(), 
and the NAPI stays in the poll_list of this CPU. 

When this happens, the host usually cannot get out of this state until
we throttle/stop client traffic. 


I am pretty confident this is what happened. Please let me know if 
anything above doesn't make sense. 


Here is my proposal to fix it: Instead of polling all NAPIs within one
NIC, I would have netpoll to only poll the NAPI that will free space
for netpoll_start_xmit(). I attached my two RFC patches to the end of 
this email. 

I chatted with Alexei about this. He think polling only one NAPI may 
not guarantee netpoll make progress with the TX queue we are aiming 
for. Also, the bigger problem may be the fact that NAPIs could get 
pinned to one CPU and cannot get released. 

At this point, I really don't know what is the best way to fix this. 

I will also work on a repro with netperf. 

Please let me know your suggestions. 

Thanks,
Song


========================= RFCs ============================


From 7b6d1d0e21bc7a0034dbcacfdaaec95a0e5f5b14 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Thu, 20 Sep 2018 13:09:26 -0700
Subject: [RFC net 1/2] net: netpoll: when failed to send a message, only poll
 the target NAPI

Currently, netpoll polls all NAPIs in the NIC if it fails to send data to
one TX queue. This is not necessary and sometimes problematic. This is
because ndo_poll_controller() calls napi_schedule() for all NAPIs in the
NIC, and attach available NAPIs to this_cpu's softnet_data->poll_list.
Modern highend NIC has tens of NAPIs, and requires multiple CPUs to poll
the data from the queues. Attaching multiple NAPIs to one CPU's poll_list
fan-in softirq work of multiple CPUs to one CPU. As a result, we see one
CPU pegged in ksoftirqd, while other CPUs cannot get assgined work.

This patch fixes this issue by only polls target NAPI when send fails.
New hook ndo_netpoll_get_napi() is added for the driver to map busy
TX queue to the NAPI to poll. If the driver implements this hook, netpoll
will only poll the NAPI of the target TX queue.

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 include/linux/netdevice.h |  5 +++++
 net/core/netpoll.c        | 30 ++++++++++++++++++++++++------
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 6a7ac01fa605..576414b150ff 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1033,6 +1033,9 @@ struct dev_ifalias {
  *
  * void (*ndo_poll_controller)(struct net_device *dev);
  *     Run napi_schedule() for all NAPI within this device.
+ * struct napi_struct* (*ndo_netpoll_get_napi)(struct net_device *dev,
+ *                                             struct netdev_queue *txq);
+ *     Returns the NAPI to poll for the given TX queue.
  *
  *     SR-IOV management functions.
  * int (*ndo_set_vf_mac)(struct net_device *dev, int vf, u8* mac);
@@ -1267,6 +1270,8 @@ struct net_device_ops {
                                                        __be16 proto, u16 vid);
 #ifdef CONFIG_NET_POLL_CONTROLLER
        void                    (*ndo_poll_controller)(struct net_device *dev);
+       struct napi_struct*     (*ndo_netpoll_get_napi)(struct net_device *dev,
+                                                       struct netdev_queue *txq);
        int                     (*ndo_netpoll_setup)(struct net_device *dev,
                                                     struct netpoll_info *info);
        void                    (*ndo_netpoll_cleanup)(struct net_device *dev);
diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 57557a6a950c..218a46acbdfc 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -187,7 +187,7 @@ static void poll_napi(struct net_device *dev)
        }
 }

-static void netpoll_poll_dev(struct net_device *dev)
+static void netpoll_poll_dev(struct net_device *dev, struct napi_struct *napi)
 {
        const struct net_device_ops *ops;
        struct netpoll_info *ni = rcu_dereference_bh(dev->npinfo);
@@ -210,10 +210,23 @@ static void netpoll_poll_dev(struct net_device *dev)
                return;
        }

-       /* Process pending work on NIC */
-       ops->ndo_poll_controller(dev);
+       if (napi) {
+               int cpu = smp_processor_id();

-       poll_napi(dev);
+               /* If we know which NAPI to poll, just poll it. */
+               napi_schedule(napi);
+               if (cmpxchg(&napi->poll_owner, -1, cpu) == -1) {
+                       poll_one_napi(napi);
+                       smp_store_release(&napi->poll_owner, -1);
+               }
+       } else {
+               /* If we are not sure which NAPI to poll, process all
+                * pending work on NIC.
+                */
+               ops->ndo_poll_controller(dev);
+
+               poll_napi(dev);
+       }

        up(&ni->dev_lock);

@@ -303,7 +316,7 @@ static struct sk_buff *find_skb(struct netpoll *np, int len, int reserve)

        if (!skb) {
                if (++count < 10) {
-                       netpoll_poll_dev(np->dev);
+                       netpoll_poll_dev(np->dev, NULL);
                        goto repeat;
                }
                return NULL;
@@ -345,8 +358,13 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
        /* don't get messages out of order, and no recursion */
        if (skb_queue_len(&npinfo->txq) == 0 && !netpoll_owner_active(dev)) {
                struct netdev_queue *txq;
+               struct napi_struct *napi;

                txq = netdev_pick_tx(dev, skb, NULL);
+               if (dev->netdev_ops->ndo_netpoll_get_napi)
+                       napi = dev->netdev_ops->ndo_netpoll_get_napi(dev, txq);
+               else
+                       napi = NULL;

                /* try until next clock tick */
                for (tries = jiffies_to_usecs(1)/USEC_PER_POLL;
@@ -363,7 +381,7 @@ void netpoll_send_skb_on_dev(struct netpoll *np, struct sk_buff *skb,
                        }

                        /* tickle device maybe there is some cleanup */
-                       netpoll_poll_dev(np->dev);
+                       netpoll_poll_dev(np->dev, napi);

                        udelay(USEC_PER_POLL);
                }
--
2.17.1






From 9f6bd53cf011820054802f17ede2f73de0ec7d41 Mon Sep 17 00:00:00 2001
From: Song Liu <songliubraving@fb.com>
Date: Thu, 20 Sep 2018 15:08:33 -0700
Subject: [RFC net 2/2] net: bnxt: add ndo_netpoll_get_napi

Signed-off-by: Song Liu <songliubraving@fb.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 177587f9c3f1..0a82a61a16d9 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -7685,6 +7685,20 @@ static void bnxt_poll_controller(struct net_device *dev)
                napi_schedule(&txr->bnapi->napi);
        }
 }
+
+static struct napi_struct* bnxt_netpoll_get_napi(struct net_device *dev,
+                                                struct netdev_queue *txq)
+{
+       struct bnxt *bp = netdev_priv(dev);
+       int i;
+
+       for (i = 0; i < bp->tx_nr_rings; i++) {
+               struct bnxt_tx_ring_info *txr = &bp->tx_ring[i];
+               if (txq == netdev_get_tx_queue(dev, txr->txq_index))
+                       return &txr->bnapi->napi;
+       }
+       return NULL;
+}
 #endif

 static void bnxt_timer(struct timer_list *t)
@@ -8522,6 +8536,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
 #endif
 #ifdef CONFIG_NET_POLL_CONTROLLER
        .ndo_poll_controller    = bnxt_poll_controller,
+       .ndo_netpoll_get_napi   = bnxt_netpoll_get_napi,
 #endif
        .ndo_setup_tc           = bnxt_setup_tc,
 #ifdef CONFIG_RFS_ACCEL
--
2.17.1
Eric Dumazet Sept. 20, 2018, 11:22 p.m. UTC | #4
On 09/20/2018 03:42 PM, Song Liu wrote:
> 
> 
>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:
>>
>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
>>> On 09/20/2018 12:01 PM, Song Liu wrote:
>>>> The NIC driver should only enable interrupts when napi_complete_done()
>>>> returns true. This patch adds the check for ixgbe.
>>>>
>>>> Cc: stable@vger.kernel.org # 4.10+
>>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>>> Suggested-by: Eric Dumazet <edumazet@google.com>
>>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>>> ---
>>>
>>>
>>> Well, unfortunately we do not know why this is needed,
>>> this is why I have not yet sent this patch formally.
>>>
>>> netpoll has correct synchronization :
>>>
>>> poll_napi() places into napi->poll_owner current cpu number before
>>> calling poll_one_napi()
>>>
>>> netpoll_poll_lock() does also use napi->poll_owner
>>>
>>> When netpoll calls ixgbe poll() method, it passed a budget of 0,
>>> meaning napi_complete_done() is not called.
>>>
>>> As long as we can not explain the problem properly in the changelog,
>>> we should investigate, otherwise we will probably see coming dozens of
>>> patches
>>> trying to fix a 'potential hazard'.
>>
>> Agreed, which is why I have our validation and developers looking into it,
>> while we test the current patch from Song.
> 
> I figured out what is the issue here. And I have a proposal to fix it. I 
> have verified that this fixes the issue in our tests. But Alexei suggests
> that it may not be the right way to fix. 
> 
> Here is what happened:
> 
> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
> within the same NIC. 
> 
> This is problematic, because at the end napi_schedule() calls:
> 
>     ____napi_schedule(this_cpu_ptr(&softnet_data), n);
> 
> which attached these NAPIs to softnet_data on THIS CPU. This is done
> via napi->poll_list. 
> 
> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
> not give up the ownership until it calls napi_complete_done(). However, 
> for a very busy server, we usually use 16 CPUs to poll NAPI, so this
> CPU can easily be overloaded. And as a result, each call of napi->poll() 
> will hit budget (of 64), and it will not call napi_complete_done(), 
> and the NAPI stays in the poll_list of this CPU. 
> 
> When this happens, the host usually cannot get out of this state until
> we throttle/stop client traffic. 
> 
> 
> I am pretty confident this is what happened. Please let me know if 
> anything above doesn't make sense. 
> 
> 
> Here is my proposal to fix it: Instead of polling all NAPIs within one
> NIC, I would have netpoll to only poll the NAPI that will free space
> for netpoll_start_xmit(). I attached my two RFC patches to the end of 
> this email. 
> 
> I chatted with Alexei about this. He think polling only one NAPI may 
> not guarantee netpoll make progress with the TX queue we are aiming 
> for. Also, the bigger problem may be the fact that NAPIs could get 
> pinned to one CPU and cannot get released. 
> 
> At this point, I really don't know what is the best way to fix this. 
> 
> I will also work on a repro with netperf. 

Thanks !

> 
> Please let me know your suggestions. 
> 

Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() method at all,
since it is very risky (potentially grab many NAPI, and end up in this locked situation)

poll_napi() could attempt to free skbs one napi at a time,
without the current cpu stealing all NAPI.


diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev)
        }
 
        ops = dev->netdev_ops;
-       if (!ops->ndo_poll_controller) {
-               up(&ni->dev_lock);
-               return;
-       }
-
-       /* Process pending work on NIC */
-       ops->ndo_poll_controller(dev);
+       if (ops->ndo_poll_controller)
+               ops->ndo_poll_controller(dev);
 
        poll_napi(dev);
Song Liu Sept. 20, 2018, 11:43 p.m. UTC | #5
> On Sep 20, 2018, at 4:22 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 09/20/2018 03:42 PM, Song Liu wrote:
>> 
>> 
>>> On Sep 20, 2018, at 2:01 PM, Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:
>>> 
>>> On Thu, 2018-09-20 at 13:35 -0700, Eric Dumazet wrote:
>>>> On 09/20/2018 12:01 PM, Song Liu wrote:
>>>>> The NIC driver should only enable interrupts when napi_complete_done()
>>>>> returns true. This patch adds the check for ixgbe.
>>>>> 
>>>>> Cc: stable@vger.kernel.org # 4.10+
>>>>> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
>>>>> Suggested-by: Eric Dumazet <edumazet@google.com>
>>>>> Signed-off-by: Song Liu <songliubraving@fb.com>
>>>>> ---
>>>> 
>>>> 
>>>> Well, unfortunately we do not know why this is needed,
>>>> this is why I have not yet sent this patch formally.
>>>> 
>>>> netpoll has correct synchronization :
>>>> 
>>>> poll_napi() places into napi->poll_owner current cpu number before
>>>> calling poll_one_napi()
>>>> 
>>>> netpoll_poll_lock() does also use napi->poll_owner
>>>> 
>>>> When netpoll calls ixgbe poll() method, it passed a budget of 0,
>>>> meaning napi_complete_done() is not called.
>>>> 
>>>> As long as we can not explain the problem properly in the changelog,
>>>> we should investigate, otherwise we will probably see coming dozens of
>>>> patches
>>>> trying to fix a 'potential hazard'.
>>> 
>>> Agreed, which is why I have our validation and developers looking into it,
>>> while we test the current patch from Song.
>> 
>> I figured out what is the issue here. And I have a proposal to fix it. I 
>> have verified that this fixes the issue in our tests. But Alexei suggests
>> that it may not be the right way to fix. 
>> 
>> Here is what happened:
>> 
>> netpoll tries to send skb with netpoll_start_xmit(). If that fails, it 
>> calls netpoll_poll_dev(), which calls ndo_poll_controller(). Then, in 
>> the driver, ndo_poll_controller() calls napi_schedule() for ALL NAPIs 
>> within the same NIC. 
>> 
>> This is problematic, because at the end napi_schedule() calls:
>> 
>>    ____napi_schedule(this_cpu_ptr(&softnet_data), n);
>> 
>> which attached these NAPIs to softnet_data on THIS CPU. This is done
>> via napi->poll_list. 
>> 
>> Then suddenly ksoftirqd on this CPU owns multiple NAPIs. And it will
>> not give up the ownership until it calls napi_complete_done(). However, 
>> for a very busy server, we usually use 16 CPUs to poll NAPI, so this
>> CPU can easily be overloaded. And as a result, each call of napi->poll() 
>> will hit budget (of 64), and it will not call napi_complete_done(), 
>> and the NAPI stays in the poll_list of this CPU. 
>> 
>> When this happens, the host usually cannot get out of this state until
>> we throttle/stop client traffic. 
>> 
>> 
>> I am pretty confident this is what happened. Please let me know if 
>> anything above doesn't make sense. 
>> 
>> 
>> Here is my proposal to fix it: Instead of polling all NAPIs within one
>> NIC, I would have netpoll to only poll the NAPI that will free space
>> for netpoll_start_xmit(). I attached my two RFC patches to the end of 
>> this email. 
>> 
>> I chatted with Alexei about this. He think polling only one NAPI may 
>> not guarantee netpoll make progress with the TX queue we are aiming 
>> for. Also, the bigger problem may be the fact that NAPIs could get 
>> pinned to one CPU and cannot get released. 
>> 
>> At this point, I really don't know what is the best way to fix this. 
>> 
>> I will also work on a repro with netperf. 
> 
> Thanks !
> 
>> 
>> Please let me know your suggestions. 
>> 
> 
> Yeah, maybe that NICs using NAPI could not provide an ndo_poll_controller() method at all,
> since it is very risky (potentially grab many NAPI, and end up in this locked situation)
> 
> poll_napi() could attempt to free skbs one napi at a time,
> without the current cpu stealing all NAPI.
> 
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 57557a6a950cc9cdff959391576a03381d328c1a..a992971d366090ba69d5c1af32eadd554d6880cf 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -205,13 +205,8 @@ static void netpoll_poll_dev(struct net_device *dev)
>        }
> 
>        ops = dev->netdev_ops;
> -       if (!ops->ndo_poll_controller) {
> -               up(&ni->dev_lock);
> -               return;
> -       }
> -
> -       /* Process pending work on NIC */
> -       ops->ndo_poll_controller(dev);
> +       if (ops->ndo_poll_controller)
> +               ops->ndo_poll_controller(dev);
> 
>        poll_napi(dev);
> 

I tried to totally skip ndo_poll_controller() here. It did avoid hitting
the issue. However, netpoll will drop (fail to send) more packets. 

Thanks,
Song
Eric Dumazet Sept. 20, 2018, 11:49 p.m. UTC | #6
On 09/20/2018 04:43 PM, Song Liu wrote:
> 

> I tried to totally skip ndo_poll_controller() here. It did avoid hitting
> the issue. However, netpoll will drop (fail to send) more packets. 
>

Why is it failing ?

If you are under high memory pressure, then maybe if you absolutely want memory to send
netpoll packets, you want to grab all NAPI contexts as a way to prevent other cpus 
from feeding incoming packets to the host and add more memory pressure ;)
Song Liu Sept. 21, 2018, 7:17 a.m. UTC | #7
> On Sep 20, 2018, at 4:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> 
> 
> 
> On 09/20/2018 04:43 PM, Song Liu wrote:
>> 
> 
>> I tried to totally skip ndo_poll_controller() here. It did avoid hitting
>> the issue. However, netpoll will drop (fail to send) more packets. 
>> 
> 
> Why is it failing ?
> 
> If you are under high memory pressure, then maybe if you absolutely want memory to send
> netpoll packets, you want to grab all NAPI contexts as a way to prevent other cpus 
> from feeding incoming packets to the host and add more memory pressure ;)
> 

I did the test with Eric's latest patch (and disable ndo_poll_controller 
in driver). The result didn't show significant increase in drop packets. 
I guess packet drops in my earlier test was caused by some other changes
I mixed there. 

So I think this patch does fix the issue. Thanks Eric!

For ixgbe, I think we need to check napi_complete_done() return value
anyway. Otherwise, the driver will enable IRQ in polling mode.  

Song
Eric Dumazet Sept. 21, 2018, 1:33 p.m. UTC | #8
On 09/21/2018 12:17 AM, Song Liu wrote:
> 
> 
>> On Sep 20, 2018, at 4:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 09/20/2018 04:43 PM, Song Liu wrote:
>>>
>>
>>> I tried to totally skip ndo_poll_controller() here. It did avoid hitting
>>> the issue. However, netpoll will drop (fail to send) more packets. 
>>>
>>
>> Why is it failing ?
>>
>> If you are under high memory pressure, then maybe if you absolutely want memory to send
>> netpoll packets, you want to grab all NAPI contexts as a way to prevent other cpus 
>> from feeding incoming packets to the host and add more memory pressure ;)
>>
> 
> I did the test with Eric's latest patch (and disable ndo_poll_controller 
> in driver). The result didn't show significant increase in drop packets. 
> I guess packet drops in my earlier test was caused by some other changes
> I mixed there. 
> 
> So I think this patch does fix the issue. Thanks Eric!

Great, this is awesome.

I will prepare a patch series for net tree.

The core infrastructure is just better at being able to drain TX completions
without risking stealing the NAPI context forever.


> 
> For ixgbe, I think we need to check napi_complete_done() return value
> anyway. Otherwise, the driver will enable IRQ in polling mode.  
> 

Sure, let's prepare such optimizations for net-next.
Alexei Starovoitov Sept. 21, 2018, 2:55 p.m. UTC | #9
On 9/21/18 6:33 AM, Eric Dumazet wrote:
>
>
> On 09/21/2018 12:17 AM, Song Liu wrote:
>>
>>
>>> On Sep 20, 2018, at 4:49 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>
>>>
>>>
>>> On 09/20/2018 04:43 PM, Song Liu wrote:
>>>>
>>>
>>>> I tried to totally skip ndo_poll_controller() here. It did avoid hitting
>>>> the issue. However, netpoll will drop (fail to send) more packets.
>>>>
>>>
>>> Why is it failing ?
>>>
>>> If you are under high memory pressure, then maybe if you absolutely want memory to send
>>> netpoll packets, you want to grab all NAPI contexts as a way to prevent other cpus
>>> from feeding incoming packets to the host and add more memory pressure ;)
>>>
>>
>> I did the test with Eric's latest patch (and disable ndo_poll_controller
>> in driver). The result didn't show significant increase in drop packets.
>> I guess packet drops in my earlier test was caused by some other changes
>> I mixed there.
>>
>> So I think this patch does fix the issue. Thanks Eric!
>
> Great, this is awesome.
>
> I will prepare a patch series for net tree.
>
> The core infrastructure is just better at being able to drain TX completions
> without risking stealing the NAPI context forever.

should we remove ndo_poll_controller then?
My understanding that the patch helps by not letting
drivers do napi_schedule() for all queues into this_cpu, right?
But most of the drivers do exactly that in their ndo_poll_controller
implementations. Means most of the drivers will experience
this nasty behavior.
Eric Dumazet Sept. 21, 2018, 2:59 p.m. UTC | #10
On 09/21/2018 07:55 AM, Alexei Starovoitov wrote:

> 
> should we remove ndo_poll_controller then?
> My understanding that the patch helps by not letting
> drivers do napi_schedule() for all queues into this_cpu, right?
> But most of the drivers do exactly that in their ndo_poll_controller
> implementations. Means most of the drivers will experience
> this nasty behavior.
> 

Some legacy drivers do not use NAPI yet, but provide ndo_poll_controller()

I believe users caring about system behavior with multi queue NIC are
all using NAPI enabled drivers, so this should be fine.
Alexei Starovoitov Sept. 21, 2018, 3:14 p.m. UTC | #11
On 9/21/18 7:59 AM, Eric Dumazet wrote:
>
>
> On 09/21/2018 07:55 AM, Alexei Starovoitov wrote:
>
>>
>> should we remove ndo_poll_controller then?
>> My understanding that the patch helps by not letting
>> drivers do napi_schedule() for all queues into this_cpu, right?
>> But most of the drivers do exactly that in their ndo_poll_controller
>> implementations. Means most of the drivers will experience
>> this nasty behavior.
>>
>
> Some legacy drivers do not use NAPI yet, but provide ndo_poll_controller()
>
> I believe users caring about system behavior with multi queue NIC are
> all using NAPI enabled drivers, so this should be fine.

I'm not following.
All modern napi enabled drivers need to _remove_ ndo_poll_controller
from the driver. This is a lot of churn.
Isn't it cleaner (less error prone) to introduce new ndo
for legacy drivers without napi?
Eric Dumazet Sept. 21, 2018, 3:43 p.m. UTC | #12
On 09/21/2018 08:14 AM, Alexei Starovoitov wrote:
> On 9/21/18 7:59 AM, Eric Dumazet wrote:
>>
>>
>> On 09/21/2018 07:55 AM, Alexei Starovoitov wrote:
>>
>>>
>>> should we remove ndo_poll_controller then?
>>> My understanding that the patch helps by not letting
>>> drivers do napi_schedule() for all queues into this_cpu, right?
>>> But most of the drivers do exactly that in their ndo_poll_controller
>>> implementations. Means most of the drivers will experience
>>> this nasty behavior.
>>>
>>
>> Some legacy drivers do not use NAPI yet, but provide ndo_poll_controller()
>>
>> I believe users caring about system behavior with multi queue NIC are
>> all using NAPI enabled drivers, so this should be fine.
> 
> I'm not following.
> All modern napi enabled drivers need to _remove_ ndo_poll_controller
> from the driver. This is a lot of churn.

netpoll could skip calling ndo_poll_controller() if at least one NAPI
has been registered on the device.

diff --git a/net/core/netpoll.c b/net/core/netpoll.c
index 57557a6a950cc9cdff959391576a03381d328c1a..149474c1ad71dde295d3a2b085ef51eb50278d81 100644
--- a/net/core/netpoll.c
+++ b/net/core/netpoll.c
@@ -189,7 +189,6 @@ static void poll_napi(struct net_device *dev)
 
 static void netpoll_poll_dev(struct net_device *dev)
 {
-       const struct net_device_ops *ops;
        struct netpoll_info *ni = rcu_dereference_bh(dev->npinfo);
 
        /* Don't do any rx activity if the dev_lock mutex is held
@@ -204,14 +203,12 @@ static void netpoll_poll_dev(struct net_device *dev)
                return;
        }
 
-       ops = dev->netdev_ops;
-       if (!ops->ndo_poll_controller) {
-               up(&ni->dev_lock);
-               return;
-       }
+       if (list_empty(&dev->napi_list)) {
+               const struct net_device_ops *ops = dev->netdev_ops;
 
-       /* Process pending work on NIC */
-       ops->ndo_poll_controller(dev);
+               if (ops->ndo_poll_controller)
+                       ops->ndo_poll_controller(dev);
+       }
 
        poll_napi(dev);
 

But this looks a bit risky, I know some drivers use NAPI only for RX,
and conventional hard IRQ handler for TX completions.


Better be safe, and apply a small patch series, I can handle that, do not worry.



> Isn't it cleaner (less error prone) to introduce new ndo
> for legacy drivers without napi?

Not really, this is basically not doable, since no one of us can test this.
Bowers, AndrewX Sept. 27, 2018, 4:12 p.m. UTC | #13
> -----Original Message-----
> From: Intel-wired-lan [mailto:intel-wired-lan-bounces@osuosl.org] On
> Behalf Of Song Liu
> Sent: Thursday, September 20, 2018 12:01 PM
> To: netdev@vger.kernel.org
> Cc: Song Liu <songliubraving@fb.com>; intel-wired-lan@lists.osuosl.org;
> kernel-team@fb.com; stable@vger.kernel.org
> Subject: [Intel-wired-lan] [PATCH net] ixgbe: check return value of
> napi_complete_done()
> 
> The NIC driver should only enable interrupts when napi_complete_done()
> returns true. This patch adds the check for ixgbe.
> 
> Cc: stable@vger.kernel.org # 4.10+
> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
> Suggested-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Song Liu <songliubraving@fb.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 12 +++++++-----
>  1 file changed, 7 insertions(+), 5 deletions(-)

Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
diff mbox series

Patch

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 9a23d33a47ed..1497df6a8dff 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3196,11 +3196,13 @@  int ixgbe_poll(struct napi_struct *napi, int budget)
 		return budget;
 
 	/* all work done, exit the polling mode */
-	napi_complete_done(napi, work_done);
-	if (adapter->rx_itr_setting & 1)
-		ixgbe_set_itr(q_vector);
-	if (!test_bit(__IXGBE_DOWN, &adapter->state))
-		ixgbe_irq_enable_queues(adapter, BIT_ULL(q_vector->v_idx));
+	if (likely(napi_complete_done(napi, work_done))) {
+		if (adapter->rx_itr_setting & 1)
+			ixgbe_set_itr(q_vector);
+		if (!test_bit(__IXGBE_DOWN, &adapter->state))
+			ixgbe_irq_enable_queues(adapter,
+						BIT_ULL(q_vector->v_idx));
+	}
 
 	return min(work_done, budget - 1);
 }