Message ID | 20200412060854.334895-1-leon@kernel.org |
---|---|
State | Changes Requested |
Delegated to: | David Miller |
Headers | show |
Series | [net,v1] net/sched: Don't print dump stack in event of transmission timeout | expand |
On Sun, 12 Apr 2020 09:08:54 +0300 Leon Romanovsky wrote: > Hi Dave, > > This is a new version of previously sent v0 [1] with change in print error > level as was suggested by Jakub and Cong. I'm asking you to reevaluate > your previous decision [2] given the fact that this is user triggered > bug and very similar scenario was committed by Linus "fs/filesystems.c: > downgrade user-reachable WARN_ONCE() to pr_warn_once()" a couple of days > ago [3]. > > [1] https://lore.kernel.org/netdev/20200402152336.538433-1-leon@kernel.org > [2] https://lore.kernel.org/netdev/20200402.180218.940555077368617365.davem@davemloft.net > [3] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=26c5d78c976ca298e59a56f6101a97b618ba3539 How is it user triggerable? If there's a IB-specific reason maybe ib netdev should stop implementing ndo_tx_timeout.
On Sun, Apr 12, 2020 at 11:59:13AM -0700, Jakub Kicinski wrote: > On Sun, 12 Apr 2020 09:08:54 +0300 Leon Romanovsky wrote: > > Hi Dave, > > > > This is a new version of previously sent v0 [1] with change in print error > > level as was suggested by Jakub and Cong. I'm asking you to reevaluate > > your previous decision [2] given the fact that this is user triggered > > bug and very similar scenario was committed by Linus "fs/filesystems.c: > > downgrade user-reachable WARN_ONCE() to pr_warn_once()" a couple of days > > ago [3]. > > > > [1] https://lore.kernel.org/netdev/20200402152336.538433-1-leon@kernel.org > > [2] https://lore.kernel.org/netdev/20200402.180218.940555077368617365.davem@davemloft.net > > [3] https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/commit/?h=x86/urgent&id=26c5d78c976ca298e59a56f6101a97b618ba3539 > > How is it user triggerable? If there's a IB-specific reason maybe ib > netdev should stop implementing ndo_tx_timeout. It is happening if device is extremely over loaded with traffic, internally HW decreases the performance (HW bug), it is causing to the TX timeouts and to the WARN_ON splat. We don't want to stop implementing ndo_tx_timeout, because it works right most (if not all) of the time. If it is very important, I will dig into internal bug reports to see the possible reproduction scenarios, but from what I saw till now, it is statistical failure. And it is not IB specific, but mlx4 specific. Thanks
This is cause by a device"overwhelmed with traffic"? Sounds like normal operation to me. That's a bug, and the driver handling the device with this problem should adjust how it implements TX timeouts to accomodate this.
On Sun, Apr 12, 2020 at 09:19:25PM -0700, David Miller wrote: > > This is cause by a device"overwhelmed with traffic"? Sounds like > normal operation to me. > > That's a bug, and the driver handling the device with this problem > should adjust how it implements TX timeouts to accomodate this. From the internal bug description, hope that it makes sense. ----- A timeout may occur if the amount of the reported bytes higher than the queue limit, in this case, the kernel closes the queue and only after getting a completion it wil reopen it. In the debug we saw that in some situations the driver gets a **delayed completion**, completions arrive after **1 min**, therefore, the amount of queued bytes exceeds the DQL max size. As a result, the kernel after watchdog_timeo calls the driver's timeout function, that prints timeout to dmesg. After debugging the issue with FW to understand the root cause of the delayed completions we understand that since the IB and the TCP traffic are running at the same service level (SL), the same schedule queue schedules between all the QPs, and in this case if one of the IB QPs get stuck because of congestion, all other QPs will be stuck (include the TCP QPs) until releasing the stuck QP. ----- User separates traffic to different SLs. Thanks
From: Leon Romanovsky <leon@kernel.org> Date: Apr/12/2020, 07:08:54 (UTC+00:00) > [ 281.170584] ------------[ cut here ]------------ Not objecting to the patch it-self (because usually stack trace is useless), but just FYI we use this marker in our CI to track for timeouts or crashes. I'm not sure if anyone else is using it. And actually, can you please explain why BQL is not suppressing your timeouts ? --- Thanks, Jose Miguel Abreu
On Mon, Apr 13, 2020 at 09:01:32AM +0000, Jose Abreu wrote: > From: Leon Romanovsky <leon@kernel.org> > Date: Apr/12/2020, 07:08:54 (UTC+00:00) > > > [ 281.170584] ------------[ cut here ]------------ > > Not objecting to the patch it-self (because usually stack trace is > useless), but just FYI we use this marker in our CI to track for timeouts > or crashes. I'm not sure if anyone else is using it. I didn't delete the "NETDEV WATCHDOG .." message and it will be still visible as a marker. > > And actually, can you please explain why BQL is not suppressing your > timeouts ? Driver can't distinguish between "real" timeout and "mixed traffic" timeout, so we don't want to completely disable "dev->netdev_ops->ndo_tx_timeout(dev, i);" call in watchdog [1]. The goal is to leave functionality in place and simply remove stack trace to be similar to other BUG prints in that file [2]. [1] https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c#L444 [2] https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c#L328 > > --- > Thanks, > Jose Miguel Abreu
From: Leon Romanovsky <leon@kernel.org> Date: Apr/13/2020, 11:20:53 (UTC+00:00) > On Mon, Apr 13, 2020 at 09:01:32AM +0000, Jose Abreu wrote: > > From: Leon Romanovsky <leon@kernel.org> > > Date: Apr/12/2020, 07:08:54 (UTC+00:00) > > > > > [ 281.170584] ------------[ cut here ]------------ > > > > Not objecting to the patch it-self (because usually stack trace is > > useless), but just FYI we use this marker in our CI to track for timeouts > > or crashes. I'm not sure if anyone else is using it. > > I didn't delete the "NETDEV WATCHDOG .." message and it will be still > visible as a marker. > > > > > And actually, can you please explain why BQL is not suppressing your > > timeouts ? > > Driver can't distinguish between "real" timeout and "mixed traffic" timeout, The point is that you should not get any "mixed traffic" timeout if the driver uses BQL because Queue will be disabled long before timeout happens as per queue size usage ... --- Thanks, Jose Miguel Abreu
On Mon, Apr 13, 2020 at 10:37:24AM +0000, Jose Abreu wrote: > From: Leon Romanovsky <leon@kernel.org> > Date: Apr/13/2020, 11:20:53 (UTC+00:00) > > > On Mon, Apr 13, 2020 at 09:01:32AM +0000, Jose Abreu wrote: > > > From: Leon Romanovsky <leon@kernel.org> > > > Date: Apr/12/2020, 07:08:54 (UTC+00:00) > > > > > > > [ 281.170584] ------------[ cut here ]------------ > > > > > > Not objecting to the patch it-self (because usually stack trace is > > > useless), but just FYI we use this marker in our CI to track for timeouts > > > or crashes. I'm not sure if anyone else is using it. > > > > I didn't delete the "NETDEV WATCHDOG .." message and it will be still > > visible as a marker. > > > > > > > > And actually, can you please explain why BQL is not suppressing your > > > timeouts ? > > > > Driver can't distinguish between "real" timeout and "mixed traffic" timeout, > > The point is that you should not get any "mixed traffic" timeout if the > driver uses BQL because Queue will be disabled long before timeout happens > as per queue size usage ... Sorry, if I misunderstood you, but you are proposing to count traffic, right? If yes, RDMA traffic bypasses the SW stack and not visible to the kernel, hence the BQL will count only ETH portion of that mixed traffic, while RDMA traffic is the one who "blocked" transmission channel (QP in RDMA terminology). Thanks > > --- > Thanks, > Jose Miguel Abreu
From: Leon Romanovsky <leon@kernel.org> Date: Apr/13/2020, 11:54:08 (UTC+00:00) > Sorry, if I misunderstood you, but you are proposing to count traffic, right? > > If yes, RDMA traffic bypasses the SW stack and not visible to the kernel, hence > the BQL will count only ETH portion of that mixed traffic, while RDMA traffic > is the one who "blocked" transmission channel (QP in RDMA terminology). Sorry but you don't mention in your commit message that this is RDMA specific so that's why I brought up the topic of BQL. Apologies for the misunderstood. --- Thanks, Jose Miguel Abreu
On Mon, Apr 13, 2020 at 11:01:42AM +0000, Jose Abreu wrote: > From: Leon Romanovsky <leon@kernel.org> > Date: Apr/13/2020, 11:54:08 (UTC+00:00) > > > Sorry, if I misunderstood you, but you are proposing to count traffic, right? > > > > If yes, RDMA traffic bypasses the SW stack and not visible to the kernel, hence > > the BQL will count only ETH portion of that mixed traffic, while RDMA traffic > > is the one who "blocked" transmission channel (QP in RDMA terminology). > > Sorry but you don't mention in your commit message that this is RDMA > specific so that's why I brought up the topic of BQL. Apologies for the > misunderstood. No problem, I'm glad that you asked those questions, hope that my answers clear the rationale behind change from WARN_ON to be pr_err(). Thanks > > --- > Thanks, > Jose Miguel Abreu
On Sat, Apr 11, 2020 at 11:09 PM Leon Romanovsky <leon@kernel.org> wrote: > > From: Leon Romanovsky <leonro@mellanox.com> > > In event of transmission timeout, the drivers are given an opportunity > to recover and continue to work after some in-house cleanups. > > Such event can be caused by HW bugs, wrong congestion configurations > and many more other scenarios. In such case, users are interested to > get a simple "NETDEV WATCHDOG ... " print, which points to the relevant > netdevice in trouble. > > The dump stack printed later was added in the commit b4192bbd85d2 > ("net: Add a WARN_ON_ONCE() to the transmit timeout function") to give > extra information, like list of the modules and which driver is involved. > > While the latter is already printed in "NETDEV WATCHDOG ... ", the list > of modules rarely needed and can be collected later. > > So let's remove the WARN_ONCE() and make dmesg look more user-friendly in > large cluster setups. ... > > Fixes: b4192bbd85d2 ("net: Add a WARN_ON_ONCE() to the transmit timeout function") > Signed-off-by: Leon Romanovsky <leonro@mellanox.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Thanks.
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c index 6c9595f1048a..185f03db3d55 100644 --- a/net/sched/sch_generic.c +++ b/net/sched/sch_generic.c @@ -439,8 +439,9 @@ static void dev_watchdog(struct timer_list *t) if (some_queue_timedout) { trace_net_dev_xmit_timeout(dev, i); - WARN_ONCE(1, KERN_INFO "NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n", - dev->name, netdev_drivername(dev), i); + pr_err_once("NETDEV WATCHDOG: %s (%s): transmit queue %u timed out\n", + dev->name, + netdev_drivername(dev), i); dev->netdev_ops->ndo_tx_timeout(dev, i); } if (!mod_timer(&dev->watchdog_timer,