diff mbox

[LEDE-DEV] Transmit timeouts with mtk_eth_soc and MT7621

Message ID CAKfDRXjwDqXBMJakmmfSP5+kCE-YimQacYYB8gj2y6br=j-+TQ@mail.gmail.com
State RFC
Headers show

Commit Message

Kristian Evensen Aug. 26, 2017, 2:47 p.m. UTC
Hello again,

On Sat, Aug 26, 2017 at 12:38 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> Hi,
>
> On Sat, Aug 26, 2017 at 7:43 AM, Mingyu Li <igvtee@gmail.com> wrote:
>> Hi.
>>
>> i check the code again. found xmit_more can cause tx timeout. you can
>> reference this.
>> https://www.mail-archive.com/netdev@vger.kernel.org/msg123334.html
>> so the patch should be like this. edit mtk_eth_soc.c
>>
>>         tx_num = fe_cal_txd_req(skb);
>>         if (unlikely(fe_empty_txd(ring) <= tx_num)) {
>> +                if (skb->xmit_more)
>> +                        fe_reg_w32(ring->tx_next_idx, FE_REG_TX_CTX_IDX0);
>>                 netif_stop_queue(dev);
>>                 netif_err(priv, tx_queued, dev,
>>                           "Tx Ring full when queue awake!\n");
>>
>> but i am not sure. maybe the pause frame cause the problem.
>
> Thanks for the patch. I tested it, but I unfortunately still see the
> error. I also added a print-statement inside the conditional and can
> see that the condition is never hit. I also don't see the "Tx Ring
> full"-message. One difference which I noticed now though, is that I
> don't see the bursty bandwidth pattern I described earlier (32, 0, 32,
> 0, ...). With your patch, it is always 32, 0, crash.

I spent some more time looking into this today and think I might have
been able to solve the issue. My test has been running for ~2 hours
now without any errors (before it would best-case work for 10-15
minutes), and I do not see any performance regressions. Before going
into detail, I should probably point out that I am not very familiar
with driver development, so my observation changes might not be
appropriate/correct :)

I guess our common theory is that something is causing TX interrupts
not to be enabled again. After reading up on NAPI
(https://wiki.linuxfoundation.org/networking/napi), it seems that the
recommended way of using NAPI on clear-on-write devices (like the
RT5350) is to use NAPI for RX and do TX in the interrupt handler. In
the current driver, both TX and RX is handled in the NAPI-callback
fe_poll(). I have modified the driver to split RX and TX, so now
fe_poll() only deals with RX and TX is dealt with in fe_handle_irq().
I have attached the (messy) patch I am currently testing. If this is
an OK solution, I will clean up the patch and submit is to the list. I
will leave my tests running overnight and report back if anything pops
up.

I guess that Johns new driver is the future for mtk_sock_eth, but I
believe that fixing this issue for the current driver is worthwhile.
As things are now, is it possible to DDOS RT5350-based routers running
LEDE 17.01 by just sending the correct type of traffic.

-Kristian

Comments

Kristian Evensen Aug. 26, 2017, 2:57 p.m. UTC | #1
Hi,

On Sat, Aug 26, 2017 at 4:47 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> I guess our common theory is that something is causing TX interrupts
> not to be enabled again. After reading up on NAPI
> (https://wiki.linuxfoundation.org/networking/napi), it seems that the
> recommended way of using NAPI on clear-on-write devices (like the
> RT5350) is to use NAPI for RX and do TX in the interrupt handler. In
> the current driver, both TX and RX is handled in the NAPI-callback
> fe_poll(). I have modified the driver to split RX and TX, so now
> fe_poll() only deals with RX and TX is dealt with in fe_handle_irq().
> I have attached the (messy) patch I am currently testing. If this is
> an OK solution, I will clean up the patch and submit is to the list. I
> will leave my tests running overnight and report back if anything pops
> up.

I did not have to wait very long before anything. Unfortunately, my
router crashed right after I sent the previous email. So I guess that
means back to the drawing board.

-Kristian
Kristian Evensen Aug. 28, 2017, 11:16 a.m. UTC | #2
Hello again,

On Sat, Aug 26, 2017 at 4:57 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> I did not have to wait very long before anything. Unfortunately, my
> router crashed right after I sent the previous email. So I guess that
> means back to the drawing board.

I have kept working on this issue throughout the weekend and have
noticed something strange. I don't know if this observation is
relevant or not, and I still haven't found out why the queue times out
(why the queue is stopped), but perhaps the observation can help point
us in the right direction.

I noticed that TX_WB_DDONE is set, so I decided to dump the state of
each txd when the timeout occurs. When I do this (and the timeout has
been triggered), I see that dtx always point to txds with DDONE set,
while ctx points to txds where DDONE is not set. To me, this seems to
be the inverse of how it should be. I.e., dtx should point to a txd
where DDONE is not set, while ctx to should point to a txd with DDONE
set. Could it be that these counters/indexes for some reason goes out
of sync and that this triggers the timeout? There is data to be sent,
but it is not seen.

To give a more complete example, say we have the following error:
[ 1816.133130] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[ 1816.139304] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000067
[ 1816.145288] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0,
base=0ee44000, max=512, ctx=302, dtx=202, fdx=202, next=302
[ 1816.155968] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0,
base=0ec0e000, max=512, calc=486, drx=484

Here, txd[202-301] have DDONE set, while DDONE is not set for txd[302-201].

Thanks in advance for all the help so far,
Kristian
Kristian Evensen Nov. 8, 2017, 9:56 p.m. UTC | #3
Hello,

It turns out that the assumption that the "transmit timed out"-issue
was related to pause frames/flow control was incorrect. I have
recently started to see the error again, with flow control disabled.
However, unlike last time, I am now able to reliably trigger the
issue.

The timeout seems to be triggered by connectivity problems between
MT7621-based routers (not sure if it applies to other devices with the
MT7530 switch) and the next hop. I checked each client connected to
some of the routers exhibiting this issue, and turns out that some had
bad cables, etc.. In order to check the theory in a more controlled
fashion, I set up the following small testbed:

NUC (192.168.1.1) <-> (192.168.1.2) ZBT 3526 (192.168.2.1) <->
(192.168.2.2) ZBT 2626 (192.168.3.1) <-> (192.168.3.2) Client

I then configured port forwarding from the 3526 and all the way to the
client and hammered the client with small UDP packets. Then, at random
points, I intentionally hung the kernel on the 2626 by triggering an
RCU error causing a stall. L2 was still up, but the 2626 does not
reply to any packets, including ARP (so the neighbor-table entry for
192.168.2.2 is quickly lost). More or less as soon as the kernel hung,
the transmit timeout-error message started showing up. If I restart
networking or enable/disable the ports, then everything works fine for
a bit (I can for example ping 192.168.1.1 from 192.168.1.2), but after
some time the error appears again.

I have been trying to solve this myself for a couple of days, but I am
starting to run out of idea. Could it be that there is some traffic
destined for the client (via. the 2626) that gets stuck in the TX
queue on the 3526? Any help, pointers on where to look or ideas for
what could be wrong would be much appreciated.

Thanks in advance for the help,
Kristian
Kristian Evensen Nov. 9, 2017, 11:06 a.m. UTC | #4
Hello,

On Wed, Nov 8, 2017 at 10:56 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> I have been trying to solve this myself for a couple of days, but I am
> starting to run out of idea. Could it be that there is some traffic
> destined for the client (via. the 2626) that gets stuck in the TX
> queue on the 3526? Any help, pointers on where to look or ideas for
> what could be wrong would be much appreciated.

I have been doing some more debugging during the day today, and it
seems that the problem is that the hardware for some reason stops
transmitting data. I dumped the content of the TX ring on every
timeout, and the output seems to indicate that the state of the ring
is sane. For example, with the following error:

[  325.650638] mtk_soc_eth 1e100000.ethernet eth0: transmit timed out
[  325.656857] mtk_soc_eth 1e100000.ethernet eth0: dma_cfg:80000067
[  325.662902] mtk_soc_eth 1e100000.ethernet eth0: tx_ring=0,
base=0ef58000, max=512, ctx=384, dtx=0, fdx=0, next=384
[  325.673234] mtk_soc_eth 1e100000.ethernet eth0: rx_ring=0,
base=0ef5e000, max=512, calc=5, drx=6

I see that the CPU txds [384-511] have DDONE set and no SKB, while
DDONE is not set for the DMA txds [0-383] and an skb is attached. I
also looked at the content of the skb, and as far as I can see it is
valid. Looking at the content of the SKB also shows that
fe_reset_pending() does its job. For every timeout, there is a new set
of packets on the ring. So new packets are put on the ring, but none
are sent.

-Kristian
Kristian Evensen Nov. 9, 2017, 4:21 p.m. UTC | #5
On Thu, Nov 9, 2017 at 12:06 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> I see that the CPU txds [384-511] have DDONE set and no SKB, while
> DDONE is not set for the DMA txds [0-383] and an skb is attached. I
> also looked at the content of the skb, and as far as I can see it is
> valid. Looking at the content of the SKB also shows that
> fe_reset_pending() does its job. For every timeout, there is a new set
> of packets on the ring. So new packets are put on the ring, but none
> are sent.

I have been hammering away on this issue during the day, and it seems
that the DMA engine, TX, etc. works just fine. However, for some
reason, the port with the router that has hung is able to stop the
whole switch. If I disable the port (or disconnect the cable), then TX
works again and I can for example reach 192.168.1.1 from 192.168.1.2
in my testbed. When running ping (from 192.168.1.2 to 192.168.1.1)
while disconnecting the cable, the first packets had a very high RTT
(~20ms). Running tcpdump showed that the reply arrived immediately, so
it seems the packets are stuck in a TX buffer for a really long time.
Could it be that there is a cache or something internally on the
switch that is causing packets to be held back, and that this cache is
invalidated and buffers flushed when I disable the port? I cleared
switch, DIP and SIP tables without any effect.

If I enable the port, then the problem appears again after a little
while (~30 seconds).

-Kristian
Kristian Evensen Nov. 9, 2017, 7:42 p.m. UTC | #6
On Thu, Nov 9, 2017 at 5:21 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> I have been hammering away on this issue during the day, and it seems
> that the DMA engine, TX, etc. works just fine. However, for some
> reason, the port with the router that has hung is able to stop the
> whole switch. If I disable the port (or disconnect the cable), then TX
> works again and I can for example reach 192.168.1.1 from 192.168.1.2
> in my testbed. When running ping (from 192.168.1.2 to 192.168.1.1)
> while disconnecting the cable, the first packets had a very high RTT
> (~20ms). Running tcpdump showed that the reply arrived immediately, so
> it seems the packets are stuck in a TX buffer for a really long time.
> Could it be that there is a cache or something internally on the
> switch that is causing packets to be held back, and that this cache is
> invalidated and buffers flushed when I disable the port? I cleared
> switch, DIP and SIP tables without any effect.
>
> If I enable the port, then the problem appears again after a little
> while (~30 seconds).

I replaced the 3526 with other devices containing the mt7530 switch
(both mt7621 and mt7623-based boards), and the issues seems to be
related to the switch rather than the SoC. I am able to reliably
trigger the timeout on all devices I have tested, both running
proprietary drivers/firmware and LEDE. I guess this points to that
there is some traffic pattern or network behavior that triggers an
error in the MT7530 and causes TX to freeze. Restarting the ports
makes the switch work again, but as long as the "bad" device is
connected to the mt7530 then it is just a matter of time before the
timeout is back.

-Kristian
Kristian Evensen Nov. 16, 2017, 3:15 p.m. UTC | #7
Hello,

On Thu, Nov 9, 2017 at 8:42 PM, Kristian Evensen
<kristian.evensen@gmail.com> wrote:
> I replaced the 3526 with other devices containing the mt7530 switch
> (both mt7621 and mt7623-based boards), and the issues seems to be
> related to the switch rather than the SoC. I am able to reliably
> trigger the timeout on all devices I have tested, both running
> proprietary drivers/firmware and LEDE. I guess this points to that
> there is some traffic pattern or network behavior that triggers an
> error in the MT7530 and causes TX to freeze. Restarting the ports
> makes the switch work again, but as long as the "bad" device is
> connected to the mt7530 then it is just a matter of time before the
> timeout is back.

I think I am ready to conclude on this issue. First of all, I have
discovered that I made an incorrect statement earlier. I have not seen
the problem with flow control disabled. After finding a network tap
and a device which passes for example pause frames to the driver so I
can see them with tcpdump, I think I finally see what is going on.

I connected the tap between router #1 and router #2, and performed the
test described earlier with flow control enabled and disabled. When
triggering the RCU stall, I see a continuous flood of pause frames
coming from router #2. This flood happens irrespective of if flow
control is enabled or not. However, with flow control enabled, I see
that the RxPause- and TxPause-counters increase. With flow control
disabled, they remain at 0. In other words, it seems that the switch
filters out pause frames if the bit is unset in the feature register
(it would be great if someone could confirm/deny this).

The MT7530 switch seems to use one buffer for all ports, so what I
have seen all along is head of line blocking. Since I use iperf in UDP
mode, the traffic destined for router #2 never slows down and fills up
the buffer. Thus, all other traffic is blocked. When disabling the
port used by route #2, the buffer is cleared and packets can flow as
normal again. With flow control disabled, I do not see the head of
line blocking. If I am connected to router #1, I can always reach it.
If flow control is enabled, router #1 stops replying to for example
ping when the pause flood starts.

I don't know what is the correct "solution" for this problem. I asked
Piotr to mark my patch for always disabling flow control as not
applicable, but perhaps it should be brought back if everyone agrees
that disabling flow control is ok. If not, then perhaps the following
patch should be accepted so that it is possible to switch flow control
on/off: https://lists.openwrt.org/pipermail/openwrt-devel/2016-April/040705.html

BR,
Kristian
diff mbox

Patch

From 98ce0313cd8654fe69028c19b6f08da1d0671d75 Mon Sep 17 00:00:00 2001
From: Kristian Evensen <kristian.evensen@gmail.com>
Date: Sat, 26 Aug 2017 16:05:44 +0200
Subject: [PATCH] [FIX] Move TX out of Napi

This is a broken patch only meant for backup.
---
 .../drivers/net/ethernet/mtk/mtk_eth_soc.c         | 36 +++++++++++++++++-----
 1 file changed, 28 insertions(+), 8 deletions(-)

diff --git a/target/linux/ramips/files-4.9/drivers/net/ethernet/mtk/mtk_eth_soc.c b/target/linux/ramips/files-4.9/drivers/net/ethernet/mtk/mtk_eth_soc.c
index 5b354a6563..af865826e8 100644
--- a/target/linux/ramips/files-4.9/drivers/net/ethernet/mtk/mtk_eth_soc.c
+++ b/target/linux/ramips/files-4.9/drivers/net/ethernet/mtk/mtk_eth_soc.c
@@ -684,10 +684,13 @@  static int fe_tx_map_dma(struct sk_buff *skb, struct net_device *dev,
 	 */
 	wmb();
 	if (unlikely(fe_empty_txd(ring) <= ring->tx_thresh)) {
+        pr_info_ratelimited("netif_stop_queue %u %u\n", fe_empty_txd(ring), ring->tx_thresh);
 		netif_stop_queue(dev);
 		smp_mb();
-		if (unlikely(fe_empty_txd(ring) > ring->tx_thresh))
+		if (unlikely(fe_empty_txd(ring) > ring->tx_thresh)) {
+            pr_info_ratelimited("netif_wake_queue\n");
 			netif_wake_queue(dev);
+        }
 	}
 
 	if (netif_xmit_stopped(netdev_get_tx_queue(dev, 0)) || !skb->xmit_more)
@@ -781,6 +784,10 @@  static int fe_start_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	tx_num = fe_cal_txd_req(skb);
 	if (unlikely(fe_empty_txd(ring) <= tx_num)) {
+        if (skb->xmit_more) {
+            pr_info_ratelimited("SKB XMIT_MORE\n");
+            fe_reg_w32(ring->tx_next_idx, FE_REG_TX_CTX_IDX0);
+        }
 		netif_stop_queue(dev);
 		netif_err(priv, tx_queued, dev,
 			  "Tx Ring full when queue awake!\n");
@@ -967,11 +974,12 @@  static int fe_poll(struct napi_struct *napi, int budget)
 		status_reg = FE_REG_FE_INT_STATUS;
 	}
 
+    /*
    if (status & tx_intr) {
            fe_reg_w32(tx_intr, FE_REG_FE_INT_STATUS);
            tx_done = fe_poll_tx(priv);
            status = fe_reg_r32(FE_REG_FE_INT_STATUS);
-   }
+   }*/
 
 	if (status & rx_intr)
 		rx_done = fe_poll_rx(napi, budget, priv, rx_intr);
@@ -1042,18 +1050,30 @@  static irqreturn_t fe_handle_irq(int irq, void *dev)
 {
 	struct fe_priv *priv = netdev_priv(dev);
 	u32 status, int_mask;
+    u32 tx_intr, rx_intr;
+
+	tx_intr = priv->soc->tx_int;
+	rx_intr = priv->soc->rx_int;
 
 	status = fe_reg_r32(FE_REG_FE_INT_STATUS);
 
 	if (unlikely(!status))
 		return IRQ_NONE;
 
-	int_mask = (priv->soc->rx_int | priv->soc->tx_int);
+	int_mask = tx_intr | rx_intr;
 	if (likely(status & int_mask)) {
-		if (likely(napi_schedule_prep(&priv->rx_napi))) {
-			fe_int_disable(int_mask);
-			__napi_schedule(&priv->rx_napi);
-		}
+        if (status & rx_intr) {
+    		if (likely(napi_schedule_prep(&priv->rx_napi))) {
+	    		fe_int_disable(rx_intr);
+		    	__napi_schedule(&priv->rx_napi);
+		    }
+        }
+
+        if (status & tx_intr) {
+            fe_reg_w32(tx_intr, FE_REG_FE_INT_STATUS);
+            fe_poll_tx(priv);
+		    fe_int_enable(tx_intr);
+        }
 	} else {
 		fe_reg_w32(status, FE_REG_FE_INT_STATUS);
 	}
@@ -1549,7 +1569,7 @@  static int fe_probe(struct platform_device *pdev)
 
 	platform_set_drvdata(pdev, netdev);
 
-	netif_info(priv, probe, netdev, "mediatek frame engine at 0x%08lx, irq %d\n",
+	netif_info(priv, probe, netdev, "mediatek frame engine at 0x%08lx, irq %d - patched\n",
 		   netdev->base_addr, netdev->irq);
 
 	return 0;
-- 
2.11.0