diff mbox

[OpenWrt-Devel] ar71xx: check for stuck DMA on AR724x & fix sirq storm after recovery

Message ID 568FE8A2.5010406@openwrt.org
State RFC
Headers show

Commit Message

Felix Fietkau Jan. 8, 2016, 4:49 p.m. UTC
On 2016-01-08 06:26, Conn O'Griofa wrote:
> Hi,
> 
> I'm proposing the following patch to resolve ticket #18922 fully.
> 
> With the current master revision, when a tx timeout condition occurs,
> the interface recovers successfully, but a soft irq storm occurs
> (causing ksoftirqd to peg the CPU, due to this goto being called
> without end:
> https://github.com/openwrt-mirror/openwrt/blob/master/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c#L1073
> ). Forcing the tx and rx rings to be cleared and re-inited in
> ag71xx_restart_work_func seems to avoid the sirq storm, but I'd
> appreciate feedback on whether there's a more effective workaround.
> 
> Additionally, ag71xx_check_dma_stuck *does* successfully detect the
> stuck DMA condition on AR7241 (TR-WL842ND v1), so enabling the check
> for this chipset series ensures a link adjust occurs *before* an
> actual tx timeout is detected. This avoids the brief network
> interruption that normally occurs during the DMA stuck -> tx timeout
> -> link adjust condition.
> 
> Conn
> 
> P.S. The sirq storm also occurs when ag71xx_check_dma_stuck is utilized on this chipset to avoid the tx timeout condition, so it appears that both changes are necessary (or at least, a better way to solve the sirq storm needs to be discovered).
Thanks for investigating this further. Please try this patch:
---

Comments

Conn O'Griofa Jan. 8, 2016, 6:48 p.m. UTC | #1
On 08/01/16 16:49, Felix Fietkau wrote:

> Thanks for investigating this further. Please try this patch:

Unfortunately, your proposed patch doesn't work. When I trigger the timeout condition, a tx timeout occurs and the interface doesn't recover correctly:

[  249.075582] ------------[ cut here ]------------
[  249.080296] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:303 dev_watchdog+0x1dc/0x260()
[  249.088800] NETDEV WATCHDOG: eth1 (ag71xx): transmit queue 0 timed out
[  249.095378] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath pppox ppp_generic nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_id xt_hl xt_helper xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables crc_ccitt compat fuse sch_fq_pie sch_cake act_ipt em_nbyte sch_teql sch_htb sch_prio sch_pie sch_codel sch_gred em_meta cls_basic em_text sch_tbf sch_red sch_sfq sch_fq act_police em_cmp sch_dsmark act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc sch_ingress ledtrig_usbdev ip6t_REJ
 E
CT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables ifb zram lzo_decompress lzo_compress lz4_decompress lz4_compress zsmalloc usb_storage ehci_platform ehci_hcd sd_mod scsi_mod gpio_button_hotplug ext4 jbd2 mbcache usbcore nls_base usb_common crc16 crypto_hash
[  249.211641] CPU: 0 PID: 0 Comm: swapper Not tainted 4.1.13 #1
[  249.217434] Stack : 80378d20 00000000 00000001 803c0000 803b8ab0 803b8743 8035b52c 00000000
[  249.217434] 	  80413514 000000c8 00000004 0000000a 00000100 800a6110 803c7d04 803b0000
[  249.217434] 	  00000003 000000c8 8035edb4 803b3c44 00000100 800a473c 00000006 00000000
[  249.217434] 	  00000000 80410000 00000000 00000000 00000000 00000000 00000000 00000000
[  249.217434] 	  00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[  249.217434] 	  ...
[  249.253511] Call Trace:
[  249.256003] [<80071f5c>] show_stack+0x50/0x84
[  249.260417] [<8008150c>] warn_slowpath_common+0xa0/0xd0
[  249.265693] [<80081568>] warn_slowpath_fmt+0x2c/0x38
[  249.270694] [<8026240c>] dev_watchdog+0x1dc/0x260
[  249.275456] [<800ad290>] call_timer_fn.isra.3+0x24/0x80
[  249.280735] [<800ad498>] run_timer_softirq+0x1ac/0x1ec
[  249.285936] [<80083b30>] __do_softirq+0x16c/0x298
[  249.290678] [<80083e7c>] irq_exit+0x54/0x70
[  249.294897] [<80060830>] ret_from_irq+0x0/0x4
[  249.299313] [<80060a80>] __r4k_wait+0x20/0x40
[  249.303709] [<800a189c>] cpu_startup_entry+0x114/0x13c
[  249.308904] [<803d5bc4>] start_kernel+0x464/0x484
[  249.313649]
[  249.315156] ---[ end trace 5936bede9e8ff3e6 ]---
[  249.319817] eth1: tx timeout
[  260.075531] eth1: tx timeout
[  275.075459] eth1: tx timeout
[  285.075418] eth1: tx timeout

Triggering a manual 'ifconfig eth1 down; ifconfig eth1 up; ifup wan' still restores connectivity with your patch, if it helps.

Conn
Conn O'Griofa Jan. 8, 2016, 7:17 p.m. UTC | #2
On 08/01/16 18:48, Conn O'Griofa wrote:

> Unfortunately, your proposed patch doesn't work. When I trigger the timeout condition, a tx timeout occurs and the interface doesn't recover correctly:

I'm not able to test recompiles for a few hours at least, but:

* After your patch failed, I already tested quick rebuild, changing ag71xx_restart_work_func to use:

	rtnl_lock();
	ag71xx_stop(ag->dev);
	ag71xx_open(ag->dev);
	rtnl_unlock();

With that change to your patch, the link adjust was triggered before the tx timeout was triggered (expected), but the sirq storm still occurred.

* In ag71xx_hw_enable, netif_start_queue is issued. Since this function is used for the fast restart, that should probably be changed to netif_wake_queue so that the kernel will check for anything pending in the queue to be sent (which is certain to be true). I'll check this as soon as it's possible.

Conn
Conn O'Griofa Jan. 8, 2016, 11:34 p.m. UTC | #3
On 08/01/16 19:17, Conn O'Griofa wrote:

> * In ag71xx_hw_enable, netif_start_queue is issued. Since this function is used for the fast restart, that should probably be changed to netif_wake_queue so that the kernel will check for anything pending in the queue to be sent (which is certain to be true). I'll check this as soon as it's possible.

This change wasn't sufficient to fix the problem. After examining your changes a bit further, there's an issue: your ag71xx_hw_disable function calls ag71xx_hw_stop, but the complement function ag71xx_hw_enable doesn't call ag71xx_hw_start. Since you're trying to avoid a link adjust, the ag71xx_hw_start function never gets called during recovery with your patch.

I tried replacing netif_start_queue(dev) with ag71xx_hw_start(ag) in ag71xx_hw_enable. With this change, when the DMA stuck issue occurs, there's no longer any tx timeouts logged, but the interface stops responding. Perhaps it's also necessary to call ag71xx_fast_reset (which only gets called during a link adjust)?

Conn
diff mbox

Patch

--- a/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
+++ b/target/linux/ar71xx/files/drivers/net/ethernet/atheros/ag71xx/ag71xx_main.c
@@ -631,34 +631,61 @@  void ag71xx_link_adjust(struct ag71xx *ag)
 		ag71xx_rr(ag, AG71XX_REG_MAC_IFCTL));
 }
 
+static int ag71xx_hw_enable(struct ag71xx *ag)
+{
+	int ret;
+
+	ret = ag71xx_rings_init(ag);
+	if (ret)
+		return ret;
+
+	napi_enable(&ag->napi);
+	ag71xx_wr(ag, AG71XX_REG_TX_DESC, ag->tx_ring.descs_dma);
+	ag71xx_wr(ag, AG71XX_REG_RX_DESC, ag->rx_ring.descs_dma);
+	netif_start_queue(ag->dev);
+
+	return 0;
+}
+
+static void ag71xx_hw_disable(struct ag71xx *ag)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&ag->lock, flags);
+
+	netif_stop_queue(ag->dev);
+
+	ag71xx_hw_stop(ag);
+	ag71xx_dma_reset(ag);
+
+	napi_disable(&ag->napi);
+	del_timer_sync(&ag->oom_timer);
+
+	spin_unlock_irqrestore(&ag->lock, flags);
+
+	ag71xx_rings_cleanup(ag);
+}
+
 static int ag71xx_open(struct net_device *dev)
 {
 	struct ag71xx *ag = netdev_priv(dev);
 	unsigned int max_frame_len;
 	int ret;
 
+	netif_carrier_off(dev);
 	max_frame_len = ag71xx_max_frame_len(dev->mtu);
 	ag->rx_buf_size = max_frame_len + NET_SKB_PAD + NET_IP_ALIGN;
 
 	/* setup max frame length */
 	ag71xx_wr(ag, AG71XX_REG_MAC_MFL, max_frame_len);
+	ag71xx_hw_set_macaddr(ag, dev->dev_addr);
 
-	ret = ag71xx_rings_init(ag);
+	ret = ag71xx_hw_enable(ag);
 	if (ret)
 		goto err;
 
-	napi_enable(&ag->napi);
-
-	netif_carrier_off(dev);
 	ag71xx_phy_start(ag);
 
-	ag71xx_wr(ag, AG71XX_REG_TX_DESC, ag->tx_ring.descs_dma);
-	ag71xx_wr(ag, AG71XX_REG_RX_DESC, ag->rx_ring.descs_dma);
-
-	ag71xx_hw_set_macaddr(ag, dev->dev_addr);
-
-	netif_start_queue(dev);
-
 	return 0;
 
 err:
@@ -669,24 +696,10 @@  err:
 static int ag71xx_stop(struct net_device *dev)
 {
 	struct ag71xx *ag = netdev_priv(dev);
-	unsigned long flags;
 
 	netif_carrier_off(dev);
 	ag71xx_phy_stop(ag);
-
-	spin_lock_irqsave(&ag->lock, flags);
-
-	netif_stop_queue(dev);
-
-	ag71xx_hw_stop(ag);
-	ag71xx_dma_reset(ag);
-
-	napi_disable(&ag->napi);
-	del_timer_sync(&ag->oom_timer);
-
-	spin_unlock_irqrestore(&ag->lock, flags);
-
-	ag71xx_rings_cleanup(ag);
+	ag71xx_hw_disable(ag);
 
 	return 0;
 }
@@ -870,14 +883,10 @@  static void ag71xx_restart_work_func(struct work_struct *work)
 {
 	struct ag71xx *ag = container_of(work, struct ag71xx, restart_work);
 
-	if (ag71xx_get_pdata(ag)->is_ar724x) {
-		ag->link = 0;
-		ag71xx_link_adjust(ag);
-		return;
-	}
-
-	ag71xx_stop(ag->dev);
-	ag71xx_open(ag->dev);
+	rtnl_lock();
+	ag71xx_hw_disable(ag);
+	ag71xx_hw_enable(ag);
+	rtnl_unlock();
 }
 
 static bool ag71xx_check_dma_stuck(struct ag71xx *ag, unsigned long timestamp)
@@ -919,7 +928,7 @@  static int ag71xx_tx_packets(struct ag71xx *ag, bool flush)
 		struct sk_buff *skb = ring->buf[i].skb;
 
 		if (!flush && !ag71xx_desc_empty(desc)) {
-			if (pdata->is_ar7240 &&
+			if (pdata->is_ar724x &&
 			    ag71xx_check_dma_stuck(ag, ring->buf[i].timestamp))
 				schedule_work(&ag->restart_work);
 			break;