diff mbox

commit 16ecba59 breaks 82574L under heavy load.

Message ID 20170720234455.7nxtui7shmnzivxd@f1.synalogic.ca
State Not Applicable
Headers show

Commit Message

Benjamin Poirier July 20, 2017, 11:44 p.m. UTC
On 2017/07/20 10:00, Lennart Sorensen wrote:
> On Wed, Jul 19, 2017 at 05:07:47PM -0700, Benjamin Poirier wrote:
> > Are you sure about this? In my testing, while triggering the overrun
> > with the msleep, I read ICR when entering e1000_msix_other() and RXO is
> > consistently set.
> 
> I had thousands of calls to e1000_msix_other where the only bit set
> was OTHER.
> 
> I don't know if the cause is overruns, it just seems plausible.
> 
> > I'm working on a patch that uses that fact to handle the situation and
> > limit the interrupt.
> 
> Excellent.
> 

Could you please test the following patch and let me know if it:
1) reduces the interrupt rate of the Other msi-x vector
2) avoids the link flaps
or
3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
In this case, please paste icr values printed.

Thanks

Comments

Lennart Sorensen July 21, 2017, 3:27 p.m. UTC | #1
On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

I will give it a try.

Another odd behaviour I see is that the driver will hang in
napi_synchronize on shutdown if there is traffic at the time (at least
I think that's the trigger, maybe the trigger is if there has been an
overload of traffic and the backlog in napi was used).

From doing some searching, this seems to be a problem that has plagued
some people for years with this driver.

I am having trouble figuring out exactly what napi_synchronize is waiting
for and who is supposed to toggle the flag it is waiting on.  The flag
appears to work backwards from what I would have expected it to do.
I see lots of places that can set the bit, but only napi_enable seems
to clear it again, and I don't see how that would get called for all
the places that potentially set the bit.
Lennart Sorensen July 21, 2017, 4:09 p.m. UTC | #2
On Fri, Jul 21, 2017 at 11:27:09AM -0400,  wrote:
> On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> > Could you please test the following patch and let me know if it:
> > 1) reduces the interrupt rate of the Other msi-x vector
> > 2) avoids the link flaps
> > or
> > 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> > In this case, please paste icr values printed.
> 
> I will give it a try.

So test looks excellent.  Seems to only get interrupts when link state
actually changes now.

> Another odd behaviour I see is that the driver will hang in
> napi_synchronize on shutdown if there is traffic at the time (at least
> I think that's the trigger, maybe the trigger is if there has been an
> overload of traffic and the backlog in napi was used).
> 
> From doing some searching, this seems to be a problem that has plagued
> some people for years with this driver.
> 
> I am having trouble figuring out exactly what napi_synchronize is waiting
> for and who is supposed to toggle the flag it is waiting on.  The flag
> appears to work backwards from what I would have expected it to do.
> I see lots of places that can set the bit, but only napi_enable seems
> to clear it again, and I don't see how that would get called for all
> the places that potentially set the bit.

I just realized NAPI_STATE_SCHED and NAPIF_STATE_SCHED are the same
thing and I need to look at both of those.

Still something seems odd in some corner case where napi gets stuck and
you can't close the port anymore due to napi_synchronize never being
able to finish.  Some traffic pattern causes that SCHED state bit to
get into the wrong state and nothing ever clears it.  Even managed to
see it get stuck so it never passed traffic again and hung on shutdown.
The napi poll was never called again.
Lennart Sorensen July 21, 2017, 7:02 p.m. UTC | #3
On Thu, Jul 20, 2017 at 04:44:55PM -0700, Benjamin Poirier wrote:
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.

By the way, while at fixing the e1000e, I just noticed that
if you are blasting the port with traffic when it comes up,
you risk getting a transmit queue time out, because the queue
is started before the carrier is up.  ixgbe already fixed that in
cdc04dcce0598fead6029a2f95e95a4d2ea419c2.  igb has the same problem (which
goes away by moving the queue start to the watchdog after carrier_on,
I just haven't got around to sending that patch yet).

I am going to try moving the queue start to the watchdog and try it again.

Trace looked like this:

------------[ cut here ]------------
WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316 dev_watchdog+0x1f9/0x200
NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
Modules linked in: dpi_drv(PO) ccu_util(PO) ipv4_mb(PO) l2bridge_config_util(PO) l2_config_util(PO) route_config_util(PO) qos_config_util(PO) sysapp_common(PO) chantry_fwd_eng_2800_config(PO) shim_module(PO) sadb_cc(PO) ipsecXformer(PO) libeCrypto(PO) ipmatch_cc(PO) l2h_cc(PO) ndproxy_cc(PO) arpint_cc(PO) portinfo_cc(PO) chantryqos_cc(PO) redirector_cc(PO) ix_ph(PO) fpm_core_cc(PO) pulse_cc(PO) vnstt_cc(PO) vnsap_cc(PO) fm_cc(PO) rutm_cc(PO) mutm_cc(PO) ethernet_tx_cc(PO) stkdrv_cc(PO) l2bridge_cc(PO) events_util(PO) sched_cc(PO) qm_cc(PO) ipv4_cc(PO) wred_cc(PO) tc_meter_cc(PO) dscp_classifier_cc(PO) classifier_6t_cc(PO) ent586_cc(PO) dev_cc_arp(PO) chantry_fwd_eng_2800_tables(PO) ether_arp_lib(PO) rtmv4_lib(PO) lkup_lib(PO) l2tm_lib(PO) fragmentation_lib(PO) properties_lib(PO) msg_support_lib(PO)
 utilities_lib(PO) cci_lib(PO) rm_lib(PO) libossl(O) vip(O) productSpec_x86_dp(PO) e1000e
CPU: 0 PID: 0 Comm: swapper/0 Tainted: P           O    4.9.24 #20
Hardware name: Supermicro X7SPA-HF/X7SPA-HF, BIOS 1.2a       06/23/12  
 0000000000000000 ffffffff811cef1b ffff88007fc03e88 0000000000000000
 ffffffff81037ade 0000000000000000 ffff88007fc03ed8 0000000000000001
 0000000000000000 0000000000000082 0000000000000001 ffffffff81037b4c
Call Trace:
 <IRQ> 
 [<ffffffff811cef1b>] ? dump_stack+0x46/0x5b
 [<ffffffff81037ade>] ? __warn+0xbe/0xe0
 [<ffffffff81037b4c>] ? warn_slowpath_fmt+0x4c/0x50
 [<ffffffff8107ac92>] ? mod_timer+0xf2/0x150
 [<ffffffff812ffe69>] ? dev_watchdog+0x1f9/0x200
 [<ffffffff812ffc70>] ? dev_graft_qdisc+0x70/0x70
 [<ffffffff8107aeb1>] ? call_timer_fn.isra.26+0x11/0x80
 [<ffffffff8107b048>] ? run_timer_softirq+0x128/0x150
 [<ffffffff8103b16b>] ? __do_softirq+0xeb/0x1f0
 [<ffffffff8103b365>] ? irq_exit+0x55/0x60
 [<ffffffff81024da9>] ? smp_apic_timer_interrupt+0x39/0x50
 [<ffffffff813ab19c>] ? apic_timer_interrupt+0x7c/0x90
 <EOI> 
 [<ffffffff813aa1e1>] ? mwait_idle+0x51/0x80
 [<ffffffff81067717>] ? cpu_startup_entry+0xa7/0x130
 [<ffffffff81663cf4>] ? start_kernel+0x306/0x30e
---[ end trace ee759b7a56e1110b ]---
Philip Prindeville July 24, 2017, 9:56 p.m. UTC | #4
> On Jul 20, 2017, at 5:44 PM, Benjamin Poirier <bpoirier@suse.com> wrote:
> 
> [snip]
> Could you please test the following patch and let me know if it:
> 1) reduces the interrupt rate of the Other msi-x vector
> 2) avoids the link flaps
> or
> 3) logs some dmesg warnings of the form "Other interrupt with unhandled [...]"
> In this case, please paste icr values printed.
> 
> Thanks
> 
> diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
> index 0641c0098738..afb7ebe20b24 100644
> --- a/drivers/net/ethernet/intel/e1000e/defines.h
> +++ b/drivers/net/ethernet/intel/e1000e/defines.h
> @@ -398,6 +398,7 @@
> #define E1000_ICR_LSC           0x00000004 /* Link Status Change */
> #define E1000_ICR_RXSEQ         0x00000008 /* Rx sequence error */
> #define E1000_ICR_RXDMT0        0x00000010 /* Rx desc min. threshold (0) */
> +#define E1000_ICR_RXO           0x00000040 /* Receiver Overrun */
> #define E1000_ICR_RXT0          0x00000080 /* Rx timer intr (ring 0) */
> #define E1000_ICR_ECCER         0x00400000 /* Uncorrectable ECC Error */
> /* If this bit asserted, the driver should claim the interrupt */
> diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
> index c7c994eb410e..f7b46eba3efb 100644
> --- a/drivers/net/ethernet/intel/e1000e/e1000.h
> +++ b/drivers/net/ethernet/intel/e1000e/e1000.h
> @@ -351,6 +351,10 @@ struct e1000_adapter {
> 	s32 ptp_delta;
> 
> 	u16 eee_advert;
> +
> +	unsigned int uh_count;
> +	u32 uh_values[16];
> +	unsigned int uh_values_nb;
> };
> 
> struct e1000_info {
> diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
> index b3679728caac..46697338c0e1 100644
> --- a/drivers/net/ethernet/intel/e1000e/netdev.c
> +++ b/drivers/net/ethernet/intel/e1000e/netdev.c
> @@ -46,6 +46,8 @@
> 
> #include "e1000.h"
> 
> +DEFINE_RATELIMIT_STATE(other_uh_ratelimit_state, HZ, 1);
> +
> #define DRV_EXTRAVERSION "-k"
> 
> #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
> @@ -1904,12 +1906,60 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
> 	struct net_device *netdev = data;
> 	struct e1000_adapter *adapter = netdev_priv(netdev);
> 	struct e1000_hw *hw = &adapter->hw;
> +	u32 icr;
> +	bool enable = true;
> +	bool handled = false;
> +	unsigned int i;
> 
> -	hw->mac.get_link_status = true;
> +	icr = er32(ICR);
> +	if (icr & E1000_ICR_RXO) {
> +		ew32(ICR, E1000_ICR_RXO);
> +		enable = false;
> +		/* napi poll will re-enable Other, make sure it runs */
> +		if (napi_schedule_prep(&adapter->napi)) {
> +			adapter->total_rx_bytes = 0;
> +			adapter->total_rx_packets = 0;
> +			__napi_schedule(&adapter->napi);
> +		}
> +		handled = true;
> +	}
> +	if (icr & E1000_ICR_LSC) {
> +		ew32(ICR, E1000_ICR_LSC);
> +		hw->mac.get_link_status = true;
> +		/* guard against interrupt when we're going down */
> +		if (!test_bit(__E1000_DOWN, &adapter->state)) {
> +			mod_timer(&adapter->watchdog_timer, jiffies + 1);
> +		}
> +		handled = true;
> +	}
> 
> -	/* guard against interrupt when we're going down */
> -	if (!test_bit(__E1000_DOWN, &adapter->state)) {
> -		mod_timer(&adapter->watchdog_timer, jiffies + 1);
> +	if (!handled) {
> +		adapter->uh_count++;
> +		/* only print unseen icr values */
> +		if (adapter->uh_values_nb < ARRAY_SIZE(adapter->uh_values)) {
> +			for (i = 0; i < ARRAY_SIZE(adapter->uh_values); i++) {
> +				if (adapter->uh_values[i] == icr) {
> +					break;
> +				}
> +			}
> +			if (i == ARRAY_SIZE(adapter->uh_values)) {
> +				adapter->uh_values[adapter->uh_values_nb] =
> +					icr;
> +				adapter->uh_values_nb++;
> +				netdev_warn(netdev,
> +					    "Other interrupt with unhandled icr 0x%08x\n",
> +					    icr);
> +			}
> +		}
> +	}
> +	if (adapter->uh_count && __ratelimit(&other_uh_ratelimit_state)) {
> +		netdev_warn(netdev,
> +			    "Other interrupt with unhandled cause, count %u\n",
> +			    adapter->uh_count);
> +		adapter->uh_count = 0;
> +	}
> +
> +	if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
> 		ew32(IMS, E1000_IMS_OTHER);
> 	}
> 
> @@ -2681,7 +2731,8 @@ static int e1000e_poll(struct napi_struct *napi, int weight)
> 		napi_complete_done(napi, work_done);
> 		if (!test_bit(__E1000_DOWN, &adapter->state)) {
> 			if (adapter->msix_entries)
> -				ew32(IMS, adapter->rx_ring->ims_val);
> +				ew32(IMS, adapter->rx_ring->ims_val |
> +				     E1000_IMS_OTHER);
> 			else
> 				e1000_irq_enable(adapter);
> 		}
> @@ -4197,7 +4248,7 @@ static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
> 	struct e1000_hw *hw = &adapter->hw;
> 
> 	if (adapter->msix_entries)
> -		ew32(ICS, E1000_ICS_OTHER);
> +		ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
> 	else
> 		ew32(ICS, E1000_ICS_LSC);
> }
> @@ -7572,6 +7623,8 @@ static int __init e1000_init_module(void)
> 		e1000e_driver_version);
> 	pr_info("Copyright(c) 1999 - 2015 Intel Corporation.\n");
> 
> +	ratelimit_set_flags(&other_uh_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
> +
> 	return pci_register_driver(&e1000_driver);
> }
> module_init(e1000_init_module);



We’ve been running this patch on a Lanner FW-7568 (Atom D525, 6x 82574L NICs) under heavy load both with and without RPS enabled and have yet to see a single link flap.

Is it going into linux-stable?

Thanks,

-Philip
diff mbox

Patch

diff --git a/drivers/net/ethernet/intel/e1000e/defines.h b/drivers/net/ethernet/intel/e1000e/defines.h
index 0641c0098738..afb7ebe20b24 100644
--- a/drivers/net/ethernet/intel/e1000e/defines.h
+++ b/drivers/net/ethernet/intel/e1000e/defines.h
@@ -398,6 +398,7 @@ 
 #define E1000_ICR_LSC           0x00000004 /* Link Status Change */
 #define E1000_ICR_RXSEQ         0x00000008 /* Rx sequence error */
 #define E1000_ICR_RXDMT0        0x00000010 /* Rx desc min. threshold (0) */
+#define E1000_ICR_RXO           0x00000040 /* Receiver Overrun */
 #define E1000_ICR_RXT0          0x00000080 /* Rx timer intr (ring 0) */
 #define E1000_ICR_ECCER         0x00400000 /* Uncorrectable ECC Error */
 /* If this bit asserted, the driver should claim the interrupt */
diff --git a/drivers/net/ethernet/intel/e1000e/e1000.h b/drivers/net/ethernet/intel/e1000e/e1000.h
index c7c994eb410e..f7b46eba3efb 100644
--- a/drivers/net/ethernet/intel/e1000e/e1000.h
+++ b/drivers/net/ethernet/intel/e1000e/e1000.h
@@ -351,6 +351,10 @@  struct e1000_adapter {
 	s32 ptp_delta;
 
 	u16 eee_advert;
+
+	unsigned int uh_count;
+	u32 uh_values[16];
+	unsigned int uh_values_nb;
 };
 
 struct e1000_info {
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index b3679728caac..46697338c0e1 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -46,6 +46,8 @@ 
 
 #include "e1000.h"
 
+DEFINE_RATELIMIT_STATE(other_uh_ratelimit_state, HZ, 1);
+
 #define DRV_EXTRAVERSION "-k"
 
 #define DRV_VERSION "3.2.6" DRV_EXTRAVERSION
@@ -1904,12 +1906,60 @@  static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
 	struct net_device *netdev = data;
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_hw *hw = &adapter->hw;
+	u32 icr;
+	bool enable = true;
+	bool handled = false;
+	unsigned int i;
 
-	hw->mac.get_link_status = true;
+	icr = er32(ICR);
+	if (icr & E1000_ICR_RXO) {
+		ew32(ICR, E1000_ICR_RXO);
+		enable = false;
+		/* napi poll will re-enable Other, make sure it runs */
+		if (napi_schedule_prep(&adapter->napi)) {
+			adapter->total_rx_bytes = 0;
+			adapter->total_rx_packets = 0;
+			__napi_schedule(&adapter->napi);
+		}
+		handled = true;
+	}
+	if (icr & E1000_ICR_LSC) {
+		ew32(ICR, E1000_ICR_LSC);
+		hw->mac.get_link_status = true;
+		/* guard against interrupt when we're going down */
+		if (!test_bit(__E1000_DOWN, &adapter->state)) {
+			mod_timer(&adapter->watchdog_timer, jiffies + 1);
+		}
+		handled = true;
+	}
 
-	/* guard against interrupt when we're going down */
-	if (!test_bit(__E1000_DOWN, &adapter->state)) {
-		mod_timer(&adapter->watchdog_timer, jiffies + 1);
+	if (!handled) {
+		adapter->uh_count++;
+		/* only print unseen icr values */
+		if (adapter->uh_values_nb < ARRAY_SIZE(adapter->uh_values)) {
+			for (i = 0; i < ARRAY_SIZE(adapter->uh_values); i++) {
+				if (adapter->uh_values[i] == icr) {
+					break;
+				}
+			}
+			if (i == ARRAY_SIZE(adapter->uh_values)) {
+				adapter->uh_values[adapter->uh_values_nb] =
+					icr;
+				adapter->uh_values_nb++;
+				netdev_warn(netdev,
+					    "Other interrupt with unhandled icr 0x%08x\n",
+					    icr);
+			}
+		}
+	}
+	if (adapter->uh_count && __ratelimit(&other_uh_ratelimit_state)) {
+		netdev_warn(netdev,
+			    "Other interrupt with unhandled cause, count %u\n",
+			    adapter->uh_count);
+		adapter->uh_count = 0;
+	}
+
+	if (enable && !test_bit(__E1000_DOWN, &adapter->state)) {
 		ew32(IMS, E1000_IMS_OTHER);
 	}
 
@@ -2681,7 +2731,8 @@  static int e1000e_poll(struct napi_struct *napi, int weight)
 		napi_complete_done(napi, work_done);
 		if (!test_bit(__E1000_DOWN, &adapter->state)) {
 			if (adapter->msix_entries)
-				ew32(IMS, adapter->rx_ring->ims_val);
+				ew32(IMS, adapter->rx_ring->ims_val |
+				     E1000_IMS_OTHER);
 			else
 				e1000_irq_enable(adapter);
 		}
@@ -4197,7 +4248,7 @@  static void e1000e_trigger_lsc(struct e1000_adapter *adapter)
 	struct e1000_hw *hw = &adapter->hw;
 
 	if (adapter->msix_entries)
-		ew32(ICS, E1000_ICS_OTHER);
+		ew32(ICS, E1000_ICS_LSC | E1000_ICS_OTHER);
 	else
 		ew32(ICS, E1000_ICS_LSC);
 }
@@ -7572,6 +7623,8 @@  static int __init e1000_init_module(void)
 		e1000e_driver_version);
 	pr_info("Copyright(c) 1999 - 2015 Intel Corporation.\n");
 
+	ratelimit_set_flags(&other_uh_ratelimit_state, RATELIMIT_MSG_ON_RELEASE);
+
 	return pci_register_driver(&e1000_driver);
 }
 module_init(e1000_init_module);