diff mbox

[net-next] bnx2x: avoid soft lockup in bnx2x_poll()

Message ID 1449582880.7632.19.camel@edumazet-glaptop2.roam.corp.google.com
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Dec. 8, 2015, 1:54 p.m. UTC
From: Eric Dumazet <edumazet@google.com>

Under heavy TX load, bnx2x_poll() can loop forever and trigger
soft lockup bugs.

A napi poll handler must yield after one TX completion round,
risk of livelock is too high otherwise.

Bug is very easy to trigger using a debug build, and udp flood, because
of added cpu cycles in TX completion, and we do not receive enough
packets to break the loop.

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ariel Elior <ariel.elior@qlogic.com>
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |   51 +++++---------
 1 file changed, 21 insertions(+), 30 deletions(-)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Eric Dumazet Dec. 8, 2015, 6:03 p.m. UTC | #1
On Tue, 2015-12-08 at 17:02 +0000, Yuval Mintz wrote:
> > Under heavy TX load, bnx2x_poll() can loop forever and trigger soft lockup bugs.
> > 
> > A napi poll handler must yield after one TX completion round, risk of livelock is
> > too high otherwise.
> > 
> > Bug is very easy to trigger using a debug build, and udp flood, because of added
> > cpu cycles in TX completion, and we do not receive enough packets to break the
> > loop.
> 
> Eric - I understand what you're doing and it looks fine [to me, at least].
> Out of curiosity, do you know whether removing the loop damages any
> other flow, i.e., by slowing transmitter as transmission rings gets filled
> completely between consecutive NAPI runs? 

I saw no downsides yet. Most of the time TX are blocked by BQL these
days, before complete TX ring filling.

I added some instrumentation, and even after the patch and a non debug
kernel we can see :

bnx2x: bnx2x_poll() took 455036 nsec

455 usec on one bnx2x_poll() is already quite big, but one can tweak TX
ring (ethtool -G eth1 tx ....) if latencies are a serious concern.

Note that my patch should not slow transmitters, only give a chance for
other softirqs being serviced, and eventually give control to ksoftirqd
under stress.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 9, 2015, 3:55 a.m. UTC | #2
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 08 Dec 2015 05:54:40 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> Under heavy TX load, bnx2x_poll() can loop forever and trigger
> soft lockup bugs.
> 
> A napi poll handler must yield after one TX completion round,
> risk of livelock is too high otherwise.
> 
> Bug is very easy to trigger using a debug build, and udp flood, because
> of added cpu cycles in TX completion, and we do not receive enough
> packets to break the loop.
> 
> Reported-by: Willem de Bruijn <willemb@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Indeed, no driver should ever do more than one TX completion round
per NAPI poll call, _ever_.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Yuval Mintz Dec. 9, 2015, 10:41 a.m. UTC | #3
> > > Under heavy TX load, bnx2x_poll() can loop forever and trigger soft lockup

> bugs.

> > >

> > > A napi poll handler must yield after one TX completion round, risk

> > > of livelock is too high otherwise.

> > >

> > > Bug is very easy to trigger using a debug build, and udp flood,

> > > because of added cpu cycles in TX completion, and we do not receive

> > > enough packets to break the loop.

> >

> > Eric - I understand what you're doing and it looks fine [to me, at least].

> > Out of curiosity, do you know whether removing the loop damages any

> > other flow, i.e., by slowing transmitter as transmission rings gets

> > filled completely between consecutive NAPI runs?

> 

> I saw no downsides yet. Most of the time TX are blocked by BQL these days,

> before complete TX ring filling.

> 

> I added some instrumentation, and even after the patch and a non-debug kernel

> we can see :

> 

> bnx2x: bnx2x_poll() took 455036 nsec

> 

> 455 usec on one bnx2x_poll() is already quite big, but one can tweak TX ring

> (ethtool -G eth1 tx ....) if latencies are a serious concern.

> 

> Note that my patch should not slow transmitters, only give a chance for other

> softirqs being serviced, and eventually give control to ksoftirqd under stress.

> 


Cool. Thanks Eric.

Acked-by: Yuval Mintz <Yuval.Mintz@qlogic.com>
diff mbox

Patch

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index f6634a524153..d7cc2aa7c37a 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -3204,42 +3204,32 @@  int bnx2x_set_power_state(struct bnx2x *bp, pci_power_t state)
  */
 static int bnx2x_poll(struct napi_struct *napi, int budget)
 {
-	int work_done = 0;
-	u8 cos;
 	struct bnx2x_fastpath *fp = container_of(napi, struct bnx2x_fastpath,
 						 napi);
 	struct bnx2x *bp = fp->bp;
+	int rx_work_done;
+	u8 cos;
 
-	while (1) {
 #ifdef BNX2X_STOP_ON_ERROR
-		if (unlikely(bp->panic)) {
-			napi_complete(napi);
-			return 0;
-		}
+	if (unlikely(bp->panic)) {
+		napi_complete(napi);
+		return 0;
+	}
 #endif
-		for_each_cos_in_tx_queue(fp, cos)
-			if (bnx2x_tx_queue_has_work(fp->txdata_ptr[cos]))
-				bnx2x_tx_int(bp, fp->txdata_ptr[cos]);
-
-		if (bnx2x_has_rx_work(fp)) {
-			work_done += bnx2x_rx_int(fp, budget - work_done);
-
-			/* must not complete if we consumed full budget */
-			if (work_done >= budget)
-				break;
-		}
+	for_each_cos_in_tx_queue(fp, cos)
+		if (bnx2x_tx_queue_has_work(fp->txdata_ptr[cos]))
+			bnx2x_tx_int(bp, fp->txdata_ptr[cos]);
 
-		/* Fall out from the NAPI loop if needed */
-		if (!(bnx2x_has_rx_work(fp) || bnx2x_has_tx_work(fp))) {
+	rx_work_done = (bnx2x_has_rx_work(fp)) ? bnx2x_rx_int(fp, budget) : 0;
 
-			/* No need to update SB for FCoE L2 ring as long as
-			 * it's connected to the default SB and the SB
-			 * has been updated when NAPI was scheduled.
-			 */
-			if (IS_FCOE_FP(fp)) {
-				napi_complete(napi);
-				break;
-			}
+	if (rx_work_done < budget) {
+		/* No need to update SB for FCoE L2 ring as long as
+		 * it's connected to the default SB and the SB
+		 * has been updated when NAPI was scheduled.
+		 */
+		if (IS_FCOE_FP(fp)) {
+			napi_complete(napi);
+		} else {
 			bnx2x_update_fpsb_idx(fp);
 			/* bnx2x_has_rx_work() reads the status block,
 			 * thus we need to ensure that status block indices
@@ -3264,12 +3254,13 @@  static int bnx2x_poll(struct napi_struct *napi, int budget)
 				bnx2x_ack_sb(bp, fp->igu_sb_id, USTORM_ID,
 					     le16_to_cpu(fp->fp_hc_idx),
 					     IGU_INT_ENABLE, 1);
-				break;
+			} else {
+				rx_work_done = budget;
 			}
 		}
 	}
 
-	return work_done;
+	return rx_work_done;
 }
 
 /* we split the first BD into headers and data BDs