diff mbox

[3/5] net: mvneta: do not schedule in mvneta_tx_timeout

Message ID 1389548333.3720.73.camel@deadeye.wl.decadent.org.uk
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ben Hutchings Jan. 12, 2014, 5:38 p.m. UTC
[Putting another hat on]

On Sun, 2014-01-12 at 17:55 +0100, Willy Tarreau wrote:
> Hi Ben,
> 
> On Sun, Jan 12, 2014 at 04:49:51PM +0000, Ben Hutchings wrote:
> (...)
> > > So for now, let's simply ignore these timeouts generally caused by bugs
> > > only.
> > 
> > No, don't ignore them.  Schedule a work item to reset the device.  (And
> > remember to cancel it when stopping the device.)
> 
> OK I can try to do that. Could you recommend me one driver which does this
> successfully so that I can see exactly what needs to be taken care of ?

sfc does it, though the reset logic there is more complicated than you
would need.

I think this will DTRT, but it's compile-tested only.  I have been given
an OpenBlocks AX3 but haven't set it up yet.

Ben.

---
mvneta: Defer restart from TX watchdog handler to work item

A restart requires sleeping, but the watchdog handler runs in atomic
context.

The work item can race with down-ing of the interface, so take the RTNL
lock and do nothing if the interface is already down.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
---

Comments

Willy Tarreau Jan. 12, 2014, 10:14 p.m. UTC | #1
Hi Ben,

On Sun, Jan 12, 2014 at 05:38:53PM +0000, Ben Hutchings wrote:
> [Putting another hat on]
> 
> On Sun, 2014-01-12 at 17:55 +0100, Willy Tarreau wrote:
> > Hi Ben,
> > 
> > On Sun, Jan 12, 2014 at 04:49:51PM +0000, Ben Hutchings wrote:
> > (...)
> > > > So for now, let's simply ignore these timeouts generally caused by bugs
> > > > only.
> > > 
> > > No, don't ignore them.  Schedule a work item to reset the device.  (And
> > > remember to cancel it when stopping the device.)
> > 
> > OK I can try to do that. Could you recommend me one driver which does this
> > successfully so that I can see exactly what needs to be taken care of ?
> 
> sfc does it, though the reset logic there is more complicated than you
> would need.

OK.

> I think this will DTRT, but it's compile-tested only.

OK, I'll test it ASAP. I think I can force the tx timeout by disabling
the link state detection and unplugging the cable during a transfer.

> I have been given an OpenBlocks AX3 but haven't set it up yet.

Ah you're another lucky owner of this really great device :-)

I've sent Eric Leblond a complete howto in french, so it won't be of
a big use to you but if I can find some time and you don't find other
info, I can try to redo it in english. However you may be interested
in this article I put online with a few patches to make your life
easier :

   http://1wt.eu/articles/openblocks-http-server/

It's a line-rate (1.488 Mpps) HTTP server I've done on it with a few
patches that may be useful for other network tests.

Cheers,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Willy Tarreau Jan. 14, 2014, 3:33 p.m. UTC | #2
Hi Ben,

On Sun, Jan 12, 2014 at 05:38:53PM +0000, Ben Hutchings wrote:
> I think this will DTRT, but it's compile-tested only.  I have been given
> an OpenBlocks AX3 but haven't set it up yet.

OK I just managed to test your patch. I managed to force a Tx timeout by
forcing the link to 100/half and transfering 1000 concurrent streams.

Unfortunately for now the patch doesn't manage to recover, and the system
randomly panics one or two seconds after the link is brought up. Twice the
system did not panic but I lost all communications until a down/up cycle,
after which a panic happened during transfers.

However I could verify that the scheduled function is correctly called. I
suspect that something else might be wrong in the driver's reset sequence
(eg: unmapping pages still in use by the NIC or I don't know what), but
your patch does exactly what it's supposed to do.

At least, if the restart function does not do anything, everything works
fine. I see that the function is called (I added printk there) and the
transfer is not perturbated at all anymore.

So now I'm wondering whether the right thing should not be to just keep
your scheduled function and make it only log that a timeout was caught.

Another point which bothers me is that I suspect we're triggering Tx
timeouts too fast, because I regularly get these on 100 Mbps during
regular traffic (which ended up in immediate panics with previous code).

Thanks,
Willy

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index d5f0d72..a91dcc2 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -234,6 +234,7 @@  struct mvneta_port {
 	struct mvneta_tx_queue *txqs;
 	struct timer_list tx_done_timer;
 	struct net_device *dev;
+	struct work_struct restart_work;
 
 	u32 cause_rx_tx;
 	struct napi_struct napi;
@@ -2231,8 +2232,22 @@  static void mvneta_tx_timeout(struct net_device *dev)
 	struct mvneta_port *pp = netdev_priv(dev);
 
 	netdev_info(dev, "tx timeout\n");
-	mvneta_stop_dev(pp);
-	mvneta_start_dev(pp);
+
+	/* defer mvneta_restart(), as we're in atomic context here */
+	schedule_work(&pp->restart_work);
+}
+
+static void mvneta_restart(struct work_struct *work)
+{
+	struct mvneta_port *pp =
+		container_of(work, struct mvneta_port, restart_work);
+
+	rtnl_lock();
+	if (netif_running(pp->dev)) {
+		mvneta_stop_dev(pp);
+		mvneta_start_dev(pp);
+	}
+	rtnl_unlock();
 }
 
 /* Return positive if MTU is valid */
@@ -2792,6 +2807,8 @@  static int mvneta_probe(struct platform_device *pdev)
 
 	pp = netdev_priv(dev);
 
+	INIT_WORK(&pp->restart_work, mvneta_restart);
+
 	u64_stats_init(&pp->tx_stats.syncp);
 	u64_stats_init(&pp->rx_stats.syncp);
 
@@ -2890,6 +2907,7 @@  static int mvneta_remove(struct platform_device *pdev)
 	struct mvneta_port *pp = netdev_priv(dev);
 
 	unregister_netdev(dev);
+	cancel_work_sync(&pp->restart_work);
 	mvneta_deinit(pp);
 	clk_disable_unprepare(pp->clk);
 	iounmap(pp->base);