Patchwork [net-next] tg3: Prevent system hang during repeated EEH errors.

login
register
mail settings
Submitter Nithin Sujir
Date June 14, 2013, 9:15 p.m.
Message ID <1371244506-18969-1-git-send-email-nsujir@broadcom.com>
Download mbox | patch
Permalink /patch/251537/
State Superseded
Delegated to: David Miller
Headers show

Comments

Nithin Sujir - June 14, 2013, 9:15 p.m.
From: Michael Chan <mchan@broadcom.com>

The current tg3 code assumes the pci_error_handlers to be always called
in sequence.  In particular, during ->error_detected(), NAPI is disabled
and the device is shutdown.  The device is later reset and NAPI
re-enabled in ->slot_reset() and ->resume().

In EEH, if more than 6 errors are detected in a hour, only
->error_detected() will be called.  This will leave the driver in an
inconsistent state as NAPI is disabled but netif_running state is still
true.  When the device is later closed, we'll try to disable NAPI again
and it will loop forever.

We fix this by closing the device if we encounter any error conditions
during the normal sequence of the pci_error_handlers.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Signed-off-by: Nithin Nayak Sujir <nsujir@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)
Benjamin Poirier - June 17, 2013, 6:28 p.m.
On 2013/06/14 14:15, Nithin Nayak Sujir wrote:
[...]
> @@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
>  	rc = PCI_ERS_RESULT_RECOVERED;
>  
>  done:
> +	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
> +		tg3_napi_enable(tp);
> +		dev_close(netdev);
> +	}
>  	rtnl_unlock();
>  
>  	return rc;
> @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
>  	if (err) {
>  		tg3_full_unlock(tp);
>  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> +		tg3_napi_enable(tp);
> +		dev_close(netdev);
>  		goto done;
>  	}

Are these two hunks needed?
1) These functions do not call tg3_netif_stop() or tg3_napi_disable()
2) an error in tg3_io_resume() does not trigger device removal in
handle_eeh_events(). In fact the ->resume callback has no return value.

>  
> -- 
> 1.8.1.4
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Benjamin Poirier - June 17, 2013, 6:56 p.m.
On 2013/06/17 14:28, Benjamin Poirier wrote:
> On 2013/06/14 14:15, Nithin Nayak Sujir wrote:
> [...]
> > @@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
> >  	rc = PCI_ERS_RESULT_RECOVERED;
> >  
> >  done:
> > +	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> > +	}
> >  	rtnl_unlock();
> >  
> >  	return rc;
> > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> >  	if (err) {
> >  		tg3_full_unlock(tp);
> >  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> >  		goto done;
> >  	}
> 
> Are these two hunks needed?
> 1) These functions do not call tg3_netif_stop() or tg3_napi_disable()

Ok, I see why this is relevant, since the slot_reset and resume
callbacks are always called after the error_detected callback.

> 2) an error in tg3_io_resume() does not trigger device removal in
> handle_eeh_events(). In fact the ->resume callback has no return value.

Nevertheless, this hunk

> > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> >  	if (err) {
> >  		tg3_full_unlock(tp);
> >  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> >  		goto done;
> >  	}

duplicates the error handling code already in tg3_restart_hw().

> 
> >  
> > -- 
> > 1.8.1.4
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Chan - June 17, 2013, 6:59 p.m.
On Mon, 2013-06-17 at 14:28 -0400, Benjamin Poirier wrote: 
> On 2013/06/14 14:15, Nithin Nayak Sujir wrote:
> [...]
> > @@ -17796,6 +17799,10 @@ static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
> >  	rc = PCI_ERS_RESULT_RECOVERED;
> >  
> >  done:
> > +	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> > +	}
> >  	rtnl_unlock();
> >  
> >  	return rc;
> > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> >  	if (err) {
> >  		tg3_full_unlock(tp);
> >  		netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > +		tg3_napi_enable(tp);
> > +		dev_close(netdev);
> >  		goto done;
> >  	}
> 
> Are these two hunks needed?
> 1) These functions do not call tg3_netif_stop() or tg3_napi_disable()
> 2) an error in tg3_io_resume() does not trigger device removal in
> handle_eeh_events(). In fact the ->resume callback has no return value.
> 

The normal sequence is:

error_detected(), slot_reset(), resume()

In error_detected(), chip will be shutdown and NAPI will be disabled if
netif_running state is true.  When everything works correctly, the chip
will be re-enabled in resume() and NAPI re-enabled.  If we run into any
error in this sequence, the sequence will not complete normally.  In
this case, if netif_running state is true, we know that the NAPI state
has been disabled earlier in error_detected(), and we need to properly
close the device.




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael Chan - June 17, 2013, 7:11 p.m.
On Mon, 2013-06-17 at 14:56 -0400, Benjamin Poirier wrote:
> Nevertheless, this hunk
> 
> > > @@ -17826,6 +17833,8 @@ static void tg3_io_resume(struct pci_dev *pdev)
> > >     if (err) {
> > >             tg3_full_unlock(tp);
> > >             netdev_err(netdev, "Cannot restart hardware after reset.\n");
> > > +           tg3_napi_enable(tp);
> > > +           dev_close(netdev);
> > >             goto done;
> > >     }
> 
> duplicates the error handling code already in tg3_restart_hw(). 

Very good point.  We'll modify the patch and re-send.  Thanks a lot
Benjamin.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 28a645f..bfe1831 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -17747,10 +17747,13 @@  static pci_ers_result_t tg3_io_error_detected(struct pci_dev *pdev,
 	tg3_full_unlock(tp);
 
 done:
-	if (state == pci_channel_io_perm_failure)
+	if (state == pci_channel_io_perm_failure) {
+		tg3_napi_enable(tp);
+		dev_close(netdev);
 		err = PCI_ERS_RESULT_DISCONNECT;
-	else
+	} else {
 		pci_disable_device(pdev);
+	}
 
 	rtnl_unlock();
 
@@ -17796,6 +17799,10 @@  static pci_ers_result_t tg3_io_slot_reset(struct pci_dev *pdev)
 	rc = PCI_ERS_RESULT_RECOVERED;
 
 done:
+	if (rc != PCI_ERS_RESULT_RECOVERED && netif_running(netdev)) {
+		tg3_napi_enable(tp);
+		dev_close(netdev);
+	}
 	rtnl_unlock();
 
 	return rc;
@@ -17826,6 +17833,8 @@  static void tg3_io_resume(struct pci_dev *pdev)
 	if (err) {
 		tg3_full_unlock(tp);
 		netdev_err(netdev, "Cannot restart hardware after reset.\n");
+		tg3_napi_enable(tp);
+		dev_close(netdev);
 		goto done;
 	}