Message ID | 1416653807-4859-1-git-send-email-gwshan@linux.vnet.ibm.com |
---|---|
State | Deferred, archived |
Delegated to: | David Miller |
Headers | show |
On 11/22/2014 12:56 PM, Gavin Shan wrote: > The patch fixes couple of EEH recovery failures on PPC PowerNV > platform: > > * Release reserved memory regions in mlx4_pci_err_detected(). > Otherwise, __mlx4_init_one() fails because of reserving > same memory regions recursively. > * Disable PCI device in mlx4_pci_err_detected(). Otherwise, > pci_enable_device() in __mlx4_init_one() doesn't enable > the PCI device because it's already in enabled state indicated > by struct pci_dev::enable_cnt. > * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected(). > Otherwise, __mlx4_init_one() runs into kernel crash because > of dereferencing to NULL pointer. > > With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC > PowerNV platform. > > # lspci > 0003:0f:00.0 Network controller: Mellanox Technologies \ > MT27500 Family [ConnectX-3] > > Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> Hi Gavin, Yishai (added to the CC) is few days before sending a patchset to fix the reset flow and inside it there is a fix to EEH recovery. I would be happy if you could wait for the whole reset flow fix by Yishai. If you'd like, I can send you the patchset to try. Currently it is under review inside Mellanox before being sent to the mailing list. Thanks, Amir -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Nov 23, 2014 at 06:21:47PM +0200, Amir Vadai wrote: >On 11/22/2014 12:56 PM, Gavin Shan wrote: >> The patch fixes couple of EEH recovery failures on PPC PowerNV >> platform: >> >> * Release reserved memory regions in mlx4_pci_err_detected(). >> Otherwise, __mlx4_init_one() fails because of reserving >> same memory regions recursively. >> * Disable PCI device in mlx4_pci_err_detected(). Otherwise, >> pci_enable_device() in __mlx4_init_one() doesn't enable >> the PCI device because it's already in enabled state indicated >> by struct pci_dev::enable_cnt. >> * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected(). >> Otherwise, __mlx4_init_one() runs into kernel crash because >> of dereferencing to NULL pointer. >> >> With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC >> PowerNV platform. >> >> # lspci >> 0003:0f:00.0 Network controller: Mellanox Technologies \ >> MT27500 Family [ConnectX-3] >> >> Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> > >Hi Gavin, > >Yishai (added to the CC) is few days before sending a patchset to fix >the reset flow and inside it there is a fix to EEH recovery. >I would be happy if you could wait for the whole reset flow fix by Yishai. > Yes, It's not urgent and I can wait. Thanks for the info. >If you'd like, I can send you the patchset to try. Currently it is under >review inside Mellanox before being sent to the mailing list. > It would be nice to send me the patchset for me to have a try. Thanks, Gavin >Thanks, >Amir > > > > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sat, Nov 22, 2014 at 09:56:47PM +1100, Gavin Shan wrote: Yishai already had patches fixing the issue. So please ignore this patch and drop it. Thanks, Gavin >The patch fixes couple of EEH recovery failures on PPC PowerNV >platform: > > * Release reserved memory regions in mlx4_pci_err_detected(). > Otherwise, __mlx4_init_one() fails because of reserving > same memory regions recursively. > * Disable PCI device in mlx4_pci_err_detected(). Otherwise, > pci_enable_device() in __mlx4_init_one() doesn't enable > the PCI device because it's already in enabled state indicated > by struct pci_dev::enable_cnt. > * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected(). > Otherwise, __mlx4_init_one() runs into kernel crash because > of dereferencing to NULL pointer. > >With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC >PowerNV platform. > > # lspci > 0003:0f:00.0 Network controller: Mellanox Technologies \ > MT27500 Family [ConnectX-3] > >Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> >--- > drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c >index 90de6e1..e118ac9 100644 >--- a/drivers/net/ethernet/mellanox/mlx4/main.c >+++ b/drivers/net/ethernet/mellanox/mlx4/main.c >@@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev) > kfree(dev->caps.qp1_proxy); > kfree(dev->dev_vfs); > >- memset(priv, 0, sizeof(*priv)); > priv->pci_dev_data = pci_dev_data; > priv->removed = 1; > } >@@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev, > pci_channel_state_t state) > { > mlx4_unload_one(pdev); >+ pci_release_regions(pdev); >+ pci_disable_device(pdev); > > return state == pci_channel_io_perm_failure ? > PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET; >-- >1.8.3.2 > -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index 90de6e1..e118ac9 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2809,7 +2809,6 @@ static void mlx4_unload_one(struct pci_dev *pdev) kfree(dev->caps.qp1_proxy); kfree(dev->dev_vfs); - memset(priv, 0, sizeof(*priv)); priv->pci_dev_data = pci_dev_data; priv->removed = 1; } @@ -2900,6 +2899,8 @@ static pci_ers_result_t mlx4_pci_err_detected(struct pci_dev *pdev, pci_channel_state_t state) { mlx4_unload_one(pdev); + pci_release_regions(pdev); + pci_disable_device(pdev); return state == pci_channel_io_perm_failure ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_NEED_RESET;
The patch fixes couple of EEH recovery failures on PPC PowerNV platform: * Release reserved memory regions in mlx4_pci_err_detected(). Otherwise, __mlx4_init_one() fails because of reserving same memory regions recursively. * Disable PCI device in mlx4_pci_err_detected(). Otherwise, pci_enable_device() in __mlx4_init_one() doesn't enable the PCI device because it's already in enabled state indicated by struct pci_dev::enable_cnt. * Don't clear struct mlx4_priv instance in mlx4_pci_err_detected(). Otherwise, __mlx4_init_one() runs into kernel crash because of dereferencing to NULL pointer. With the patch applied, EEH recovery for mlx4 adapter succeeds on PPC PowerNV platform. # lspci 0003:0f:00.0 Network controller: Mellanox Technologies \ MT27500 Family [ConnectX-3] Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com> --- drivers/net/ethernet/mellanox/mlx4/main.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-)