Message ID | 1401619783-23659-1-git-send-email-ogerlitz@mellanox.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Hello. On 06/01/2014 02:49 PM, Or Gerlitz wrote: > From: Jack Morgenstein <jackm@dev.mellanox.co.il> > Commit befdf89 did not take into account the case where the Host Please also specify that commit's summary line in parens. > driver is being unloaded. In this case, pci_get_drvdata for the VF > remove_one call may return NULL, so that dereferencing the priv > struct results in a kernel oops. > The fix is to also test that the dev pointer returned by > pci_get_drvdata is non-NULL. > Fixes: befdf89 ("preserve pcd_dev_data after __mlx4_remove_one()") > Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> > Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> WBR, Sergei -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 1, 2014 at 7:41 PM, Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> wrote: > On 06/01/2014 02:49 PM, Or Gerlitz wrote: >> Commit befdf89 did not take into account the case where the Host > Please also specify that commit's summary line in parens. Did that below, see where we say Fixes: befdf89 ("preserve pcd_dev_data after __mlx4_remove_one()") >> driver is being unloaded. In this case, pci_get_drvdata for the VF >> remove_one call may return NULL, so that dereferencing the priv >> struct results in a kernel oops. >> The fix is to also test that the dev pointer returned by >> pci_get_drvdata is non-NULL. >> Fixes: befdf89 ("preserve pcd_dev_data after __mlx4_remove_one()") >> Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> >> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Jun 01, 2014 at 01:49:43PM +0300, Or Gerlitz wrote: >From: Jack Morgenstein <jackm@dev.mellanox.co.il> > >Commit befdf89 did not take into account the case where the Host >driver is being unloaded. In this case, pci_get_drvdata for the VF In my mind, unloading PF's driver when there is alive VFs is not allowed. Quoted in driver code: /* in SRIOV it is not allowed to unload the pf's * driver while there are alive vf's */ if (mlx4_is_master(dev) && mlx4_how_many_lives_vf(dev)) printk(KERN_ERR "Removing PF when there are assigned VF's !!!\n"); Actually, I don't understand this restriction clearly. Maybe my understanding of alive VF is not correct. And in your code, unload PF's driver would call pci_disable_sriov() which will destroy the VFs. While in your test, the VF's driver is still there? >remove_one call may return NULL, so that dereferencing the priv >struct results in a kernel oops. Sorry for my poor mind, I still can't understand this situation. Would you describe the situation more? You are unloading PF's driver in Host at first, and then try to release the VF's driver? > >The fix is to also test that the dev pointer returned by >pci_get_drvdata is non-NULL. > >Fixes: befdf89 ("preserve pcd_dev_data after __mlx4_remove_one()") >Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> >Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> >--- > drivers/net/ethernet/mellanox/mlx4/main.c | 2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > >diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c >index c187d74..a6ae089 100644 >--- a/drivers/net/ethernet/mellanox/mlx4/main.c >+++ b/drivers/net/ethernet/mellanox/mlx4/main.c >@@ -2629,7 +2629,7 @@ static void __mlx4_remove_one(struct pci_dev *pdev) > int pci_dev_data; > int p; > >- if (priv->removed) >+ if (!dev || priv->removed) > return; This fix looks good to me. As I remembered, I had this check in my first version, but I removed the check on dev based on the suggestion from Bjorn. Since I agreed that there is no chance for dev to be NULL. Bjorn, seems we are not correct :( > > pci_dev_data = priv->pci_dev_data; >-- >1.7.1
On Mon, Jun 2, 2014 at 8:29 AM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote: > On Sun, Jun 01, 2014 at 01:49:43PM +0300, Or Gerlitz wrote: >>From: Jack Morgenstein <jackm@dev.mellanox.co.il> >> >>Commit befdf89 did not take into account the case where the Host >>driver is being unloaded. In this case, pci_get_drvdata for the VF > > In my mind, unloading PF's driver when there is alive VFs is not allowed. > Quoted in driver code: > > /* in SRIOV it is not allowed to unload the pf's > * driver while there are alive vf's */ > if (mlx4_is_master(dev) && mlx4_how_many_lives_vf(dev)) > printk(KERN_ERR "Removing PF when there are assigned VF's !!!\n"); > > Actually, I don't understand this restriction clearly. Maybe my understanding > of alive VF is not correct. > > And in your code, unload PF's driver would call pci_disable_sriov() which will > destroy the VFs. While in your test, the VF's driver is still there? > >>remove_one call may return NULL, so that dereferencing the priv >>struct results in a kernel oops. > > Sorry for my poor mind, I still can't understand this situation. > Would you describe the situation more? You are unloading PF's driver in Host > at first, and then try to release the VF's driver? > >> >>The fix is to also test that the dev pointer returned by >>pci_get_drvdata is non-NULL. >> >>Fixes: befdf89 ("preserve pcd_dev_data after __mlx4_remove_one()") >>Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> >>Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> >>--- >> drivers/net/ethernet/mellanox/mlx4/main.c | 2 +- >> 1 files changed, 1 insertions(+), 1 deletions(-) >> >>diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c >>index c187d74..a6ae089 100644 >>--- a/drivers/net/ethernet/mellanox/mlx4/main.c >>+++ b/drivers/net/ethernet/mellanox/mlx4/main.c >>@@ -2629,7 +2629,7 @@ static void __mlx4_remove_one(struct pci_dev *pdev) >> int pci_dev_data; >> int p; >> >>- if (priv->removed) >>+ if (!dev || priv->removed) >> return; > > This fix looks good to me. > > As I remembered, I had this check in my first version, but I removed the check > on dev based on the suggestion from Bjorn. Since I agreed that there is no > chance for dev to be NULL. Bjorn, seems we are not correct :( Writing a driver is not an empirical process of trying things to see what works. You need to actively design a consistent structure so you know why and when things are safe. I object to gratuitous "dev == NULL" checks because often they are just a way of patching up a driver design that isn't well thought-out. As I wrote before: From the PCI core's perspective, after .probe() returns successfully, we can call any driver entry point and pass the pci_dev to it, and expect it to work. Doing mlx4_remove_one() in mlx4_pci_err_detected() sort of breaks that assumption because you clear out pci_drvdata(). Right now, the only other entry point mlx4 really implements is mlx4_remove_one(), and it has a hack that tests whether pci_drvdata() is NULL. But that's ... a hack, and you'll have to do the same if/when you implement suspend/resume/sriov_configure/etc. Bjorn -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Bjorn Helgaas <bhelgaas@google.com> Date: Mon, 2 Jun 2014 10:10:01 -0600 > Writing a driver is not an empirical process of trying things to see > what works. You need to actively design a consistent structure so you > know why and when things are safe. I object to gratuitous "dev == > NULL" checks because often they are just a way of patching up a driver > design that isn't well thought-out. > > As I wrote before: > > From the PCI core's perspective, after .probe() returns successfully, > we can call any driver entry point and pass the pci_dev to it, and > expect it to work. Doing mlx4_remove_one() in mlx4_pci_err_detected() > sort of breaks that assumption because you clear out pci_drvdata(). > Right now, the only other entry point mlx4 really implements is > mlx4_remove_one(), and it has a hack that tests whether pci_drvdata() > is NULL. But that's ... a hack, and you'll have to do the same > if/when you implement suspend/resume/sriov_configure/etc. Agreed. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Jun 02, 2014 at 10:10:01AM -0600, Bjorn Helgaas wrote: >On Mon, Jun 2, 2014 at 8:29 AM, Wei Yang <weiyang@linux.vnet.ibm.com> wrote: >> On Sun, Jun 01, 2014 at 01:49:43PM +0300, Or Gerlitz wrote: >>>From: Jack Morgenstein <jackm@dev.mellanox.co.il> >>> >>>Commit befdf89 did not take into account the case where the Host >>>driver is being unloaded. In this case, pci_get_drvdata for the VF >> >> In my mind, unloading PF's driver when there is alive VFs is not allowed. >> Quoted in driver code: >> >> /* in SRIOV it is not allowed to unload the pf's >> * driver while there are alive vf's */ >> if (mlx4_is_master(dev) && mlx4_how_many_lives_vf(dev)) >> printk(KERN_ERR "Removing PF when there are assigned VF's !!!\n"); >> >> Actually, I don't understand this restriction clearly. Maybe my understanding >> of alive VF is not correct. >> >> And in your code, unload PF's driver would call pci_disable_sriov() which will >> destroy the VFs. While in your test, the VF's driver is still there? >> >>>remove_one call may return NULL, so that dereferencing the priv >>>struct results in a kernel oops. >> >> Sorry for my poor mind, I still can't understand this situation. >> Would you describe the situation more? You are unloading PF's driver in Host >> at first, and then try to release the VF's driver? >> >>> >>>The fix is to also test that the dev pointer returned by >>>pci_get_drvdata is non-NULL. >>> >>>Fixes: befdf89 ("preserve pcd_dev_data after __mlx4_remove_one()") >>>Signed-off-by: Jack Morgenstein <jackm@dev.mellanox.co.il> >>>Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com> >>>--- >>> drivers/net/ethernet/mellanox/mlx4/main.c | 2 +- >>> 1 files changed, 1 insertions(+), 1 deletions(-) >>> >>>diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c >>>index c187d74..a6ae089 100644 >>>--- a/drivers/net/ethernet/mellanox/mlx4/main.c >>>+++ b/drivers/net/ethernet/mellanox/mlx4/main.c >>>@@ -2629,7 +2629,7 @@ static void __mlx4_remove_one(struct pci_dev *pdev) >>> int pci_dev_data; >>> int p; >>> >>>- if (priv->removed) >>>+ if (!dev || priv->removed) >>> return; >> >> This fix looks good to me. >> >> As I remembered, I had this check in my first version, but I removed the check >> on dev based on the suggestion from Bjorn. Since I agreed that there is no >> chance for dev to be NULL. Bjorn, seems we are not correct :( > >Writing a driver is not an empirical process of trying things to see >what works. You need to actively design a consistent structure so you >know why and when things are safe. I object to gratuitous "dev == >NULL" checks because often they are just a way of patching up a driver >design that isn't well thought-out. > >As I wrote before: > > From the PCI core's perspective, after .probe() returns successfully, > we can call any driver entry point and pass the pci_dev to it, and > expect it to work. Doing mlx4_remove_one() in mlx4_pci_err_detected() > sort of breaks that assumption because you clear out pci_drvdata(). > Right now, the only other entry point mlx4 really implements is > mlx4_remove_one(), and it has a hack that tests whether pci_drvdata() > is NULL. But that's ... a hack, and you'll have to do the same > if/when you implement suspend/resume/sriov_configure/etc. Thanks for your kindness. After re-reading it, I understand it more, it is not only related to the Mellanox driver, but also the whole picture about how to write a driver. 1. We should make the driver entry save, after .probe() returns successfully. 2. If there is an exception and a hack to test the pci_drvdata(), we need to have this hack in suspend/resum/etc. Now back to the current mlx4 driver, mlx4_remove_one() is called by .shutdown and .remove. In my mind, these two hook is invoked by rmmod or reboot. By doing so, it is trying to comply with rule 1, make sure the pci_drvdata() is valid, after .probe() succeed. Then I am curious about in which case the driver break this rule. Following is my suggestion: 1. To comply with rule 1, it would be better to fix this point instead of add a hack. 2. Or to comply with rule 2, the driver needs to check pci_drvdata() in every driver's entry instead of just in one driver entry. For example, mlx4_pci_slot_reset() need this check too. Bjorn, thanks again, hope my understanding this time is correct :-) > >Bjorn
On Mon, Jun 2, 2014 at 7:10 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: > Writing a driver is not an empirical process of trying things to see > what works. You need to actively design a consistent structure so you > know why and when things are safe. I object to gratuitous "dev == > NULL" checks because often they are just a way of patching up a driver > design that isn't well thought-out. Bjorn, 1st and most -- Agreed. Next, to be precise, the use case of rebooting the host while the driver was loaded in SRIOV mode and NO VFs probed to VMs worked before commit befdf89 and is now broken. Reading further your response, I understand that the code was probably using a sort of hackish branching to make that to happen, and you suggest we re-write that section properly so it can serve well when (hopefully soon) implemenet sriov_configure and possibly also suspend/resume, point taken. Dave, as for this patch, again, the regression of inability to reboot the host node while the driver is loaded exists in the latest upstream code as of befdf89 / 3.15-rc1 Now, taking into account that 3.15 is after rc8 and the IL devel team has a holiday this week, I don't see us coming in time with a more deeper fix for 3.15, so maybe you can eventaully go and merge this one liner for 3.15? Or. > As I wrote before: > From the PCI core's perspective, after .probe() returns successfully, > we can call any driver entry point and pass the pci_dev to it, and > expect it to work. Doing mlx4_remove_one() in mlx4_pci_err_detected() > sort of breaks that assumption because you clear out pci_drvdata(). > Right now, the only other entry point mlx4 really implements is > mlx4_remove_one(), and it has a hack that tests whether pci_drvdata() > is NULL. But that's ... a hack, and you'll have to do the same > if/when you implement suspend/resume/sriov_configure/etc. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jun 03, 2014 at 11:15:43AM +0300, Or Gerlitz wrote: >On Mon, Jun 2, 2014 at 7:10 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> Writing a driver is not an empirical process of trying things to see >> what works. You need to actively design a consistent structure so you >> know why and when things are safe. I object to gratuitous "dev == >> NULL" checks because often they are just a way of patching up a driver >> design that isn't well thought-out. > >Bjorn, 1st and most -- Agreed. > >Next, to be precise, the use case of rebooting the host while the >driver was loaded in SRIOV mode and NO VFs probed to VMs worked before >commit befdf89 and is now broken. > >Reading further your response, I understand that the code was probably >using a sort of hackish branching to make that to happen, and you >suggest we re-write that section properly so it can serve well when >(hopefully soon) implemenet >sriov_configure and possibly also suspend/resume, point taken. > >Dave, as for this patch, again, the regression of inability to reboot >the host node >while the driver is loaded exists in the latest upstream code as of >befdf89 / 3.15-rc1 > >Now, taking into account that 3.15 is after rc8 and the IL devel team >has a holiday this week, I don't see us coming in time with a more >deeper fix for 3.15, so maybe you can eventaully go and merge this one >liner for 3.15? I am glad to verify your patch, if you wish. > >Or. > > >> As I wrote before: >> From the PCI core's perspective, after .probe() returns successfully, >> we can call any driver entry point and pass the pci_dev to it, and >> expect it to work. Doing mlx4_remove_one() in mlx4_pci_err_detected() >> sort of breaks that assumption because you clear out pci_drvdata(). >> Right now, the only other entry point mlx4 really implements is >> mlx4_remove_one(), and it has a hack that tests whether pci_drvdata() >> is NULL. But that's ... a hack, and you'll have to do the same >> if/when you implement suspend/resume/sriov_configure/etc.
On Tue, Jun 03, 2014 at 11:15:43AM +0300, Or Gerlitz wrote: >On Mon, Jun 2, 2014 at 7:10 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: >> Writing a driver is not an empirical process of trying things to see >> what works. You need to actively design a consistent structure so you >> know why and when things are safe. I object to gratuitous "dev == >> NULL" checks because often they are just a way of patching up a driver >> design that isn't well thought-out. > >Bjorn, 1st and most -- Agreed. > >Next, to be precise, the use case of rebooting the host while the >driver was loaded in SRIOV mode and NO VFs probed to VMs worked before >commit befdf89 and is now broken. > >Reading further your response, I understand that the code was probably >using a sort of hackish branching to make that to happen, and you >suggest we re-write that section properly so it can serve well when >(hopefully soon) implemenet >sriov_configure and possibly also suspend/resume, point taken. > >Dave, as for this patch, again, the regression of inability to reboot >the host node >while the driver is loaded exists in the latest upstream code as of >befdf89 / 3.15-rc1 > >Now, taking into account that 3.15 is after rc8 and the IL devel team >has a holiday this week, I don't see us coming in time with a more >deeper fix for 3.15, so maybe you can eventaully go and merge this one >liner for 3.15? > >Or. Hi, Or, I did some tests with your steps to reproduce the case. Below is my analysis: I did "rmmod mlx4_core" and "kexec" after probe the Mellanox driver. Below is the log from two steps respectively. [root@tian-lp1 ywywyang]# rmmod mlx4_core [ 534.159740] mlx4_core 0003:05:00.1: mlx4_remove_one: called [ 534.161272] mlx4_core 0003:05:00.0: Received reset from slave:1 [ 534.161509] mlx4_core 0003:05:00.0: mlx4_remove_one: called [ 534.170823] mlx4_core 0003:05:00.0: Disabling SR-IOV [root@tian-lp1 ywywyang]# kexec -e [ 669.089322] kvm: exiting hardware virtualization [ 669.091746] mlx4_core 0003:05:00.1: mlx4_remove_one: called [ 669.326754] mlx4_core 0003:05:00.0: Received reset from slave:1 [ 674.488417] lpfc 0006:01:00.4: 2:2885 Port Status Event: port status reg 0x81000000, port smphr reg 0xc000, error 1=0x9f000001, error 2=0xa9fa47fd [ 675.618578] mlx4_core 0003:05:00.0: mlx4_remove_one: called [ 675.691278] mlx4_en 0003:05:00.0: removed PHC [ 675.700414] mlx4_core 0003:05:00.0: Disabling SR-IOV [ 675.700630] mlx4_core 0003:05:00.1: mlx4_remove_one: called [ 675.700701] Unable to handle kernel paging request for data at address 0x00000370 [ 675.700769] Faulting instruction address: 0xd00000001a13fb88 [ 675.700826] Oops: Kernel access of bad area, sig: 11 [#1] [---] During rmmod, the driver works fine, and in kexec there is oops message. The kexec is almost the same as reboot. We see the driver for pci device 0003:05:00.1 has been "removed" twice and at the second time the driver triggers an error. rmmod and kexec calls different driver entry, rmmod -> .remove and kexec->shutdown. I think this is the reason why there is an oops message during reboot. In .shutdown, the driver will not be detached. While in case there is VFs, both .shutdown and .remove will be invoked on VF. Did a quick glance at the e1000e driver, the .shutdown and .remove behaves differently. So maybe at .shutdown, it needs some different handling than .remove. Well adding a check at .remove is a quick fix for this case. This is my draft analysis for your reference, hope it is correct and help you to some extend. Have a good day :-)
On Mon, Jun 2, 2014 at 7:10 PM, Bjorn Helgaas <bhelgaas@google.com> wrote: [...] > From the PCI core's perspective, after .probe() returns successfully, > we can call any driver entry point and pass the pci_dev to it, and > expect it to work. Doing mlx4_remove_one() in mlx4_pci_err_detected() note that __mlx4_remove_one() is what called from mlx4_pci_err_detected() and the former is built in a way which allows it to be called twice. In that respect, I agree to the fix provided by Wei Yang over this thread, which essentially makes .shutdown to behave in a similar way and call __mlx4_remove_one() and will submit it for inclusion. > sort of breaks that assumption because you clear out pci_drvdata(). > Right now, the only other entry point mlx4 really implements is > mlx4_remove_one(), and it has a hack that tests whether pci_drvdata() > is NULL. But that's ... a hack, and you'll have to do the same > if/when you implement suspend/resume/sriov_configure/etc. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c index c187d74..a6ae089 100644 --- a/drivers/net/ethernet/mellanox/mlx4/main.c +++ b/drivers/net/ethernet/mellanox/mlx4/main.c @@ -2629,7 +2629,7 @@ static void __mlx4_remove_one(struct pci_dev *pdev) int pci_dev_data; int p; - if (priv->removed) + if (!dev || priv->removed) return; pci_dev_data = priv->pci_dev_data;