Message ID | 4A5B5274.2020801@mellanox.co.il |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Yevgeny Petrilin <yevgenyp@mellanox.co.il> Date: Mon, 13 Jul 2009 18:27:48 +0300 > There is a race condition when the mlx4_core module is being unloaded > during the execution of restart task due to catastrophic error. > Added a global mutex that synchs those operations. If the catastrophic task > tries to catch the mutex, and it is already taken, it means that somebody is unloading the > module, and there is no point in executing the restart operation. > If the unload function tries to catch the mutex and it is taken, > it would wait for the catas task to finish and then unload the module. > > Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il> Applied, thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> Applied, thanks.
Dave, please don't apply mlx4_core patches without giving me a chance to
review them. In this case the patch looks buggy to me: I don't see how
it handles, say, hot remove of one device -- it only handles module
removal. And I would hope we could fix this without adding a global
symbol as namespace polluting as "drv_mutex".
Yevgeny didn't even send this patch to you; he just cc'ed netdev as a
courtesy. However I understand that the physical location of mlx4_core
in drivers/net makes it easy to do this. Maybe this is the best
argument in favor of moving the mlx4_core stuff to drivers/shared?
- R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
From: Roland Dreier <rdreier@cisco.com> Date: Mon, 13 Jul 2009 12:45:32 -0700 > > Applied, thanks. > > Dave, please don't apply mlx4_core patches without giving me a chance to > review them. In this case the patch looks buggy to me: I don't see how > it handles, say, hot remove of one device -- it only handles module > removal. And I would hope we could fix this without adding a global > symbol as namespace polluting as "drv_mutex". > > Yevgeny didn't even send this patch to you; he just cc'ed netdev as a > courtesy. However I understand that the physical location of mlx4_core > in drivers/net makes it easy to do this. Maybe this is the best > argument in favor of moving the mlx4_core stuff to drivers/shared? If it gets sent to netdev, it's for a networking driver, and it says "PATCH" rather than "RFC" or "please review" or "don't apply" you cannot reasonably expect me to not look into applying the thing. And if you're saying that patches for this device should start not going through me, and the tactic to accomplish that is to move the bulk of the driver into some driver/shared area, that's really weird. Anyways I didn't push the patch out to kernel.org yet so it's easy for me to remove it. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
> If it gets sent to netdev, it's for a networking driver, and it says > "PATCH" rather than "RFC" or "please review" or "don't apply" you > cannot reasonably expect me to not look into applying the thing. > And if you're saying that patches for this device should start not > going through me, and the tactic to accomplish that is to move the > bulk of the driver into some driver/shared area, that's really weird. Well, first of all, if a driver, networking or not, has an active maintainer, I would expect you to give that maintainer a chance to look at any not-totally-trivial patches affecting that driver. But in this case, mlx4_core (as opposed to mlx4_en from the same drivers/net/mlx4 directory) really is not a network driver -- it is a low-level multiplexer for access to hardware that really is more InfiniBand than ethernet (with a dash of Fibre Channel thrown in). And yes, I am saying that making it clearer that mlx4_core is not an network driver by moving the source to a more appropriate place does seem to make sense. > Anyways I didn't push the patch out to kernel.org yet so it's easy for > me to remove it. Thanks. - R. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/drivers/net/mlx4/catas.c b/drivers/net/mlx4/catas.c index aa9674b..e3aa7e9 100644 --- a/drivers/net/mlx4/catas.c +++ b/drivers/net/mlx4/catas.c @@ -91,6 +91,9 @@ static void catas_reset(struct work_struct *work) LIST_HEAD(tlist); int ret; + if (!mutex_trylock(&drv_mutex)) + return; + spin_lock_irq(&catas_lock); list_splice_init(&catas_list, &tlist); spin_unlock_irq(&catas_lock); @@ -103,6 +106,7 @@ static void catas_reset(struct work_struct *work) else mlx4_dbg(dev, "Reset succeeded\n"); } + mutex_unlock(&drv_mutex); } void mlx4_start_catas_poll(struct mlx4_dev *dev) diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c index dac621b..9cd5123 100644 --- a/drivers/net/mlx4/main.c +++ b/drivers/net/mlx4/main.c @@ -77,6 +77,8 @@ static char mlx4_version[] __devinitdata = DRV_NAME ": Mellanox ConnectX core driver v" DRV_VERSION " (" DRV_RELDATE ")\n"; +struct mutex drv_mutex; + static struct mlx4_profile default_profile = { .num_qp = 1 << 17, .num_srq = 1 << 16, @@ -1325,6 +1327,8 @@ static int __init mlx4_init(void) { int ret; + mutex_init(&drv_mutex); + if (mlx4_verify_params()) return -EINVAL; @@ -1340,7 +1344,9 @@ static int __init mlx4_init(void) static void __exit mlx4_cleanup(void) { + mutex_lock(&drv_mutex); pci_unregister_driver(&mlx4_driver); + mutex_unlock(&drv_mutex); destroy_workqueue(mlx4_wq); } diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h index 5bd79c2..bd8fb43 100644 --- a/drivers/net/mlx4/mlx4.h +++ b/drivers/net/mlx4/mlx4.h @@ -284,6 +284,8 @@ struct mlx4_sense { struct delayed_work sense_poll; }; +extern struct mutex drv_mutex; + struct mlx4_priv { struct mlx4_dev dev;
There is a race condition when the mlx4_core module is being unloaded during the execution of restart task due to catastrophic error. Added a global mutex that synchs those operations. If the catastrophic task tries to catch the mutex, and it is already taken, it means that somebody is unloading the module, and there is no point in executing the restart operation. If the unload function tries to catch the mutex and it is taken, it would wait for the catas task to finish and then unload the module. Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il> --- drivers/net/mlx4/catas.c | 4 ++++ drivers/net/mlx4/main.c | 6 ++++++ drivers/net/mlx4/mlx4.h | 2 ++ 3 files changed, 12 insertions(+), 0 deletions(-)