diff mbox

[ofa-general] mlx4_core: Synch catastrophic flow with module unload

Message ID 4A5B5274.2020801@mellanox.co.il
State Accepted, archived
Delegated to: David Miller
Headers show

Commit Message

Yevgeny Petrilin July 13, 2009, 3:27 p.m. UTC
There is a race condition when the mlx4_core module is being unloaded
during the execution of restart task due to catastrophic error.
Added a global mutex that synchs those operations. If the catastrophic task
tries to catch the mutex, and it is already taken, it means that somebody is unloading the
module, and there is no point in executing the restart operation.
If the unload function tries to catch the mutex and it is taken,
it would wait for the catas task to finish and then unload the module.

Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
---
 drivers/net/mlx4/catas.c |    4 ++++
 drivers/net/mlx4/main.c  |    6 ++++++
 drivers/net/mlx4/mlx4.h  |    2 ++
 3 files changed, 12 insertions(+), 0 deletions(-)

Comments

David Miller July 13, 2009, 6:14 p.m. UTC | #1
From: Yevgeny Petrilin <yevgenyp@mellanox.co.il>
Date: Mon, 13 Jul 2009 18:27:48 +0300

> There is a race condition when the mlx4_core module is being unloaded
> during the execution of restart task due to catastrophic error.
> Added a global mutex that synchs those operations. If the catastrophic task
> tries to catch the mutex, and it is already taken, it means that somebody is unloading the
> module, and there is no point in executing the restart operation.
> If the unload function tries to catch the mutex and it is taken,
> it would wait for the catas task to finish and then unload the module.
> 
> Signed-off-by: Yevgeny Petrilin <yevgenyp@mellanox.co.il>

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier July 13, 2009, 7:45 p.m. UTC | #2
> Applied, thanks.

Dave, please don't apply mlx4_core patches without giving me a chance to
review them.  In this case the patch looks buggy to me: I don't see how
it handles, say, hot remove of one device -- it only handles module
removal.  And I would hope we could fix this without adding a global
symbol as namespace polluting as "drv_mutex".

Yevgeny didn't even send this patch to you; he just cc'ed netdev as a
courtesy.  However I understand that the physical location of mlx4_core
in drivers/net makes it easy to do this.  Maybe this is the best
argument in favor of moving the mlx4_core stuff to drivers/shared?

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller July 13, 2009, 8:09 p.m. UTC | #3
From: Roland Dreier <rdreier@cisco.com>
Date: Mon, 13 Jul 2009 12:45:32 -0700

>  > Applied, thanks.
> 
> Dave, please don't apply mlx4_core patches without giving me a chance to
> review them.  In this case the patch looks buggy to me: I don't see how
> it handles, say, hot remove of one device -- it only handles module
> removal.  And I would hope we could fix this without adding a global
> symbol as namespace polluting as "drv_mutex".
> 
> Yevgeny didn't even send this patch to you; he just cc'ed netdev as a
> courtesy.  However I understand that the physical location of mlx4_core
> in drivers/net makes it easy to do this.  Maybe this is the best
> argument in favor of moving the mlx4_core stuff to drivers/shared?

If it gets sent to netdev, it's for a networking driver, and it says
"PATCH" rather than "RFC" or "please review" or "don't apply" you
cannot reasonably expect me to not look into applying the thing.

And if you're saying that patches for this device should start not
going through me, and the tactic to accomplish that is to move the
bulk of the driver into some driver/shared area, that's really weird.

Anyways I didn't push the patch out to kernel.org yet so it's easy for
me to remove it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Roland Dreier July 13, 2009, 9:53 p.m. UTC | #4
> If it gets sent to netdev, it's for a networking driver, and it says
 > "PATCH" rather than "RFC" or "please review" or "don't apply" you
 > cannot reasonably expect me to not look into applying the thing.

 > And if you're saying that patches for this device should start not
 > going through me, and the tactic to accomplish that is to move the
 > bulk of the driver into some driver/shared area, that's really weird.

Well, first of all, if a driver, networking or not, has an active
maintainer, I would expect you to give that maintainer a chance to look
at any not-totally-trivial patches affecting that driver.

But in this case, mlx4_core (as opposed to mlx4_en from the same
drivers/net/mlx4 directory) really is not a network driver -- it is a
low-level multiplexer for access to hardware that really is more
InfiniBand than ethernet (with a dash of Fibre Channel thrown in).  And
yes, I am saying that making it clearer that mlx4_core is not an
network driver by moving the source to a more appropriate place does
seem to make sense.

 > Anyways I didn't push the patch out to kernel.org yet so it's easy for
 > me to remove it.

Thanks.

 - R.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/net/mlx4/catas.c b/drivers/net/mlx4/catas.c
index aa9674b..e3aa7e9 100644
--- a/drivers/net/mlx4/catas.c
+++ b/drivers/net/mlx4/catas.c
@@ -91,6 +91,9 @@  static void catas_reset(struct work_struct *work)
 	LIST_HEAD(tlist);
 	int ret;
 
+	if (!mutex_trylock(&drv_mutex))
+		return;
+
 	spin_lock_irq(&catas_lock);
 	list_splice_init(&catas_list, &tlist);
 	spin_unlock_irq(&catas_lock);
@@ -103,6 +106,7 @@  static void catas_reset(struct work_struct *work)
 		else
 			mlx4_dbg(dev, "Reset succeeded\n");
 	}
+	mutex_unlock(&drv_mutex);
 }
 
 void mlx4_start_catas_poll(struct mlx4_dev *dev)
diff --git a/drivers/net/mlx4/main.c b/drivers/net/mlx4/main.c
index dac621b..9cd5123 100644
--- a/drivers/net/mlx4/main.c
+++ b/drivers/net/mlx4/main.c
@@ -77,6 +77,8 @@  static char mlx4_version[] __devinitdata =
 	DRV_NAME ": Mellanox ConnectX core driver v"
 	DRV_VERSION " (" DRV_RELDATE ")\n";
 
+struct mutex drv_mutex;
+
 static struct mlx4_profile default_profile = {
 	.num_qp		= 1 << 17,
 	.num_srq	= 1 << 16,
@@ -1325,6 +1327,8 @@  static int __init mlx4_init(void)
 {
 	int ret;
 
+	mutex_init(&drv_mutex);
+
 	if (mlx4_verify_params())
 		return -EINVAL;
 
@@ -1340,7 +1344,9 @@  static int __init mlx4_init(void)
 
 static void __exit mlx4_cleanup(void)
 {
+	mutex_lock(&drv_mutex);
 	pci_unregister_driver(&mlx4_driver);
+	mutex_unlock(&drv_mutex);
 	destroy_workqueue(mlx4_wq);
 }
 
diff --git a/drivers/net/mlx4/mlx4.h b/drivers/net/mlx4/mlx4.h
index 5bd79c2..bd8fb43 100644
--- a/drivers/net/mlx4/mlx4.h
+++ b/drivers/net/mlx4/mlx4.h
@@ -284,6 +284,8 @@  struct mlx4_sense {
 	struct delayed_work	sense_poll;
 };
 
+extern struct mutex drv_mutex;
+
 struct mlx4_priv {
 	struct mlx4_dev		dev;