diff mbox

[v3,for-next,01/13] IB/core: Use SRCU when reading client_list or device_list

Message ID 1431253604-9214-2-git-send-email-haggaie@mellanox.com
State Not Applicable, archived
Delegated to: David Miller
Headers show

Commit Message

Haggai Eran May 10, 2015, 10:26 a.m. UTC
Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding an SRCU to protect the lists. Readers now don't need
the mutex, and are safe just by using srcu_read_lock/unlock.

The ib_register_device, ib_register_client, and ib_unregister_client
functions are modified to only lock the device_mutex during their
respective list modification, and use the SRCU for iteration on the other
list. In ib_unregister_device, the client list iteration remains in the
mutex critical section as it is done in reverse order.

This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.

[1] http://www.spinics.net/lists/linux-rdma/msg24733.html

Cc: Matan Barak <matanb@mellanox.com>
Cc: Jason Gunthorpe <jgunthorpe@obsidianresearch.com>
Signed-off-by: Haggai Eran <haggaie@mellanox.com>
---
 drivers/infiniband/core/device.c | 75 ++++++++++++++++++++++++++++++----------
 1 file changed, 56 insertions(+), 19 deletions(-)

Comments

Jason Gunthorpe May 11, 2015, 6:18 p.m. UTC | #1
On Sun, May 10, 2015 at 01:26:32PM +0300, Haggai Eran wrote:
> Currently the RDMA subsystem's device list and client list are protected by
> a single mutex. This prevents adding user-facing APIs that iterate these
> lists, since using them may cause a deadlock. The patch attempts to solve
> this problem by adding an SRCU to protect the lists. Readers now don't need
> the mutex, and are safe just by using srcu_read_lock/unlock.
> 
> The ib_register_device, ib_register_client, and ib_unregister_client
> functions are modified to only lock the device_mutex during their
> respective list modification, and use the SRCU for iteration on the other
> list. In ib_unregister_device, the client list iteration remains in the
> mutex critical section as it is done in reverse order.
> 
> This patch attempts to solve a similar need [1] that was seen in the RoCE
> v2 patch series.
> 
> [1] http://www.spinics.net/lists/linux-rdma/msg24733.html

So at first blush this looked reasonable, but, no it is racy:

ib_register_client:
 mutex_lock(&device_mutex);
 list_add_tail_rcu(&client->list, &client_list);
 mutex_unlock(&device_mutex);

 id = srcu_read_lock(&device_srcu);
 list_for_each_entry_rcu(device, &device_list, core_list)
                        client->add(device);

ib_register_device:
  mutex_lock(&device_mutex);
  list_add_tail(&device->core_list, &device_list);
  mutex_unlock(&device_mutex);

  id = srcu_read_lock(&device_srcu);
  list_for_each_entry_rcu(client, &client_list, list)
       client->add(device);

So, if two threads call the two registers then the new client will
have it's add called twice on the same device.

There are similar problems with removal.

I'm not sure RCU is the right way to approach this. The driver core
has the same basic task to perform, maybe review it's locking
arrangment between the device list and driver list.

[Actually we probably should be using the driver core here, with IB
 clients as device drivers, but that is way beyond scope of this..]

I had a bunch of other comments, but they are not relevant,
considering the above.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran May 12, 2015, 6:07 a.m. UTC | #2
On 11/05/2015 21:18, Jason Gunthorpe wrote:
> So at first blush this looked reasonable, but, no it is racy:
> 
> ib_register_client:
>  mutex_lock(&device_mutex);
>  list_add_tail_rcu(&client->list, &client_list);
>  mutex_unlock(&device_mutex);
> 
>  id = srcu_read_lock(&device_srcu);
>  list_for_each_entry_rcu(device, &device_list, core_list)
>                         client->add(device);
> 
> ib_register_device:
>   mutex_lock(&device_mutex);
>   list_add_tail(&device->core_list, &device_list);
>   mutex_unlock(&device_mutex);
> 
>   id = srcu_read_lock(&device_srcu);
>   list_for_each_entry_rcu(client, &client_list, list)
>        client->add(device);
> 
> So, if two threads call the two registers then the new client will
> have it's add called twice on the same device.
> 
> There are similar problems with removal.

Right. Sorry I missed that. Our first draft just kept the addition and
removal of elements to the device or client list under the mutex,
thinking that only the new code in this patchset that does traversal of
the lists would use the SRCU read lock. We then changed it thinking that
it would be better to make some use of this SRCU in this patch as well.

> 
> I'm not sure RCU is the right way to approach this. The driver core
> has the same basic task to perform, maybe review it's locking
> arrangment between the device list and driver list.
> 
> [Actually we probably should be using the driver core here, with IB
>  clients as device drivers, but that is way beyond scope of this..]

So, I'm not very familiar with that code, but it seems that the main
difference is that in the core a single driver can be attached to a
device. The device_add function calls bus_add_device to add it to the
bus's list, and and later calls bus_probe_device to find a driver,
without any lock between these calls. And bus_add_driver adds a new
driver to the bus's driver list and then calls driver_attach without any
lock between them. The only thing I see that makes sure a driver won't
be attached twice to a device is the dev->driver field which is updated
under the device lock during the probing process.

I guess a similar thing we can do is to rely on the context we associate
with a pair of a client and a device. If such a context exist, we don't
need to call client->add again. What do you think?

Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 12, 2015, 6:23 p.m. UTC | #3
On Tue, May 12, 2015 at 09:07:51AM +0300, Haggai Eran wrote:
> > I'm not sure RCU is the right way to approach this. The driver core
> > has the same basic task to perform, maybe review it's locking
> > arrangment between the device list and driver list.
> > 
> > [Actually we probably should be using the driver core here, with IB
> >  clients as device drivers, but that is way beyond scope of this..]
> 
> So, I'm not very familiar with that code, but it seems that the main
> difference is that in the core a single driver can be attached to a
> device. 

Roughly, a bus (IB would be a bus) has devices attached to it, and
devices have drivers attached to them. Bus:Device is 1:N,
Device:Drvier is 1:1. 

There a a couple of reasons to be interested in re-using the driver
core, perhaps the best is that it has all the infrastructure to let us
auto-load client modules...

> I guess a similar thing we can do is to rely on the context we associate
> with a pair of a client and a device. If such a context exist, we don't
> need to call client->add again. What do you think?

I didn't look closely, isn't this enough?

device_register:
 mutex_lock(client_mutex);
 down_write(devices_rwsem);
 list_add(device_list,dev,..);
 up_write(devices_rwsem);

 /* Caller must prevent device_register/unregister concurrancy on the
    same dev */

 foreach(client_list)
   .. client->add(dev,...) .. 

 mutex_unlock(client_mutex)

client_register:
 mutex_lock(client_mutex)
 list_add(client_list,new_client..)
 down_read(devices_rwsem);
 foreach(device_list)
   .. client->add(dev,new_client,..) ..
 up_read(devices_rwsem);
 mutex_unlock(client_mutex)

[Note, I didn't check this carefully, just intuitively seems like a
 sane starting point]

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran May 13, 2015, 8:10 a.m. UTC | #4
On 12/05/2015 21:23, Jason Gunthorpe wrote:
> On Tue, May 12, 2015 at 09:07:51AM +0300, Haggai Eran wrote:
>>> I'm not sure RCU is the right way to approach this. The driver core
>>> has the same basic task to perform, maybe review it's locking
>>> arrangment between the device list and driver list.
>>>
>>> [Actually we probably should be using the driver core here, with IB
>>>  clients as device drivers, but that is way beyond scope of this..]
>>
>> So, I'm not very familiar with that code, but it seems that the main
>> difference is that in the core a single driver can be attached to a
>> device. 
> 
> Roughly, a bus (IB would be a bus) has devices attached to it, and
> devices have drivers attached to them. Bus:Device is 1:N,
> Device:Drvier is 1:1. 
> 
> There a a couple of reasons to be interested in re-using the driver
> core, perhaps the best is that it has all the infrastructure to let us
> auto-load client modules...

So you want an ib_core bus and each client is a "device" or a driver on
that bus? But this doesn't get you the relation between clients and ib
devices.

> 
>> I guess a similar thing we can do is to rely on the context we associate
>> with a pair of a client and a device. If such a context exist, we don't
>> need to call client->add again. What do you think?
> 
> I didn't look closely, isn't this enough?
> 
> device_register:
>  mutex_lock(client_mutex);
>  down_write(devices_rwsem);
>  list_add(device_list,dev,..);
>  up_write(devices_rwsem);
> 
>  /* Caller must prevent device_register/unregister concurrancy on the
>     same dev */
> 
>  foreach(client_list)
>    .. client->add(dev,...) .. 
> 
>  mutex_unlock(client_mutex)
> 
> client_register:
>  mutex_lock(client_mutex)
>  list_add(client_list,new_client..)
>  down_read(devices_rwsem);
>  foreach(device_list)
>    .. client->add(dev,new_client,..) ..
>  up_read(devices_rwsem);
>  mutex_unlock(client_mutex)
> 
> [Note, I didn't check this carefully, just intuitively seems like a
>  sane starting point]

That could perhaps work for the RoCEv2 patch-set, as their deadlock
relates to iterating the device list. This patch set however does an
iteration on the client list (patch 3). Because a client remove could
block on this iteration, you can still get a deadlock.

I think I prefer keeping the single device_mutex and the SRCU, but
keeping the device_mutex locked as it is today, covering all of the
registration and unregistration code. Only the new code that reads the
client list or the device list without modification to the other list
will use the SRCU read lock.

Haggai
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jason Gunthorpe May 13, 2015, 3:57 p.m. UTC | #5
On Wed, May 13, 2015 at 11:10:15AM +0300, Haggai Eran wrote:

> >> I guess a similar thing we can do is to rely on the context we associate
> >> with a pair of a client and a device. If such a context exist, we don't
> >> need to call client->add again. What do you think?
> > 
> > I didn't look closely, isn't this enough?
> > 
> > device_register:
> >  mutex_lock(client_mutex);
> >  down_write(devices_rwsem);
> >  list_add(device_list,dev,..);
> >  up_write(devices_rwsem);
> > 
> >  /* Caller must prevent device_register/unregister concurrancy on the
> >     same dev */
> > 
> >  foreach(client_list)
> >    .. client->add(dev,...) .. 
> > 
> >  mutex_unlock(client_mutex)
> > 
> > client_register:
> >  mutex_lock(client_mutex)
> >  list_add(client_list,new_client..)
> >  down_read(devices_rwsem);
> >  foreach(device_list)
> >    .. client->add(dev,new_client,..) ..
> >  up_read(devices_rwsem);
> >  mutex_unlock(client_mutex)
> > 
> > [Note, I didn't check this carefully, just intuitively seems like a
> >  sane starting point]
> 
> That could perhaps work for the RoCEv2 patch-set, as their deadlock
> relates to iterating the device list. This patch set however does an
> iteration on the client list (patch 3). Because a client remove could
> block on this iteration, you can still get a deadlock.

Really? Gross.

Still, I think you got it right in your analysis, but we don't need RCU:

device_register:
 mutex_lock(modify_mutex);
 down_write(lists_rwsem);
 list_add(device_list,dev,..);
 up_write(lists_rwsem);

 // implied by modify_mutex: down_read(lists_rwsem)
 foreach(client_list)
    .. client->add(dev,...) ..
 mutex_unlock(modify_mutex)

client_register:
 mutex_lock(modify_mutex);
 // implied by modify_mutex: down_read(lists_rwsem)
 foreach(device_list)
    .. client->add(dev,new_client...) ..

 down_write(lists_rwsem);
 list_add(client_list,new_client..);
 up_write(lists_rwsem);

 mutex_unlock(modify_mutex)

client_unregister:
 mutex_lock(modify_mutex);
 down_write(lists_rwsem);
 list_cut(client_list,..rm_client..);
 up_write(lists_rwsem);

 // implied by modify_mutex: down_read(lists_rwsem)
 foreach(device_list)
    .. client->remove(dev,rm_client...) ..

 mutex_unlock(modify_mutex)

etc. Notice the ordering.

> I think I prefer keeping the single device_mutex and the SRCU, but
> keeping the device_mutex locked as it is today, covering all of the
> registration and unregistration code. Only the new code that reads the
> client list or the device list without modification to the other list
> will use the SRCU read lock.

In this case, I don't see a justification to use RCU, we'd need a
high read load before optimizing the rwsem into RCU would make
sense.

I'm not sure you should ever use RCU until you've designed the locking
using traditional means.

Jason
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Haggai Eran May 14, 2015, 10:35 a.m. UTC | #6
On 13/05/2015 18:57, Jason Gunthorpe wrote:
> On Wed, May 13, 2015 at 11:10:15AM +0300, Haggai Eran wrote:
> 
>>>> I guess a similar thing we can do is to rely on the context we associate
>>>> with a pair of a client and a device. If such a context exist, we don't
>>>> need to call client->add again. What do you think?
>>>
>>> I didn't look closely, isn't this enough?
>>>
>>> device_register:
>>>  mutex_lock(client_mutex);
>>>  down_write(devices_rwsem);
>>>  list_add(device_list,dev,..);
>>>  up_write(devices_rwsem);
>>>
>>>  /* Caller must prevent device_register/unregister concurrancy on the
>>>     same dev */
>>>
>>>  foreach(client_list)
>>>    .. client->add(dev,...) .. 
>>>
>>>  mutex_unlock(client_mutex)
>>>
>>> client_register:
>>>  mutex_lock(client_mutex)
>>>  list_add(client_list,new_client..)
>>>  down_read(devices_rwsem);
>>>  foreach(device_list)
>>>    .. client->add(dev,new_client,..) ..
>>>  up_read(devices_rwsem);
>>>  mutex_unlock(client_mutex)
>>>
>>> [Note, I didn't check this carefully, just intuitively seems like a
>>>  sane starting point]
>>
>> That could perhaps work for the RoCEv2 patch-set, as their deadlock
>> relates to iterating the device list. This patch set however does an
>> iteration on the client list (patch 3). Because a client remove could
>> block on this iteration, you can still get a deadlock.
> 
> Really? Gross.
> 
> Still, I think you got it right in your analysis, but we don't need RCU:
> 
> device_register:
>  mutex_lock(modify_mutex);
>  down_write(lists_rwsem);
>  list_add(device_list,dev,..);
>  up_write(lists_rwsem);
> 
>  // implied by modify_mutex: down_read(lists_rwsem)
>  foreach(client_list)
>     .. client->add(dev,...) ..
>  mutex_unlock(modify_mutex)
> 
> client_register:
>  mutex_lock(modify_mutex);
>  // implied by modify_mutex: down_read(lists_rwsem)
>  foreach(device_list)
>     .. client->add(dev,new_client...) ..
> 
>  down_write(lists_rwsem);
>  list_add(client_list,new_client..);
>  up_write(lists_rwsem);
> 
>  mutex_unlock(modify_mutex)
> 
> client_unregister:
>  mutex_lock(modify_mutex);
>  down_write(lists_rwsem);
>  list_cut(client_list,..rm_client..);
>  up_write(lists_rwsem);
> 
>  // implied by modify_mutex: down_read(lists_rwsem)
>  foreach(device_list)
>     .. client->remove(dev,rm_client...) ..
> 
>  mutex_unlock(modify_mutex)
> 
> etc. Notice the ordering.
> 

Looks good. I'll use it in the next version of the patch.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index b360350a0b20..7d90b2ca2eba 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -58,12 +58,11 @@  EXPORT_SYMBOL_GPL(ib_wq);
 static LIST_HEAD(device_list);
 static LIST_HEAD(client_list);
 
+/* device_srcu protects access to both device_list and client_list. */
+static struct srcu_struct device_srcu;
+
 /*
- * device_mutex protects access to both device_list and client_list.
- * There's no real point to using multiple locks or something fancier
- * like an rwsem: we always access both lists, and we're always
- * modifying one list or the other list.  In any case this is not a
- * hot path so there's no point in trying to optimize.
+ * device_mutex protects writer access to both device_list and client_list.
  */
 static DEFINE_MUTEX(device_mutex);
 
@@ -276,6 +275,7 @@  int ib_register_device(struct ib_device *device,
 					    u8, struct kobject *))
 {
 	int ret;
+	int id;
 
 	mutex_lock(&device_mutex);
 
@@ -315,13 +315,19 @@  int ib_register_device(struct ib_device *device,
 
 	device->reg_state = IB_DEV_REGISTERED;
 
+	mutex_unlock(&device_mutex);
+
+	id = srcu_read_lock(&device_srcu);
 	{
 		struct ib_client *client;
 
-		list_for_each_entry(client, &client_list, list)
+		list_for_each_entry_rcu(client, &client_list, list)
 			if (client->add && !add_client_context(device, client))
 				client->add(device);
 	}
+	srcu_read_unlock(&device_srcu, id);
+
+	return 0;
 
  out:
 	mutex_unlock(&device_mutex);
@@ -338,6 +344,7 @@  EXPORT_SYMBOL(ib_register_device);
 void ib_unregister_device(struct ib_device *device)
 {
 	struct ib_client *client;
+	LIST_HEAD(contexts);
 	struct ib_client_data *context, *tmp;
 	unsigned long flags;
 
@@ -347,21 +354,26 @@  void ib_unregister_device(struct ib_device *device)
 		if (client->remove)
 			client->remove(device);
 
-	list_del(&device->core_list);
+	list_del_rcu(&device->core_list);
+
+	mutex_unlock(&device_mutex);
+
+	synchronize_srcu(&device_srcu);
 
 	kfree(device->gid_tbl_len);
 	kfree(device->pkey_tbl_len);
 
-	mutex_unlock(&device_mutex);
-
 	ib_device_unregister_sysfs(device);
 
 	spin_lock_irqsave(&device->client_data_lock, flags);
-	list_for_each_entry_safe(context, tmp, &device->client_data_list, list)
-		kfree(context);
+	list_cut_position(&contexts, &device->client_data_list,
+			  device->client_data_list.prev);
 	spin_unlock_irqrestore(&device->client_data_lock, flags);
 
 	device->reg_state = IB_DEV_UNREGISTERED;
+
+	list_for_each_entry_safe(context, tmp, &contexts, list)
+		kfree(context);
 }
 EXPORT_SYMBOL(ib_unregister_device);
 
@@ -381,15 +393,19 @@  EXPORT_SYMBOL(ib_unregister_device);
 int ib_register_client(struct ib_client *client)
 {
 	struct ib_device *device;
+	int id;
 
 	mutex_lock(&device_mutex);
+	list_add_tail_rcu(&client->list, &client_list);
+	mutex_unlock(&device_mutex);
 
-	list_add_tail(&client->list, &client_list);
-	list_for_each_entry(device, &device_list, core_list)
+	id = srcu_read_lock(&device_srcu);
+
+	list_for_each_entry_rcu(device, &device_list, core_list)
 		if (client->add && !add_client_context(device, client))
 			client->add(device);
 
-	mutex_unlock(&device_mutex);
+	srcu_read_unlock(&device_srcu, id);
 
 	return 0;
 }
@@ -407,11 +423,13 @@  void ib_unregister_client(struct ib_client *client)
 {
 	struct ib_client_data *context, *tmp;
 	struct ib_device *device;
+	LIST_HEAD(contexts);
 	unsigned long flags;
+	int id;
 
-	mutex_lock(&device_mutex);
+	id = srcu_read_lock(&device_srcu);
 
-	list_for_each_entry(device, &device_list, core_list) {
+	list_for_each_entry_rcu(device, &device_list, core_list) {
 		if (client->remove)
 			client->remove(device);
 
@@ -419,13 +437,21 @@  void ib_unregister_client(struct ib_client *client)
 		list_for_each_entry_safe(context, tmp, &device->client_data_list, list)
 			if (context->client == client) {
 				list_del(&context->list);
-				kfree(context);
+				list_add(&context->list, &contexts);
 			}
 		spin_unlock_irqrestore(&device->client_data_lock, flags);
 	}
-	list_del(&client->list);
 
+	srcu_read_unlock(&device_srcu, id);
+
+	mutex_lock(&device_mutex);
+	list_del_rcu(&client->list);
 	mutex_unlock(&device_mutex);
+
+	synchronize_srcu(&device_srcu);
+
+	list_for_each_entry_safe(context, tmp, &contexts, list)
+		kfree(context);
 }
 EXPORT_SYMBOL(ib_unregister_client);
 
@@ -738,9 +764,15 @@  static int __init ib_core_init(void)
 {
 	int ret;
 
+	ret = init_srcu_struct(&device_srcu);
+	if (ret) {
+		pr_warn("Couldn't initialize SRCU\n");
+		return ret;
+	}
+
 	ib_wq = alloc_workqueue("infiniband", 0, 0);
 	if (!ib_wq)
-		return -ENOMEM;
+		goto err_srcu;
 
 	ret = ib_sysfs_setup();
 	if (ret) {
@@ -770,6 +802,9 @@  err_sysfs:
 
 err:
 	destroy_workqueue(ib_wq);
+err_srcu:
+	cleanup_srcu_struct(&device_srcu);
+
 	return ret;
 }
 
@@ -780,6 +815,8 @@  static void __exit ib_core_cleanup(void)
 	ib_sysfs_cleanup();
 	/* Make sure that any pending umem accounting work is done. */
 	destroy_workqueue(ib_wq);
+	srcu_barrier(&device_srcu);
+	cleanup_srcu_struct(&device_srcu);
 }
 
 module_init(ib_core_init);