Patchwork xfrm_state locking regression...

login
register
mail settings
Submitter Timo Teräs
Date Sept. 23, 2008, 12:08 p.m.
Message ID <48D8DC28.1020001@iki.fi>
Download mbox | patch
Permalink /patch/1053/
State Changes Requested
Delegated to: David Miller
Headers show

Comments

Timo Teräs - Sept. 23, 2008, 12:08 p.m.
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote:
>> This would make it possibly to reclaim the deleted entries right
>> away.
>>
>> Does this sound better?
> 
> Yep this sounds pretty good to me.
> 
> Thanks,

So the patch would look something like this. Compile tested.
Will test later when it becomes possible to reboot my box.

Cheers,
 Timo

ipsec: Fix up xfrm_state_walk.state on node deletion

Now that we track xfrm_state_walks, it makes more sense to
fix up the state pointer on deletion of the node if needed.

This allows accurately to delete all entries immediately,
instead of possibly waiting for userland.

Also fixed locking of xfrm_state_walks list handling.

Signed-off-by: Timo Teras <timo.teras@iki.fi>
---
 include/linux/netlink.h |    2 +-
 include/net/xfrm.h      |    3 +-
 net/xfrm/xfrm_state.c   |   77 +++++++++++++++++-----------------------------
 3 files changed, 31 insertions(+), 51 deletions(-)
Herbert Xu - Sept. 23, 2008, 12:14 p.m.
On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote:
>
> +		list_for_each_entry(walk, &xfrm_state_walks, list) {
> +			if (walk->state == x)
> +				walk->state = next;
> +		}

Just make a new list in xfrm_state that has all the dumpers sitting
on it.  In fact all you need is a hlist, or even just a pointer.

Walking through an unbounded list on the fast path is not good :)

Cheers,
Timo Teräs - Sept. 23, 2008, 12:25 p.m.
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote:
>> +		list_for_each_entry(walk, &xfrm_state_walks, list) {
>> +			if (walk->state == x)
>> +				walk->state = next;
>> +		}
> 
> Just make a new list in xfrm_state that has all the dumpers sitting
> on it.  In fact all you need is a hlist, or even just a pointer.

There can be a lot of xfrm_state structs. As in thousands. So it
will take more memory then. hlist would be enough, so it'd be a
4/8 bytes addition. Single pointer would not be enough as we can
have multiple walks on the same entry.

> Walking through an unbounded list on the fast path is not good :)

I was thinking about the same. That's why I wasn't so sure if this is
better. In practice there aren't many walks active. Maybe we limit
the maximum simultaneous walks?

But in real life, there is only a one or two walkers if any active.
So I would not consider a global update too expensive. But if you think
it might become a problem, I'd add an hlist to xfrm_state of walkers
referencing that entry.

- Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 23, 2008, 12:56 p.m.
On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote:
>
> There can be a lot of xfrm_state structs. As in thousands. So it
> will take more memory then. hlist would be enough, so it'd be a
> 4/8 bytes addition. Single pointer would not be enough as we can
> have multiple walks on the same entry.

Right, we need doubly linked lists in the walkers to ease unlinking.
But only a single pointer (i.e., an hlist) is needed in xfrm_state.

> I was thinking about the same. That's why I wasn't so sure if this is
> better. In practice there aren't many walks active. Maybe we limit
> the maximum simultaneous walks?

That sounds like a hack :)

> But in real life, there is only a one or two walkers if any active.
> So I would not consider a global update too expensive. But if you think
> it might become a problem, I'd add an hlist to xfrm_state of walkers
> referencing that entry.

Well in real life dumps don't tend to get suspended either :)
In any case, a suspended dump is only going to pin down some
memory, while a very long list walk may kill the box.

Cheers,
Timo Teräs - Sept. 23, 2008, 1:01 p.m.
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote:
>> There can be a lot of xfrm_state structs. As in thousands. So it
>> will take more memory then. hlist would be enough, so it'd be a
>> 4/8 bytes addition. Single pointer would not be enough as we can
>> have multiple walks on the same entry.
> 
> Right, we need doubly linked lists in the walkers to ease unlinking.
> But only a single pointer (i.e., an hlist) is needed in xfrm_state.

Right.
 
>> I was thinking about the same. That's why I wasn't so sure if this is
>> better. In practice there aren't many walks active. Maybe we limit
>> the maximum simultaneous walks?
> 
> That sounds like a hack :)

Yes :)
 
>> But in real life, there is only a one or two walkers if any active.
>> So I would not consider a global update too expensive. But if you think
>> it might become a problem, I'd add an hlist to xfrm_state of walkers
>> referencing that entry.
> 
> Well in real life dumps don't tend to get suspended either :)
> In any case, a suspended dump is only going to pin down some
> memory, while a very long list walk may kill the box.

So, what to do?
1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and
   xfrm_state_put(all.next) on destruct.
2. Add per-entry hlist of walkers currently referencing it.
3. Use the global walker list.

1 can keep memory allocated until userland wakes up. 2 & 3 can make
the delete of that entry slow if there's many walkers suspended.

Btw. the current stuff in net-next is broken. There's no locking
for xfrm_state_walkers list handling.

Cheers,
  Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 23, 2008, 1:07 p.m.
On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote:
>
> So, what to do?
> 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and
>    xfrm_state_put(all.next) on destruct.
> 2. Add per-entry hlist of walkers currently referencing it.
> 3. Use the global walker list.
> 
> 1 can keep memory allocated until userland wakes up. 2 & 3 can make
> the delete of that entry slow if there's many walkers suspended.

I'd cross 3 off the list because 2 is just so much better :)

I'd slightly lean towards 2 but now that you mention it yes even
that is vulnerable to loads of dumpers sitting on the same entry.
So SELINUX folks wouldn't like that :)

> Btw. the current stuff in net-next is broken. There's no locking
> for xfrm_state_walkers list handling.

What about xfrm_cfg_mutex?

Cheers,
Timo Teräs - Sept. 23, 2008, 1:30 p.m.
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote:
>> So, what to do?
>> 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and
>>    xfrm_state_put(all.next) on destruct.
>> 2. Add per-entry hlist of walkers currently referencing it.
>> 3. Use the global walker list.
>>
>> 1 can keep memory allocated until userland wakes up. 2 & 3 can make
>> the delete of that entry slow if there's many walkers suspended.
> 
> I'd cross 3 off the list because 2 is just so much better :)
> 
> I'd slightly lean towards 2 but now that you mention it yes even
> that is vulnerable to loads of dumpers sitting on the same entry.
> So SELINUX folks wouldn't like that :)

Umm... right. It's a tricky problem. Cannot think of perfect
solution atm. But I guess 3 is in general case the best. But in
worst case scenarios 1 performs better.

I have no strong opinion either way. So what ever you want, I'm
happy to provide a patch for.

>> Btw. the current stuff in net-next is broken. There's no locking
>> for xfrm_state_walkers list handling.
> 
> What about xfrm_cfg_mutex?

It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
touch xfrm_state_walks list without locking properly. At least in
the version I'm looking (= net-next-2.6 via git web interface).

- Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 23, 2008, 1:32 p.m.
On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote:
>
> I have no strong opinion either way. So what ever you want, I'm
> happy to provide a patch for.

Let's just leave it as it is for now.

> > What about xfrm_cfg_mutex?
> 
> It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
> touch xfrm_state_walks list without locking properly. At least in
> the version I'm looking (= net-next-2.6 via git web interface).

Their callers should be holding it.

Cheers,
Timo Teräs - Sept. 23, 2008, 1:46 p.m.
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote:
>> I have no strong opinion either way. So what ever you want, I'm
>> happy to provide a patch for.
> 
> Let's just leave it as it is for now.

Well, 1 is an improvement over the current implementation, so
I think it'd be better than not doing anything.

>>> What about xfrm_cfg_mutex?
>> It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done}
>> touch xfrm_state_walks list without locking properly. At least in
>> the version I'm looking (= net-next-2.6 via git web interface).
> 
> Their callers should be holding it.

Ah, the other layers take it at least on _walk_init paths. But
_walk_done can be called from recv() syscalls. The af_key
implementation does not take xfrm_cfg_mutex there. I don't think
xfrm_user does that either as it does not pass cb_mutex to
netlink_kernel_create. So at least the _state_walk_done path
is unsafe as-is, I think.

- Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@  struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..7f787c7 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -122,7 +122,7 @@  struct xfrm_state
 {
 	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -1247,7 +1247,6 @@  struct xfrm6_tunnel {
 
 struct xfrm_state_walk {
 	struct list_head list;
-	unsigned long genid;
 	struct xfrm_state *state;
 	int count;
 	u8 proto;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..636f7ee 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,11 +59,6 @@  static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
 /* List of outstanding state walks used to set the completed counter.  */
 static LIST_HEAD(xfrm_state_walks);
 
@@ -199,8 +194,7 @@  static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +406,16 @@  static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	hlist_move_list(&xfrm_state_gc_list, &gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -556,7 +543,7 @@  void __xfrm_state_destroy(struct xfrm_state *x)
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
 }
@@ -564,13 +551,22 @@  EXPORT_SYMBOL(__xfrm_state_destroy);
 
 int __xfrm_state_delete(struct xfrm_state *x)
 {
+	struct xfrm_state_walk *walk;
+	struct xfrm_state *next;
 	int err = -ESRCH;
 
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
-		list_del_rcu(&x->all);
+		if (list_is_last(&x->all, &xfrm_state_walks))
+			next = NULL;
+		else
+			next = container_of(x->all.next, struct xfrm_state, all);
+		list_for_each_entry(walk, &xfrm_state_walks, list) {
+			if (walk->state == x)
+				walk->state = next;
+		}
+		list_del(&x->all);
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1566,15 +1562,16 @@  int xfrm_state_walk(struct xfrm_state_walk *walk,
 		    int (*func)(struct xfrm_state *, int, void*),
 		    void *data)
 {
-	struct xfrm_state *old, *x, *last = NULL;
+	struct xfrm_state *x, *last = NULL;
 	int err = 0;
 
 	if (walk->state == NULL && walk->count != 0)
 		return 0;
 
-	old = x = walk->state;
-	walk->state = NULL;
 	spin_lock_bh(&xfrm_state_lock);
+	x = walk->state;
+	walk->state = NULL;
+
 	if (x == NULL)
 		x = list_first_entry(&xfrm_state_all, struct xfrm_state, all);
 	list_for_each_entry_from(x, &xfrm_state_all, all) {
@@ -1585,7 +1582,6 @@  int xfrm_state_walk(struct xfrm_state_walk *walk,
 		if (last) {
 			err = func(last, walk->count, data);
 			if (err) {
-				xfrm_state_hold(last);
 				walk->state = last;
 				goto out;
 			}
@@ -1601,8 +1597,7 @@  int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL)
-		xfrm_state_put(old);
+
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
@@ -1612,33 +1607,19 @@  void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->proto = proto;
 	walk->state = NULL;
 	walk->count = 0;
+
+	spin_lock_bh(&xfrm_state_lock);
 	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
+	spin_unlock_bh(&xfrm_state_lock);
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
-	if (walk->state != NULL) {
-		xfrm_state_put(walk->state);
-		walk->state = NULL;
-	}
-
-	prev = walk->list.prev;
+	spin_lock_bh(&xfrm_state_lock);
 	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
-		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
-
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
+	spin_unlock_bh(&xfrm_state_lock);
+	walk->state = NULL;
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);