diff mbox

xfrm_state locking regression...

Message ID 48D88BCC.5030806@iki.fi
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Timo Teras Sept. 23, 2008, 6:25 a.m. UTC
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 08:17:14AM +0300, Timo Teräs wrote:
>> The extra step there wold be a hold call. The recursive _put on a
>> _put of some entry can happen only on dump path. As otherwise the
>> ->next entry is first held in state delete, but would be immediately
>> _put on the _put as the final step of _delete().
>>
>> So basically one additional atomic_inc() and one atomic_dec() on the
>> normal _delete() path.
> 
> Can you post a patch? If this can be done locklessly then yes
> it would probably be a good way to go.  However, I'm not sure
> whether I understand your lockless proposal yet.

Something like this. I just compile tested, so I'm not sure if it
actually works.

This basically reverts the GC changes. The interesting bits are in
xfrm_state_delete(). It will _hold() the ->next entry unless we
were at last entry. This will make sure that when an entry is
in XFRM_STATE_DEAD the all.next is valid all the way until the
entry is destroyed.

The corresponding put is in _destroy(). Added it as a final thing
to do so hopefully compiler optimizes tail recursion to iteration.
(Seemed to do that in my case.)

And finally, _walk() was right all along. It returns to dumping
and first skips all dead entries. And just before return, when
all dead entries are already skipped the originally held entry
is _put(). That _put call (or the one in _walk_done()) will result
all dead entries that were held, to be iteratively put.

- Timo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Herbert Xu Sept. 23, 2008, 6:47 a.m. UTC | #1
On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote:
>
> Something like this. I just compile tested, so I'm not sure if it
> actually works.

It is quite clever :)
 
> The corresponding put is in _destroy(). Added it as a final thing
> to do so hopefully compiler optimizes tail recursion to iteration.
> (Seemed to do that in my case.)

Yeah we should open code this to be absolutely sure that we don't
get killed by a stupid compiler.

However, I think this is still open to the same problem that my
patch had, i.e., if a dumper goes to sleep during the dump we may
prevent entries from being freed indefinitely.  Yes your idea
is better in that we may only withhold say a fraction (depending
on the order of deletion it could be anywhere from none to every
entry) of the entries instead of all of them, but fundamentally
the same issue is still there.

Considering the fact that dumps require root privileges I'm not
convinced as of yet that this is worth it.

Hmm, could we perhaps go back to your original scheme of keeping
everything on the list and see if we can use RCU to make it lockless
instead?

Cheers,
Timo Teras Sept. 23, 2008, 6:56 a.m. UTC | #2
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote:
>> Something like this. I just compile tested, so I'm not sure if it
>> actually works.
> 
> It is quite clever :)

Thanks :)

And actually, we can make _delete() do normal list_del() and not hold
the next pointer if there are no running dumps. This is the common
case anyway. Then there would be no penalty at all.

>> The corresponding put is in _destroy(). Added it as a final thing
>> to do so hopefully compiler optimizes tail recursion to iteration.
>> (Seemed to do that in my case.)
> 
> Yeah we should open code this to be absolutely sure that we don't
> get killed by a stupid compiler.

Right. The patch was just to show the idea. Will do a proper patch
with this and the above mentioned optimization if you think we
should change to this until we get really cool lockless dumping.

> However, I think this is still open to the same problem that my
> patch had, i.e., if a dumper goes to sleep during the dump we may
> prevent entries from being freed indefinitely.  Yes your idea
> is better in that we may only withhold say a fraction (depending
> on the order of deletion it could be anywhere from none to every
> entry) of the entries instead of all of them, but fundamentally
> the same issue is still there.
> 
> Considering the fact that dumps require root privileges I'm not
> convinced as of yet that this is worth it.

You are right. It can keep memory allocated. And in worst case
you end up keeping all the deleted entries (entries are deleted
in the order they appear in all list). But I would not be to
worried about it either.

> Hmm, could we perhaps go back to your original scheme of keeping
> everything on the list and see if we can use RCU to make it lockless
> instead?

That would be ideal of course.

- Timo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Timo Teras Sept. 23, 2008, 9:39 a.m. UTC | #3
Herbert Xu wrote:
> However, I think this is still open to the same problem that my
> patch had, i.e., if a dumper goes to sleep during the dump we may
> prevent entries from being freed indefinitely.  Yes your idea
> is better in that we may only withhold say a fraction (depending
> on the order of deletion it could be anywhere from none to every
> entry) of the entries instead of all of them, but fundamentally
> the same issue is still there.
> 
> Considering the fact that dumps require root privileges I'm not
> convinced as of yet that this is worth it.
> 
> Hmm, could we perhaps go back to your original scheme of keeping
> everything on the list and see if we can use RCU to make it lockless
> instead?

How about this: we keep list of walks as your latest patch does.
When walking is interrupted, we do not hold the entry, we just store
the pointer to walk iterator. When the entry is deleted from the lists
we go through the walk contexts, and if someone is pointing to
the entry being deleted, we just update it to next.

The list of walkers can be protected by xfrm_state_lock. That
needs to be taken anyway for accessing/modifying the other lists.

During most of the time, there is no walks active, so the penalty
for _destroy() is minimal.

This would make it possibly to reclaim the deleted entries right
away.

Does this sound better?

Cheers,
  Timo

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu Sept. 23, 2008, 11:24 a.m. UTC | #4
On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote:
>
> This would make it possibly to reclaim the deleted entries right
> away.
> 
> Does this sound better?

Yep this sounds pretty good to me.

Thanks,
diff mbox

Patch

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index cbba776..9ff1b54 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@  struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[7];
+	long		args[6];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 48630b2..d5cba7b 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -122,7 +122,7 @@  struct xfrm_state
 {
 	struct list_head	all;
 	union {
-		struct list_head	gclist;
+		struct hlist_node	gclist;
 		struct hlist_node	bydst;
 	};
 	struct hlist_node	bysrc;
@@ -1246,8 +1246,6 @@  struct xfrm6_tunnel {
 };
 
 struct xfrm_state_walk {
-	struct list_head list;
-	unsigned long genid;
 	struct xfrm_state *state;
 	int count;
 	u8 proto;
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 053970e..f45f006 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -59,14 +59,6 @@  static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024;
 static unsigned int xfrm_state_num;
 static unsigned int xfrm_state_genid;
 
-/* Counter indicating ongoing walk, protected by xfrm_state_lock. */
-static unsigned long xfrm_state_walk_ongoing;
-/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
-static unsigned long xfrm_state_walk_completed;
-
-/* List of outstanding state walks used to set the completed counter.  */
-static LIST_HEAD(xfrm_state_walks);
-
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -199,8 +191,7 @@  static DEFINE_RWLOCK(xfrm_state_afinfo_lock);
 static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO];
 
 static struct work_struct xfrm_state_gc_work;
-static LIST_HEAD(xfrm_state_gc_leftovers);
-static LIST_HEAD(xfrm_state_gc_list);
+static HLIST_HEAD(xfrm_state_gc_list);
 static DEFINE_SPINLOCK(xfrm_state_gc_lock);
 
 int __xfrm_state_delete(struct xfrm_state *x);
@@ -412,23 +403,17 @@  static void xfrm_state_gc_destroy(struct xfrm_state *x)
 
 static void xfrm_state_gc_task(struct work_struct *data)
 {
-	struct xfrm_state *x, *tmp;
-	unsigned long completed;
+	struct xfrm_state *x;
+	struct hlist_node *entry, *tmp;
+	struct hlist_head gc_list;
 
-	mutex_lock(&xfrm_cfg_mutex);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers);
+	gc_list.first = xfrm_state_gc_list.first;
+	INIT_HLIST_HEAD(&xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 
-	completed = xfrm_state_walk_completed;
-	mutex_unlock(&xfrm_cfg_mutex);
-
-	list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) {
-		if ((long)(x->lastused - completed) > 0)
-			break;
-		list_del(&x->gclist);
+	hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist)
 		xfrm_state_gc_destroy(x);
-	}
 
 	wake_up(&km_waitq);
 }
@@ -553,12 +538,19 @@  EXPORT_SYMBOL(xfrm_state_alloc);
 
 void __xfrm_state_destroy(struct xfrm_state *x)
 {
+	struct xfrm_state *next = NULL;
+
 	WARN_ON(x->km.state != XFRM_STATE_DEAD);
 
+	if (x->all.next != &xfrm_state_all)
+		next = container_of(x->all.next, struct xfrm_state, all);
 	spin_lock_bh(&xfrm_state_gc_lock);
-	list_add_tail(&x->gclist, &xfrm_state_gc_list);
+	hlist_add_head(&x->gclist, &xfrm_state_gc_list);
 	spin_unlock_bh(&xfrm_state_gc_lock);
 	schedule_work(&xfrm_state_gc_work);
+
+	if (next != NULL)
+		xfrm_state_put(next);
 }
 EXPORT_SYMBOL(__xfrm_state_destroy);
 
@@ -569,8 +561,10 @@  int __xfrm_state_delete(struct xfrm_state *x)
 	if (x->km.state != XFRM_STATE_DEAD) {
 		x->km.state = XFRM_STATE_DEAD;
 		spin_lock(&xfrm_state_lock);
-		x->lastused = xfrm_state_walk_ongoing;
 		list_del_rcu(&x->all);
+		if (x->all.next != &xfrm_state_all)
+			xfrm_state_hold(container_of(x->all.next,
+						     struct xfrm_state, all));
 		hlist_del(&x->bydst);
 		hlist_del(&x->bysrc);
 		if (x->id.spi)
@@ -1612,33 +1606,15 @@  void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
 	walk->proto = proto;
 	walk->state = NULL;
 	walk->count = 0;
-	list_add_tail(&walk->list, &xfrm_state_walks);
-	walk->genid = ++xfrm_state_walk_ongoing;
 }
 EXPORT_SYMBOL(xfrm_state_walk_init);
 
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
-	struct list_head *prev;
-
 	if (walk->state != NULL) {
 		xfrm_state_put(walk->state);
 		walk->state = NULL;
 	}
-
-	prev = walk->list.prev;
-	list_del(&walk->list);
-
-	if (prev != &xfrm_state_walks) {
-		list_entry(prev, struct xfrm_state_walk, list)->genid =
-			walk->genid;
-		return;
-	}
-
-	xfrm_state_walk_completed = walk->genid;
-
-	if (!list_empty(&xfrm_state_gc_leftovers))
-		schedule_work(&xfrm_state_gc_work);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);