Patchwork xfrm_state locking regression...

login
register
mail settings
Submitter Herbert Xu
Date Sept. 22, 2008, 11:42 a.m.
Message ID <20080922114256.GA27055@gondor.apana.org.au>
Download mbox | patch
Permalink /patch/870/
State Accepted
Delegated to: David Miller
Headers show

Comments

Herbert Xu - Sept. 22, 2008, 11:42 a.m.
On Sun, Sep 21, 2008 at 06:21:27PM +0300, Timo Teräs wrote:
>
> But the flaw still exists: the entries which interator->next points
> can be deleted if there is a walking that takes a long time and
> meanwhile we get walks that complete fast.

Thanks, I guess that's why I'm not paid to work on RCU :)

> So if we do list_del_rcu() on delete, could we also xfrm_state_hold()
> the entry pointed to by that list entry. And then on GC we could
> xfrm_state_put() the next entry.

Unfortunately it's not that simple since we'll be in the same
bind if the entry after the next entry gets deleted as well as
the next entry.

The only other solution is to go back to the original scheme
where we keep the list intact until GC.  However, nobody has
come up with a way of doing that that allows the SA creation
path to run locklessly with respect to the dumping path.

So I think we'll have to stick with the quasi-RCU solution for
now.  Here's a patch to fix the flaw that you discovered.

ipsec: Fix xfrm_state_walk race

As discovered by Timo Teräs, the currently xfrm_state_walk scheme
is racy because if a second dump finishes before the first, we
may free xfrm states that the first dump would walk over later.

This patch fixes this by storing the dumps in a list in order
to calculate the correct completion counter which cures this
problem.

I've expanded netlink_cb in order to accomodate the extra state
related to this.  It shouldn't be a big deal since netlink_cb
is kmalloced for each dump and we're just increasing it by 4 or
8 bytes.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>


Cheers,
Timo Teräs - Sept. 22, 2008, 1:01 p.m.
Herbert Xu wrote:
> On Sun, Sep 21, 2008 at 06:21:27PM +0300, Timo Teräs wrote:
>> So if we do list_del_rcu() on delete, could we also xfrm_state_hold()
>> the entry pointed to by that list entry. And then on GC we could
>> xfrm_state_put() the next entry.
> 
> Unfortunately it's not that simple since we'll be in the same
> bind if the entry after the next entry gets deleted as well as
> the next entry.

Well, I was thinking that we hold the next pointer. And when
continuing the dump, we can first skip all entries that are marked
as dead (each next pointer is valid since each of the next pointers
are held once). When we find the first valid entry to dump we
_put() the originally held entry. That would recursively _put() all
the next entries which were held.

> The only other solution is to go back to the original scheme
> where we keep the list intact until GC.  However, nobody has
> come up with a way of doing that that allows the SA creation
> path to run locklessly with respect to the dumping path.
> 
> So I think we'll have to stick with the quasi-RCU solution for
> now.  Here's a patch to fix the flaw that you discovered.

But yes, this would work as well.

Not sure which one would be faster. I guess the holding of
individual entries would be at least more memory efficient.

Cheers,
  Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 22, 2008, 11:50 p.m.
On Mon, Sep 22, 2008 at 04:01:14PM +0300, Timo Teräs wrote:
>
> > Unfortunately it's not that simple since we'll be in the same
> > bind if the entry after the next entry gets deleted as well as
> > the next entry.
> 
> Well, I was thinking that we hold the next pointer. And when
> continuing the dump, we can first skip all entries that are marked
> as dead (each next pointer is valid since each of the next pointers
> are held once). When we find the first valid entry to dump we
> _put() the originally held entry. That would recursively _put() all
> the next entries which were held.

No that doesn't work.  Let's say we store the entry X in walk->state,
and we hold X as well as X->next.  Now X, X->next, and X->next->next
get deleted from the list.  What'll happen is that X and X->next
will stick around but X->next->next will be freed.  So when we
resume from X we'll dump X and X->next correctly, but then hit
X->next->next and be in the same shithole.

Really, the only way to do what you want here is to hold everything
from X to the very end of the list.  The problem with that is that
you can no longer drop the reference counts when we resume.

You can see why even if we only hold X and X->next.  Should X->next
be deleted from the list, when we resume from X we'll no longer be
able to drop the reference count on the original X->next since it
now points to a new entry.

> Not sure which one would be faster. I guess the holding of
> individual entries would be at least more memory efficient.

I think this is a lot more efficient than storing every single
entry from X to the end of the list :)

Cheers,
David Miller - Sept. 23, 2008, 2:48 a.m.
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 22 Sep 2008 19:42:56 +0800

> ipsec: Fix xfrm_state_walk race
> 
> As discovered by Timo Teräs, the currently xfrm_state_walk scheme
> is racy because if a second dump finishes before the first, we
> may free xfrm states that the first dump would walk over later.
> 
> This patch fixes this by storing the dumps in a list in order
> to calculate the correct completion counter which cures this
> problem.
> 
> I've expanded netlink_cb in order to accomodate the extra state
> related to this.  It shouldn't be a big deal since netlink_cb
> is kmalloced for each dump and we're just increasing it by 4 or
> 8 bytes.
> 
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Looks good, applied to net-next-2.6
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Timo Teräs - Sept. 23, 2008, 4:53 a.m.
Herbert Xu wrote:
> On Mon, Sep 22, 2008 at 04:01:14PM +0300, Timo Teräs wrote:
>>> Unfortunately it's not that simple since we'll be in the same
>>> bind if the entry after the next entry gets deleted as well as
>>> the next entry.
>> Well, I was thinking that we hold the next pointer. And when
>> continuing the dump, we can first skip all entries that are marked
>> as dead (each next pointer is valid since each of the next pointers
>> are held once). When we find the first valid entry to dump we
>> _put() the originally held entry. That would recursively _put() all
>> the next entries which were held.
> 
> No that doesn't work.  Let's say we store the entry X in walk->state,
> and we hold X as well as X->next.  Now X, X->next, and X->next->next
> get deleted from the list.  What'll happen is that X and X->next
> will stick around but X->next->next will be freed.  So when we
> resume from X we'll dump X and X->next correctly, but then hit
> X->next->next and be in the same shithole.

I think it would work. Here's the scenarios:
We hold X as dumping is interrupted there. X->next points statically to
some non-deleted entry and is held.

Now, if X->next gets deleted, it's marked dead and X->next->next is held
too. Thus when there is multiple deleted entries in chain, the whole
chain is held recursively/iteratively. When walking is continued on X
the first thing we do is skip all dead entries from X, after that we
put X and that would trigger put() for all X->next:s which were held
iteratively.

If X->next is not deleted, and X->next->next gets deleted, the X->next
list structure is updated correctly by list_del_rcu and the entry can
be actually freed even if the walking didn't iterate that entry (it
would be skipped anyway as it's marked dead on deletion).

So the idea was to hold X->next from deletion function, not from
the walking function. That would be, we always hold deleted->next when
there are ongoing walks. And on final _put() we _put() the ->next
entry.

I think that would work.

- Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 23, 2008, 4:59 a.m.
On Tue, Sep 23, 2008 at 07:53:18AM +0300, Timo Teräs wrote:
>
> So the idea was to hold X->next from deletion function, not from
> the walking function. That would be, we always hold deleted->next when
> there are ongoing walks. And on final _put() we _put() the ->next
> entry.
> 
> I think that would work.

Right, this would work.

However, now you're again slowing down the relatively fast path of
state deletion by moving extra work there from the non-critical
dump path.

When we optimise code for time, we don't necessarily try to minimise
the work overall.  We instead try to minimise the amount of work on
the critical paths.  It's OK to let the other paths such as dumping
do a bit more work if it improves the fast paths.

Cheers,
Timo Teräs - Sept. 23, 2008, 5:17 a.m.
Herbert Xu wrote:
> On Tue, Sep 23, 2008 at 07:53:18AM +0300, Timo Teräs wrote:
>> So the idea was to hold X->next from deletion function, not from
>> the walking function. That would be, we always hold deleted->next when
>> there are ongoing walks. And on final _put() we _put() the ->next
>> entry.
>>
>> I think that would work.
> 
> Right, this would work.
> 
> However, now you're again slowing down the relatively fast path of
> state deletion by moving extra work there from the non-critical
> dump path.

The extra step there wold be a hold call. The recursive _put on a
_put of some entry can happen only on dump path. As otherwise the
->next entry is first held in state delete, but would be immediately
_put on the _put as the final step of _delete().

So basically one additional atomic_inc() and one atomic_dec() on the
normal _delete() path.

> When we optimise code for time, we don't necessarily try to minimise
> the work overall.  We instead try to minimise the amount of work on
> the critical paths.  It's OK to let the other paths such as dumping
> do a bit more work if it improves the fast paths.

Not sure about the overall penalty, but yes I know it would have some
penalty. But at least there would be no locking.

Particularly the thing that can happen on your approach is that if
there is a user land process that gets suspended during a dump
processing, it would prevent the garbage collection of all entries
for all eternity until that process is continued or killed.

So this would allow deletion and GC of entries even during walking.
But this was just a thought. Maybe it's not worth trying.

Cheers,
  Timo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Herbert Xu - Sept. 23, 2008, 5:22 a.m.
On Tue, Sep 23, 2008 at 08:17:14AM +0300, Timo Teräs wrote:
>
> The extra step there wold be a hold call. The recursive _put on a
> _put of some entry can happen only on dump path. As otherwise the
> ->next entry is first held in state delete, but would be immediately
> _put on the _put as the final step of _delete().
> 
> So basically one additional atomic_inc() and one atomic_dec() on the
> normal _delete() path.

Can you post a patch? If this can be done locklessly then yes
it would probably be a good way to go.  However, I'm not sure
whether I understand your lockless proposal yet.

Thanks,

Patch

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 9ff1b54..cbba776 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -220,7 +220,7 @@  struct netlink_callback
 	int		(*dump)(struct sk_buff * skb, struct netlink_callback *cb);
 	int		(*done)(struct netlink_callback *cb);
 	int		family;
-	long		args[6];
+	long		args[7];
 };
 
 struct netlink_notify
diff --git a/include/net/xfrm.h b/include/net/xfrm.h
index 4bb9499..48630b2 100644
--- a/include/net/xfrm.h
+++ b/include/net/xfrm.h
@@ -1246,6 +1246,8 @@  struct xfrm6_tunnel {
 };
 
 struct xfrm_state_walk {
+	struct list_head list;
+	unsigned long genid;
 	struct xfrm_state *state;
 	int count;
 	u8 proto;
@@ -1281,13 +1283,7 @@  static inline void xfrm6_fini(void)
 extern int xfrm_proc_init(void);
 #endif
 
-static inline void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
-{
-	walk->proto = proto;
-	walk->state = NULL;
-	walk->count = 0;
-}
-
+extern void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto);
 extern int xfrm_state_walk(struct xfrm_state_walk *walk,
 			   int (*func)(struct xfrm_state *, int, void*), void *);
 extern void xfrm_state_walk_done(struct xfrm_state_walk *walk);
diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index fe84a46..f7805c5 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -64,6 +64,9 @@  static unsigned long xfrm_state_walk_ongoing;
 /* Counter indicating walk completion, protected by xfrm_cfg_mutex. */
 static unsigned long xfrm_state_walk_completed;
 
+/* List of outstanding state walks used to set the completed counter.  */
+static LIST_HEAD(xfrm_state_walks);
+
 static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family);
 static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo);
 
@@ -1582,7 +1585,6 @@  int xfrm_state_walk(struct xfrm_state_walk *walk,
 			if (err) {
 				xfrm_state_hold(last);
 				walk->state = last;
-				xfrm_state_walk_ongoing++;
 				goto out;
 			}
 		}
@@ -1597,25 +1599,44 @@  int xfrm_state_walk(struct xfrm_state_walk *walk,
 		err = func(last, 0, data);
 out:
 	spin_unlock_bh(&xfrm_state_lock);
-	if (old != NULL) {
+	if (old != NULL)
 		xfrm_state_put(old);
-		xfrm_state_walk_completed++;
-		if (!list_empty(&xfrm_state_gc_leftovers))
-			schedule_work(&xfrm_state_gc_work);
-	}
 	return err;
 }
 EXPORT_SYMBOL(xfrm_state_walk);
 
+void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto)
+{
+	walk->proto = proto;
+	walk->state = NULL;
+	walk->count = 0;
+	list_add_tail(&walk->list, &xfrm_state_walks);
+	walk->genid = ++xfrm_state_walk_ongoing;
+}
+EXPORT_SYMBOL(xfrm_state_walk_init);
+
 void xfrm_state_walk_done(struct xfrm_state_walk *walk)
 {
+	struct list_head *prev;
+
 	if (walk->state != NULL) {
 		xfrm_state_put(walk->state);
 		walk->state = NULL;
-		xfrm_state_walk_completed++;
-		if (!list_empty(&xfrm_state_gc_leftovers))
-			schedule_work(&xfrm_state_gc_work);
 	}
+
+	prev = walk->list.prev;
+	list_del(&walk->list);
+
+	if (prev != &xfrm_state_walks) {
+		list_entry(prev, struct xfrm_state_walk, list)->genid =
+			walk->genid;
+		return;
+	}
+
+	xfrm_state_walk_completed = walk->genid;
+
+	if (!list_empty(&xfrm_state_gc_leftovers))
+		schedule_work(&xfrm_state_gc_work);
 }
 EXPORT_SYMBOL(xfrm_state_walk_done);