Message ID | 48D8DC28.1020001@iki.fi |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote: > > + list_for_each_entry(walk, &xfrm_state_walks, list) { > + if (walk->state == x) > + walk->state = next; > + } Just make a new list in xfrm_state that has all the dumpers sitting on it. In fact all you need is a hlist, or even just a pointer. Walking through an unbounded list on the fast path is not good :) Cheers,
Herbert Xu wrote: > On Tue, Sep 23, 2008 at 03:08:08PM +0300, Timo Teräs wrote: >> + list_for_each_entry(walk, &xfrm_state_walks, list) { >> + if (walk->state == x) >> + walk->state = next; >> + } > > Just make a new list in xfrm_state that has all the dumpers sitting > on it. In fact all you need is a hlist, or even just a pointer. There can be a lot of xfrm_state structs. As in thousands. So it will take more memory then. hlist would be enough, so it'd be a 4/8 bytes addition. Single pointer would not be enough as we can have multiple walks on the same entry. > Walking through an unbounded list on the fast path is not good :) I was thinking about the same. That's why I wasn't so sure if this is better. In practice there aren't many walks active. Maybe we limit the maximum simultaneous walks? But in real life, there is only a one or two walkers if any active. So I would not consider a global update too expensive. But if you think it might become a problem, I'd add an hlist to xfrm_state of walkers referencing that entry. - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote: > > There can be a lot of xfrm_state structs. As in thousands. So it > will take more memory then. hlist would be enough, so it'd be a > 4/8 bytes addition. Single pointer would not be enough as we can > have multiple walks on the same entry. Right, we need doubly linked lists in the walkers to ease unlinking. But only a single pointer (i.e., an hlist) is needed in xfrm_state. > I was thinking about the same. That's why I wasn't so sure if this is > better. In practice there aren't many walks active. Maybe we limit > the maximum simultaneous walks? That sounds like a hack :) > But in real life, there is only a one or two walkers if any active. > So I would not consider a global update too expensive. But if you think > it might become a problem, I'd add an hlist to xfrm_state of walkers > referencing that entry. Well in real life dumps don't tend to get suspended either :) In any case, a suspended dump is only going to pin down some memory, while a very long list walk may kill the box. Cheers,
Herbert Xu wrote: > On Tue, Sep 23, 2008 at 03:25:41PM +0300, Timo Teräs wrote: >> There can be a lot of xfrm_state structs. As in thousands. So it >> will take more memory then. hlist would be enough, so it'd be a >> 4/8 bytes addition. Single pointer would not be enough as we can >> have multiple walks on the same entry. > > Right, we need doubly linked lists in the walkers to ease unlinking. > But only a single pointer (i.e., an hlist) is needed in xfrm_state. Right. >> I was thinking about the same. That's why I wasn't so sure if this is >> better. In practice there aren't many walks active. Maybe we limit >> the maximum simultaneous walks? > > That sounds like a hack :) Yes :) >> But in real life, there is only a one or two walkers if any active. >> So I would not consider a global update too expensive. But if you think >> it might become a problem, I'd add an hlist to xfrm_state of walkers >> referencing that entry. > > Well in real life dumps don't tend to get suspended either :) > In any case, a suspended dump is only going to pin down some > memory, while a very long list walk may kill the box. So, what to do? 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and xfrm_state_put(all.next) on destruct. 2. Add per-entry hlist of walkers currently referencing it. 3. Use the global walker list. 1 can keep memory allocated until userland wakes up. 2 & 3 can make the delete of that entry slow if there's many walkers suspended. Btw. the current stuff in net-next is broken. There's no locking for xfrm_state_walkers list handling. Cheers, Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote: > > So, what to do? > 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and > xfrm_state_put(all.next) on destruct. > 2. Add per-entry hlist of walkers currently referencing it. > 3. Use the global walker list. > > 1 can keep memory allocated until userland wakes up. 2 & 3 can make > the delete of that entry slow if there's many walkers suspended. I'd cross 3 off the list because 2 is just so much better :) I'd slightly lean towards 2 but now that you mention it yes even that is vulnerable to loads of dumpers sitting on the same entry. So SELINUX folks wouldn't like that :) > Btw. the current stuff in net-next is broken. There's no locking > for xfrm_state_walkers list handling. What about xfrm_cfg_mutex? Cheers,
Herbert Xu wrote: > On Tue, Sep 23, 2008 at 04:01:29PM +0300, Timo Teräs wrote: >> So, what to do? >> 1. Go back to: list_del_rcu, xfrm_state_hold(all.next) on delete and >> xfrm_state_put(all.next) on destruct. >> 2. Add per-entry hlist of walkers currently referencing it. >> 3. Use the global walker list. >> >> 1 can keep memory allocated until userland wakes up. 2 & 3 can make >> the delete of that entry slow if there's many walkers suspended. > > I'd cross 3 off the list because 2 is just so much better :) > > I'd slightly lean towards 2 but now that you mention it yes even > that is vulnerable to loads of dumpers sitting on the same entry. > So SELINUX folks wouldn't like that :) Umm... right. It's a tricky problem. Cannot think of perfect solution atm. But I guess 3 is in general case the best. But in worst case scenarios 1 performs better. I have no strong opinion either way. So what ever you want, I'm happy to provide a patch for. >> Btw. the current stuff in net-next is broken. There's no locking >> for xfrm_state_walkers list handling. > > What about xfrm_cfg_mutex? It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done} touch xfrm_state_walks list without locking properly. At least in the version I'm looking (= net-next-2.6 via git web interface). - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote: > > I have no strong opinion either way. So what ever you want, I'm > happy to provide a patch for. Let's just leave it as it is for now. > > What about xfrm_cfg_mutex? > > It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done} > touch xfrm_state_walks list without locking properly. At least in > the version I'm looking (= net-next-2.6 via git web interface). Their callers should be holding it. Cheers,
Herbert Xu wrote: > On Tue, Sep 23, 2008 at 04:30:06PM +0300, Timo Teräs wrote: >> I have no strong opinion either way. So what ever you want, I'm >> happy to provide a patch for. > > Let's just leave it as it is for now. Well, 1 is an improvement over the current implementation, so I think it'd be better than not doing anything. >>> What about xfrm_cfg_mutex? >> It's used only in xfrm_state_gc_task. xfrm_state_walk_{init,done} >> touch xfrm_state_walks list without locking properly. At least in >> the version I'm looking (= net-next-2.6 via git web interface). > > Their callers should be holding it. Ah, the other layers take it at least on _walk_init paths. But _walk_done can be called from recv() syscalls. The af_key implementation does not take xfrm_cfg_mutex there. I don't think xfrm_user does that either as it does not pass cb_mutex to netlink_kernel_create. So at least the _state_walk_done path is unsafe as-is, I think. - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..7f787c7 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -122,7 +122,7 @@ struct xfrm_state { struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -1247,7 +1247,6 @@ struct xfrm6_tunnel { struct xfrm_state_walk { struct list_head list; - unsigned long genid; struct xfrm_state *state; int count; u8 proto; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..636f7ee 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,11 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - /* List of outstanding state walks used to set the completed counter. */ static LIST_HEAD(xfrm_state_walks); @@ -199,8 +194,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +406,16 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + hlist_move_list(&xfrm_state_gc_list, &gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -556,7 +543,7 @@ void __xfrm_state_destroy(struct xfrm_state *x) WARN_ON(x->km.state != XFRM_STATE_DEAD); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); } @@ -564,13 +551,22 @@ EXPORT_SYMBOL(__xfrm_state_destroy); int __xfrm_state_delete(struct xfrm_state *x) { + struct xfrm_state_walk *walk; + struct xfrm_state *next; int err = -ESRCH; if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; - list_del_rcu(&x->all); + if (list_is_last(&x->all, &xfrm_state_walks)) + next = NULL; + else + next = container_of(x->all.next, struct xfrm_state, all); + list_for_each_entry(walk, &xfrm_state_walks, list) { + if (walk->state == x) + walk->state = next; + } + list_del(&x->all); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1566,15 +1562,16 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, int (*func)(struct xfrm_state *, int, void*), void *data) { - struct xfrm_state *old, *x, *last = NULL; + struct xfrm_state *x, *last = NULL; int err = 0; if (walk->state == NULL && walk->count != 0) return 0; - old = x = walk->state; - walk->state = NULL; spin_lock_bh(&xfrm_state_lock); + x = walk->state; + walk->state = NULL; + if (x == NULL) x = list_first_entry(&xfrm_state_all, struct xfrm_state, all); list_for_each_entry_from(x, &xfrm_state_all, all) { @@ -1585,7 +1582,6 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, if (last) { err = func(last, walk->count, data); if (err) { - xfrm_state_hold(last); walk->state = last; goto out; } @@ -1601,8 +1597,7 @@ int xfrm_state_walk(struct xfrm_state_walk *walk, err = func(last, 0, data); out: spin_unlock_bh(&xfrm_state_lock); - if (old != NULL) - xfrm_state_put(old); + return err; } EXPORT_SYMBOL(xfrm_state_walk); @@ -1612,33 +1607,19 @@ void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->proto = proto; walk->state = NULL; walk->count = 0; + + spin_lock_bh(&xfrm_state_lock); list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; + spin_unlock_bh(&xfrm_state_lock); } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - - if (walk->state != NULL) { - xfrm_state_put(walk->state); - walk->state = NULL; - } - - prev = walk->list.prev; + spin_lock_bh(&xfrm_state_lock); list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; - return; - } - - xfrm_state_walk_completed = walk->genid; - - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); + spin_unlock_bh(&xfrm_state_lock); + walk->state = NULL; } EXPORT_SYMBOL(xfrm_state_walk_done);