Message ID | 48D88BCC.5030806@iki.fi |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote: > > Something like this. I just compile tested, so I'm not sure if it > actually works. It is quite clever :) > The corresponding put is in _destroy(). Added it as a final thing > to do so hopefully compiler optimizes tail recursion to iteration. > (Seemed to do that in my case.) Yeah we should open code this to be absolutely sure that we don't get killed by a stupid compiler. However, I think this is still open to the same problem that my patch had, i.e., if a dumper goes to sleep during the dump we may prevent entries from being freed indefinitely. Yes your idea is better in that we may only withhold say a fraction (depending on the order of deletion it could be anywhere from none to every entry) of the entries instead of all of them, but fundamentally the same issue is still there. Considering the fact that dumps require root privileges I'm not convinced as of yet that this is worth it. Hmm, could we perhaps go back to your original scheme of keeping everything on the list and see if we can use RCU to make it lockless instead? Cheers,
Herbert Xu wrote: > On Tue, Sep 23, 2008 at 09:25:16AM +0300, Timo Teräs wrote: >> Something like this. I just compile tested, so I'm not sure if it >> actually works. > > It is quite clever :) Thanks :) And actually, we can make _delete() do normal list_del() and not hold the next pointer if there are no running dumps. This is the common case anyway. Then there would be no penalty at all. >> The corresponding put is in _destroy(). Added it as a final thing >> to do so hopefully compiler optimizes tail recursion to iteration. >> (Seemed to do that in my case.) > > Yeah we should open code this to be absolutely sure that we don't > get killed by a stupid compiler. Right. The patch was just to show the idea. Will do a proper patch with this and the above mentioned optimization if you think we should change to this until we get really cool lockless dumping. > However, I think this is still open to the same problem that my > patch had, i.e., if a dumper goes to sleep during the dump we may > prevent entries from being freed indefinitely. Yes your idea > is better in that we may only withhold say a fraction (depending > on the order of deletion it could be anywhere from none to every > entry) of the entries instead of all of them, but fundamentally > the same issue is still there. > > Considering the fact that dumps require root privileges I'm not > convinced as of yet that this is worth it. You are right. It can keep memory allocated. And in worst case you end up keeping all the deleted entries (entries are deleted in the order they appear in all list). But I would not be to worried about it either. > Hmm, could we perhaps go back to your original scheme of keeping > everything on the list and see if we can use RCU to make it lockless > instead? That would be ideal of course. - Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Herbert Xu wrote: > However, I think this is still open to the same problem that my > patch had, i.e., if a dumper goes to sleep during the dump we may > prevent entries from being freed indefinitely. Yes your idea > is better in that we may only withhold say a fraction (depending > on the order of deletion it could be anywhere from none to every > entry) of the entries instead of all of them, but fundamentally > the same issue is still there. > > Considering the fact that dumps require root privileges I'm not > convinced as of yet that this is worth it. > > Hmm, could we perhaps go back to your original scheme of keeping > everything on the list and see if we can use RCU to make it lockless > instead? How about this: we keep list of walks as your latest patch does. When walking is interrupted, we do not hold the entry, we just store the pointer to walk iterator. When the entry is deleted from the lists we go through the walk contexts, and if someone is pointing to the entry being deleted, we just update it to next. The list of walkers can be protected by xfrm_state_lock. That needs to be taken anyway for accessing/modifying the other lists. During most of the time, there is no walks active, so the penalty for _destroy() is minimal. This would make it possibly to reclaim the deleted entries right away. Does this sound better? Cheers, Timo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Sep 23, 2008 at 12:39:51PM +0300, Timo Teräs wrote: > > This would make it possibly to reclaim the deleted entries right > away. > > Does this sound better? Yep this sounds pretty good to me. Thanks,
diff --git a/include/linux/netlink.h b/include/linux/netlink.h index cbba776..9ff1b54 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -220,7 +220,7 @@ struct netlink_callback int (*dump)(struct sk_buff * skb, struct netlink_callback *cb); int (*done)(struct netlink_callback *cb); int family; - long args[7]; + long args[6]; }; struct netlink_notify diff --git a/include/net/xfrm.h b/include/net/xfrm.h index 48630b2..d5cba7b 100644 --- a/include/net/xfrm.h +++ b/include/net/xfrm.h @@ -122,7 +122,7 @@ struct xfrm_state { struct list_head all; union { - struct list_head gclist; + struct hlist_node gclist; struct hlist_node bydst; }; struct hlist_node bysrc; @@ -1246,8 +1246,6 @@ struct xfrm6_tunnel { }; struct xfrm_state_walk { - struct list_head list; - unsigned long genid; struct xfrm_state *state; int count; u8 proto; diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c index 053970e..f45f006 100644 --- a/net/xfrm/xfrm_state.c +++ b/net/xfrm/xfrm_state.c @@ -59,14 +59,6 @@ static unsigned int xfrm_state_hashmax __read_mostly = 1 * 1024 * 1024; static unsigned int xfrm_state_num; static unsigned int xfrm_state_genid; -/* Counter indicating ongoing walk, protected by xfrm_state_lock. */ -static unsigned long xfrm_state_walk_ongoing; -/* Counter indicating walk completion, protected by xfrm_cfg_mutex. */ -static unsigned long xfrm_state_walk_completed; - -/* List of outstanding state walks used to set the completed counter. */ -static LIST_HEAD(xfrm_state_walks); - static struct xfrm_state_afinfo *xfrm_state_get_afinfo(unsigned int family); static void xfrm_state_put_afinfo(struct xfrm_state_afinfo *afinfo); @@ -199,8 +191,7 @@ static DEFINE_RWLOCK(xfrm_state_afinfo_lock); static struct xfrm_state_afinfo *xfrm_state_afinfo[NPROTO]; static struct work_struct xfrm_state_gc_work; -static LIST_HEAD(xfrm_state_gc_leftovers); -static LIST_HEAD(xfrm_state_gc_list); +static HLIST_HEAD(xfrm_state_gc_list); static DEFINE_SPINLOCK(xfrm_state_gc_lock); int __xfrm_state_delete(struct xfrm_state *x); @@ -412,23 +403,17 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x) static void xfrm_state_gc_task(struct work_struct *data) { - struct xfrm_state *x, *tmp; - unsigned long completed; + struct xfrm_state *x; + struct hlist_node *entry, *tmp; + struct hlist_head gc_list; - mutex_lock(&xfrm_cfg_mutex); spin_lock_bh(&xfrm_state_gc_lock); - list_splice_tail_init(&xfrm_state_gc_list, &xfrm_state_gc_leftovers); + gc_list.first = xfrm_state_gc_list.first; + INIT_HLIST_HEAD(&xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); - completed = xfrm_state_walk_completed; - mutex_unlock(&xfrm_cfg_mutex); - - list_for_each_entry_safe(x, tmp, &xfrm_state_gc_leftovers, gclist) { - if ((long)(x->lastused - completed) > 0) - break; - list_del(&x->gclist); + hlist_for_each_entry_safe(x, entry, tmp, &gc_list, gclist) xfrm_state_gc_destroy(x); - } wake_up(&km_waitq); } @@ -553,12 +538,19 @@ EXPORT_SYMBOL(xfrm_state_alloc); void __xfrm_state_destroy(struct xfrm_state *x) { + struct xfrm_state *next = NULL; + WARN_ON(x->km.state != XFRM_STATE_DEAD); + if (x->all.next != &xfrm_state_all) + next = container_of(x->all.next, struct xfrm_state, all); spin_lock_bh(&xfrm_state_gc_lock); - list_add_tail(&x->gclist, &xfrm_state_gc_list); + hlist_add_head(&x->gclist, &xfrm_state_gc_list); spin_unlock_bh(&xfrm_state_gc_lock); schedule_work(&xfrm_state_gc_work); + + if (next != NULL) + xfrm_state_put(next); } EXPORT_SYMBOL(__xfrm_state_destroy); @@ -569,8 +561,10 @@ int __xfrm_state_delete(struct xfrm_state *x) if (x->km.state != XFRM_STATE_DEAD) { x->km.state = XFRM_STATE_DEAD; spin_lock(&xfrm_state_lock); - x->lastused = xfrm_state_walk_ongoing; list_del_rcu(&x->all); + if (x->all.next != &xfrm_state_all) + xfrm_state_hold(container_of(x->all.next, + struct xfrm_state, all)); hlist_del(&x->bydst); hlist_del(&x->bysrc); if (x->id.spi) @@ -1612,33 +1606,15 @@ void xfrm_state_walk_init(struct xfrm_state_walk *walk, u8 proto) walk->proto = proto; walk->state = NULL; walk->count = 0; - list_add_tail(&walk->list, &xfrm_state_walks); - walk->genid = ++xfrm_state_walk_ongoing; } EXPORT_SYMBOL(xfrm_state_walk_init); void xfrm_state_walk_done(struct xfrm_state_walk *walk) { - struct list_head *prev; - if (walk->state != NULL) { xfrm_state_put(walk->state); walk->state = NULL; } - - prev = walk->list.prev; - list_del(&walk->list); - - if (prev != &xfrm_state_walks) { - list_entry(prev, struct xfrm_state_walk, list)->genid = - walk->genid; - return; - } - - xfrm_state_walk_completed = walk->genid; - - if (!list_empty(&xfrm_state_gc_leftovers)) - schedule_work(&xfrm_state_gc_work); } EXPORT_SYMBOL(xfrm_state_walk_done);