Message ID | 20090218120508.GB4100@elte.hu |
---|---|
State | RFC, archived |
Delegated to: | David Miller |
Headers | show |
Ingo Molnar wrote: > * Patrick McHardy <kaber@trash.net> wrote: > >> We need to avoid having a timer that was deleted by one CPU >> getting re-added by another, but want to avoid taking the >> conntrack lock for every timer update. The timer-internal >> locking is enough for this as long as we have a mod_timer >> variant that forwards a timer, but doesn't activate it in >> case it isn't active already. > > that makes sense - but the implementation is still somewhat > ugly. How about the one below instead? Not tested. This seems to fulfill our needs. I also like the mod_timer_pending() name better than mod_timer_noact(). > One open question is this construct in mod_timer(): > > + /* > + * This is a common optimization triggered by the > + * networking code - if the timer is re-modified > + * to be the same thing then just return: > + */ > + if (timer->expires == expires && timer_pending(timer)) > + return 1; > > We've had this for ages, but it seems rather SMP-unsafe. > timer_pending(), if used in an unserialized fashion, can be any > random value in theory - there's no internal serialization here > anywhere. > > We could end up with incorrectly not re-activating a timer in > mod_timer() for example - have such things never been observed > in practice? Yes, it seems racy if done for timers that might get activated. For forwarding only without activation it seems OK, in that case the timer_pending check doesn't seem necessary at all. > So the original patch which added this to mod_timer_noact() was > racy i think, and we cannot preserve this optimization outside > of the timer list lock. (we could do it inside of it.) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Patrick McHardy <kaber@trash.net> wrote: > Ingo Molnar wrote: >> * Patrick McHardy <kaber@trash.net> wrote: >> >>> We need to avoid having a timer that was deleted by one CPU >>> getting re-added by another, but want to avoid taking the >>> conntrack lock for every timer update. The timer-internal >>> locking is enough for this as long as we have a mod_timer >>> variant that forwards a timer, but doesn't activate it in >>> case it isn't active already. >> >> that makes sense - but the implementation is still somewhat ugly. How >> about the one below instead? Not tested. > > This seems to fulfill our needs. I also like the mod_timer_pending() > name better than mod_timer_noact(). > >> One open question is this construct in mod_timer(): >> >> + /* >> + * This is a common optimization triggered by the >> + * networking code - if the timer is re-modified >> + * to be the same thing then just return: >> + */ >> + if (timer->expires == expires && timer_pending(timer)) >> + return 1; >> >> We've had this for ages, but it seems rather SMP-unsafe. >> timer_pending(), if used in an unserialized fashion, can be any random >> value in theory - there's no internal serialization here anywhere. >> >> We could end up with incorrectly not re-activating a timer in >> mod_timer() for example - have such things never been observed in >> practice? > > Yes, it seems racy if done for timers that might get > activated. For forwarding only without activation it seems OK, > in that case the timer_pending check doesn't seem necessary at > all. ok. To accelerate matters i've committed the new API patch into a new standalone topic branch: tip:timers/new-apis. Unless there are objections or test failures, you (or Stephen or David) can then git-pull it into the networking tree via the Git coordinates below - and you'll get this single commit in a surgical manner - no other timer changes are included. Doing so has the advantage of: - You not having to wait a kernel cycle for the API to go upstream. - You can also push it upstream without waiting for the timer tree. (the timer tree and the networking tree will share the exact same commit) - It will also all merge cleanly with the timer tree in linux-next, etc. I'd suggest to do it in about a week, to make sure any after effects have trickled down and to make sure the topic has become append-only. You can ping Thomas and me about testing/review status then, whenever you want to do the pull. Ingo -------------> You can pull the latest timers/new-apis git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git timers/new-apis Thanks, Ingo ------------------> Ingo Molnar (1): timers: add mod_timer_pending() arch/powerpc/platforms/cell/spufs/sched.c | 2 +- drivers/infiniband/hw/ipath/ipath_driver.c | 6 +- include/linux/timer.h | 22 +----- kernel/relay.c | 2 +- kernel/timer.c | 110 ++++++++++++++++++--------- 5 files changed, 80 insertions(+), 62 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Ingo Molnar wrote: > To accelerate matters i've committed the new API patch into a > new standalone topic branch: tip:timers/new-apis. > > Unless there are objections or test failures, you (or Stephen or > David) can then git-pull it into the networking tree via the Git > coordinates below - and you'll get this single commit in a > surgical manner - no other timer changes are included. > > Doing so has the advantage of: > > - You not having to wait a kernel cycle for the API to go > upstream. > > - You can also push it upstream without waiting for the timer > tree. (the timer tree and the networking tree will share the > exact same commit) > > - It will also all merge cleanly with the timer tree in > linux-next, etc. > > I'd suggest to do it in about a week, to make sure any after > effects have trickled down and to make sure the topic has become > append-only. You can ping Thomas and me about testing/review > status then, whenever you want to do the pull. Thanks Ingo. I'll wait for Stephen to rebase his patches on top of your change and the test results and will let you know. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Patrick McHardy <kaber@trash.net> wrote: > Ingo Molnar wrote: >> To accelerate matters i've committed the new API patch into a new >> standalone topic branch: tip:timers/new-apis. >> >> Unless there are objections or test failures, you (or Stephen or >> David) can then git-pull it into the networking tree via the Git >> coordinates below - and you'll get this single commit in a surgical >> manner - no other timer changes are included. >> >> Doing so has the advantage of: >> >> - You not having to wait a kernel cycle for the API to go >> upstream. >> >> - You can also push it upstream without waiting for the timer tree. >> (the timer tree and the networking tree will share the exact same >> commit) >> >> - It will also all merge cleanly with the timer tree in linux-next, >> etc. >> >> I'd suggest to do it in about a week, to make sure any after effects >> have trickled down and to make sure the topic has become append-only. >> You can ping Thomas and me about testing/review status then, whenever >> you want to do the pull. > > Thanks Ingo. I'll wait for Stephen to rebase his patches on > top of your change and the test results and will let you know. Stress-testing here in the last ~2 hours on eight x86 test-boxes showed no problems so far. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/18, Ingo Molnar wrote: > > Based on an idea from Stephen Hemminger: introduce > mod_timer_pending() which is a mod_timer() offspring > that is an invariant on already removed timers. This also can be used by workqueues, see http://marc.info/?l=linux-kernel&m=122209752020413 but can't we add another helper? Because, > +static inline int > +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) > { > struct tvec_base *base, *new_base; > unsigned long flags; > - int ret = 0; > + int ret; > + > + ret = 0; > > timer_stats_timer_set_start_info(timer); > BUG_ON(!timer->function); > @@ -614,6 +617,9 @@ int __mod_timer(struct timer_list *timer > if (timer_pending(timer)) { > detach_timer(timer, 0); > ret = 1; > + } else { > + if (pending_only) > + goto out_unlock; This can change the base (CPU) of the pending timer. How about int __update_timer(struct timer_list *timer, unsigned long expires) { struct tvec_base *base; unsigned long flags; int ret = 0; base = lock_timer_base(timer, &flags); if (timer_pending(timer)) { detach_timer(timer, 0); timer->expires = expires; internal_add_timer(base, timer); ret = 1; } spin_unlock_irqrestore(&base->lock, flags); return ret; } ? Unlike __mod_timer(..., bool pending_only), it preserves the CPU on which the timer is pending. Or, perhaps, we can modify __mod_timer() further, static inline int __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) { struct tvec_base *base; unsigned long flags; int ret; ret = 0; timer_stats_timer_set_start_info(timer); BUG_ON(!timer->function); base = lock_timer_base(timer, &flags); if (timer_pending(timer)) { detach_timer(timer, 0); ret = 1; } else if (pending_only) goto out_unlock; } debug_timer_activate(timer); if (!pending_only) { struct tvec_base *new_base = __get_cpu_var(tvec_bases); if (base != new_base) { /* * We are trying to schedule the timer on the local CPU. * However we can't change timer's base while it is running, * otherwise del_timer_sync() can't detect that the timer's * handler yet has not finished. This also guarantees that * the timer is serialized wrt itself. */ if (likely(base->running_timer != timer)) { /* See the comment in lock_timer_base() */ timer_set_base(timer, NULL); spin_unlock(&base->lock); base = new_base; spin_lock(&base->lock); timer_set_base(timer, base); } } } timer->expires = expires; internal_add_timer(base, timer); out_unlock: spin_unlock_irqrestore(&base->lock, flags); return ret; } What do you all think? Oleg. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Oleg Nesterov <oleg@redhat.com> wrote: > On 02/18, Ingo Molnar wrote: > > > > Based on an idea from Stephen Hemminger: introduce > > mod_timer_pending() which is a mod_timer() offspring > > that is an invariant on already removed timers. > > This also can be used by workqueues, see > > http://marc.info/?l=linux-kernel&m=122209752020413 > > but can't we add another helper? Because, > > > +static inline int > > +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) > > { > > struct tvec_base *base, *new_base; > > unsigned long flags; > > - int ret = 0; > > + int ret; > > + > > + ret = 0; > > > > timer_stats_timer_set_start_info(timer); > > BUG_ON(!timer->function); > > @@ -614,6 +617,9 @@ int __mod_timer(struct timer_list *timer > > if (timer_pending(timer)) { > > detach_timer(timer, 0); > > ret = 1; > > + } else { > > + if (pending_only) > > + goto out_unlock; > > This can change the base (CPU) of the pending timer. > > How about > > int __update_timer(struct timer_list *timer, unsigned long expires) > { > struct tvec_base *base; > unsigned long flags; > int ret = 0; > > base = lock_timer_base(timer, &flags); > if (timer_pending(timer)) { > detach_timer(timer, 0); > timer->expires = expires; > internal_add_timer(base, timer); > ret = 1; > } > spin_unlock_irqrestore(&base->lock, flags); > > return ret; > } > > ? > > Unlike __mod_timer(..., bool pending_only), it preserves the CPU on > which the timer is pending. > > Or, perhaps, we can modify __mod_timer() further, > > static inline int > __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) > { > struct tvec_base *base; > unsigned long flags; > int ret; > > ret = 0; > > timer_stats_timer_set_start_info(timer); > BUG_ON(!timer->function); > > base = lock_timer_base(timer, &flags); > > if (timer_pending(timer)) { > detach_timer(timer, 0); > ret = 1; > } else if (pending_only) > goto out_unlock; > } > > debug_timer_activate(timer); > > if (!pending_only) { > struct tvec_base *new_base = __get_cpu_var(tvec_bases); > > if (base != new_base) { > /* > * We are trying to schedule the timer on the local CPU. > * However we can't change timer's base while it is running, > * otherwise del_timer_sync() can't detect that the timer's > * handler yet has not finished. This also guarantees that > * the timer is serialized wrt itself. > */ > if (likely(base->running_timer != timer)) { > /* See the comment in lock_timer_base() */ > timer_set_base(timer, NULL); > spin_unlock(&base->lock); > base = new_base; > spin_lock(&base->lock); > timer_set_base(timer, base); > } > } > } > > timer->expires = expires; > internal_add_timer(base, timer); > > out_unlock: > spin_unlock_irqrestore(&base->lock, flags); > > return ret; > } > > What do you all think? if then i'd put it into a separate commit. I think the auto-migration of all the mod_timer() variants is a scalability feature: if for example a networking socket's main user migrates to another CPU, then the timer 'follows' it - even if the timer never actually expires (which is quite common for high-speed high-reliability networking transports). By keeping it on the same CPU we'd allow the timer's and the task's affinity to differ. Agreed? Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 02/18, Ingo Molnar wrote: > > * Oleg Nesterov <oleg@redhat.com> wrote: > > > Unlike __mod_timer(..., bool pending_only), it preserves the CPU on > > which the timer is pending. > > > > Or, perhaps, we can modify __mod_timer() further, > > if then i'd put it into a separate commit. > > I think the auto-migration of all the mod_timer() variants is a > scalability feature: if for example a networking socket's main > user migrates to another CPU, then the timer 'follows' it - even > if the timer never actually expires (which is quite common for > high-speed high-reliability networking transports). OK. But sometimes it is better (or necessary) to prevent the migration. Since you already are changed __mod_timer() it would be ugly to add yet another helper. Perhaps we should turn "bool pending_only" into "int flags" right now? This is minor, and perhaps we will never need the TIMER_DONT_MIGRATE flag. But if ever need, then we have to audit all callers. Oleg. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
* Oleg Nesterov <oleg@redhat.com> wrote: > On 02/18, Ingo Molnar wrote: > > > > * Oleg Nesterov <oleg@redhat.com> wrote: > > > > > Unlike __mod_timer(..., bool pending_only), it preserves the CPU on > > > which the timer is pending. > > > > > > Or, perhaps, we can modify __mod_timer() further, > > > > if then i'd put it into a separate commit. > > > > I think the auto-migration of all the mod_timer() variants is a > > scalability feature: if for example a networking socket's main > > user migrates to another CPU, then the timer 'follows' it - even > > if the timer never actually expires (which is quite common for > > high-speed high-reliability networking transports). > > OK. > > But sometimes it is better (or necessary) to prevent the > migration. Since you already are changed __mod_timer() it > would be ugly to add yet another helper. Perhaps we should > turn "bool pending_only" into "int flags" right now? > > This is minor, and perhaps we will never need the > TIMER_DONT_MIGRATE flag. But if ever need, then we have to > audit all callers. hm, dunno - such unused flags are generally frowned upon, especially if they influence the code flow in a dynamic way. In fact i tried to avoid this flag here too - but __mod_timer() is too small, the flag is used in the middle, and two separate helpers would have made the code look worse. Ingo -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Index: linux/arch/powerpc/platforms/cell/spufs/sched.c =================================================================== --- linux.orig/arch/powerpc/platforms/cell/spufs/sched.c +++ linux/arch/powerpc/platforms/cell/spufs/sched.c @@ -508,7 +508,7 @@ static void __spu_add_to_rq(struct spu_c list_add_tail(&ctx->rq, &spu_prio->runq[ctx->prio]); set_bit(ctx->prio, spu_prio->bitmap); if (!spu_prio->nr_waiting++) - __mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK); + mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK); } } Index: linux/drivers/infiniband/hw/ipath/ipath_driver.c =================================================================== --- linux.orig/drivers/infiniband/hw/ipath/ipath_driver.c +++ linux/drivers/infiniband/hw/ipath/ipath_driver.c @@ -2715,7 +2715,7 @@ static void ipath_hol_signal_up(struct i * to prevent HoL blocking, then start the HoL timer that * periodically continues, then stop procs, so they can detect * link down if they want, and do something about it. - * Timer may already be running, so use __mod_timer, not add_timer. + * Timer may already be running, so use mod_timer, not add_timer. */ void ipath_hol_down(struct ipath_devdata *dd) { @@ -2724,7 +2724,7 @@ void ipath_hol_down(struct ipath_devdata dd->ipath_hol_next = IPATH_HOL_DOWNCONT; dd->ipath_hol_timer.expires = jiffies + msecs_to_jiffies(ipath_hol_timeout_ms); - __mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires); + mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires); } /* @@ -2763,7 +2763,7 @@ void ipath_hol_event(unsigned long opaqu else { dd->ipath_hol_timer.expires = jiffies + msecs_to_jiffies(ipath_hol_timeout_ms); - __mod_timer(&dd->ipath_hol_timer, + mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires); } } Index: linux/include/linux/timer.h =================================================================== --- linux.orig/include/linux/timer.h +++ linux/include/linux/timer.h @@ -161,8 +161,8 @@ static inline int timer_pending(const st extern void add_timer_on(struct timer_list *timer, int cpu); extern int del_timer(struct timer_list * timer); -extern int __mod_timer(struct timer_list *timer, unsigned long expires); extern int mod_timer(struct timer_list *timer, unsigned long expires); +extern int mod_timer_pending(struct timer_list *timer, unsigned long expires); /* * The jiffies value which is added to now, when there is no timer @@ -221,25 +221,7 @@ static inline void timer_stats_timer_cle } #endif -/** - * add_timer - start a timer - * @timer: the timer to be added - * - * The kernel will do a ->function(->data) callback from the - * timer interrupt at the ->expires point in the future. The - * current time is 'jiffies'. - * - * The timer's ->expires, ->function (and if the handler uses it, ->data) - * fields must be set prior calling this function. - * - * Timers with an ->expires field in the past will be executed in the next - * timer tick. - */ -static inline void add_timer(struct timer_list *timer) -{ - BUG_ON(timer_pending(timer)); - __mod_timer(timer, timer->expires); -} +extern void add_timer(struct timer_list *timer); #ifdef CONFIG_SMP extern int try_to_del_timer_sync(struct timer_list *timer); Index: linux/kernel/relay.c =================================================================== --- linux.orig/kernel/relay.c +++ linux/kernel/relay.c @@ -748,7 +748,7 @@ size_t relay_switch_subbuf(struct rchan_ * from the scheduler (trying to re-grab * rq->lock), so defer it. */ - __mod_timer(&buf->timer, jiffies + 1); + mod_timer(&buf->timer, jiffies + 1); } old = buf->data; Index: linux/kernel/timer.c =================================================================== --- linux.orig/kernel/timer.c +++ linux/kernel/timer.c @@ -600,11 +600,14 @@ static struct tvec_base *lock_timer_base } } -int __mod_timer(struct timer_list *timer, unsigned long expires) +static inline int +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only) { struct tvec_base *base, *new_base; unsigned long flags; - int ret = 0; + int ret; + + ret = 0; timer_stats_timer_set_start_info(timer); BUG_ON(!timer->function); @@ -614,6 +617,9 @@ int __mod_timer(struct timer_list *timer if (timer_pending(timer)) { detach_timer(timer, 0); ret = 1; + } else { + if (pending_only) + goto out_unlock; } debug_timer_activate(timer); @@ -640,42 +646,28 @@ int __mod_timer(struct timer_list *timer timer->expires = expires; internal_add_timer(base, timer); + +out_unlock: spin_unlock_irqrestore(&base->lock, flags); return ret; } -EXPORT_SYMBOL(__mod_timer); - /** - * add_timer_on - start a timer on a particular CPU - * @timer: the timer to be added - * @cpu: the CPU to start it on + * mod_timer_pending - modify a pending timer's timeout + * @timer: the pending timer to be modified + * @expires: new timeout in jiffies * - * This is not very scalable on SMP. Double adds are not possible. + * mod_timer_pending() is the same for pending timers as mod_timer(), + * but will not re-activate and modify already deleted timers. + * + * It is useful for unserialized use of timers. */ -void add_timer_on(struct timer_list *timer, int cpu) +int mod_timer_pending(struct timer_list *timer, unsigned long expires) { - struct tvec_base *base = per_cpu(tvec_bases, cpu); - unsigned long flags; - - timer_stats_timer_set_start_info(timer); - BUG_ON(timer_pending(timer) || !timer->function); - spin_lock_irqsave(&base->lock, flags); - timer_set_base(timer, base); - debug_timer_activate(timer); - internal_add_timer(base, timer); - /* - * Check whether the other CPU is idle and needs to be - * triggered to reevaluate the timer wheel when nohz is - * active. We are protected against the other CPU fiddling - * with the timer by holding the timer base lock. This also - * makes sure that a CPU on the way to idle can not evaluate - * the timer wheel. - */ - wake_up_idle_cpu(cpu); - spin_unlock_irqrestore(&base->lock, flags); + return __mod_timer(timer, expires, true); } +EXPORT_SYMBOL(mod_timer_pending); /** * mod_timer - modify a timer's timeout @@ -699,9 +691,6 @@ void add_timer_on(struct timer_list *tim */ int mod_timer(struct timer_list *timer, unsigned long expires) { - BUG_ON(!timer->function); - - timer_stats_timer_set_start_info(timer); /* * This is a common optimization triggered by the * networking code - if the timer is re-modified @@ -710,12 +699,62 @@ int mod_timer(struct timer_list *timer, if (timer->expires == expires && timer_pending(timer)) return 1; - return __mod_timer(timer, expires); + return __mod_timer(timer, expires, false); } - EXPORT_SYMBOL(mod_timer); /** + * add_timer - start a timer + * @timer: the timer to be added + * + * The kernel will do a ->function(->data) callback from the + * timer interrupt at the ->expires point in the future. The + * current time is 'jiffies'. + * + * The timer's ->expires, ->function (and if the handler uses it, ->data) + * fields must be set prior calling this function. + * + * Timers with an ->expires field in the past will be executed in the next + * timer tick. + */ +void add_timer(struct timer_list *timer) +{ + BUG_ON(timer_pending(timer)); + mod_timer(timer, timer->expires); +} +EXPORT_SYMBOL(add_timer); + +/** + * add_timer_on - start a timer on a particular CPU + * @timer: the timer to be added + * @cpu: the CPU to start it on + * + * This is not very scalable on SMP. Double adds are not possible. + */ +void add_timer_on(struct timer_list *timer, int cpu) +{ + struct tvec_base *base = per_cpu(tvec_bases, cpu); + unsigned long flags; + + timer_stats_timer_set_start_info(timer); + BUG_ON(timer_pending(timer) || !timer->function); + spin_lock_irqsave(&base->lock, flags); + timer_set_base(timer, base); + debug_timer_activate(timer); + internal_add_timer(base, timer); + /* + * Check whether the other CPU is idle and needs to be + * triggered to reevaluate the timer wheel when nohz is + * active. We are protected against the other CPU fiddling + * with the timer by holding the timer base lock. This also + * makes sure that a CPU on the way to idle can not evaluate + * the timer wheel. + */ + wake_up_idle_cpu(cpu); + spin_unlock_irqrestore(&base->lock, flags); +} + +/** * del_timer - deactive a timer. * @timer: the timer to be deactivated * @@ -744,7 +783,6 @@ int del_timer(struct timer_list *timer) return ret; } - EXPORT_SYMBOL(del_timer); #ifdef CONFIG_SMP @@ -778,7 +816,6 @@ out: return ret; } - EXPORT_SYMBOL(try_to_del_timer_sync); /** @@ -816,7 +853,6 @@ int del_timer_sync(struct timer_list *ti cpu_relax(); } } - EXPORT_SYMBOL(del_timer_sync); #endif @@ -1314,7 +1350,7 @@ signed long __sched schedule_timeout(sig expire = timeout + jiffies; setup_timer_on_stack(&timer, process_timeout, (unsigned long)current); - __mod_timer(&timer, expire); + __mod_timer(&timer, expire, false); schedule(); del_singleshot_timer_sync(&timer);