diff mbox

timers: add mod_timer_pending()

Message ID 20090218120508.GB4100@elte.hu
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Ingo Molnar Feb. 18, 2009, 12:05 p.m. UTC
* Patrick McHardy <kaber@trash.net> wrote:

> Ingo Molnar wrote:
>>> -extern int __mod_timer(struct timer_list *timer, unsigned long expires);
>>> +extern int __mod_timer(struct timer_list *timer, unsigned long expires, int activate);
>>
>> This is not really acceptable, it slows down every single add_timer() 
>> and mod_timer() call in the kernel with a flag that has one specific 
>> value in all but your case. There's more than 2000 such callsites in 
>> the kernel.
>>
>> Why dont you use something like this instead:
>>
>> 	if (del_timer(timer))
>> 		add_timer(timer);
>
> We need to avoid having a timer that was deleted by one CPU
> getting re-added by another, but want to avoid taking the
> conntrack lock for every timer update. The timer-internal
> locking is enough for this as long as we have a mod_timer
> variant that forwards a timer, but doesn't activate it in
> case it isn't active already.

that makes sense - but the implementation is still somewhat 
ugly. How about the one below instead? Not tested.

One open question is this construct in mod_timer():

+	/*
+	 * This is a common optimization triggered by the
+	 * networking code - if the timer is re-modified
+	 * to be the same thing then just return:
+	 */
+	if (timer->expires == expires && timer_pending(timer))
+		return 1;

We've had this for ages, but it seems rather SMP-unsafe. 
timer_pending(), if used in an unserialized fashion, can be any 
random value in theory - there's no internal serialization here 
anywhere.

We could end up with incorrectly not re-activating a timer in 
mod_timer() for example - have such things never been observed 
in practice?

So the original patch which added this to mod_timer_noact() was 
racy i think, and we cannot preserve this optimization outside 
of the timer list lock. (we could do it inside of it.)

	Ingo

------------------->
Subject: timers: add mod_timer_pending()
From: Ingo Molnar <mingo@elte.hu>
Date: Wed, 18 Feb 2009 12:23:29 +0100

Impact: new timer API

Based on an idea from Stephen Hemminger: introduce
 mod_timer_pending() which is a mod_timer() offspring
that is an invariant on already removed timers.

(regular mod_timer() re-activates non-pending timers.)

This is useful for the networking code in that it can
allow unserialized mod_timer_pending() timer-forwarding
calls, but a single del_timer*() will stop the timer
from being reactivated again.

Also while at it:

- optimize the regular mod_timer() path some more, the
  timer-stat and a debug check was needlessly duplicated
  in __mod_timer().

- make the exports come straight after the function, as
  most other exports in timer.c already did.

- eliminate __mod_timer() as an external API, change the
  users to mod_timer().

The regular mod_timer() code path is not impacted
significantly, due to inlining optimizations and due to
the simplifications - but performance testing would be nice
nevertheless.

Based-on-patch-from: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/powerpc/platforms/cell/spufs/sched.c  |    2 
 drivers/infiniband/hw/ipath/ipath_driver.c |    6 -
 include/linux/timer.h                      |   22 -----
 kernel/relay.c                             |    2 
 kernel/timer.c                             |  110 +++++++++++++++++++----------
 5 files changed, 80 insertions(+), 62 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Patrick McHardy Feb. 18, 2009, 12:33 p.m. UTC | #1
Ingo Molnar wrote:
> * Patrick McHardy <kaber@trash.net> wrote:
> 
>> We need to avoid having a timer that was deleted by one CPU
>> getting re-added by another, but want to avoid taking the
>> conntrack lock for every timer update. The timer-internal
>> locking is enough for this as long as we have a mod_timer
>> variant that forwards a timer, but doesn't activate it in
>> case it isn't active already.
> 
> that makes sense - but the implementation is still somewhat 
> ugly. How about the one below instead? Not tested.

This seems to fulfill our needs. I also like the mod_timer_pending()
name better than mod_timer_noact().

> One open question is this construct in mod_timer():
> 
> +	/*
> +	 * This is a common optimization triggered by the
> +	 * networking code - if the timer is re-modified
> +	 * to be the same thing then just return:
> +	 */
> +	if (timer->expires == expires && timer_pending(timer))
> +		return 1;
> 
> We've had this for ages, but it seems rather SMP-unsafe. 
> timer_pending(), if used in an unserialized fashion, can be any 
> random value in theory - there's no internal serialization here 
> anywhere.
> 
> We could end up with incorrectly not re-activating a timer in 
> mod_timer() for example - have such things never been observed 
> in practice?

Yes, it seems racy if done for timers that might get activated.
For forwarding only without activation it seems OK, in that case
the timer_pending check doesn't seem necessary at all.

> So the original patch which added this to mod_timer_noact() was 
> racy i think, and we cannot preserve this optimization outside 
> of the timer list lock. (we could do it inside of it.)
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Feb. 18, 2009, 12:50 p.m. UTC | #2
* Patrick McHardy <kaber@trash.net> wrote:

> Ingo Molnar wrote:
>> * Patrick McHardy <kaber@trash.net> wrote:
>>
>>> We need to avoid having a timer that was deleted by one CPU
>>> getting re-added by another, but want to avoid taking the
>>> conntrack lock for every timer update. The timer-internal
>>> locking is enough for this as long as we have a mod_timer
>>> variant that forwards a timer, but doesn't activate it in
>>> case it isn't active already.
>>
>> that makes sense - but the implementation is still somewhat ugly. How 
>> about the one below instead? Not tested.
>
> This seems to fulfill our needs. I also like the mod_timer_pending()
> name better than mod_timer_noact().
>
>> One open question is this construct in mod_timer():
>>
>> +	/*
>> +	 * This is a common optimization triggered by the
>> +	 * networking code - if the timer is re-modified
>> +	 * to be the same thing then just return:
>> +	 */
>> +	if (timer->expires == expires && timer_pending(timer))
>> +		return 1;
>>
>> We've had this for ages, but it seems rather SMP-unsafe.  
>> timer_pending(), if used in an unserialized fashion, can be any random 
>> value in theory - there's no internal serialization here anywhere.
>>
>> We could end up with incorrectly not re-activating a timer in  
>> mod_timer() for example - have such things never been observed in 
>> practice?
>
> Yes, it seems racy if done for timers that might get 
> activated. For forwarding only without activation it seems OK, 
> in that case the timer_pending check doesn't seem necessary at 
> all.

ok.

To accelerate matters i've committed the new API patch into a 
new standalone topic branch: tip:timers/new-apis.

Unless there are objections or test failures, you (or Stephen or 
David) can then git-pull it into the networking tree via the Git 
coordinates below - and you'll get this single commit in a 
surgical manner - no other timer changes are included.

Doing so has the advantage of:

- You not having to wait a kernel cycle for the API to go
  upstream.

- You can also push it upstream without waiting for the timer 
  tree. (the timer tree and the networking tree will share the 
  exact same commit)

- It will also all merge cleanly with the timer tree in 
  linux-next, etc.

I'd suggest to do it in about a week, to make sure any after 
effects have trickled down and to make sure the topic has become 
append-only. You can ping Thomas and me about testing/review 
status then, whenever you want to do the pull.

	Ingo

------------->

You can pull the latest timers/new-apis git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git timers/new-apis

 Thanks,

	Ingo

------------------>
Ingo Molnar (1):
      timers: add mod_timer_pending()


 arch/powerpc/platforms/cell/spufs/sched.c  |    2 +-
 drivers/infiniband/hw/ipath/ipath_driver.c |    6 +-
 include/linux/timer.h                      |   22 +-----
 kernel/relay.c                             |    2 +-
 kernel/timer.c                             |  110 ++++++++++++++++++---------
 5 files changed, 80 insertions(+), 62 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Patrick McHardy Feb. 18, 2009, 12:54 p.m. UTC | #3
Ingo Molnar wrote:
> To accelerate matters i've committed the new API patch into a 
> new standalone topic branch: tip:timers/new-apis.
> 
> Unless there are objections or test failures, you (or Stephen or 
> David) can then git-pull it into the networking tree via the Git 
> coordinates below - and you'll get this single commit in a 
> surgical manner - no other timer changes are included.
> 
> Doing so has the advantage of:
> 
> - You not having to wait a kernel cycle for the API to go
>   upstream.
> 
> - You can also push it upstream without waiting for the timer 
>   tree. (the timer tree and the networking tree will share the 
>   exact same commit)
> 
> - It will also all merge cleanly with the timer tree in 
>   linux-next, etc.
> 
> I'd suggest to do it in about a week, to make sure any after 
> effects have trickled down and to make sure the topic has become 
> append-only. You can ping Thomas and me about testing/review 
> status then, whenever you want to do the pull.

Thanks Ingo. I'll wait for Stephen to rebase his patches on
top of your change and the test results and will let you know.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Feb. 18, 2009, 1:47 p.m. UTC | #4
* Patrick McHardy <kaber@trash.net> wrote:

> Ingo Molnar wrote:
>> To accelerate matters i've committed the new API patch into a new 
>> standalone topic branch: tip:timers/new-apis.
>>
>> Unless there are objections or test failures, you (or Stephen or  
>> David) can then git-pull it into the networking tree via the Git  
>> coordinates below - and you'll get this single commit in a surgical 
>> manner - no other timer changes are included.
>>
>> Doing so has the advantage of:
>>
>> - You not having to wait a kernel cycle for the API to go
>>   upstream.
>>
>> - You can also push it upstream without waiting for the timer   tree. 
>> (the timer tree and the networking tree will share the   exact same 
>> commit)
>>
>> - It will also all merge cleanly with the timer tree in   linux-next, 
>> etc.
>>
>> I'd suggest to do it in about a week, to make sure any after effects 
>> have trickled down and to make sure the topic has become append-only. 
>> You can ping Thomas and me about testing/review status then, whenever 
>> you want to do the pull.
>
> Thanks Ingo. I'll wait for Stephen to rebase his patches on 
> top of your change and the test results and will let you know.

Stress-testing here in the last ~2 hours on eight x86 test-boxes 
showed no problems so far.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Feb. 18, 2009, 5 p.m. UTC | #5
On 02/18, Ingo Molnar wrote:
>
> Based on an idea from Stephen Hemminger: introduce
>  mod_timer_pending() which is a mod_timer() offspring
> that is an invariant on already removed timers.

This also can be used by workqueues, see

	http://marc.info/?l=linux-kernel&m=122209752020413

but can't we add another helper? Because,

> +static inline int
> +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
>  {
>  	struct tvec_base *base, *new_base;
>  	unsigned long flags;
> -	int ret = 0;
> +	int ret;
> +
> +	ret = 0;
>
>  	timer_stats_timer_set_start_info(timer);
>  	BUG_ON(!timer->function);
> @@ -614,6 +617,9 @@ int __mod_timer(struct timer_list *timer
>  	if (timer_pending(timer)) {
>  		detach_timer(timer, 0);
>  		ret = 1;
> +	} else {
> +		if (pending_only)
> +			goto out_unlock;

This can change the base (CPU) of the pending timer.

How about

	int __update_timer(struct timer_list *timer, unsigned long expires)
	{
		struct tvec_base *base;
		unsigned long flags;
		int ret = 0;

		base = lock_timer_base(timer, &flags);
		if (timer_pending(timer)) {
			detach_timer(timer, 0);
			timer->expires = expires;
			internal_add_timer(base, timer);
			ret = 1;
		}
		spin_unlock_irqrestore(&base->lock, flags);

		return ret;
	}

?

Unlike __mod_timer(..., bool pending_only), it preserves the CPU on
which the timer is pending.

Or, perhaps, we can modify __mod_timer() further,

	static inline int
	__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
	{
		struct tvec_base *base;
		unsigned long flags;
		int ret;

		ret = 0;

		timer_stats_timer_set_start_info(timer);
		BUG_ON(!timer->function);

		base = lock_timer_base(timer, &flags);

		if (timer_pending(timer)) {
			detach_timer(timer, 0);
			ret = 1;
		} else if (pending_only)
			goto out_unlock;
		}

		debug_timer_activate(timer);

		if (!pending_only) {
			struct tvec_base *new_base = __get_cpu_var(tvec_bases);

			if (base != new_base) {
				/*
				 * We are trying to schedule the timer on the local CPU.
				 * However we can't change timer's base while it is running,
				 * otherwise del_timer_sync() can't detect that the timer's
				 * handler yet has not finished. This also guarantees that
				 * the timer is serialized wrt itself.
				 */
				if (likely(base->running_timer != timer)) {
					/* See the comment in lock_timer_base() */
					timer_set_base(timer, NULL);
					spin_unlock(&base->lock);
					base = new_base;
					spin_lock(&base->lock);
					timer_set_base(timer, base);
				}
			}
		}

		timer->expires = expires;
		internal_add_timer(base, timer);

	out_unlock:
		spin_unlock_irqrestore(&base->lock, flags);

		return ret;
	}

What do you all think?

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Feb. 18, 2009, 6:23 p.m. UTC | #6
* Oleg Nesterov <oleg@redhat.com> wrote:

> On 02/18, Ingo Molnar wrote:
> >
> > Based on an idea from Stephen Hemminger: introduce
> >  mod_timer_pending() which is a mod_timer() offspring
> > that is an invariant on already removed timers.
> 
> This also can be used by workqueues, see
> 
> 	http://marc.info/?l=linux-kernel&m=122209752020413
> 
> but can't we add another helper? Because,
> 
> > +static inline int
> > +__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
> >  {
> >  	struct tvec_base *base, *new_base;
> >  	unsigned long flags;
> > -	int ret = 0;
> > +	int ret;
> > +
> > +	ret = 0;
> >
> >  	timer_stats_timer_set_start_info(timer);
> >  	BUG_ON(!timer->function);
> > @@ -614,6 +617,9 @@ int __mod_timer(struct timer_list *timer
> >  	if (timer_pending(timer)) {
> >  		detach_timer(timer, 0);
> >  		ret = 1;
> > +	} else {
> > +		if (pending_only)
> > +			goto out_unlock;
> 
> This can change the base (CPU) of the pending timer.
> 
> How about
> 
> 	int __update_timer(struct timer_list *timer, unsigned long expires)
> 	{
> 		struct tvec_base *base;
> 		unsigned long flags;
> 		int ret = 0;
> 
> 		base = lock_timer_base(timer, &flags);
> 		if (timer_pending(timer)) {
> 			detach_timer(timer, 0);
> 			timer->expires = expires;
> 			internal_add_timer(base, timer);
> 			ret = 1;
> 		}
> 		spin_unlock_irqrestore(&base->lock, flags);
> 
> 		return ret;
> 	}
> 
> ?
> 
> Unlike __mod_timer(..., bool pending_only), it preserves the CPU on
> which the timer is pending.
> 
> Or, perhaps, we can modify __mod_timer() further,
> 
> 	static inline int
> 	__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
> 	{
> 		struct tvec_base *base;
> 		unsigned long flags;
> 		int ret;
> 
> 		ret = 0;
> 
> 		timer_stats_timer_set_start_info(timer);
> 		BUG_ON(!timer->function);
> 
> 		base = lock_timer_base(timer, &flags);
> 
> 		if (timer_pending(timer)) {
> 			detach_timer(timer, 0);
> 			ret = 1;
> 		} else if (pending_only)
> 			goto out_unlock;
> 		}
> 
> 		debug_timer_activate(timer);
> 
> 		if (!pending_only) {
> 			struct tvec_base *new_base = __get_cpu_var(tvec_bases);
> 
> 			if (base != new_base) {
> 				/*
> 				 * We are trying to schedule the timer on the local CPU.
> 				 * However we can't change timer's base while it is running,
> 				 * otherwise del_timer_sync() can't detect that the timer's
> 				 * handler yet has not finished. This also guarantees that
> 				 * the timer is serialized wrt itself.
> 				 */
> 				if (likely(base->running_timer != timer)) {
> 					/* See the comment in lock_timer_base() */
> 					timer_set_base(timer, NULL);
> 					spin_unlock(&base->lock);
> 					base = new_base;
> 					spin_lock(&base->lock);
> 					timer_set_base(timer, base);
> 				}
> 			}
> 		}
> 
> 		timer->expires = expires;
> 		internal_add_timer(base, timer);
> 
> 	out_unlock:
> 		spin_unlock_irqrestore(&base->lock, flags);
> 
> 		return ret;
> 	}
> 
> What do you all think?

if then i'd put it into a separate commit.

I think the auto-migration of all the mod_timer() variants is a 
scalability feature: if for example a networking socket's main 
user migrates to another CPU, then the timer 'follows' it - even 
if the timer never actually expires (which is quite common for 
high-speed high-reliability networking transports).

By keeping it on the same CPU we'd allow the timer's and the 
task's affinity to differ.

Agreed?

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Oleg Nesterov Feb. 18, 2009, 6:58 p.m. UTC | #7
On 02/18, Ingo Molnar wrote:
>
> * Oleg Nesterov <oleg@redhat.com> wrote:
>
> > Unlike __mod_timer(..., bool pending_only), it preserves the CPU on
> > which the timer is pending.
> >
> > Or, perhaps, we can modify __mod_timer() further,
>
> if then i'd put it into a separate commit.
>
> I think the auto-migration of all the mod_timer() variants is a
> scalability feature: if for example a networking socket's main
> user migrates to another CPU, then the timer 'follows' it - even
> if the timer never actually expires (which is quite common for
> high-speed high-reliability networking transports).

OK.

But sometimes it is better (or necessary) to prevent the migration.
Since you already are changed __mod_timer() it would be ugly to
add yet another helper. Perhaps we should turn "bool pending_only"
into "int flags" right now?

This is minor, and perhaps we will never need the TIMER_DONT_MIGRATE
flag. But if ever need, then we have to audit all callers.

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Feb. 18, 2009, 7:24 p.m. UTC | #8
* Oleg Nesterov <oleg@redhat.com> wrote:

> On 02/18, Ingo Molnar wrote:
> >
> > * Oleg Nesterov <oleg@redhat.com> wrote:
> >
> > > Unlike __mod_timer(..., bool pending_only), it preserves the CPU on
> > > which the timer is pending.
> > >
> > > Or, perhaps, we can modify __mod_timer() further,
> >
> > if then i'd put it into a separate commit.
> >
> > I think the auto-migration of all the mod_timer() variants is a
> > scalability feature: if for example a networking socket's main
> > user migrates to another CPU, then the timer 'follows' it - even
> > if the timer never actually expires (which is quite common for
> > high-speed high-reliability networking transports).
> 
> OK.
> 
> But sometimes it is better (or necessary) to prevent the 
> migration. Since you already are changed __mod_timer() it 
> would be ugly to add yet another helper. Perhaps we should 
> turn "bool pending_only" into "int flags" right now?
> 
> This is minor, and perhaps we will never need the 
> TIMER_DONT_MIGRATE flag. But if ever need, then we have to 
> audit all callers.

hm, dunno - such unused flags are generally frowned upon, 
especially if they influence the code flow in a dynamic way. In 
fact i tried to avoid this flag here too - but __mod_timer() is 
too small, the flag is used in the middle, and two separate 
helpers would have made the code look worse.

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux/arch/powerpc/platforms/cell/spufs/sched.c
===================================================================
--- linux.orig/arch/powerpc/platforms/cell/spufs/sched.c
+++ linux/arch/powerpc/platforms/cell/spufs/sched.c
@@ -508,7 +508,7 @@  static void __spu_add_to_rq(struct spu_c
 		list_add_tail(&ctx->rq, &spu_prio->runq[ctx->prio]);
 		set_bit(ctx->prio, spu_prio->bitmap);
 		if (!spu_prio->nr_waiting++)
-			__mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK);
+			mod_timer(&spusched_timer, jiffies + SPUSCHED_TICK);
 	}
 }
 
Index: linux/drivers/infiniband/hw/ipath/ipath_driver.c
===================================================================
--- linux.orig/drivers/infiniband/hw/ipath/ipath_driver.c
+++ linux/drivers/infiniband/hw/ipath/ipath_driver.c
@@ -2715,7 +2715,7 @@  static void ipath_hol_signal_up(struct i
  * to prevent HoL blocking, then start the HoL timer that
  * periodically continues, then stop procs, so they can detect
  * link down if they want, and do something about it.
- * Timer may already be running, so use __mod_timer, not add_timer.
+ * Timer may already be running, so use mod_timer, not add_timer.
  */
 void ipath_hol_down(struct ipath_devdata *dd)
 {
@@ -2724,7 +2724,7 @@  void ipath_hol_down(struct ipath_devdata
 	dd->ipath_hol_next = IPATH_HOL_DOWNCONT;
 	dd->ipath_hol_timer.expires = jiffies +
 		msecs_to_jiffies(ipath_hol_timeout_ms);
-	__mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires);
+	mod_timer(&dd->ipath_hol_timer, dd->ipath_hol_timer.expires);
 }
 
 /*
@@ -2763,7 +2763,7 @@  void ipath_hol_event(unsigned long opaqu
 	else {
 		dd->ipath_hol_timer.expires = jiffies +
 			msecs_to_jiffies(ipath_hol_timeout_ms);
-		__mod_timer(&dd->ipath_hol_timer,
+		mod_timer(&dd->ipath_hol_timer,
 			dd->ipath_hol_timer.expires);
 	}
 }
Index: linux/include/linux/timer.h
===================================================================
--- linux.orig/include/linux/timer.h
+++ linux/include/linux/timer.h
@@ -161,8 +161,8 @@  static inline int timer_pending(const st
 
 extern void add_timer_on(struct timer_list *timer, int cpu);
 extern int del_timer(struct timer_list * timer);
-extern int __mod_timer(struct timer_list *timer, unsigned long expires);
 extern int mod_timer(struct timer_list *timer, unsigned long expires);
+extern int mod_timer_pending(struct timer_list *timer, unsigned long expires);
 
 /*
  * The jiffies value which is added to now, when there is no timer
@@ -221,25 +221,7 @@  static inline void timer_stats_timer_cle
 }
 #endif
 
-/**
- * add_timer - start a timer
- * @timer: the timer to be added
- *
- * The kernel will do a ->function(->data) callback from the
- * timer interrupt at the ->expires point in the future. The
- * current time is 'jiffies'.
- *
- * The timer's ->expires, ->function (and if the handler uses it, ->data)
- * fields must be set prior calling this function.
- *
- * Timers with an ->expires field in the past will be executed in the next
- * timer tick.
- */
-static inline void add_timer(struct timer_list *timer)
-{
-	BUG_ON(timer_pending(timer));
-	__mod_timer(timer, timer->expires);
-}
+extern void add_timer(struct timer_list *timer);
 
 #ifdef CONFIG_SMP
   extern int try_to_del_timer_sync(struct timer_list *timer);
Index: linux/kernel/relay.c
===================================================================
--- linux.orig/kernel/relay.c
+++ linux/kernel/relay.c
@@ -748,7 +748,7 @@  size_t relay_switch_subbuf(struct rchan_
 			 * from the scheduler (trying to re-grab
 			 * rq->lock), so defer it.
 			 */
-			__mod_timer(&buf->timer, jiffies + 1);
+			mod_timer(&buf->timer, jiffies + 1);
 	}
 
 	old = buf->data;
Index: linux/kernel/timer.c
===================================================================
--- linux.orig/kernel/timer.c
+++ linux/kernel/timer.c
@@ -600,11 +600,14 @@  static struct tvec_base *lock_timer_base
 	}
 }
 
-int __mod_timer(struct timer_list *timer, unsigned long expires)
+static inline int
+__mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 {
 	struct tvec_base *base, *new_base;
 	unsigned long flags;
-	int ret = 0;
+	int ret;
+
+	ret = 0;
 
 	timer_stats_timer_set_start_info(timer);
 	BUG_ON(!timer->function);
@@ -614,6 +617,9 @@  int __mod_timer(struct timer_list *timer
 	if (timer_pending(timer)) {
 		detach_timer(timer, 0);
 		ret = 1;
+	} else {
+		if (pending_only)
+			goto out_unlock;
 	}
 
 	debug_timer_activate(timer);
@@ -640,42 +646,28 @@  int __mod_timer(struct timer_list *timer
 
 	timer->expires = expires;
 	internal_add_timer(base, timer);
+
+out_unlock:
 	spin_unlock_irqrestore(&base->lock, flags);
 
 	return ret;
 }
 
-EXPORT_SYMBOL(__mod_timer);
-
 /**
- * add_timer_on - start a timer on a particular CPU
- * @timer: the timer to be added
- * @cpu: the CPU to start it on
+ * mod_timer_pending - modify a pending timer's timeout
+ * @timer: the pending timer to be modified
+ * @expires: new timeout in jiffies
  *
- * This is not very scalable on SMP. Double adds are not possible.
+ * mod_timer_pending() is the same for pending timers as mod_timer(),
+ * but will not re-activate and modify already deleted timers.
+ *
+ * It is useful for unserialized use of timers.
  */
-void add_timer_on(struct timer_list *timer, int cpu)
+int mod_timer_pending(struct timer_list *timer, unsigned long expires)
 {
-	struct tvec_base *base = per_cpu(tvec_bases, cpu);
-	unsigned long flags;
-
-	timer_stats_timer_set_start_info(timer);
-	BUG_ON(timer_pending(timer) || !timer->function);
-	spin_lock_irqsave(&base->lock, flags);
-	timer_set_base(timer, base);
-	debug_timer_activate(timer);
-	internal_add_timer(base, timer);
-	/*
-	 * Check whether the other CPU is idle and needs to be
-	 * triggered to reevaluate the timer wheel when nohz is
-	 * active. We are protected against the other CPU fiddling
-	 * with the timer by holding the timer base lock. This also
-	 * makes sure that a CPU on the way to idle can not evaluate
-	 * the timer wheel.
-	 */
-	wake_up_idle_cpu(cpu);
-	spin_unlock_irqrestore(&base->lock, flags);
+	return __mod_timer(timer, expires, true);
 }
+EXPORT_SYMBOL(mod_timer_pending);
 
 /**
  * mod_timer - modify a timer's timeout
@@ -699,9 +691,6 @@  void add_timer_on(struct timer_list *tim
  */
 int mod_timer(struct timer_list *timer, unsigned long expires)
 {
-	BUG_ON(!timer->function);
-
-	timer_stats_timer_set_start_info(timer);
 	/*
 	 * This is a common optimization triggered by the
 	 * networking code - if the timer is re-modified
@@ -710,12 +699,62 @@  int mod_timer(struct timer_list *timer, 
 	if (timer->expires == expires && timer_pending(timer))
 		return 1;
 
-	return __mod_timer(timer, expires);
+	return __mod_timer(timer, expires, false);
 }
-
 EXPORT_SYMBOL(mod_timer);
 
 /**
+ * add_timer - start a timer
+ * @timer: the timer to be added
+ *
+ * The kernel will do a ->function(->data) callback from the
+ * timer interrupt at the ->expires point in the future. The
+ * current time is 'jiffies'.
+ *
+ * The timer's ->expires, ->function (and if the handler uses it, ->data)
+ * fields must be set prior calling this function.
+ *
+ * Timers with an ->expires field in the past will be executed in the next
+ * timer tick.
+ */
+void add_timer(struct timer_list *timer)
+{
+	BUG_ON(timer_pending(timer));
+	mod_timer(timer, timer->expires);
+}
+EXPORT_SYMBOL(add_timer);
+
+/**
+ * add_timer_on - start a timer on a particular CPU
+ * @timer: the timer to be added
+ * @cpu: the CPU to start it on
+ *
+ * This is not very scalable on SMP. Double adds are not possible.
+ */
+void add_timer_on(struct timer_list *timer, int cpu)
+{
+	struct tvec_base *base = per_cpu(tvec_bases, cpu);
+	unsigned long flags;
+
+	timer_stats_timer_set_start_info(timer);
+	BUG_ON(timer_pending(timer) || !timer->function);
+	spin_lock_irqsave(&base->lock, flags);
+	timer_set_base(timer, base);
+	debug_timer_activate(timer);
+	internal_add_timer(base, timer);
+	/*
+	 * Check whether the other CPU is idle and needs to be
+	 * triggered to reevaluate the timer wheel when nohz is
+	 * active. We are protected against the other CPU fiddling
+	 * with the timer by holding the timer base lock. This also
+	 * makes sure that a CPU on the way to idle can not evaluate
+	 * the timer wheel.
+	 */
+	wake_up_idle_cpu(cpu);
+	spin_unlock_irqrestore(&base->lock, flags);
+}
+
+/**
  * del_timer - deactive a timer.
  * @timer: the timer to be deactivated
  *
@@ -744,7 +783,6 @@  int del_timer(struct timer_list *timer)
 
 	return ret;
 }
-
 EXPORT_SYMBOL(del_timer);
 
 #ifdef CONFIG_SMP
@@ -778,7 +816,6 @@  out:
 
 	return ret;
 }
-
 EXPORT_SYMBOL(try_to_del_timer_sync);
 
 /**
@@ -816,7 +853,6 @@  int del_timer_sync(struct timer_list *ti
 		cpu_relax();
 	}
 }
-
 EXPORT_SYMBOL(del_timer_sync);
 #endif
 
@@ -1314,7 +1350,7 @@  signed long __sched schedule_timeout(sig
 	expire = timeout + jiffies;
 
 	setup_timer_on_stack(&timer, process_timeout, (unsigned long)current);
-	__mod_timer(&timer, expire);
+	__mod_timer(&timer, expire, false);
 	schedule();
 	del_singleshot_timer_sync(&timer);