diff mbox

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

Message ID 20170820230040.706b62ac@roar.ozlabs.ibm.com
State Not Applicable
Delegated to: David Miller
Headers show

Commit Message

Nicholas Piggin Aug. 20, 2017, 1 p.m. UTC
On Sun, 20 Aug 2017 14:45:53 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Wed, 16 Aug 2017 09:27:31 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > 
> > Thomas, John, am I misinterpreting the timer trace event messages?  
> 
> So I did some digging, and what you find is that rcu_sched seems to do a
> simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> The process_timeout timer never fires (when it finally does wake after
> one of these events, it usually removes the timer with del_timer_sync).
> 
> So this patch seems to fix it. Testing, comments welcome.

Okay this had a problem of trying to forward the timer from a timer
callback function.

This was my other approach which also fixes the RCU warnings, but it's
a little more complex. I reworked it a bit so the mod_timer fast path
hopefully doesn't have much more overhead (actually by reading jiffies
only when needed, it probably saves a load).

Thanks,
Nick

--
[PATCH] timers: Fix excessive granularity of new timers after a nohz idle

When a timer base is idle, it is forwarded when a new timer is added to
ensure that granularity does not become excessive. When not idle, the
timer tick is expected to increment the base.

However there is a window after a timer is restarted from nohz, when it
is marked not-idle, and before the timer tick on this CPU, where a timer
may be added on an ancient base that does not get forwarded (beacause
the timer appears not-idle).

This results in excessive granularity. So much so that a 1 jiffy timeout
has blown out to 10s of seconds and triggered the RCU stall warning
detector.

Fix this by keeping track of whether the timer has been idle since it was
last run or forwarded, and allow forwarding in the case that is true (even
if it is not currently idle).

Also add a comment noting a case where we could get an unexpectedly
large granularity for a timer. I debugged this problem by adding
warnings for such cases, but it seems we can't add them in general due
to this corner case.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/time/timer.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

Comments

Paul E. McKenney Aug. 20, 2017, 6:35 p.m. UTC | #1
On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:
> On Sun, 20 Aug 2017 14:45:53 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Wed, 16 Aug 2017 09:27:31 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > 
> > > Thomas, John, am I misinterpreting the timer trace event messages?  
> > 
> > So I did some digging, and what you find is that rcu_sched seems to do a
> > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > The process_timeout timer never fires (when it finally does wake after
> > one of these events, it usually removes the timer with del_timer_sync).
> > 
> > So this patch seems to fix it. Testing, comments welcome.
> 
> Okay this had a problem of trying to forward the timer from a timer
> callback function.
> 
> This was my other approach which also fixes the RCU warnings, but it's
> a little more complex. I reworked it a bit so the mod_timer fast path
> hopefully doesn't have much more overhead (actually by reading jiffies
> only when needed, it probably saves a load).

Giving this one a whirl!

							Thanx, Paul

> Thanks,
> Nick
> 
> --
> [PATCH] timers: Fix excessive granularity of new timers after a nohz idle
> 
> When a timer base is idle, it is forwarded when a new timer is added to
> ensure that granularity does not become excessive. When not idle, the
> timer tick is expected to increment the base.
> 
> However there is a window after a timer is restarted from nohz, when it
> is marked not-idle, and before the timer tick on this CPU, where a timer
> may be added on an ancient base that does not get forwarded (beacause
> the timer appears not-idle).
> 
> This results in excessive granularity. So much so that a 1 jiffy timeout
> has blown out to 10s of seconds and triggered the RCU stall warning
> detector.
> 
> Fix this by keeping track of whether the timer has been idle since it was
> last run or forwarded, and allow forwarding in the case that is true (even
> if it is not currently idle).
> 
> Also add a comment noting a case where we could get an unexpectedly
> large granularity for a timer. I debugged this problem by adding
> warnings for such cases, but it seems we can't add them in general due
> to this corner case.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/time/timer.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> diff --git a/kernel/time/timer.c b/kernel/time/timer.c
> index 8f5d1bf18854..ee7b8b688b48 100644
> --- a/kernel/time/timer.c
> +++ b/kernel/time/timer.c
> @@ -203,6 +203,7 @@ struct timer_base {
>  	bool			migration_enabled;
>  	bool			nohz_active;
>  	bool			is_idle;
> +	bool			was_idle;
>  	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
>  	struct hlist_head	vectors[WHEEL_SIZE];
>  } ____cacheline_aligned;
> @@ -856,13 +857,19 @@ get_target_base(struct timer_base *base, unsigned tflags)
> 
>  static inline void forward_timer_base(struct timer_base *base)
>  {
> -	unsigned long jnow = READ_ONCE(jiffies);
> +	unsigned long jnow;
> 
>  	/*
> -	 * We only forward the base when it's idle and we have a delta between
> -	 * base clock and jiffies.
> +	 * We only forward the base when we are idle or have just come out
> +	 * of idle (was_idle logic), and have a delta between base clock
> +	 * and jiffies. In the common case, run_timers will take care of it.
>  	 */
> -	if (!base->is_idle || (long) (jnow - base->clk) < 2)
> +	if (likely(!base->was_idle))
> +		return;
> +
> +	jnow = READ_ONCE(jiffies);
> +	base->was_idle = base->is_idle;
> +	if ((long)(jnow - base->clk) < 2)
>  		return;
> 
>  	/*
> @@ -938,6 +945,13 @@ __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
>  	 * same array bucket then just return:
>  	 */
>  	if (timer_pending(timer)) {
> +		/*
> +		 * The downside of this optimization is that it can result in
> +		 * larger granularity than you would get from adding a new
> +		 * timer with this expiry. Would a timer flag for networking
> +		 * be appropriate, then we can try to keep expiry of general
> +		 * timers within ~1/8th of their interval?
> +		 */
>  		if (timer->expires == expires)
>  			return 1;
> 
> @@ -1499,8 +1513,10 @@ u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
>  		/*
>  		 * If we expect to sleep more than a tick, mark the base idle:
>  		 */
> -		if ((expires - basem) > TICK_NSEC)
> +		if ((expires - basem) > TICK_NSEC) {
> +			base->was_idle = true;
>  			base->is_idle = true;
> +		}
>  	}
>  	raw_spin_unlock(&base->lock);
> 
> @@ -1587,6 +1603,12 @@ static inline void __run_timers(struct timer_base *base)
>  	struct hlist_head heads[LVL_DEPTH];
>  	int levels;
> 
> +	/*
> +	 * was_idle must be cleared before running timers so that any timer
> +	 * functions that call mod_timer will not try to forward the base.
> +	 */
> +	base->was_idle = false;
> +
>  	if (!time_after_eq(jiffies, base->clk))
>  		return;
> 
> -- 
> 2.13.3
> 

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Aug. 20, 2017, 9:14 p.m. UTC | #2
On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
> On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:
> > On Sun, 20 Aug 2017 14:45:53 +1000
> > Nicholas Piggin <npiggin@gmail.com> wrote:
> > 
> > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > 
> > > > Thomas, John, am I misinterpreting the timer trace event messages?  
> > > 
> > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > The process_timeout timer never fires (when it finally does wake after
> > > one of these events, it usually removes the timer with del_timer_sync).
> > > 
> > > So this patch seems to fix it. Testing, comments welcome.
> > 
> > Okay this had a problem of trying to forward the timer from a timer
> > callback function.
> > 
> > This was my other approach which also fixes the RCU warnings, but it's
> > a little more complex. I reworked it a bit so the mod_timer fast path
> > hopefully doesn't have much more overhead (actually by reading jiffies
> > only when needed, it probably saves a load).
> 
> Giving this one a whirl!

No joy here, but then again there are other reasons to believe that I
am seeing a different bug than Dave and Jonathan are.

OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
not statistically different from what I see without either patch.

But no statistical difference compared to without patch, and I still
see the "rcu_sched kthread starved" messages.  For whatever it is worth,
by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
Hmmm...  I am also seeing that without any of your patches.  Might
be hypervisor preemption, I guess.

							Thanx, Paul

PS.  I will be off the grid for the next day or so.  Eclipse day here...

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicholas Piggin Aug. 21, 2017, 12:52 a.m. UTC | #3
On Sun, 20 Aug 2017 14:14:29 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
> > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
> > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > >   
> > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:  
> > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > 
> > > > > Thomas, John, am I misinterpreting the timer trace event messages?    
> > > > 
> > > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > > The process_timeout timer never fires (when it finally does wake after
> > > > one of these events, it usually removes the timer with del_timer_sync).
> > > > 
> > > > So this patch seems to fix it. Testing, comments welcome.  
> > > 
> > > Okay this had a problem of trying to forward the timer from a timer
> > > callback function.
> > > 
> > > This was my other approach which also fixes the RCU warnings, but it's
> > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > only when needed, it probably saves a load).  
> > 
> > Giving this one a whirl!  
> 
> No joy here, but then again there are other reasons to believe that I
> am seeing a different bug than Dave and Jonathan are.
> 
> OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> not statistically different from what I see without either patch.
> 
> But no statistical difference compared to without patch, and I still
> see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> Hmmm...  I am also seeing that without any of your patches.  Might
> be hypervisor preemption, I guess.

Okay it makes the warnings go away for me, but I'm just booting then
leaving the system idle. You're doing some CPU hotplug activity?

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nicholas Piggin Aug. 21, 2017, 6:06 a.m. UTC | #4
On Mon, 21 Aug 2017 10:52:58 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Sun, 20 Aug 2017 14:14:29 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:  
> > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:    
> > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > > >     
> > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:    
> > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > Thomas, John, am I misinterpreting the timer trace event messages?      
> > > > > 
> > > > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > one of these events, it usually removes the timer with del_timer_sync).
> > > > > 
> > > > > So this patch seems to fix it. Testing, comments welcome.    
> > > > 
> > > > Okay this had a problem of trying to forward the timer from a timer
> > > > callback function.
> > > > 
> > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > only when needed, it probably saves a load).    
> > > 
> > > Giving this one a whirl!    
> > 
> > No joy here, but then again there are other reasons to believe that I
> > am seeing a different bug than Dave and Jonathan are.
> > 
> > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > not statistically different from what I see without either patch.
> > 
> > But no statistical difference compared to without patch, and I still
> > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > Hmmm...  I am also seeing that without any of your patches.  Might
> > be hypervisor preemption, I guess.  
> 
> Okay it makes the warnings go away for me, but I'm just booting then
> leaving the system idle. You're doing some CPU hotplug activity?

Okay found a bug in the patch (it was not forwarding properly before
adding the first timer after an idle) and a few other concerns.

There's still a problem of a timer function doing a mod timer from
within expire_timers. It can't forward the base, which might currently
be quite a way behind. I *think* after we close these gaps and get
timely wakeups for timers on there, it should not get too far behind
for standard timers.

Deferrable is a different story. Firstly it has no idle tracking so we
never forward it. Even if we wanted to, we can't do it reliably because
it could contain timers way behind the base. They are "deferrable", so
you get what you pay for, but this still means there's a window where
you can add a deferrable timer and get a far later expiry than you
asked for despite the CPU never going idle after you added it.

All these problems would seem to go away if mod_timer just queued up
the timer to a single list on the base then pushed them into the
wheel during your wheel processing softirq... Although maybe you end
up with excessive passes over big queue of timers. Anyway that
wouldn't be suitable for 4.13 even if it could work.

I'll send out an updated minimal fix after some more testing...

Thanks,
Nick
--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jonathan Cameron Aug. 21, 2017, 10:18 a.m. UTC | #5
On Mon, 21 Aug 2017 16:06:05 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Mon, 21 Aug 2017 10:52:58 +1000
> Nicholas Piggin <npiggin@gmail.com> wrote:
> 
> > On Sun, 20 Aug 2017 14:14:29 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:    
> > > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:      
> > > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > > > >       
> > > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:      
> > > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > > 
> > > > > > > Thomas, John, am I misinterpreting the timer trace event messages?        
> > > > > > 
> > > > > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > > > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > > one of these events, it usually removes the timer with del_timer_sync).
> > > > > > 
> > > > > > So this patch seems to fix it. Testing, comments welcome.      
> > > > > 
> > > > > Okay this had a problem of trying to forward the timer from a timer
> > > > > callback function.
> > > > > 
> > > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > > only when needed, it probably saves a load).      
> > > > 
> > > > Giving this one a whirl!      
> > > 
> > > No joy here, but then again there are other reasons to believe that I
> > > am seeing a different bug than Dave and Jonathan are.
> > > 
> > > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > > not statistically different from what I see without either patch.
> > > 
> > > But no statistical difference compared to without patch, and I still
> > > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > > Hmmm...  I am also seeing that without any of your patches.  Might
> > > be hypervisor preemption, I guess.    
> > 
> > Okay it makes the warnings go away for me, but I'm just booting then
> > leaving the system idle. You're doing some CPU hotplug activity?  
> 
> Okay found a bug in the patch (it was not forwarding properly before
> adding the first timer after an idle) and a few other concerns.
> 
> There's still a problem of a timer function doing a mod timer from
> within expire_timers. It can't forward the base, which might currently
> be quite a way behind. I *think* after we close these gaps and get
> timely wakeups for timers on there, it should not get too far behind
> for standard timers.
> 
> Deferrable is a different story. Firstly it has no idle tracking so we
> never forward it. Even if we wanted to, we can't do it reliably because
> it could contain timers way behind the base. They are "deferrable", so
> you get what you pay for, but this still means there's a window where
> you can add a deferrable timer and get a far later expiry than you
> asked for despite the CPU never going idle after you added it.
> 
> All these problems would seem to go away if mod_timer just queued up
> the timer to a single list on the base then pushed them into the
> wheel during your wheel processing softirq... Although maybe you end
> up with excessive passes over big queue of timers. Anyway that
> wouldn't be suitable for 4.13 even if it could work.
> 
> I'll send out an updated minimal fix after some more testing...

Hi All,

I'm back in the office with hardware access on our D05 64 core ARM64
boards.

I think we still have by far the quickest test cases for this so
feel free to ping me anything you want tested quickly (we were
looking at an average of less than 10 minutes to trigger
with machine idling).

Nick, I'm currently running your previous version and we are over an
hour so even without any instances of the issue so it looks like a
considerable improvement.  I'll see if I can line a couple of boards
up for an overnight run if you have your updated version out by then.

Be great to finally put this one to bed.

Thanks,

Jonathan

> 
> Thanks,
> Nick

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paul E. McKenney Aug. 22, 2017, 12:38 a.m. UTC | #6
On Mon, Aug 21, 2017 at 10:52:58AM +1000, Nicholas Piggin wrote:
> On Sun, 20 Aug 2017 14:14:29 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Sun, Aug 20, 2017 at 11:35:14AM -0700, Paul E. McKenney wrote:
> > > On Sun, Aug 20, 2017 at 11:00:40PM +1000, Nicholas Piggin wrote:  
> > > > On Sun, 20 Aug 2017 14:45:53 +1000
> > > > Nicholas Piggin <npiggin@gmail.com> wrote:
> > > >   
> > > > > On Wed, 16 Aug 2017 09:27:31 -0700
> > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:  
> > > > > > On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> > > > > > 
> > > > > > Thomas, John, am I misinterpreting the timer trace event messages?    
> > > > > 
> > > > > So I did some digging, and what you find is that rcu_sched seems to do a
> > > > > simple scheudle_timeout(1) and just goes out to lunch for many seconds.
> > > > > The process_timeout timer never fires (when it finally does wake after
> > > > > one of these events, it usually removes the timer with del_timer_sync).
> > > > > 
> > > > > So this patch seems to fix it. Testing, comments welcome.  
> > > > 
> > > > Okay this had a problem of trying to forward the timer from a timer
> > > > callback function.
> > > > 
> > > > This was my other approach which also fixes the RCU warnings, but it's
> > > > a little more complex. I reworked it a bit so the mod_timer fast path
> > > > hopefully doesn't have much more overhead (actually by reading jiffies
> > > > only when needed, it probably saves a load).  
> > > 
> > > Giving this one a whirl!  
> > 
> > No joy here, but then again there are other reasons to believe that I
> > am seeing a different bug than Dave and Jonathan are.
> > 
> > OK, not -entirely- without joy -- 10 of 14 runs were error-free, which
> > is a good improvement over 0 of 84 for your earlier patch.  ;-)  But
> > not statistically different from what I see without either patch.
> > 
> > But no statistical difference compared to without patch, and I still
> > see the "rcu_sched kthread starved" messages.  For whatever it is worth,
> > by the way, I also see this: "hrtimer: interrupt took 5712368 ns".
> > Hmmm...  I am also seeing that without any of your patches.  Might
> > be hypervisor preemption, I guess.
> 
> Okay it makes the warnings go away for me, but I'm just booting then
> leaving the system idle. You're doing some CPU hotplug activity?

Yes, along with rcutorture, so a very different workload.

						Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe sparclinux" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 8f5d1bf18854..ee7b8b688b48 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -203,6 +203,7 @@  struct timer_base {
 	bool			migration_enabled;
 	bool			nohz_active;
 	bool			is_idle;
+	bool			was_idle;
 	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
 	struct hlist_head	vectors[WHEEL_SIZE];
 } ____cacheline_aligned;
@@ -856,13 +857,19 @@  get_target_base(struct timer_base *base, unsigned tflags)
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-	unsigned long jnow = READ_ONCE(jiffies);
+	unsigned long jnow;
 
 	/*
-	 * We only forward the base when it's idle and we have a delta between
-	 * base clock and jiffies.
+	 * We only forward the base when we are idle or have just come out
+	 * of idle (was_idle logic), and have a delta between base clock
+	 * and jiffies. In the common case, run_timers will take care of it.
 	 */
-	if (!base->is_idle || (long) (jnow - base->clk) < 2)
+	if (likely(!base->was_idle))
+		return;
+
+	jnow = READ_ONCE(jiffies);
+	base->was_idle = base->is_idle;
+	if ((long)(jnow - base->clk) < 2)
 		return;
 
 	/*
@@ -938,6 +945,13 @@  __mod_timer(struct timer_list *timer, unsigned long expires, bool pending_only)
 	 * same array bucket then just return:
 	 */
 	if (timer_pending(timer)) {
+		/*
+		 * The downside of this optimization is that it can result in
+		 * larger granularity than you would get from adding a new
+		 * timer with this expiry. Would a timer flag for networking
+		 * be appropriate, then we can try to keep expiry of general
+		 * timers within ~1/8th of their interval?
+		 */
 		if (timer->expires == expires)
 			return 1;
 
@@ -1499,8 +1513,10 @@  u64 get_next_timer_interrupt(unsigned long basej, u64 basem)
 		/*
 		 * If we expect to sleep more than a tick, mark the base idle:
 		 */
-		if ((expires - basem) > TICK_NSEC)
+		if ((expires - basem) > TICK_NSEC) {
+			base->was_idle = true;
 			base->is_idle = true;
+		}
 	}
 	raw_spin_unlock(&base->lock);
 
@@ -1587,6 +1603,12 @@  static inline void __run_timers(struct timer_base *base)
 	struct hlist_head heads[LVL_DEPTH];
 	int levels;
 
+	/*
+	 * was_idle must be cleared before running timers so that any timer
+	 * functions that call mod_timer will not try to forward the base.
+	 */
+	base->was_idle = false;
+
 	if (!time_after_eq(jiffies, base->clk))
 		return;