diff mbox

RCU lockup issues when CONFIG_SOFTLOCKUP_DETECTOR=n - any one else seeing this?

Message ID 20170815154743.GK7017@linux.vnet.ibm.com (mailing list archive)
State Not Applicable
Headers show

Commit Message

Paul E. McKenney Aug. 15, 2017, 3:47 p.m. UTC
On Wed, Aug 02, 2017 at 05:25:55PM +0100, Jonathan Cameron wrote:
> On Tue, 1 Aug 2017 11:46:46 -0700
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> 
> > On Mon, Jul 31, 2017 at 04:27:57PM +0100, Jonathan Cameron wrote:
> > > On Mon, 31 Jul 2017 08:04:11 -0700
> > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > >   
> > > > On Mon, Jul 31, 2017 at 12:08:47PM +0100, Jonathan Cameron wrote:  
> > > > > On Fri, 28 Jul 2017 12:03:50 -0700
> > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > >     
> > > > > > On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote:    
> > > > > > > On Fri, 28 Jul 2017 09:55:29 -0700
> > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > >       
> > > > > > > > On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote:      
> > > > > > > > > On Fri, 28 Jul 2017 08:44:11 +0100
> > > > > > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:        
> > > > > > > > 
> > > > > > > > [ . . . ]
> > > > > > > >       
> > > > > > > > > Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest of having
> > > > > > > > > fewer timer events.  Issue became much easier to trigger (on some runs before
> > > > > > > > > I could get tracing up and running)
> > > > > > > > >e
> > > > > > > > > So logs are large enough that pastebin doesn't like them - please shoet if        
> > > > > > > > >>e another timer period is of interest.        
> > > > > > > > > 
> > > > > > > > > https://pastebin.com/iUZDfQGM for the timer trace.
> > > > > > > > > https://pastebin.com/3w1F7amH for dmesg.  
> > > > > > > > > 
> > > > > > > > > The relevant timeout on the RCU stall detector was 8 seconds.  Event is
> > > > > > > > > detected around 835.
> > > > > > > > > 
> > > > > > > > > It's a lot of logs, so I haven't identified a smoking gun yet but there
> > > > > > > > > may well be one in there.        
> > > > > > > > 
> > > > > > > > The dmesg says:
> > > > > > > > 
> > > > > > > > rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > > > > > > > 
> > > > > > > > So I look for "rcu_preempt" timer events and find these:
> > > > > > > > 
> > > > > > > > rcu_preempt-9     [019] ....   827.579114: timer_init: timer=ffff8017d5fc7da0
> > > > > > > > rcu_preempt-9     [019] d..1   827.579115: timer_start: timer=ffff8017d5fc7da0 function=process_timeout 
> > > > > > > > 
> > > > > > > > Next look for "ffff8017d5fc7da0" and I don't find anything else.      
> > > > > > > It does show up off the bottom of what would fit in pastebin...
> > > > > > > 
> > > > > > >      rcu_preempt-9     [001] d..1   837.681077: timer_cancel: timer=ffff8017d5fc7da0
> > > > > > >      rcu_preempt-9     [001] ....   837.681086: timer_init: timer=ffff8017d5fc7da0
> > > > > > >      rcu_preempt-9     [001] d..1   837.681087: timer_start: timer=ffff8017d5fc7da0 function=process_timeout expires=4295101298 [timeout=1] cpu=1 idx=0 flags=      
> > > > > > 
> > > > > > Odd.  I would expect an expiration...  And ten seconds is way longer
> > > > > > than the requested one jiffy!
> > > > > >     
> > > > > > > > The timeout was one jiffy, and more than a second later, no expiration.
> > > > > > > > Is it possible that this event was lost?  I am not seeing any sign of
> > > > > > > > this is the trace.
> > > > > > > > 
> > > > > > > > I don't see any sign of CPU hotplug (and I test with lots of that in
> > > > > > > > any case).
> > > > > > > > 
> > > > > > > > The last time we saw something like this it was a timer HW/driver problem,
> > > > > > > > but it is a bit hard to imagine such a problem affecting both ARM64
> > > > > > > > and SPARC.  ;-)      
> > > > > > > Could be different issues, both of which were hidden by that lockup detector.
> > > > > > > 
> > > > > > > There is an errata work around for the timers on this particular board.
> > > > > > > I'm only vaguely aware of it, so may be unconnected.
> > > > > > > 
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/clocksource/arm_arch_timer.c?h=v4.13-rc2&id=bb42ca47401010fc02901b5e8f79e40a26f208cb
> > > > > > > 
> > > > > > > Seems unlikely though! + we've not yet seen it on the other chips that
> > > > > > > errata effects (not that that means much).      
> > > > > > 
> > > > > > If you can reproduce quickly, might be worth trying anyway...
> > > > > > 
> > > > > > 							Thanx, Paul    
> > > > > Errata fix is running already and was for all those tests.    
> > > > 
> > > > I was afraid of that...  ;-)  
> > > It's a pretty rare errata it seems.  Not actually managed to catch
> > > one yet.   
> > > >   
> > > > > I'll have a dig into the timers today and see where I get to.    
> > > > 
> > > > Look forward to seeing what you find!  
> > > Nothing obvious turning up other than we don't seem to have issue
> > > when we aren't running hrtimers.
> > > 
> > > On a plus side I just got a report that it is effecting our d03
> > > boards which is good on the basis I couldn't tell what the difference
> > > could be wrt to this issue!
> > > 
> > > It indeed looks like we are consistently missing a timer before
> > > the rcu splat occurs.  
> > 
> > And for my part, my tests with CONFIG_HZ_PERIODIC=y and
> > CONFIG_RCU_FAST_NO_HZ=n showed roughly the same failure rate
> > as other runs.
> > 
> > Missing a timer can most certainly give RCU severe heartburn!  ;-)
> > Do you have what you need to track down the missing timer?  
> 
> Not managed to make much progress yet.  Turning on any additional tracing
> in that area seems to make the issue stop happening or at least
> occur very infrequently. Which certainly makes it 'fun' to find.
> 
> As a long shot I applied a locking fix from another reported issue that
> was causing rcu stalls and it seemed good for much longer, but
> eventually still occurred.
> 
> (from the thread rcu_sched stall while waiting in csd_lock_wait())

On the perhaps unlikely off-chance that it helps locate something,
here is a patch that adds a trace_printk() to check how long a CPU
believes that it can sleep when going idle.  The thought is to check
to see if a CPU with a timer set to expire in one jiffy thinks that
can sleep for (say) 30 seconds.

Didn't find anything for my problem, but I believe that yours is
different, so...

							Thanx, Paul

------------------------------------------------------------------------

commit 33103e7b1f89ef432dfe3337d2a6932cdf5c1312
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Aug 14 08:54:39 2017 -0700

    EXP: Trace tick return from tick_nohz_stop_sched_tick
    
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

Comments

Jonathan Cameron Aug. 16, 2017, 1:24 a.m. UTC | #1
On Tue, 15 Aug 2017 08:47:43 -0700
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:

> On Wed, Aug 02, 2017 at 05:25:55PM +0100, Jonathan Cameron wrote:
> > On Tue, 1 Aug 2017 11:46:46 -0700
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> >   
> > > On Mon, Jul 31, 2017 at 04:27:57PM +0100, Jonathan Cameron wrote:  
> > > > On Mon, 31 Jul 2017 08:04:11 -0700
> > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > >     
> > > > > On Mon, Jul 31, 2017 at 12:08:47PM +0100, Jonathan Cameron wrote:    
> > > > > > On Fri, 28 Jul 2017 12:03:50 -0700
> > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > >       
> > > > > > > On Fri, Jul 28, 2017 at 06:27:05PM +0100, Jonathan Cameron wrote:      
> > > > > > > > On Fri, 28 Jul 2017 09:55:29 -0700
> > > > > > > > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> wrote:
> > > > > > > >         
> > > > > > > > > On Fri, Jul 28, 2017 at 02:24:03PM +0100, Jonathan Cameron wrote:        
> > > > > > > > > > On Fri, 28 Jul 2017 08:44:11 +0100
> > > > > > > > > > Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:          
> > > > > > > > > 
> > > > > > > > > [ . . . ]
> > > > > > > > >         
> > > > > > > > > > Ok.  Some info.  I disabled a few driver (usb and SAS) in the interest of having
> > > > > > > > > > fewer timer events.  Issue became much easier to trigger (on some runs before
> > > > > > > > > > I could get tracing up and running)
> > > > > > > > > >e
> > > > > > > > > > So logs are large enough that pastebin doesn't like them - please shoet if          
> > > > > > > > > >>e another timer period is of interest.          
> > > > > > > > > > 
> > > > > > > > > > https://pastebin.com/iUZDfQGM for the timer trace.
> > > > > > > > > > https://pastebin.com/3w1F7amH for dmesg.  
> > > > > > > > > > 
> > > > > > > > > > The relevant timeout on the RCU stall detector was 8 seconds.  Event is
> > > > > > > > > > detected around 835.
> > > > > > > > > > 
> > > > > > > > > > It's a lot of logs, so I haven't identified a smoking gun yet but there
> > > > > > > > > > may well be one in there.          
> > > > > > > > > 
> > > > > > > > > The dmesg says:
> > > > > > > > > 
> > > > > > > > > rcu_preempt kthread starved for 2508 jiffies! g112 c111 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
> > > > > > > > > 
> > > > > > > > > So I look for "rcu_preempt" timer events and find these:
> > > > > > > > > 
> > > > > > > > > rcu_preempt-9     [019] ....   827.579114: timer_init: timer=ffff8017d5fc7da0
> > > > > > > > > rcu_preempt-9     [019] d..1   827.579115: timer_start: timer=ffff8017d5fc7da0 function=process_timeout 
> > > > > > > > > 
> > > > > > > > > Next look for "ffff8017d5fc7da0" and I don't find anything else.        
> > > > > > > > It does show up off the bottom of what would fit in pastebin...
> > > > > > > > 
> > > > > > > >      rcu_preempt-9     [001] d..1   837.681077: timer_cancel: timer=ffff8017d5fc7da0
> > > > > > > >      rcu_preempt-9     [001] ....   837.681086: timer_init: timer=ffff8017d5fc7da0
> > > > > > > >      rcu_preempt-9     [001] d..1   837.681087: timer_start: timer=ffff8017d5fc7da0 function=process_timeout expires=4295101298 [timeout=1] cpu=1 idx=0 flags=        
> > > > > > > 
> > > > > > > Odd.  I would expect an expiration...  And ten seconds is way longer
> > > > > > > than the requested one jiffy!
> > > > > > >       
> > > > > > > > > The timeout was one jiffy, and more than a second later, no expiration.
> > > > > > > > > Is it possible that this event was lost?  I am not seeing any sign of
> > > > > > > > > this is the trace.
> > > > > > > > > 
> > > > > > > > > I don't see any sign of CPU hotplug (and I test with lots of that in
> > > > > > > > > any case).
> > > > > > > > > 
> > > > > > > > > The last time we saw something like this it was a timer HW/driver problem,
> > > > > > > > > but it is a bit hard to imagine such a problem affecting both ARM64
> > > > > > > > > and SPARC.  ;-)        
> > > > > > > > Could be different issues, both of which were hidden by that lockup detector.
> > > > > > > > 
> > > > > > > > There is an errata work around for the timers on this particular board.
> > > > > > > > I'm only vaguely aware of it, so may be unconnected.
> > > > > > > > 
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/clocksource/arm_arch_timer.c?h=v4.13-rc2&id=bb42ca47401010fc02901b5e8f79e40a26f208cb
> > > > > > > > 
> > > > > > > > Seems unlikely though! + we've not yet seen it on the other chips that
> > > > > > > > errata effects (not that that means much).        
> > > > > > > 
> > > > > > > If you can reproduce quickly, might be worth trying anyway...
> > > > > > > 
> > > > > > > 							Thanx, Paul      
> > > > > > Errata fix is running already and was for all those tests.      
> > > > > 
> > > > > I was afraid of that...  ;-)    
> > > > It's a pretty rare errata it seems.  Not actually managed to catch
> > > > one yet.     
> > > > >     
> > > > > > I'll have a dig into the timers today and see where I get to.      
> > > > > 
> > > > > Look forward to seeing what you find!    
> > > > Nothing obvious turning up other than we don't seem to have issue
> > > > when we aren't running hrtimers.
> > > > 
> > > > On a plus side I just got a report that it is effecting our d03
> > > > boards which is good on the basis I couldn't tell what the difference
> > > > could be wrt to this issue!
> > > > 
> > > > It indeed looks like we are consistently missing a timer before
> > > > the rcu splat occurs.    
> > > 
> > > And for my part, my tests with CONFIG_HZ_PERIODIC=y and
> > > CONFIG_RCU_FAST_NO_HZ=n showed roughly the same failure rate
> > > as other runs.
> > > 
> > > Missing a timer can most certainly give RCU severe heartburn!  ;-)
> > > Do you have what you need to track down the missing timer?    
> > 
> > Not managed to make much progress yet.  Turning on any additional tracing
> > in that area seems to make the issue stop happening or at least
> > occur very infrequently. Which certainly makes it 'fun' to find.
> > 
> > As a long shot I applied a locking fix from another reported issue that
> > was causing rcu stalls and it seemed good for much longer, but
> > eventually still occurred.
> > 
> > (from the thread rcu_sched stall while waiting in csd_lock_wait())  
> 
> On the perhaps unlikely off-chance that it helps locate something,
> here is a patch that adds a trace_printk() to check how long a CPU
> believes that it can sleep when going idle.  The thought is to check
> to see if a CPU with a timer set to expire in one jiffy thinks that
> can sleep for (say) 30 seconds.
> 
> Didn't find anything for my problem, but I believe that yours is
> different, so...
> 
> 							Thanx, Paul

Hi Paul,

Thanks for the suggestion.  I hadn't thought to look at the
expected time being wrong.

I have noted that adding other tracepoints (and turning them on)
in that function cause the problem to 'disappear' though so it
would seem very timing sensitive.  Fingers crossed this doesn't
have the same effect!

Our progress on this has been a bit limited partly as I have been
traveling and haven't sorted out remote hardware access.
May be the start of next week before I can try this out.

Agreed, the problems look to be different.

Interesting question is whether the other known cases fall
into one or the other category?

Thanks for the help and good luck with your variant!

Jonathan
> 
> ------------------------------------------------------------------------
> 
> commit 33103e7b1f89ef432dfe3337d2a6932cdf5c1312
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Mon Aug 14 08:54:39 2017 -0700
> 
>     EXP: Trace tick return from tick_nohz_stop_sched_tick
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c7a899c5ce64..7358a5073dfb 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -817,6 +817,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
>  	 * (not only the tick).
>  	 */
>  	ts->sleep_length = ktime_sub(dev->next_event, now);
> +	trace_printk("tick_nohz_stop_sched_tick: %lld\n", (tick - ktime_get()) / 1000);
>  	return tick;
>  }
>  
>
Michael Ellerman Aug. 16, 2017, 12:43 p.m. UTC | #2
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
...
>
> commit 33103e7b1f89ef432dfe3337d2a6932cdf5c1312
> Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> Date:   Mon Aug 14 08:54:39 2017 -0700
>
>     EXP: Trace tick return from tick_nohz_stop_sched_tick
>     
>     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index c7a899c5ce64..7358a5073dfb 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -817,6 +817,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
>  	 * (not only the tick).
>  	 */
>  	ts->sleep_length = ktime_sub(dev->next_event, now);
> +	trace_printk("tick_nohz_stop_sched_tick: %lld\n", (tick - ktime_get()) / 1000);
>  	return tick;
>  }

Should I be seeing negative values? A small sample:

          <idle>-0     [015] d...  1602.039695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250019
          <idle>-0     [009] d...  1602.039701: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250025
          <idle>-0     [007] d...  1602.039702: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250025
          <idle>-0     [048] d...  1602.039703: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: 9973
          <idle>-0     [006] d...  1602.039704: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250027
          <idle>-0     [001] d...  1602.039730: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250053
          <idle>-0     [008] d...  1602.039732: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250055
          <idle>-0     [006] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
          <idle>-0     [009] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
          <idle>-0     [001] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018


I have a full trace, I'll send it to you off-list.

cheers
Paul E. McKenney Aug. 16, 2017, 12:56 p.m. UTC | #3
On Wed, Aug 16, 2017 at 10:43:52PM +1000, Michael Ellerman wrote:
> "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> ...
> >
> > commit 33103e7b1f89ef432dfe3337d2a6932cdf5c1312
> > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > Date:   Mon Aug 14 08:54:39 2017 -0700
> >
> >     EXP: Trace tick return from tick_nohz_stop_sched_tick
> >     
> >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> >
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index c7a899c5ce64..7358a5073dfb 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -817,6 +817,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
> >  	 * (not only the tick).
> >  	 */
> >  	ts->sleep_length = ktime_sub(dev->next_event, now);
> > +	trace_printk("tick_nohz_stop_sched_tick: %lld\n", (tick - ktime_get()) / 1000);
> >  	return tick;
> >  }
> 
> Should I be seeing negative values? A small sample:

Maybe due to hypervisor preemption delays, but I confess that I am
surprised to see them this large.  1,602,250,019 microseconds is something
like a half hour, which could result in stall warnings all by itself.

>           <idle>-0     [015] d...  1602.039695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250019
>           <idle>-0     [009] d...  1602.039701: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250025
>           <idle>-0     [007] d...  1602.039702: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250025
>           <idle>-0     [048] d...  1602.039703: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: 9973
>           <idle>-0     [006] d...  1602.039704: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250027
>           <idle>-0     [001] d...  1602.039730: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250053
>           <idle>-0     [008] d...  1602.039732: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250055
>           <idle>-0     [006] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
>           <idle>-0     [009] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
>           <idle>-0     [001] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
> 
> 
> I have a full trace, I'll send it to you off-list.

I will take a look!

							Thanx, Paul
Paul E. McKenney Aug. 16, 2017, 4:27 p.m. UTC | #4
On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 16, 2017 at 10:43:52PM +1000, Michael Ellerman wrote:
> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
> > ...
> > >
> > > commit 33103e7b1f89ef432dfe3337d2a6932cdf5c1312
> > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > > Date:   Mon Aug 14 08:54:39 2017 -0700
> > >
> > >     EXP: Trace tick return from tick_nohz_stop_sched_tick
> > >     
> > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
> > >
> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > > index c7a899c5ce64..7358a5073dfb 100644
> > > --- a/kernel/time/tick-sched.c
> > > +++ b/kernel/time/tick-sched.c
> > > @@ -817,6 +817,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
> > >  	 * (not only the tick).
> > >  	 */
> > >  	ts->sleep_length = ktime_sub(dev->next_event, now);
> > > +	trace_printk("tick_nohz_stop_sched_tick: %lld\n", (tick - ktime_get()) / 1000);
> > >  	return tick;
> > >  }
> > 
> > Should I be seeing negative values? A small sample:
> 
> Maybe due to hypervisor preemption delays, but I confess that I am
> surprised to see them this large.  1,602,250,019 microseconds is something
> like a half hour, which could result in stall warnings all by itself.
> 
> >           <idle>-0     [015] d...  1602.039695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250019
> >           <idle>-0     [009] d...  1602.039701: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250025
> >           <idle>-0     [007] d...  1602.039702: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250025
> >           <idle>-0     [048] d...  1602.039703: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: 9973
> >           <idle>-0     [006] d...  1602.039704: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250027
> >           <idle>-0     [001] d...  1602.039730: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250053
> >           <idle>-0     [008] d...  1602.039732: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602250055
> >           <idle>-0     [006] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
> >           <idle>-0     [009] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
> >           <idle>-0     [001] d...  1602.049695: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: -1602260018
> > 
> > 
> > I have a full trace, I'll send it to you off-list.
> 
> I will take a look!

And from your ps output, PID 9 is rcu_sched, which is the RCU grace-period
kthread that stalled.  This kthread was starved, based on this from your
dmesg:

[ 1602.067008] rcu_sched kthread starved for 2603 jiffies! g7275 c7274 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1

The RCU_GP_WAIT_FQS says that this kthread is periodically scanning for
idle-CPU and offline-CPU quiescent states, which means that its waits
will be accompanied by short timeouts.  The "starved for 2603 jiffies"
says that it has not run for one good long time.  The ->state is its
task_struct ->state field.

The immediately preceding dmesg line is as follows:

[ 1602.063851]  (detected by 53, t=2603 jiffies, g=7275, c=7274, q=608)

In other words, the rcu_sched grace-period kthread has been starved
for the entire duration of the current grace period, as shown by the
t=2603.

Lets turn now to the trace output, looking for the last bit of the
rcu_sched task's activity:

       rcu_sched-9     [054] d...  1576.030096: timer_start: timer=c0000007fae1bc20 function=process_timeout expires=4295094922 [timeout=1] cpu=54 idx=0 flags=
    ksoftirqd/53-276   [053] ..s.  1576.030097: rcu_invoke_callback: rcu_sched rhp=c000000fcf8c4eb0 func=__d_free
       rcu_sched-9     [054] d...  1576.030097: rcu_utilization: Start context switch
    ksoftirqd/53-276   [053] ..s.  1576.030098: rcu_invoke_callback: rcu_sched rhp=c000000fcff74ee0 func=proc_i_callback
       rcu_sched-9     [054] d...  1576.030098: rcu_grace_period: rcu_sched 7275 cpuqs
       rcu_sched-9     [054] d...  1576.030099: rcu_utilization: End context switch

So this task set up a timer ("timer_start:") for one jiffy ("[timeout=1]",
but what is with "expires=4295094922"?)  and blocked ("rcu_utilization:
Start context switch" and "rcu_utilization: End context switch"),
recording its CPU's quiescent state in the process ("rcu_grace_period:
rcu_sched 7275 cpuqs").

Of course, the timer will have expired in the context of some other task,
but a search for "c0000007fae1bc20" (see the "timer=" in the first trace
line above) shows nothing (to be painfully accurate, the search wraps back
to earlier uses of this timer by rcu_sched).  So the timer never did fire.

The next question is "what did CPU 054 do next?"  We find it entering idle:

          <idle>-0     [054] d...  1576.030167: tick_stop: success=1 dependency=NONE
          <idle>-0     [054] d...  1576.030167: hrtimer_cancel: hrtimer=c000000fff88c680
          <idle>-0     [054] d...  1576.030168: hrtimer_start: hrtimer=c000000fff88c680 function=tick_sched_timer expires=1610710000000 softexpires=1610710000000
          <idle>-0     [054] d...  1576.030170: __tick_nohz_idle_enter: tick_nohz_stop_sched_tick: 34469831
          <idle>-0     [054] d...  1576.030171: rcu_dyntick: Start 140000000000000 0

So we have an hrtimer set for 1610710000000, whatever time that might
happen to map to.  And that is the last we hear from CPU 054, so it
apparently maps to something a bit too far into the future.  Let's
assume that this value is nanoseconds since boot, in which case we have
1,610.710000000 seconds, which is eight seconds after the stall-warning
message.  And -way- longer than the one-jiffy timeout requested!

Thomas, John, am I misinterpreting the timer trace event messages?

							Thanx, Paul
Michael Ellerman Aug. 17, 2017, 1:55 p.m. UTC | #5
"Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:

> On Wed, Aug 16, 2017 at 05:56:17AM -0700, Paul E. McKenney wrote:
>> On Wed, Aug 16, 2017 at 10:43:52PM +1000, Michael Ellerman wrote:
>> > "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> writes:
>> > ...
>> > >
>> > > commit 33103e7b1f89ef432dfe3337d2a6932cdf5c1312
>> > > Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> > > Date:   Mon Aug 14 08:54:39 2017 -0700
>> > >
>> > >     EXP: Trace tick return from tick_nohz_stop_sched_tick
>> > >     
>> > >     Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
>> > >
>> > > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> > > index c7a899c5ce64..7358a5073dfb 100644
>> > > --- a/kernel/time/tick-sched.c
>> > > +++ b/kernel/time/tick-sched.c
>> > > @@ -817,6 +817,7 @@ static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
>> > >  	 * (not only the tick).
>> > >  	 */
>> > >  	ts->sleep_length = ktime_sub(dev->next_event, now);
>> > > +	trace_printk("tick_nohz_stop_sched_tick: %lld\n", (tick - ktime_get()) / 1000);
>> > >  	return tick;
>> > >  }
>> > 
>> > Should I be seeing negative values? A small sample:
>> 
>> Maybe due to hypervisor preemption delays, but I confess that I am
>> surprised to see them this large.  1,602,250,019 microseconds is something
>> like a half hour, which could result in stall warnings all by itself.

Hmm. This is a bare metal machine. So no hypervisor.

>> I will take a look!
>
> And from your ps output, PID 9 is rcu_sched, which is the RCU grace-period
> kthread that stalled.  This kthread was starved, based on this from your
> dmesg:
>
> [ 1602.067008] rcu_sched kthread starved for 2603 jiffies! g7275 c7274 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x1
>
> The RCU_GP_WAIT_FQS says that this kthread is periodically scanning for
> idle-CPU and offline-CPU quiescent states, which means that its waits
> will be accompanied by short timeouts.  The "starved for 2603 jiffies"
> says that it has not run for one good long time.  The ->state is its
> task_struct ->state field.
>
> The immediately preceding dmesg line is as follows:
>
> [ 1602.063851]  (detected by 53, t=2603 jiffies, g=7275, c=7274, q=608)
>
> In other words, the rcu_sched grace-period kthread has been starved
> for the entire duration of the current grace period, as shown by the
> t=2603.
>
> Lets turn now to the trace output, looking for the last bit of the
> rcu_sched task's activity:
>
>        rcu_sched-9     [054] d...  1576.030096: timer_start: timer=c0000007fae1bc20 function=process_timeout expires=4295094922 [timeout=1] cpu=54 idx=0 flags=
>     ksoftirqd/53-276   [053] ..s.  1576.030097: rcu_invoke_callback: rcu_sched rhp=c000000fcf8c4eb0 func=__d_free
>        rcu_sched-9     [054] d...  1576.030097: rcu_utilization: Start context switch
>     ksoftirqd/53-276   [053] ..s.  1576.030098: rcu_invoke_callback: rcu_sched rhp=c000000fcff74ee0 func=proc_i_callback
>        rcu_sched-9     [054] d...  1576.030098: rcu_grace_period: rcu_sched 7275 cpuqs
>        rcu_sched-9     [054] d...  1576.030099: rcu_utilization: End context switch
>
> So this task set up a timer ("timer_start:") for one jiffy ("[timeout=1]",
> but what is with "expires=4295094922"?)

That's a good one.

I have HZ=100, and therefore:

INITIAL_JIFFIES = (1 << 32) - (300 * 100) = 4294937296

So the expires value of 4295094922 is:

4295094922 - 4294937296 = 157626

Jiffies since boot.

Or 1576,260,000,000 ns == 1576.26 s.

> Of course, the timer will have expired in the context of some other task,
> but a search for "c0000007fae1bc20" (see the "timer=" in the first trace
> line above) shows nothing (to be painfully accurate, the search wraps back
> to earlier uses of this timer by rcu_sched).  So the timer never did fire.

Or it just wasn't in the trace ?

I'll try and get it to trace a bit longer and see if that is helpful.

cheers
diff mbox

Patch

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index c7a899c5ce64..7358a5073dfb 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -817,6 +817,7 @@  static ktime_t tick_nohz_stop_sched_tick(struct tick_sched *ts,
 	 * (not only the tick).
 	 */
 	ts->sleep_length = ktime_sub(dev->next_event, now);
+	trace_printk("tick_nohz_stop_sched_tick: %lld\n", (tick - ktime_get()) / 1000);
 	return tick;
 }