diff mbox series

Message ID 1649818529.j46672mh2p.astroid@bobo.none (mailing list archive)
State Handled Elsewhere
Headers show
Series | expand

Commit Message

Nicholas Piggin April 13, 2022, 5:11 a.m. UTC
+Daniel, Thomas, Viresh

Subject: Re: rcu_sched self-detected stall on CPU

Excerpts from Michael Ellerman's message of April 9, 2022 12:42 am:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
>>>> Hi
>>>> 
>>>> I can reproduce it in a ppc virtual cloud server provided by Oregon
>>>> State University.  Following is what I do:
>>>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
>>>> -o linux-5.18-rc1.tar.gz
>>>> 2) tar zxf linux-5.18-rc1.tar.gz
>>>> 3) cp config linux-5.18-rc1/.config
>>>> 4) cd linux-5.18-rc1
>>>> 5) make vmlinux -j 8
>>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
>>>> -smp 2 (QEMU 4.2.1)
>>>> 7) after 12 rounds, the bug got reproduced:
>>>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
>>>
>>> Just to make sure, are you both seeing the same thing?  Last I knew,
>>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
>>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
>>> I miss something?
>>>
>>> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
>>> kthread slept for three milliseconds, but did not wake up for more than
>>> 20 seconds.  This kthread would normally have awakened on CPU 1, but
>>> CPU 1 looks to me to be very unhealthy, as can be seen in your console
>>> output below (but maybe my idea of what is healthy for powerpc systems
>>> is outdated).  Please see also the inline annotations.
>>>
>>> Thoughts from the PPC guys?
>>
>> I haven't seen it in my testing. But using Miguel's config I can
>> reproduce it seemingly on every boot.
>>
>> For me it bisects to:
>>
>>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>>
>> Which seems plausible.
>>
>> Reverting that on mainline makes the bug go away.
>>
>> I don't see an obvious bug in the diff, but I could be wrong, or the old
>> code was papering over an existing bug?
>>
>> I'll try and work out what it is about Miguel's config that exposes
>> this vs our defconfig, that might give us a clue.
> 
> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.
> 
> I can reproduce just with:
> 
>   $ make ppc64le_guest_defconfig
>   $ ./scripts/config -d HIGH_RES_TIMERS
> 
> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
> realise you could disable it TBH :)
> 
> The Rust CI has it disabled because I copied that from the x86 defconfig
> they were using back when I added the Rust support. I think that was
> meant to be a stripped down fast config for CI, but the result is it's
> just using a badly tested combination which is not helpful.
> 
> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
> can debug this further without blocking them.

So we traced the problem down to possibly a misunderstanding between 
decrementer clock event device and core code.

The decrementer is only oneshot*ish*. It actually needs to either be 
reprogrammed or shut down otherwise it just continues to cause 
interrupts.

Before commit 35de589cb879, it was sort of two-shot. The initial 
interrupt at the programmed time would set its internal next_tb variable 
to ~0 and call the ->event_handler(). If that did not set_next_event or 
stop the timer, the interrupt will fire again immediately, notice 
next_tb is ~0, and only then stop the decrementer interrupt.

So that was already kind of ugly, this patch just turned it into a hang.

The problem happens when the tick is stopped with an event still 
pending, then tick_nohz_handler() is called, but it bails out because 
tick_stopped == 1 so the device never gets programmed again, and so it 
keeps firing.

How to fix it? Before commit a7cba02deced, powerpc's decrementer was 
really oneshot, but we would like to avoid doing that because it requires 
additional programming of the hardware on each timer interrupt. We have 
the ONESHOT_STOPPED state which seems to be just about what we want.

Did the ONESHOT_STOPPED patch just miss this case, or is there a reason 
we don't stop it here? This patch seems to fix the hang (not heavily
tested though).
 
Thanks,
Nick

---

Comments

Nicholas Piggin April 13, 2022, 6:10 a.m. UTC | #1
Oops, fixed subject...

Excerpts from Nicholas Piggin's message of April 13, 2022 3:11 pm:
> +Daniel, Thomas, Viresh
> 
> Subject: Re: rcu_sched self-detected stall on CPU
> 
> Excerpts from Michael Ellerman's message of April 9, 2022 12:42 am:
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>>> "Paul E. McKenney" <paulmck@kernel.org> writes:
>>>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
>>>>> Hi
>>>>> 
>>>>> I can reproduce it in a ppc virtual cloud server provided by Oregon
>>>>> State University.  Following is what I do:
>>>>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
>>>>> -o linux-5.18-rc1.tar.gz
>>>>> 2) tar zxf linux-5.18-rc1.tar.gz
>>>>> 3) cp config linux-5.18-rc1/.config
>>>>> 4) cd linux-5.18-rc1
>>>>> 5) make vmlinux -j 8
>>>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
>>>>> -smp 2 (QEMU 4.2.1)
>>>>> 7) after 12 rounds, the bug got reproduced:
>>>>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
>>>>
>>>> Just to make sure, are you both seeing the same thing?  Last I knew,
>>>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
>>>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
>>>> I miss something?
>>>>
>>>> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
>>>> kthread slept for three milliseconds, but did not wake up for more than
>>>> 20 seconds.  This kthread would normally have awakened on CPU 1, but
>>>> CPU 1 looks to me to be very unhealthy, as can be seen in your console
>>>> output below (but maybe my idea of what is healthy for powerpc systems
>>>> is outdated).  Please see also the inline annotations.
>>>>
>>>> Thoughts from the PPC guys?
>>>
>>> I haven't seen it in my testing. But using Miguel's config I can
>>> reproduce it seemingly on every boot.
>>>
>>> For me it bisects to:
>>>
>>>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
>>>
>>> Which seems plausible.
>>>
>>> Reverting that on mainline makes the bug go away.
>>>
>>> I don't see an obvious bug in the diff, but I could be wrong, or the old
>>> code was papering over an existing bug?
>>>
>>> I'll try and work out what it is about Miguel's config that exposes
>>> this vs our defconfig, that might give us a clue.
>> 
>> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.
>> 
>> I can reproduce just with:
>> 
>>   $ make ppc64le_guest_defconfig
>>   $ ./scripts/config -d HIGH_RES_TIMERS
>> 
>> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
>> realise you could disable it TBH :)
>> 
>> The Rust CI has it disabled because I copied that from the x86 defconfig
>> they were using back when I added the Rust support. I think that was
>> meant to be a stripped down fast config for CI, but the result is it's
>> just using a badly tested combination which is not helpful.
>> 
>> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
>> can debug this further without blocking them.
> 
> So we traced the problem down to possibly a misunderstanding between 
> decrementer clock event device and core code.
> 
> The decrementer is only oneshot*ish*. It actually needs to either be 
> reprogrammed or shut down otherwise it just continues to cause 
> interrupts.
> 
> Before commit 35de589cb879, it was sort of two-shot. The initial 
> interrupt at the programmed time would set its internal next_tb variable 
> to ~0 and call the ->event_handler(). If that did not set_next_event or 
> stop the timer, the interrupt will fire again immediately, notice 
> next_tb is ~0, and only then stop the decrementer interrupt.
> 
> So that was already kind of ugly, this patch just turned it into a hang.
> 
> The problem happens when the tick is stopped with an event still 
> pending, then tick_nohz_handler() is called, but it bails out because 
> tick_stopped == 1 so the device never gets programmed again, and so it 
> keeps firing.
> 
> How to fix it? Before commit a7cba02deced, powerpc's decrementer was 
> really oneshot, but we would like to avoid doing that because it requires 
> additional programming of the hardware on each timer interrupt. We have 
> the ONESHOT_STOPPED state which seems to be just about what we want.
> 
> Did the ONESHOT_STOPPED patch just miss this case, or is there a reason 
> we don't stop it here? This patch seems to fix the hang (not heavily
> tested though).
>  
> Thanks,
> Nick
> 
> ---
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 2d76c91b85de..7e13a55b6b71 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -1364,9 +1364,11 @@ static void tick_nohz_handler(struct clock_event_device *dev)
>  	tick_sched_do_timer(ts, now);
>  	tick_sched_handle(ts, regs);
>  
> -	/* No need to reprogram if we are running tickless  */
> -	if (unlikely(ts->tick_stopped))
> +	if (unlikely(ts->tick_stopped)) {
> +		/* If we are tickless, change the clock event to stopped */
> +		tick_program_event(KTIME_MAX, 1);
>  		return;
> +	}
>  
>  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
>  	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
>
Paul E. McKenney April 14, 2022, 5:15 p.m. UTC | #2
On Wed, Apr 13, 2022 at 04:10:02PM +1000, Nicholas Piggin wrote:
> Oops, fixed subject...
> 
> Excerpts from Nicholas Piggin's message of April 13, 2022 3:11 pm:
> > +Daniel, Thomas, Viresh
> > 
> > Subject: Re: rcu_sched self-detected stall on CPU
> > 
> > Excerpts from Michael Ellerman's message of April 9, 2022 12:42 am:
> >> Michael Ellerman <mpe@ellerman.id.au> writes:
> >>> "Paul E. McKenney" <paulmck@kernel.org> writes:
> >>>> On Wed, Apr 06, 2022 at 05:31:10PM +0800, Zhouyi Zhou wrote:
> >>>>> Hi
> >>>>> 
> >>>>> I can reproduce it in a ppc virtual cloud server provided by Oregon
> >>>>> State University.  Following is what I do:
> >>>>> 1) curl -l https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/snapshot/linux-5.18-rc1.tar.gz
> >>>>> -o linux-5.18-rc1.tar.gz
> >>>>> 2) tar zxf linux-5.18-rc1.tar.gz
> >>>>> 3) cp config linux-5.18-rc1/.config
> >>>>> 4) cd linux-5.18-rc1
> >>>>> 5) make vmlinux -j 8
> >>>>> 6) qemu-system-ppc64 -kernel vmlinux -nographic -vga none -no-reboot
> >>>>> -smp 2 (QEMU 4.2.1)
> >>>>> 7) after 12 rounds, the bug got reproduced:
> >>>>> (http://154.223.142.244/logs/20220406/qemu.log.txt)
> >>>>
> >>>> Just to make sure, are you both seeing the same thing?  Last I knew,
> >>>> Zhouyi was chasing an RCU-tasks issue that appears only in kernels
> >>>> built with CONFIG_PROVE_RCU=y, which Miguel does not have set.  Or did
> >>>> I miss something?
> >>>>
> >>>> Miguel is instead seeing an RCU CPU stall warning where RCU's grace-period
> >>>> kthread slept for three milliseconds, but did not wake up for more than
> >>>> 20 seconds.  This kthread would normally have awakened on CPU 1, but
> >>>> CPU 1 looks to me to be very unhealthy, as can be seen in your console
> >>>> output below (but maybe my idea of what is healthy for powerpc systems
> >>>> is outdated).  Please see also the inline annotations.
> >>>>
> >>>> Thoughts from the PPC guys?
> >>>
> >>> I haven't seen it in my testing. But using Miguel's config I can
> >>> reproduce it seemingly on every boot.
> >>>
> >>> For me it bisects to:
> >>>
> >>>   35de589cb879 ("powerpc/time: improve decrementer clockevent processing")
> >>>
> >>> Which seems plausible.
> >>>
> >>> Reverting that on mainline makes the bug go away.
> >>>
> >>> I don't see an obvious bug in the diff, but I could be wrong, or the old
> >>> code was papering over an existing bug?
> >>>
> >>> I'll try and work out what it is about Miguel's config that exposes
> >>> this vs our defconfig, that might give us a clue.
> >> 
> >> It's CONFIG_HIGH_RES_TIMERS=n which triggers the stall.
> >> 
> >> I can reproduce just with:
> >> 
> >>   $ make ppc64le_guest_defconfig
> >>   $ ./scripts/config -d HIGH_RES_TIMERS
> >> 
> >> We have no defconfigs that disable HIGH_RES_TIMERS, I didn't even
> >> realise you could disable it TBH :)
> >> 
> >> The Rust CI has it disabled because I copied that from the x86 defconfig
> >> they were using back when I added the Rust support. I think that was
> >> meant to be a stripped down fast config for CI, but the result is it's
> >> just using a badly tested combination which is not helpful.
> >> 
> >> So I'll send a patch to turn HIGH_RES_TIMERS on for the Rust CI, and we
> >> can debug this further without blocking them.
> > 
> > So we traced the problem down to possibly a misunderstanding between 
> > decrementer clock event device and core code.
> > 
> > The decrementer is only oneshot*ish*. It actually needs to either be 
> > reprogrammed or shut down otherwise it just continues to cause 
> > interrupts.
> > 
> > Before commit 35de589cb879, it was sort of two-shot. The initial 
> > interrupt at the programmed time would set its internal next_tb variable 
> > to ~0 and call the ->event_handler(). If that did not set_next_event or 
> > stop the timer, the interrupt will fire again immediately, notice 
> > next_tb is ~0, and only then stop the decrementer interrupt.
> > 
> > So that was already kind of ugly, this patch just turned it into a hang.
> > 
> > The problem happens when the tick is stopped with an event still 
> > pending, then tick_nohz_handler() is called, but it bails out because 
> > tick_stopped == 1 so the device never gets programmed again, and so it 
> > keeps firing.
> > 
> > How to fix it? Before commit a7cba02deced, powerpc's decrementer was 
> > really oneshot, but we would like to avoid doing that because it requires 
> > additional programming of the hardware on each timer interrupt. We have 
> > the ONESHOT_STOPPED state which seems to be just about what we want.
> > 
> > Did the ONESHOT_STOPPED patch just miss this case, or is there a reason 
> > we don't stop it here? This patch seems to fix the hang (not heavily
> > tested though).

This looks plausible to me based on my interactions with ticks, but it
would be good to have someone who understands that code better than I
do to look it over.

							Thanx, Paul

> > Thanks,
> > Nick
> > 
> > ---
> > diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> > index 2d76c91b85de..7e13a55b6b71 100644
> > --- a/kernel/time/tick-sched.c
> > +++ b/kernel/time/tick-sched.c
> > @@ -1364,9 +1364,11 @@ static void tick_nohz_handler(struct clock_event_device *dev)
> >  	tick_sched_do_timer(ts, now);
> >  	tick_sched_handle(ts, regs);
> >  
> > -	/* No need to reprogram if we are running tickless  */
> > -	if (unlikely(ts->tick_stopped))
> > +	if (unlikely(ts->tick_stopped)) {
> > +		/* If we are tickless, change the clock event to stopped */
> > +		tick_program_event(KTIME_MAX, 1);
> >  		return;
> > +	}
> >  
> >  	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
> >  	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);
> >
Thomas Gleixner April 22, 2022, 3:53 p.m. UTC | #3
On Wed, Apr 13 2022 at 15:11, Nicholas Piggin wrote:
> So we traced the problem down to possibly a misunderstanding between 
> decrementer clock event device and core code.
>
> The decrementer is only oneshot*ish*. It actually needs to either be 
> reprogrammed or shut down otherwise it just continues to cause 
> interrupts.

I always thought that PPC had sane timers. That's really disillusioning.

> Before commit 35de589cb879, it was sort of two-shot. The initial 
> interrupt at the programmed time would set its internal next_tb variable 
> to ~0 and call the ->event_handler(). If that did not set_next_event or 
> stop the timer, the interrupt will fire again immediately, notice 
> next_tb is ~0, and only then stop the decrementer interrupt.
>
> So that was already kind of ugly, this patch just turned it into a hang.
>
> The problem happens when the tick is stopped with an event still 
> pending, then tick_nohz_handler() is called, but it bails out because 
> tick_stopped == 1 so the device never gets programmed again, and so it 
> keeps firing.
>
> How to fix it? Before commit a7cba02deced, powerpc's decrementer was 
> really oneshot, but we would like to avoid doing that because it requires 
> additional programming of the hardware on each timer interrupt. We have 
> the ONESHOT_STOPPED state which seems to be just about what we want.
>
> Did the ONESHOT_STOPPED patch just miss this case, or is there a reason 
> we don't stop it here? This patch seems to fix the hang (not heavily
> tested though).

This was definitely overlooked, but it's arguable it is is not required
for real oneshot clockevent devices. This should only handle the case
where the interrupt was already pending.

The ONESHOT_STOPPED state was introduced to handle the case where the
last timer gets canceled, so the already programmed event does not fire.

It was not necessarily meant to "fix" clockevent devices which are
pretending to be ONESHOT, but keep firing over and over.

That, said. I'm fine with the change along with a big fat comment why
this is required.

Thanks,

        tglx
Nicholas Piggin April 23, 2022, 2:29 a.m. UTC | #4
Excerpts from Thomas Gleixner's message of April 23, 2022 1:53 am:
> On Wed, Apr 13 2022 at 15:11, Nicholas Piggin wrote:
>> So we traced the problem down to possibly a misunderstanding between 
>> decrementer clock event device and core code.
>>
>> The decrementer is only oneshot*ish*. It actually needs to either be 
>> reprogrammed or shut down otherwise it just continues to cause 
>> interrupts.
> 
> I always thought that PPC had sane timers. That's really disillusioning.

My comment was probably a bit misleading explanation of the whole
situation. This weirdness is actually in software in the powerpc
clock event driver due to a recent change I made assuming the clock 
event goes to oneshot-stopped.

The hardware is relatively sane I think, global synchronized constant
rate high frequency clock distributed to the CPUs so reads don't
go off-core. And per-CPU "decrementer" event interrupt at the same
frequency as the clock -- program it to a +ve value and it decrements
until zero then creates basically a level triggered interrupt.

Before my change, the decrementer interrupt would always clear the
interrupt at entry. The event_handler usually programs another
timer in so I tried to avoid that first clear counting on the
oneshot_stopped callback to clear the interrupt if there was no
other timer.

>> Before commit 35de589cb879, it was sort of two-shot. The initial 
>> interrupt at the programmed time would set its internal next_tb variable 
>> to ~0 and call the ->event_handler(). If that did not set_next_event or 
>> stop the timer, the interrupt will fire again immediately, notice 
>> next_tb is ~0, and only then stop the decrementer interrupt.
>>
>> So that was already kind of ugly, this patch just turned it into a hang.
>>
>> The problem happens when the tick is stopped with an event still 
>> pending, then tick_nohz_handler() is called, but it bails out because 
>> tick_stopped == 1 so the device never gets programmed again, and so it 
>> keeps firing.
>>
>> How to fix it? Before commit a7cba02deced, powerpc's decrementer was 
>> really oneshot, but we would like to avoid doing that because it requires 
>> additional programming of the hardware on each timer interrupt. We have 
>> the ONESHOT_STOPPED state which seems to be just about what we want.
>>
>> Did the ONESHOT_STOPPED patch just miss this case, or is there a reason 
>> we don't stop it here? This patch seems to fix the hang (not heavily
>> tested though).
> 
> This was definitely overlooked, but it's arguable it is is not required
> for real oneshot clockevent devices. This should only handle the case
> where the interrupt was already pending.
> 
> The ONESHOT_STOPPED state was introduced to handle the case where the
> last timer gets canceled, so the already programmed event does not fire.
> 
> It was not necessarily meant to "fix" clockevent devices which are
> pretending to be ONESHOT, but keep firing over and over.
> 
> That, said. I'm fine with the change along with a big fat comment why
> this is required.

Thanks for taking a look and confirming. I just sent a patch with a
comment and what looks like another missed case. Hopefully it's okay.

Thanks,
Nick
diff mbox series

Patch

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 2d76c91b85de..7e13a55b6b71 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -1364,9 +1364,11 @@  static void tick_nohz_handler(struct clock_event_device *dev)
 	tick_sched_do_timer(ts, now);
 	tick_sched_handle(ts, regs);
 
-	/* No need to reprogram if we are running tickless  */
-	if (unlikely(ts->tick_stopped))
+	if (unlikely(ts->tick_stopped)) {
+		/* If we are tickless, change the clock event to stopped */
+		tick_program_event(KTIME_MAX, 1);
 		return;
+	}
 
 	hrtimer_forward(&ts->sched_timer, now, TICK_NSEC);
 	tick_program_event(hrtimer_get_expires(&ts->sched_timer), 1);