diff mbox series

[v1] powerpc/watchdog: help remote CPUs to flush NMI printk output

Message ID 20211110030416.2085103-1-npiggin@gmail.com (mailing list archive)
State Superseded
Headers show
Series [v1] powerpc/watchdog: help remote CPUs to flush NMI printk output | expand

Commit Message

Nicholas Piggin Nov. 10, 2021, 3:04 a.m. UTC
The printk layer at the moment does not seem to have a good way to force
flush printk messages that are created in NMI context, except in the
panic path.

NMI-context printk messages normally get to the console with irq_work,
but that won't help if the CPU is stuck with irqs disabled, as can be
the case for hard lockup watchdog messages.

The watchdog currently flushes the printk buffers after detecting a
lockup on remote CPUs, but they may not have processed their NMI IPI
yet by that stage, or they may have self-detected a lockup in which
case they won't go via this NMI IPI path.

Improve the situation by having NMI-context mark a flag if it called
printk, and have watchdog timer interrupts check if that flag was set
and try to flush if it was. Latency is not a big problem because we
were already stuck for a while, just need to try to make sure the
messages eventually make it out.

Cc: Laurent Dufour <ldufour@linux.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
This patch is actually based on top of this one which is planned to go
upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2

Prior to commit 93d102f094be that is fixed by the above, we had a printk
flush function with a different name but basically does the same job, so
this patch can be backported, just needs some care. I'm posting it for
review now for feedback so it's ready to go when the printk patch is
upstream.

Thanks,
Nick

 arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------
 1 file changed, 23 insertions(+), 6 deletions(-)

Comments

Daniel Axtens Nov. 12, 2021, 6:08 a.m. UTC | #1
Hi,

> The printk layer at the moment does not seem to have a good way to force
> flush printk messages that are created in NMI context, except in the
> panic path.
>
> NMI-context printk messages normally get to the console with irq_work,
> but that won't help if the CPU is stuck with irqs disabled, as can be
> the case for hard lockup watchdog messages.
>
> The watchdog currently flushes the printk buffers after detecting a
> lockup on remote CPUs, but they may not have processed their NMI IPI
> yet by that stage, or they may have self-detected a lockup in which
> case they won't go via this NMI IPI path.
>
> Improve the situation by having NMI-context mark a flag if it called
> printk, and have watchdog timer interrupts check if that flag was set
> and try to flush if it was. Latency is not a big problem because we
> were already stuck for a while, just need to try to make sure the
> messages eventually make it out.

Initially I was surprised that this doesn't affect the printk code
itself, just the powerpc code...

>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> This patch is actually based on top of this one which is planned to go
> upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2
>
> Prior to commit 93d102f094be that is fixed by the above, we had a printk
> flush function with a different name but basically does the same job, so
> this patch can be backported, just needs some care. I'm posting it for
> review now for feedback so it's ready to go when the printk patch is
> upstream.
>
> Thanks,
> Nick
>
>  arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------
>  1 file changed, 23 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
> index b6533539386b..a7b6b0691203 100644
> --- a/arch/powerpc/kernel/watchdog.c
> +++ b/arch/powerpc/kernel/watchdog.c
> @@ -86,6 +86,7 @@ static DEFINE_PER_CPU(u64, wd_timer_tb);
>  /* SMP checker bits */
>  static unsigned long __wd_smp_lock;
>  static unsigned long __wd_reporting;
> +static unsigned long __wd_nmi_output;
>  static cpumask_t wd_smp_cpus_pending;
>  static cpumask_t wd_smp_cpus_stuck;
>  static u64 wd_smp_last_reset_tb;
> @@ -154,6 +155,18 @@ static void wd_lockup_ipi(struct pt_regs *regs)
>  	else
>  		dump_stack();
>  
> +	/*
> +	 * We printk()ed from NMI context, the contents may not get flushed
> +	 * if we return to a context with interrupts disabled because
> +	 * printk uses irq_work to schedule flushes of NMI output.
> +	 * __wd_nmi_output says there has been output from NMI context, so
> +	 * other CPUs are recruited to help flush it.
> +	 *
> +	 * xchg is not needed here (it could be a simple atomic store), but
> +	 * it gives the memory ordering and atomicity required.
> +	 */
> +	xchg(&__wd_nmi_output, 1);
> +
>  	/* Do not panic from here because that can recurse into NMI IPI layer */
>  }

I think, looking at this and the other site where __wd_nmi_output is
set, that this works because you set the flag only when you are done
printing from the non-panic lockup context on this CPU. I was initially
worried that you set this flag part way through printing, and then it
might get cleared by another CPU while you're still trying to print.
However, in this function it's right at the end - there's nothing else
left to do, and ...

>  DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
> @@ -386,6 +401,8 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
>  		print_irqtrace_events(current);
>  		show_regs(regs);
>  
> +		xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi
> +
>  		if (sysctl_hardlockup_all_cpu_backtrace)
>  			trigger_allbutself_cpu_backtrace();

in this one, the only things that can happen afterwards are
 - a panic, which does its own flushing, and

- trigger_allbutself_cpu_backtrace(), which seems to just send IPIs, not
 do any printing of its own.

All of which is fine, but I wonder if we need a more descriptive
variable name or if the comment needs to specify that the flag should
only be set at certain times.

Kind regards,
Daniel
Nicholas Piggin Nov. 19, 2021, 9:15 a.m. UTC | #2
Excerpts from Daniel Axtens's message of November 12, 2021 4:08 pm:
> Hi,
> 
>> The printk layer at the moment does not seem to have a good way to force
>> flush printk messages that are created in NMI context, except in the
>> panic path.
>>
>> NMI-context printk messages normally get to the console with irq_work,
>> but that won't help if the CPU is stuck with irqs disabled, as can be
>> the case for hard lockup watchdog messages.
>>
>> The watchdog currently flushes the printk buffers after detecting a
>> lockup on remote CPUs, but they may not have processed their NMI IPI
>> yet by that stage, or they may have self-detected a lockup in which
>> case they won't go via this NMI IPI path.
>>
>> Improve the situation by having NMI-context mark a flag if it called
>> printk, and have watchdog timer interrupts check if that flag was set
>> and try to flush if it was. Latency is not a big problem because we
>> were already stuck for a while, just need to try to make sure the
>> messages eventually make it out.

Sorry just coming back to this now the printk fix was merged upstream.

> Initially I was surprised that this doesn't affect the printk code
> itself, just the powerpc code...

Yeah I'm actually not sure how other watchdogs handle this. If they
use nmi_trigger_cpumask_backtrace() then that does have a flush.

>> Cc: Laurent Dufour <ldufour@linux.ibm.com>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>> This patch is actually based on top of this one which is planned to go
>> upstream in rc1/2. https://marc.info/?l=linux-kernel&m=163626070312052&w=2
>>
>> Prior to commit 93d102f094be that is fixed by the above, we had a printk
>> flush function with a different name but basically does the same job, so
>> this patch can be backported, just needs some care. I'm posting it for
>> review now for feedback so it's ready to go when the printk patch is
>> upstream.
>>
>> Thanks,
>> Nick
>>
>>  arch/powerpc/kernel/watchdog.c | 29 +++++++++++++++++++++++------
>>  1 file changed, 23 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
>> index b6533539386b..a7b6b0691203 100644
>> --- a/arch/powerpc/kernel/watchdog.c
>> +++ b/arch/powerpc/kernel/watchdog.c
>> @@ -86,6 +86,7 @@ static DEFINE_PER_CPU(u64, wd_timer_tb);
>>  /* SMP checker bits */
>>  static unsigned long __wd_smp_lock;
>>  static unsigned long __wd_reporting;
>> +static unsigned long __wd_nmi_output;
>>  static cpumask_t wd_smp_cpus_pending;
>>  static cpumask_t wd_smp_cpus_stuck;
>>  static u64 wd_smp_last_reset_tb;
>> @@ -154,6 +155,18 @@ static void wd_lockup_ipi(struct pt_regs *regs)
>>  	else
>>  		dump_stack();
>>  
>> +	/*
>> +	 * We printk()ed from NMI context, the contents may not get flushed
>> +	 * if we return to a context with interrupts disabled because
>> +	 * printk uses irq_work to schedule flushes of NMI output.
>> +	 * __wd_nmi_output says there has been output from NMI context, so
>> +	 * other CPUs are recruited to help flush it.
>> +	 *
>> +	 * xchg is not needed here (it could be a simple atomic store), but
>> +	 * it gives the memory ordering and atomicity required.
>> +	 */
>> +	xchg(&__wd_nmi_output, 1);
>> +
>>  	/* Do not panic from here because that can recurse into NMI IPI layer */
>>  }
> 
> I think, looking at this and the other site where __wd_nmi_output is
> set, that this works because you set the flag only when you are done
> printing from the non-panic lockup context on this CPU. I was initially
> worried that you set this flag part way through printing, and then it
> might get cleared by another CPU while you're still trying to print.
> However, in this function it's right at the end - there's nothing else
> left to do, and ...
> 
>>  DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
>> @@ -386,6 +401,8 @@ DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
>>  		print_irqtrace_events(current);
>>  		show_regs(regs);
>>  
>> +		xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi
>> +
>>  		if (sysctl_hardlockup_all_cpu_backtrace)
>>  			trigger_allbutself_cpu_backtrace();
> 
> in this one, the only things that can happen afterwards are
>  - a panic, which does its own flushing, and
> 
> - trigger_allbutself_cpu_backtrace(), which seems to just send IPIs, not
>  do any printing of its own.

Yeah, on powerpc that actually ends up not using NMI IPIs because that's 
dangerous and core code uses it for less severe things than a hard 
lockup so we disabled it, so if you can take a regular IPI then you can
run irq work when it returns and print things. If we changed it to use
NMI IPIs I guess we might want to do something, maybe.

> All of which is fine, but I wonder if we need a more descriptive
> variable name or if the comment needs to specify that the flag should
> only be set at certain times.

I did try to add some explanation in that comment. I'll see if that can 
be made a bit more readable and expand on it a bit.

Thanks,
Nick
diff mbox series

Patch

diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index b6533539386b..a7b6b0691203 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -86,6 +86,7 @@  static DEFINE_PER_CPU(u64, wd_timer_tb);
 /* SMP checker bits */
 static unsigned long __wd_smp_lock;
 static unsigned long __wd_reporting;
+static unsigned long __wd_nmi_output;
 static cpumask_t wd_smp_cpus_pending;
 static cpumask_t wd_smp_cpus_stuck;
 static u64 wd_smp_last_reset_tb;
@@ -154,6 +155,18 @@  static void wd_lockup_ipi(struct pt_regs *regs)
 	else
 		dump_stack();
 
+	/*
+	 * We printk()ed from NMI context, the contents may not get flushed
+	 * if we return to a context with interrupts disabled because
+	 * printk uses irq_work to schedule flushes of NMI output.
+	 * __wd_nmi_output says there has been output from NMI context, so
+	 * other CPUs are recruited to help flush it.
+	 *
+	 * xchg is not needed here (it could be a simple atomic store), but
+	 * it gives the memory ordering and atomicity required.
+	 */
+	xchg(&__wd_nmi_output, 1);
+
 	/* Do not panic from here because that can recurse into NMI IPI layer */
 }
 
@@ -227,12 +240,6 @@  static void watchdog_smp_panic(int cpu)
 		cpumask_clear(&wd_smp_cpus_ipi);
 	}
 
-	/*
-	 * Force flush any remote buffers that might be stuck in IRQ context
-	 * and therefore could not run their irq_work.
-	 */
-	printk_trigger_flush();
-
 	if (hardlockup_panic)
 		nmi_panic(NULL, "Hard LOCKUP");
 
@@ -337,6 +344,14 @@  static void watchdog_timer_interrupt(int cpu)
 
 	if ((s64)(tb - wd_smp_last_reset_tb) >= (s64)wd_smp_panic_timeout_tb)
 		watchdog_smp_panic(cpu);
+
+	if (__wd_nmi_output && xchg(&__wd_nmi_output, 0)) {
+		/*
+		 * This triggers a flush to try get stuck NMI printk output
+		 * from remote CPUs out to the console. See wd_lockup_ipi.
+		 */
+		printk_trigger_flush();
+	}
 }
 
 DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
@@ -386,6 +401,8 @@  DEFINE_INTERRUPT_HANDLER_NMI(soft_nmi_interrupt)
 		print_irqtrace_events(current);
 		show_regs(regs);
 
+		xchg(&__wd_nmi_output, 1); // see wd_lockup_ipi
+
 		if (sysctl_hardlockup_all_cpu_backtrace)
 			trigger_allbutself_cpu_backtrace();