Message ID | 20180508065854.23064-1-npiggin@gmail.com |
---|---|
State | Accepted |
Headers | show |
Series | core/console: fix deadlock when printing with console lock held | expand |
This is unchanged since the RFC, passed some testing: `pdbg stop` a thread, then enter xmon in the host which cases it to NMI IPI all other CPUs. Result now is that it skips the stopped thread and it simply doesn't make it into xmon. pdbg cont resumes as normal, (but it looks like the thread doesn't receive its doorbell to get into xmon). Sending it a sreset with pdbg kicks it into xmon, and then resuming from xmon seems to result in the system recovering fine. Without this patch, the thread will be given the sreset, which can have undefined consequences if we're currently ramming or in some other stat. At best it will kick the thread out of quiesce and we lose our pdbg debugging session, at worst probably checkstop the box. These fixes should not affect non-error paths, or fix the timeout setting speial wakeup, or other errors we've often seen. Thanks, Nick Nicholas Piggin (2): core/direct-controls: fix p9_cont_thread for stopped/inactive threads core/direct-controls: improve p9_stop_thread error handling core/direct-controls.c | 101 ++++++++++++++++++++++++++++++----------- 1 file changed, 75 insertions(+), 26 deletions(-)
Uh, sorry ignore this, I messed up my git send. The console deadlock
patch is good.
On Tue, 8 May 2018 16:58:54 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:
> This is unchanged since the RFC, passed some testing: `pdbg stop`
On 08/05/18 16:58, Nicholas Piggin wrote: > Some debugging options will print while the console lock is held, > which is why the console lock is taken as a recursive lock. > However console_write calls __flush_console, which will drop and > re-take the lock non-recursively in some cases. > > Just set con_need_flush and return from __flush_console if we are > holding the console lock already. > > This stack usage message (taken with this patch applied) could lead > to a deadlock without this: > > CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0 > CPU 0000 Backtrace: > S: 0000000031c03370 R: 00000000300cb808 .list_check_node+0x1c > S: 0000000031c03410 R: 00000000300cb910 .list_check+0x38 > S: 0000000031c034b0 R: 00000000300190ac .try_lock_caller+0xb8 > S: 0000000031c03540 R: 00000000300192e0 .lock_caller+0x80 > S: 0000000031c03600 R: 0000000030012c70 .__flush_console+0x134 > S: 0000000031c036d0 R: 00000000300130cc .console_write+0x68 > S: 0000000031c03780 R: 00000000300347bc .vprlog+0xc8 > S: 0000000031c03970 R: 0000000030034844 ._prlog+0x50 > S: 0000000031c03a00 R: 00000000300364a4 .log_simple_error+0x74 > S: 0000000031c03b90 R: 000000003004ab48 .occ_pstates_init+0x184 > S: 0000000031c03d50 R: 000000003001480c .load_and_boot_kernel+0x38c > S: 0000000031c03e30 R: 000000003001571c .main_cpu_entry+0x62c > S: 0000000031c03f00 R: 0000000030002700 boot_entry+0x1c0 > > Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> Thanks Nick! I'm not sure if it solves whatever it was that I was hitting, with this patch applied the machine hangs but doesn't print a deadlock detection error. I'll need to do some further digging I think.
Nicholas Piggin <npiggin@gmail.com> writes: > Some debugging options will print while the console lock is held, > which is why the console lock is taken as a recursive lock. > However console_write calls __flush_console, which will drop and > re-take the lock non-recursively in some cases. > > Just set con_need_flush and return from __flush_console if we are > holding the console lock already. > > This stack usage message (taken with this patch applied) could lead > to a deadlock without this: > > CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0 > CPU 0000 Backtrace: > S: 0000000031c03370 R: 00000000300cb808 .list_check_node+0x1c > S: 0000000031c03410 R: 00000000300cb910 .list_check+0x38 > S: 0000000031c034b0 R: 00000000300190ac .try_lock_caller+0xb8 > S: 0000000031c03540 R: 00000000300192e0 .lock_caller+0x80 > S: 0000000031c03600 R: 0000000030012c70 .__flush_console+0x134 > S: 0000000031c036d0 R: 00000000300130cc .console_write+0x68 > S: 0000000031c03780 R: 00000000300347bc .vprlog+0xc8 > S: 0000000031c03970 R: 0000000030034844 ._prlog+0x50 > S: 0000000031c03a00 R: 00000000300364a4 .log_simple_error+0x74 > S: 0000000031c03b90 R: 000000003004ab48 .occ_pstates_init+0x184 > S: 0000000031c03d50 R: 000000003001480c .load_and_boot_kernel+0x38c > S: 0000000031c03e30 R: 000000003001571c .main_cpu_entry+0x62c > S: 0000000031c03f00 R: 0000000030002700 boot_entry+0x1c0 > > Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> > Signed-off-by: Nicholas Piggin <npiggin@gmail.com> > --- > core/console.c | 12 ++++++++---- > 1 file changed, 8 insertions(+), 4 deletions(-) Merged to master as of 277615348ba64dbdff426206d1b6a619811163b1
diff --git a/core/console.c b/core/console.c index b9129c9f..e8256758 100644 --- a/core/console.c +++ b/core/console.c @@ -100,7 +100,7 @@ void clear_console(void) * Optionally can skip flushing to drivers, leaving messages * just in memory console. */ -static bool __flush_console(bool flush_to_drivers) +static bool __flush_console(bool flush_to_drivers, bool need_unlock) { struct cpu_thread *cpu = this_cpu(); size_t req, len = 0; @@ -114,8 +114,12 @@ static bool __flush_console(bool flush_to_drivers) * Console flushing is suspended on this CPU, typically because * some critical locks are held that would potentially cause a * flush to deadlock + * + * Also if it recursed on con_lock (need_unlock is false). This + * can happen due to debug code firing (e.g., list or stack + * debugging). */ - if (cpu->con_suspend) { + if (cpu->con_suspend || !need_unlock) { cpu->con_need_flush = true; return false; } @@ -176,7 +180,7 @@ bool flush_console(void) bool ret; lock(&con_lock); - ret = __flush_console(true); + ret = __flush_console(true, true); unlock(&con_lock); return ret; @@ -250,7 +254,7 @@ ssize_t console_write(bool flush_to_drivers, const void *buf, size_t count) write_char(c); } - __flush_console(flush_to_drivers); + __flush_console(flush_to_drivers, need_unlock); if (need_unlock) unlock(&con_lock);
Some debugging options will print while the console lock is held, which is why the console lock is taken as a recursive lock. However console_write calls __flush_console, which will drop and re-take the lock non-recursively in some cases. Just set con_need_flush and return from __flush_console if we are holding the console lock already. This stack usage message (taken with this patch applied) could lead to a deadlock without this: CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0 CPU 0000 Backtrace: S: 0000000031c03370 R: 00000000300cb808 .list_check_node+0x1c S: 0000000031c03410 R: 00000000300cb910 .list_check+0x38 S: 0000000031c034b0 R: 00000000300190ac .try_lock_caller+0xb8 S: 0000000031c03540 R: 00000000300192e0 .lock_caller+0x80 S: 0000000031c03600 R: 0000000030012c70 .__flush_console+0x134 S: 0000000031c036d0 R: 00000000300130cc .console_write+0x68 S: 0000000031c03780 R: 00000000300347bc .vprlog+0xc8 S: 0000000031c03970 R: 0000000030034844 ._prlog+0x50 S: 0000000031c03a00 R: 00000000300364a4 .log_simple_error+0x74 S: 0000000031c03b90 R: 000000003004ab48 .occ_pstates_init+0x184 S: 0000000031c03d50 R: 000000003001480c .load_and_boot_kernel+0x38c S: 0000000031c03e30 R: 000000003001571c .main_cpu_entry+0x62c S: 0000000031c03f00 R: 0000000030002700 boot_entry+0x1c0 Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com> Signed-off-by: Nicholas Piggin <npiggin@gmail.com> --- core/console.c | 12 ++++++++---- 1 file changed, 8 insertions(+), 4 deletions(-)