core/console: fix deadlock when printing with console lock held

Message ID 20180508065854.23064-1-npiggin@gmail.com
State Accepted
Headers show
Series
  • core/console: fix deadlock when printing with console lock held
Related show

Commit Message

Nicholas Piggin May 8, 2018, 6:58 a.m.
Some debugging options will print while the console lock is held,
which is why the console lock is taken as a recursive lock.
However console_write calls __flush_console, which will drop and
re-take the lock non-recursively in some cases.

Just set con_need_flush and return from __flush_console if we are
holding the console lock already.

This stack usage message (taken with this patch applied) could lead
to a deadlock without this:

CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0
CPU 0000 Backtrace:
S: 0000000031c03370 R: 00000000300cb808   .list_check_node+0x1c
S: 0000000031c03410 R: 00000000300cb910   .list_check+0x38
S: 0000000031c034b0 R: 00000000300190ac   .try_lock_caller+0xb8
S: 0000000031c03540 R: 00000000300192e0   .lock_caller+0x80
S: 0000000031c03600 R: 0000000030012c70   .__flush_console+0x134
S: 0000000031c036d0 R: 00000000300130cc   .console_write+0x68
S: 0000000031c03780 R: 00000000300347bc   .vprlog+0xc8
S: 0000000031c03970 R: 0000000030034844   ._prlog+0x50
S: 0000000031c03a00 R: 00000000300364a4   .log_simple_error+0x74
S: 0000000031c03b90 R: 000000003004ab48   .occ_pstates_init+0x184
S: 0000000031c03d50 R: 000000003001480c   .load_and_boot_kernel+0x38c
S: 0000000031c03e30 R: 000000003001571c   .main_cpu_entry+0x62c
S: 0000000031c03f00 R: 0000000030002700   boot_entry+0x1c0

Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 core/console.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

Comments

Nicholas Piggin May 8, 2018, 6:58 a.m. | #1
This is unchanged since the RFC, passed some testing: `pdbg stop`
a thread, then enter xmon in the host which cases it to NMI IPI
all other CPUs. Result now is that it skips the stopped thread and
it simply doesn't make it into xmon. pdbg cont resumes as normal,
(but it looks like the thread doesn't receive its doorbell to get
into xmon). Sending it a sreset with pdbg kicks it into xmon, and
then resuming from xmon seems to result in the system recovering
fine.

Without this patch, the thread will be given the sreset, which
can have undefined consequences if we're currently ramming or
in some other stat. At best it will kick the thread out of quiesce
and we lose our pdbg debugging session, at worst probably checkstop
the box.

These fixes should not affect non-error paths, or fix the timeout
setting speial wakeup, or other errors we've often seen.

Thanks,
Nick

Nicholas Piggin (2):
  core/direct-controls: fix p9_cont_thread for stopped/inactive threads
  core/direct-controls: improve p9_stop_thread error handling

 core/direct-controls.c | 101 ++++++++++++++++++++++++++++++-----------
 1 file changed, 75 insertions(+), 26 deletions(-)
Nicholas Piggin May 8, 2018, 7 a.m. | #2
Uh, sorry ignore this, I messed up my git send. The console deadlock
patch is good.

On Tue,  8 May 2018 16:58:54 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> This is unchanged since the RFC, passed some testing: `pdbg stop`
Andrew Donnellan May 9, 2018, 4:12 a.m. | #3
On 08/05/18 16:58, Nicholas Piggin wrote:
> Some debugging options will print while the console lock is held,
> which is why the console lock is taken as a recursive lock.
> However console_write calls __flush_console, which will drop and
> re-take the lock non-recursively in some cases.
> 
> Just set con_need_flush and return from __flush_console if we are
> holding the console lock already.
> 
> This stack usage message (taken with this patch applied) could lead
> to a deadlock without this:
> 
> CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0
> CPU 0000 Backtrace:
> S: 0000000031c03370 R: 00000000300cb808   .list_check_node+0x1c
> S: 0000000031c03410 R: 00000000300cb910   .list_check+0x38
> S: 0000000031c034b0 R: 00000000300190ac   .try_lock_caller+0xb8
> S: 0000000031c03540 R: 00000000300192e0   .lock_caller+0x80
> S: 0000000031c03600 R: 0000000030012c70   .__flush_console+0x134
> S: 0000000031c036d0 R: 00000000300130cc   .console_write+0x68
> S: 0000000031c03780 R: 00000000300347bc   .vprlog+0xc8
> S: 0000000031c03970 R: 0000000030034844   ._prlog+0x50
> S: 0000000031c03a00 R: 00000000300364a4   .log_simple_error+0x74
> S: 0000000031c03b90 R: 000000003004ab48   .occ_pstates_init+0x184
> S: 0000000031c03d50 R: 000000003001480c   .load_and_boot_kernel+0x38c
> S: 0000000031c03e30 R: 000000003001571c   .main_cpu_entry+0x62c
> S: 0000000031c03f00 R: 0000000030002700   boot_entry+0x1c0
> 
> Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Thanks Nick! I'm not sure if it solves whatever it was that I was 
hitting, with this patch applied the machine hangs but doesn't print a 
deadlock detection error. I'll need to do some further digging I think.
Stewart Smith May 25, 2018, 12:24 a.m. | #4
Nicholas Piggin <npiggin@gmail.com> writes:
> Some debugging options will print while the console lock is held,
> which is why the console lock is taken as a recursive lock.
> However console_write calls __flush_console, which will drop and
> re-take the lock non-recursively in some cases.
>
> Just set con_need_flush and return from __flush_console if we are
> holding the console lock already.
>
> This stack usage message (taken with this patch applied) could lead
> to a deadlock without this:
>
> CPU 0000 lowest stack mark 11768 bytes left pc=300cb808 token=0
> CPU 0000 Backtrace:
> S: 0000000031c03370 R: 00000000300cb808   .list_check_node+0x1c
> S: 0000000031c03410 R: 00000000300cb910   .list_check+0x38
> S: 0000000031c034b0 R: 00000000300190ac   .try_lock_caller+0xb8
> S: 0000000031c03540 R: 00000000300192e0   .lock_caller+0x80
> S: 0000000031c03600 R: 0000000030012c70   .__flush_console+0x134
> S: 0000000031c036d0 R: 00000000300130cc   .console_write+0x68
> S: 0000000031c03780 R: 00000000300347bc   .vprlog+0xc8
> S: 0000000031c03970 R: 0000000030034844   ._prlog+0x50
> S: 0000000031c03a00 R: 00000000300364a4   .log_simple_error+0x74
> S: 0000000031c03b90 R: 000000003004ab48   .occ_pstates_init+0x184
> S: 0000000031c03d50 R: 000000003001480c   .load_and_boot_kernel+0x38c
> S: 0000000031c03e30 R: 000000003001571c   .main_cpu_entry+0x62c
> S: 0000000031c03f00 R: 0000000030002700   boot_entry+0x1c0
>
> Reported-by: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  core/console.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)

Merged to master as of 277615348ba64dbdff426206d1b6a619811163b1

Patch

diff --git a/core/console.c b/core/console.c
index b9129c9f..e8256758 100644
--- a/core/console.c
+++ b/core/console.c
@@ -100,7 +100,7 @@  void clear_console(void)
  * Optionally can skip flushing to drivers, leaving messages
  * just in memory console.
  */
-static bool __flush_console(bool flush_to_drivers)
+static bool __flush_console(bool flush_to_drivers, bool need_unlock)
 {
 	struct cpu_thread *cpu = this_cpu();
 	size_t req, len = 0;
@@ -114,8 +114,12 @@  static bool __flush_console(bool flush_to_drivers)
 	 * Console flushing is suspended on this CPU, typically because
 	 * some critical locks are held that would potentially cause a
 	 * flush to deadlock
+	 *
+	 * Also if it recursed on con_lock (need_unlock is false). This
+	 * can happen due to debug code firing (e.g., list or stack
+	 * debugging).
 	 */
-	if (cpu->con_suspend) {
+	if (cpu->con_suspend || !need_unlock) {
 		cpu->con_need_flush = true;
 		return false;
 	}
@@ -176,7 +180,7 @@  bool flush_console(void)
 	bool ret;
 
 	lock(&con_lock);
-	ret = __flush_console(true);
+	ret = __flush_console(true, true);
 	unlock(&con_lock);
 
 	return ret;
@@ -250,7 +254,7 @@  ssize_t console_write(bool flush_to_drivers, const void *buf, size_t count)
 		write_char(c);
 	}
 
-	__flush_console(flush_to_drivers);
+	__flush_console(flush_to_drivers, need_unlock);
 
 	if (need_unlock)
 		unlock(&con_lock);