diff mbox series

Fix possible deadlock with DEBUG build

Message ID 20201211053106.273472-1-hegdevasant@linux.vnet.ibm.com
State Accepted
Headers show
Series Fix possible deadlock with DEBUG build | expand

Commit Message

Vasant Hegde Dec. 11, 2020, 5:31 a.m. UTC
Sample output from Cédric:
-------------------------
[   88.294111649,7] cpu_idle_p9 called on cpu 0x063c with pm disabled
[   88.289365222,7] cpu_idle_p9 called on cpu 0x025f with pm disabled
[   88.289900684,7] cpu_idle_p9 called on cpu 0x045f with pm disabled
[   88.302621295,7] CHIPTOD: Base TFMR=0x2512000000000000
[   88.289899701,7] cpu_idle_p9 called on cpu 0x0456 with pm disabled
LOCK ERROR: Deadlock detected @0x30402740 (state: 0x0000000400000001)
[   88.332264757,3] ***********************************************
[   88.332300051,3] < assert failed at core/lock.c:32 >
[   88.332328282,3]     .
[   88.332347335,3]      .
[   88.332364894,3]       .
[   88.332377963,3]         OO__)
[   88.332395458,3]        <"__/
[   88.332412628,3]         ^ ^
[   88.332450246,3] Fatal TRAP at 00000000300286a0   .lock_error+0x64  MSR 9000000000021002
[   88.332501812,3] CFAR : 00000000300414f4 MSR  : 9000000000021002
[   88.332536539,3] SRR0 : 00000000300286a0 SRR1 : 9000000000021002
[   88.332574644,3] HSRR0: 0000000030020024 HSRR1: 9000000000001000
[   88.332610635,3] DSISR: 00000000         DAR  : 0000000000000000
[   88.332650628,3] LR   : 0000000030028690 CTR  : 00000000300f9fa0
[   88.332684451,3] CR   : 20002000         XER  : 00000000
[   88.332712767,3] GPR00: 0000000030028690 GPR16: 0000000032c98000
[   88.332748046,3] GPR01: 0000000032c9b0a0 GPR17: 0000000000000000
[   88.332784060,3] GPR02: 0000000030169d00 GPR18: 0000000000000000
[   88.332822091,3] GPR03: 0000000032c9b310 GPR19: 0000000000000000
[   88.332861357,3] GPR04: 0000000030041480 GPR20: 0000000000000000
[   88.332897229,3] GPR05: 0000000000000000 GPR21: 0000000000000000
[   88.332937051,3] GPR06: 0000000000000010 GPR22: 0000000000000000
[   88.332968463,3] GPR07: 0000000000000000 GPR23: 0000000000000000
[   88.333007333,3] GPR08: 000000000002cbb5 GPR24: 0000000000000000
[   88.333041971,3] GPR09: 0000000000000000 GPR25: 0000000000000000
[   88.333081073,3] GPR10: 0000000000000000 GPR26: 0000000000000003
[   88.333114301,3] GPR11: 3839616263646566 GPR27: 0000000000000211
[   88.333156040,3] GPR12: 0000000020002000 GPR28: 000000003042a134
[   88.333189222,3] GPR13: 0000000000000000 GPR29: 0000000030402740
[   88.333225638,3] GPR14: 0000000000000000 GPR30: 0000000000000001
[   88.333259730,3] GPR15: 0000000000000000 GPR31: 0000000000000000
CPU 0211 Backtrace:
 S: 0000000032c9b3b0 R: 0000000030028690   .lock_error+0x54
 S: 0000000032c9b440 R: 0000000030028828   .add_lock_request+0xd0
 S: 0000000032c9b4f0 R: 0000000030028a9c   .lock_caller+0x8c
 S: 0000000032c9b5a0 R: 0000000030021b30   .__mcount_stack_check+0x70
 S: 0000000032c9b650 R: 00000000300fabb0   .list_check_node+0x1c
 S: 0000000032c9b6f0 R: 00000000300fac98   .list_check+0x38
 S: 0000000032c9b790 R: 00000000300289bc   .try_lock_caller+0xac
 S: 0000000032c9b830 R: 0000000030028ad8   .lock_caller+0xc8
 S: 0000000032c9b8e0 R: 0000000030028d74   .lock_recursive_caller+0x54
 S: 0000000032c9b980 R: 0000000030020cb8   .console_write+0x48
 S: 0000000032c9ba30 R: 00000000300445a8   .vprlog+0xc8
 S: 0000000032c9bc20 R: 0000000030044630   ._prlog+0x50
 S: 0000000032c9bcb0 R: 0000000030029204   .cpu_idle_p9+0x74
 S: 0000000032c9bd40 R: 0000000030029628   .cpu_idle_pm+0x4c
 S: 0000000032c9bde0 R: 0000000030023fe0   .__secondary_cpu_entry+0xa0
 S: 0000000032c9be70 R: 0000000030024034   .secondary_cpu_entry+0x40
 S: 0000000032c9bf00 R: 0000000030003290   secondary_wait+0x8c

CPU 0x4:
  opal_run_pollers ->
    check_stacks -> takes stack_check_lock lock
      prlog ->
        console_write -> waits for con_lock

CPU 0x211
  cpu_idle_p9 ->
    prlog ->
      console_write -> Takes con_lock lock
       list_check_node -> tries to take stack_check_lock and hits deadlock.

I think we don't need to hold `stack_check_lock` while printing
backtraces. Instead it makes sense to hold backtrace lock (bt_lock)
and print output.

Reported-by: Cédric Le Goater <clg@kaod.org>
Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
---
 core/stack.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Cédric Le Goater Dec. 11, 2020, 8:28 a.m. UTC | #1
On 12/11/20 6:31 AM, Vasant Hegde wrote:
> Sample output from Cédric:
> -------------------------
> [   88.294111649,7] cpu_idle_p9 called on cpu 0x063c with pm disabled
> [   88.289365222,7] cpu_idle_p9 called on cpu 0x025f with pm disabled
> [   88.289900684,7] cpu_idle_p9 called on cpu 0x045f with pm disabled
> [   88.302621295,7] CHIPTOD: Base TFMR=0x2512000000000000
> [   88.289899701,7] cpu_idle_p9 called on cpu 0x0456 with pm disabled
> LOCK ERROR: Deadlock detected @0x30402740 (state: 0x0000000400000001)
> [   88.332264757,3] ***********************************************
> [   88.332300051,3] < assert failed at core/lock.c:32 >
> [   88.332328282,3]     .
> [   88.332347335,3]      .
> [   88.332364894,3]       .
> [   88.332377963,3]         OO__)
> [   88.332395458,3]        <"__/
> [   88.332412628,3]         ^ ^
> [   88.332450246,3] Fatal TRAP at 00000000300286a0   .lock_error+0x64  MSR 9000000000021002
> [   88.332501812,3] CFAR : 00000000300414f4 MSR  : 9000000000021002
> [   88.332536539,3] SRR0 : 00000000300286a0 SRR1 : 9000000000021002
> [   88.332574644,3] HSRR0: 0000000030020024 HSRR1: 9000000000001000
> [   88.332610635,3] DSISR: 00000000         DAR  : 0000000000000000
> [   88.332650628,3] LR   : 0000000030028690 CTR  : 00000000300f9fa0
> [   88.332684451,3] CR   : 20002000         XER  : 00000000
> [   88.332712767,3] GPR00: 0000000030028690 GPR16: 0000000032c98000
> [   88.332748046,3] GPR01: 0000000032c9b0a0 GPR17: 0000000000000000
> [   88.332784060,3] GPR02: 0000000030169d00 GPR18: 0000000000000000
> [   88.332822091,3] GPR03: 0000000032c9b310 GPR19: 0000000000000000
> [   88.332861357,3] GPR04: 0000000030041480 GPR20: 0000000000000000
> [   88.332897229,3] GPR05: 0000000000000000 GPR21: 0000000000000000
> [   88.332937051,3] GPR06: 0000000000000010 GPR22: 0000000000000000
> [   88.332968463,3] GPR07: 0000000000000000 GPR23: 0000000000000000
> [   88.333007333,3] GPR08: 000000000002cbb5 GPR24: 0000000000000000
> [   88.333041971,3] GPR09: 0000000000000000 GPR25: 0000000000000000
> [   88.333081073,3] GPR10: 0000000000000000 GPR26: 0000000000000003
> [   88.333114301,3] GPR11: 3839616263646566 GPR27: 0000000000000211
> [   88.333156040,3] GPR12: 0000000020002000 GPR28: 000000003042a134
> [   88.333189222,3] GPR13: 0000000000000000 GPR29: 0000000030402740
> [   88.333225638,3] GPR14: 0000000000000000 GPR30: 0000000000000001
> [   88.333259730,3] GPR15: 0000000000000000 GPR31: 0000000000000000
> CPU 0211 Backtrace:
>  S: 0000000032c9b3b0 R: 0000000030028690   .lock_error+0x54
>  S: 0000000032c9b440 R: 0000000030028828   .add_lock_request+0xd0
>  S: 0000000032c9b4f0 R: 0000000030028a9c   .lock_caller+0x8c
>  S: 0000000032c9b5a0 R: 0000000030021b30   .__mcount_stack_check+0x70
>  S: 0000000032c9b650 R: 00000000300fabb0   .list_check_node+0x1c
>  S: 0000000032c9b6f0 R: 00000000300fac98   .list_check+0x38
>  S: 0000000032c9b790 R: 00000000300289bc   .try_lock_caller+0xac
>  S: 0000000032c9b830 R: 0000000030028ad8   .lock_caller+0xc8
>  S: 0000000032c9b8e0 R: 0000000030028d74   .lock_recursive_caller+0x54
>  S: 0000000032c9b980 R: 0000000030020cb8   .console_write+0x48
>  S: 0000000032c9ba30 R: 00000000300445a8   .vprlog+0xc8
>  S: 0000000032c9bc20 R: 0000000030044630   ._prlog+0x50
>  S: 0000000032c9bcb0 R: 0000000030029204   .cpu_idle_p9+0x74
>  S: 0000000032c9bd40 R: 0000000030029628   .cpu_idle_pm+0x4c
>  S: 0000000032c9bde0 R: 0000000030023fe0   .__secondary_cpu_entry+0xa0
>  S: 0000000032c9be70 R: 0000000030024034   .secondary_cpu_entry+0x40
>  S: 0000000032c9bf00 R: 0000000030003290   secondary_wait+0x8c
> 
> CPU 0x4:
>   opal_run_pollers ->
>     check_stacks -> takes stack_check_lock lock
>       prlog ->
>         console_write -> waits for con_lock
> 
> CPU 0x211
>   cpu_idle_p9 ->
>     prlog ->
>       console_write -> Takes con_lock lock
>        list_check_node -> tries to take stack_check_lock and hits deadlock.
> 
> I think we don't need to hold `stack_check_lock` while printing
> backtraces. Instead it makes sense to hold backtrace lock (bt_lock)
> and print output.
> 
> Reported-by: Cédric Le Goater <clg@kaod.org>
> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>

On a 4 socket, 352 threads :

Tested-by: Cédric Le Goater <clg@kaod.org>

Thanks,

C. 

> ---
>  core/stack.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/core/stack.c b/core/stack.c
> index 05506f6b3..3edf98411 100644
> --- a/core/stack.c
> +++ b/core/stack.c
> @@ -250,7 +250,7 @@ void check_stacks(void)
>  		unlock(&stack_check_lock);
>  	}
>  	if (lowest) {
> -		lock(&stack_check_lock);
> +		lock(&bt_lock);
>  		prlog(PR_NOTICE, "CPU %04x lowest stack mark %lld bytes left"
>  		      " pc=%08llx token=%lld\n",
>  		      lowest->pir, lowest->stack_bot_mark, lowest->stack_bot_pc,
> @@ -258,7 +258,7 @@ void check_stacks(void)
>  		backtrace_print(lowest->stack_bot_bt,
>  				&lowest->stack_bot_bt_metadata,
>  				NULL, NULL, true);
> -		unlock(&stack_check_lock);
> +		unlock(&bt_lock);
>  	}
>  
>  	this_cpu()->in_mcount = false;
>
Vasant Hegde Dec. 16, 2020, 7:58 a.m. UTC | #2
On 12/11/20 1:58 PM, Cédric Le Goater wrote:
> On 12/11/20 6:31 AM, Vasant Hegde wrote:
>> Sample output from Cédric:
>> -------------------------
>> [   88.294111649,7] cpu_idle_p9 called on cpu 0x063c with pm disabled
>> [   88.289365222,7] cpu_idle_p9 called on cpu 0x025f with pm disabled
>> [   88.289900684,7] cpu_idle_p9 called on cpu 0x045f with pm disabled
>> [   88.302621295,7] CHIPTOD: Base TFMR=0x2512000000000000
>> [   88.289899701,7] cpu_idle_p9 called on cpu 0x0456 with pm disabled
>> LOCK ERROR: Deadlock detected @0x30402740 (state: 0x0000000400000001)
>> [   88.332264757,3] ***********************************************
>> [   88.332300051,3] < assert failed at core/lock.c:32 >
>> [   88.332328282,3]     .
>> [   88.332347335,3]      .
>> [   88.332364894,3]       .
>> [   88.332377963,3]         OO__)
>> [   88.332395458,3]        <"__/
>> [   88.332412628,3]         ^ ^
>> [   88.332450246,3] Fatal TRAP at 00000000300286a0   .lock_error+0x64  MSR 9000000000021002
>> [   88.332501812,3] CFAR : 00000000300414f4 MSR  : 9000000000021002
>> [   88.332536539,3] SRR0 : 00000000300286a0 SRR1 : 9000000000021002
>> [   88.332574644,3] HSRR0: 0000000030020024 HSRR1: 9000000000001000
>> [   88.332610635,3] DSISR: 00000000         DAR  : 0000000000000000
>> [   88.332650628,3] LR   : 0000000030028690 CTR  : 00000000300f9fa0
>> [   88.332684451,3] CR   : 20002000         XER  : 00000000
>> [   88.332712767,3] GPR00: 0000000030028690 GPR16: 0000000032c98000
>> [   88.332748046,3] GPR01: 0000000032c9b0a0 GPR17: 0000000000000000
>> [   88.332784060,3] GPR02: 0000000030169d00 GPR18: 0000000000000000
>> [   88.332822091,3] GPR03: 0000000032c9b310 GPR19: 0000000000000000
>> [   88.332861357,3] GPR04: 0000000030041480 GPR20: 0000000000000000
>> [   88.332897229,3] GPR05: 0000000000000000 GPR21: 0000000000000000
>> [   88.332937051,3] GPR06: 0000000000000010 GPR22: 0000000000000000
>> [   88.332968463,3] GPR07: 0000000000000000 GPR23: 0000000000000000
>> [   88.333007333,3] GPR08: 000000000002cbb5 GPR24: 0000000000000000
>> [   88.333041971,3] GPR09: 0000000000000000 GPR25: 0000000000000000
>> [   88.333081073,3] GPR10: 0000000000000000 GPR26: 0000000000000003
>> [   88.333114301,3] GPR11: 3839616263646566 GPR27: 0000000000000211
>> [   88.333156040,3] GPR12: 0000000020002000 GPR28: 000000003042a134
>> [   88.333189222,3] GPR13: 0000000000000000 GPR29: 0000000030402740
>> [   88.333225638,3] GPR14: 0000000000000000 GPR30: 0000000000000001
>> [   88.333259730,3] GPR15: 0000000000000000 GPR31: 0000000000000000
>> CPU 0211 Backtrace:
>>   S: 0000000032c9b3b0 R: 0000000030028690   .lock_error+0x54
>>   S: 0000000032c9b440 R: 0000000030028828   .add_lock_request+0xd0
>>   S: 0000000032c9b4f0 R: 0000000030028a9c   .lock_caller+0x8c
>>   S: 0000000032c9b5a0 R: 0000000030021b30   .__mcount_stack_check+0x70
>>   S: 0000000032c9b650 R: 00000000300fabb0   .list_check_node+0x1c
>>   S: 0000000032c9b6f0 R: 00000000300fac98   .list_check+0x38
>>   S: 0000000032c9b790 R: 00000000300289bc   .try_lock_caller+0xac
>>   S: 0000000032c9b830 R: 0000000030028ad8   .lock_caller+0xc8
>>   S: 0000000032c9b8e0 R: 0000000030028d74   .lock_recursive_caller+0x54
>>   S: 0000000032c9b980 R: 0000000030020cb8   .console_write+0x48
>>   S: 0000000032c9ba30 R: 00000000300445a8   .vprlog+0xc8
>>   S: 0000000032c9bc20 R: 0000000030044630   ._prlog+0x50
>>   S: 0000000032c9bcb0 R: 0000000030029204   .cpu_idle_p9+0x74
>>   S: 0000000032c9bd40 R: 0000000030029628   .cpu_idle_pm+0x4c
>>   S: 0000000032c9bde0 R: 0000000030023fe0   .__secondary_cpu_entry+0xa0
>>   S: 0000000032c9be70 R: 0000000030024034   .secondary_cpu_entry+0x40
>>   S: 0000000032c9bf00 R: 0000000030003290   secondary_wait+0x8c
>>
>> CPU 0x4:
>>    opal_run_pollers ->
>>      check_stacks -> takes stack_check_lock lock
>>        prlog ->
>>          console_write -> waits for con_lock
>>
>> CPU 0x211
>>    cpu_idle_p9 ->
>>      prlog ->
>>        console_write -> Takes con_lock lock
>>         list_check_node -> tries to take stack_check_lock and hits deadlock.
>>
>> I think we don't need to hold `stack_check_lock` while printing
>> backtraces. Instead it makes sense to hold backtrace lock (bt_lock)
>> and print output.
>>
>> Reported-by: Cédric Le Goater <clg@kaod.org>
>> Signed-off-by: Vasant Hegde <hegdevasant@linux.vnet.ibm.com>
> 
> On a 4 socket, 352 threads :
> 
> Tested-by: Cédric Le Goater <clg@kaod.org>

Sorry. I had missed to add your Test-by. Thanks for testing.

Merged to master as b2ac14f4.

-Vasant
diff mbox series

Patch

diff --git a/core/stack.c b/core/stack.c
index 05506f6b3..3edf98411 100644
--- a/core/stack.c
+++ b/core/stack.c
@@ -250,7 +250,7 @@  void check_stacks(void)
 		unlock(&stack_check_lock);
 	}
 	if (lowest) {
-		lock(&stack_check_lock);
+		lock(&bt_lock);
 		prlog(PR_NOTICE, "CPU %04x lowest stack mark %lld bytes left"
 		      " pc=%08llx token=%lld\n",
 		      lowest->pir, lowest->stack_bot_mark, lowest->stack_bot_pc,
@@ -258,7 +258,7 @@  void check_stacks(void)
 		backtrace_print(lowest->stack_bot_bt,
 				&lowest->stack_bot_bt_metadata,
 				NULL, NULL, true);
-		unlock(&stack_check_lock);
+		unlock(&bt_lock);
 	}
 
 	this_cpu()->in_mcount = false;