mbox series

[X,0/4] LP#1821259 Fix for deadlock in cpu_stopper

Message ID 20190321234412.11113-1-mfo@canonical.com
Headers show
Series LP#1821259 Fix for deadlock in cpu_stopper | expand

Message

Mauricio Faria de Oliveira March 21, 2019, 11:44 p.m. UTC
BugLink: https://bugs.launchpad.net/bugs/1821259

[Impact] 

 * This problem hard locks up 2 CPUs in a deadlock, and this
   soft locks up other CPUs as an effect; the system becomes
   unusable.

 * This is relatively rare / difficult to hit because it's a
   corner case in scheduling/load balancing that needs timing
   with CPU stopper code. And it needs SMP plus _NUMA_ system.
   (but it can be hit with synthetic test case attached in LP.)

 * Since SMP plus NUMA usually equals _servers_ it looks like
   a good idea to prevent this bug / hard lockups / rebooting.

 * The fix resolves the potential deadlock by removing one of
   the calls required to deadlock from under the locked code.

[Test Case]

 * There's a synthetic test case to reproduce this problem
   (although without the stack traces - just a system hang)
   attached to this LP bug.

 * It uses kprobes/mdelay/cpu stopper calls to force the code
   to execute and force the timing/locking condition to occur.

 * $ sudo insmod kmod-stopper.ko 

   Some dmesg logging occurs, and systems either hangs or not.
   See examples in comments.
   
[Regression Potential] 

 * These are patches to the cpu stop_machine.c code, and they
   change a bit how it works;  however, there are no upstream
   fixes for these patches anymore and they are still the top
   of the 'git log --oneline -- kernel/stop_machine.c' output.

 * These patches have been verified with the synthetic test case
   and 'stress-ng --class scheduler --sequential 0' (no regressions)
   on guest with 2 CPUs and one physical system with 24 CPUs.

[Other Info]
 
 * The patches are required on Xenial and later.
 * There are 4 patches for Xenial, and 2 patches pending for Bionic.
 * All patches are applied from Cosmic onwards.

Isaac J. Manjarres (2):
  stop_machine: Disable preemption when waking two stopper threads
  stop_machine: Disable preemption after queueing stopper threads

Peter Zijlstra (1):
  stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock

Prasad Sodagudi (1):
  stop_machine: Atomically queue and wake stopper threads

 kernel/stop_machine.c | 32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

Comments

Marcelo Henrique Cerri March 27, 2019, 12:52 p.m. UTC | #1
Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Mauricio Faria de Oliveira March 27, 2019, 3:20 p.m. UTC | #2
Pinging for reviews for this cycle, if at all possible.
Thank you!

On Thu, Mar 21, 2019 at 8:44 PM Mauricio Faria de Oliveira
<mfo@canonical.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/1821259
>
> [Impact]
>
>  * This problem hard locks up 2 CPUs in a deadlock, and this
>    soft locks up other CPUs as an effect; the system becomes
>    unusable.
>
>  * This is relatively rare / difficult to hit because it's a
>    corner case in scheduling/load balancing that needs timing
>    with CPU stopper code. And it needs SMP plus _NUMA_ system.
>    (but it can be hit with synthetic test case attached in LP.)
>
>  * Since SMP plus NUMA usually equals _servers_ it looks like
>    a good idea to prevent this bug / hard lockups / rebooting.
>
>  * The fix resolves the potential deadlock by removing one of
>    the calls required to deadlock from under the locked code.
>
> [Test Case]
>
>  * There's a synthetic test case to reproduce this problem
>    (although without the stack traces - just a system hang)
>    attached to this LP bug.
>
>  * It uses kprobes/mdelay/cpu stopper calls to force the code
>    to execute and force the timing/locking condition to occur.
>
>  * $ sudo insmod kmod-stopper.ko
>
>    Some dmesg logging occurs, and systems either hangs or not.
>    See examples in comments.
>
> [Regression Potential]
>
>  * These are patches to the cpu stop_machine.c code, and they
>    change a bit how it works;  however, there are no upstream
>    fixes for these patches anymore and they are still the top
>    of the 'git log --oneline -- kernel/stop_machine.c' output.
>
>  * These patches have been verified with the synthetic test case
>    and 'stress-ng --class scheduler --sequential 0' (no regressions)
>    on guest with 2 CPUs and one physical system with 24 CPUs.
>
> [Other Info]
>
>  * The patches are required on Xenial and later.
>  * There are 4 patches for Xenial, and 2 patches pending for Bionic.
>  * All patches are applied from Cosmic onwards.
>
> Isaac J. Manjarres (2):
>   stop_machine: Disable preemption when waking two stopper threads
>   stop_machine: Disable preemption after queueing stopper threads
>
> Peter Zijlstra (1):
>   stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
>
> Prasad Sodagudi (1):
>   stop_machine: Atomically queue and wake stopper threads
>
>  kernel/stop_machine.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>
> --
> 2.17.1
>
Khalid Elmously March 28, 2019, 3:01 a.m. UTC | #3
On 2019-03-21 20:44:08 , Mauricio Faria de Oliveira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1821259
> 
> [Impact] 
> 
>  * This problem hard locks up 2 CPUs in a deadlock, and this
>    soft locks up other CPUs as an effect; the system becomes
>    unusable.
> 
>  * This is relatively rare / difficult to hit because it's a
>    corner case in scheduling/load balancing that needs timing
>    with CPU stopper code. And it needs SMP plus _NUMA_ system.
>    (but it can be hit with synthetic test case attached in LP.)
> 
>  * Since SMP plus NUMA usually equals _servers_ it looks like
>    a good idea to prevent this bug / hard lockups / rebooting.
> 
>  * The fix resolves the potential deadlock by removing one of
>    the calls required to deadlock from under the locked code.
> 
> [Test Case]
> 
>  * There's a synthetic test case to reproduce this problem
>    (although without the stack traces - just a system hang)
>    attached to this LP bug.
> 
>  * It uses kprobes/mdelay/cpu stopper calls to force the code
>    to execute and force the timing/locking condition to occur.
> 
>  * $ sudo insmod kmod-stopper.ko 
> 
>    Some dmesg logging occurs, and systems either hangs or not.
>    See examples in comments.
>    
> [Regression Potential] 
> 
>  * These are patches to the cpu stop_machine.c code, and they
>    change a bit how it works;  however, there are no upstream
>    fixes for these patches anymore and they are still the top
>    of the 'git log --oneline -- kernel/stop_machine.c' output.
> 
>  * These patches have been verified with the synthetic test case
>    and 'stress-ng --class scheduler --sequential 0' (no regressions)
>    on guest with 2 CPUs and one physical system with 24 CPUs.
> 
> [Other Info]
>  
>  * The patches are required on Xenial and later.
>  * There are 4 patches for Xenial, and 2 patches pending for Bionic.
>  * All patches are applied from Cosmic onwards.
> 
> Isaac J. Manjarres (2):
>   stop_machine: Disable preemption when waking two stopper threads
>   stop_machine: Disable preemption after queueing stopper threads
> 
> Peter Zijlstra (1):
>   stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
> 
> Prasad Sodagudi (1):
>   stop_machine: Atomically queue and wake stopper threads
> 
>  kernel/stop_machine.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
>

Interesting problem and good investigation.
The detailed comments, in the backport section and in the bug, are appreciated.

Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
Khalid Elmously March 28, 2019, 3:03 a.m. UTC | #4
On 2019-03-21 20:44:08 , Mauricio Faria de Oliveira wrote:
> BugLink: https://bugs.launchpad.net/bugs/1821259
> 
> [Impact] 
> 
>  * This problem hard locks up 2 CPUs in a deadlock, and this
>    soft locks up other CPUs as an effect; the system becomes
>    unusable.
> 
>  * This is relatively rare / difficult to hit because it's a
>    corner case in scheduling/load balancing that needs timing
>    with CPU stopper code. And it needs SMP plus _NUMA_ system.
>    (but it can be hit with synthetic test case attached in LP.)
> 
>  * Since SMP plus NUMA usually equals _servers_ it looks like
>    a good idea to prevent this bug / hard lockups / rebooting.
> 
>  * The fix resolves the potential deadlock by removing one of
>    the calls required to deadlock from under the locked code.
> 
> [Test Case]
> 
>  * There's a synthetic test case to reproduce this problem
>    (although without the stack traces - just a system hang)
>    attached to this LP bug.
> 
>  * It uses kprobes/mdelay/cpu stopper calls to force the code
>    to execute and force the timing/locking condition to occur.
> 
>  * $ sudo insmod kmod-stopper.ko 
> 
>    Some dmesg logging occurs, and systems either hangs or not.
>    See examples in comments.
>    
> [Regression Potential] 
> 
>  * These are patches to the cpu stop_machine.c code, and they
>    change a bit how it works;  however, there are no upstream
>    fixes for these patches anymore and they are still the top
>    of the 'git log --oneline -- kernel/stop_machine.c' output.
> 
>  * These patches have been verified with the synthetic test case
>    and 'stress-ng --class scheduler --sequential 0' (no regressions)
>    on guest with 2 CPUs and one physical system with 24 CPUs.
> 
> [Other Info]
>  
>  * The patches are required on Xenial and later.
>  * There are 4 patches for Xenial, and 2 patches pending for Bionic.
>  * All patches are applied from Cosmic onwards.
> 
> Isaac J. Manjarres (2):
>   stop_machine: Disable preemption when waking two stopper threads
>   stop_machine: Disable preemption after queueing stopper threads
> 
> Peter Zijlstra (1):
>   stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock
> 
> Prasad Sodagudi (1):
>   stop_machine: Atomically queue and wake stopper threads
> 
>  kernel/stop_machine.c | 32 +++++++++++++++++++++++++++-----
>  1 file changed, 27 insertions(+), 5 deletions(-)
> 
> -- 
> 2.17.1
> 
> 
> -- 
> kernel-team mailing list
> kernel-team@lists.ubuntu.com
> https://lists.ubuntu.com/mailman/listinfo/kernel-team