Message ID | 20190321234412.11113-1-mfo@canonical.com |
---|---|
Headers | show |
Series | LP#1821259 Fix for deadlock in cpu_stopper | expand |
Acked-by: Marcelo Henrique Cerri <marcelo.cerri@canonical.com>
Pinging for reviews for this cycle, if at all possible. Thank you! On Thu, Mar 21, 2019 at 8:44 PM Mauricio Faria de Oliveira <mfo@canonical.com> wrote: > > BugLink: https://bugs.launchpad.net/bugs/1821259 > > [Impact] > > * This problem hard locks up 2 CPUs in a deadlock, and this > soft locks up other CPUs as an effect; the system becomes > unusable. > > * This is relatively rare / difficult to hit because it's a > corner case in scheduling/load balancing that needs timing > with CPU stopper code. And it needs SMP plus _NUMA_ system. > (but it can be hit with synthetic test case attached in LP.) > > * Since SMP plus NUMA usually equals _servers_ it looks like > a good idea to prevent this bug / hard lockups / rebooting. > > * The fix resolves the potential deadlock by removing one of > the calls required to deadlock from under the locked code. > > [Test Case] > > * There's a synthetic test case to reproduce this problem > (although without the stack traces - just a system hang) > attached to this LP bug. > > * It uses kprobes/mdelay/cpu stopper calls to force the code > to execute and force the timing/locking condition to occur. > > * $ sudo insmod kmod-stopper.ko > > Some dmesg logging occurs, and systems either hangs or not. > See examples in comments. > > [Regression Potential] > > * These are patches to the cpu stop_machine.c code, and they > change a bit how it works; however, there are no upstream > fixes for these patches anymore and they are still the top > of the 'git log --oneline -- kernel/stop_machine.c' output. > > * These patches have been verified with the synthetic test case > and 'stress-ng --class scheduler --sequential 0' (no regressions) > on guest with 2 CPUs and one physical system with 24 CPUs. > > [Other Info] > > * The patches are required on Xenial and later. > * There are 4 patches for Xenial, and 2 patches pending for Bionic. > * All patches are applied from Cosmic onwards. > > Isaac J. Manjarres (2): > stop_machine: Disable preemption when waking two stopper threads > stop_machine: Disable preemption after queueing stopper threads > > Peter Zijlstra (1): > stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock > > Prasad Sodagudi (1): > stop_machine: Atomically queue and wake stopper threads > > kernel/stop_machine.c | 32 +++++++++++++++++++++++++++----- > 1 file changed, 27 insertions(+), 5 deletions(-) > > -- > 2.17.1 >
On 2019-03-21 20:44:08 , Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1821259 > > [Impact] > > * This problem hard locks up 2 CPUs in a deadlock, and this > soft locks up other CPUs as an effect; the system becomes > unusable. > > * This is relatively rare / difficult to hit because it's a > corner case in scheduling/load balancing that needs timing > with CPU stopper code. And it needs SMP plus _NUMA_ system. > (but it can be hit with synthetic test case attached in LP.) > > * Since SMP plus NUMA usually equals _servers_ it looks like > a good idea to prevent this bug / hard lockups / rebooting. > > * The fix resolves the potential deadlock by removing one of > the calls required to deadlock from under the locked code. > > [Test Case] > > * There's a synthetic test case to reproduce this problem > (although without the stack traces - just a system hang) > attached to this LP bug. > > * It uses kprobes/mdelay/cpu stopper calls to force the code > to execute and force the timing/locking condition to occur. > > * $ sudo insmod kmod-stopper.ko > > Some dmesg logging occurs, and systems either hangs or not. > See examples in comments. > > [Regression Potential] > > * These are patches to the cpu stop_machine.c code, and they > change a bit how it works; however, there are no upstream > fixes for these patches anymore and they are still the top > of the 'git log --oneline -- kernel/stop_machine.c' output. > > * These patches have been verified with the synthetic test case > and 'stress-ng --class scheduler --sequential 0' (no regressions) > on guest with 2 CPUs and one physical system with 24 CPUs. > > [Other Info] > > * The patches are required on Xenial and later. > * There are 4 patches for Xenial, and 2 patches pending for Bionic. > * All patches are applied from Cosmic onwards. > > Isaac J. Manjarres (2): > stop_machine: Disable preemption when waking two stopper threads > stop_machine: Disable preemption after queueing stopper threads > > Peter Zijlstra (1): > stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock > > Prasad Sodagudi (1): > stop_machine: Atomically queue and wake stopper threads > > kernel/stop_machine.c | 32 +++++++++++++++++++++++++++----- > 1 file changed, 27 insertions(+), 5 deletions(-) > Interesting problem and good investigation. The detailed comments, in the backport section and in the bug, are appreciated. Acked-by: Khalid Elmously <khalid.elmously@canonical.com>
On 2019-03-21 20:44:08 , Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1821259 > > [Impact] > > * This problem hard locks up 2 CPUs in a deadlock, and this > soft locks up other CPUs as an effect; the system becomes > unusable. > > * This is relatively rare / difficult to hit because it's a > corner case in scheduling/load balancing that needs timing > with CPU stopper code. And it needs SMP plus _NUMA_ system. > (but it can be hit with synthetic test case attached in LP.) > > * Since SMP plus NUMA usually equals _servers_ it looks like > a good idea to prevent this bug / hard lockups / rebooting. > > * The fix resolves the potential deadlock by removing one of > the calls required to deadlock from under the locked code. > > [Test Case] > > * There's a synthetic test case to reproduce this problem > (although without the stack traces - just a system hang) > attached to this LP bug. > > * It uses kprobes/mdelay/cpu stopper calls to force the code > to execute and force the timing/locking condition to occur. > > * $ sudo insmod kmod-stopper.ko > > Some dmesg logging occurs, and systems either hangs or not. > See examples in comments. > > [Regression Potential] > > * These are patches to the cpu stop_machine.c code, and they > change a bit how it works; however, there are no upstream > fixes for these patches anymore and they are still the top > of the 'git log --oneline -- kernel/stop_machine.c' output. > > * These patches have been verified with the synthetic test case > and 'stress-ng --class scheduler --sequential 0' (no regressions) > on guest with 2 CPUs and one physical system with 24 CPUs. > > [Other Info] > > * The patches are required on Xenial and later. > * There are 4 patches for Xenial, and 2 patches pending for Bionic. > * All patches are applied from Cosmic onwards. > > Isaac J. Manjarres (2): > stop_machine: Disable preemption when waking two stopper threads > stop_machine: Disable preemption after queueing stopper threads > > Peter Zijlstra (1): > stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock > > Prasad Sodagudi (1): > stop_machine: Atomically queue and wake stopper threads > > kernel/stop_machine.c | 32 +++++++++++++++++++++++++++----- > 1 file changed, 27 insertions(+), 5 deletions(-) > > -- > 2.17.1 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team