From patchwork Thu Mar 21 23:44:08 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauricio Faria de Oliveira X-Patchwork-Id: 1060598 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44QNgl31sGz9sRy; Fri, 22 Mar 2019 10:45:07 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1h77Mp-0001Fo-QS; Thu, 21 Mar 2019 23:44:59 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.86_2) (envelope-from ) id 1h77Mn-0001FX-Ku for kernel-team@lists.ubuntu.com; Thu, 21 Mar 2019 23:44:57 +0000 Received: from mail-qk1-f198.google.com ([209.85.222.198]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1h77Mn-0001Ge-BE for kernel-team@lists.ubuntu.com; Thu, 21 Mar 2019 23:44:57 +0000 Received: by mail-qk1-f198.google.com with SMTP id l10so367923qkj.22 for ; Thu, 21 Mar 2019 16:44:57 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id; bh=IlbpfpZhSEfyI1+p8gKOtdpM8Lyj9rb3da1uKytljVU=; b=I4Lf+f6++g/o7fcdYLlFp3Hvam9qvkBaxO1n3oeAg0LdM7VgIM+6MPPcnSG7haGFn7 4cUglBe0wY1GC3JGuFfcnnbfhybz/vWDccqxplINz6akaKj97kevPDZCFY/GEi8GyGYD IdDmHfI+aoztVh4gkSBFfLgUdUcLT5spLbg0grQU+IHxf3VpITaB+D0W3//lIctn8Tkm 2SNWuVrZ2XlH8U5OoALrTHedhg5Geh5hhkRPo+G2seI7p35rZSzgWpP6T4SXxYI7KrNN MJR78dbBRj42trkxEBO6yXt2AfAEASRZk2jXz94bOF5MMTCoz2VVZXHeztusQBAKCZs/ 0vFw== X-Gm-Message-State: APjAAAUdYwOJgO17AkFrZD68iXapfII4GJk97RULI21VzkK/Y82Pi6zx umJIAPJZBUDtymC5HmsU9Ts22occJ7HdwD53lY8wTAcrKaTj5wTEgPUxKf59A6Y3lFKdDYx+yEg WYE29ciZEYieJwSIxjniClhyHBtQAX2UNG5gUQD+1RA== X-Received: by 2002:a37:4c85:: with SMTP id z127mr5275200qka.180.1553211896430; Thu, 21 Mar 2019 16:44:56 -0700 (PDT) X-Google-Smtp-Source: APXvYqwRct+ZrmqwC5IV/hu9/Nlb06sKZLSGt84laCIA9M4+1eQk0rVochtQx/QDovRJByferqjL7Q== X-Received: by 2002:a37:4c85:: with SMTP id z127mr5275192qka.180.1553211896258; Thu, 21 Mar 2019 16:44:56 -0700 (PDT) Received: from localhost.localdomain ([2804:14c:4e7:c0e:5083:4574:81c5:ff8d]) by smtp.gmail.com with ESMTPSA id d21sm1907421qtc.91.2019.03.21.16.44.54 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Mar 2019 16:44:55 -0700 (PDT) From: Mauricio Faria de Oliveira To: kernel-team@lists.ubuntu.com Subject: [X][PATCH 0/4] LP#1821259 Fix for deadlock in cpu_stopper Date: Thu, 21 Mar 2019 20:44:08 -0300 Message-Id: <20190321234412.11113-1-mfo@canonical.com> X-Mailer: git-send-email 2.17.1 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/1821259 [Impact] * This problem hard locks up 2 CPUs in a deadlock, and this soft locks up other CPUs as an effect; the system becomes unusable. * This is relatively rare / difficult to hit because it's a corner case in scheduling/load balancing that needs timing with CPU stopper code. And it needs SMP plus _NUMA_ system. (but it can be hit with synthetic test case attached in LP.) * Since SMP plus NUMA usually equals _servers_ it looks like a good idea to prevent this bug / hard lockups / rebooting. * The fix resolves the potential deadlock by removing one of the calls required to deadlock from under the locked code. [Test Case] * There's a synthetic test case to reproduce this problem (although without the stack traces - just a system hang) attached to this LP bug. * It uses kprobes/mdelay/cpu stopper calls to force the code to execute and force the timing/locking condition to occur. * $ sudo insmod kmod-stopper.ko Some dmesg logging occurs, and systems either hangs or not. See examples in comments. [Regression Potential] * These are patches to the cpu stop_machine.c code, and they change a bit how it works; however, there are no upstream fixes for these patches anymore and they are still the top of the 'git log --oneline -- kernel/stop_machine.c' output. * These patches have been verified with the synthetic test case and 'stress-ng --class scheduler --sequential 0' (no regressions) on guest with 2 CPUs and one physical system with 24 CPUs. [Other Info] * The patches are required on Xenial and later. * There are 4 patches for Xenial, and 2 patches pending for Bionic. * All patches are applied from Cosmic onwards. Isaac J. Manjarres (2): stop_machine: Disable preemption when waking two stopper threads stop_machine: Disable preemption after queueing stopper threads Peter Zijlstra (1): stop_machine, sched: Fix migrate_swap() vs. active_balance() deadlock Prasad Sodagudi (1): stop_machine: Atomically queue and wake stopper threads kernel/stop_machine.c | 32 +++++++++++++++++++++++++++----- 1 file changed, 27 insertions(+), 5 deletions(-) Acked-by: Marcelo Henrique Cerri Acked-by: Khalid Elmously