From patchwork Thu Mar 21 23:48:34 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauricio Faria de Oliveira X-Patchwork-Id: 1060604 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 44QNmm516tz9sS0; Fri, 22 Mar 2019 10:49:28 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1h77R4-0001lm-WA; Thu, 21 Mar 2019 23:49:22 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.86_2) (envelope-from ) id 1h77R3-0001lf-Fn for kernel-team@lists.ubuntu.com; Thu, 21 Mar 2019 23:49:21 +0000 Received: from mail-qt1-f199.google.com ([209.85.160.199]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1h77R3-0001ey-5i for kernel-team@lists.ubuntu.com; Thu, 21 Mar 2019 23:49:21 +0000 Received: by mail-qt1-f199.google.com with SMTP id q12so644059qtr.3 for ; Thu, 21 Mar 2019 16:49:21 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id; bh=BeQ4orldyvqZWHSViiw25S0o0YTEcEwZdHMPhUqIaO0=; b=g56XTqan5IxbhMumgv/azMJZuGUlrB+VW+TQdoKoIXVVkFfl+DjCwexgfq59R2wEUj aJVZdyN03HFjpcu6/WtJmLSyolhLHSrqQmygb5nJeIZuD9zf8XrYD+QoGMlqv98rtNot jaSqulHSZbI48T8SBtVB4d/gSTojXrDNtvQbkI045ysMKG/0N2WUyyrrLQM+HcYN0kc3 Dk6bZjbiVKLuqk1ndR+s4UdDzvjGoHWCUy6pJDmxIPg0PH09CXdtvdlAEpMKo0c2N5+7 pQ+v5hi883MDN1NF3QhVHYzp/FnS4IMsFSiA4Y+lfQEB0IDrvsEYSgBrC0/8k2Ck6bMC oHTA== X-Gm-Message-State: APjAAAVS5rr0D6m/Pj/YMcgW0ZSaDvaCFxlmpsNyyxY00feAqa4cIRlV KddmLWWWUiC4yY4HIEdGTb7HYofU8h8rOYtNgA87QM/oLvJ1XW7M8nb3SVUw026KrMQRVvo5cuI sFZC4W5hxDnrMTadspppZtoL+0rS8FQmuZD5QfQl1Qw== X-Received: by 2002:aed:2121:: with SMTP id 30mr5737600qtc.158.1553212160285; Thu, 21 Mar 2019 16:49:20 -0700 (PDT) X-Google-Smtp-Source: APXvYqz36Q8JG/iMn79eTnkbqi6Za1bMu2IYndgJ15IJ0k1TdSQYdy5j7KS3IyAoeMJUffpnDefJLw== X-Received: by 2002:aed:2121:: with SMTP id 30mr5737588qtc.158.1553212160118; Thu, 21 Mar 2019 16:49:20 -0700 (PDT) Received: from localhost.localdomain ([2804:14c:4e7:c0e:5083:4574:81c5:ff8d]) by smtp.gmail.com with ESMTPSA id e6sm445639qtr.56.2019.03.21.16.49.18 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 21 Mar 2019 16:49:19 -0700 (PDT) From: Mauricio Faria de Oliveira To: kernel-team@lists.ubuntu.com Subject: [B][PATCH 0/2] Fix for LP#1821259 (pending patches for) Fix for deadlock in cpu_stopper Date: Thu, 21 Mar 2019 20:48:34 -0300 Message-Id: <20190321234836.11774-1-mfo@canonical.com> X-Mailer: git-send-email 2.17.1 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/1821259 Bionic only needs 2 of the 4 patches submitted for Xenial. All patches are applied / not needed on Cosmic and later. [Impact] * This problem hard locks up 2 CPUs in a deadlock, and this soft locks up other CPUs as an effect; the system becomes unusable. * This is relatively rare / difficult to hit because it's a corner case in scheduling/load balancing that needs timing with CPU stopper code. And it needs SMP plus _NUMA_ system. (but it can be hit with synthetic test case attached in LP.) * Since SMP plus NUMA usually equals _servers_ it looks like a good idea to prevent this bug / hard lockups / rebooting. * The fix resolves the potential deadlock by removing one of the calls required to deadlock from under the locked code. [Test Case] * There's a synthetic test case to reproduce this problem (although without the stack traces - just a system hang) attached to this LP bug. * It uses kprobes/mdelay/cpu stopper calls to force the code to execute and force the timing/locking condition to occur. * $ sudo insmod kmod-stopper.ko Some dmesg logging occurs, and systems either hangs or not. See examples in comments. [Regression Potential] * These are patches to the cpu stop_machine.c code, and they change a bit how it works; however, there are no upstream fixes for these patches anymore and they are still the top of the 'git log --oneline -- kernel/stop_machine.c' output. * These patches have been verified with the synthetic test case and 'stress-ng --class scheduler --sequential 0' (no regressions) on guest with 2 CPUs and one physical system with 24 CPUs. [Other Info] * The patches are required on Xenial and later. * There are 4 patches for Xenial, and 2 patches pending for Bionic. * All patches are applied from Cosmic onwards. Isaac J. Manjarres (1): stop_machine: Disable preemption after queueing stopper threads Prasad Sodagudi (1): stop_machine: Atomically queue and wake stopper threads kernel/stop_machine.c | 12 +++++++++++- 1 file changed, 11 insertions(+), 1 deletion(-) Acked-by: Khalid Elmously Acked-by: Marcelo Henrique Cerri