From patchwork Fri Jan 11 11:08:37 2019 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Mauricio Faria de Oliveira X-Patchwork-Id: 1023517 Return-Path: X-Original-To: incoming@patchwork.ozlabs.org Delivered-To: patchwork-incoming@bilbo.ozlabs.org Authentication-Results: ozlabs.org; spf=none (mailfrom) smtp.mailfrom=lists.ubuntu.com (client-ip=91.189.94.19; helo=huckleberry.canonical.com; envelope-from=kernel-team-bounces@lists.ubuntu.com; receiver=) Authentication-Results: ozlabs.org; dmarc=fail (p=none dis=none) header.from=canonical.com Received: from huckleberry.canonical.com (huckleberry.canonical.com [91.189.94.19]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ozlabs.org (Postfix) with ESMTPS id 43bg966D6Zz9sBQ; Fri, 11 Jan 2019 22:08:58 +1100 (AEDT) Received: from localhost ([127.0.0.1] helo=huckleberry.canonical.com) by huckleberry.canonical.com with esmtp (Exim 4.86_2) (envelope-from ) id 1ghugH-0007Jl-H8; Fri, 11 Jan 2019 11:08:53 +0000 Received: from youngberry.canonical.com ([91.189.89.112]) by huckleberry.canonical.com with esmtps (TLS1.0:DHE_RSA_AES_128_CBC_SHA1:128) (Exim 4.86_2) (envelope-from ) id 1ghugF-0007J2-C1 for kernel-team@lists.ubuntu.com; Fri, 11 Jan 2019 11:08:51 +0000 Received: from mail-qt1-f199.google.com ([209.85.160.199]) by youngberry.canonical.com with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.76) (envelope-from ) id 1ghugF-0004w3-24 for kernel-team@lists.ubuntu.com; Fri, 11 Jan 2019 11:08:51 +0000 Received: by mail-qt1-f199.google.com with SMTP id w15so16263403qtk.19 for ; Fri, 11 Jan 2019 03:08:51 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:subject:date:message-id; bh=3lzfDwZG+87A7bC4t57r7nmYv4uvyHSrzTJxvaLkA+g=; b=lueaTu0oACHXsSBpZ09IP5GVCSXlFE7WhcqtQsu8W3dPOy6mQHvYhExise57fef1wg aTyCunWvZTuZNc5LrqwcOwDoOUKuE/rstq9iietX9gyO+RHpOE/+DOUEIbdF8yHiN4hA 388s//nmRav+GSerU8/4pb0xB3dyTxSsNJoYusfPDkm0uLPl8LWQjOgXYfAosD71fxE+ VjzdXV4YuDwXu69QFFdupLR7cZGSSQfNdxGuYjf0Vfh23KVvHuigVKfSYxgc2FdD9/dq LvIRwJzlLptaPgsjtMakphWUoVcZbaW7Whns1ooM/8BbqnvEv40grCbIJwTKdN2wDGMc EHJQ== X-Gm-Message-State: AJcUukeLu5pteUiiuVaT9aexDDaPgEUfT4OtJNiZ3AfQqUKguc/B+6q4 YvfDvUPNhFKlLZtj7mQP1lUh2HM2T27eWxGjtltY4HLQvGydsYoBsiioHgI+dp8PDP+kb5du4Zi 4oJm7GDFoEHGsJ7LsuXT/DO8Yag/nRcRyuCN//MnwRQ== X-Received: by 2002:ac8:7518:: with SMTP id u24mr13488666qtq.75.1547204930061; Fri, 11 Jan 2019 03:08:50 -0800 (PST) X-Google-Smtp-Source: ALg8bN7fJwnUFDtSFpHWjjG/SyS7Fgdy78f5KwcLn7SiJRpgWc8IFNaX8QQhF//H0A+SPgXL+NdO2A== X-Received: by 2002:ac8:7518:: with SMTP id u24mr13488644qtq.75.1547204929714; Fri, 11 Jan 2019 03:08:49 -0800 (PST) Received: from localhost.localdomain ([177.181.227.3]) by smtp.gmail.com with ESMTPSA id x202sm35844345qka.67.2019.01.11.03.08.47 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 11 Jan 2019 03:08:49 -0800 (PST) From: Mauricio Faria de Oliveira To: kernel-team@lists.ubuntu.com Subject: [SRU C][PATCH v2 0/6] blk-wbt: fix for LP#1810998 Date: Fri, 11 Jan 2019 09:08:37 -0200 Message-Id: <20190111110843.18042-1-mfo@canonical.com> X-Mailer: git-send-email 2.17.1 X-BeenThere: kernel-team@lists.ubuntu.com X-Mailman-Version: 2.1.20 Precedence: list List-Id: Kernel team discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , MIME-Version: 1.0 Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" BugLink: https://bugs.launchpad.net/bugs/1810998 [Impact] * Users may experience cpu hard lockups when performing rigorous writes to NVMe drives. * The fix addresses an scheduling issue in the original implementation of wbt/writeback throttling * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait"), plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve waking of tasks"). * Plus a few dependency commits for each fix. * Backports are trivial: mainly replace rq_wait_inc_below() with the equivalent atomic_inc_below(), and maintain the __wbt_done() signature, both due to the lack of commit a79050434b45 ("blk-rq-qos: refactor out common elements of blk-wbt"), that changes lots of other/unrelated code. [Test Case] * This command has been reported to reproduce the problem: $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128 * It generates stack traces as below in the original kernel, and does not generate them in the modified/patched kernel. * The user/reporter verified the test kernel with these patches resolved the problem. * The developer verified in 2 systems (4-core and 24-core but no NVMe) for regressions, and no error messages were logged to dmesg. [Regression Potential] * The regression potential is contained within writeback throttling mechanism (block/blk-wbt.*). * The commits have been verified for fixes in later commits in linux-next as of 2019-01-08 and all known fix commits are in. [Other Info] * The problem has been introduced with the blk-wbt mechanism, in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2, so only Bionic and Cosmic needs this. [Stack Traces] [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30 ... [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu ... [ 393.628720] Call Trace: [ 393.628721] [ 393.628724] enqueue_task_fair+0x6c/0x7f0 [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 [ 393.628731] activate_task+0x57/0xc0 [ 393.628735] ? sched_clock+0x9/0x10 [ 393.628736] ? sched_clock+0x9/0x10 [ 393.628738] ttwu_do_activate+0x49/0x90 [ 393.628739] try_to_wake_up+0x1df/0x490 [ 393.628741] default_wake_function+0x12/0x20 [ 393.628743] autoremove_wake_function+0x12/0x40 [ 393.628744] __wake_up_common+0x73/0x130 [ 393.628745] __wake_up_common_lock+0x80/0xc0 [ 393.628746] __wake_up+0x13/0x20 [ 393.628749] __wbt_done.part.21+0xa4/0xb0 [ 393.628749] wbt_done+0x72/0xa0 [ 393.628753] blk_mq_free_request+0xca/0x1a0 [ 393.628755] blk_mq_end_request+0x48/0x90 [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core] [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme] [ 393.628764] __blk_mq_complete_request+0xd2/0x140 [ 393.628766] blk_mq_complete_request+0x18/0x20 [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme] [ 393.628768] nvme_irq+0x23/0x50 [nvme] [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0 [ 393.628773] handle_irq_event_percpu+0x32/0x80 [ 393.628774] handle_irq_event+0x3b/0x60 [ 393.628778] handle_edge_irq+0x7c/0x190 [ 393.628779] handle_irq+0x20/0x30 [ 393.628783] do_IRQ+0x46/0xd0 [ 393.628784] common_interrupt+0x84/0x84 [ 393.628785] ... [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0 [ 393.628796] cpuidle_enter+0x17/0x20 [ 393.628797] call_cpuidle+0x23/0x40 [ 393.628798] do_idle+0x18c/0x1f0 [ 393.628799] cpu_startup_entry+0x73/0x80 [ 393.628802] start_secondary+0x1a6/0x200 [ 393.628804] secondary_startup_64+0xa5/0xb0 [ 393.628805] Code: ... [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks: [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485 [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818) [ 435.617025] Sending NMI from CPU 8 to CPUs 30: [ 435.617029] NMI backtrace for cpu 30 [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu ... [ 435.617047] Call Trace: [ 435.617048] [ 435.617051] enqueue_entity+0x9f/0x6b0 [ 435.617053] enqueue_task_fair+0x6c/0x7f0 [ 435.617056] activate_task+0x57/0xc0 [ 435.617059] ? sched_clock+0x9/0x10 [ 435.617060] ? sched_clock+0x9/0x10 [ 435.617061] ttwu_do_activate+0x49/0x90 [ 435.617063] try_to_wake_up+0x1df/0x490 [ 435.617065] default_wake_function+0x12/0x20 [ 435.617067] autoremove_wake_function+0x12/0x40 [ 435.617068] __wake_up_common+0x73/0x130 [ 435.617069] __wake_up_common_lock+0x80/0xc0 [ 435.617070] __wake_up+0x13/0x20 [ 435.617073] __wbt_done.part.21+0xa4/0xb0 [ 435.617074] wbt_done+0x72/0xa0 [ 435.617077] blk_mq_free_request+0xca/0x1a0 [ 435.617079] blk_mq_end_request+0x48/0x90 [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core] [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme] [ 435.617088] __blk_mq_complete_request+0xd2/0x140 [ 435.617090] blk_mq_complete_request+0x18/0x20 [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme] [ 435.617093] nvme_irq+0x23/0x50 [nvme] [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0 [ 435.617097] handle_irq_event_percpu+0x32/0x80 [ 435.617098] handle_irq_event+0x3b/0x60 [ 435.617101] handle_edge_irq+0x7c/0x190 [ 435.617102] handle_irq+0x20/0x30 [ 435.617106] do_IRQ+0x46/0xd0 [ 435.617107] common_interrupt+0x84/0x84 [ 435.617108] ... [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0 [ 435.617118] cpuidle_enter+0x17/0x20 [ 435.617119] call_cpuidle+0x23/0x40 [ 435.617121] do_idle+0x18c/0x1f0 [ 435.617122] cpu_startup_entry+0x73/0x80 [ 435.617125] start_secondary+0x1a6/0x200 [ 435.617127] secondary_startup_64+0xa5/0xb0 [ 435.617128] Code: ... Anchal Agarwal (1): blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait Jens Axboe (5): blk-wbt: move disable check into get_limit() blk-wbt: use wq_has_sleeper() for wq active check blk-wbt: fix has-sleeper queueing check blk-wbt: abstract out end IO completion handler blk-wbt: improve waking of tasks block/blk-wbt.c | 107 +++++++++++++++++++++++++++++++++--------------- 1 file changed, 75 insertions(+), 32 deletions(-) Acked-by: Stefan Bader Acked-by: Kleber Sacilotto de Souza