Message ID | 20190110031219.30676-1-mfo@canonical.com |
---|---|
Headers | show |
Series | blk-wbt: fix for LP#1810998 | expand |
On 10.01.19 04:12, Mauricio Faria de Oliveira wrote: > BugLink: https://bugs.launchpad.net/bugs/1810998 > > [Impact] > > * Users may experience cpu hard lockups when performing > rigorous writes to NVMe drives. > > * The fix addresses an scheduling issue in the original > implementation of wbt/writeback throttling > > * The fix is commit 2887e41b910b ("blk-wbt: Avoid lock > contention and thundering herd issue in wbt_wait"), > plus its fix commit 38cfb5a45ee0 ("blk-wbt: improve > waking of tasks"). > > * There are additional commits to help with a cleaner > backport and future maintenance: > - Cosmic: total of 8 clean cherry picks. > - Bionic: total of 13 commits, 9 clean cherry-picks > and 4 backports, which are just changes to context > lines (i.e. refresh) without any functional changes > in the backport itself. > > [Test Case] > > * This command has been reported to reproduce the problem: > > $ sudo iozone -R -s 5G -r 1m -S 2048 -i 0 -G -c -o -l 128 -u 128 -t 128 > > * It generates stack traces as below in the original kernel, > and does not generate them in the modified/patched kernel. > > * The user/reporter verified the test kernel with these patches > resolved the problem. > > * The developer verified in 2 systems (4-core and 24-core but > no NVMe) for regressions, and no error messages were logged > to dmesg. > > [Regression Potential] > > * The commits have been verified for fixes in later commits in > linux-next as of 2019-01-08 and all known fix commits are in. > > * The regression potential is mostly contained in the writeback > throttling (block/blk-wbt.*), as almost all of the 13 patches > change exclusively that, except for 4 of them (2 of which are > sysfs): > > - blk-rq-qos: refactor out common elements of blk-wbt (block/) > - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-sysfs.c) > - block: Protect less code with sysfs_lock in blk_{un,}register_queue() (blk-{mq-}sysfs.c) > - block: pass struct request instead of struct blk_issue_stat to wbt (block/, still mostly blk-wbt.*) > > [Other Info] > > * Alternatively, it is probably possible to introduce just the > two commits that fix this with some changes to their code in > the backport, but since the 'blk-rq-qos: refactor ..' commit > may become a dependency for many additional/future fixes, it > seemed interesting to pull it in earlier in the 18.04 cycle. > > * The problem has been introduced with the blk-wbt mechanism, > in v4.10-rc1, and the fix commits in v4.19-rc1 and -rc2, > so only Bionic and Cosmic needs this. > > [Stack Traces] > > [ 393.628647] NMI watchdog: Watchdog detected hard LOCKUP on cpu 30 > ... > [ 393.628704] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu > ... > [ 393.628720] Call Trace: > [ 393.628721] <IRQ> > [ 393.628724] enqueue_task_fair+0x6c/0x7f0 > [ 393.628726] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 > [ 393.628728] ? __update_load_avg_blocked_se.isra.37+0xd1/0x150 > [ 393.628731] activate_task+0x57/0xc0 > [ 393.628735] ? sched_clock+0x9/0x10 > [ 393.628736] ? sched_clock+0x9/0x10 > [ 393.628738] ttwu_do_activate+0x49/0x90 > [ 393.628739] try_to_wake_up+0x1df/0x490 > [ 393.628741] default_wake_function+0x12/0x20 > [ 393.628743] autoremove_wake_function+0x12/0x40 > [ 393.628744] __wake_up_common+0x73/0x130 > [ 393.628745] __wake_up_common_lock+0x80/0xc0 > [ 393.628746] __wake_up+0x13/0x20 > [ 393.628749] __wbt_done.part.21+0xa4/0xb0 > [ 393.628749] wbt_done+0x72/0xa0 > [ 393.628753] blk_mq_free_request+0xca/0x1a0 > [ 393.628755] blk_mq_end_request+0x48/0x90 > [ 393.628760] nvme_complete_rq+0x23/0x120 [nvme_core] > [ 393.628763] nvme_pci_complete_rq+0x7a/0x130 [nvme] > [ 393.628764] __blk_mq_complete_request+0xd2/0x140 > [ 393.628766] blk_mq_complete_request+0x18/0x20 > [ 393.628767] nvme_process_cq+0xe1/0x1b0 [nvme] > [ 393.628768] nvme_irq+0x23/0x50 [nvme] > [ 393.628772] __handle_irq_event_percpu+0x44/0x1a0 > [ 393.628773] handle_irq_event_percpu+0x32/0x80 > [ 393.628774] handle_irq_event+0x3b/0x60 > [ 393.628778] handle_edge_irq+0x7c/0x190 > [ 393.628779] handle_irq+0x20/0x30 > [ 393.628783] do_IRQ+0x46/0xd0 > [ 393.628784] common_interrupt+0x84/0x84 > [ 393.628785] </IRQ> > ... > [ 393.628794] ? cpuidle_enter_state+0x97/0x2f0 > [ 393.628796] cpuidle_enter+0x17/0x20 > [ 393.628797] call_cpuidle+0x23/0x40 > [ 393.628798] do_idle+0x18c/0x1f0 > [ 393.628799] cpu_startup_entry+0x73/0x80 > [ 393.628802] start_secondary+0x1a6/0x200 > [ 393.628804] secondary_startup_64+0xa5/0xb0 > [ 393.628805] Code: ... > > [ 405.981597] nvme nvme1: I/O 393 QID 6 timeout, completion polled > > [ 435.597209] INFO: rcu_sched detected stalls on CPUs/tasks: > [ 435.602858] 30-...0: (1 GPs behind) idle=e26/1/0 softirq=6834/6834 fqs=4485 > [ 435.610203] (detected by 8, t=15005 jiffies, g=6396, c=6395, q=146818) > [ 435.617025] Sending NMI from CPU 8 to CPUs 30: > [ 435.617029] NMI backtrace for cpu 30 > [ 435.617031] CPU: 30 PID: 0 Comm: swapper/30 Tainted: P OE 4.15.0-20-generic #21-Ubuntu > ... > [ 435.617047] Call Trace: > [ 435.617048] <IRQ> > [ 435.617051] enqueue_entity+0x9f/0x6b0 > [ 435.617053] enqueue_task_fair+0x6c/0x7f0 > [ 435.617056] activate_task+0x57/0xc0 > [ 435.617059] ? sched_clock+0x9/0x10 > [ 435.617060] ? sched_clock+0x9/0x10 > [ 435.617061] ttwu_do_activate+0x49/0x90 > [ 435.617063] try_to_wake_up+0x1df/0x490 > [ 435.617065] default_wake_function+0x12/0x20 > [ 435.617067] autoremove_wake_function+0x12/0x40 > [ 435.617068] __wake_up_common+0x73/0x130 > [ 435.617069] __wake_up_common_lock+0x80/0xc0 > [ 435.617070] __wake_up+0x13/0x20 > [ 435.617073] __wbt_done.part.21+0xa4/0xb0 > [ 435.617074] wbt_done+0x72/0xa0 > [ 435.617077] blk_mq_free_request+0xca/0x1a0 > [ 435.617079] blk_mq_end_request+0x48/0x90 > [ 435.617084] nvme_complete_rq+0x23/0x120 [nvme_core] > [ 435.617087] nvme_pci_complete_rq+0x7a/0x130 [nvme] > [ 435.617088] __blk_mq_complete_request+0xd2/0x140 > [ 435.617090] blk_mq_complete_request+0x18/0x20 > [ 435.617091] nvme_process_cq+0xe1/0x1b0 [nvme] > [ 435.617093] nvme_irq+0x23/0x50 [nvme] > [ 435.617096] __handle_irq_event_percpu+0x44/0x1a0 > [ 435.617097] handle_irq_event_percpu+0x32/0x80 > [ 435.617098] handle_irq_event+0x3b/0x60 > [ 435.617101] handle_edge_irq+0x7c/0x190 > [ 435.617102] handle_irq+0x20/0x30 > [ 435.617106] do_IRQ+0x46/0xd0 > [ 435.617107] common_interrupt+0x84/0x84 > [ 435.617108] </IRQ> > ... > [ 435.617117] ? cpuidle_enter_state+0x97/0x2f0 > [ 435.617118] cpuidle_enter+0x17/0x20 > [ 435.617119] call_cpuidle+0x23/0x40 > [ 435.617121] do_idle+0x18c/0x1f0 > [ 435.617122] cpu_startup_entry+0x73/0x80 > [ 435.617125] start_secondary+0x1a6/0x200 > [ 435.617127] secondary_startup_64+0xa5/0xb0 > [ 435.617128] Code: ... > > Anchal Agarwal (1): > blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait > > Bart Van Assche (1): > block: Protect less code with sysfs_lock in blk_{un,}register_queue() > > Jens Axboe (6): > blk-wbt: pass in enum wbt_flags to get_rq_wait() > blk-wbt: move disable check into get_limit() > blk-wbt: use wq_has_sleeper() for wq active check > blk-wbt: fix has-sleeper queueing check > blk-wbt: abstract out end IO completion handler > blk-wbt: improve waking of tasks > > Josef Bacik (2): > blk-rq-qos: refactor out common elements of blk-wbt > blk-wbt: wake up all when we scale up, not down > > Mike Snitzer (1): > block: properly protect the 'queue' kobj in blk_unregister_queue > > Omar Sandoval (2): > block: move some wbt helpers to blk-wbt.c > block: pass struct request instead of struct blk_issue_stat to wbt > > block/Makefile | 2 +- > block/blk-core.c | 14 +- > block/blk-mq-sysfs.c | 9 +- > block/blk-mq.c | 14 +- > block/blk-rq-qos.c | 178 +++++++++++++++ > block/blk-rq-qos.h | 106 +++++++++ > block/blk-settings.c | 4 +- > block/blk-sysfs.c | 62 +++-- > block/blk-wbt.c | 499 ++++++++++++++++++++++------------------- > block/blk-wbt.h | 96 +++----- > include/linux/blkdev.h | 4 +- > 11 files changed, 643 insertions(+), 345 deletions(-) > create mode 100644 block/blk-rq-qos.c > create mode 100644 block/blk-rq-qos.h > Sorry, in this case, if there is any chance to make a backport of the two changes actually addressing the problem without changing half the block layer, I prefer changing as little as possible. That actually makes future maintenance simpler because that is done assuming to have a 4.15 based kernel. -Stefan
On Thu, Jan 10, 2019 at 1:36 PM Stefan Bader <stefan.bader@canonical.com> wrote: > > On 10.01.19 04:12, Mauricio Faria de Oliveira wrote: > > BugLink: https://bugs.launchpad.net/bugs/1810998 [snip] > Sorry, in this case, if there is any chance to make a backport of the two > changes actually addressing the problem without changing half the block layer, I > prefer changing as little as possible. That actually makes future maintenance > simpler because that is done assuming to have a 4.15 based kernel. No worries, that's certainly understandable. I'll work on the simpler approach. Thanks. > > -Stefan >