Message ID | 20220622051731.23563-1-matthew.ruffell@canonical.com |
---|---|
Headers | show |
Series | nbd: requests can become stuck when disconnecting from server with qemu-nbd | expand |
On 6/21/22 23:17, Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896350 > > [Impact] > > After 2516ab1("nbd: only clear the queue on device teardown"), present in > 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently > being processed. This change was made to fix a race between using the > NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing > requests. This worked for the most part, as several years ago systemd was > not set up to watch nbd devices for changes in their state. > > But after: > > commit f82abfcda58168d9f667e2094d438763531d3fa6 > From: Tony Asleson <tasleson@redhat.com> > Date: Fri, 8 Feb 2019 15:47:10 -0600 > Subject: rules: watch metadata changes on nbd devices > Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6 > > in systemd v242-rc1, nbd* devices were added to a udev rule to watch those > devices for changes with the inotify subsystem. From man udev: > >> watch >> Watch the device node with inotify; when the node is closed after being >> opened for writing, a change uevent is synthesized. >> >> nowatch >> Disable the watching of a device node with inotify. > > This changed the behaviour of device teardown, since systemd now keeps tabs on > the device with inotify, outstanding requests cannot be cleared as > nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get > stuck, never to complete, because a disconnect has occurred, and never to > timeout, as their timers keep being reset. > > Symptoms of this issue is that the nbd subsystem gets stuck with messages like: > > block nbd15: NBD_DISCONNECT > block nbd15: Send disconnect failed -32 > ... > block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds > ... > block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds > ... > INFO: task qemu-nbd:1267 blocked for more than 120 seconds. > Not tainted 5.15.0-23-generic #23-Ubuntu > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:qemu-nbd state:D stack: 0 pid: 1267 ppid: 1 flags:0x00000002 > Call Trace: > <TASK> > __schedule+0x23d/0x590 > ? call_rcu+0xe/0x10 > schedule+0x4e/0xb0 > blk_mq_freeze_queue_wait+0x69/0xa0 > ? wait_woken+0x70/0x70 > blk_mq_freeze_queue+0x1b/0x30 > nbd_add_socket+0x76/0x1f0 [nbd] > __nbd_ioctl+0x18b/0x340 [nbd] > ? security_capable+0x3d/0x60 > nbd_ioctl+0x81/0xb0 [nbd] > blkdev_ioctl+0x12e/0x270 > ? __fget_files+0x86/0xc0 > block_ioctl+0x46/0x50 > __x64_sys_ioctl+0x91/0xc0 > do_syscall_64+0x5c/0xc0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > </TASK> > > Additionally, in syslog you will also see systemd-udevd get stuck: > > systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time > > $ ps aux > ... > 419 1194 root D 0.1 systemd-udevd - > > We can workaround the issue by adding a higher priority udev rule to not watch > nbd* devices. > > $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules > # Disable inotify watching of change events for NBD devices > ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch" > EOF > > $ sudo udevadm control --reload-rules > $ sudo udevadm trigger > > [Fix] > > The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which > was introduced in 5.16, and added to in 5.19. We need to backport all commits > related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective. > > For Focal, Impish and Jammy: > > commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d > Author: Yu Kuai <yukuai3@huawei.com> > Date: Thu Sep 16 17:33:44 2021 +0800 > Subject: nbd: don't handle response without a corresponding request message > Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d > > commit 07175cb1baf4c51051b1fbd391097e349f9a02a9 > Author: Yu Kuai <yukuai3@huawei.com> > Date: Thu Sep 16 17:33:45 2021 +0800 > Subject: nbd: make sure request completion won't concurrent > Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9 > > commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e > Author: Yu Kuai <yukuai3@huawei.com> > Date: Sat May 21 15:37:46 2022 +0800 > Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e > > commit 09dadb5985023e27d4740ebd17e6fea4640110e5 > Author: Yu Kuai <yukuai3@huawei.com> > Date: Sat May 21 15:37:47 2022 +0800 > Subject: nbd: fix io hung while disconnecting device > Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5 > > For Focal only (dependency commits): > > commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d > Author: Keith Busch <kbusch@kernel.org> > Date: Fri May 29 07:51:59 2020 -0700 > Subject: blk-mq: blk-mq: provide forced completion method > Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d > > commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2 > Author: Christoph Hellwig <hch@lst.de> > Date: Thu Jun 11 08:44:47 2020 +0200 > Subject: blk-mq: move failure injection out of blk_mq_complete_request > Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2 > > I want to talk about the backport of "blk-mq: move failure injection out of > blk_mq_complete_request" for Focal, since it changes a number of drivers using > the blk_mq_complete_request() call, and can be considered a large regression > risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be > enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem. > CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus > blk_should_fake_timeout() just returns false, and is more or less a nop on our > kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is > really just if(true), and I did think about simply backporting that to the > nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" > but after talking with Jay, we decided backporting the entire patch was the best > way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a > custom kernel build with the nbd subsystem. > > Note: Bionic is also affected, but the backport list to the 4.15 kernel is > unreasonable, and due to systemd in Bionic not having the udev rule set to watch > nbd* devices by default, Bionic is not directly affected by the issue. Users > would only see the problem if they add nbd* to the watch list in > /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or > later LXC container running using the hosts nbd kernel module. Because of this > Bionic will be Won't Fix. > > [Testcase] > > The issue can be easily reproduced with: > > $ sudo apt install qemu-utils > > $ cat << EOF >> reproducer.sh > #!/bin/bash > > sudo modprobe nbd > > while : > do > qemu-img create -f qcow2 foo.img 500M > sudo qemu-nbd --disconnect /dev/nbd15 || true > sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img > sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15 > sudo qemu-nbd --disconnect /dev/nbd15 > done > EOF > > $ chmod +x reproducer.sh > $ yes | ./reproducer.sh > > On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or > so, and the kernel log will be filled with stuck request messages and hung > task timeouts after the 120 second mark. > > Test kernels are available in the following ppa: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test > > If you install these test kernels, and re-run the reproducer, your system will > be perfectly stable for many hours with no issues. > > I have torture tested the backport to the Focal kernel for more than 4 hours > with no issues observed. > > [Where problems can occur] > > Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change > to the NBD subsystem is quite large, and changes how requests are processed > and accounted for, depending if they are still outstanding or not. > > For Impish and Jammy, these risks are limited to the NBD subsystem itself. > > On Focal, the risk of regression is slightly larger due to the backport of > "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat > mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT > being disabled on the Focal kernel. > > If a regression were to occur, users may see NBD requests fail, occur in the > incorrect order, or cleared at incorrect times. There is no way to disable > NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older > kernel while a fix is made. > > [Other info] > > My bug report to upstream can be found in the following mailing list thread: > https://lkml.org/lkml/2022/4/22/61 > > You may be wondering why we cannot simply just backport "nbd: fix io hung while > disconnecting device" and resolve the issue with a 1 line change. I actually > built test kernels for this scenario to see if it was possible, and they are > available here: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single > > While it did sort of fix the issue, that is, connect and disconnect from nbd > devices lasted longer than the 30 seconds the kernels lasted previously, after > about 10 minutes of torture testing, I started experiencing race conditions > of requests completing multiple times, leading to the following use after frees: > > Jammy 5.15 test kernel: > https://paste.ubuntu.com/p/FSM6DrgjTy/ > > Impish 5.13 test kernel: > https://paste.ubuntu.com/p/86tNfsM7Vs/ > > Focal 5.4 test kernel: > https://paste.ubuntu.com/p/zzVzWx23sb/ > > This was similar to a mailing list thread I found previously: > https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ > https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/ > > Keith Busch identified this as a double completion of the same request, > which is resolved through the NBD_CMD_INFLIGHT infrastructure. > > Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT > infrastructure for a complete fix. > > Christoph Hellwig (1): > blk-mq: move failure injection out of blk_mq_complete_request > > Keith Busch (1): > blk-mq: blk-mq: provide forced completion method > > Yu Kuai (4): > nbd: don't handle response without a corresponding request message > nbd: make sure request completion won't concurrent > nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > nbd: fix io hung while disconnecting device > > block/blk-mq.c | 27 ++++++------------ > block/blk-timeout.c | 6 ++-- > block/blk.h | 8 ------ > block/bsg-lib.c | 5 +++- > drivers/block/loop.c | 6 ++-- > drivers/block/mtip32xx/mtip32xx.c | 3 +- > drivers/block/nbd.c | 46 +++++++++++++++++++++++++++++-- > drivers/block/null_blk_main.c | 3 +- > drivers/block/skd_main.c | 9 ++++-- > drivers/block/virtio_blk.c | 3 +- > drivers/block/xen-blkfront.c | 3 +- > drivers/md/dm-rq.c | 3 +- > drivers/mmc/core/block.c | 6 ++-- > drivers/nvme/host/nvme.h | 3 +- > drivers/s390/block/dasd.c | 2 +- > drivers/s390/block/scm_blk.c | 3 +- > drivers/scsi/scsi_lib.c | 12 ++------ > include/linux/blk-mq.h | 11 +++++++- > 18 files changed, 99 insertions(+), 60 deletions(-) > Acked-by: Tim Gardner <tim.gardner@canonical.com> Nice work. I know this bug has been going on for awhile.
Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com> On Wed, Jun 22, 2022 at 7:18 AM Matthew Ruffell <matthew.ruffell@canonical.com> wrote: > > BugLink: https://bugs.launchpad.net/bugs/1896350 > > [Impact] > > After 2516ab1("nbd: only clear the queue on device teardown"), present in > 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently > being processed. This change was made to fix a race between using the > NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing > requests. This worked for the most part, as several years ago systemd was > not set up to watch nbd devices for changes in their state. > > But after: > > commit f82abfcda58168d9f667e2094d438763531d3fa6 > From: Tony Asleson <tasleson@redhat.com> > Date: Fri, 8 Feb 2019 15:47:10 -0600 > Subject: rules: watch metadata changes on nbd devices > Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6 > > in systemd v242-rc1, nbd* devices were added to a udev rule to watch those > devices for changes with the inotify subsystem. From man udev: > > > watch > > Watch the device node with inotify; when the node is closed after being > > opened for writing, a change uevent is synthesized. > > > > nowatch > > Disable the watching of a device node with inotify. > > This changed the behaviour of device teardown, since systemd now keeps tabs on > the device with inotify, outstanding requests cannot be cleared as > nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get > stuck, never to complete, because a disconnect has occurred, and never to > timeout, as their timers keep being reset. > > Symptoms of this issue is that the nbd subsystem gets stuck with messages like: > > block nbd15: NBD_DISCONNECT > block nbd15: Send disconnect failed -32 > ... > block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds > ... > block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds > ... > INFO: task qemu-nbd:1267 blocked for more than 120 seconds. > Not tainted 5.15.0-23-generic #23-Ubuntu > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:qemu-nbd state:D stack: 0 pid: 1267 ppid: 1 flags:0x00000002 > Call Trace: > <TASK> > __schedule+0x23d/0x590 > ? call_rcu+0xe/0x10 > schedule+0x4e/0xb0 > blk_mq_freeze_queue_wait+0x69/0xa0 > ? wait_woken+0x70/0x70 > blk_mq_freeze_queue+0x1b/0x30 > nbd_add_socket+0x76/0x1f0 [nbd] > __nbd_ioctl+0x18b/0x340 [nbd] > ? security_capable+0x3d/0x60 > nbd_ioctl+0x81/0xb0 [nbd] > blkdev_ioctl+0x12e/0x270 > ? __fget_files+0x86/0xc0 > block_ioctl+0x46/0x50 > __x64_sys_ioctl+0x91/0xc0 > do_syscall_64+0x5c/0xc0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > </TASK> > > Additionally, in syslog you will also see systemd-udevd get stuck: > > systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time > > $ ps aux > ... > 419 1194 root D 0.1 systemd-udevd - > > We can workaround the issue by adding a higher priority udev rule to not watch > nbd* devices. > > $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules > # Disable inotify watching of change events for NBD devices > ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch" > EOF > > $ sudo udevadm control --reload-rules > $ sudo udevadm trigger > > [Fix] > > The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which > was introduced in 5.16, and added to in 5.19. We need to backport all commits > related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective. > > For Focal, Impish and Jammy: > > commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d > Author: Yu Kuai <yukuai3@huawei.com> > Date: Thu Sep 16 17:33:44 2021 +0800 > Subject: nbd: don't handle response without a corresponding request message > Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d > > commit 07175cb1baf4c51051b1fbd391097e349f9a02a9 > Author: Yu Kuai <yukuai3@huawei.com> > Date: Thu Sep 16 17:33:45 2021 +0800 > Subject: nbd: make sure request completion won't concurrent > Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9 > > commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e > Author: Yu Kuai <yukuai3@huawei.com> > Date: Sat May 21 15:37:46 2022 +0800 > Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e > > commit 09dadb5985023e27d4740ebd17e6fea4640110e5 > Author: Yu Kuai <yukuai3@huawei.com> > Date: Sat May 21 15:37:47 2022 +0800 > Subject: nbd: fix io hung while disconnecting device > Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5 > > For Focal only (dependency commits): > > commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d > Author: Keith Busch <kbusch@kernel.org> > Date: Fri May 29 07:51:59 2020 -0700 > Subject: blk-mq: blk-mq: provide forced completion method > Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d > > commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2 > Author: Christoph Hellwig <hch@lst.de> > Date: Thu Jun 11 08:44:47 2020 +0200 > Subject: blk-mq: move failure injection out of blk_mq_complete_request > Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2 > > I want to talk about the backport of "blk-mq: move failure injection out of > blk_mq_complete_request" for Focal, since it changes a number of drivers using > the blk_mq_complete_request() call, and can be considered a large regression > risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be > enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem. > CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus > blk_should_fake_timeout() just returns false, and is more or less a nop on our > kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is > really just if(true), and I did think about simply backporting that to the > nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" > but after talking with Jay, we decided backporting the entire patch was the best > way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a > custom kernel build with the nbd subsystem. > > Note: Bionic is also affected, but the backport list to the 4.15 kernel is > unreasonable, and due to systemd in Bionic not having the udev rule set to watch > nbd* devices by default, Bionic is not directly affected by the issue. Users > would only see the problem if they add nbd* to the watch list in > /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or > later LXC container running using the hosts nbd kernel module. Because of this > Bionic will be Won't Fix. > > [Testcase] > > The issue can be easily reproduced with: > > $ sudo apt install qemu-utils > > $ cat << EOF >> reproducer.sh > #!/bin/bash > > sudo modprobe nbd > > while : > do > qemu-img create -f qcow2 foo.img 500M > sudo qemu-nbd --disconnect /dev/nbd15 || true > sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img > sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15 > sudo qemu-nbd --disconnect /dev/nbd15 > done > EOF > > $ chmod +x reproducer.sh > $ yes | ./reproducer.sh > > On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or > so, and the kernel log will be filled with stuck request messages and hung > task timeouts after the 120 second mark. > > Test kernels are available in the following ppa: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test > > If you install these test kernels, and re-run the reproducer, your system will > be perfectly stable for many hours with no issues. > > I have torture tested the backport to the Focal kernel for more than 4 hours > with no issues observed. > > [Where problems can occur] > > Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change > to the NBD subsystem is quite large, and changes how requests are processed > and accounted for, depending if they are still outstanding or not. > > For Impish and Jammy, these risks are limited to the NBD subsystem itself. > > On Focal, the risk of regression is slightly larger due to the backport of > "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat > mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT > being disabled on the Focal kernel. > > If a regression were to occur, users may see NBD requests fail, occur in the > incorrect order, or cleared at incorrect times. There is no way to disable > NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older > kernel while a fix is made. > > [Other info] > > My bug report to upstream can be found in the following mailing list thread: > https://lkml.org/lkml/2022/4/22/61 > > You may be wondering why we cannot simply just backport "nbd: fix io hung while > disconnecting device" and resolve the issue with a 1 line change. I actually > built test kernels for this scenario to see if it was possible, and they are > available here: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single > > While it did sort of fix the issue, that is, connect and disconnect from nbd > devices lasted longer than the 30 seconds the kernels lasted previously, after > about 10 minutes of torture testing, I started experiencing race conditions > of requests completing multiple times, leading to the following use after frees: > > Jammy 5.15 test kernel: > https://paste.ubuntu.com/p/FSM6DrgjTy/ > > Impish 5.13 test kernel: > https://paste.ubuntu.com/p/86tNfsM7Vs/ > > Focal 5.4 test kernel: > https://paste.ubuntu.com/p/zzVzWx23sb/ > > This was similar to a mailing list thread I found previously: > https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ > https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/ > > Keith Busch identified this as a double completion of the same request, > which is resolved through the NBD_CMD_INFLIGHT infrastructure. > > Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT > infrastructure for a complete fix. > > Christoph Hellwig (1): > blk-mq: move failure injection out of blk_mq_complete_request > > Keith Busch (1): > blk-mq: blk-mq: provide forced completion method > > Yu Kuai (4): > nbd: don't handle response without a corresponding request message > nbd: make sure request completion won't concurrent > nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > nbd: fix io hung while disconnecting device > > block/blk-mq.c | 27 ++++++------------ > block/blk-timeout.c | 6 ++-- > block/blk.h | 8 ------ > block/bsg-lib.c | 5 +++- > drivers/block/loop.c | 6 ++-- > drivers/block/mtip32xx/mtip32xx.c | 3 +- > drivers/block/nbd.c | 46 +++++++++++++++++++++++++++++-- > drivers/block/null_blk_main.c | 3 +- > drivers/block/skd_main.c | 9 ++++-- > drivers/block/virtio_blk.c | 3 +- > drivers/block/xen-blkfront.c | 3 +- > drivers/md/dm-rq.c | 3 +- > drivers/mmc/core/block.c | 6 ++-- > drivers/nvme/host/nvme.h | 3 +- > drivers/s390/block/dasd.c | 2 +- > drivers/s390/block/scm_blk.c | 3 +- > drivers/scsi/scsi_lib.c | 12 ++------ > include/linux/blk-mq.h | 11 +++++++- > 18 files changed, 99 insertions(+), 60 deletions(-) > > -- > 2.34.1
On 22.06.22 07:17, Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896350 > > [Impact] > > After 2516ab1("nbd: only clear the queue on device teardown"), present in > 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently > being processed. This change was made to fix a race between using the > NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing > requests. This worked for the most part, as several years ago systemd was > not set up to watch nbd devices for changes in their state. > > But after: > > commit f82abfcda58168d9f667e2094d438763531d3fa6 > From: Tony Asleson <tasleson@redhat.com> > Date: Fri, 8 Feb 2019 15:47:10 -0600 > Subject: rules: watch metadata changes on nbd devices > Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6 > > in systemd v242-rc1, nbd* devices were added to a udev rule to watch those > devices for changes with the inotify subsystem. From man udev: > >> watch >> Watch the device node with inotify; when the node is closed after being >> opened for writing, a change uevent is synthesized. >> >> nowatch >> Disable the watching of a device node with inotify. > > This changed the behaviour of device teardown, since systemd now keeps tabs on > the device with inotify, outstanding requests cannot be cleared as > nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get > stuck, never to complete, because a disconnect has occurred, and never to > timeout, as their timers keep being reset. > > Symptoms of this issue is that the nbd subsystem gets stuck with messages like: > > block nbd15: NBD_DISCONNECT > block nbd15: Send disconnect failed -32 > ... > block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds > ... > block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds > ... > INFO: task qemu-nbd:1267 blocked for more than 120 seconds. > Not tainted 5.15.0-23-generic #23-Ubuntu > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > task:qemu-nbd state:D stack: 0 pid: 1267 ppid: 1 flags:0x00000002 > Call Trace: > <TASK> > __schedule+0x23d/0x590 > ? call_rcu+0xe/0x10 > schedule+0x4e/0xb0 > blk_mq_freeze_queue_wait+0x69/0xa0 > ? wait_woken+0x70/0x70 > blk_mq_freeze_queue+0x1b/0x30 > nbd_add_socket+0x76/0x1f0 [nbd] > __nbd_ioctl+0x18b/0x340 [nbd] > ? security_capable+0x3d/0x60 > nbd_ioctl+0x81/0xb0 [nbd] > blkdev_ioctl+0x12e/0x270 > ? __fget_files+0x86/0xc0 > block_ioctl+0x46/0x50 > __x64_sys_ioctl+0x91/0xc0 > do_syscall_64+0x5c/0xc0 > entry_SYSCALL_64_after_hwframe+0x44/0xae > </TASK> > > Additionally, in syslog you will also see systemd-udevd get stuck: > > systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time > > $ ps aux > ... > 419 1194 root D 0.1 systemd-udevd - > > We can workaround the issue by adding a higher priority udev rule to not watch > nbd* devices. > > $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules > # Disable inotify watching of change events for NBD devices > ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch" > EOF > > $ sudo udevadm control --reload-rules > $ sudo udevadm trigger > > [Fix] > > The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which > was introduced in 5.16, and added to in 5.19. We need to backport all commits > related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective. > > For Focal, Impish and Jammy: > > commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d > Author: Yu Kuai <yukuai3@huawei.com> > Date: Thu Sep 16 17:33:44 2021 +0800 > Subject: nbd: don't handle response without a corresponding request message > Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d > > commit 07175cb1baf4c51051b1fbd391097e349f9a02a9 > Author: Yu Kuai <yukuai3@huawei.com> > Date: Thu Sep 16 17:33:45 2021 +0800 > Subject: nbd: make sure request completion won't concurrent > Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9 > > commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e > Author: Yu Kuai <yukuai3@huawei.com> > Date: Sat May 21 15:37:46 2022 +0800 > Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e > > commit 09dadb5985023e27d4740ebd17e6fea4640110e5 > Author: Yu Kuai <yukuai3@huawei.com> > Date: Sat May 21 15:37:47 2022 +0800 > Subject: nbd: fix io hung while disconnecting device > Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5 > > For Focal only (dependency commits): > > commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d > Author: Keith Busch <kbusch@kernel.org> > Date: Fri May 29 07:51:59 2020 -0700 > Subject: blk-mq: blk-mq: provide forced completion method > Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d > > commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2 > Author: Christoph Hellwig <hch@lst.de> > Date: Thu Jun 11 08:44:47 2020 +0200 > Subject: blk-mq: move failure injection out of blk_mq_complete_request > Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2 > > I want to talk about the backport of "blk-mq: move failure injection out of > blk_mq_complete_request" for Focal, since it changes a number of drivers using > the blk_mq_complete_request() call, and can be considered a large regression > risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be > enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem. > CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus > blk_should_fake_timeout() just returns false, and is more or less a nop on our > kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is > really just if(true), and I did think about simply backporting that to the > nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" > but after talking with Jay, we decided backporting the entire patch was the best > way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a > custom kernel build with the nbd subsystem. > > Note: Bionic is also affected, but the backport list to the 4.15 kernel is > unreasonable, and due to systemd in Bionic not having the udev rule set to watch > nbd* devices by default, Bionic is not directly affected by the issue. Users > would only see the problem if they add nbd* to the watch list in > /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or > later LXC container running using the hosts nbd kernel module. Because of this > Bionic will be Won't Fix. > > [Testcase] > > The issue can be easily reproduced with: > > $ sudo apt install qemu-utils > > $ cat << EOF >> reproducer.sh > #!/bin/bash > > sudo modprobe nbd > > while : > do > qemu-img create -f qcow2 foo.img 500M > sudo qemu-nbd --disconnect /dev/nbd15 || true > sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img > sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15 > sudo qemu-nbd --disconnect /dev/nbd15 > done > EOF > > $ chmod +x reproducer.sh > $ yes | ./reproducer.sh > > On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or > so, and the kernel log will be filled with stuck request messages and hung > task timeouts after the 120 second mark. > > Test kernels are available in the following ppa: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test > > If you install these test kernels, and re-run the reproducer, your system will > be perfectly stable for many hours with no issues. > > I have torture tested the backport to the Focal kernel for more than 4 hours > with no issues observed. > > [Where problems can occur] > > Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change > to the NBD subsystem is quite large, and changes how requests are processed > and accounted for, depending if they are still outstanding or not. > > For Impish and Jammy, these risks are limited to the NBD subsystem itself. > > On Focal, the risk of regression is slightly larger due to the backport of > "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat > mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT > being disabled on the Focal kernel. > > If a regression were to occur, users may see NBD requests fail, occur in the > incorrect order, or cleared at incorrect times. There is no way to disable > NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older > kernel while a fix is made. > > [Other info] > > My bug report to upstream can be found in the following mailing list thread: > https://lkml.org/lkml/2022/4/22/61 > > You may be wondering why we cannot simply just backport "nbd: fix io hung while > disconnecting device" and resolve the issue with a 1 line change. I actually > built test kernels for this scenario to see if it was possible, and they are > available here: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single > > While it did sort of fix the issue, that is, connect and disconnect from nbd > devices lasted longer than the 30 seconds the kernels lasted previously, after > about 10 minutes of torture testing, I started experiencing race conditions > of requests completing multiple times, leading to the following use after frees: > > Jammy 5.15 test kernel: > https://paste.ubuntu.com/p/FSM6DrgjTy/ > > Impish 5.13 test kernel: > https://paste.ubuntu.com/p/86tNfsM7Vs/ > > Focal 5.4 test kernel: > https://paste.ubuntu.com/p/zzVzWx23sb/ > > This was similar to a mailing list thread I found previously: > https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ > https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/ > > Keith Busch identified this as a double completion of the same request, > which is resolved through the NBD_CMD_INFLIGHT infrastructure. > > Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT > infrastructure for a complete fix. > > Christoph Hellwig (1): > blk-mq: move failure injection out of blk_mq_complete_request > > Keith Busch (1): > blk-mq: blk-mq: provide forced completion method > > Yu Kuai (4): > nbd: don't handle response without a corresponding request message > nbd: make sure request completion won't concurrent > nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed > nbd: fix io hung while disconnecting device > > block/blk-mq.c | 27 ++++++------------ > block/blk-timeout.c | 6 ++-- > block/blk.h | 8 ------ > block/bsg-lib.c | 5 +++- > drivers/block/loop.c | 6 ++-- > drivers/block/mtip32xx/mtip32xx.c | 3 +- > drivers/block/nbd.c | 46 +++++++++++++++++++++++++++++-- > drivers/block/null_blk_main.c | 3 +- > drivers/block/skd_main.c | 9 ++++-- > drivers/block/virtio_blk.c | 3 +- > drivers/block/xen-blkfront.c | 3 +- > drivers/md/dm-rq.c | 3 +- > drivers/mmc/core/block.c | 6 ++-- > drivers/nvme/host/nvme.h | 3 +- > drivers/s390/block/dasd.c | 2 +- > drivers/s390/block/scm_blk.c | 3 +- > drivers/scsi/scsi_lib.c | 12 ++------ > include/linux/blk-mq.h | 11 +++++++- > 18 files changed, 99 insertions(+), 60 deletions(-) > Applied to jammy:linux/master-next. Impish will not get more updates, so skipped. For Focal, the set does not apply. On Focal "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" fails in the last (#4) hunk. Assuming there is a backported version of that patch somewhere, could you submit this again for just Focal please? Thanks. -Stefan
On Sat, Jul 9, 2022 at 2:34 AM Stefan Bader <stefan.bader@canonical.com> wrote: > > On 22.06.22 07:17, Matthew Ruffell wrote: > > BugLink: https://bugs.launchpad.net/bugs/1896350 > > > > Applied to jammy:linux/master-next. Impish will not get more updates, so > skipped. For Focal, the set does not apply. > > On Focal "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" > fails in the last (#4) hunk. Assuming there is a backported version of that > patch somewhere, could you submit this again for just Focal please? Thanks. > > -Stefan Hi Stefan, I checked the patches, and they apply cleanly to Focal for me. I'm using git 2.34.1 on Jammy, with Focal hash 5063db9ef77566eebd845e5016ef664268039845 (Ubuntu-5.4.0-123.139). Regardless, I will send a V2 that has been output from format-patch from a Focal tree, instead of using the patch from the Jammy tree, which is identical, apart from offsets. $ git checkout -b test3 origin/master-next Branch 'test3' set up to track remote branch 'master-next' from 'origin'. Switched to a new branch 'test3' matthew@desktop:~/Work/kernel/ubuntu-focal$ diff 0005-nbd-don-t-clear-NBD_CMD_INFLIGHT-flag-if-request-is-.patch ~/Work/333142/patches/0005-nbd-don-t-clear-NBD_CMD_INFLIGHT-flag-if-request-is-.patch 1c1 < From fb45ba43e36432839dacd1c27c4c1128d025fa97 Mon Sep 17 00:00:00 2001 --- > From 0c7289ffb3837a671c85c67122666ddba9154ca8 Mon Sep 17 00:00:00 2001 4c4 < Subject: [SRU][F][PATCH V2 5/6] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if --- > Subject: [SRU][F][I][J][PATCH 5/6] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if 24c24 < index 01d030e9f301..6b165101f84b 100644 --- > index d525ad238e2d..f97f3a5adf28 100644 27c27 < @@ -394,13 +394,14 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, --- > @@ -411,13 +411,14 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, 43c43 < @@ -456,6 +457,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, --- > @@ -486,6 +487,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req, 51c51 < @@ -715,7 +717,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) --- > @@ -745,7 +747,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index) 60c60 < @@ -804,8 +806,16 @@ static void recv_work(struct work_struct *work) --- > @@ -832,8 +834,16 @@ static void recv_work(struct work_struct *work) matthew@desktop:~/Work/kernel/ubuntu-focal$ git am 000* Patch is empty. When you have resolved this problem, run "git am --continue". If you prefer to skip this patch, run "git am --skip" instead. To restore the original branch and stop patching, run "git am --abort". matthew@desktop:~/Work/kernel/ubuntu-focal$ git am --skip Applying: blk-mq: blk-mq: provide forced completion method Applying: blk-mq: move failure injection out of blk_mq_complete_request Applying: nbd: don't handle response without a corresponding request message Applying: nbd: make sure request completion won't concurrent Applying: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed Applying: nbd: fix io hung while disconnecting device Hopefully this set applies to Focal, please let me know if it does not, this would be great to have fixed. Thanks, Matthew