[SRU,F,I,J,0/6] nbd: requests can become stuck when disconnecting from server with qemu-nbd

Message ID	20220622051731.23563-1-matthew.ruffell@canonical.com
Headers	show Return-Path: <kernel-team-bounces@lists.ubuntu.com> From: Matthew Ruffell <matthew.ruffell@canonical.com> To: kernel-team@lists.ubuntu.com Subject: [SRU][F][I][J][PATCH 0/6] nbd: requests can become stuck when disconnecting from server with qemu-nbd Date: Wed, 22 Jun 2022 17:17:25 +1200 Message-Id: <20220622051731.23563-1-matthew.ruffell@canonical.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" <kernel-team-bounces@lists.ubuntu.com>
Series	nbd: requests can become stuck when disconnecting from server with qemu-nbd \| expand [SRU,F,I,J,0/6] nbd: requests can become stuck when disconnecting from server with qemu-nbd [SRU,F,1/6] blk-mq: blk-mq: provide forced completion method [SRU,F,2/6] blk-mq: move failure injection out of blk_mq_complete_request [SRU,F,I,J,3/6] nbd: don't handle response without a corresponding request message [SRU,F,I,J,4/6] nbd: make sure request completion won't concurrent [SRU,F,I,J,5/6] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed [SRU,F,I,J,6/6] nbd: fix io hung while disconnecting device

Message ID

20220622051731.23563-1-matthew.ruffell@canonical.com

Headers

From: Matthew Ruffell <matthew.ruffell@canonical.com>
To: kernel-team@lists.ubuntu.com
Subject: [SRU][F][I][J][PATCH 0/6] nbd: requests can become stuck when
 disconnecting from server with qemu-nbd
Date: Wed, 22 Jun 2022 17:17:25 +1200
Message-Id: <20220622051731.23563-1-matthew.ruffell@canonical.com>
MIME-Version: 1.0
Precedence: list
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64
Errors-To: kernel-team-bounces@lists.ubuntu.com
Sender: "kernel-team" <kernel-team-bounces@lists.ubuntu.com>

Series

nbd: requests can become stuck when disconnecting from server with qemu-nbd | expand

Message

Matthew Ruffell June 22, 2022, 5:17 a.m. UTC

BugLink: https://bugs.launchpad.net/bugs/1896350

[Impact]

After 2516ab1("nbd: only clear the queue on device teardown"), present in 
4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently
being processed. This change was made to fix a race between using the 
NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing
requests. This worked for the most part, as several years ago systemd was
not set up to watch nbd devices for changes in their state.

But after:

commit f82abfcda58168d9f667e2094d438763531d3fa6
From: Tony Asleson <tasleson@redhat.com>
Date: Fri, 8 Feb 2019 15:47:10 -0600
Subject: rules: watch metadata changes on nbd devices
Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6

in systemd v242-rc1, nbd* devices were added to a udev rule to watch those
devices for changes with the inotify subsystem. From man udev:

> watch
>   Watch the device node with inotify; when the node is closed after being 
>   opened for writing, a change uevent is synthesized.
>
> nowatch
>   Disable the watching of a device node with inotify.

This changed the behaviour of device teardown, since systemd now keeps tabs on
the device with inotify, outstanding requests cannot be cleared as 
nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get
stuck, never to complete, because a disconnect has occurred, and never to
timeout, as their timers keep being reset.

Symptoms of this issue is that the nbd subsystem gets stuck with messages like:

block nbd15: NBD_DISCONNECT
block nbd15: Send disconnect failed -32
...
block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds
...
block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds
...
INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
      Not tainted 5.15.0-23-generic #23-Ubuntu
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:qemu-nbd        state:D stack:    0 pid: 1267 ppid:     1 flags:0x00000002
Call Trace:
 <TASK>
 __schedule+0x23d/0x590
 ? call_rcu+0xe/0x10
 schedule+0x4e/0xb0
 blk_mq_freeze_queue_wait+0x69/0xa0
 ? wait_woken+0x70/0x70
 blk_mq_freeze_queue+0x1b/0x30
 nbd_add_socket+0x76/0x1f0 [nbd]
 __nbd_ioctl+0x18b/0x340 [nbd]
 ? security_capable+0x3d/0x60
 nbd_ioctl+0x81/0xb0 [nbd]
 blkdev_ioctl+0x12e/0x270
 ? __fget_files+0x86/0xc0
 block_ioctl+0x46/0x50
 __x64_sys_ioctl+0x91/0xc0
 do_syscall_64+0x5c/0xc0
 entry_SYSCALL_64_after_hwframe+0x44/0xae
 </TASK>

Additionally, in syslog you will also see systemd-udevd get stuck:

systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time

$ ps aux
...
419    1194 root     D     0.1 systemd-udevd   -

We can workaround the issue by adding a higher priority udev rule to not watch
nbd* devices.

$ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
# Disable inotify watching of change events for NBD devices
ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
EOF

$ sudo udevadm control --reload-rules
$ sudo udevadm trigger

[Fix]

The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which
was introduced in 5.16, and added to in 5.19. We need to backport all commits
related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.

For Focal, Impish and Jammy:

commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Sep 16 17:33:44 2021 +0800
Subject: nbd: don't handle response without a corresponding request message
Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d

commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Thu Sep 16 17:33:45 2021 +0800
Subject: nbd: make sure request completion won't concurrent
Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9

commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat May 21 15:37:46 2022 +0800
Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e

commit 09dadb5985023e27d4740ebd17e6fea4640110e5
Author: Yu Kuai <yukuai3@huawei.com>
Date:   Sat May 21 15:37:47 2022 +0800
Subject: nbd: fix io hung while disconnecting device
Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5

For Focal only (dependency commits):

commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
Author: Keith Busch <kbusch@kernel.org>
Date:   Fri May 29 07:51:59 2020 -0700
Subject: blk-mq: blk-mq: provide forced completion method
Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d

commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
Author: Christoph Hellwig <hch@lst.de>
Date:   Thu Jun 11 08:44:47 2020 +0200
Subject: blk-mq: move failure injection out of blk_mq_complete_request
Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2

I want to talk about the backport of "blk-mq: move failure injection out of 
blk_mq_complete_request" for Focal, since it changes a number of drivers using
the blk_mq_complete_request() call, and can be considered a large regression
risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be
enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus 
blk_should_fake_timeout() just returns false, and is more or less a nop on our
kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is
really just if(true), and I did think about simply backporting that to the
nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
but after talking with Jay, we decided backporting the entire patch was the best
way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a
custom kernel build with the nbd subsystem.

Note: Bionic is also affected, but the backport list to the 4.15 kernel is
unreasonable, and due to systemd in Bionic not having the udev rule set to watch
nbd* devices by default, Bionic is not directly affected by the issue. Users
would only see the problem if they add nbd* to the watch list in 
/usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or
later LXC container running using the hosts nbd kernel module. Because of this
Bionic will be Won't Fix.

[Testcase]

The issue can be easily reproduced with:

$ sudo apt install qemu-utils

$ cat << EOF >> reproducer.sh
#!/bin/bash

sudo modprobe nbd

while :
do
        qemu-img create -f qcow2 foo.img 500M
        sudo qemu-nbd --disconnect /dev/nbd15 || true
        sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
        sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
        sudo qemu-nbd --disconnect /dev/nbd15
done
EOF

$ chmod +x reproducer.sh
$ yes | ./reproducer.sh

On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or
so, and the kernel log will be filled with stuck request messages and hung
task timeouts after the 120 second mark.

Test kernels are available in the following ppa:

https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test

If you install these test kernels, and re-run the reproducer, your system will
be perfectly stable for many hours with no issues.

I have torture tested the backport to the Focal kernel for more than 4 hours
with no issues observed.

[Where problems can occur]

Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change
to the NBD subsystem is quite large, and changes how requests are processed
and accounted for, depending if they are still outstanding or not.

For Impish and Jammy, these risks are limited to the NBD subsystem itself.

On Focal, the risk of regression is slightly larger due to the backport of 
"blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat
mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
being disabled on the Focal kernel.

If a regression were to occur, users may see NBD requests fail, occur in the
incorrect order, or cleared at incorrect times. There is no way to disable
NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older
kernel while a fix is made.

[Other info]

My bug report to upstream can be found in the following mailing list thread:
https://lkml.org/lkml/2022/4/22/61

You may be wondering why we cannot simply just backport "nbd: fix io hung while
disconnecting device" and resolve the issue with a 1 line change. I actually
built test kernels for this scenario to see if it was possible, and they are
available here:

https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single

While it did sort of fix the issue, that is, connect and disconnect from nbd
devices lasted longer than the 30 seconds the kernels lasted previously, after
about 10 minutes of torture testing, I started experiencing race conditions
of requests completing multiple times, leading to the following use after frees:

Jammy 5.15 test kernel:
https://paste.ubuntu.com/p/FSM6DrgjTy/

Impish 5.13 test kernel:
https://paste.ubuntu.com/p/86tNfsM7Vs/

Focal 5.4 test kernel:
https://paste.ubuntu.com/p/zzVzWx23sb/

This was similar to a mailing list thread I found previously:
https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/

Keith Busch identified this as a double completion of the same request,
which is resolved through the NBD_CMD_INFLIGHT infrastructure.

Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT
infrastructure for a complete fix.

Christoph Hellwig (1):
  blk-mq: move failure injection out of blk_mq_complete_request

Keith Busch (1):
  blk-mq: blk-mq: provide forced completion method

Yu Kuai (4):
  nbd: don't handle response without a corresponding request message
  nbd: make sure request completion won't concurrent
  nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
  nbd: fix io hung while disconnecting device

 block/blk-mq.c                    | 27 ++++++------------
 block/blk-timeout.c               |  6 ++--
 block/blk.h                       |  8 ------
 block/bsg-lib.c                   |  5 +++-
 drivers/block/loop.c              |  6 ++--
 drivers/block/mtip32xx/mtip32xx.c |  3 +-
 drivers/block/nbd.c               | 46 +++++++++++++++++++++++++++++--
 drivers/block/null_blk_main.c     |  3 +-
 drivers/block/skd_main.c          |  9 ++++--
 drivers/block/virtio_blk.c        |  3 +-
 drivers/block/xen-blkfront.c      |  3 +-
 drivers/md/dm-rq.c                |  3 +-
 drivers/mmc/core/block.c          |  6 ++--
 drivers/nvme/host/nvme.h          |  3 +-
 drivers/s390/block/dasd.c         |  2 +-
 drivers/s390/block/scm_blk.c      |  3 +-
 drivers/scsi/scsi_lib.c           | 12 ++------
 include/linux/blk-mq.h            | 11 +++++++-
 18 files changed, 99 insertions(+), 60 deletions(-)

Comments

Tim Gardner June 22, 2022, 12:34 p.m. UTC | #1

On 6/21/22 23:17, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896350
> 
> [Impact]
> 
> After 2516ab1("nbd: only clear the queue on device teardown"), present in
> 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently
> being processed. This change was made to fix a race between using the
> NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing
> requests. This worked for the most part, as several years ago systemd was
> not set up to watch nbd devices for changes in their state.
> 
> But after:
> 
> commit f82abfcda58168d9f667e2094d438763531d3fa6
> From: Tony Asleson <tasleson@redhat.com>
> Date: Fri, 8 Feb 2019 15:47:10 -0600
> Subject: rules: watch metadata changes on nbd devices
> Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6
> 
> in systemd v242-rc1, nbd* devices were added to a udev rule to watch those
> devices for changes with the inotify subsystem. From man udev:
> 
>> watch
>>    Watch the device node with inotify; when the node is closed after being
>>    opened for writing, a change uevent is synthesized.
>>
>> nowatch
>>    Disable the watching of a device node with inotify.
> 
> This changed the behaviour of device teardown, since systemd now keeps tabs on
> the device with inotify, outstanding requests cannot be cleared as
> nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get
> stuck, never to complete, because a disconnect has occurred, and never to
> timeout, as their timers keep being reset.
> 
> Symptoms of this issue is that the nbd subsystem gets stuck with messages like:
> 
> block nbd15: NBD_DISCONNECT
> block nbd15: Send disconnect failed -32
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds
> ...
> INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
>        Not tainted 5.15.0-23-generic #23-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:qemu-nbd        state:D stack:    0 pid: 1267 ppid:     1 flags:0x00000002
> Call Trace:
>   <TASK>
>   __schedule+0x23d/0x590
>   ? call_rcu+0xe/0x10
>   schedule+0x4e/0xb0
>   blk_mq_freeze_queue_wait+0x69/0xa0
>   ? wait_woken+0x70/0x70
>   blk_mq_freeze_queue+0x1b/0x30
>   nbd_add_socket+0x76/0x1f0 [nbd]
>   __nbd_ioctl+0x18b/0x340 [nbd]
>   ? security_capable+0x3d/0x60
>   nbd_ioctl+0x81/0xb0 [nbd]
>   blkdev_ioctl+0x12e/0x270
>   ? __fget_files+0x86/0xc0
>   block_ioctl+0x46/0x50
>   __x64_sys_ioctl+0x91/0xc0
>   do_syscall_64+0x5c/0xc0
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>   </TASK>
> 
> Additionally, in syslog you will also see systemd-udevd get stuck:
> 
> systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time
> 
> $ ps aux
> ...
> 419    1194 root     D     0.1 systemd-udevd   -
> 
> We can workaround the issue by adding a higher priority udev rule to not watch
> nbd* devices.
> 
> $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
> # Disable inotify watching of change events for NBD devices
> ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
> EOF
> 
> $ sudo udevadm control --reload-rules
> $ sudo udevadm trigger
> 
> [Fix]
> 
> The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which
> was introduced in 5.16, and added to in 5.19. We need to backport all commits
> related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.
> 
> For Focal, Impish and Jammy:
> 
> commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu Sep 16 17:33:44 2021 +0800
> Subject: nbd: don't handle response without a corresponding request message
> Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> 
> commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu Sep 16 17:33:45 2021 +0800
> Subject: nbd: make sure request completion won't concurrent
> Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9
> 
> commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Sat May 21 15:37:46 2022 +0800
> Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> 
> commit 09dadb5985023e27d4740ebd17e6fea4640110e5
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Sat May 21 15:37:47 2022 +0800
> Subject: nbd: fix io hung while disconnecting device
> Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5
> 
> For Focal only (dependency commits):
> 
> commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
> Author: Keith Busch <kbusch@kernel.org>
> Date:   Fri May 29 07:51:59 2020 -0700
> Subject: blk-mq: blk-mq: provide forced completion method
> Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d
> 
> commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Thu Jun 11 08:44:47 2020 +0200
> Subject: blk-mq: move failure injection out of blk_mq_complete_request
> Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> 
> I want to talk about the backport of "blk-mq: move failure injection out of
> blk_mq_complete_request" for Focal, since it changes a number of drivers using
> the blk_mq_complete_request() call, and can be considered a large regression
> risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be
> enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
> CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus
> blk_should_fake_timeout() just returns false, and is more or less a nop on our
> kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is
> really just if(true), and I did think about simply backporting that to the
> nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
> but after talking with Jay, we decided backporting the entire patch was the best
> way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a
> custom kernel build with the nbd subsystem.
> 
> Note: Bionic is also affected, but the backport list to the 4.15 kernel is
> unreasonable, and due to systemd in Bionic not having the udev rule set to watch
> nbd* devices by default, Bionic is not directly affected by the issue. Users
> would only see the problem if they add nbd* to the watch list in
> /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or
> later LXC container running using the hosts nbd kernel module. Because of this
> Bionic will be Won't Fix.
> 
> [Testcase]
> 
> The issue can be easily reproduced with:
> 
> $ sudo apt install qemu-utils
> 
> $ cat << EOF >> reproducer.sh
> #!/bin/bash
> 
> sudo modprobe nbd
> 
> while :
> do
>          qemu-img create -f qcow2 foo.img 500M
>          sudo qemu-nbd --disconnect /dev/nbd15 || true
>          sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
>          sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
>          sudo qemu-nbd --disconnect /dev/nbd15
> done
> EOF
> 
> $ chmod +x reproducer.sh
> $ yes | ./reproducer.sh
> 
> On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or
> so, and the kernel log will be filled with stuck request messages and hung
> task timeouts after the 120 second mark.
> 
> Test kernels are available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test
> 
> If you install these test kernels, and re-run the reproducer, your system will
> be perfectly stable for many hours with no issues.
> 
> I have torture tested the backport to the Focal kernel for more than 4 hours
> with no issues observed.
> 
> [Where problems can occur]
> 
> Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change
> to the NBD subsystem is quite large, and changes how requests are processed
> and accounted for, depending if they are still outstanding or not.
> 
> For Impish and Jammy, these risks are limited to the NBD subsystem itself.
> 
> On Focal, the risk of regression is slightly larger due to the backport of
> "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat
> mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
> being disabled on the Focal kernel.
> 
> If a regression were to occur, users may see NBD requests fail, occur in the
> incorrect order, or cleared at incorrect times. There is no way to disable
> NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older
> kernel while a fix is made.
> 
> [Other info]
> 
> My bug report to upstream can be found in the following mailing list thread:
> https://lkml.org/lkml/2022/4/22/61
> 
> You may be wondering why we cannot simply just backport "nbd: fix io hung while
> disconnecting device" and resolve the issue with a 1 line change. I actually
> built test kernels for this scenario to see if it was possible, and they are
> available here:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single
> 
> While it did sort of fix the issue, that is, connect and disconnect from nbd
> devices lasted longer than the 30 seconds the kernels lasted previously, after
> about 10 minutes of torture testing, I started experiencing race conditions
> of requests completing multiple times, leading to the following use after frees:
> 
> Jammy 5.15 test kernel:
> https://paste.ubuntu.com/p/FSM6DrgjTy/
> 
> Impish 5.13 test kernel:
> https://paste.ubuntu.com/p/86tNfsM7Vs/
> 
> Focal 5.4 test kernel:
> https://paste.ubuntu.com/p/zzVzWx23sb/
> 
> This was similar to a mailing list thread I found previously:
> https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
> https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/
> 
> Keith Busch identified this as a double completion of the same request,
> which is resolved through the NBD_CMD_INFLIGHT infrastructure.
> 
> Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT
> infrastructure for a complete fix.
> 
> Christoph Hellwig (1):
>    blk-mq: move failure injection out of blk_mq_complete_request
> 
> Keith Busch (1):
>    blk-mq: blk-mq: provide forced completion method
> 
> Yu Kuai (4):
>    nbd: don't handle response without a corresponding request message
>    nbd: make sure request completion won't concurrent
>    nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
>    nbd: fix io hung while disconnecting device
> 
>   block/blk-mq.c                    | 27 ++++++------------
>   block/blk-timeout.c               |  6 ++--
>   block/blk.h                       |  8 ------
>   block/bsg-lib.c                   |  5 +++-
>   drivers/block/loop.c              |  6 ++--
>   drivers/block/mtip32xx/mtip32xx.c |  3 +-
>   drivers/block/nbd.c               | 46 +++++++++++++++++++++++++++++--
>   drivers/block/null_blk_main.c     |  3 +-
>   drivers/block/skd_main.c          |  9 ++++--
>   drivers/block/virtio_blk.c        |  3 +-
>   drivers/block/xen-blkfront.c      |  3 +-
>   drivers/md/dm-rq.c                |  3 +-
>   drivers/mmc/core/block.c          |  6 ++--
>   drivers/nvme/host/nvme.h          |  3 +-
>   drivers/s390/block/dasd.c         |  2 +-
>   drivers/s390/block/scm_blk.c      |  3 +-
>   drivers/scsi/scsi_lib.c           | 12 ++------
>   include/linux/blk-mq.h            | 11 +++++++-
>   18 files changed, 99 insertions(+), 60 deletions(-)
> 
Acked-by: Tim Gardner <tim.gardner@canonical.com>

Nice work. I know this bug has been going on for awhile.

Bartlomiej Zolnierkiewicz June 23, 2022, 5:01 p.m. UTC | #2

Acked-by: Bartlomiej Zolnierkiewicz <bartlomiej.zolnierkiewicz@canonical.com>

On Wed, Jun 22, 2022 at 7:18 AM Matthew Ruffell
<matthew.ruffell@canonical.com> wrote:
>
> BugLink: https://bugs.launchpad.net/bugs/1896350
>
> [Impact]
>
> After 2516ab1("nbd: only clear the queue on device teardown"), present in
> 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently
> being processed. This change was made to fix a race between using the
> NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing
> requests. This worked for the most part, as several years ago systemd was
> not set up to watch nbd devices for changes in their state.
>
> But after:
>
> commit f82abfcda58168d9f667e2094d438763531d3fa6
> From: Tony Asleson <tasleson@redhat.com>
> Date: Fri, 8 Feb 2019 15:47:10 -0600
> Subject: rules: watch metadata changes on nbd devices
> Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6
>
> in systemd v242-rc1, nbd* devices were added to a udev rule to watch those
> devices for changes with the inotify subsystem. From man udev:
>
> > watch
> >   Watch the device node with inotify; when the node is closed after being
> >   opened for writing, a change uevent is synthesized.
> >
> > nowatch
> >   Disable the watching of a device node with inotify.
>
> This changed the behaviour of device teardown, since systemd now keeps tabs on
> the device with inotify, outstanding requests cannot be cleared as
> nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get
> stuck, never to complete, because a disconnect has occurred, and never to
> timeout, as their timers keep being reset.
>
> Symptoms of this issue is that the nbd subsystem gets stuck with messages like:
>
> block nbd15: NBD_DISCONNECT
> block nbd15: Send disconnect failed -32
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds
> ...
> INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
>       Not tainted 5.15.0-23-generic #23-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:qemu-nbd        state:D stack:    0 pid: 1267 ppid:     1 flags:0x00000002
> Call Trace:
>  <TASK>
>  __schedule+0x23d/0x590
>  ? call_rcu+0xe/0x10
>  schedule+0x4e/0xb0
>  blk_mq_freeze_queue_wait+0x69/0xa0
>  ? wait_woken+0x70/0x70
>  blk_mq_freeze_queue+0x1b/0x30
>  nbd_add_socket+0x76/0x1f0 [nbd]
>  __nbd_ioctl+0x18b/0x340 [nbd]
>  ? security_capable+0x3d/0x60
>  nbd_ioctl+0x81/0xb0 [nbd]
>  blkdev_ioctl+0x12e/0x270
>  ? __fget_files+0x86/0xc0
>  block_ioctl+0x46/0x50
>  __x64_sys_ioctl+0x91/0xc0
>  do_syscall_64+0x5c/0xc0
>  entry_SYSCALL_64_after_hwframe+0x44/0xae
>  </TASK>
>
> Additionally, in syslog you will also see systemd-udevd get stuck:
>
> systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time
>
> $ ps aux
> ...
> 419    1194 root     D     0.1 systemd-udevd   -
>
> We can workaround the issue by adding a higher priority udev rule to not watch
> nbd* devices.
>
> $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
> # Disable inotify watching of change events for NBD devices
> ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
> EOF
>
> $ sudo udevadm control --reload-rules
> $ sudo udevadm trigger
>
> [Fix]
>
> The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which
> was introduced in 5.16, and added to in 5.19. We need to backport all commits
> related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.
>
> For Focal, Impish and Jammy:
>
> commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu Sep 16 17:33:44 2021 +0800
> Subject: nbd: don't handle response without a corresponding request message
> Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
>
> commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu Sep 16 17:33:45 2021 +0800
> Subject: nbd: make sure request completion won't concurrent
> Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9
>
> commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Sat May 21 15:37:46 2022 +0800
> Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e
>
> commit 09dadb5985023e27d4740ebd17e6fea4640110e5
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Sat May 21 15:37:47 2022 +0800
> Subject: nbd: fix io hung while disconnecting device
> Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5
>
> For Focal only (dependency commits):
>
> commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
> Author: Keith Busch <kbusch@kernel.org>
> Date:   Fri May 29 07:51:59 2020 -0700
> Subject: blk-mq: blk-mq: provide forced completion method
> Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d
>
> commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Thu Jun 11 08:44:47 2020 +0200
> Subject: blk-mq: move failure injection out of blk_mq_complete_request
> Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
>
> I want to talk about the backport of "blk-mq: move failure injection out of
> blk_mq_complete_request" for Focal, since it changes a number of drivers using
> the blk_mq_complete_request() call, and can be considered a large regression
> risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be
> enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
> CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus
> blk_should_fake_timeout() just returns false, and is more or less a nop on our
> kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is
> really just if(true), and I did think about simply backporting that to the
> nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
> but after talking with Jay, we decided backporting the entire patch was the best
> way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a
> custom kernel build with the nbd subsystem.
>
> Note: Bionic is also affected, but the backport list to the 4.15 kernel is
> unreasonable, and due to systemd in Bionic not having the udev rule set to watch
> nbd* devices by default, Bionic is not directly affected by the issue. Users
> would only see the problem if they add nbd* to the watch list in
> /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or
> later LXC container running using the hosts nbd kernel module. Because of this
> Bionic will be Won't Fix.
>
> [Testcase]
>
> The issue can be easily reproduced with:
>
> $ sudo apt install qemu-utils
>
> $ cat << EOF >> reproducer.sh
> #!/bin/bash
>
> sudo modprobe nbd
>
> while :
> do
>         qemu-img create -f qcow2 foo.img 500M
>         sudo qemu-nbd --disconnect /dev/nbd15 || true
>         sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
>         sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
>         sudo qemu-nbd --disconnect /dev/nbd15
> done
> EOF
>
> $ chmod +x reproducer.sh
> $ yes | ./reproducer.sh
>
> On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or
> so, and the kernel log will be filled with stuck request messages and hung
> task timeouts after the 120 second mark.
>
> Test kernels are available in the following ppa:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test
>
> If you install these test kernels, and re-run the reproducer, your system will
> be perfectly stable for many hours with no issues.
>
> I have torture tested the backport to the Focal kernel for more than 4 hours
> with no issues observed.
>
> [Where problems can occur]
>
> Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change
> to the NBD subsystem is quite large, and changes how requests are processed
> and accounted for, depending if they are still outstanding or not.
>
> For Impish and Jammy, these risks are limited to the NBD subsystem itself.
>
> On Focal, the risk of regression is slightly larger due to the backport of
> "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat
> mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
> being disabled on the Focal kernel.
>
> If a regression were to occur, users may see NBD requests fail, occur in the
> incorrect order, or cleared at incorrect times. There is no way to disable
> NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older
> kernel while a fix is made.
>
> [Other info]
>
> My bug report to upstream can be found in the following mailing list thread:
> https://lkml.org/lkml/2022/4/22/61
>
> You may be wondering why we cannot simply just backport "nbd: fix io hung while
> disconnecting device" and resolve the issue with a 1 line change. I actually
> built test kernels for this scenario to see if it was possible, and they are
> available here:
>
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single
>
> While it did sort of fix the issue, that is, connect and disconnect from nbd
> devices lasted longer than the 30 seconds the kernels lasted previously, after
> about 10 minutes of torture testing, I started experiencing race conditions
> of requests completing multiple times, leading to the following use after frees:
>
> Jammy 5.15 test kernel:
> https://paste.ubuntu.com/p/FSM6DrgjTy/
>
> Impish 5.13 test kernel:
> https://paste.ubuntu.com/p/86tNfsM7Vs/
>
> Focal 5.4 test kernel:
> https://paste.ubuntu.com/p/zzVzWx23sb/
>
> This was similar to a mailing list thread I found previously:
> https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
> https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/
>
> Keith Busch identified this as a double completion of the same request,
> which is resolved through the NBD_CMD_INFLIGHT infrastructure.
>
> Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT
> infrastructure for a complete fix.
>
> Christoph Hellwig (1):
>   blk-mq: move failure injection out of blk_mq_complete_request
>
> Keith Busch (1):
>   blk-mq: blk-mq: provide forced completion method
>
> Yu Kuai (4):
>   nbd: don't handle response without a corresponding request message
>   nbd: make sure request completion won't concurrent
>   nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
>   nbd: fix io hung while disconnecting device
>
>  block/blk-mq.c                    | 27 ++++++------------
>  block/blk-timeout.c               |  6 ++--
>  block/blk.h                       |  8 ------
>  block/bsg-lib.c                   |  5 +++-
>  drivers/block/loop.c              |  6 ++--
>  drivers/block/mtip32xx/mtip32xx.c |  3 +-
>  drivers/block/nbd.c               | 46 +++++++++++++++++++++++++++++--
>  drivers/block/null_blk_main.c     |  3 +-
>  drivers/block/skd_main.c          |  9 ++++--
>  drivers/block/virtio_blk.c        |  3 +-
>  drivers/block/xen-blkfront.c      |  3 +-
>  drivers/md/dm-rq.c                |  3 +-
>  drivers/mmc/core/block.c          |  6 ++--
>  drivers/nvme/host/nvme.h          |  3 +-
>  drivers/s390/block/dasd.c         |  2 +-
>  drivers/s390/block/scm_blk.c      |  3 +-
>  drivers/scsi/scsi_lib.c           | 12 ++------
>  include/linux/blk-mq.h            | 11 +++++++-
>  18 files changed, 99 insertions(+), 60 deletions(-)
>
> --
> 2.34.1

Stefan Bader July 8, 2022, 2:34 p.m. UTC | #3

On 22.06.22 07:17, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896350
> 
> [Impact]
> 
> After 2516ab1("nbd: only clear the queue on device teardown"), present in
> 4.12-rc1 onward, the ioctl NBD_CLEAR_SOCK can no longer clear requests currently
> being processed. This change was made to fix a race between using the
> NBD_CLEAR_SOCK ioctl to clear requests, and teardown of the device clearing
> requests. This worked for the most part, as several years ago systemd was
> not set up to watch nbd devices for changes in their state.
> 
> But after:
> 
> commit f82abfcda58168d9f667e2094d438763531d3fa6
> From: Tony Asleson <tasleson@redhat.com>
> Date: Fri, 8 Feb 2019 15:47:10 -0600
> Subject: rules: watch metadata changes on nbd devices
> Link: https://github.com/systemd/systemd/commit/f82abfcda58168d9f667e2094d438763531d3fa6
> 
> in systemd v242-rc1, nbd* devices were added to a udev rule to watch those
> devices for changes with the inotify subsystem. From man udev:
> 
>> watch
>>    Watch the device node with inotify; when the node is closed after being
>>    opened for writing, a change uevent is synthesized.
>>
>> nowatch
>>    Disable the watching of a device node with inotify.
> 
> This changed the behaviour of device teardown, since systemd now keeps tabs on
> the device with inotify, outstanding requests cannot be cleared as
> nbd_xmit_timeout() will always return 'BLK_EH_RESET_TIMER', and requests get
> stuck, never to complete, because a disconnect has occurred, and never to
> timeout, as their timers keep being reset.
> 
> Symptoms of this issue is that the nbd subsystem gets stuck with messages like:
> 
> block nbd15: NBD_DISCONNECT
> block nbd15: Send disconnect failed -32
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 30 seconds
> ...
> block nbd15: Possible stuck request 000000007fcf62ba: control (read@523915264,24576B). Runtime 150 seconds
> ...
> INFO: task qemu-nbd:1267 blocked for more than 120 seconds.
>        Not tainted 5.15.0-23-generic #23-Ubuntu
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> task:qemu-nbd        state:D stack:    0 pid: 1267 ppid:     1 flags:0x00000002
> Call Trace:
>   <TASK>
>   __schedule+0x23d/0x590
>   ? call_rcu+0xe/0x10
>   schedule+0x4e/0xb0
>   blk_mq_freeze_queue_wait+0x69/0xa0
>   ? wait_woken+0x70/0x70
>   blk_mq_freeze_queue+0x1b/0x30
>   nbd_add_socket+0x76/0x1f0 [nbd]
>   __nbd_ioctl+0x18b/0x340 [nbd]
>   ? security_capable+0x3d/0x60
>   nbd_ioctl+0x81/0xb0 [nbd]
>   blkdev_ioctl+0x12e/0x270
>   ? __fget_files+0x86/0xc0
>   block_ioctl+0x46/0x50
>   __x64_sys_ioctl+0x91/0xc0
>   do_syscall_64+0x5c/0xc0
>   entry_SYSCALL_64_after_hwframe+0x44/0xae
>   </TASK>
> 
> Additionally, in syslog you will also see systemd-udevd get stuck:
> 
> systemd-udevd[419]: nbd15: Worker [2004] processing SEQNUM=5661 is taking a long time
> 
> $ ps aux
> ...
> 419    1194 root     D     0.1 systemd-udevd   -
> 
> We can workaround the issue by adding a higher priority udev rule to not watch
> nbd* devices.
> 
> $ cat << EOF >> /etc/udev/rules.d/97-nbd-device.rules
> # Disable inotify watching of change events for NBD devices
> ACTION=="add|change", KERNEL=="nbd*", OPTIONS:="nowatch"
> EOF
> 
> $ sudo udevadm control --reload-rules
> $ sudo udevadm trigger
> 
> [Fix]
> 
> The fix relies on infrastructure provided by the flag NBD_CMD_INFLIGHT, which
> was introduced in 5.16, and added to in 5.19. We need to backport all commits
> related to NBD_CMD_INFLIGHT to our kernels for the fix to be effective.
> 
> For Focal, Impish and Jammy:
> 
> commit 4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu Sep 16 17:33:44 2021 +0800
> Subject: nbd: don't handle response without a corresponding request message
> Link: https://github.com/torvalds/linux/commit/4e6eef5dc25b528e08ac5b5f64f6ca9d9987241d
> 
> commit 07175cb1baf4c51051b1fbd391097e349f9a02a9
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Thu Sep 16 17:33:45 2021 +0800
> Subject: nbd: make sure request completion won't concurrent
> Link: https://github.com/torvalds/linux/commit/07175cb1baf4c51051b1fbd391097e349f9a02a9
> 
> commit 2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Sat May 21 15:37:46 2022 +0800
> Subject: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
> Link: https://github.com/torvalds/linux/commit/2895f1831e911ca87d4efdf43e35eb72a0c7e66e
> 
> commit 09dadb5985023e27d4740ebd17e6fea4640110e5
> Author: Yu Kuai <yukuai3@huawei.com>
> Date:   Sat May 21 15:37:47 2022 +0800
> Subject: nbd: fix io hung while disconnecting device
> Link: https://github.com/torvalds/linux/commit/09dadb5985023e27d4740ebd17e6fea4640110e5
> 
> For Focal only (dependency commits):
> 
> commit 7b11eab041dacfeaaa6d27d9183b247a995bc16d
> Author: Keith Busch <kbusch@kernel.org>
> Date:   Fri May 29 07:51:59 2020 -0700
> Subject: blk-mq: blk-mq: provide forced completion method
> Link: https://github.com/torvalds/linux/commit/7b11eab041dacfeaaa6d27d9183b247a995bc16d
> 
> commit 15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> Author: Christoph Hellwig <hch@lst.de>
> Date:   Thu Jun 11 08:44:47 2020 +0200
> Subject: blk-mq: move failure injection out of blk_mq_complete_request
> Link: https://github.com/torvalds/linux/commit/15f73f5b3e5958f2d169fe13c420eeeeae07bbf2
> 
> I want to talk about the backport of "blk-mq: move failure injection out of
> blk_mq_complete_request" for Focal, since it changes a number of drivers using
> the blk_mq_complete_request() call, and can be considered a large regression
> risk. Now, blk_should_fake_timeout() relies on CONFIG_FAIL_IO_TIMEOUT to be
> enabled, as well as QUEUE_FLAG_FAIL_IO being present in the blk subsystem.
> CONFIG_FAIL_IO_TIMEOUT is not enabled on Ubuntu kernels, and thus
> blk_should_fake_timeout() just returns false, and is more or less a nop on our
> kernels. Because of this, if (likely(!blk_should_fake_timeout(req->q))), is
> really just if(true), and I did think about simply backporting that to the
> nbd patch "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
> but after talking with Jay, we decided backporting the entire patch was the best
> way to proceed if any of our users wish to use CONFIG_FAIL_IO_TIMEOUT on a
> custom kernel build with the nbd subsystem.
> 
> Note: Bionic is also affected, but the backport list to the 4.15 kernel is
> unreasonable, and due to systemd in Bionic not having the udev rule set to watch
> nbd* devices by default, Bionic is not directly affected by the issue. Users
> would only see the problem if they add nbd* to the watch list in
> /usr/lib/udev/rules.d/60-block.rules, or if they had a privileged 20.04 or
> later LXC container running using the hosts nbd kernel module. Because of this
> Bionic will be Won't Fix.
> 
> [Testcase]
> 
> The issue can be easily reproduced with:
> 
> $ sudo apt install qemu-utils
> 
> $ cat << EOF >> reproducer.sh
> #!/bin/bash
> 
> sudo modprobe nbd
> 
> while :
> do
>          qemu-img create -f qcow2 foo.img 500M
>          sudo qemu-nbd --disconnect /dev/nbd15 || true
>          sudo qemu-nbd --connect=/dev/nbd15 --cache=writeback --format=qcow2 foo.img
>          sudo mkfs.ext4 -L root -O "^64bit" -E nodiscard /dev/nbd15
>          sudo qemu-nbd --disconnect /dev/nbd15
> done
> EOF
> 
> $ chmod +x reproducer.sh
> $ yes | ./reproducer.sh
> 
> On the Ubuntu kernels, you will see the nbd subsystem hang within 30 seconds or
> so, and the kernel log will be filled with stuck request messages and hung
> task timeouts after the 120 second mark.
> 
> Test kernels are available in the following ppa:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test
> 
> If you install these test kernels, and re-run the reproducer, your system will
> be perfectly stable for many hours with no issues.
> 
> I have torture tested the backport to the Focal kernel for more than 4 hours
> with no issues observed.
> 
> [Where problems can occur]
> 
> Due to needing to backport the infrastructure for NBD_CMD_INFLIGHT, this change
> to the NBD subsystem is quite large, and changes how requests are processed
> and accounted for, depending if they are still outstanding or not.
> 
> For Impish and Jammy, these risks are limited to the NBD subsystem itself.
> 
> On Focal, the risk of regression is slightly larger due to the backport of
> "blk-mq: move failure injection out of blk_mq_complete_request", but is somewhat
> mitigated by blk_should_fake_timeout() being a NOP due to CONFIG_FAIL_IO_TIMEOUT
> being disabled on the Focal kernel.
> 
> If a regression were to occur, users may see NBD requests fail, occur in the
> incorrect order, or cleared at incorrect times. There is no way to disable
> NBD_CMD_INFLIGHT during runtime, and users would need to revert to an older
> kernel while a fix is made.
> 
> [Other info]
> 
> My bug report to upstream can be found in the following mailing list thread:
> https://lkml.org/lkml/2022/4/22/61
> 
> You may be wondering why we cannot simply just backport "nbd: fix io hung while
> disconnecting device" and resolve the issue with a 1 line change. I actually
> built test kernels for this scenario to see if it was possible, and they are
> available here:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/sf333142-test-single
> 
> While it did sort of fix the issue, that is, connect and disconnect from nbd
> devices lasted longer than the 30 seconds the kernels lasted previously, after
> about 10 minutes of torture testing, I started experiencing race conditions
> of requests completing multiple times, leading to the following use after frees:
> 
> Jammy 5.15 test kernel:
> https://paste.ubuntu.com/p/FSM6DrgjTy/
> 
> Impish 5.13 test kernel:
> https://paste.ubuntu.com/p/86tNfsM7Vs/
> 
> Focal 5.4 test kernel:
> https://paste.ubuntu.com/p/zzVzWx23sb/
> 
> This was similar to a mailing list thread I found previously:
> https://groups.google.com/g/syzkaller-bugs/c/jhvr3Yv_QH8/m/DgGG4xYFEQAJ
> https://lore.kernel.org/all/000000000000d0df7f058f625d13@google.com/T/
> 
> Keith Busch identified this as a double completion of the same request,
> which is resolved through the NBD_CMD_INFLIGHT infrastructure.
> 
> Hence, we cannot just pull in the single line change, we need NBD_CMD_INFLIGHT
> infrastructure for a complete fix.
> 
> Christoph Hellwig (1):
>    blk-mq: move failure injection out of blk_mq_complete_request
> 
> Keith Busch (1):
>    blk-mq: blk-mq: provide forced completion method
> 
> Yu Kuai (4):
>    nbd: don't handle response without a corresponding request message
>    nbd: make sure request completion won't concurrent
>    nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
>    nbd: fix io hung while disconnecting device
> 
>   block/blk-mq.c                    | 27 ++++++------------
>   block/blk-timeout.c               |  6 ++--
>   block/blk.h                       |  8 ------
>   block/bsg-lib.c                   |  5 +++-
>   drivers/block/loop.c              |  6 ++--
>   drivers/block/mtip32xx/mtip32xx.c |  3 +-
>   drivers/block/nbd.c               | 46 +++++++++++++++++++++++++++++--
>   drivers/block/null_blk_main.c     |  3 +-
>   drivers/block/skd_main.c          |  9 ++++--
>   drivers/block/virtio_blk.c        |  3 +-
>   drivers/block/xen-blkfront.c      |  3 +-
>   drivers/md/dm-rq.c                |  3 +-
>   drivers/mmc/core/block.c          |  6 ++--
>   drivers/nvme/host/nvme.h          |  3 +-
>   drivers/s390/block/dasd.c         |  2 +-
>   drivers/s390/block/scm_blk.c      |  3 +-
>   drivers/scsi/scsi_lib.c           | 12 ++------
>   include/linux/blk-mq.h            | 11 +++++++-
>   18 files changed, 99 insertions(+), 60 deletions(-)
> 

Applied to jammy:linux/master-next. Impish will not get more updates, so 
skipped. For Focal, the set does not apply.

On Focal "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed" 
fails in the last (#4) hunk. Assuming there is a backported version of that 
patch somewhere, could you submit this again for just Focal please? Thanks.

-Stefan

Matthew Ruffell July 20, 2022, 3:08 a.m. UTC | #4

On Sat, Jul 9, 2022 at 2:34 AM Stefan Bader <stefan.bader@canonical.com> wrote:
>
> On 22.06.22 07:17, Matthew Ruffell wrote:
> > BugLink: https://bugs.launchpad.net/bugs/1896350
> >
>
> Applied to jammy:linux/master-next. Impish will not get more updates, so
> skipped. For Focal, the set does not apply.
>
> On Focal "nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed"
> fails in the last (#4) hunk. Assuming there is a backported version of that
> patch somewhere, could you submit this again for just Focal please? Thanks.
>
> -Stefan

Hi Stefan,

I checked the patches, and they apply cleanly to Focal for me. I'm using git
2.34.1 on Jammy, with Focal hash 5063db9ef77566eebd845e5016ef664268039845
(Ubuntu-5.4.0-123.139). Regardless, I will send a V2 that has been output from
format-patch from a Focal tree, instead of using the patch from the Jammy tree,
which is identical, apart from offsets.

$ git checkout -b test3 origin/master-next
Branch 'test3' set up to track remote branch 'master-next' from 'origin'.
Switched to a new branch 'test3'
matthew@desktop:~/Work/kernel/ubuntu-focal$ diff
0005-nbd-don-t-clear-NBD_CMD_INFLIGHT-flag-if-request-is-.patch
~/Work/333142/patches/0005-nbd-don-t-clear-NBD_CMD_INFLIGHT-flag-if-request-is-.patch
1c1
< From fb45ba43e36432839dacd1c27c4c1128d025fa97 Mon Sep 17 00:00:00 2001
---
> From 0c7289ffb3837a671c85c67122666ddba9154ca8 Mon Sep 17 00:00:00 2001
4c4
< Subject: [SRU][F][PATCH V2 5/6] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if
---
> Subject: [SRU][F][I][J][PATCH 5/6] nbd: don't clear 'NBD_CMD_INFLIGHT' flag if
24c24
< index 01d030e9f301..6b165101f84b 100644
---
> index d525ad238e2d..f97f3a5adf28 100644
27c27
< @@ -394,13 +394,14 @@ static enum blk_eh_timer_return
nbd_xmit_timeout(struct request *req,
---
> @@ -411,13 +411,14 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
43c43
< @@ -456,6 +457,7 @@ static enum blk_eh_timer_return
nbd_xmit_timeout(struct request *req,
---
> @@ -486,6 +487,7 @@ static enum blk_eh_timer_return nbd_xmit_timeout(struct request *req,
51c51
< @@ -715,7 +717,7 @@ static struct nbd_cmd *nbd_read_stat(struct
nbd_device *nbd, int index)
---
> @@ -745,7 +747,7 @@ static struct nbd_cmd *nbd_read_stat(struct nbd_device *nbd, int index)
60c60
< @@ -804,8 +806,16 @@ static void recv_work(struct work_struct *work)
---
> @@ -832,8 +834,16 @@ static void recv_work(struct work_struct *work)
matthew@desktop:~/Work/kernel/ubuntu-focal$ git am 000*
Patch is empty.
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".
matthew@desktop:~/Work/kernel/ubuntu-focal$ git am --skip
Applying: blk-mq: blk-mq: provide forced completion method
Applying: blk-mq: move failure injection out of blk_mq_complete_request
Applying: nbd: don't handle response without a corresponding request message
Applying: nbd: make sure request completion won't concurrent
Applying: nbd: don't clear 'NBD_CMD_INFLIGHT' flag if request is not completed
Applying: nbd: fix io hung while disconnecting device

Hopefully this set applies to Focal, please let me know if it does not, this
would be great to have fixed.

Thanks,
Matthew