[SRU,B,F,G,H,0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Message ID	20210506040442.10877-1-matthew.ruffell@canonical.com
Headers	show Return-Path: <kernel-team-bounces@lists.ubuntu.com> From: Matthew Ruffell <matthew.ruffell@canonical.com> To: kernel-team@lists.ubuntu.com Subject: [SRU][B][F][G][H][PATCH 0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations Date: Thu, 6 May 2021 16:04:31 +1200 Message-Id: <20210506040442.10877-1-matthew.ruffell@canonical.com> MIME-Version: 1.0 Precedence: list Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: base64 Errors-To: kernel-team-bounces@lists.ubuntu.com Sender: "kernel-team" <kernel-team-bounces@lists.ubuntu.com>
Series	raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations \| expand [SRU,B,F,G,H,0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim ope… [SRU,H,1/6] md: add md_submit_discard_bio() for submitting discard bio [SRU,B,F,G,H,2/6] md/raid10: extend r10bio devs to raid disks [SRU,B,F,G,H,3/6] md/raid10: pull the code that wait for blocked dev into one function [SRU,H,4/6] md/raid10: improve raid10 discard request [SRU,F,G,H,5/6] md/raid10: improve discard request for far layout [SRU,F,G,H,6/6] dm raid: remove unnecessary discard limits for raid0 and raid10

Matthew Ruffell May 6, 2021, 4:04 a.m. UTC

BugLink: https://bugs.launchpad.net/bugs/1896578

[Impact]

Block discard is very slow on Raid10, which causes common use cases which invoke
block discard, such as mkfs and fstrim operations, to take a very long time.

For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.

The bigger the devices, the longer it takes.

The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the 
request into millions of 512k bio requests, even if the underlying device 
supports larger requests.

For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:

$ cat /sys/block/nvme0n1/queue/discard_max_bytes
2199023255040
$ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
2199023255040

Where the Raid10 md device only supports 512k:

$ cat /sys/block/md0/queue/discard_max_bytes
524288
$ cat /sys/block/md0/queue/discard_max_hw_bytes
524288

If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 
minutes and if we examine the stack, it is stuck in blkdev_issue_discard()

$ sudo cat /proc/1626/stack
[<0>] wait_barrier+0x14c/0x230 [raid10]
[<0>] regular_request_wait+0x39/0x150 [raid10]
[<0>] raid10_write_request+0x11e/0x850 [raid10]
[<0>] raid10_make_request+0xd7/0x150 [raid10]
[<0>] md_handle_request+0x123/0x1a0
[<0>] md_submit_bio+0xda/0x120
[<0>] __submit_bio_noacct+0xde/0x320
[<0>] submit_bio_noacct+0x4d/0x90
[<0>] submit_bio+0x4f/0x1b0
[<0>] __blkdev_issue_discard+0x154/0x290
[<0>] blkdev_issue_discard+0x5d/0xc0
[<0>] blk_ioctl_discard+0xc4/0x110
[<0>] blkdev_common_ioctl+0x56c/0x840
[<0>] blkdev_ioctl+0xeb/0x270
[<0>] block_ioctl+0x3d/0x50
[<0>] __x64_sys_ioctl+0x91/0xc0
[<0>] do_syscall_64+0x38/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

[Fix]

Xiao Ni has developed a patchset which resolves the block discard performance
problems. These commits have now landed in 5.13-rc1.

commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
Author: Xiao Ni <xni@redhat.com>
Date: Thu Feb 4 15:50:43 2021 +0800
Subject: md: add md_submit_discard_bio() for submitting discard bio
Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8

commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
Author: Xiao Ni <xni@redhat.com>
Date: Thu Feb 4 15:50:44 2021 +0800
Subject: md/raid10: extend r10bio devs to raid disks
Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925

commit f2e7e269a7525317752d472bb48a549780e87d22
Author: Xiao Ni <xni@redhat.com>
Date: Thu Feb 4 15:50:45 2021 +0800
Subject: md/raid10: pull the code that wait for blocked dev into one function
Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22

commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
Author: Xiao Ni <xni@redhat.com>
Date: Thu Feb 4 15:50:46 2021 +0800
Subject: md/raid10: improve raid10 discard request
Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa

commit 254c271da0712ea8914f187588e0f81f7678ee2f
Author: Xiao Ni <xni@redhat.com>
Date: Thu Feb 4 15:50:47 2021 +0800
Subject: md/raid10: improve discard request for far layout
Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f

There is also an additional commit which is required, and was merged after 
"md/raid10: improve raid10 discard request" was merged. The following commit 
enables Radid10 to use large discards, instead of splitting into many bios, 
since the technical hurdles have now been removed.

commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
Author: Mike Snitzer <snitzer@redhat.com>
Date: Fri Apr 30 14:38:37 2021 -0400
Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44

The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels, 
with the following minor backports:

1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
was recently changed in:

commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
Author: Christoph Hellwig <hch@lst.de>
Date: Wed Jul 1 10:59:44 2020 +0200
Subject: block: rename generic_make_request to submit_bio_noacct
Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead

2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
request_queue argument put back in place. It was recently removed in:

commit 1c02fca620f7273b597591065d366e2cca948d8f
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Dec 3 17:21:38 2020 +0100
Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f

3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
'&' removed for one of their arguments for the 4.15 kernel, due to changes made
in:

commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
Author: Kent Overstreet <kent.overstreet@gmail.com>
Date: Sun May 20 18:25:52 2018 -0400
Subject: md: convert to bioset_init()/mempool_init()
Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8

4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
raid0 and raid10" due to not having the following commit, which was merged in 
5.1-rc1:

commit 61697a6abd24acba941359c6268a94f4afe4a53d
Author: Mike Snitzer <snitzer@redhat.com>
Date: Fri Jan 18 14:19:26 2019 -0500
Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d

5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to 
bio_clone_blkcg_association() due to it changing in:

commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
Author: Dennis Zhou <dennis@kernel.org>
Date: Wed Dec 5 12:10:35 2018 -0500
Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1

[Testcase]

You will need a machine with at least 4x NVMe drives which support block 
discard. I use a i3.8xlarge instance on AWS, since it has all of these things.

$ lsblk
xvda 202:0 0 8G 0 disk
└─xvda1 202:1 0 8G 0 part /
nvme0n1 259:2 0 1.7T 0 disk
nvme1n1 259:0 0 1.7T 0 disk
nvme2n1 259:1 0 1.7T 0 disk
nvme3n1 259:3 0 1.7T 0 disk

Create a Raid10 array:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

Format the array with XFS:

$ time sudo mkfs.xfs /dev/md0
real 11m14.734s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk

Optional, do a fstrim:

$ time sudo fstrim /mnt/disk

real 11m37.643s

There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:

https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test

If you install a test kernel, we can see that performance dramatically improves:

$ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1

$ time sudo mkfs.xfs /dev/md0
real 0m4.226s
user 0m0.020s
sys 0m0.148s

$ sudo mkdir /mnt/disk
$ sudo mount /dev/md0 /mnt/disk
$ time sudo fstrim /mnt/disk

real 0m1.991s
user 0m0.020s
sys 0m0.000s

The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
from 11 minutes to 2 seconds.

Performance Matrix (AWS i3.8xlarge):

Kernel | mkfs.xfs | fstrim
---------------------------------
4.15      | 7m23.449s | 7m20.678s
5.4       | 8m23.219s | 8m23.927s
5.8       | 2m54.990s | 8m22.010s
4.15-test | 0m4.286s  | 0m1.657s
5.4-test  | 0m6.075s  | 0m3.150s
5.8-test  | 0m2.753s  | 0m2.999s

The test kernel also changes the discard_max_bytes to the underlying hardware 
limit:

$ cat /sys/block/md0/queue/discard_max_bytes
2199023255040

[Where problems can occur]

A problem has occurred once before, with the previous revision of this patchset.
This has been documented in bug 1907262, and caused a worst case scenario of 
data loss for some users, in this particular case, on the second and onward 
disks. This was due to two two faults: the first, incorrectly calculating the 
start offset for block discard for the second and extra disks. The second bug 
was an incorrect stripe size for far layouts.

The kernel team was forced to revert the patches in an emergency and the faulty 
kernel was removed from the archive, and community users urged to avoid the 
faulty kernel.

These bugs and a few other minor issues have now been corrected, and we have 
been testing the new patches since mid February. The patches have been tested 
against the testcase in bug 1907262 and do not cause the disks to become 
corrupted.

The regression potential is still the same for this patchset though. If a 
regression were to occur, it could lead to data loss on Raid10 arrays backed by 
NVMe or SSD disks that support block discard.

If a regression happens, users need to disable the fstrim systemd service as 
soon as possible, plan an emergency maintenance window, and downgrade the kernel
to a previous release, or upgrade to a corrected kernel.

Mike Snitzer (1):
  dm raid: remove unnecessary discard limits for raid0 and raid10

Xiao Ni (5):
  md: add md_submit_discard_bio() for submitting discard bio
  md/raid10: extend r10bio devs to raid disks
  md/raid10: pull the code that wait for blocked dev into one function
  md/raid10: improve raid10 discard request
  md/raid10: improve discard request for far layout

 drivers/md/dm-raid.c |   9 -
 drivers/md/md.c      |  20 ++
 drivers/md/md.h      |   2 +
 drivers/md/raid0.c   |  14 +-
 drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
 drivers/md/raid10.h  |   1 +
 6 files changed, 402 insertions(+), 78 deletions(-)

Tim Gardner May 7, 2021, 1:28 p.m. UTC | #1

This is way more code change then I can wrap my head around. Aside from 
the test cases that exercise block discard, have you run any other file 
system tests on top of a raid 10 configuration ? These changes are not 
isolated to just raid 10, so perhaps some testing on other raid 
configurations would be appropriate. I know you've already put a lot of 
work into this, but its a huge change. I'd also be inclined to let the 
upstream changes bake awhile, then monitor for subsequent fixes.

This isn't a NACK per say, but I'm in no hurry to ACK just yet.

rtg

On 5/5/21 10:04 PM, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
> 
> [Impact]
> 
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
> 
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
> 
> The bigger the devices, the longer it takes.
> 
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
> 
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
> 
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040
> 
> Where the Raid10 md device only supports 512k:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288
> 
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
> 
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [Fix]
> 
> Xiao Ni has developed a patchset which resolves the block discard performance
> problems. These commits have now landed in 5.13-rc1.
> 
> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:43 2021 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> 
> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:44 2021 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
> 
> commit f2e7e269a7525317752d472bb48a549780e87d22
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:45 2021 +0800
> Subject: md/raid10: pull the code that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
> 
> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:46 2021 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
> 
> commit 254c271da0712ea8914f187588e0f81f7678ee2f
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:47 2021 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
> 
> There is also an additional commit which is required, and was merged after
> "md/raid10: improve raid10 discard request" was merged. The following commit
> enables Radid10 to use large discards, instead of splitting into many bios,
> since the technical hurdles have now been removed.
> 
> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Apr 30 14:38:37 2021 -0400
> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
> 
> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
> with the following minor backports:
> 
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
> was recently changed in:
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch@lst.de>
> Date: Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
> request_queue argument put back in place. It was recently removed in:
> 
> commit 1c02fca620f7273b597591065d366e2cca948d8f
> Author: Christoph Hellwig <hch@lst.de>
> Date: Thu Dec 3 17:21:38 2020 +0100
> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
> 
> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
> 
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date: Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> 
> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
> raid0 and raid10" due to not having the following commit, which was merged in
> 5.1-rc1:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
> 
> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
> bio_clone_blkcg_association() due to it changing in:
> 
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis@kernel.org>
> Date: Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> 
> [Testcase]
> 
> You will need a machine with at least 4x NVMe drives which support block
> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
> 
> $ lsblk
> xvda 202:0 0 8G 0 disk
> └─xvda1 202:1 0 8G 0 part /
> nvme0n1 259:2 0 1.7T 0 disk
> nvme1n1 259:0 0 1.7T 0 disk
> nvme2n1 259:1 0 1.7T 0 disk
> nvme3n1 259:3 0 1.7T 0 disk
> 
> Create a Raid10 array:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Format the array with XFS:
> 
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> 
> Optional, do a fstrim:
> 
> $ time sudo fstrim /mnt/disk
> 
> real 11m37.643s
> 
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 
> If you install a test kernel, we can see that performance dramatically improves:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> $ time sudo mkfs.xfs /dev/md0
> real 0m4.226s
> user 0m0.020s
> sys 0m0.148s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
> 
> real 0m1.991s
> user 0m0.020s
> sys 0m0.000s
> 
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
> 
> Performance Matrix (AWS i3.8xlarge):
> 
> Kernel | mkfs.xfs | fstrim
> ---------------------------------
> 4.15      | 7m23.449s | 7m20.678s
> 5.4       | 8m23.219s | 8m23.927s
> 5.8       | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s  | 0m1.657s
> 5.4-test  | 0m6.075s  | 0m3.150s
> 5.8-test  | 0m2.753s  | 0m2.999s
> 
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 2199023255040
> 
> [Where problems can occur]
> 
> A problem has occurred once before, with the previous revision of this patchset.
> This has been documented in bug 1907262, and caused a worst case scenario of
> data loss for some users, in this particular case, on the second and onward
> disks. This was due to two two faults: the first, incorrectly calculating the
> start offset for block discard for the second and extra disks. The second bug
> was an incorrect stripe size for far layouts.
> 
> The kernel team was forced to revert the patches in an emergency and the faulty
> kernel was removed from the archive, and community users urged to avoid the
> faulty kernel.
> 
> These bugs and a few other minor issues have now been corrected, and we have
> been testing the new patches since mid February. The patches have been tested
> against the testcase in bug 1907262 and do not cause the disks to become
> corrupted.
> 
> The regression potential is still the same for this patchset though. If a
> regression were to occur, it could lead to data loss on Raid10 arrays backed by
> NVMe or SSD disks that support block discard.
> 
> If a regression happens, users need to disable the fstrim systemd service as
> soon as possible, plan an emergency maintenance window, and downgrade the kernel
> to a previous release, or upgrade to a corrected kernel.
> 
> Mike Snitzer (1):
>    dm raid: remove unnecessary discard limits for raid0 and raid10
> 
> Xiao Ni (5):
>    md: add md_submit_discard_bio() for submitting discard bio
>    md/raid10: extend r10bio devs to raid disks
>    md/raid10: pull the code that wait for blocked dev into one function
>    md/raid10: improve raid10 discard request
>    md/raid10: improve discard request for far layout
> 
>   drivers/md/dm-raid.c |   9 -
>   drivers/md/md.c      |  20 ++
>   drivers/md/md.h      |   2 +
>   drivers/md/raid0.c   |  14 +-
>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>   drivers/md/raid10.h  |   1 +
>   6 files changed, 402 insertions(+), 78 deletions(-)
>

Matthew Ruffell May 10, 2021, 1:25 a.m. UTC | #2

Hi Tim,

I appreciate your cautiousness, and I too am happy to play the conservative game
on this particular patchset. I _really_ don't want to cause another regression,
the one in December caused too much trouble for everyone as it was.

I'm happy to NACK the patchset for this cycle, and resubmit in the next cycle,
as long as we set out and agree on what additional regression testing should be
performed, and if I go perform those tests then you would seriously consider 
accepting these patches.

Testing that I have currently performed:
- Testing against the testcase of mkfs.xfs / mkfs.ext4 / fstrim on a Raid10 md
  block device, as specified in LP #1896578 [1].
- Testing against the disk corruption regression testcase, as reported in
  LP #1907262 [2]. All disks now fsck clean with this new revision of the 
  patchset, which you can see in comment #15 [3] on LP #1896578. 
- Three customers have tested test kernels and haven't found any issues (yet).
  
Testing that I could go perform over the next week or two:
- Run xfstests with the generic/*, xfs/* and ext4/* testsuites over two Raid10 
  md block devices, one test block device, one scratch block device. I would
  do two runs, one with a released kernel without the patches, and one with the
  test kernels in [4], to see if there is any regressions.
- I could use a cloud instance with NVMe drives as my primary computer over the
  next two or so weeks, and have my /home on a Raid10 md block device, and
  change the fstrim systemd timer from weekly to every 30 minutes, and see if I
  come across data corruption.

Just so that you are aware, I have three different customers who would very much
like these patches to land in the Ubuntu kernels. Two are deploying systems via
MAAS or curtin, and are seeing deployment timeouts when deploying Raid10 to
NVMe disks, since their larger arrays take 2-3 hours to perform a block discard
on with current kernels, when normal systems only take 15-30 mins to deploy, so
they don't want to increase the deployment timeout setting for this one outlier.
The other customer doesn't want to spend 2-3 hours waiting for their Raid10 md
arrays to format with a filesystem, when it could take 4 seconds instead.

I know this is a big change, and I know that this set has already caused grief
with a regression in December, but customers are requesting this feature, and
because of that, I'm willing to work with you to figure out appropriate testing
required, and hopefully get this landed in Ubuntu kernels safely in the near
future.

Let me know what additional testing you would like to see, and I will go and
complete it.

Thanks,
Matthew

[1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578
[2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262
[3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/comments/15
[4] https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test

On 8/05/21 1:28 am, Tim Gardner wrote:
> This is way more code change then I can wrap my head around. Aside from the test cases that exercise block discard, have you run any other file system tests on top of a raid 10 configuration ? These changes are not isolated to just raid 10, so perhaps some testing on other raid configurations would be appropriate. I know you've already put a lot of work into this, but its a huge change. I'd also be inclined to let the upstream changes bake awhile, then monitor for subsequent fixes.
> 
> This isn't a NACK per say, but I'm in no hurry to ACK just yet.
> 
> rtg
> 
> On 5/5/21 10:04 PM, Matthew Ruffell wrote:
>> BugLink: https://bugs.launchpad.net/bugs/1896578
>>
>> [Impact]
>>
>> Block discard is very slow on Raid10, which causes common use cases which invoke
>> block discard, such as mkfs and fstrim operations, to take a very long time.
>>
>> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
>> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
>> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
>>
>> The bigger the devices, the longer it takes.
>>
>> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
>> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
>> request into millions of 512k bio requests, even if the underlying device
>> supports larger requests.
>>
>> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
>>
>> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
>> 2199023255040
>> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
>> 2199023255040
>>
>> Where the Raid10 md device only supports 512k:
>>
>> $ cat /sys/block/md0/queue/discard_max_bytes
>> 524288
>> $ cat /sys/block/md0/queue/discard_max_hw_bytes
>> 524288
>>
>> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
>> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
>>
>> $ sudo cat /proc/1626/stack
>> [<0>] wait_barrier+0x14c/0x230 [raid10]
>> [<0>] regular_request_wait+0x39/0x150 [raid10]
>> [<0>] raid10_write_request+0x11e/0x850 [raid10]
>> [<0>] raid10_make_request+0xd7/0x150 [raid10]
>> [<0>] md_handle_request+0x123/0x1a0
>> [<0>] md_submit_bio+0xda/0x120
>> [<0>] __submit_bio_noacct+0xde/0x320
>> [<0>] submit_bio_noacct+0x4d/0x90
>> [<0>] submit_bio+0x4f/0x1b0
>> [<0>] __blkdev_issue_discard+0x154/0x290
>> [<0>] blkdev_issue_discard+0x5d/0xc0
>> [<0>] blk_ioctl_discard+0xc4/0x110
>> [<0>] blkdev_common_ioctl+0x56c/0x840
>> [<0>] blkdev_ioctl+0xeb/0x270
>> [<0>] block_ioctl+0x3d/0x50
>> [<0>] __x64_sys_ioctl+0x91/0xc0
>> [<0>] do_syscall_64+0x38/0x90
>> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>
>> [Fix]
>>
>> Xiao Ni has developed a patchset which resolves the block discard performance
>> problems. These commits have now landed in 5.13-rc1.
>>
>> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
>> Author: Xiao Ni <xni@redhat.com>
>> Date: Thu Feb 4 15:50:43 2021 +0800
>> Subject: md: add md_submit_discard_bio() for submitting discard bio
>> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
>>
>> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
>> Author: Xiao Ni <xni@redhat.com>
>> Date: Thu Feb 4 15:50:44 2021 +0800
>> Subject: md/raid10: extend r10bio devs to raid disks
>> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
>>
>> commit f2e7e269a7525317752d472bb48a549780e87d22
>> Author: Xiao Ni <xni@redhat.com>
>> Date: Thu Feb 4 15:50:45 2021 +0800
>> Subject: md/raid10: pull the code that wait for blocked dev into one function
>> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
>>
>> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
>> Author: Xiao Ni <xni@redhat.com>
>> Date: Thu Feb 4 15:50:46 2021 +0800
>> Subject: md/raid10: improve raid10 discard request
>> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
>>
>> commit 254c271da0712ea8914f187588e0f81f7678ee2f
>> Author: Xiao Ni <xni@redhat.com>
>> Date: Thu Feb 4 15:50:47 2021 +0800
>> Subject: md/raid10: improve discard request for far layout
>> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
>>
>> There is also an additional commit which is required, and was merged after
>> "md/raid10: improve raid10 discard request" was merged. The following commit
>> enables Radid10 to use large discards, instead of splitting into many bios,
>> since the technical hurdles have now been removed.
>>
>> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
>> Author: Mike Snitzer <snitzer@redhat.com>
>> Date: Fri Apr 30 14:38:37 2021 -0400
>> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
>> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
>>
>> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
>> with the following minor backports:
>>
>> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
>> was recently changed in:
>>
>> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>> Author: Christoph Hellwig <hch@lst.de>
>> Date: Wed Jul 1 10:59:44 2020 +0200
>> Subject: block: rename generic_make_request to submit_bio_noacct
>> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
>>
>> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
>> request_queue argument put back in place. It was recently removed in:
>>
>> commit 1c02fca620f7273b597591065d366e2cca948d8f
>> Author: Christoph Hellwig <hch@lst.de>
>> Date: Thu Dec 3 17:21:38 2020 +0100
>> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
>> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
>>
>> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
>> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
>> in:
>>
>> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
>> Author: Kent Overstreet <kent.overstreet@gmail.com>
>> Date: Sun May 20 18:25:52 2018 -0400
>> Subject: md: convert to bioset_init()/mempool_init()
>> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
>>
>> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
>> raid0 and raid10" due to not having the following commit, which was merged in
>> 5.1-rc1:
>>
>> commit 61697a6abd24acba941359c6268a94f4afe4a53d
>> Author: Mike Snitzer <snitzer@redhat.com>
>> Date: Fri Jan 18 14:19:26 2019 -0500
>> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
>> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
>>
>> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
>> bio_clone_blkcg_association() due to it changing in:
>>
>> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
>> Author: Dennis Zhou <dennis@kernel.org>
>> Date: Wed Dec 5 12:10:35 2018 -0500
>> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
>> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
>>
>> [Testcase]
>>
>> You will need a machine with at least 4x NVMe drives which support block
>> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
>>
>> $ lsblk
>> xvda 202:0 0 8G 0 disk
>> └─xvda1 202:1 0 8G 0 part /
>> nvme0n1 259:2 0 1.7T 0 disk
>> nvme1n1 259:0 0 1.7T 0 disk
>> nvme2n1 259:1 0 1.7T 0 disk
>> nvme3n1 259:3 0 1.7T 0 disk
>>
>> Create a Raid10 array:
>>
>> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> Format the array with XFS:
>>
>> $ time sudo mkfs.xfs /dev/md0
>> real 11m14.734s
>>
>> $ sudo mkdir /mnt/disk
>> $ sudo mount /dev/md0 /mnt/disk
>>
>> Optional, do a fstrim:
>>
>> $ time sudo fstrim /mnt/disk
>>
>> real 11m37.643s
>>
>> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
>>
>> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
>>
>> If you install a test kernel, we can see that performance dramatically improves:
>>
>> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
>>
>> $ time sudo mkfs.xfs /dev/md0
>> real 0m4.226s
>> user 0m0.020s
>> sys 0m0.148s
>>
>> $ sudo mkdir /mnt/disk
>> $ sudo mount /dev/md0 /mnt/disk
>> $ time sudo fstrim /mnt/disk
>>
>> real 0m1.991s
>> user 0m0.020s
>> sys 0m0.000s
>>
>> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
>> from 11 minutes to 2 seconds.
>>
>> Performance Matrix (AWS i3.8xlarge):
>>
>> Kernel | mkfs.xfs | fstrim
>> ---------------------------------
>> 4.15      | 7m23.449s | 7m20.678s
>> 5.4       | 8m23.219s | 8m23.927s
>> 5.8       | 2m54.990s | 8m22.010s
>> 4.15-test | 0m4.286s  | 0m1.657s
>> 5.4-test  | 0m6.075s  | 0m3.150s
>> 5.8-test  | 0m2.753s  | 0m2.999s
>>
>> The test kernel also changes the discard_max_bytes to the underlying hardware
>> limit:
>>
>> $ cat /sys/block/md0/queue/discard_max_bytes
>> 2199023255040
>>
>> [Where problems can occur]
>>
>> A problem has occurred once before, with the previous revision of this patchset.
>> This has been documented in bug 1907262, and caused a worst case scenario of
>> data loss for some users, in this particular case, on the second and onward
>> disks. This was due to two two faults: the first, incorrectly calculating the
>> start offset for block discard for the second and extra disks. The second bug
>> was an incorrect stripe size for far layouts.
>>
>> The kernel team was forced to revert the patches in an emergency and the faulty
>> kernel was removed from the archive, and community users urged to avoid the
>> faulty kernel.
>>
>> These bugs and a few other minor issues have now been corrected, and we have
>> been testing the new patches since mid February. The patches have been tested
>> against the testcase in bug 1907262 and do not cause the disks to become
>> corrupted.
>>
>> The regression potential is still the same for this patchset though. If a
>> regression were to occur, it could lead to data loss on Raid10 arrays backed by
>> NVMe or SSD disks that support block discard.
>>
>> If a regression happens, users need to disable the fstrim systemd service as
>> soon as possible, plan an emergency maintenance window, and downgrade the kernel
>> to a previous release, or upgrade to a corrected kernel.
>>
>> Mike Snitzer (1):
>>    dm raid: remove unnecessary discard limits for raid0 and raid10
>>
>> Xiao Ni (5):
>>    md: add md_submit_discard_bio() for submitting discard bio
>>    md/raid10: extend r10bio devs to raid disks
>>    md/raid10: pull the code that wait for blocked dev into one function
>>    md/raid10: improve raid10 discard request
>>    md/raid10: improve discard request for far layout
>>
>>   drivers/md/dm-raid.c |   9 -
>>   drivers/md/md.c      |  20 ++
>>   drivers/md/md.h      |   2 +
>>   drivers/md/raid0.c   |  14 +-
>>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>>   drivers/md/raid10.h  |   1 +
>>   6 files changed, 402 insertions(+), 78 deletions(-)
>>
>

Tim Gardner May 10, 2021, 12:08 p.m. UTC | #3

On 5/9/21 7:25 PM, Matthew Ruffell wrote:
> Hi Tim,
> 
> I appreciate your cautiousness, and I too am happy to play the conservative game
> on this particular patchset. I _really_ don't want to cause another regression,
> the one in December caused too much trouble for everyone as it was.
> 
> I'm happy to NACK the patchset for this cycle, and resubmit in the next cycle,
> as long as we set out and agree on what additional regression testing should be
> performed, and if I go perform those tests then you would seriously consider
> accepting these patches.
> 
> Testing that I have currently performed:
> - Testing against the testcase of mkfs.xfs / mkfs.ext4 / fstrim on a Raid10 md
>    block device, as specified in LP #1896578 [1].
> - Testing against the disk corruption regression testcase, as reported in
>    LP #1907262 [2]. All disks now fsck clean with this new revision of the
>    patchset, which you can see in comment #15 [3] on LP #1896578.
> - Three customers have tested test kernels and haven't found any issues (yet).
>    
> Testing that I could go perform over the next week or two:
> - Run xfstests with the generic/*, xfs/* and ext4/* testsuites over two Raid10
>    md block devices, one test block device, one scratch block device. I would
>    do two runs, one with a released kernel without the patches, and one with the
>    test kernels in [4], to see if there is any regressions.
> - I could use a cloud instance with NVMe drives as my primary computer over the
>    next two or so weeks, and have my /home on a Raid10 md block device, and
>    change the fstrim systemd timer from weekly to every 30 minutes, and see if I
>    come across data corruption.
> 

I like this setup ^. Its a bit more live and random then focused 
testing. Plus, you're more likely to notice regressions or delays in 
file system access.

> Just so that you are aware, I have three different customers who would very much
> like these patches to land in the Ubuntu kernels. Two are deploying systems via
> MAAS or curtin, and are seeing deployment timeouts when deploying Raid10 to
> NVMe disks, since their larger arrays take 2-3 hours to perform a block discard
> on with current kernels, when normal systems only take 15-30 mins to deploy, so
> they don't want to increase the deployment timeout setting for this one outlier.
> The other customer doesn't want to spend 2-3 hours waiting for their Raid10 md
> arrays to format with a filesystem, when it could take 4 seconds instead.
> 

I can see why they are pushing to get this fixed.

> I know this is a big change, and I know that this set has already caused grief
> with a regression in December, but customers are requesting this feature, and
> because of that, I'm willing to work with you to figure out appropriate testing
> required, and hopefully get this landed in Ubuntu kernels safely in the near
> future.
> 
> Let me know what additional testing you would like to see, and I will go and
> complete it.
> 
> Thanks,
> Matthew
> 
> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578
> [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262
> [3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/comments/15
> [4] https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 

This patch set won't make the 2021.05.10 SRU cycle, so lets schedule it 
for the next SRU cycle (beginning about June 1). That'll give another 
few weeks of test. If there are no regressions then I think we can get 
it included. Be sure to ping me about then so I can annoy someone else 
on the team to review as well. Watch for upstream Fix patches in the 
meantime.

rtg
-----------
Tim Gardner
Canonical, Inc

Tim Gardner May 20, 2021, 3:54 p.m. UTC | #4

Acked-by: Tim Gardner <tim.gardner@canonical.com>

Matthew's personal use of this patch set appears to be going well.

On 5/5/21 10:04 PM, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
> 
> [Impact]
> 
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
> 
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
> 
> The bigger the devices, the longer it takes.
> 
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
> 
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
> 
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040
> 
> Where the Raid10 md device only supports 512k:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288
> 
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
> 
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [Fix]
> 
> Xiao Ni has developed a patchset which resolves the block discard performance
> problems. These commits have now landed in 5.13-rc1.
> 
> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:43 2021 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> 
> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:44 2021 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
> 
> commit f2e7e269a7525317752d472bb48a549780e87d22
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:45 2021 +0800
> Subject: md/raid10: pull the code that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
> 
> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:46 2021 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
> 
> commit 254c271da0712ea8914f187588e0f81f7678ee2f
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:47 2021 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
> 
> There is also an additional commit which is required, and was merged after
> "md/raid10: improve raid10 discard request" was merged. The following commit
> enables Radid10 to use large discards, instead of splitting into many bios,
> since the technical hurdles have now been removed.
> 
> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Apr 30 14:38:37 2021 -0400
> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
> 
> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
> with the following minor backports:
> 
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
> was recently changed in:
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch@lst.de>
> Date: Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
> request_queue argument put back in place. It was recently removed in:
> 
> commit 1c02fca620f7273b597591065d366e2cca948d8f
> Author: Christoph Hellwig <hch@lst.de>
> Date: Thu Dec 3 17:21:38 2020 +0100
> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
> 
> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
> 
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date: Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> 
> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
> raid0 and raid10" due to not having the following commit, which was merged in
> 5.1-rc1:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
> 
> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
> bio_clone_blkcg_association() due to it changing in:
> 
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis@kernel.org>
> Date: Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> 
> [Testcase]
> 
> You will need a machine with at least 4x NVMe drives which support block
> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
> 
> $ lsblk
> xvda 202:0 0 8G 0 disk
> └─xvda1 202:1 0 8G 0 part /
> nvme0n1 259:2 0 1.7T 0 disk
> nvme1n1 259:0 0 1.7T 0 disk
> nvme2n1 259:1 0 1.7T 0 disk
> nvme3n1 259:3 0 1.7T 0 disk
> 
> Create a Raid10 array:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Format the array with XFS:
> 
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> 
> Optional, do a fstrim:
> 
> $ time sudo fstrim /mnt/disk
> 
> real 11m37.643s
> 
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 
> If you install a test kernel, we can see that performance dramatically improves:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> $ time sudo mkfs.xfs /dev/md0
> real 0m4.226s
> user 0m0.020s
> sys 0m0.148s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
> 
> real 0m1.991s
> user 0m0.020s
> sys 0m0.000s
> 
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
> 
> Performance Matrix (AWS i3.8xlarge):
> 
> Kernel | mkfs.xfs | fstrim
> ---------------------------------
> 4.15      | 7m23.449s | 7m20.678s
> 5.4       | 8m23.219s | 8m23.927s
> 5.8       | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s  | 0m1.657s
> 5.4-test  | 0m6.075s  | 0m3.150s
> 5.8-test  | 0m2.753s  | 0m2.999s
> 
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 2199023255040
> 
> [Where problems can occur]
> 
> A problem has occurred once before, with the previous revision of this patchset.
> This has been documented in bug 1907262, and caused a worst case scenario of
> data loss for some users, in this particular case, on the second and onward
> disks. This was due to two two faults: the first, incorrectly calculating the
> start offset for block discard for the second and extra disks. The second bug
> was an incorrect stripe size for far layouts.
> 
> The kernel team was forced to revert the patches in an emergency and the faulty
> kernel was removed from the archive, and community users urged to avoid the
> faulty kernel.
> 
> These bugs and a few other minor issues have now been corrected, and we have
> been testing the new patches since mid February. The patches have been tested
> against the testcase in bug 1907262 and do not cause the disks to become
> corrupted.
> 
> The regression potential is still the same for this patchset though. If a
> regression were to occur, it could lead to data loss on Raid10 arrays backed by
> NVMe or SSD disks that support block discard.
> 
> If a regression happens, users need to disable the fstrim systemd service as
> soon as possible, plan an emergency maintenance window, and downgrade the kernel
> to a previous release, or upgrade to a corrected kernel.
> 
> Mike Snitzer (1):
>    dm raid: remove unnecessary discard limits for raid0 and raid10
> 
> Xiao Ni (5):
>    md: add md_submit_discard_bio() for submitting discard bio
>    md/raid10: extend r10bio devs to raid disks
>    md/raid10: pull the code that wait for blocked dev into one function
>    md/raid10: improve raid10 discard request
>    md/raid10: improve discard request for far layout
> 
>   drivers/md/dm-raid.c |   9 -
>   drivers/md/md.c      |  20 ++
>   drivers/md/md.h      |   2 +
>   drivers/md/raid0.c   |  14 +-
>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>   drivers/md/raid10.h  |   1 +
>   6 files changed, 402 insertions(+), 78 deletions(-)
>

Matthew Ruffell May 25, 2021, 6 a.m. UTC | #5

Hi Tim,

Thanks for the ack. Please also ask your colleagues for a second and maybe
third review. 

The patches still apply cleanly to master-next for all series.

I just wanted to circle around with how testing has been going.

My instance on Google Cloud has 4x 375gb NVMe disks, arranged in Raid10, mounted
as /home.

I changed the fstrim.timer to include /home with Preservehome=no, and set the
fstrim.timer to run hourly, instead of weekly.

I have my kernel git repos on the instance, and over the last week or two I have
been doing kernel builds and general work. I did install ubuntu-desktop and try
RDP, but it was super slow, and unusable, so I stuck to ssh.

Things have been pretty stable.

I have been doing regular reboots, and the instance always comes back up, so
/home hasn't been corrupted.

I did a fsck, and it came back mostly clean, with an off-by-one free inode
count being the only complaint:

$ sudo fsck -n -f /dev/md0
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
Warning!  /dev/md0 is mounted.
Warning: skipping journal recovery because doing a read-only filesystem check.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free inodes count wrong (49055036, counted=49055037).
Fix? no

/dev/md0: 80580/49135616 files (0.2% non-contiguous), 4370396/196541952 blocks

I also did a git fsck on the kernel git repos, and they came back clean, with
the same warnings as my regular work pc non-raid10.

Raid10 ext4: https://paste.ubuntu.com/p/HB2CJ7h3jw/
Standard ext4: https://paste.ubuntu.com/p/4rG32zMXJx/

So my kernel git trees experienced no data corruption.

One of my customers has been running the Bionic HWE test kernel on their QA
test clusters for a few weeks now, and they haven't ran into any issues. They
are also asking for a hotfix, so they can roll out the kernel more widely, but
I think they can hold on for a kernel in -proposed.

One of my other customers have been doing some more testing with the test
kernels, and are happy to help test as soon as new kernels land in -proposed.

I also have two community users testing the test kernels.

Thimo, in LP #1907262 [1], the original regression reporter, has been running
the test kernel on a machine, and hasn't found any issues.

Evan, in LP #1896578 [2], has installed the test kernel and can attest to its
speed, but hasn't done any long runs with the kernel to attest to its data
safety.

If the patches get another ack and get applied and built, I am confident we
can get everyone testing the kernel while it is in -proposed. I will also switch
my cloud instance to the -proposed kernel.

Let me know if there is any more testing I can do in the meantime.

Thanks,
Matthew

On 11/05/21 12:08 am, Tim Gardner wrote:
> 
> 
> On 5/9/21 7:25 PM, Matthew Ruffell wrote:
>> Hi Tim,
>>
>> I appreciate your cautiousness, and I too am happy to play the conservative game
>> on this particular patchset. I _really_ don't want to cause another regression,
>> the one in December caused too much trouble for everyone as it was.
>>
>> I'm happy to NACK the patchset for this cycle, and resubmit in the next cycle,
>> as long as we set out and agree on what additional regression testing should be
>> performed, and if I go perform those tests then you would seriously consider
>> accepting these patches.
>>
>> Testing that I have currently performed:
>> - Testing against the testcase of mkfs.xfs / mkfs.ext4 / fstrim on a Raid10 md
>>    block device, as specified in LP #1896578 [1].
>> - Testing against the disk corruption regression testcase, as reported in
>>    LP #1907262 [2]. All disks now fsck clean with this new revision of the
>>    patchset, which you can see in comment #15 [3] on LP #1896578.
>> - Three customers have tested test kernels and haven't found any issues (yet).
>>    Testing that I could go perform over the next week or two:
>> - Run xfstests with the generic/*, xfs/* and ext4/* testsuites over two Raid10
>>    md block devices, one test block device, one scratch block device. I would
>>    do two runs, one with a released kernel without the patches, and one with the
>>    test kernels in [4], to see if there is any regressions.
>> - I could use a cloud instance with NVMe drives as my primary computer over the
>>    next two or so weeks, and have my /home on a Raid10 md block device, and
>>    change the fstrim systemd timer from weekly to every 30 minutes, and see if I
>>    come across data corruption.
>>
> 
> I like this setup ^. Its a bit more live and random then focused testing. Plus, you're more likely to notice regressions or delays in file system access.
> 
>> Just so that you are aware, I have three different customers who would very much
>> like these patches to land in the Ubuntu kernels. Two are deploying systems via
>> MAAS or curtin, and are seeing deployment timeouts when deploying Raid10 to
>> NVMe disks, since their larger arrays take 2-3 hours to perform a block discard
>> on with current kernels, when normal systems only take 15-30 mins to deploy, so
>> they don't want to increase the deployment timeout setting for this one outlier.
>> The other customer doesn't want to spend 2-3 hours waiting for their Raid10 md
>> arrays to format with a filesystem, when it could take 4 seconds instead.
>>
> 
> I can see why they are pushing to get this fixed.
> 
>> I know this is a big change, and I know that this set has already caused grief
>> with a regression in December, but customers are requesting this feature, and
>> because of that, I'm willing to work with you to figure out appropriate testing
>> required, and hopefully get this landed in Ubuntu kernels safely in the near
>> future.
>>
>> Let me know what additional testing you would like to see, and I will go and
>> complete it.
>>
>> Thanks,
>> Matthew
>>
>> [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578
>> [2] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1907262
>> [3] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1896578/comments/15
>> [4] https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
>>
> 
> This patch set won't make the 2021.05.10 SRU cycle, so lets schedule it for the next SRU cycle (beginning about June 1). That'll give another few weeks of test. If there are no regressions then I think we can get it included. Be sure to ping me about then so I can annoy someone else on the team to review as well. Watch for upstream Fix patches in the meantime.
> 
> rtg
> -----------
> Tim Gardner
> Canonical, Inc

Kleber Sacilotto de Souza May 27, 2021, 10:02 a.m. UTC | #6

On 06.05.21 06:04, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
> 
> [Impact]
> 
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
> 
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
> 
> The bigger the devices, the longer it takes.
> 
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
> 
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
> 
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040
> 
> Where the Raid10 md device only supports 512k:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288
> 
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
> 
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [Fix]
> 
> Xiao Ni has developed a patchset which resolves the block discard performance
> problems. These commits have now landed in 5.13-rc1.
> 
> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:43 2021 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> 
> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:44 2021 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
> 
> commit f2e7e269a7525317752d472bb48a549780e87d22
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:45 2021 +0800
> Subject: md/raid10: pull the code that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
> 
> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:46 2021 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
> 
> commit 254c271da0712ea8914f187588e0f81f7678ee2f
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:47 2021 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
> 
> There is also an additional commit which is required, and was merged after
> "md/raid10: improve raid10 discard request" was merged. The following commit
> enables Radid10 to use large discards, instead of splitting into many bios,
> since the technical hurdles have now been removed.
> 
> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Apr 30 14:38:37 2021 -0400
> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
> 
> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
> with the following minor backports:
> 
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
> was recently changed in:
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch@lst.de>
> Date: Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
> request_queue argument put back in place. It was recently removed in:
> 
> commit 1c02fca620f7273b597591065d366e2cca948d8f
> Author: Christoph Hellwig <hch@lst.de>
> Date: Thu Dec 3 17:21:38 2020 +0100
> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
> 
> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
> 
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date: Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> 
> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
> raid0 and raid10" due to not having the following commit, which was merged in
> 5.1-rc1:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
> 
> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
> bio_clone_blkcg_association() due to it changing in:
> 
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis@kernel.org>
> Date: Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> 
> [Testcase]
> 
> You will need a machine with at least 4x NVMe drives which support block
> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
> 
> $ lsblk
> xvda 202:0 0 8G 0 disk
> └─xvda1 202:1 0 8G 0 part /
> nvme0n1 259:2 0 1.7T 0 disk
> nvme1n1 259:0 0 1.7T 0 disk
> nvme2n1 259:1 0 1.7T 0 disk
> nvme3n1 259:3 0 1.7T 0 disk
> 
> Create a Raid10 array:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Format the array with XFS:
> 
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> 
> Optional, do a fstrim:
> 
> $ time sudo fstrim /mnt/disk
> 
> real 11m37.643s
> 
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 
> If you install a test kernel, we can see that performance dramatically improves:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> $ time sudo mkfs.xfs /dev/md0
> real 0m4.226s
> user 0m0.020s
> sys 0m0.148s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
> 
> real 0m1.991s
> user 0m0.020s
> sys 0m0.000s
> 
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
> 
> Performance Matrix (AWS i3.8xlarge):
> 
> Kernel | mkfs.xfs | fstrim
> ---------------------------------
> 4.15      | 7m23.449s | 7m20.678s
> 5.4       | 8m23.219s | 8m23.927s
> 5.8       | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s  | 0m1.657s
> 5.4-test  | 0m6.075s  | 0m3.150s
> 5.8-test  | 0m2.753s  | 0m2.999s
> 
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 2199023255040
> 
> [Where problems can occur]
> 
> A problem has occurred once before, with the previous revision of this patchset.
> This has been documented in bug 1907262, and caused a worst case scenario of
> data loss for some users, in this particular case, on the second and onward
> disks. This was due to two two faults: the first, incorrectly calculating the
> start offset for block discard for the second and extra disks. The second bug
> was an incorrect stripe size for far layouts.
> 
> The kernel team was forced to revert the patches in an emergency and the faulty
> kernel was removed from the archive, and community users urged to avoid the
> faulty kernel.
> 
> These bugs and a few other minor issues have now been corrected, and we have
> been testing the new patches since mid February. The patches have been tested
> against the testcase in bug 1907262 and do not cause the disks to become
> corrupted.
> 
> The regression potential is still the same for this patchset though. If a
> regression were to occur, it could lead to data loss on Raid10 arrays backed by
> NVMe or SSD disks that support block discard.
> 
> If a regression happens, users need to disable the fstrim systemd service as
> soon as possible, plan an emergency maintenance window, and downgrade the kernel
> to a previous release, or upgrade to a corrected kernel.
> 
> Mike Snitzer (1):
>    dm raid: remove unnecessary discard limits for raid0 and raid10
> 
> Xiao Ni (5):
>    md: add md_submit_discard_bio() for submitting discard bio
>    md/raid10: extend r10bio devs to raid disks
>    md/raid10: pull the code that wait for blocked dev into one function
>    md/raid10: improve raid10 discard request
>    md/raid10: improve discard request for far layout
> 
>   drivers/md/dm-raid.c |   9 -
>   drivers/md/md.c      |  20 ++
>   drivers/md/md.h      |   2 +
>   drivers/md/raid0.c   |  14 +-
>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>   drivers/md/raid10.h  |   1 +
>   6 files changed, 402 insertions(+), 78 deletions(-)
> 


Thanks Matthew for the great work put into this.

This patchset has been extensively tested by several people and on
several scenarios. I also do not see any follow-up fix on Linus'
tree.

It would be good though to continue the tests with the kernels once
they hit -proposed if possible.


Acked-by: Kleber Sacilotto de Souza <kleber.souza@canonical.com>

Thanks

Kleber Sacilotto de Souza May 27, 2021, 10:23 a.m. UTC | #7

On 06.05.21 06:04, Matthew Ruffell wrote:
> BugLink: https://bugs.launchpad.net/bugs/1896578
> 
> [Impact]
> 
> Block discard is very slow on Raid10, which causes common use cases which invoke
> block discard, such as mkfs and fstrim operations, to take a very long time.
> 
> For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices
> which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to
> 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds.
> 
> The bigger the devices, the longer it takes.
> 
> The cause is that Raid10 currently uses a 512k chunk size, and uses this for the
> discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the
> request into millions of 512k bio requests, even if the underlying device
> supports larger requests.
> 
> For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once:
> 
> $ cat /sys/block/nvme0n1/queue/discard_max_bytes
> 2199023255040
> $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes
> 2199023255040
> 
> Where the Raid10 md device only supports 512k:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 524288
> $ cat /sys/block/md0/queue/discard_max_hw_bytes
> 524288
> 
> If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11
> minutes and if we examine the stack, it is stuck in blkdev_issue_discard()
> 
> $ sudo cat /proc/1626/stack
> [<0>] wait_barrier+0x14c/0x230 [raid10]
> [<0>] regular_request_wait+0x39/0x150 [raid10]
> [<0>] raid10_write_request+0x11e/0x850 [raid10]
> [<0>] raid10_make_request+0xd7/0x150 [raid10]
> [<0>] md_handle_request+0x123/0x1a0
> [<0>] md_submit_bio+0xda/0x120
> [<0>] __submit_bio_noacct+0xde/0x320
> [<0>] submit_bio_noacct+0x4d/0x90
> [<0>] submit_bio+0x4f/0x1b0
> [<0>] __blkdev_issue_discard+0x154/0x290
> [<0>] blkdev_issue_discard+0x5d/0xc0
> [<0>] blk_ioctl_discard+0xc4/0x110
> [<0>] blkdev_common_ioctl+0x56c/0x840
> [<0>] blkdev_ioctl+0xeb/0x270
> [<0>] block_ioctl+0x3d/0x50
> [<0>] __x64_sys_ioctl+0x91/0xc0
> [<0>] do_syscall_64+0x38/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> [Fix]
> 
> Xiao Ni has developed a patchset which resolves the block discard performance
> problems. These commits have now landed in 5.13-rc1.
> 
> commit cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:43 2021 +0800
> Subject: md: add md_submit_discard_bio() for submitting discard bio
> Link: https://github.com/torvalds/linux/commit/cf78408f937a67f59f5e90ee8e6cadeed7c128a8
> 
> commit c2968285925adb97b9aa4ede94c1f1ab61ce0925
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:44 2021 +0800
> Subject: md/raid10: extend r10bio devs to raid disks
> Link: https://github.com/torvalds/linux/commit/c2968285925adb97b9aa4ede94c1f1ab61ce0925
> 
> commit f2e7e269a7525317752d472bb48a549780e87d22
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:45 2021 +0800
> Subject: md/raid10: pull the code that wait for blocked dev into one function
> Link: https://github.com/torvalds/linux/commit/f2e7e269a7525317752d472bb48a549780e87d22
> 
> commit d30588b2731fb01e1616cf16c3fe79a1443e29aa
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:46 2021 +0800
> Subject: md/raid10: improve raid10 discard request
> Link: https://github.com/torvalds/linux/commit/d30588b2731fb01e1616cf16c3fe79a1443e29aa
> 
> commit 254c271da0712ea8914f187588e0f81f7678ee2f
> Author: Xiao Ni <xni@redhat.com>
> Date: Thu Feb 4 15:50:47 2021 +0800
> Subject: md/raid10: improve discard request for far layout
> Link: https://github.com/torvalds/linux/commit/254c271da0712ea8914f187588e0f81f7678ee2f
> 
> There is also an additional commit which is required, and was merged after
> "md/raid10: improve raid10 discard request" was merged. The following commit
> enables Radid10 to use large discards, instead of splitting into many bios,
> since the technical hurdles have now been removed.
> 
> commit ca4a4e9a55beeb138bb06e3867f5e486da896d44
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Apr 30 14:38:37 2021 -0400
> Subject: dm raid: remove unnecessary discard limits for raid0 and raid10
> Link: https://github.com/torvalds/linux/commit/ca4a4e9a55beeb138bb06e3867f5e486da896d44
> 
> The commits more or less cherry pick to the 5.11, 5.8, 5.4 and 4.15 kernels,
> with the following minor backports:
> 
> 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it
> was recently changed in:
> 
> commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> Author: Christoph Hellwig <hch@lst.de>
> Date: Wed Jul 1 10:59:44 2020 +0200
> Subject: block: rename generic_make_request to submit_bio_noacct
> Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead
> 
> 2) In the 4.15, 5.4 and 5.8 kernels, trace_block_bio_remap() needs to have its
> request_queue argument put back in place. It was recently removed in:
> 
> commit 1c02fca620f7273b597591065d366e2cca948d8f
> Author: Christoph Hellwig <hch@lst.de>
> Date: Thu Dec 3 17:21:38 2020 +0100
> Subject: block: remove the request_queue argument to the block_bio_remap tracepoint
> Link: https://github.com/torvalds/linux/commit/1c02fca620f7273b597591065d366e2cca948d8f
> 
> 3) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of"
> '&' removed for one of their arguments for the 4.15 kernel, due to changes made
> in:
> 
> commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> Author: Kent Overstreet <kent.overstreet@gmail.com>
> Date: Sun May 20 18:25:52 2018 -0400
> Subject: md: convert to bioset_init()/mempool_init()
> Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8
> 
> 4) The 4.15 kernel does not need "dm raid: remove unnecessary discard limits for
> raid0 and raid10" due to not having the following commit, which was merged in
> 5.1-rc1:
> 
> commit 61697a6abd24acba941359c6268a94f4afe4a53d
> Author: Mike Snitzer <snitzer@redhat.com>
> Date: Fri Jan 18 14:19:26 2019 -0500
> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface
> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d
> 
> 5) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to
> bio_clone_blkcg_association() due to it changing in:
> 
> commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> Author: Dennis Zhou <dennis@kernel.org>
> Date: Wed Dec 5 12:10:35 2018 -0500
> Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg
> https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1
> 
> [Testcase]
> 
> You will need a machine with at least 4x NVMe drives which support block
> discard. I use a i3.8xlarge instance on AWS, since it has all of these things.
> 
> $ lsblk
> xvda 202:0 0 8G 0 disk
> └─xvda1 202:1 0 8G 0 part /
> nvme0n1 259:2 0 1.7T 0 disk
> nvme1n1 259:0 0 1.7T 0 disk
> nvme2n1 259:1 0 1.7T 0 disk
> nvme3n1 259:3 0 1.7T 0 disk
> 
> Create a Raid10 array:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> Format the array with XFS:
> 
> $ time sudo mkfs.xfs /dev/md0
> real 11m14.734s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> 
> Optional, do a fstrim:
> 
> $ time sudo fstrim /mnt/disk
> 
> real 11m37.643s
> 
> There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA:
> 
> https://launchpad.net/~mruffell/+archive/ubuntu/lp1896578-test
> 
> If you install a test kernel, we can see that performance dramatically improves:
> 
> $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1
> 
> $ time sudo mkfs.xfs /dev/md0
> real 0m4.226s
> user 0m0.020s
> sys 0m0.148s
> 
> $ sudo mkdir /mnt/disk
> $ sudo mount /dev/md0 /mnt/disk
> $ time sudo fstrim /mnt/disk
> 
> real 0m1.991s
> user 0m0.020s
> sys 0m0.000s
> 
> The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim
> from 11 minutes to 2 seconds.
> 
> Performance Matrix (AWS i3.8xlarge):
> 
> Kernel | mkfs.xfs | fstrim
> ---------------------------------
> 4.15      | 7m23.449s | 7m20.678s
> 5.4       | 8m23.219s | 8m23.927s
> 5.8       | 2m54.990s | 8m22.010s
> 4.15-test | 0m4.286s  | 0m1.657s
> 5.4-test  | 0m6.075s  | 0m3.150s
> 5.8-test  | 0m2.753s  | 0m2.999s
> 
> The test kernel also changes the discard_max_bytes to the underlying hardware
> limit:
> 
> $ cat /sys/block/md0/queue/discard_max_bytes
> 2199023255040
> 
> [Where problems can occur]
> 
> A problem has occurred once before, with the previous revision of this patchset.
> This has been documented in bug 1907262, and caused a worst case scenario of
> data loss for some users, in this particular case, on the second and onward
> disks. This was due to two two faults: the first, incorrectly calculating the
> start offset for block discard for the second and extra disks. The second bug
> was an incorrect stripe size for far layouts.
> 
> The kernel team was forced to revert the patches in an emergency and the faulty
> kernel was removed from the archive, and community users urged to avoid the
> faulty kernel.
> 
> These bugs and a few other minor issues have now been corrected, and we have
> been testing the new patches since mid February. The patches have been tested
> against the testcase in bug 1907262 and do not cause the disks to become
> corrupted.
> 
> The regression potential is still the same for this patchset though. If a
> regression were to occur, it could lead to data loss on Raid10 arrays backed by
> NVMe or SSD disks that support block discard.
> 
> If a regression happens, users need to disable the fstrim systemd service as
> soon as possible, plan an emergency maintenance window, and downgrade the kernel
> to a previous release, or upgrade to a corrected kernel.
> 
> Mike Snitzer (1):
>    dm raid: remove unnecessary discard limits for raid0 and raid10
> 
> Xiao Ni (5):
>    md: add md_submit_discard_bio() for submitting discard bio
>    md/raid10: extend r10bio devs to raid disks
>    md/raid10: pull the code that wait for blocked dev into one function
>    md/raid10: improve raid10 discard request
>    md/raid10: improve discard request for far layout
> 
>   drivers/md/dm-raid.c |   9 -
>   drivers/md/md.c      |  20 ++
>   drivers/md/md.h      |   2 +
>   drivers/md/raid0.c   |  14 +-
>   drivers/md/raid10.c  | 434 +++++++++++++++++++++++++++++++++++++------
>   drivers/md/raid10.h  |   1 +
>   6 files changed, 402 insertions(+), 78 deletions(-)
> 


Applied to [bionic/focal/groovy/hirsute]:linux.

Thanks,
Kleber

[SRU,B,F,G,H,0/6] raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations

Message

Comments