Message ID | 20201029030737.21204-1-matthew.ruffell@canonical.com |
---|---|
Headers | show |
Series | raid10: Block discard is very slow, causing severe delays for mkfs and fstrim operations | expand |
On 29.10.20 04:07, Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896578 > > [Impact] > > Block discard is very slow on Raid10, which causes common use cases which invoke > block discard, such as mkfs and fstrim operations, to take a very long time. > > For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices > which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to > 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. > > The bigger the devices, the longer it takes. > > The cause is that Raid10 currently uses a 512k chunk size, and uses this for the > discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the > request into millions of 512k bio requests, even if the underlying device > supports larger requests. > > For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: > > $ cat /sys/block/nvme0n1/queue/discard_max_bytes > 2199023255040 > $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes > 2199023255040 > > Where the Raid10 md device only supports 512k: > > $ cat /sys/block/md0/queue/discard_max_bytes > 524288 > $ cat /sys/block/md0/queue/discard_max_hw_bytes > 524288 > > If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes > and if we examine the stack, it is stuck in blkdev_issue_discard() > > $ sudo cat /proc/1626/stack > [<0>] wait_barrier+0x14c/0x230 [raid10] > [<0>] regular_request_wait+0x39/0x150 [raid10] > [<0>] raid10_write_request+0x11e/0x850 [raid10] > [<0>] raid10_make_request+0xd7/0x150 [raid10] > [<0>] md_handle_request+0x123/0x1a0 > [<0>] md_submit_bio+0xda/0x120 > [<0>] __submit_bio_noacct+0xde/0x320 > [<0>] submit_bio_noacct+0x4d/0x90 > [<0>] submit_bio+0x4f/0x1b0 > [<0>] __blkdev_issue_discard+0x154/0x290 > [<0>] blkdev_issue_discard+0x5d/0xc0 > [<0>] blk_ioctl_discard+0xc4/0x110 > [<0>] blkdev_common_ioctl+0x56c/0x840 > [<0>] blkdev_ioctl+0xeb/0x270 > [<0>] block_ioctl+0x3d/0x50 > [<0>] __x64_sys_ioctl+0x91/0xc0 > [<0>] do_syscall_64+0x38/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [Fix] > > Xiao Ni has developed a patchset which resolves the block discard performance > problems. These commits have now landed in 5.10-rc1. > > commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:42:59 2020 +0800 > Subject: md: add md_submit_discard_bio() for submitting discard bio > Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > > commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:00 2020 +0800 > Subject: md/raid10: extend r10bio devs to raid disks > Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > > commit f046f5d0d79cdb968f219ce249e497fd1accf484 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:01 2020 +0800 > Subject: md/raid10: pull codes that wait for blocked dev into one function > Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 > > commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:22 2020 +0800 > Subject: md/raid10: improve raid10 discard request > Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 > > commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:23 2020 +0800 > Subject: md/raid10: improve discard request for far layout > Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > > There is also some additional commits which is required, and was merged after > "md/raid10: improve raid10 discard request" was merged. The following commits > enables Radid10 to use large discards, instead of splitting into many bios, > since the technical hurdles have now been removed. The below two patches are marked up as only needed for F and G. What about Bionic? If the changes they refer to were in 4.12, then those would have to go to Bionic as well. Beside that, I am not sure how exactly that might be better phrased, but personally I stumbled over "remove 'address of' pointer for...". Mabye "do not use a pointer for one of the arguments to ..." but not sure. -Stefan > > commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 13:14:52 2020 -0400 > Subject: dm raid: fix discard limits for raid1 and raid10 > Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > > commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 16:40:12 2020 -0400 > Subject: dm raid: remove unnecessary discard limits for raid10 > Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > > All the commits mentioned follow a similar strategy which was implemented in > Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block > discard performance issues in Raid0: > > commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 > Author: Shaohua Li <shli@fb.com> > Date: Sun May 7 17:36:24 2017 -0700 > Subject: md/md0: optimize raid0 discard handling > Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 > > The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the > following minor fixups: > > 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it > was recently changed in: > > commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead > Author: Christoph Hellwig <hch@lst.de> > Date: Wed Jul 1 10:59:44 2020 +0200 > Subject: block: rename generic_make_request to submit_bio_noacct > Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead > > 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" > '&' removed for one of their arguments for the 4.15 kernel, due to changes made > in: > > commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > Author: Kent Overstreet <kent.overstreet@gmail.com> > Date: Sun May 20 18:25:52 2018 -0400 > Subject: md: convert to bioset_init()/mempool_init() > Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > > 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" > and "dm raid: remove unnecessary discard limits for raid10" due to not having > the following commit, which was merged in 5.1-rc1: > > commit 61697a6abd24acba941359c6268a94f4afe4a53d > Author: Mike Snitzer <snitzer@redhat.com> > Date: Fri Jan 18 14:19:26 2019 -0500 > Subject: dm: eliminate 'split_discard_bios' flag from DM target interface > Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d > > 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to > bio_clone_blkcg_association() due to it changing in: > > commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > Author: Dennis Zhou <dennis@kernel.org> > Date: Wed Dec 5 12:10:35 2018 -0500 > Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg > https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > > [Testcase] > > You will need a machine with at least 4x NVMe drives which support block discard. > I use a i3.8xlarge instance on AWS, since it has all of these things. > > $ lsblk > xvda 202:0 0 8G 0 disk > └─xvda1 202:1 0 8G 0 part / > nvme0n1 259:2 0 1.7T 0 disk > nvme1n1 259:0 0 1.7T 0 disk > nvme2n1 259:1 0 1.7T 0 disk > nvme3n1 259:3 0 1.7T 0 disk > > Create a Raid10 array: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Format the array with XFS: > > $ time sudo mkfs.xfs /dev/md0 > real 11m14.734s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > > Optional, do a fstrim: > > $ time sudo fstrim /mnt/disk > > real 11m37.643s > > There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test > > If you install a test kernel, we can see that performance dramatically improves: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > $ time sudo mkfs.xfs /dev/md0 > real 0m4.226s > user 0m0.020s > sys 0m0.148s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > $ time sudo fstrim /mnt/disk > > real 0m1.991s > user 0m0.020s > sys 0m0.000s > > The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim > from 11 minutes to 2 seconds. > > Performance Matrix (AWS i3.8xlarge): > > Kernel | mkfs.xfs | fstrim > --------------------------------- > 4.15 | 7m23.449s | 7m20.678s > 5.4 | 8m23.219s | 8m23.927s > 5.8 | 2m54.990s | 8m22.010s > 4.15-test | 0m4.286s | 0m1.657s > 5.4-test | 0m6.075s | 0m3.150s > 5.8-test | 0m2.753s | 0m2.999s > > The test kernel also changes the discard_max_bytes to the underlying hardware > limit: > > $ cat /sys/block/md0/queue/discard_max_bytes > 2199023255040 > > [Regression Potential] > > If a regression were to occur, then it would affect operations which would > trigger block discard operations, such as mkfs and fstrim, on Raid10 only. > > Other Raid levels would not be affected, although, I should note there will be > a small risk of regression to Raid0, due to one of its functions being > re-factored and split out, for use in both Raid0 and Raid10. > > The changes only affect block discard, so only Raid10 arrays backed by SSD or > NVMe devices which support block discard will be affected. Traditional hard > disks, or SSD devices which do not support block discard would not be affected. > > If a regression were to occur, users could work around the issue by running > "mkfs.xfs -K <device>" which would skip block discard entirely. > > Mike Snitzer (2): > dm raid: fix discard limits for raid1 and raid10 > dm raid: remove unnecessary discard limits for raid10 > > Xiao Ni (5): > md: add md_submit_discard_bio() for submitting discard bio > md/raid10: extend r10bio devs to raid disks > md/raid10: pull codes that wait for blocked dev into one function > md/raid10: improve raid10 discard request > md/raid10: improve discard request for far layout > > drivers/md/dm-raid.c | 9 - > drivers/md/md.c | 20 ++ > drivers/md/md.h | 2 + > drivers/md/raid0.c | 14 +- > drivers/md/raid10.c | 423 +++++++++++++++++++++++++++++++++++++------ > drivers/md/raid10.h | 1 + > 6 files changed, 391 insertions(+), 78 deletions(-) >
On 29/10/20 9:46 pm, Stefan Bader wrote: >> There is also some additional commits which is required, and was merged after >> "md/raid10: improve raid10 discard request" was merged. The following commits >> enables Radid10 to use large discards, instead of splitting into many bios, >> since the technical hurdles have now been removed. > > The below two patches are marked up as only needed for F and G. What about > Bionic? If the changes they refer to were in 4.12, then those would have to go > to Bionic as well. Sorry, it seems I have confused you in my SRU template. See backport note #3 about the two commits for F and G: > 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" > and "dm raid: remove unnecessary discard limits for raid10" due to not having > the following commit, which was merged in 5.1-rc1: > > commit 61697a6abd24acba941359c6268a94f4afe4a53d > Author: Mike Snitzer <snitzer at redhat.com> > Date: Fri Jan 18 14:19:26 2019 -0500 > Subject: dm: eliminate 'split_discard_bios' flag from DM target interface > Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d Now, "dm: eliminate 'split_discard_bios' flag from DM target interface" is a really messy backport to Bionic, it also changes the DM API and version requirements, and it also needs a bunch of dependency commits to enable blk_queue_split() to handle splitting discards based on queue_limits. In 4.15, this is all handled by DM core. The two commits marked for F and G but not B, do not actually modify performance in any way, and technically aren't needed to solve the problem. I included them for F and G to provide a more complete fix, but they can be omitted if necessary. I decided that making the two commits marked for F and G work in Bionic is too much of a regression risk, and did not include them in the SRU. The benefits for the 4.15 kernel do not outweigh the risks, while for the 5.4 and 5.8 kernels, they will get one of the commits via upstream -stable anyway, so we may as well ship a complete solution. As for the paragraph that mentions the changes in 4.12, it is referencing the overall architecture changes required for raid0 to gain better performance, as it was used as a design template for the changes to raid10. I included it more or less as a statement that the new code follows a design implemented in 4.12, and that the design still holds up today, in order to try lend the new patches some credibility, since they are more or less the same refactor as introduced in: > All the commits mentioned follow a similar strategy which was implemented in > Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block > discard performance issues in Raid0: > > commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 > Author: Shaohua Li <shli at fb.com> > Date: Sun May 7 17:36:24 2017 -0700 > Subject: md/md0: optimize raid0 discard handling > Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 "follow a similar strategy which was implemented in Raid0" was a poor choice of words by me to say that the whole patchset is derived from the refactor of raid0 to fix similar problems back in 4.12. Sorry for the confusion. > Beside that, I am not sure how exactly that might be better phrased, but > personally I stumbled over "remove 'address of' pointer for...". Mabye "do > not use a pointer for one of the arguments to ..." but not sure. Possibly we could change the wording to "remove pointer reference for parameter in ...". e.g. for "md/raid10: improve raid10 discard request": (backported from commit bcc90d280465ebd51ab8688be86e1f00c62dccf9) [mruffell: change submit_bio_noacct() to generic_make_request(), remove pointer reference for parameter in bio_split(), mempool_alloc(), bio_clone_fast()] Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com> e.g. for "md/raid10: improve discard request for far layout": (backported from commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359) [mruffell: remove pointer reference for parameter in mempool_alloc()] Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com> Would you like me to send a V2? Or is it possible for you to change in place? Let me know if you have any more concerns. I know this is a big SRU request, so I have tried to document it the best I can. A customer would also really like this SRU to go through, since they are avoiding all use of raid10 due to the severe delays with block discard, but they want to use raid10 on NVMe devices, and are frustrated that they can't. At the same time, I also understand if you have regression concerns due to the amount of code changing. Thanks, Matthew
On 02.11.20 01:13, Matthew Ruffell wrote: > On 29/10/20 9:46 pm, Stefan Bader wrote: >>> There is also some additional commits which is required, and was merged after >>> "md/raid10: improve raid10 discard request" was merged. The following commits >>> enables Radid10 to use large discards, instead of splitting into many bios, >>> since the technical hurdles have now been removed. >> >> The below two patches are marked up as only needed for F and G. What about >> Bionic? If the changes they refer to were in 4.12, then those would have to go >> to Bionic as well. > > Sorry, it seems I have confused you in my SRU template. See backport note #3 > about the two commits for F and G: > >> 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" >> and "dm raid: remove unnecessary discard limits for raid10" due to not having >> the following commit, which was merged in 5.1-rc1: >> >> commit 61697a6abd24acba941359c6268a94f4afe4a53d >> Author: Mike Snitzer <snitzer at redhat.com> >> Date: Fri Jan 18 14:19:26 2019 -0500 >> Subject: dm: eliminate 'split_discard_bios' flag from DM target interface >> Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d > > Now, "dm: eliminate 'split_discard_bios' flag from DM target interface" is a > really messy backport to Bionic, it also changes the DM API and version > requirements, and it also needs a bunch of dependency commits to enable > blk_queue_split() to handle splitting discards based on queue_limits. In 4.15, > this is all handled by DM core. > > The two commits marked for F and G but not B, do not actually modify performance > in any way, and technically aren't needed to solve the problem. I included them > for F and G to provide a more complete fix, but they can be omitted if necessary. I guess part of the confusion was that, while your statement mentioned them not needed, they sound like being performance related and patch #7 in the commit message refers to patch #4 which is provided for Bionic as well. > > I decided that making the two commits marked for F and G work in Bionic is too > much of a regression risk, and did not include them in the SRU. The benefits for > the 4.15 kernel do not outweigh the risks, while for the 5.4 and 5.8 kernels, > they will get one of the commits via upstream -stable anyway, so we may as well > ship a complete solution. > > As for the paragraph that mentions the changes in 4.12, it is referencing the > overall architecture changes required for raid0 to gain better performance, as > it was used as a design template for the changes to raid10. I included it more > or less as a statement that the new code follows a design implemented in 4.12, > and that the design still holds up today, in order to try lend the new patches > some credibility, since they are more or less the same refactor as introduced > in: > >> All the commits mentioned follow a similar strategy which was implemented in >> Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block >> discard performance issues in Raid0: >> >> commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 >> Author: Shaohua Li <shli at fb.com> >> Date: Sun May 7 17:36:24 2017 -0700 >> Subject: md/md0: optimize raid0 discard handling >> Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 > > "follow a similar strategy which was implemented in Raid0" was a poor choice of > words by me to say that the whole patchset is derived from the refactor of raid0 > to fix similar problems back in 4.12. > > Sorry for the confusion. This is a rather complex case. Confusion to some degree is probably inevitable. > >> Beside that, I am not sure how exactly that might be better phrased, but >> personally I stumbled over "remove 'address of' pointer for...". Mabye "do >> not use a pointer for one of the arguments to ..." but not sure. > > Possibly we could change the wording to "remove pointer reference for parameter > in ...". > > e.g. for "md/raid10: improve raid10 discard request": > > (backported from commit bcc90d280465ebd51ab8688be86e1f00c62dccf9) > [mruffell: change submit_bio_noacct() to generic_make_request(), remove pointer > reference for parameter in bio_split(), mempool_alloc(), bio_clone_fast()] > Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com> > > e.g. for "md/raid10: improve discard request for far layout": > > (backported from commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359) > [mruffell: remove pointer reference for parameter in mempool_alloc()] > Signed-off-by: Matthew Ruffell <matthew.ruffell@canonical.com> > > Would you like me to send a V2? Or is it possible for you to change in place? This is not something that must change. That I got confused does not mean it is wrong. So using the "remove pointer reference" form in future will make it simpler for me to know what I am looking for. If this gets applied we might or might not change it on the fly, but I do not consider this a blocker. > > Let me know if you have any more concerns. I know this is a big SRU request, so > I have tried to document it the best I can. A customer would also really > like this SRU to go through, since they are avoiding all use of raid10 due to > the severe delays with block discard, but they want to use raid10 on NVMe > devices, and are frustrated that they can't. At the same time, I also understand > if you have regression concerns due to the amount of code changing. > On the good side, there is only one change to md code that is more generic usage. And that looked to be mostly shifting existing code into a separate function. So ok. And the other changes are raid10 discipline, so with targeted testing / verification I think we can be fairly safe. So with this: Acked-by: Stefan Bader <stefan.bader@canonical.com> > Thanks, > Matthew >
Verified all applies cleanly and build tests ok. Cherry picks look clean and backports make sense. Appreciate the information and write up. lgtm Acked-by: Kelsey Skunberg <kelsey.skunberg@canonical.com> On 2020-10-29 16:07:27 , Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896578 > > [Impact] > > Block discard is very slow on Raid10, which causes common use cases which invoke > block discard, such as mkfs and fstrim operations, to take a very long time. > > For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices > which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to > 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. > > The bigger the devices, the longer it takes. > > The cause is that Raid10 currently uses a 512k chunk size, and uses this for the > discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the > request into millions of 512k bio requests, even if the underlying device > supports larger requests. > > For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: > > $ cat /sys/block/nvme0n1/queue/discard_max_bytes > 2199023255040 > $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes > 2199023255040 > > Where the Raid10 md device only supports 512k: > > $ cat /sys/block/md0/queue/discard_max_bytes > 524288 > $ cat /sys/block/md0/queue/discard_max_hw_bytes > 524288 > > If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes > and if we examine the stack, it is stuck in blkdev_issue_discard() > > $ sudo cat /proc/1626/stack > [<0>] wait_barrier+0x14c/0x230 [raid10] > [<0>] regular_request_wait+0x39/0x150 [raid10] > [<0>] raid10_write_request+0x11e/0x850 [raid10] > [<0>] raid10_make_request+0xd7/0x150 [raid10] > [<0>] md_handle_request+0x123/0x1a0 > [<0>] md_submit_bio+0xda/0x120 > [<0>] __submit_bio_noacct+0xde/0x320 > [<0>] submit_bio_noacct+0x4d/0x90 > [<0>] submit_bio+0x4f/0x1b0 > [<0>] __blkdev_issue_discard+0x154/0x290 > [<0>] blkdev_issue_discard+0x5d/0xc0 > [<0>] blk_ioctl_discard+0xc4/0x110 > [<0>] blkdev_common_ioctl+0x56c/0x840 > [<0>] blkdev_ioctl+0xeb/0x270 > [<0>] block_ioctl+0x3d/0x50 > [<0>] __x64_sys_ioctl+0x91/0xc0 > [<0>] do_syscall_64+0x38/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [Fix] > > Xiao Ni has developed a patchset which resolves the block discard performance > problems. These commits have now landed in 5.10-rc1. > > commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:42:59 2020 +0800 > Subject: md: add md_submit_discard_bio() for submitting discard bio > Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > > commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:00 2020 +0800 > Subject: md/raid10: extend r10bio devs to raid disks > Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > > commit f046f5d0d79cdb968f219ce249e497fd1accf484 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:01 2020 +0800 > Subject: md/raid10: pull codes that wait for blocked dev into one function > Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 > > commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:22 2020 +0800 > Subject: md/raid10: improve raid10 discard request > Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 > > commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:23 2020 +0800 > Subject: md/raid10: improve discard request for far layout > Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > > There is also some additional commits which is required, and was merged after > "md/raid10: improve raid10 discard request" was merged. The following commits > enables Radid10 to use large discards, instead of splitting into many bios, > since the technical hurdles have now been removed. > > commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 13:14:52 2020 -0400 > Subject: dm raid: fix discard limits for raid1 and raid10 > Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > > commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 16:40:12 2020 -0400 > Subject: dm raid: remove unnecessary discard limits for raid10 > Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > > All the commits mentioned follow a similar strategy which was implemented in > Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block > discard performance issues in Raid0: > > commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 > Author: Shaohua Li <shli@fb.com> > Date: Sun May 7 17:36:24 2017 -0700 > Subject: md/md0: optimize raid0 discard handling > Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 > > The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the > following minor fixups: > > 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it > was recently changed in: > > commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead > Author: Christoph Hellwig <hch@lst.de> > Date: Wed Jul 1 10:59:44 2020 +0200 > Subject: block: rename generic_make_request to submit_bio_noacct > Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead > > 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" > '&' removed for one of their arguments for the 4.15 kernel, due to changes made > in: > > commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > Author: Kent Overstreet <kent.overstreet@gmail.com> > Date: Sun May 20 18:25:52 2018 -0400 > Subject: md: convert to bioset_init()/mempool_init() > Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > > 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" > and "dm raid: remove unnecessary discard limits for raid10" due to not having > the following commit, which was merged in 5.1-rc1: > > commit 61697a6abd24acba941359c6268a94f4afe4a53d > Author: Mike Snitzer <snitzer@redhat.com> > Date: Fri Jan 18 14:19:26 2019 -0500 > Subject: dm: eliminate 'split_discard_bios' flag from DM target interface > Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d > > 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to > bio_clone_blkcg_association() due to it changing in: > > commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > Author: Dennis Zhou <dennis@kernel.org> > Date: Wed Dec 5 12:10:35 2018 -0500 > Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg > https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > > [Testcase] > > You will need a machine with at least 4x NVMe drives which support block discard. > I use a i3.8xlarge instance on AWS, since it has all of these things. > > $ lsblk > xvda 202:0 0 8G 0 disk > └─xvda1 202:1 0 8G 0 part / > nvme0n1 259:2 0 1.7T 0 disk > nvme1n1 259:0 0 1.7T 0 disk > nvme2n1 259:1 0 1.7T 0 disk > nvme3n1 259:3 0 1.7T 0 disk > > Create a Raid10 array: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Format the array with XFS: > > $ time sudo mkfs.xfs /dev/md0 > real 11m14.734s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > > Optional, do a fstrim: > > $ time sudo fstrim /mnt/disk > > real 11m37.643s > > There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test > > If you install a test kernel, we can see that performance dramatically improves: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > $ time sudo mkfs.xfs /dev/md0 > real 0m4.226s > user 0m0.020s > sys 0m0.148s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > $ time sudo fstrim /mnt/disk > > real 0m1.991s > user 0m0.020s > sys 0m0.000s > > The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim > from 11 minutes to 2 seconds. > > Performance Matrix (AWS i3.8xlarge): > > Kernel | mkfs.xfs | fstrim > --------------------------------- > 4.15 | 7m23.449s | 7m20.678s > 5.4 | 8m23.219s | 8m23.927s > 5.8 | 2m54.990s | 8m22.010s > 4.15-test | 0m4.286s | 0m1.657s > 5.4-test | 0m6.075s | 0m3.150s > 5.8-test | 0m2.753s | 0m2.999s > > The test kernel also changes the discard_max_bytes to the underlying hardware > limit: > > $ cat /sys/block/md0/queue/discard_max_bytes > 2199023255040 > > [Regression Potential] > > If a regression were to occur, then it would affect operations which would > trigger block discard operations, such as mkfs and fstrim, on Raid10 only. > > Other Raid levels would not be affected, although, I should note there will be > a small risk of regression to Raid0, due to one of its functions being > re-factored and split out, for use in both Raid0 and Raid10. > > The changes only affect block discard, so only Raid10 arrays backed by SSD or > NVMe devices which support block discard will be affected. Traditional hard > disks, or SSD devices which do not support block discard would not be affected. > > If a regression were to occur, users could work around the issue by running > "mkfs.xfs -K <device>" which would skip block discard entirely. > > Mike Snitzer (2): > dm raid: fix discard limits for raid1 and raid10 > dm raid: remove unnecessary discard limits for raid10 > > Xiao Ni (5): > md: add md_submit_discard_bio() for submitting discard bio > md/raid10: extend r10bio devs to raid disks > md/raid10: pull codes that wait for blocked dev into one function > md/raid10: improve raid10 discard request > md/raid10: improve discard request for far layout > > drivers/md/dm-raid.c | 9 - > drivers/md/md.c | 20 ++ > drivers/md/md.h | 2 + > drivers/md/raid0.c | 14 +- > drivers/md/raid10.c | 423 +++++++++++++++++++++++++++++++++++++------ > drivers/md/raid10.h | 1 + > 6 files changed, 391 insertions(+), 78 deletions(-) > > -- > 2.27.0 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
Applied to Groovy/master-next Thanks, Ian On 2020-10-29 16:07:27 , Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896578 > > [Impact] > > Block discard is very slow on Raid10, which causes common use cases which invoke > block discard, such as mkfs and fstrim operations, to take a very long time. > > For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices > which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to > 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. > > The bigger the devices, the longer it takes. > > The cause is that Raid10 currently uses a 512k chunk size, and uses this for the > discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the > request into millions of 512k bio requests, even if the underlying device > supports larger requests. > > For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: > > $ cat /sys/block/nvme0n1/queue/discard_max_bytes > 2199023255040 > $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes > 2199023255040 > > Where the Raid10 md device only supports 512k: > > $ cat /sys/block/md0/queue/discard_max_bytes > 524288 > $ cat /sys/block/md0/queue/discard_max_hw_bytes > 524288 > > If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes > and if we examine the stack, it is stuck in blkdev_issue_discard() > > $ sudo cat /proc/1626/stack > [<0>] wait_barrier+0x14c/0x230 [raid10] > [<0>] regular_request_wait+0x39/0x150 [raid10] > [<0>] raid10_write_request+0x11e/0x850 [raid10] > [<0>] raid10_make_request+0xd7/0x150 [raid10] > [<0>] md_handle_request+0x123/0x1a0 > [<0>] md_submit_bio+0xda/0x120 > [<0>] __submit_bio_noacct+0xde/0x320 > [<0>] submit_bio_noacct+0x4d/0x90 > [<0>] submit_bio+0x4f/0x1b0 > [<0>] __blkdev_issue_discard+0x154/0x290 > [<0>] blkdev_issue_discard+0x5d/0xc0 > [<0>] blk_ioctl_discard+0xc4/0x110 > [<0>] blkdev_common_ioctl+0x56c/0x840 > [<0>] blkdev_ioctl+0xeb/0x270 > [<0>] block_ioctl+0x3d/0x50 > [<0>] __x64_sys_ioctl+0x91/0xc0 > [<0>] do_syscall_64+0x38/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [Fix] > > Xiao Ni has developed a patchset which resolves the block discard performance > problems. These commits have now landed in 5.10-rc1. > > commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:42:59 2020 +0800 > Subject: md: add md_submit_discard_bio() for submitting discard bio > Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > > commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:00 2020 +0800 > Subject: md/raid10: extend r10bio devs to raid disks > Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > > commit f046f5d0d79cdb968f219ce249e497fd1accf484 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:01 2020 +0800 > Subject: md/raid10: pull codes that wait for blocked dev into one function > Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 > > commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:22 2020 +0800 > Subject: md/raid10: improve raid10 discard request > Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 > > commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:23 2020 +0800 > Subject: md/raid10: improve discard request for far layout > Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > > There is also some additional commits which is required, and was merged after > "md/raid10: improve raid10 discard request" was merged. The following commits > enables Radid10 to use large discards, instead of splitting into many bios, > since the technical hurdles have now been removed. > > commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 13:14:52 2020 -0400 > Subject: dm raid: fix discard limits for raid1 and raid10 > Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > > commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 16:40:12 2020 -0400 > Subject: dm raid: remove unnecessary discard limits for raid10 > Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > > All the commits mentioned follow a similar strategy which was implemented in > Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block > discard performance issues in Raid0: > > commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 > Author: Shaohua Li <shli@fb.com> > Date: Sun May 7 17:36:24 2017 -0700 > Subject: md/md0: optimize raid0 discard handling > Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 > > The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the > following minor fixups: > > 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it > was recently changed in: > > commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead > Author: Christoph Hellwig <hch@lst.de> > Date: Wed Jul 1 10:59:44 2020 +0200 > Subject: block: rename generic_make_request to submit_bio_noacct > Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead > > 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" > '&' removed for one of their arguments for the 4.15 kernel, due to changes made > in: > > commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > Author: Kent Overstreet <kent.overstreet@gmail.com> > Date: Sun May 20 18:25:52 2018 -0400 > Subject: md: convert to bioset_init()/mempool_init() > Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > > 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" > and "dm raid: remove unnecessary discard limits for raid10" due to not having > the following commit, which was merged in 5.1-rc1: > > commit 61697a6abd24acba941359c6268a94f4afe4a53d > Author: Mike Snitzer <snitzer@redhat.com> > Date: Fri Jan 18 14:19:26 2019 -0500 > Subject: dm: eliminate 'split_discard_bios' flag from DM target interface > Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d > > 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to > bio_clone_blkcg_association() due to it changing in: > > commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > Author: Dennis Zhou <dennis@kernel.org> > Date: Wed Dec 5 12:10:35 2018 -0500 > Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg > https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > > [Testcase] > > You will need a machine with at least 4x NVMe drives which support block discard. > I use a i3.8xlarge instance on AWS, since it has all of these things. > > $ lsblk > xvda 202:0 0 8G 0 disk > └─xvda1 202:1 0 8G 0 part / > nvme0n1 259:2 0 1.7T 0 disk > nvme1n1 259:0 0 1.7T 0 disk > nvme2n1 259:1 0 1.7T 0 disk > nvme3n1 259:3 0 1.7T 0 disk > > Create a Raid10 array: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Format the array with XFS: > > $ time sudo mkfs.xfs /dev/md0 > real 11m14.734s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > > Optional, do a fstrim: > > $ time sudo fstrim /mnt/disk > > real 11m37.643s > > There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test > > If you install a test kernel, we can see that performance dramatically improves: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > $ time sudo mkfs.xfs /dev/md0 > real 0m4.226s > user 0m0.020s > sys 0m0.148s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > $ time sudo fstrim /mnt/disk > > real 0m1.991s > user 0m0.020s > sys 0m0.000s > > The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim > from 11 minutes to 2 seconds. > > Performance Matrix (AWS i3.8xlarge): > > Kernel | mkfs.xfs | fstrim > --------------------------------- > 4.15 | 7m23.449s | 7m20.678s > 5.4 | 8m23.219s | 8m23.927s > 5.8 | 2m54.990s | 8m22.010s > 4.15-test | 0m4.286s | 0m1.657s > 5.4-test | 0m6.075s | 0m3.150s > 5.8-test | 0m2.753s | 0m2.999s > > The test kernel also changes the discard_max_bytes to the underlying hardware > limit: > > $ cat /sys/block/md0/queue/discard_max_bytes > 2199023255040 > > [Regression Potential] > > If a regression were to occur, then it would affect operations which would > trigger block discard operations, such as mkfs and fstrim, on Raid10 only. > > Other Raid levels would not be affected, although, I should note there will be > a small risk of regression to Raid0, due to one of its functions being > re-factored and split out, for use in both Raid0 and Raid10. > > The changes only affect block discard, so only Raid10 arrays backed by SSD or > NVMe devices which support block discard will be affected. Traditional hard > disks, or SSD devices which do not support block discard would not be affected. > > If a regression were to occur, users could work around the issue by running > "mkfs.xfs -K <device>" which would skip block discard entirely. > > Mike Snitzer (2): > dm raid: fix discard limits for raid1 and raid10 > dm raid: remove unnecessary discard limits for raid10 > > Xiao Ni (5): > md: add md_submit_discard_bio() for submitting discard bio > md/raid10: extend r10bio devs to raid disks > md/raid10: pull codes that wait for blocked dev into one function > md/raid10: improve raid10 discard request > md/raid10: improve discard request for far layout > > drivers/md/dm-raid.c | 9 - > drivers/md/md.c | 20 ++ > drivers/md/md.h | 2 + > drivers/md/raid0.c | 14 +- > drivers/md/raid10.c | 423 +++++++++++++++++++++++++++++++++++++------ > drivers/md/raid10.h | 1 + > 6 files changed, 391 insertions(+), 78 deletions(-) > > -- > 2.27.0 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
Applied to Focal/master-next Thanks, Ian On 2020-10-29 16:07:27 , Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896578 > > [Impact] > > Block discard is very slow on Raid10, which causes common use cases which invoke > block discard, such as mkfs and fstrim operations, to take a very long time. > > For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices > which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to > 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. > > The bigger the devices, the longer it takes. > > The cause is that Raid10 currently uses a 512k chunk size, and uses this for the > discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the > request into millions of 512k bio requests, even if the underlying device > supports larger requests. > > For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: > > $ cat /sys/block/nvme0n1/queue/discard_max_bytes > 2199023255040 > $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes > 2199023255040 > > Where the Raid10 md device only supports 512k: > > $ cat /sys/block/md0/queue/discard_max_bytes > 524288 > $ cat /sys/block/md0/queue/discard_max_hw_bytes > 524288 > > If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes > and if we examine the stack, it is stuck in blkdev_issue_discard() > > $ sudo cat /proc/1626/stack > [<0>] wait_barrier+0x14c/0x230 [raid10] > [<0>] regular_request_wait+0x39/0x150 [raid10] > [<0>] raid10_write_request+0x11e/0x850 [raid10] > [<0>] raid10_make_request+0xd7/0x150 [raid10] > [<0>] md_handle_request+0x123/0x1a0 > [<0>] md_submit_bio+0xda/0x120 > [<0>] __submit_bio_noacct+0xde/0x320 > [<0>] submit_bio_noacct+0x4d/0x90 > [<0>] submit_bio+0x4f/0x1b0 > [<0>] __blkdev_issue_discard+0x154/0x290 > [<0>] blkdev_issue_discard+0x5d/0xc0 > [<0>] blk_ioctl_discard+0xc4/0x110 > [<0>] blkdev_common_ioctl+0x56c/0x840 > [<0>] blkdev_ioctl+0xeb/0x270 > [<0>] block_ioctl+0x3d/0x50 > [<0>] __x64_sys_ioctl+0x91/0xc0 > [<0>] do_syscall_64+0x38/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [Fix] > > Xiao Ni has developed a patchset which resolves the block discard performance > problems. These commits have now landed in 5.10-rc1. > > commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:42:59 2020 +0800 > Subject: md: add md_submit_discard_bio() for submitting discard bio > Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > > commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:00 2020 +0800 > Subject: md/raid10: extend r10bio devs to raid disks > Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > > commit f046f5d0d79cdb968f219ce249e497fd1accf484 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:01 2020 +0800 > Subject: md/raid10: pull codes that wait for blocked dev into one function > Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 > > commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:22 2020 +0800 > Subject: md/raid10: improve raid10 discard request > Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 > > commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:23 2020 +0800 > Subject: md/raid10: improve discard request for far layout > Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > > There is also some additional commits which is required, and was merged after > "md/raid10: improve raid10 discard request" was merged. The following commits > enables Radid10 to use large discards, instead of splitting into many bios, > since the technical hurdles have now been removed. > > commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 13:14:52 2020 -0400 > Subject: dm raid: fix discard limits for raid1 and raid10 > Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > > commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 16:40:12 2020 -0400 > Subject: dm raid: remove unnecessary discard limits for raid10 > Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > > All the commits mentioned follow a similar strategy which was implemented in > Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block > discard performance issues in Raid0: > > commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 > Author: Shaohua Li <shli@fb.com> > Date: Sun May 7 17:36:24 2017 -0700 > Subject: md/md0: optimize raid0 discard handling > Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 > > The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the > following minor fixups: > > 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it > was recently changed in: > > commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead > Author: Christoph Hellwig <hch@lst.de> > Date: Wed Jul 1 10:59:44 2020 +0200 > Subject: block: rename generic_make_request to submit_bio_noacct > Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead > > 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" > '&' removed for one of their arguments for the 4.15 kernel, due to changes made > in: > > commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > Author: Kent Overstreet <kent.overstreet@gmail.com> > Date: Sun May 20 18:25:52 2018 -0400 > Subject: md: convert to bioset_init()/mempool_init() > Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > > 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" > and "dm raid: remove unnecessary discard limits for raid10" due to not having > the following commit, which was merged in 5.1-rc1: > > commit 61697a6abd24acba941359c6268a94f4afe4a53d > Author: Mike Snitzer <snitzer@redhat.com> > Date: Fri Jan 18 14:19:26 2019 -0500 > Subject: dm: eliminate 'split_discard_bios' flag from DM target interface > Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d > > 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to > bio_clone_blkcg_association() due to it changing in: > > commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > Author: Dennis Zhou <dennis@kernel.org> > Date: Wed Dec 5 12:10:35 2018 -0500 > Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg > https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > > [Testcase] > > You will need a machine with at least 4x NVMe drives which support block discard. > I use a i3.8xlarge instance on AWS, since it has all of these things. > > $ lsblk > xvda 202:0 0 8G 0 disk > └─xvda1 202:1 0 8G 0 part / > nvme0n1 259:2 0 1.7T 0 disk > nvme1n1 259:0 0 1.7T 0 disk > nvme2n1 259:1 0 1.7T 0 disk > nvme3n1 259:3 0 1.7T 0 disk > > Create a Raid10 array: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Format the array with XFS: > > $ time sudo mkfs.xfs /dev/md0 > real 11m14.734s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > > Optional, do a fstrim: > > $ time sudo fstrim /mnt/disk > > real 11m37.643s > > There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test > > If you install a test kernel, we can see that performance dramatically improves: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > $ time sudo mkfs.xfs /dev/md0 > real 0m4.226s > user 0m0.020s > sys 0m0.148s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > $ time sudo fstrim /mnt/disk > > real 0m1.991s > user 0m0.020s > sys 0m0.000s > > The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim > from 11 minutes to 2 seconds. > > Performance Matrix (AWS i3.8xlarge): > > Kernel | mkfs.xfs | fstrim > --------------------------------- > 4.15 | 7m23.449s | 7m20.678s > 5.4 | 8m23.219s | 8m23.927s > 5.8 | 2m54.990s | 8m22.010s > 4.15-test | 0m4.286s | 0m1.657s > 5.4-test | 0m6.075s | 0m3.150s > 5.8-test | 0m2.753s | 0m2.999s > > The test kernel also changes the discard_max_bytes to the underlying hardware > limit: > > $ cat /sys/block/md0/queue/discard_max_bytes > 2199023255040 > > [Regression Potential] > > If a regression were to occur, then it would affect operations which would > trigger block discard operations, such as mkfs and fstrim, on Raid10 only. > > Other Raid levels would not be affected, although, I should note there will be > a small risk of regression to Raid0, due to one of its functions being > re-factored and split out, for use in both Raid0 and Raid10. > > The changes only affect block discard, so only Raid10 arrays backed by SSD or > NVMe devices which support block discard will be affected. Traditional hard > disks, or SSD devices which do not support block discard would not be affected. > > If a regression were to occur, users could work around the issue by running > "mkfs.xfs -K <device>" which would skip block discard entirely. > > Mike Snitzer (2): > dm raid: fix discard limits for raid1 and raid10 > dm raid: remove unnecessary discard limits for raid10 > > Xiao Ni (5): > md: add md_submit_discard_bio() for submitting discard bio > md/raid10: extend r10bio devs to raid disks > md/raid10: pull codes that wait for blocked dev into one function > md/raid10: improve raid10 discard request > md/raid10: improve discard request for far layout > > drivers/md/dm-raid.c | 9 - > drivers/md/md.c | 20 ++ > drivers/md/md.h | 2 + > drivers/md/raid0.c | 14 +- > drivers/md/raid10.c | 423 +++++++++++++++++++++++++++++++++++++------ > drivers/md/raid10.h | 1 + > 6 files changed, 391 insertions(+), 78 deletions(-) > > -- > 2.27.0 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team
Applied to Bionic/master-next Thanks, Ian On 2020-10-29 16:07:27 , Matthew Ruffell wrote: > BugLink: https://bugs.launchpad.net/bugs/1896578 > > [Impact] > > Block discard is very slow on Raid10, which causes common use cases which invoke > block discard, such as mkfs and fstrim operations, to take a very long time. > > For example, on a i3.8xlarge instance on AWS, which has 4x 1.9TB NVMe devices > which support block discard, a mkfs.xfs operation on Raid 10 takes between 8 to > 11 minutes, where the same mkfs.xfs operation on Raid 0, takes 4 seconds. > > The bigger the devices, the longer it takes. > > The cause is that Raid10 currently uses a 512k chunk size, and uses this for the > discard_max_bytes value. If we need to discard 1.9TB, the kernel splits the > request into millions of 512k bio requests, even if the underlying device > supports larger requests. > > For example, the NVMe devices on i3.8xlarge support 2.2TB of discard at once: > > $ cat /sys/block/nvme0n1/queue/discard_max_bytes > 2199023255040 > $ cat /sys/block/nvme0n1/queue/discard_max_hw_bytes > 2199023255040 > > Where the Raid10 md device only supports 512k: > > $ cat /sys/block/md0/queue/discard_max_bytes > 524288 > $ cat /sys/block/md0/queue/discard_max_hw_bytes > 524288 > > If we perform a mkfs.xfs operation on the /dev/md array, it takes over 11 minutes > and if we examine the stack, it is stuck in blkdev_issue_discard() > > $ sudo cat /proc/1626/stack > [<0>] wait_barrier+0x14c/0x230 [raid10] > [<0>] regular_request_wait+0x39/0x150 [raid10] > [<0>] raid10_write_request+0x11e/0x850 [raid10] > [<0>] raid10_make_request+0xd7/0x150 [raid10] > [<0>] md_handle_request+0x123/0x1a0 > [<0>] md_submit_bio+0xda/0x120 > [<0>] __submit_bio_noacct+0xde/0x320 > [<0>] submit_bio_noacct+0x4d/0x90 > [<0>] submit_bio+0x4f/0x1b0 > [<0>] __blkdev_issue_discard+0x154/0x290 > [<0>] blkdev_issue_discard+0x5d/0xc0 > [<0>] blk_ioctl_discard+0xc4/0x110 > [<0>] blkdev_common_ioctl+0x56c/0x840 > [<0>] blkdev_ioctl+0xeb/0x270 > [<0>] block_ioctl+0x3d/0x50 > [<0>] __x64_sys_ioctl+0x91/0xc0 > [<0>] do_syscall_64+0x38/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [Fix] > > Xiao Ni has developed a patchset which resolves the block discard performance > problems. These commits have now landed in 5.10-rc1. > > commit 2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:42:59 2020 +0800 > Subject: md: add md_submit_discard_bio() for submitting discard bio > Link: https://github.com/torvalds/linux/commit/2628089b74d5a64bd0bcb5d247a18f78d7b6f4d0 > > commit 8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:00 2020 +0800 > Subject: md/raid10: extend r10bio devs to raid disks > Link: https://github.com/torvalds/linux/commit/8650a889017cb1f6ea6813ccf83a2e9f6fa49dd3 > > commit f046f5d0d79cdb968f219ce249e497fd1accf484 > Author: Xiao Ni <xni@redhat.com> > Date: Tue Aug 25 13:43:01 2020 +0800 > Subject: md/raid10: pull codes that wait for blocked dev into one function > Link: https://github.com/torvalds/linux/commit/f046f5d0d79cdb968f219ce249e497fd1accf484 > > commit bcc90d280465ebd51ab8688be86e1f00c62dccf9 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:22 2020 +0800 > Subject: md/raid10: improve raid10 discard request > Link: https://github.com/torvalds/linux/commit/bcc90d280465ebd51ab8688be86e1f00c62dccf9 > > commit d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > Author: Xiao Ni <xni@redhat.com> > Date: Wed Sep 2 20:00:23 2020 +0800 > Subject: md/raid10: improve discard request for far layout > Link: https://github.com/torvalds/linux/commit/d3ee2d8415a6256c1c41e1be36e80e640c3e6359 > > There is also some additional commits which is required, and was merged after > "md/raid10: improve raid10 discard request" was merged. The following commits > enables Radid10 to use large discards, instead of splitting into many bios, > since the technical hurdles have now been removed. > > commit e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 13:14:52 2020 -0400 > Subject: dm raid: fix discard limits for raid1 and raid10 > Link: https://github.com/torvalds/linux/commit/e0910c8e4f87bb9f767e61a778b0d9271c4dc512 > > commit f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > Author: Mike Snitzer <snitzer@redhat.com> > Date: Thu Sep 24 16:40:12 2020 -0400 > Subject: dm raid: remove unnecessary discard limits for raid10 > Link: https://github.com/torvalds/linux/commit/f0e90b6c663a7e3b4736cb318c6c7c589f152c28 > > All the commits mentioned follow a similar strategy which was implemented in > Raid0 in the below commit, which was merged in 4.12-rc2, which fixed block > discard performance issues in Raid0: > > commit 29efc390b9462582ae95eb9a0b8cd17ab956afc0 > Author: Shaohua Li <shli@fb.com> > Date: Sun May 7 17:36:24 2017 -0700 > Subject: md/md0: optimize raid0 discard handling > Link: https://github.com/torvalds/linux/commit/29efc390b9462582ae95eb9a0b8cd17ab956afc0 > > The commits more or less cherry pick to the 5.8, 5.4 and 4.15 kernels, with the > following minor fixups: > > 1) submit_bio_noacct() needed to be renamed to generic_make_request() since it > was recently changed in: > > commit ed00aabd5eb9fb44d6aff1173234a2e911b9fead > Author: Christoph Hellwig <hch@lst.de> > Date: Wed Jul 1 10:59:44 2020 +0200 > Subject: block: rename generic_make_request to submit_bio_noacct > Link: https://github.com/torvalds/linux/commit/ed00aabd5eb9fb44d6aff1173234a2e911b9fead > > 2) bio_split(), mempool_alloc(), bio_clone_fast() all needed their "address of" > '&' removed for one of their arguments for the 4.15 kernel, due to changes made > in: > > commit afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > Author: Kent Overstreet <kent.overstreet@gmail.com> > Date: Sun May 20 18:25:52 2018 -0400 > Subject: md: convert to bioset_init()/mempool_init() > Link: https://github.com/torvalds/linux/commit/afeee514ce7f4cab605beedd03be71ebaf0c5fc8 > > 3) The 4.15 kernel does not need "dm raid: fix discard limits for raid1 and raid10" > and "dm raid: remove unnecessary discard limits for raid10" due to not having > the following commit, which was merged in 5.1-rc1: > > commit 61697a6abd24acba941359c6268a94f4afe4a53d > Author: Mike Snitzer <snitzer@redhat.com> > Date: Fri Jan 18 14:19:26 2019 -0500 > Subject: dm: eliminate 'split_discard_bios' flag from DM target interface > Link: https://github.com/torvalds/linux/commit/61697a6abd24acba941359c6268a94f4afe4a53d > > 4) The 4.15 kernel needed bio_clone_blkg_association() to be renamed to > bio_clone_blkcg_association() due to it changing in: > > commit db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > Author: Dennis Zhou <dennis@kernel.org> > Date: Wed Dec 5 12:10:35 2018 -0500 > Subject: blkcg: remove bio->bi_css and instead use bio->bi_blkg > https://github.com/torvalds/linux/commit/db6638d7d177a8bc74c9e539e2e0d7d061c767b1 > > [Testcase] > > You will need a machine with at least 4x NVMe drives which support block discard. > I use a i3.8xlarge instance on AWS, since it has all of these things. > > $ lsblk > xvda 202:0 0 8G 0 disk > └─xvda1 202:1 0 8G 0 part / > nvme0n1 259:2 0 1.7T 0 disk > nvme1n1 259:0 0 1.7T 0 disk > nvme2n1 259:1 0 1.7T 0 disk > nvme3n1 259:3 0 1.7T 0 disk > > Create a Raid10 array: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > Format the array with XFS: > > $ time sudo mkfs.xfs /dev/md0 > real 11m14.734s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > > Optional, do a fstrim: > > $ time sudo fstrim /mnt/disk > > real 11m37.643s > > There are test kernels for 5.8, 5.4 and 4.15 available in the following PPA: > > https://launchpad.net/~mruffell/+archive/ubuntu/sf291726-test > > If you install a test kernel, we can see that performance dramatically improves: > > $ sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 > > $ time sudo mkfs.xfs /dev/md0 > real 0m4.226s > user 0m0.020s > sys 0m0.148s > > $ sudo mkdir /mnt/disk > $ sudo mount /dev/md0 /mnt/disk > $ time sudo fstrim /mnt/disk > > real 0m1.991s > user 0m0.020s > sys 0m0.000s > > The patches bring mkfs.xfs from 11 minutes down to 4 seconds, and fstrim > from 11 minutes to 2 seconds. > > Performance Matrix (AWS i3.8xlarge): > > Kernel | mkfs.xfs | fstrim > --------------------------------- > 4.15 | 7m23.449s | 7m20.678s > 5.4 | 8m23.219s | 8m23.927s > 5.8 | 2m54.990s | 8m22.010s > 4.15-test | 0m4.286s | 0m1.657s > 5.4-test | 0m6.075s | 0m3.150s > 5.8-test | 0m2.753s | 0m2.999s > > The test kernel also changes the discard_max_bytes to the underlying hardware > limit: > > $ cat /sys/block/md0/queue/discard_max_bytes > 2199023255040 > > [Regression Potential] > > If a regression were to occur, then it would affect operations which would > trigger block discard operations, such as mkfs and fstrim, on Raid10 only. > > Other Raid levels would not be affected, although, I should note there will be > a small risk of regression to Raid0, due to one of its functions being > re-factored and split out, for use in both Raid0 and Raid10. > > The changes only affect block discard, so only Raid10 arrays backed by SSD or > NVMe devices which support block discard will be affected. Traditional hard > disks, or SSD devices which do not support block discard would not be affected. > > If a regression were to occur, users could work around the issue by running > "mkfs.xfs -K <device>" which would skip block discard entirely. > > Mike Snitzer (2): > dm raid: fix discard limits for raid1 and raid10 > dm raid: remove unnecessary discard limits for raid10 > > Xiao Ni (5): > md: add md_submit_discard_bio() for submitting discard bio > md/raid10: extend r10bio devs to raid disks > md/raid10: pull codes that wait for blocked dev into one function > md/raid10: improve raid10 discard request > md/raid10: improve discard request for far layout > > drivers/md/dm-raid.c | 9 - > drivers/md/md.c | 20 ++ > drivers/md/md.h | 2 + > drivers/md/raid0.c | 14 +- > drivers/md/raid10.c | 423 +++++++++++++++++++++++++++++++++++++------ > drivers/md/raid10.h | 1 + > 6 files changed, 391 insertions(+), 78 deletions(-) > > -- > 2.27.0 > > > -- > kernel-team mailing list > kernel-team@lists.ubuntu.com > https://lists.ubuntu.com/mailman/listinfo/kernel-team