mbox series

[0/5,v7] ext4: Speedup orphan file handling

Message ID 20210816093626.18767-1-jack@suse.cz
Headers show
Series ext4: Speedup orphan file handling | expand

Message

Jan Kara Aug. 16, 2021, 9:57 a.m. UTC
Hello,

Here is the seventh version of my series to speed up orphan inode handling in
ext4. I've forgot to add a check that orphan file is not exposed in directory
hierarchy. The only change in this version is addition of that fix.

Orphan inode handling in ext4 is a bottleneck for workloads which heavily
excercise truncate / unlink of small files as they contend on global
s_orphan_mutex (when you have fast enough storage). This patch set implements
new way of handling orphan inodes - instead of using a linked list, we store
inode numbers of orphaned inodes in a file which is possible to implement in a
more scalable manner than linked list manipulations. See description of patch
3/5 for more details.

The patch set achieves significant gains both for a micro benchmark stressing
orphan inode handling (truncating file byte-by-byte, several threads in
parallel) and for reaim creat_clo workload. I'm happy for any review, thoughts,
ideas about the patches. I have also implemented full support in e2fsprogs
which I'll send separately.

								Honza

[1] https://lore.kernel.org/lkml/20210227120804.GB22871@xsang-OptiPlex-9020/

Changes since v6:
* Rebased on top of v5.14-rc6 + patches fixing exposure of hidden inodes in
  directory hierarchy
* Added check orphan file cannot be linked from directory hierarchy
* Get orphan file inode with EXT4_IGET_SPECIAL flag

Changes since v5:
* Added Reviewed-by tags from Ted
* Fixed up sparse warning spotted by 0-day
* Fixed error handling path in ext4_orphan_add() to not leak orphan entry

Changes since v4:
* Rebased on top of v5.14-rc5
* Updated commit message of patch 1/5
* Added Reviewed-by tags from Ted

Changes since v3:
* Added documentation about on-disk format changes
* Add physical block number into orphan block checksum
* Improve some sanity checks, handling of corrupted orphan file
* Improved some changelogs

Changes since v2:
* Updated some comments
* Rebased onto 5.13-rc5
* Change orphan file inode from a fixed inode number to inode number stored
  in the superblock

Changes since v1:
* orphan blocks have now magic numbers
* split out orphan handling to a separate source file
* some smaller updates according to review

Previous versions:
Link: http://lore.kernel.org/r/20210816091810.16994-1-jack@suse.cz # v6
Link: http://lore.kernel.org/r/20210811101006.2033-1-jack@suse.cz # v5
Link: https://lore.kernel.org/linux-ext4/20210712154009.9290-1-jack@suse.cz/ #v4
Link: https://lore.kernel.org/linux-ext4/20210616105655.5129-1-jack@suse.cz/ #v3
Link: https://lore.kernel.org/linux-ext4/1432293717-24010-1-git-send-email-jack@suse.cz/ #v2

Comments

Theodore Ts'o Aug. 24, 2021, 5:14 p.m. UTC | #1
I've been running some tests exercising the orphan_file code, and
there are a number of failures:

ext4/orphan_file: 512 tests, 3 failures, 25 skipped, 7325 seconds
  Failures: ext4/044 generic/475 generic/643
ext4/orphan_file_1k: 524 tests, 6 failures, 37 skipped, 8361 seconds
  Failures: ext4/033 ext4/044 ext4/045 generic/273 generic/476 generic/643

generic/643 is the iomap swap failure, and can be ignored.
generic/475 is a pre-existing test flake that involves simulated disk
failures, which we can also ignore in the context or orphan_file.

However, ext4/044 is one that looks... interesting:

root@kvm-xfstests:~# e2fsck -fn /dev/vdc
e2fsck 1.46.4-orphan-file (22-Aug-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Orphan file (inode 12) block 0 is not clean.
Clear? no

Failed to initialize orphan file.
Recreate? no

This is highly reproducible, and involves using a file system config
that is probably a little unusual:

Filesystem features:      has_journal ext_attr resize_inode dir_index orphan_file filetype sparse_super large_file

(This was created using "mke2fs -t ext3 -O orphan_file".)

The orphan_file_1k failures seem to involve running out of space in
the orphan_file, and the fallback to using the old fashioned orphan
list seems to return ENOSPC?  For example, from ext4/045:

    +mkdir: No space left on device
    +Failed to create directories - 19679

ext4/045 creates a lot of directories when calls mkdir (ext4/045 tests
creating more than 65000 subdirectories in a directory), and so this
seems to be triggering a failure?

					- Ted
Jan Kara Aug. 25, 2021, 11:30 a.m. UTC | #2
On Tue 24-08-21 13:14:09, Theodore Ts'o wrote:
> I've been running some tests exercising the orphan_file code, and
> there are a number of failures:
> 
> ext4/orphan_file: 512 tests, 3 failures, 25 skipped, 7325 seconds
>   Failures: ext4/044 generic/475 generic/643
> ext4/orphan_file_1k: 524 tests, 6 failures, 37 skipped, 8361 seconds
>   Failures: ext4/033 ext4/044 ext4/045 generic/273 generic/476 generic/643
> 
> generic/643 is the iomap swap failure, and can be ignored.
> generic/475 is a pre-existing test flake that involves simulated disk
> failures, which we can also ignore in the context or orphan_file.
> 
> However, ext4/044 is one that looks... interesting:
> 
> root@kvm-xfstests:~# e2fsck -fn /dev/vdc
> e2fsck 1.46.4-orphan-file (22-Aug-2021)
> Pass 1: Checking inodes, blocks, and sizes
> Pass 2: Checking directory structure
> Pass 3: Checking directory connectivity
> Pass 4: Checking reference counts
> Pass 5: Checking group summary information
> Orphan file (inode 12) block 0 is not clean.
> Clear? no
> 
> Failed to initialize orphan file.
> Recreate? no
> 
> This is highly reproducible, and involves using a file system config
> that is probably a little unusual:
> 
> Filesystem features:      has_journal ext_attr resize_inode dir_index orphan_file filetype sparse_super large_file
> 
> (This was created using "mke2fs -t ext3 -O orphan_file".)

Interesting. I don't see how orphan handling code gets used at all for this
test. Hrm. Actually it seems to be a bug in the tools themselves because
just "mke2fs -t ext3 -O orphan_file" and "e2fsck -f" reproduces exactly
this failure. It seems that when I was adding physical block number to
orphan file block checksum, I've broken e2fsck for the situation when
metadata_csum is disabled. I've fixed the bug now (relative diff attached,
I can resend the full series once the other bugs are dealt with as well).

> The orphan_file_1k failures seem to involve running out of space in
> the orphan_file, and the fallback to using the old fashioned orphan
> list seems to return ENOSPC?  For example, from ext4/045:
> 
>     +mkdir: No space left on device
>     +Failed to create directories - 19679
> 
> ext4/045 creates a lot of directories when calls mkdir (ext4/045 tests
> creating more than 65000 subdirectories in a directory), and so this
> seems to be triggering a failure?

Strange. I don't see how ext4/045 load could run out of space in the orphan
file (and in fact I did test that the fallback when we run out of space in
the orphan file works correctly). Anyway, I'll look into it. Thanks for the
reports!

								Honza
Jan Kara Aug. 25, 2021, 4:13 p.m. UTC | #3
On Wed 25-08-21 13:30:16, Jan Kara wrote:
> On Tue 24-08-21 13:14:09, Theodore Ts'o wrote:
> > I've been running some tests exercising the orphan_file code, and
> > there are a number of failures:
> > 
> > ext4/orphan_file: 512 tests, 3 failures, 25 skipped, 7325 seconds
> >   Failures: ext4/044 generic/475 generic/643
> > ext4/orphan_file_1k: 524 tests, 6 failures, 37 skipped, 8361 seconds
> >   Failures: ext4/033 ext4/044 ext4/045 generic/273 generic/476 generic/643

So I had a look into the other failures... So ext4/044 works for me after
fixing e2fsck (both in 1k and 4k cases). ext4/033, ext4/045, generic/273
fail for me in the 1k case even without orphan file patches so I don't
think they are a regression caused by my changes (specifically ext4/045 is
a buggy test - I think the directory h-tree is not able to hold that many
directories for 1k block size). Interestingly, I couldn't make generic/476
fail for me either with or without my patches so that may be some random
failure. I'm now running that test in a loop to see whether the failure
will reproduce to investigate.

So overall I don't see any other regressions with my patches (besides that
e2fsck bug). Or did I miss something?

								Honza
Theodore Ts'o Aug. 25, 2021, 5:48 p.m. UTC | #4
On Wed, Aug 25, 2021 at 06:13:31PM +0200, Jan Kara wrote:
> 
> So I had a look into the other failures... So ext4/044 works for me after
> fixing e2fsck (both in 1k and 4k cases). ext4/033, ext4/045, generic/273
> fail for me in the 1k case even without orphan file patches so I don't
> think they are a regression caused by my changes (specifically ext4/045 is
> a buggy test - I think the directory h-tree is not able to hold that many
> directories for 1k block size). Interestingly, I couldn't make generic/476
> fail for me either with or without my patches so that may be some random
> failure. I'm now running that test in a loop to see whether the failure
> will reproduce to investigate.

Oh, you're right.  I had forgotten I had the following in my
1k.exclude file, and I hadn't copied them over when I set up the
orphan_file_1k config.

# The test fails due to too many block group descriptors when the
# block size is 1k
ext4/033

# This test uses dioread_nolock which currently isn't supported when
# block_size != PAGE_SIZE.
ext4/034

# This test tries to create 65536 directories, and with 1k blocks,
# and long names, we run out of htree depth
ext4/045

# This test creates too many inodes on when the block size is 1k
# without using special mkfs.ext4 options to change the inode size.
# This test is a bit bogus anyway, and uses a bunch of magic calculations
# where it's not clear what it was originally trying to test in the
# first place.  So let's just skip it for now.
generic/273

# This test creates too many extended attributes to fit in a 1k block
generic/454

# Normal configurations don't support dax
-g dax

(We can drop ext4/034 from the exclude list since we now *do* support
dioread_nolock when block_size < page_size.)

Thanks for the investigating, and apologies for not the reason why I
hadn't seen the failures in the 1k case was because they had been
suppressed.

						- Ted
Jan Kara Aug. 25, 2021, 10:13 p.m. UTC | #5
On Wed 25-08-21 13:48:18, Theodore Ts'o wrote:
> On Wed, Aug 25, 2021 at 06:13:31PM +0200, Jan Kara wrote:
> > 
> > So I had a look into the other failures... So ext4/044 works for me after
> > fixing e2fsck (both in 1k and 4k cases). ext4/033, ext4/045, generic/273
> > fail for me in the 1k case even without orphan file patches so I don't
> > think they are a regression caused by my changes (specifically ext4/045 is
> > a buggy test - I think the directory h-tree is not able to hold that many
> > directories for 1k block size). Interestingly, I couldn't make generic/476
> > fail for me either with or without my patches so that may be some random
> > failure. I'm now running that test in a loop to see whether the failure
> > will reproduce to investigate.
> 
> Oh, you're right.  I had forgotten I had the following in my
> 1k.exclude file, and I hadn't copied them over when I set up the
> orphan_file_1k config.
> 
> # The test fails due to too many block group descriptors when the
> # block size is 1k
> ext4/033
> 
> # This test uses dioread_nolock which currently isn't supported when
> # block_size != PAGE_SIZE.
> ext4/034
> 
> # This test tries to create 65536 directories, and with 1k blocks,
> # and long names, we run out of htree depth
> ext4/045
> 
> # This test creates too many inodes on when the block size is 1k
> # without using special mkfs.ext4 options to change the inode size.
> # This test is a bit bogus anyway, and uses a bunch of magic calculations
> # where it's not clear what it was originally trying to test in the
> # first place.  So let's just skip it for now.
> generic/273
> 
> # This test creates too many extended attributes to fit in a 1k block
> generic/454
> 
> # Normal configurations don't support dax
> -g dax
> 
> (We can drop ext4/034 from the exclude list since we now *do* support
> dioread_nolock when block_size < page_size.)
> 
> Thanks for the investigating, and apologies for not the reason why I
> hadn't seen the failures in the 1k case was because they had been
> suppressed.

No problem. Glad this is cleared out. I've sent out next version of the
e2fsprogs patches with fixed e2fsck bug you've found.

								Honza
Theodore Ts'o Aug. 25, 2021, 10:47 p.m. UTC | #6
On Wed, Aug 25, 2021 at 06:13:31PM +0200, Jan Kara wrote:
>
> Interestingly, I couldn't make generic/476
> fail for me either with or without my patches so that may be some random
> failure. I'm now running that test in a loop to see whether the failure
> will reproduce to investigate.

On my test infrastructure (using gce-xfstests, which uses 5gig test
and scratch partitions using PD-SSD as the device, 2CPU's, with 7.5 GB
of memory), it's failing 70% of the time in the 1k and orphan_file_1k
case.

Looking through my notes, that's a known failure:

generic/476	1k
   Run an all-writes fsstress run with multiple threads to shake out
   bugs in the write path.

   50% of the time.  fsck complaining corrupted file system block
   bitmap differences all negative.  Possible cause: races loading
   adjacent block bitmaps in a single 4k page, leading to bitmap
   updates getting lost?

   Eric Whitney reports that it's a punch hole bug, where we are
   leaking a block or a cluster (in bigalloc); with fixed seed he
   can repro reliably at 100%.

					- Ted


TESTRUNID: tytso-20210825172314
KERNEL:    kernel 5.14.0-rc2-xfstests-00019-g3e5533948c16 #326 SMP Mon Aug 23 22:27:25
EDT 2021 x86_64
CMDLINE:   --update-files -c 1k -C 10 generic/476
CPUS:      2
MEM:       7680

ext4/1k: 10 tests, 7 failures, 3399 seconds
  generic/476  Failed   327s
  generic/476  Failed   348s
  generic/476  Failed   344s
  generic/476  Pass     343s
  generic/476  Pass     335s
  generic/476  Failed   348s
  generic/476  Failed   341s
  generic/476  Failed   335s
  generic/476  Failed   345s
  generic/476  Pass     333s
Totals: 10 tests, 0 skipped, 7 failures, 0 errors, 3399s

FSTESTIMG: gce-xfstests/xfstests-202108232144
FSTESTPRJ: gce-xfstests
FSTESTVER: blktests 62ec9d6 (Tue, 1 Jun 2021 10:08:38 -0700)
FSTESTVER: e2fsprogs v1.46.4-5-g4cda2545-orphan_file (Sun, 22 Aug 2021 10:07:15 -0400)
FSTESTVER: fio  fio-3.27 (Wed, 26 May 2021 10:10:32 -0600)
FSTESTVER: fsverity v1.4 (Mon, 14 Jun 2021 16:14:52 -0700)
FSTESTVER: ima-evm-utils v1.3.2 (Wed, 28 Oct 2020 13:18:08 -0400)
FSTESTVER: nvme-cli v1.14-61-gf729e93 (Fri, 25 Jun 2021 10:21:17 -0600)
FSTESTVER: quota  v4.05-40-g25f16b1 (Tue, 16 Mar 2021 17:57:19 +0100)
FSTESTVER: util-linux v2.37 (Tue, 1 Jun 2021 09:52:10 +0200)
FSTESTVER: xfsprogs v5.12.0 (Fri, 21 May 2021 15:53:24 -0400)
FSTESTVER: xfstests-bld 779b6a0 (Mon, 23 Aug 2021 21:29:52 -0400)
FSTESTVER: xfstests linux-v3.8-3295-gddc09fff (Mon, 23 Aug 2021 20:54:57 -0400)
FSTESTCFG: 1k
FSTESTSET: generic/476
FSTESTOPT: count 10 aex
GCE ID:    2585249240830380583
Theodore Ts'o Aug. 26, 2021, 2:55 p.m. UTC | #7
On Mon, Aug 16, 2021 at 11:57:03AM +0200, Jan Kara wrote:
> 
> Here is the seventh version of my series to speed up orphan inode handling in
> ext4. I've forgot to add a check that orphan file is not exposed in directory
> hierarchy. The only change in this version is addition of that fix.

Thanks, applied.

					- Ted