Message ID | 20210921034203.323950-1-sarthakkukreti@google.com |
---|---|
State | New |
Headers | show |
Series | mke2fs: Add extended option for prezeroed storage devices | expand |
On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > > From: Sarthak Kukreti <sarthakkukreti@chromium.org> > > This patch adds an extended option "assume_storage_prezeroed" to > mke2fs. When enabled, this option acts as a hint to mke2fs that > the underlying block device was zeroed before mke2fs was called. > This allows mke2fs to optimize out the zeroing of the inode > table and the journal, which speeds up the filesystem creation > time. > > Additionally, on thinly provisioned storage devices (like Ceph, > dm-thin), ... and newly-created sparse loopback files > reads on unmapped extents return zero. This property > allows mke2fs (with assume_storage_prezeroed) to avoid > pre-allocating metadata space for inode tables for the entire > filesystem and saves space that would normally be preallocated > for zero inode tables. > > Testing on ChromeOS (running linux kernel 4.19) with dm-thin > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>': > > - Time taken by mke2fs drops from 1.07s to 0.08s. > - Avoiding zeroing out the inode table and journal reduces the > initial metadata space allocation from 0.48% to 0.01%. > - Lazy inode table zeroing results in a further 1.45% of logical > volume space getting allocated for inode tables, even if not file > data is added to the filesystem. With assume_storage_prezeroed, > the metadata allocation remains at 0.01%. This seems beneficial, but I'm wondering if this could also be done automatically when TRIM/DISCARD is used by mke2fs to erase a device? One safe option to do this automatically would be to start by *reading* the disk blocks and check if they are all zero, and only switch to zero-block writes if any block is found with non-zero data. That would avoid the extra space usage from zero-block writes in the above cases, and also work for the huge majority of users that won't know the "assume_storage_prezeroed" option even exits, though it won't necessarily reduce the runtime. > diff --git a/misc/mke2fs.c b/misc/mke2fs.c > index 04b2fbce..5293d9b0 100644 > --- a/misc/mke2fs.c > +++ b/misc/mke2fs.c > @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[]) > io_channel_set_options(fs->io, opt_string); > } > > + if (assume_storage_prezeroed) { > + if (verbose) > + printf("%s", > + _("Assuming the storage device is prezeroed " > + "- skipping inode table and journal wipe\n")); > + > + lazy_itable_init = 1; > + itable_zeroed = 1; > + zero_hugefile = 0; > + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; > + } Indentation appears to be broken here - only 2 spaces instead of a tab. This is also missing any kind of test case. Since a large number of the e2fsck test cases are using loopback filesystems created on a sparse file, this would both be good test cases, as well as reducing time/space used during testing. Cheers, Andreas
Wouldn't it make more sense to use "write-same" of 0 instead of writing a page of zeros and task the layers that do thin provisioning and return 0 on read from unallocated blocks to check if a block exists before writing zeros to it? On 9/21/21, 2:40 PM, "Andreas Dilger" <adilger@dilger.ca> wrote: On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > > From: Sarthak Kukreti <sarthakkukreti@chromium.org> > > This patch adds an extended option "assume_storage_prezeroed" to > mke2fs. When enabled, this option acts as a hint to mke2fs that > the underlying block device was zeroed before mke2fs was called. > This allows mke2fs to optimize out the zeroing of the inode > table and the journal, which speeds up the filesystem creation > time. > > Additionally, on thinly provisioned storage devices (like Ceph, > dm-thin), ... and newly-created sparse loopback files > reads on unmapped extents return zero. This property > allows mke2fs (with assume_storage_prezeroed) to avoid > pre-allocating metadata space for inode tables for the entire > filesystem and saves space that would normally be preallocated > for zero inode tables. > > Testing on ChromeOS (running linux kernel 4.19) with dm-thin > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>': > > - Time taken by mke2fs drops from 1.07s to 0.08s. > - Avoiding zeroing out the inode table and journal reduces the > initial metadata space allocation from 0.48% to 0.01%. > - Lazy inode table zeroing results in a further 1.45% of logical > volume space getting allocated for inode tables, even if not file > data is added to the filesystem. With assume_storage_prezeroed, > the metadata allocation remains at 0.01%. This seems beneficial, but I'm wondering if this could also be done automatically when TRIM/DISCARD is used by mke2fs to erase a device? One safe option to do this automatically would be to start by *reading* the disk blocks and check if they are all zero, and only switch to zero-block writes if any block is found with non-zero data. That would avoid the extra space usage from zero-block writes in the above cases, and also work for the huge majority of users that won't know the "assume_storage_prezeroed" option even exits, though it won't necessarily reduce the runtime. > diff --git a/misc/mke2fs.c b/misc/mke2fs.c > index 04b2fbce..5293d9b0 100644 > --- a/misc/mke2fs.c > +++ b/misc/mke2fs.c > @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[]) > io_channel_set_options(fs->io, opt_string); > } > > + if (assume_storage_prezeroed) { > + if (verbose) > + printf("%s", > + _("Assuming the storage device is prezeroed " > + "- skipping inode table and journal wipe\n")); > + > + lazy_itable_init = 1; > + itable_zeroed = 1; > + zero_hugefile = 0; > + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; > + } Indentation appears to be broken here - only 2 spaces instead of a tab. This is also missing any kind of test case. Since a large number of the e2fsck test cases are using loopback filesystems created on a sparse file, this would both be good test cases, as well as reducing time/space used during testing. Cheers, Andreas
On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote: > Wouldn't it make more sense to use "write-same" of 0 instead of > writing a page of zeros and task the layers that do thin > provisioning and return 0 on read from unallocated blocks to check > if a block exists before writing zeros to it? The problem is we have absolutely no idea what "write-same" of 0 will actually do in terms of whether it will consume storage for various thinly provisioned devices. We also have no idea what the performance might be. It might be the same speed as explicitly passing in zero-filled buffers and sending DMA requests to a hard drive. (e.g., potentially very S-L-O-W.) That's technically true for "discard" as well, except there's a vague understanding that discard will generally be faster than writing all zeros --- it's just that it might also be a no-op, or it might randomly be a no-op, depending on the phase of the moon, or anything other random variable, including whether "the storage device feels like it or not". Bottom line --- unfortunately, the SATA/SCSI standards authors were mealy-mouthed and made discard something which is completely useless for our purposes. And since we don't know anything about the performance of write same and what it might do from the perspective of thin-provisioned storage, we can't really depend on it either. The problem is mke2fs really does need to care about the performance of discard or write same. Users want mke2fs to be fast, especially during the distro installation process. That's why we implemented the lazy inode table initialization feature in the first place. So reading all each block from the inode table to see if it's zero might be slow, and so we might be better off just doing the lazy itable init instead. Hence, I think Sarthak's approach of giving an explicit hint is a good approach. The other approach we can use is to depend on metadata checksums, and the fact that a new file system will use a different UUID for the seed for the checksum. Unfortunately, in order to make this work well, we need to change e2fsck so that if the checksum doesn't work out --- especially if all of the checksums in an inode table block are incorrect --- we need to assume that it means we should just presume that the inode table block is from an old instance of the file system, and return a zero-filled block when reading that inode table block. (Right now, e2fsck still offers the chance to just fix the checksum, back when we were worried there might be bugs in the metadata checksum code.) But I don't think the two approaches are mutually exclusive. The approach of an explicit hint is a "safe" and a lot easier to review. Cheers, - Ted
Hi all, Thanks for the discussions on the original patch. I wanted to circle back and see if you had any further comments/concerns on the second version of the patchset. Best Sarthak On Mon, Sep 27, 2021 at 3:44 AM Sarthak Kukreti <sarthakkukreti@chromium.org> wrote: > > This patch adds an extended option "assume_storage_prezeroed" to > mke2fs. When enabled, this option acts as a hint to mke2fs that > the underlying block device was zeroed before mke2fs was called. > This allows mke2fs to optimize out the zeroing of the inode > table and the journal, which speeds up the filesystem creation > time. > > Additionally, on thinly provisioned storage devices (like Ceph, > dm-thin, newly created sparse loopback files), reads on unmapped extents > return zero. This property allows mke2fs (with assume_storage_prezeroed) > to avoid pre-allocating metadata space for inode tables for the entire > filesystem and saves space that would normally be preallocated > for zero inode tables. > > Tests > ----- > 1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4 > filesystem drops the time taken by mke2fs from 0.09s to 0.04s > and reduces the initial metadata space allocation (stat on > sparse file) from 139736 blocks (545M) to 8672 blocks (34M). > > 2) On ChromeOS (running linux kernel 4.19) with dm-thin > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>': > > - Time taken by mke2fs drops from 1.07s to 0.08s. > - Avoiding zeroing out the inode table and journal reduces the > initial metadata space allocation from 0.48% to 0.01%. > - Lazy inode table zeroing results in a further 1.45% of logical > volume space getting allocated for inode tables, even if no file > data is added to the filesystem. With assume_storage_prezeroed, > the metadata allocation remains at 0.01%. > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org> > -- > Changes in v2: Added regression test, fixed indentation. > --- > misc/mke2fs.8.in | 7 ++++++ > misc/mke2fs.c | 21 ++++++++++++++++- > tests/m_assume_storage_prezeroed/expect | 2 ++ > tests/m_assume_storage_prezeroed/script | 31 +++++++++++++++++++++++++ > 4 files changed, 60 insertions(+), 1 deletion(-) > create mode 100644 tests/m_assume_storage_prezeroed/expect > create mode 100644 tests/m_assume_storage_prezeroed/script > > diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in > index c0b53245..5c6ea5ec 100644 > --- a/misc/mke2fs.8.in > +++ b/misc/mke2fs.8.in > @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten > entirely one time. If the option value is omitted, it defaults to 1 to > enable lazy journal inode zeroing. > .TP > +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR] > +If enabled, > +.BR mke2fs > +assumes that the storage device has been prezeroed, skips zeroing the journal > +and inode tables, and annotates the block group flags to signal that the inode > +table has been zeroed. > +.TP > .B no_copy_xattrs > Normally > .B mke2fs > diff --git a/misc/mke2fs.c b/misc/mke2fs.c > index 04b2fbce..24c69966 100644 > --- a/misc/mke2fs.c > +++ b/misc/mke2fs.c > @@ -95,6 +95,7 @@ int journal_size; > int journal_flags; > int journal_fc_size; > static int lazy_itable_init; > +static int assume_storage_prezeroed; > static int packed_meta_blocks; > int no_copy_xattrs; > static char *bad_blocks_filename = NULL; > @@ -1012,6 +1013,11 @@ static void parse_extended_opts(struct ext2_super_block *param, > lazy_itable_init = strtoul(arg, &p, 0); > else > lazy_itable_init = 1; > + } else if (!strcmp(token, "assume_storage_prezeroed")) { > + if (arg) > + assume_storage_prezeroed = strtoul(arg, &p, 0); > + else > + assume_storage_prezeroed = 1; > } else if (!strcmp(token, "lazy_journal_init")) { > if (arg) > journal_flags |= strtoul(arg, &p, 0) ? > @@ -1115,7 +1121,8 @@ static void parse_extended_opts(struct ext2_super_block *param, > "\tnodiscard\n" > "\tencoding=<encoding>\n" > "\tencoding_flags=<flags>\n" > - "\tquotatype=<quota type(s) to be enabled>\n\n"), > + "\tquotatype=<quota type(s) to be enabled>\n" > + "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"), > badopt ? badopt : ""); > free(buf); > exit(1); > @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[]) > io_channel_set_options(fs->io, opt_string); > } > > + if (assume_storage_prezeroed) { > + if (verbose) > + printf("%s", > + _("Assuming the storage device is prezeroed " > + "- skipping inode table and journal wipe\n")); > + > + lazy_itable_init = 1; > + itable_zeroed = 1; > + zero_hugefile = 0; > + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; > + } > + > /* Can't undo discard ... */ > if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) { > retval = mke2fs_discard_device(fs); > diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect > new file mode 100644 > index 00000000..2ca3784a > --- /dev/null > +++ b/tests/m_assume_storage_prezeroed/expect > @@ -0,0 +1,2 @@ > +2384 > +336 > diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script > new file mode 100644 > index 00000000..0745fb28 > --- /dev/null > +++ b/tests/m_assume_storage_prezeroed/script > @@ -0,0 +1,31 @@ > +test_description="test prezeroed storage metadata allocation" > +FILE_SIZE=16M > + > +LOG=$test_name.log > +OUT=$test_name.out > +EXP=$test_dir/expect > + > +dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1 > +dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1 > + > +$MKE2FS -o Linux -t ext4 -O has_journal $TMPFILE.1 >> $LOG 2>&1 > +stat -c "%b" $TMPFILE.1 > $OUT > + > +$MKE2FS -o Linux -t ext4 -O has_journal -E assume_storage_prezeroed=1 $TMPFILE.2 >> $LOG 2>&1 > +stat -c "%b" $TMPFILE.2 >> $OUT > + > +rm -f $TMPFILE.1 $TMPFILE.2 > + > +cmp -s $OUT $EXP > +status=$? > + > +if [ "$status" = 0 ] ; then > + echo "$test_name: $test_description: ok" > + touch $test_name.ok > +else > + echo "$test_name: $test_description: failed" > + cat $LOG > $test_name.failed > + diff $EXP $OUT >> $test_name.failed > +fi > + > +unset LOG OUT EXP FILE_SIZE > \ No newline at end of file > -- > 2.31.0 >
I tried running the regression test, and it was failing for me; it showed that even with -E assume_stoarge_prezeroed, the size of the $TMPFILE.1 and $TMPFILE.2 was the same. Looking into this, it was because in lib/ext2fs/unix_io.c, when the file is a plain file io_channel_discard_zeroes_data() returns true, since it assumes that we can use PUNCH_HOLE to implement unix_io_discard(), which is guaranteed to work. So I had to change the regression test to use losetup, which also meant that the test had to run as root.... Anyway, this is what I've checked into e2fsprogs. - Ted commit bd2e72c5c5521b561d20a881c843a64a5832721a Author: Sarthak Kukreti <sarthakkukreti@chromium.org> Date: Mon Sep 27 03:39:10 2021 -0700 mke2fs: add extended option for prezeroed storage devices This patch adds an extended option "assume_storage_prezeroed" to mke2fs. When enabled, this option acts as a hint to mke2fs that the underlying block device was zeroed before mke2fs was called. This allows mke2fs to optimize out the zeroing of the inode table and the journal, which speeds up the filesystem creation time. Additionally, on thinly provisioned storage devices (like Ceph, dm-thin, newly created sparse loopback files), reads on unmapped extents return zero. This property allows mke2fs (with assume_storage_prezeroed) to avoid pre-allocating metadata space for inode tables for the entire filesystem and saves space that would normally be preallocated for zero inode tables. Tests ----- 1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4 filesystem drops the time taken by mke2fs from 0.09s to 0.04s and reduces the initial metadata space allocation (stat on sparse file) from 139736 blocks (545M) to 8672 blocks (34M). 2) On ChromeOS (running linux kernel 4.19) with dm-thin and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>': - Time taken by mke2fs drops from 1.07s to 0.08s. - Avoiding zeroing out the inode table and journal reduces the initial metadata space allocation from 0.48% to 0.01%. - Lazy inode table zeroing results in a further 1.45% of logical volume space getting allocated for inode tables, even if no file data is added to the filesystem. With assume_storage_prezeroed, the metadata allocation remains at 0.01%. [ Fixed regression test to work on newer versions of e2fsprogs -- TYT ] Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org> Signed-off-by: Theodore Ts'o <tytso@mit.edu> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in index b378e4d7..30f97bb5 100644 --- a/misc/mke2fs.8.in +++ b/misc/mke2fs.8.in @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten entirely one time. If the option value is omitted, it defaults to 1 to enable lazy journal inode zeroing. .TP +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR] +If enabled, +.BR mke2fs +assumes that the storage device has been prezeroed, skips zeroing the journal +and inode tables, and annotates the block group flags to signal that the inode +table has been zeroed. +.TP .B no_copy_xattrs Normally .B mke2fs diff --git a/misc/mke2fs.c b/misc/mke2fs.c index c955b318..76b8b8c6 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -96,6 +96,7 @@ int journal_flags; int journal_fc_size; static e2_blkcnt_t orphan_file_blocks; static int lazy_itable_init; +static int assume_storage_prezeroed; static int packed_meta_blocks; int no_copy_xattrs; static char *bad_blocks_filename = NULL; @@ -1013,6 +1014,11 @@ static void parse_extended_opts(struct ext2_super_block *param, lazy_itable_init = strtoul(arg, &p, 0); else lazy_itable_init = 1; + } else if (!strcmp(token, "assume_storage_prezeroed")) { + if (arg) + assume_storage_prezeroed = strtoul(arg, &p, 0); + else + assume_storage_prezeroed = 1; } else if (!strcmp(token, "lazy_journal_init")) { if (arg) journal_flags |= strtoul(arg, &p, 0) ? @@ -1131,7 +1137,8 @@ static void parse_extended_opts(struct ext2_super_block *param, "\tnodiscard\n" "\tencoding=<encoding>\n" "\tencoding_flags=<flags>\n" - "\tquotatype=<quota type(s) to be enabled>\n\n"), + "\tquotatype=<quota type(s) to be enabled>\n" + "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"), badopt ? badopt : ""); free(buf); exit(1); @@ -3125,6 +3132,18 @@ int main (int argc, char *argv[]) io_channel_set_options(fs->io, opt_string); } + if (assume_storage_prezeroed) { + if (verbose) + printf("%s", + _("Assuming the storage device is prezeroed " + "- skipping inode table and journal wipe\n")); + + lazy_itable_init = 1; + itable_zeroed = 1; + zero_hugefile = 0; + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; + } + /* Can't undo discard ... */ if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) { retval = mke2fs_discard_device(fs); diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect new file mode 100644 index 00000000..b735e242 --- /dev/null +++ b/tests/m_assume_storage_prezeroed/expect @@ -0,0 +1,2 @@ +> 10000 +224 diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script new file mode 100644 index 00000000..1a8d8463 --- /dev/null +++ b/tests/m_assume_storage_prezeroed/script @@ -0,0 +1,63 @@ +test_description="test prezeroed storage metadata allocation" +FILE_SIZE=16M + +LOG=$test_name.log +OUT=$test_name.out +EXP=$test_dir/expect + +if test "$(id -u)" -ne 0 ; then + echo "$test_name: $test_description: skipped (not root)" +elif ! command -v losetup >/dev/null ; then + echo "$test_name: $test_description: skipped (no losetup)" +else + dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1 + dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1 + + LOOP1=$(losetup --show --sector-size 4096 -f $TMPFILE.1) + if [ ! -b "$LOOP1" ]; then + echo "$test_name: $DESCRIPTION: skipped (no loop devices)" + rm -f $TMPFILE.1 $TMPFILE.2 + exit 0 + fi + LOOP2=$(losetup --show --sector-size 4096 -f $TMPFILE.2) + if [ ! -b "$LOOP2" ]; then + echo "$test_name: $DESCRIPTION: skipped (no loop devices)" + rm -f $TMPFILE.1 $TMPFILE.2 + losetup -d $LOOP1 + exit 0 + fi + + echo $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1 + $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1 + sync + stat $TMPFILE.1 >> $LOG 2>&1 + SZ=$(stat -c "%b" $TMPFILE.1) + if test $SZ -gt 10000 ; then + echo "> 10000" > $OUT + else + echo "$SZ" > $OUT + fi + + echo $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1 + $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1 + sync + stat $TMPFILE.2 >> $LOG 2>&1 + stat -c "%b" $TMPFILE.2 >> $OUT + + losetup -d $LOOP1 + losetup -d $LOOP2 + rm -f $TMPFILE.1 $TMPFILE.2 + + cmp -s $OUT $EXP + status=$? + + if [ "$status" = 0 ] ; then + echo "$test_name: $test_description: ok" + touch $test_name.ok + else + echo "$test_name: $test_description: failed" + cat $LOG > $test_name.failed + diff $EXP $OUT >> $test_name.failed + fi +fi +unset LOG OUT EXP FILE_SIZE LOOP1 LOOP2
Ted, I noticed Sarthak's patch is not in e2fsprogs-1.46.5 December release. His patch is in the |master| branch (commit bd2e72c5c552 ("mke2fs: add extended option for prezeroed storage devices")) since September, but not in the |maint| branch. Other patches were not included as well - see below. Is it expected? git log --cherry-mark --oneline --left-right origin/master...origin/maint < 96185e9b (origin/next, origin/master, origin/HEAD) Merge branch 'maint' into next < f85b4526 tune2fs: implement support for set/get label iocts < 8adeabee Merge branch 'maint' into next < 02827d06 ext2fs: avoid re-reading inode multiple times < bd2e72c5 mke2fs: add extended option for prezeroed storage devices < a8f52588 dumpe2fs, debugfs, e2image: Add support for orphan file < 795101dd tune2fs: Add support for orphan_file feature < d0c52ffb e2fsck: Add support for handling orphan file < 818da4a9 mke2fs: Add support for orphan_file feature < 1d551c68 libext2fs: Support for orphan file feature Gwendal. On Sun, Oct 24, 2021 at 9:25 PM Theodore Ts'o <tytso@mit.edu> wrote: > > I tried running the regression test, and it was failing for me; it > showed that even with -E assume_stoarge_prezeroed, the size of the > $TMPFILE.1 and $TMPFILE.2 was the same. Looking into this, it was > because in lib/ext2fs/unix_io.c, when the file is a plain file > io_channel_discard_zeroes_data() returns true, since it assumes that > we can use PUNCH_HOLE to implement unix_io_discard(), which is > guaranteed to work. > > So I had to change the regression test to use losetup, which also > meant that the test had to run as root.... > > Anyway, this is what I've checked into e2fsprogs. > > - Ted > > commit bd2e72c5c5521b561d20a881c843a64a5832721a > Author: Sarthak Kukreti <sarthakkukreti@chromium.org> > Date: Mon Sep 27 03:39:10 2021 -0700 > > mke2fs: add extended option for prezeroed storage devices > > This patch adds an extended option "assume_storage_prezeroed" to > mke2fs. When enabled, this option acts as a hint to mke2fs that the > underlying block device was zeroed before mke2fs was called. This > allows mke2fs to optimize out the zeroing of the inode table and the > journal, which speeds up the filesystem creation time. > > Additionally, on thinly provisioned storage devices (like Ceph, > dm-thin, newly created sparse loopback files), reads on unmapped > extents return zero. This property allows mke2fs (with > assume_storage_prezeroed) to avoid pre-allocating metadata space for > inode tables for the entire filesystem and saves space that would > normally be preallocated for zero inode tables. > > Tests > ----- > 1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4 > filesystem drops the time taken by mke2fs from 0.09s to 0.04s > and reduces the initial metadata space allocation (stat on > sparse file) from 139736 blocks (545M) to 8672 blocks (34M). > > 2) On ChromeOS (running linux kernel 4.19) with dm-thin > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>': > > - Time taken by mke2fs drops from 1.07s to 0.08s. > - Avoiding zeroing out the inode table and journal reduces the > initial metadata space allocation from 0.48% to 0.01%. > - Lazy inode table zeroing results in a further 1.45% of logical > volume space getting allocated for inode tables, even if no file > data is added to the filesystem. With assume_storage_prezeroed, > the metadata allocation remains at 0.01%. > > [ Fixed regression test to work on newer versions of e2fsprogs -- TYT ] > > Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org> > Signed-off-by: Theodore Ts'o <tytso@mit.edu> > > diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in > index b378e4d7..30f97bb5 100644 > --- a/misc/mke2fs.8.in > +++ b/misc/mke2fs.8.in > @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten > entirely one time. If the option value is omitted, it defaults to 1 to > enable lazy journal inode zeroing. > .TP > +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR] > +If enabled, > +.BR mke2fs > +assumes that the storage device has been prezeroed, skips zeroing the journal > +and inode tables, and annotates the block group flags to signal that the inode > +table has been zeroed. > +.TP > .B no_copy_xattrs > Normally > .B mke2fs > diff --git a/misc/mke2fs.c b/misc/mke2fs.c > index c955b318..76b8b8c6 100644 > --- a/misc/mke2fs.c > +++ b/misc/mke2fs.c > @@ -96,6 +96,7 @@ int journal_flags; > int journal_fc_size; > static e2_blkcnt_t orphan_file_blocks; > static int lazy_itable_init; > +static int assume_storage_prezeroed; > static int packed_meta_blocks; > int no_copy_xattrs; > static char *bad_blocks_filename = NULL; > @@ -1013,6 +1014,11 @@ static void parse_extended_opts(struct ext2_super_block *param, > lazy_itable_init = strtoul(arg, &p, 0); > else > lazy_itable_init = 1; > + } else if (!strcmp(token, "assume_storage_prezeroed")) { > + if (arg) > + assume_storage_prezeroed = strtoul(arg, &p, 0); > + else > + assume_storage_prezeroed = 1; > } else if (!strcmp(token, "lazy_journal_init")) { > if (arg) > journal_flags |= strtoul(arg, &p, 0) ? > @@ -1131,7 +1137,8 @@ static void parse_extended_opts(struct ext2_super_block *param, > "\tnodiscard\n" > "\tencoding=<encoding>\n" > "\tencoding_flags=<flags>\n" > - "\tquotatype=<quota type(s) to be enabled>\n\n"), > + "\tquotatype=<quota type(s) to be enabled>\n" > + "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"), > badopt ? badopt : ""); > free(buf); > exit(1); > @@ -3125,6 +3132,18 @@ int main (int argc, char *argv[]) > io_channel_set_options(fs->io, opt_string); > } > > + if (assume_storage_prezeroed) { > + if (verbose) > + printf("%s", > + _("Assuming the storage device is prezeroed " > + "- skipping inode table and journal wipe\n")); > + > + lazy_itable_init = 1; > + itable_zeroed = 1; > + zero_hugefile = 0; > + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; > + } > + > /* Can't undo discard ... */ > if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) { > retval = mke2fs_discard_device(fs); > diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect > new file mode 100644 > index 00000000..b735e242 > --- /dev/null > +++ b/tests/m_assume_storage_prezeroed/expect > @@ -0,0 +1,2 @@ > +> 10000 > +224 > diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script > new file mode 100644 > index 00000000..1a8d8463 > --- /dev/null > +++ b/tests/m_assume_storage_prezeroed/script > @@ -0,0 +1,63 @@ > +test_description="test prezeroed storage metadata allocation" > +FILE_SIZE=16M > + > +LOG=$test_name.log > +OUT=$test_name.out > +EXP=$test_dir/expect > + > +if test "$(id -u)" -ne 0 ; then > + echo "$test_name: $test_description: skipped (not root)" > +elif ! command -v losetup >/dev/null ; then > + echo "$test_name: $test_description: skipped (no losetup)" > +else > + dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1 > + dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1 > + > + LOOP1=$(losetup --show --sector-size 4096 -f $TMPFILE.1) > + if [ ! -b "$LOOP1" ]; then > + echo "$test_name: $DESCRIPTION: skipped (no loop devices)" > + rm -f $TMPFILE.1 $TMPFILE.2 > + exit 0 > + fi > + LOOP2=$(losetup --show --sector-size 4096 -f $TMPFILE.2) > + if [ ! -b "$LOOP2" ]; then > + echo "$test_name: $DESCRIPTION: skipped (no loop devices)" > + rm -f $TMPFILE.1 $TMPFILE.2 > + losetup -d $LOOP1 > + exit 0 > + fi > + > + echo $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1 > + $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1 > + sync > + stat $TMPFILE.1 >> $LOG 2>&1 > + SZ=$(stat -c "%b" $TMPFILE.1) > + if test $SZ -gt 10000 ; then > + echo "> 10000" > $OUT > + else > + echo "$SZ" > $OUT > + fi > + > + echo $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1 > + $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1 > + sync > + stat $TMPFILE.2 >> $LOG 2>&1 > + stat -c "%b" $TMPFILE.2 >> $OUT > + > + losetup -d $LOOP1 > + losetup -d $LOOP2 > + rm -f $TMPFILE.1 $TMPFILE.2 > + > + cmp -s $OUT $EXP > + status=$? > + > + if [ "$status" = 0 ] ; then > + echo "$test_name: $test_description: ok" > + touch $test_name.ok > + else > + echo "$test_name: $test_description: failed" > + cat $LOG > $test_name.failed > + diff $EXP $OUT >> $test_name.failed > + fi > +fi > +unset LOG OUT EXP FILE_SIZE LOOP1 LOOP2 >
diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in index c0b53245..b82f8445 100644 --- a/misc/mke2fs.8.in +++ b/misc/mke2fs.8.in @@ -364,6 +364,12 @@ This speeds up file system initialization noticeably, but carries some small risk if the system crashes before the journal has been overwritten entirely one time. If the option value is omitted, it defaults to 1 to enable lazy journal inode zeroing. +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR] +If enabled, +.BR mke2fs +assumes that the storage device has been prezeroed, skips zeroing the journal +and inode tables, and annotates the block group flags to signal that the inode +table has been zeroed. .TP .B no_copy_xattrs Normally diff --git a/misc/mke2fs.c b/misc/mke2fs.c index 04b2fbce..5293d9b0 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -95,6 +95,7 @@ int journal_size; int journal_flags; int journal_fc_size; static int lazy_itable_init; +static int assume_storage_prezeroed; static int packed_meta_blocks; int no_copy_xattrs; static char *bad_blocks_filename = NULL; @@ -1012,6 +1013,11 @@ static void parse_extended_opts(struct ext2_super_block *param, lazy_itable_init = strtoul(arg, &p, 0); else lazy_itable_init = 1; + } else if (!strcmp(token, "assume_storage_prezeroed")) { + if (arg) + assume_storage_prezeroed = strtoul(arg, &p, 0); + else + assume_storage_prezeroed = 1; } else if (!strcmp(token, "lazy_journal_init")) { if (arg) journal_flags |= strtoul(arg, &p, 0) ? @@ -1115,7 +1121,8 @@ static void parse_extended_opts(struct ext2_super_block *param, "\tnodiscard\n" "\tencoding=<encoding>\n" "\tencoding_flags=<flags>\n" - "\tquotatype=<quota type(s) to be enabled>\n\n"), + "\tquotatype=<quota type(s) to be enabled>\n" + "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"), badopt ? badopt : ""); free(buf); exit(1); @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[]) io_channel_set_options(fs->io, opt_string); } + if (assume_storage_prezeroed) { + if (verbose) + printf("%s", + _("Assuming the storage device is prezeroed " + "- skipping inode table and journal wipe\n")); + + lazy_itable_init = 1; + itable_zeroed = 1; + zero_hugefile = 0; + journal_flags |= EXT2_MKJOURNAL_LAZYINIT; + } + /* Can't undo discard ... */ if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) { retval = mke2fs_discard_device(fs);