diff mbox series

mke2fs: Add extended option for prezeroed storage devices

Message ID 20210921034203.323950-1-sarthakkukreti@google.com
State New
Headers show
Series mke2fs: Add extended option for prezeroed storage devices | expand

Commit Message

Sarthak Kukreti Sept. 21, 2021, 3:42 a.m. UTC
From: Sarthak Kukreti <sarthakkukreti@chromium.org>

This patch adds an extended option "assume_storage_prezeroed" to
mke2fs. When enabled, this option acts as a hint to mke2fs that
the underlying block device was zeroed before mke2fs was called.
This allows mke2fs to optimize out the zeroing of the inode
table and the journal, which speeds up the filesystem creation
time.

Additionally, on thinly provisioned storage devices (like Ceph,
dm-thin), reads on unmapped extents return zero. This property
allows mke2fs (with assume_storage_prezeroed) to avoid
pre-allocating metadata space for inode tables for the entire
filesystem and saves space that would normally be preallocated
for zero inode tables.

Testing on ChromeOS (running linux kernel 4.19) with dm-thin
and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':

- Time taken by mke2fs drops from 1.07s to 0.08s.
- Avoiding zeroing out the inode table and journal reduces the
  initial metadata space allocation from 0.48% to 0.01%.
- Lazy inode table zeroing results in a further 1.45% of logical
  volume space getting allocated for inode tables, even if not file
  data is added to the filesystem. With assume_storage_prezeroed,
  the metadata allocation remains at 0.01%.

Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
---
 misc/mke2fs.8.in |  6 ++++++
 misc/mke2fs.c    | 21 ++++++++++++++++++++-
 2 files changed, 26 insertions(+), 1 deletion(-)

Comments

Andreas Dilger Sept. 21, 2021, 9:39 p.m. UTC | #1
On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
> 
> From: Sarthak Kukreti <sarthakkukreti@chromium.org>
> 
> This patch adds an extended option "assume_storage_prezeroed" to
> mke2fs. When enabled, this option acts as a hint to mke2fs that
> the underlying block device was zeroed before mke2fs was called.
> This allows mke2fs to optimize out the zeroing of the inode
> table and the journal, which speeds up the filesystem creation
> time.
> 
> Additionally, on thinly provisioned storage devices (like Ceph,
> dm-thin),

... and newly-created sparse loopback files

> reads on unmapped extents return zero. This property
> allows mke2fs (with assume_storage_prezeroed) to avoid
> pre-allocating metadata space for inode tables for the entire
> filesystem and saves space that would normally be preallocated
> for zero inode tables.
> 
> Testing on ChromeOS (running linux kernel 4.19) with dm-thin
> and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
> 
> - Time taken by mke2fs drops from 1.07s to 0.08s.
> - Avoiding zeroing out the inode table and journal reduces the
>  initial metadata space allocation from 0.48% to 0.01%.
> - Lazy inode table zeroing results in a further 1.45% of logical
>  volume space getting allocated for inode tables, even if not file
>  data is added to the filesystem. With assume_storage_prezeroed,
>  the metadata allocation remains at 0.01%.

This seems beneficial, but I'm wondering if this could also be
done automatically when TRIM/DISCARD is used by mke2fs to erase
a device?

One safe option to do this automatically would be to start by
*reading* the disk blocks and check if they are all zero, and only
switch to zero-block writes if any block is found with non-zero
data.  That would avoid the extra space usage from zero-block
writes in the above cases, and also work for the huge majority of
users that won't know the "assume_storage_prezeroed" option even
exits, though it won't necessarily reduce the runtime.

> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 04b2fbce..5293d9b0 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
> 		io_channel_set_options(fs->io, opt_string);
> 	}
> 
> +	if (assume_storage_prezeroed) {
> +	  if (verbose)
> +			printf("%s",
> +				       _("Assuming the storage device is prezeroed "
> +                         "- skipping inode table and journal wipe\n"));
> +
> +	  lazy_itable_init = 1;
> +	  itable_zeroed = 1;
> +	  zero_hugefile = 0;
> +	  journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> +	}

Indentation appears to be broken here - only 2 spaces instead of a tab.

This is also missing any kind of test case.  Since a large number of
the e2fsck test cases are using loopback filesystems created on a sparse
file, this would both be good test cases, as well as reducing time/space
used during testing.

Cheers, Andreas
Kiselev, Oleg Sept. 23, 2021, 3:31 a.m. UTC | #2
Wouldn't it make more sense to use "write-same" of 0 instead of writing a page of zeros and task the layers that do thin provisioning and return 0 on read from unallocated blocks to check if a block exists before writing zeros to it?

On 9/21/21, 2:40 PM, "Andreas Dilger" <adilger@dilger.ca> wrote:

    On Sep 20, 2021, at 9:42 PM, Sarthak Kukreti <sarthakkukreti@chromium.org> wrote:
    > 
    > From: Sarthak Kukreti <sarthakkukreti@chromium.org>
    > 
    > This patch adds an extended option "assume_storage_prezeroed" to
    > mke2fs. When enabled, this option acts as a hint to mke2fs that
    > the underlying block device was zeroed before mke2fs was called.
    > This allows mke2fs to optimize out the zeroing of the inode
    > table and the journal, which speeds up the filesystem creation
    > time.
    > 
    > Additionally, on thinly provisioned storage devices (like Ceph,
    > dm-thin),

    ... and newly-created sparse loopback files

    > reads on unmapped extents return zero. This property
    > allows mke2fs (with assume_storage_prezeroed) to avoid
    > pre-allocating metadata space for inode tables for the entire
    > filesystem and saves space that would normally be preallocated
    > for zero inode tables.
    > 
    > Testing on ChromeOS (running linux kernel 4.19) with dm-thin
    > and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
    > 
    > - Time taken by mke2fs drops from 1.07s to 0.08s.
    > - Avoiding zeroing out the inode table and journal reduces the
    >  initial metadata space allocation from 0.48% to 0.01%.
    > - Lazy inode table zeroing results in a further 1.45% of logical
    >  volume space getting allocated for inode tables, even if not file
    >  data is added to the filesystem. With assume_storage_prezeroed,
    >  the metadata allocation remains at 0.01%.

    This seems beneficial, but I'm wondering if this could also be
    done automatically when TRIM/DISCARD is used by mke2fs to erase
    a device?

    One safe option to do this automatically would be to start by
    *reading* the disk blocks and check if they are all zero, and only
    switch to zero-block writes if any block is found with non-zero
    data.  That would avoid the extra space usage from zero-block
    writes in the above cases, and also work for the huge majority of
    users that won't know the "assume_storage_prezeroed" option even
    exits, though it won't necessarily reduce the runtime.

    > diff --git a/misc/mke2fs.c b/misc/mke2fs.c
    > index 04b2fbce..5293d9b0 100644
    > --- a/misc/mke2fs.c
    > +++ b/misc/mke2fs.c
    > @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
    > 		io_channel_set_options(fs->io, opt_string);
    > 	}
    > 
    > +	if (assume_storage_prezeroed) {
    > +	  if (verbose)
    > +			printf("%s",
    > +				       _("Assuming the storage device is prezeroed "
    > +                         "- skipping inode table and journal wipe\n"));
    > +
    > +	  lazy_itable_init = 1;
    > +	  itable_zeroed = 1;
    > +	  zero_hugefile = 0;
    > +	  journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
    > +	}

    Indentation appears to be broken here - only 2 spaces instead of a tab.

    This is also missing any kind of test case.  Since a large number of
    the e2fsck test cases are using loopback filesystems created on a sparse
    file, this would both be good test cases, as well as reducing time/space
    used during testing.

    Cheers, Andreas
Theodore Ts'o Sept. 23, 2021, 3:57 a.m. UTC | #3
On Thu, Sep 23, 2021 at 03:31:00AM +0000, Kiselev, Oleg wrote:
> Wouldn't it make more sense to use "write-same" of 0 instead of
> writing a page of zeros and task the layers that do thin
> provisioning and return 0 on read from unallocated blocks to check
> if a block exists before writing zeros to it?

The problem is we have absolutely no idea what "write-same" of 0 will
actually do in terms of whether it will consume storage for various
thinly provisioned devices.  We also have no idea what the performance
might be.  It might be the same speed as explicitly passing in
zero-filled buffers and sending DMA requests to a hard drive.  (e.g.,
potentially very S-L-O-W.)

That's technically true for "discard" as well, except there's a vague
understanding that discard will generally be faster than writing all
zeros --- it's just that it might also be a no-op, or it might
randomly be a no-op, depending on the phase of the moon, or anything
other random variable, including whether "the storage device feels
like it or not".

Bottom line --- unfortunately, the SATA/SCSI standards authors were
mealy-mouthed and made discard something which is completely useless
for our purposes.  And since we don't know anything about the
performance of write same and what it might do from the perspective of
thin-provisioned storage, we can't really depend on it either.

The problem is mke2fs really does need to care about the performance
of discard or write same.  Users want mke2fs to be fast, especially
during the distro installation process.  That's why we implemented the
lazy inode table initialization feature in the first place.  So
reading all each block from the inode table to see if it's zero might
be slow, and so we might be better off just doing the lazy itable init
instead.

Hence, I think Sarthak's approach of giving an explicit hint is a good
approach.

The other approach we can use is to depend on metadata checksums, and
the fact that a new file system will use a different UUID for the seed
for the checksum.  Unfortunately, in order to make this work well, we
need to change e2fsck so that if the checksum doesn't work out ---
especially if all of the checksums in an inode table block are
incorrect --- we need to assume that it means we should just presume
that the inode table block is from an old instance of the file system,
and return a zero-filled block when reading that inode table block.
(Right now, e2fsck still offers the chance to just fix the checksum,
back when we were worried there might be bugs in the metadata checksum
code.)

But I don't think the two approaches are mutually exclusive.  The
approach of an explicit hint is a "safe" and a lot easier to review.

Cheers,

					- Ted
Sarthak Kukreti Oct. 5, 2021, 3:49 a.m. UTC | #4
Hi all,

Thanks for the discussions on the original patch. I wanted to circle
back and see if you had any further comments/concerns on the second
version of the patchset.

Best
Sarthak

On Mon, Sep 27, 2021 at 3:44 AM Sarthak Kukreti
<sarthakkukreti@chromium.org> wrote:
>
> This patch adds an extended option "assume_storage_prezeroed" to
> mke2fs. When enabled, this option acts as a hint to mke2fs that
> the underlying block device was zeroed before mke2fs was called.
> This allows mke2fs to optimize out the zeroing of the inode
> table and the journal, which speeds up the filesystem creation
> time.
>
> Additionally, on thinly provisioned storage devices (like Ceph,
> dm-thin, newly created sparse loopback files), reads on unmapped extents
> return zero. This property allows mke2fs (with assume_storage_prezeroed)
> to avoid pre-allocating metadata space for inode tables for the entire
> filesystem and saves space that would normally be preallocated
> for zero inode tables.
>
> Tests
> -----
> 1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
> filesystem drops the time taken by mke2fs from 0.09s to 0.04s
> and reduces the initial metadata space allocation (stat on
> sparse file) from 139736 blocks (545M) to 8672 blocks (34M).
>
> 2) On ChromeOS (running linux kernel 4.19) with dm-thin
> and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
>
> - Time taken by mke2fs drops from 1.07s to 0.08s.
> - Avoiding zeroing out the inode table and journal reduces the
>   initial metadata space allocation from 0.48% to 0.01%.
> - Lazy inode table zeroing results in a further 1.45% of logical
>   volume space getting allocated for inode tables, even if no file
>   data is added to the filesystem. With assume_storage_prezeroed,
>   the metadata allocation remains at 0.01%.
>
> Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
> --
> Changes in v2: Added regression test, fixed indentation.
> ---
>  misc/mke2fs.8.in                        |  7 ++++++
>  misc/mke2fs.c                           | 21 ++++++++++++++++-
>  tests/m_assume_storage_prezeroed/expect |  2 ++
>  tests/m_assume_storage_prezeroed/script | 31 +++++++++++++++++++++++++
>  4 files changed, 60 insertions(+), 1 deletion(-)
>  create mode 100644 tests/m_assume_storage_prezeroed/expect
>  create mode 100644 tests/m_assume_storage_prezeroed/script
>
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index c0b53245..5c6ea5ec 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
>  entirely one time.  If the option value is omitted, it defaults to 1 to
>  enable lazy journal inode zeroing.
>  .TP
> +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
> +If enabled,
> +.BR mke2fs
> +assumes that the storage device has been prezeroed, skips zeroing the journal
> +and inode tables, and annotates the block group flags to signal that the inode
> +table has been zeroed.
> +.TP
>  .B no_copy_xattrs
>  Normally
>  .B mke2fs
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index 04b2fbce..24c69966 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -95,6 +95,7 @@ int   journal_size;
>  int    journal_flags;
>  int    journal_fc_size;
>  static int     lazy_itable_init;
> +static int     assume_storage_prezeroed;
>  static int     packed_meta_blocks;
>  int            no_copy_xattrs;
>  static char    *bad_blocks_filename = NULL;
> @@ -1012,6 +1013,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                                 lazy_itable_init = strtoul(arg, &p, 0);
>                         else
>                                 lazy_itable_init = 1;
> +               } else if (!strcmp(token, "assume_storage_prezeroed")) {
> +                       if (arg)
> +                               assume_storage_prezeroed = strtoul(arg, &p, 0);
> +                       else
> +                               assume_storage_prezeroed = 1;
>                 } else if (!strcmp(token, "lazy_journal_init")) {
>                         if (arg)
>                                 journal_flags |= strtoul(arg, &p, 0) ?
> @@ -1115,7 +1121,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                         "\tnodiscard\n"
>                         "\tencoding=<encoding>\n"
>                         "\tencoding_flags=<flags>\n"
> -                       "\tquotatype=<quota type(s) to be enabled>\n\n"),
> +                       "\tquotatype=<quota type(s) to be enabled>\n"
> +                       "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
>                         badopt ? badopt : "");
>                 free(buf);
>                 exit(1);
> @@ -3095,6 +3102,18 @@ int main (int argc, char *argv[])
>                 io_channel_set_options(fs->io, opt_string);
>         }
>
> +       if (assume_storage_prezeroed) {
> +               if (verbose)
> +                       printf("%s",
> +                              _("Assuming the storage device is prezeroed "
> +                              "- skipping inode table and journal wipe\n"));
> +
> +               lazy_itable_init = 1;
> +               itable_zeroed = 1;
> +               zero_hugefile = 0;
> +               journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> +       }
> +
>         /* Can't undo discard ... */
>         if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
>                 retval = mke2fs_discard_device(fs);
> diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
> new file mode 100644
> index 00000000..2ca3784a
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/expect
> @@ -0,0 +1,2 @@
> +2384
> +336
> diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
> new file mode 100644
> index 00000000..0745fb28
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/script
> @@ -0,0 +1,31 @@
> +test_description="test prezeroed storage metadata allocation"
> +FILE_SIZE=16M
> +
> +LOG=$test_name.log
> +OUT=$test_name.out
> +EXP=$test_dir/expect
> +
> +dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +
> +$MKE2FS -o Linux -t ext4 -O has_journal $TMPFILE.1 >> $LOG 2>&1
> +stat -c "%b" $TMPFILE.1 > $OUT
> +
> +$MKE2FS -o Linux -t ext4 -O has_journal -E assume_storage_prezeroed=1 $TMPFILE.2 >> $LOG 2>&1
> +stat -c "%b" $TMPFILE.2 >> $OUT
> +
> +rm -f $TMPFILE.1 $TMPFILE.2
> +
> +cmp -s $OUT $EXP
> +status=$?
> +
> +if [ "$status" = 0 ] ; then
> +       echo "$test_name: $test_description: ok"
> +       touch $test_name.ok
> +else
> +       echo "$test_name: $test_description: failed"
> +       cat $LOG > $test_name.failed
> +       diff $EXP $OUT >> $test_name.failed
> +fi
> +
> +unset LOG OUT EXP FILE_SIZE
> \ No newline at end of file
> --
> 2.31.0
>
Theodore Ts'o Oct. 25, 2021, 4:25 a.m. UTC | #5
I tried running the regression test, and it was failing for me; it
showed that even with -E assume_stoarge_prezeroed, the size of the
$TMPFILE.1 and $TMPFILE.2 was the same.  Looking into this, it was
because in lib/ext2fs/unix_io.c, when the file is a plain file
io_channel_discard_zeroes_data() returns true, since it assumes that
we can use PUNCH_HOLE to implement unix_io_discard(), which is
guaranteed to work.

So I had to change the regression test to use losetup, which also
meant that the test had to run as root....

Anyway, this is what I've checked into e2fsprogs.

      	       	    	       	  - Ted

commit bd2e72c5c5521b561d20a881c843a64a5832721a
Author: Sarthak Kukreti <sarthakkukreti@chromium.org>
Date:   Mon Sep 27 03:39:10 2021 -0700

    mke2fs: add extended option for prezeroed storage devices
    
    This patch adds an extended option "assume_storage_prezeroed" to
    mke2fs. When enabled, this option acts as a hint to mke2fs that the
    underlying block device was zeroed before mke2fs was called.  This
    allows mke2fs to optimize out the zeroing of the inode table and the
    journal, which speeds up the filesystem creation time.
    
    Additionally, on thinly provisioned storage devices (like Ceph,
    dm-thin, newly created sparse loopback files), reads on unmapped
    extents return zero. This property allows mke2fs (with
    assume_storage_prezeroed) to avoid pre-allocating metadata space for
    inode tables for the entire filesystem and saves space that would
    normally be preallocated for zero inode tables.
    
    Tests
    -----
    1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
    filesystem drops the time taken by mke2fs from 0.09s to 0.04s
    and reduces the initial metadata space allocation (stat on
    sparse file) from 139736 blocks (545M) to 8672 blocks (34M).
    
    2) On ChromeOS (running linux kernel 4.19) with dm-thin
    and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
    
    - Time taken by mke2fs drops from 1.07s to 0.08s.
    - Avoiding zeroing out the inode table and journal reduces the
      initial metadata space allocation from 0.48% to 0.01%.
    - Lazy inode table zeroing results in a further 1.45% of logical
      volume space getting allocated for inode tables, even if no file
      data is added to the filesystem. With assume_storage_prezeroed,
      the metadata allocation remains at 0.01%.
    
    [ Fixed regression test to work on newer versions of e2fsprogs -- TYT ]
    
    Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index b378e4d7..30f97bb5 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
 entirely one time.  If the option value is omitted, it defaults to 1 to
 enable lazy journal inode zeroing.
 .TP
+.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+If enabled,
+.BR mke2fs
+assumes that the storage device has been prezeroed, skips zeroing the journal
+and inode tables, and annotates the block group flags to signal that the inode
+table has been zeroed.
+.TP
 .B no_copy_xattrs
 Normally
 .B mke2fs
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index c955b318..76b8b8c6 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -96,6 +96,7 @@ int	journal_flags;
 int	journal_fc_size;
 static e2_blkcnt_t	orphan_file_blocks;
 static int	lazy_itable_init;
+static int	assume_storage_prezeroed;
 static int	packed_meta_blocks;
 int		no_copy_xattrs;
 static char	*bad_blocks_filename = NULL;
@@ -1013,6 +1014,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
 				lazy_itable_init = strtoul(arg, &p, 0);
 			else
 				lazy_itable_init = 1;
+		} else if (!strcmp(token, "assume_storage_prezeroed")) {
+			if (arg)
+				assume_storage_prezeroed = strtoul(arg, &p, 0);
+			else
+				assume_storage_prezeroed = 1;
 		} else if (!strcmp(token, "lazy_journal_init")) {
 			if (arg)
 				journal_flags |= strtoul(arg, &p, 0) ?
@@ -1131,7 +1137,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
 			"\tnodiscard\n"
 			"\tencoding=<encoding>\n"
 			"\tencoding_flags=<flags>\n"
-			"\tquotatype=<quota type(s) to be enabled>\n\n"),
+			"\tquotatype=<quota type(s) to be enabled>\n"
+			"\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
 			badopt ? badopt : "");
 		free(buf);
 		exit(1);
@@ -3125,6 +3132,18 @@ int main (int argc, char *argv[])
 		io_channel_set_options(fs->io, opt_string);
 	}
 
+	if (assume_storage_prezeroed) {
+		if (verbose)
+			printf("%s",
+			       _("Assuming the storage device is prezeroed "
+			       "- skipping inode table and journal wipe\n"));
+
+		lazy_itable_init = 1;
+		itable_zeroed = 1;
+		zero_hugefile = 0;
+		journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
+	}
+
 	/* Can't undo discard ... */
 	if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
 		retval = mke2fs_discard_device(fs);
diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
new file mode 100644
index 00000000..b735e242
--- /dev/null
+++ b/tests/m_assume_storage_prezeroed/expect
@@ -0,0 +1,2 @@
+> 10000
+224
diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
new file mode 100644
index 00000000..1a8d8463
--- /dev/null
+++ b/tests/m_assume_storage_prezeroed/script
@@ -0,0 +1,63 @@
+test_description="test prezeroed storage metadata allocation"
+FILE_SIZE=16M
+
+LOG=$test_name.log
+OUT=$test_name.out
+EXP=$test_dir/expect
+
+if test "$(id -u)" -ne 0 ; then
+    echo "$test_name: $test_description: skipped (not root)"
+elif ! command -v losetup >/dev/null ; then
+    echo "$test_name: $test_description: skipped (no losetup)"
+else
+    dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
+    dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
+
+    LOOP1=$(losetup --show --sector-size 4096 -f $TMPFILE.1)
+    if [ ! -b "$LOOP1" ]; then
+        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
+        rm -f $TMPFILE.1 $TMPFILE.2
+        exit 0
+    fi
+    LOOP2=$(losetup --show --sector-size 4096 -f $TMPFILE.2)
+    if [ ! -b "$LOOP2" ]; then
+        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
+        rm -f $TMPFILE.1 $TMPFILE.2
+	losetup -d $LOOP1
+        exit 0
+    fi
+
+    echo $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
+    $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
+    sync
+    stat $TMPFILE.1 >> $LOG 2>&1
+    SZ=$(stat -c "%b" $TMPFILE.1)
+    if test $SZ -gt 10000 ; then
+	echo "> 10000" > $OUT
+    else
+	echo "$SZ" > $OUT
+    fi
+
+    echo $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
+    $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
+    sync
+    stat $TMPFILE.2 >> $LOG 2>&1
+    stat -c "%b" $TMPFILE.2 >> $OUT
+
+    losetup -d $LOOP1
+    losetup -d $LOOP2
+    rm -f $TMPFILE.1 $TMPFILE.2
+
+    cmp -s $OUT $EXP
+    status=$?
+
+    if [ "$status" = 0 ] ; then
+	echo "$test_name: $test_description: ok"
+	touch $test_name.ok
+    else
+	echo "$test_name: $test_description: failed"
+	cat $LOG > $test_name.failed
+	diff $EXP $OUT >> $test_name.failed
+    fi
+fi
+unset LOG OUT EXP FILE_SIZE LOOP1 LOOP2
Gwendal Grignou March 11, 2022, 9:49 a.m. UTC | #6
Ted,

I noticed Sarthak's patch is not in e2fsprogs-1.46.5 December release.
His patch is in the |master| branch (commit bd2e72c5c552 ("mke2fs: add
extended option for prezeroed storage devices")) since September, but
not in the |maint| branch. Other patches were not included as well -
see below. Is it expected?

git log --cherry-mark --oneline --left-right  origin/master...origin/maint
< 96185e9b (origin/next, origin/master, origin/HEAD) Merge branch
'maint' into next
< f85b4526 tune2fs: implement support for set/get label iocts
< 8adeabee Merge branch 'maint' into next
< 02827d06 ext2fs: avoid re-reading inode multiple times
< bd2e72c5 mke2fs: add extended option for prezeroed storage devices
< a8f52588 dumpe2fs, debugfs, e2image: Add support for orphan file
< 795101dd tune2fs: Add support for orphan_file feature
< d0c52ffb e2fsck: Add support for handling orphan file
< 818da4a9 mke2fs: Add support for orphan_file feature
< 1d551c68 libext2fs: Support for orphan file feature

Gwendal.

On Sun, Oct 24, 2021 at 9:25 PM Theodore Ts'o <tytso@mit.edu> wrote:
>
> I tried running the regression test, and it was failing for me; it
> showed that even with -E assume_stoarge_prezeroed, the size of the
> $TMPFILE.1 and $TMPFILE.2 was the same.  Looking into this, it was
> because in lib/ext2fs/unix_io.c, when the file is a plain file
> io_channel_discard_zeroes_data() returns true, since it assumes that
> we can use PUNCH_HOLE to implement unix_io_discard(), which is
> guaranteed to work.
>
> So I had to change the regression test to use losetup, which also
> meant that the test had to run as root....
>
> Anyway, this is what I've checked into e2fsprogs.
>
>                                   - Ted
>
> commit bd2e72c5c5521b561d20a881c843a64a5832721a
> Author: Sarthak Kukreti <sarthakkukreti@chromium.org>
> Date:   Mon Sep 27 03:39:10 2021 -0700
>
>     mke2fs: add extended option for prezeroed storage devices
>
>     This patch adds an extended option "assume_storage_prezeroed" to
>     mke2fs. When enabled, this option acts as a hint to mke2fs that the
>     underlying block device was zeroed before mke2fs was called.  This
>     allows mke2fs to optimize out the zeroing of the inode table and the
>     journal, which speeds up the filesystem creation time.
>
>     Additionally, on thinly provisioned storage devices (like Ceph,
>     dm-thin, newly created sparse loopback files), reads on unmapped
>     extents return zero. This property allows mke2fs (with
>     assume_storage_prezeroed) to avoid pre-allocating metadata space for
>     inode tables for the entire filesystem and saves space that would
>     normally be preallocated for zero inode tables.
>
>     Tests
>     -----
>     1) Running 'mke2fs -t ext4' on 10G sparse files on an ext4
>     filesystem drops the time taken by mke2fs from 0.09s to 0.04s
>     and reduces the initial metadata space allocation (stat on
>     sparse file) from 139736 blocks (545M) to 8672 blocks (34M).
>
>     2) On ChromeOS (running linux kernel 4.19) with dm-thin
>     and 200GB thin logical volumes using 'mke2fs -t ext4 <dev>':
>
>     - Time taken by mke2fs drops from 1.07s to 0.08s.
>     - Avoiding zeroing out the inode table and journal reduces the
>       initial metadata space allocation from 0.48% to 0.01%.
>     - Lazy inode table zeroing results in a further 1.45% of logical
>       volume space getting allocated for inode tables, even if no file
>       data is added to the filesystem. With assume_storage_prezeroed,
>       the metadata allocation remains at 0.01%.
>
>     [ Fixed regression test to work on newer versions of e2fsprogs -- TYT ]
>
>     Signed-off-by: Sarthak Kukreti <sarthakkukreti@chromium.org>
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>
> diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
> index b378e4d7..30f97bb5 100644
> --- a/misc/mke2fs.8.in
> +++ b/misc/mke2fs.8.in
> @@ -365,6 +365,13 @@ small risk if the system crashes before the journal has been overwritten
>  entirely one time.  If the option value is omitted, it defaults to 1 to
>  enable lazy journal inode zeroing.
>  .TP
> +.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
> +If enabled,
> +.BR mke2fs
> +assumes that the storage device has been prezeroed, skips zeroing the journal
> +and inode tables, and annotates the block group flags to signal that the inode
> +table has been zeroed.
> +.TP
>  .B no_copy_xattrs
>  Normally
>  .B mke2fs
> diff --git a/misc/mke2fs.c b/misc/mke2fs.c
> index c955b318..76b8b8c6 100644
> --- a/misc/mke2fs.c
> +++ b/misc/mke2fs.c
> @@ -96,6 +96,7 @@ int   journal_flags;
>  int    journal_fc_size;
>  static e2_blkcnt_t     orphan_file_blocks;
>  static int     lazy_itable_init;
> +static int     assume_storage_prezeroed;
>  static int     packed_meta_blocks;
>  int            no_copy_xattrs;
>  static char    *bad_blocks_filename = NULL;
> @@ -1013,6 +1014,11 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                                 lazy_itable_init = strtoul(arg, &p, 0);
>                         else
>                                 lazy_itable_init = 1;
> +               } else if (!strcmp(token, "assume_storage_prezeroed")) {
> +                       if (arg)
> +                               assume_storage_prezeroed = strtoul(arg, &p, 0);
> +                       else
> +                               assume_storage_prezeroed = 1;
>                 } else if (!strcmp(token, "lazy_journal_init")) {
>                         if (arg)
>                                 journal_flags |= strtoul(arg, &p, 0) ?
> @@ -1131,7 +1137,8 @@ static void parse_extended_opts(struct ext2_super_block *param,
>                         "\tnodiscard\n"
>                         "\tencoding=<encoding>\n"
>                         "\tencoding_flags=<flags>\n"
> -                       "\tquotatype=<quota type(s) to be enabled>\n\n"),
> +                       "\tquotatype=<quota type(s) to be enabled>\n"
> +                       "\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
>                         badopt ? badopt : "");
>                 free(buf);
>                 exit(1);
> @@ -3125,6 +3132,18 @@ int main (int argc, char *argv[])
>                 io_channel_set_options(fs->io, opt_string);
>         }
>
> +       if (assume_storage_prezeroed) {
> +               if (verbose)
> +                       printf("%s",
> +                              _("Assuming the storage device is prezeroed "
> +                              "- skipping inode table and journal wipe\n"));
> +
> +               lazy_itable_init = 1;
> +               itable_zeroed = 1;
> +               zero_hugefile = 0;
> +               journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
> +       }
> +
>         /* Can't undo discard ... */
>         if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
>                 retval = mke2fs_discard_device(fs);
> diff --git a/tests/m_assume_storage_prezeroed/expect b/tests/m_assume_storage_prezeroed/expect
> new file mode 100644
> index 00000000..b735e242
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/expect
> @@ -0,0 +1,2 @@
> +> 10000
> +224
> diff --git a/tests/m_assume_storage_prezeroed/script b/tests/m_assume_storage_prezeroed/script
> new file mode 100644
> index 00000000..1a8d8463
> --- /dev/null
> +++ b/tests/m_assume_storage_prezeroed/script
> @@ -0,0 +1,63 @@
> +test_description="test prezeroed storage metadata allocation"
> +FILE_SIZE=16M
> +
> +LOG=$test_name.log
> +OUT=$test_name.out
> +EXP=$test_dir/expect
> +
> +if test "$(id -u)" -ne 0 ; then
> +    echo "$test_name: $test_description: skipped (not root)"
> +elif ! command -v losetup >/dev/null ; then
> +    echo "$test_name: $test_description: skipped (no losetup)"
> +else
> +    dd if=/dev/zero of=$TMPFILE.1 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +    dd if=/dev/zero of=$TMPFILE.2 bs=1 count=0 seek=$FILE_SIZE >> $LOG 2>&1
> +
> +    LOOP1=$(losetup --show --sector-size 4096 -f $TMPFILE.1)
> +    if [ ! -b "$LOOP1" ]; then
> +        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
> +        rm -f $TMPFILE.1 $TMPFILE.2
> +        exit 0
> +    fi
> +    LOOP2=$(losetup --show --sector-size 4096 -f $TMPFILE.2)
> +    if [ ! -b "$LOOP2" ]; then
> +        echo "$test_name: $DESCRIPTION: skipped (no loop devices)"
> +        rm -f $TMPFILE.1 $TMPFILE.2
> +       losetup -d $LOOP1
> +        exit 0
> +    fi
> +
> +    echo $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
> +    $MKE2FS -o Linux -t ext4 $LOOP1 >> $LOG 2>&1
> +    sync
> +    stat $TMPFILE.1 >> $LOG 2>&1
> +    SZ=$(stat -c "%b" $TMPFILE.1)
> +    if test $SZ -gt 10000 ; then
> +       echo "> 10000" > $OUT
> +    else
> +       echo "$SZ" > $OUT
> +    fi
> +
> +    echo $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
> +    $MKE2FS -o Linux -t ext4 -E assume_storage_prezeroed=1 $LOOP2 >> $LOG 2>&1
> +    sync
> +    stat $TMPFILE.2 >> $LOG 2>&1
> +    stat -c "%b" $TMPFILE.2 >> $OUT
> +
> +    losetup -d $LOOP1
> +    losetup -d $LOOP2
> +    rm -f $TMPFILE.1 $TMPFILE.2
> +
> +    cmp -s $OUT $EXP
> +    status=$?
> +
> +    if [ "$status" = 0 ] ; then
> +       echo "$test_name: $test_description: ok"
> +       touch $test_name.ok
> +    else
> +       echo "$test_name: $test_description: failed"
> +       cat $LOG > $test_name.failed
> +       diff $EXP $OUT >> $test_name.failed
> +    fi
> +fi
> +unset LOG OUT EXP FILE_SIZE LOOP1 LOOP2
>
diff mbox series

Patch

diff --git a/misc/mke2fs.8.in b/misc/mke2fs.8.in
index c0b53245..b82f8445 100644
--- a/misc/mke2fs.8.in
+++ b/misc/mke2fs.8.in
@@ -364,6 +364,12 @@  This speeds up file system initialization noticeably, but carries some
 small risk if the system crashes before the journal has been overwritten
 entirely one time.  If the option value is omitted, it defaults to 1 to
 enable lazy journal inode zeroing.
+.B assume_storage_prezeroed\fR[\fB= \fI<0 to disable, 1 to enable>\fR]
+If enabled,
+.BR mke2fs
+assumes that the storage device has been prezeroed, skips zeroing the journal
+and inode tables, and annotates the block group flags to signal that the inode
+table has been zeroed.
 .TP
 .B no_copy_xattrs
 Normally
diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index 04b2fbce..5293d9b0 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -95,6 +95,7 @@  int	journal_size;
 int	journal_flags;
 int	journal_fc_size;
 static int	lazy_itable_init;
+static int	assume_storage_prezeroed;
 static int	packed_meta_blocks;
 int		no_copy_xattrs;
 static char	*bad_blocks_filename = NULL;
@@ -1012,6 +1013,11 @@  static void parse_extended_opts(struct ext2_super_block *param,
 				lazy_itable_init = strtoul(arg, &p, 0);
 			else
 				lazy_itable_init = 1;
+		} else if (!strcmp(token, "assume_storage_prezeroed")) {
+			if (arg)
+				assume_storage_prezeroed = strtoul(arg, &p, 0);
+			else
+				assume_storage_prezeroed = 1;
 		} else if (!strcmp(token, "lazy_journal_init")) {
 			if (arg)
 				journal_flags |= strtoul(arg, &p, 0) ?
@@ -1115,7 +1121,8 @@  static void parse_extended_opts(struct ext2_super_block *param,
 			"\tnodiscard\n"
 			"\tencoding=<encoding>\n"
 			"\tencoding_flags=<flags>\n"
-			"\tquotatype=<quota type(s) to be enabled>\n\n"),
+			"\tquotatype=<quota type(s) to be enabled>\n"
+			"\tassume_storage_prezeroed=<0 to disable, 1 to enable>\n\n"),
 			badopt ? badopt : "");
 		free(buf);
 		exit(1);
@@ -3095,6 +3102,18 @@  int main (int argc, char *argv[])
 		io_channel_set_options(fs->io, opt_string);
 	}
 
+	if (assume_storage_prezeroed) {
+	  if (verbose)
+			printf("%s",
+				       _("Assuming the storage device is prezeroed "
+                         "- skipping inode table and journal wipe\n"));
+
+	  lazy_itable_init = 1;
+	  itable_zeroed = 1;
+	  zero_hugefile = 0;
+	  journal_flags |= EXT2_MKJOURNAL_LAZYINIT;
+	}
+
 	/* Can't undo discard ... */
 	if (!noaction && discard && dev_size && (io_ptr != undo_io_manager)) {
 		retval = mke2fs_discard_device(fs);