diff mbox

mke2fs: use lazy inode init on some discard-able devices

Message ID 4C6EF67A.5080502@redhat.com
State Superseded, archived
Headers show

Commit Message

Eric Sandeen Aug. 20, 2010, 9:41 p.m. UTC
If a device supports discard -and- returns 0s for discarded blocks,
then we can skip the inode table initialization -and- the inode table
zeroing at mkfs time, and skip the lazy init as well since they are
already zeroed out.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Aug. 23, 2010, 10:49 a.m. UTC | #1
On Aug 20, 2010, at 5:41 PM, Eric Sandeen wrote:

> If a device supports discard -and- returns 0s for discarded blocks,
> then we can skip the inode table initialization -and- the inode table
> zeroing at mkfs time, and skip the lazy init as well since they are
> already zeroed out.
> 
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>

This needs to be configurable in /etc/mke2fs.conf.  Without naming
the manufacturer, I'm aware of at least one device which claims 
that discard works, and will even return zeros --- but after a
power cycle, if the block has not been reallocated, will once again
return the old, pre-discard values that had been stored in that block.

In other words, the discard is not power-cycle persistent...

-- Ted


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Aug. 23, 2010, 2:32 p.m. UTC | #2
Theodore Tso wrote:
> On Aug 20, 2010, at 5:41 PM, Eric Sandeen wrote:
> 
>> If a device supports discard -and- returns 0s for discarded blocks,
>> then we can skip the inode table initialization -and- the inode table
>> zeroing at mkfs time, and skip the lazy init as well since they are
>> already zeroed out.
>> 
>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> 
> This needs to be configurable in /etc/mke2fs.conf.  Without naming 
> the manufacturer, I'm aware of at least one device which claims that
> discard works, and will even return zeros --- but after a power
> cycle, if the block has not been reallocated, will once again return
> the old, pre-discard values that had been stored in that block.
> 
> In other words, the discard is not power-cycle persistent...
> 
> -- Ted
> 
> 

yes, I've seen issues like that too.

TBH in that case I'd rather just drop the patch than make another
tunable for the user to figure out...

Making tunables for every permutation of broken hardware doesn't 
scale IMHO. Users will get it wrong often as not, if they even know 
it's there (they'll find out it's there via some web forum or other, 
and it'll become a meme like "set this to go faster" rather than 
understanding all the implications.)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Aug. 24, 2010, 12:27 a.m. UTC | #3
On 2010-08-23, at 08:32, Eric Sandeen wrote:
> Theodore Tso wrote:
>> On Aug 20, 2010, at 5:41 PM, Eric Sandeen wrote:
>> 
>>> If a device supports discard -and- returns 0s for discarded blocks,
>>> then we can skip the inode table initialization -and- the inode table
>>> zeroing at mkfs time, and skip the lazy init as well since they are
>>> already zeroed out.
>>> 
>>> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
>> 
>> This needs to be configurable in /etc/mke2fs.conf.  Without naming 
>> the manufacturer, I'm aware of at least one device which claims that
>> discard works, and will even return zeros --- but after a power
>> cycle, if the block has not been reallocated, will once again return
>> the old, pre-discard values that had been stored in that block.
>> 
>> In other words, the discard is not power-cycle persistent...
> 
> yes, I've seen issues like that too.
> 
> TBH in that case I'd rather just drop the patch than make another
> tunable for the user to figure out...

What else we discussed is to have mke2fs validate whether the discard+zero works (i.e. write some small non-zero data, discard the whole device, read back previously written data).  Granted it wouldn't handle this "it only fails after a power-cycle" problem, but it should detect gratuitously broken hardware.

That should be sufficiently safe until such a time that inode checksums are available.  Note also that non-zero inode table blocks are only a major problem if some additional corruption causes the itable_uninit numbers to become invalid (e.g. bad group checksum) at which case the old itable blocks will be used.

We also discussed doing this at least for sparse files, which makes a lot of sense for testing in any case, even if the default is not to do this for SSD devices until they smarten up.


Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/misc/mke2fs.c b/misc/mke2fs.c
index add7c0c..b7a9e12 100644
--- a/misc/mke2fs.c
+++ b/misc/mke2fs.c
@@ -88,7 +88,8 @@  int	force;
 int	noaction;
 int	journal_size;
 int	journal_flags;
-int	lazy_itable_init;
+int	lazy_itable_init;	/* use lazy inode table init */
+int	lazy_itable_zeroed;	/* inode table zeroed by discard */
 char	*bad_blocks_filename;
 __u32	fs_stride;
 
@@ -300,7 +301,7 @@  _("Warning: the backup superblock/group descriptors at block %u contain\n"
 	ext2fs_badblocks_list_iterate_end(bb_iter);
 }
 
-static void write_inode_tables(ext2_filsys fs, int lazy_flag)
+static void write_inode_tables(ext2_filsys fs, int lazy_flag, int lazy_zeroed)
 {
 	errcode_t	retval;
 	blk64_t		blk;
@@ -325,6 +326,9 @@  static void write_inode_tables(ext2_filsys fs, int lazy_flag)
 				 EXT2_INODE_SIZE(fs->super)) +
 				EXT2_BLOCK_SIZE(fs->super) - 1) /
 			       EXT2_BLOCK_SIZE(fs->super));
+			/* if pre-zeroed by discard, mark as such */
+			if (lazy_zeroed)
+				ext2fs_bg_flags_set(fs, i, EXT2_BG_INODE_ZEROED);
 		} else {
 			/* The kernel doesn't need to zero the itable blocks */
 			ext2fs_bg_flags_set(fs, i, EXT2_BG_INODE_ZEROED);
@@ -1901,7 +1905,11 @@  static int mke2fs_setup_tdb(const char *name, io_manager *io_ptr)
 #define BLKDISCARD	_IO(0x12,119)
 #endif
 
-static void mke2fs_discard_blocks(ext2_filsys fs)
+#ifndef BLKDISCARDZEROES
+#define BLKDISCARDZEROES _IO(0x12,124)
+#endif
+ 
+static int mke2fs_discard_blocks(ext2_filsys fs)
 {
 	int fd;
 	int ret;
@@ -1917,8 +1925,8 @@  static void mke2fs_discard_blocks(ext2_filsys fs)
 	fd = open64(fs->device_name, O_RDWR);
 
 	/*
-	 * We don't care about whether the ioctl succeeds; it's only an
-	 * optmization for SSDs or sparse storage.
+	 * We don't much care about whether the ioctl succeeds; it's only
+	 * an optmization for SSDs or thinly-provisioned storage.
 	 */
 	if (fd > 0) {
 		ret = ioctl(fd, BLKDISCARD, &range);
@@ -1933,9 +1941,26 @@  static void mke2fs_discard_blocks(ext2_filsys fs)
 		}
 		close(fd);
 	}
+	return ret;
+}
+
+static int mke2fs_discard_zeroes_data(ext2_filsys fs)
+{
+	int fd;
+	int ret;
+	int discard_zeroes_data = 0;
+
+	fd = open64(fs->device_name, O_RDWR);
+
+	if (fd > 0) {
+		ioctl(fd, BLKDISCARDZEROES, &discard_zeroes_data);
+		close(fd);
+	}
+	return discard_zeroes_data;
 }
 #else
-#define mke2fs_discard_blocks(fs)
+#define mke2fs_discard_blocks(fs)	1
+#define mke2fs_discard_zeroes_data(fs)	0
 #endif
 
 int main (int argc, char *argv[])
@@ -1996,8 +2021,17 @@  int main (int argc, char *argv[])
 	}
 
 	/* Can't undo discard ... */
-	if (discard && (io_ptr != undo_io_manager))
-		mke2fs_discard_blocks(fs);
+	if (discard && (io_ptr != undo_io_manager)) {
+		retval = mke2fs_discard_blocks(fs);
+
+		if (!retval && mke2fs_discard_zeroes_data(fs)) {
+			if (verbose)
+				printf(_("Discard succeeded and will return 0s "
+					 " - enabling lazy_itable_init\n"));
+			lazy_itable_init = 1;
+			lazy_itable_zeroed = 1;
+		}
+	}
 
 	sprintf(tdb_string, "tdb_data_size=%d", fs->blocksize <= 4096 ?
 		32768 : fs->blocksize * 8);
@@ -2147,7 +2181,7 @@  int main (int argc, char *argv[])
 				_("while zeroing block %llu at end of filesystem"),
 				ret_blk);
 		}
-		write_inode_tables(fs, lazy_itable_init);
+		write_inode_tables(fs, lazy_itable_init, lazy_itable_zeroed);
 		create_root_dir(fs);
 		create_lost_and_found(fs);
 		reserve_inodes(fs);