Message ID | 4C6EF67A.5080502@redhat.com |
---|---|
State | Superseded, archived |
Headers | show |
On Aug 20, 2010, at 5:41 PM, Eric Sandeen wrote: > If a device supports discard -and- returns 0s for discarded blocks, > then we can skip the inode table initialization -and- the inode table > zeroing at mkfs time, and skip the lazy init as well since they are > already zeroed out. > > Signed-off-by: Eric Sandeen <sandeen@redhat.com> This needs to be configurable in /etc/mke2fs.conf. Without naming the manufacturer, I'm aware of at least one device which claims that discard works, and will even return zeros --- but after a power cycle, if the block has not been reallocated, will once again return the old, pre-discard values that had been stored in that block. In other words, the discard is not power-cycle persistent... -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Aug 20, 2010, at 5:41 PM, Eric Sandeen wrote: > >> If a device supports discard -and- returns 0s for discarded blocks, >> then we can skip the inode table initialization -and- the inode table >> zeroing at mkfs time, and skip the lazy init as well since they are >> already zeroed out. >> >> Signed-off-by: Eric Sandeen <sandeen@redhat.com> > > This needs to be configurable in /etc/mke2fs.conf. Without naming > the manufacturer, I'm aware of at least one device which claims that > discard works, and will even return zeros --- but after a power > cycle, if the block has not been reallocated, will once again return > the old, pre-discard values that had been stored in that block. > > In other words, the discard is not power-cycle persistent... > > -- Ted > > yes, I've seen issues like that too. TBH in that case I'd rather just drop the patch than make another tunable for the user to figure out... Making tunables for every permutation of broken hardware doesn't scale IMHO. Users will get it wrong often as not, if they even know it's there (they'll find out it's there via some web forum or other, and it'll become a meme like "set this to go faster" rather than understanding all the implications.) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2010-08-23, at 08:32, Eric Sandeen wrote: > Theodore Tso wrote: >> On Aug 20, 2010, at 5:41 PM, Eric Sandeen wrote: >> >>> If a device supports discard -and- returns 0s for discarded blocks, >>> then we can skip the inode table initialization -and- the inode table >>> zeroing at mkfs time, and skip the lazy init as well since they are >>> already zeroed out. >>> >>> Signed-off-by: Eric Sandeen <sandeen@redhat.com> >> >> This needs to be configurable in /etc/mke2fs.conf. Without naming >> the manufacturer, I'm aware of at least one device which claims that >> discard works, and will even return zeros --- but after a power >> cycle, if the block has not been reallocated, will once again return >> the old, pre-discard values that had been stored in that block. >> >> In other words, the discard is not power-cycle persistent... > > yes, I've seen issues like that too. > > TBH in that case I'd rather just drop the patch than make another > tunable for the user to figure out... What else we discussed is to have mke2fs validate whether the discard+zero works (i.e. write some small non-zero data, discard the whole device, read back previously written data). Granted it wouldn't handle this "it only fails after a power-cycle" problem, but it should detect gratuitously broken hardware. That should be sufficiently safe until such a time that inode checksums are available. Note also that non-zero inode table blocks are only a major problem if some additional corruption causes the itable_uninit numbers to become invalid (e.g. bad group checksum) at which case the old itable blocks will be used. We also discussed doing this at least for sparse files, which makes a lot of sense for testing in any case, even if the default is not to do this for SSD devices until they smarten up. Cheers, Andreas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/misc/mke2fs.c b/misc/mke2fs.c index add7c0c..b7a9e12 100644 --- a/misc/mke2fs.c +++ b/misc/mke2fs.c @@ -88,7 +88,8 @@ int force; int noaction; int journal_size; int journal_flags; -int lazy_itable_init; +int lazy_itable_init; /* use lazy inode table init */ +int lazy_itable_zeroed; /* inode table zeroed by discard */ char *bad_blocks_filename; __u32 fs_stride; @@ -300,7 +301,7 @@ _("Warning: the backup superblock/group descriptors at block %u contain\n" ext2fs_badblocks_list_iterate_end(bb_iter); } -static void write_inode_tables(ext2_filsys fs, int lazy_flag) +static void write_inode_tables(ext2_filsys fs, int lazy_flag, int lazy_zeroed) { errcode_t retval; blk64_t blk; @@ -325,6 +326,9 @@ static void write_inode_tables(ext2_filsys fs, int lazy_flag) EXT2_INODE_SIZE(fs->super)) + EXT2_BLOCK_SIZE(fs->super) - 1) / EXT2_BLOCK_SIZE(fs->super)); + /* if pre-zeroed by discard, mark as such */ + if (lazy_zeroed) + ext2fs_bg_flags_set(fs, i, EXT2_BG_INODE_ZEROED); } else { /* The kernel doesn't need to zero the itable blocks */ ext2fs_bg_flags_set(fs, i, EXT2_BG_INODE_ZEROED); @@ -1901,7 +1905,11 @@ static int mke2fs_setup_tdb(const char *name, io_manager *io_ptr) #define BLKDISCARD _IO(0x12,119) #endif -static void mke2fs_discard_blocks(ext2_filsys fs) +#ifndef BLKDISCARDZEROES +#define BLKDISCARDZEROES _IO(0x12,124) +#endif + +static int mke2fs_discard_blocks(ext2_filsys fs) { int fd; int ret; @@ -1917,8 +1925,8 @@ static void mke2fs_discard_blocks(ext2_filsys fs) fd = open64(fs->device_name, O_RDWR); /* - * We don't care about whether the ioctl succeeds; it's only an - * optmization for SSDs or sparse storage. + * We don't much care about whether the ioctl succeeds; it's only + * an optmization for SSDs or thinly-provisioned storage. */ if (fd > 0) { ret = ioctl(fd, BLKDISCARD, &range); @@ -1933,9 +1941,26 @@ static void mke2fs_discard_blocks(ext2_filsys fs) } close(fd); } + return ret; +} + +static int mke2fs_discard_zeroes_data(ext2_filsys fs) +{ + int fd; + int ret; + int discard_zeroes_data = 0; + + fd = open64(fs->device_name, O_RDWR); + + if (fd > 0) { + ioctl(fd, BLKDISCARDZEROES, &discard_zeroes_data); + close(fd); + } + return discard_zeroes_data; } #else -#define mke2fs_discard_blocks(fs) +#define mke2fs_discard_blocks(fs) 1 +#define mke2fs_discard_zeroes_data(fs) 0 #endif int main (int argc, char *argv[]) @@ -1996,8 +2021,17 @@ int main (int argc, char *argv[]) } /* Can't undo discard ... */ - if (discard && (io_ptr != undo_io_manager)) - mke2fs_discard_blocks(fs); + if (discard && (io_ptr != undo_io_manager)) { + retval = mke2fs_discard_blocks(fs); + + if (!retval && mke2fs_discard_zeroes_data(fs)) { + if (verbose) + printf(_("Discard succeeded and will return 0s " + " - enabling lazy_itable_init\n")); + lazy_itable_init = 1; + lazy_itable_zeroed = 1; + } + } sprintf(tdb_string, "tdb_data_size=%d", fs->blocksize <= 4096 ? 32768 : fs->blocksize * 8); @@ -2147,7 +2181,7 @@ int main (int argc, char *argv[]) _("while zeroing block %llu at end of filesystem"), ret_blk); } - write_inode_tables(fs, lazy_itable_init); + write_inode_tables(fs, lazy_itable_init, lazy_itable_zeroed); create_root_dir(fs); create_lost_and_found(fs); reserve_inodes(fs);
If a device supports discard -and- returns 0s for discarded blocks, then we can skip the inode table initialization -and- the inode table zeroing at mkfs time, and skip the lazy init as well since they are already zeroed out. Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html