Message ID | 1330690318-22627-4-git-send-email-lczerner@redhat.com |
---|---|
State | Superseded, archived |
Headers | show |
On Fri 02-03-12 13:11:58, Lukas Czerner wrote: > This commit is an optimization for FITRIM implementation. If the group > has not been initialized yet (BLOCK_UNINIT flag set), we do not need to > discard such group. This flag is set on mke2fs time to speed up > subsequent file system checks, because it says to us that there is > nothing there in the block group. > > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when > allocation from that group takes place we know that when set, there was > not anything allocated from that group, hence there should not be anything > to discard from the file system point of view. Of course there might be > situations where even if BLOCK_UNINIT is set the underlying storage is > provisioned. This might happen for example when the user disables discard > on mke2fs, however I think that this niche is not enough to not to take > advantage of this optimization. This patch is correct but I'm undecided whether we really want to do this optimization or not. It might be unexpected we didn't truncate block group which was completely free (from user's POV). I don't consider FITRIM too performance critical and I also don't think this will be such a massive speedup... Honza > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > --- > v2: nothing changed > > fs/ext4/mballoc.c | 10 +++++++++- > 1 files changed, 9 insertions(+), 1 deletions(-) > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > index 8f817f2..9ea1065a 100644 > --- a/fs/ext4/mballoc.c > +++ b/fs/ext4/mballoc.c > @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) > ext4_fsblk_t first_data_blk = > le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); > ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); > + struct ext4_group_desc *desc; > int ret = 0; > > start = range->start >> sb->s_blocksize_bits; > @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) > if (group == last_group) > end = last_cluster; > > - if (grp->bb_free >= minlen) { > + desc = ext4_get_group_desc(sb, group, NULL); > + if (!desc) { > + ret = -EIO; > + break; > + } > + > + if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) && > + (grp->bb_free >= minlen)) { > cnt = ext4_trim_all_free(sb, group, first_cluster, > end, minlen); > if (cnt < 0) { > -- > 1.7.4.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, 5 Mar 2012, Jan Kara wrote: > On Fri 02-03-12 13:11:58, Lukas Czerner wrote: > > This commit is an optimization for FITRIM implementation. If the group > > has not been initialized yet (BLOCK_UNINIT flag set), we do not need to > > discard such group. This flag is set on mke2fs time to speed up > > subsequent file system checks, because it says to us that there is > > nothing there in the block group. > > > > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when > > allocation from that group takes place we know that when set, there was > > not anything allocated from that group, hence there should not be anything > > to discard from the file system point of view. Of course there might be > > situations where even if BLOCK_UNINIT is set the underlying storage is > > provisioned. This might happen for example when the user disables discard > > on mke2fs, however I think that this niche is not enough to not to take > > advantage of this optimization. > This patch is correct but I'm undecided whether we really want to do this > optimization or not. It might be unexpected we didn't truncate block group > which was completely free (from user's POV). I don't consider FITRIM too > performance critical and I also don't think this will be such a massive > speedup... > > Honza Hi Honzo, on small SSD's it certainly does not bring any significant speedup. But consider huge thin-provisioned storage and you'll immediately notice the change, but not only because of the storage size, but also because (from my experience) those thing are really slow with discard. Moreover ext4 would not be the only one not discarding the whole file system on FITRIM. See btrfs which does not map the whole storage to the file system at creation time, but rather allocate smaller chunks as the demand for space grows. It also means that they will not discard the whole file system, but only mapped chunks. And lastly, FITRIM is supposed to be a way to notify the underlying storage about the space which is no longer used. In conjunction with full device discard on mke2fs (which is the default), we can skip UNINIT groups just because from the fs point of view we are sure enough that this space is not mapped. Note that the only case where this is not true is if someone overrides the default mke2fs behaviour, or move their file system with dd. But I am certainly open for discussions about that. Thanks! -Lukas > > > > Signed-off-by: Lukas Czerner <lczerner@redhat.com> > > --- > > v2: nothing changed > > > > fs/ext4/mballoc.c | 10 +++++++++- > > 1 files changed, 9 insertions(+), 1 deletions(-) > > > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c > > index 8f817f2..9ea1065a 100644 > > --- a/fs/ext4/mballoc.c > > +++ b/fs/ext4/mballoc.c > > @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) > > ext4_fsblk_t first_data_blk = > > le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); > > ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); > > + struct ext4_group_desc *desc; > > int ret = 0; > > > > start = range->start >> sb->s_blocksize_bits; > > @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) > > if (group == last_group) > > end = last_cluster; > > > > - if (grp->bb_free >= minlen) { > > + desc = ext4_get_group_desc(sb, group, NULL); > > + if (!desc) { > > + ret = -EIO; > > + break; > > + } > > + > > + if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) && > > + (grp->bb_free >= minlen)) { > > cnt = ext4_trim_all_free(sb, group, first_cluster, > > end, minlen); > > if (cnt < 0) { > > -- > > 1.7.4.4 > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > > the body of a message to majordomo@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html >
On Fri, Mar 02, 2012 at 01:11:58PM +0100, Lukas Czerner wrote: > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when > allocation from that group takes place we know that when set, there was > not anything allocated from that group, hence there should not be anything > to discard from the file system point of view. There's a really good reason to set BLOCK_UNINIT once we have noticed that all of the blocks in the block group have been released.... If you have a 3TB HDD, running e2fsck takes 4 times as long if all of the block groups have BLOCK_UNINIT cleared, compared to a freshly mkfs'ed file system. As a result of my getting really annoyed at how long it took in this case, I'm planning on making e2fsck clear BLOCK_UNINIT if possible, so that subsequent e2fsck's (and dumpe2fs and debugfs invocations) can also be fast. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, 6 Mar 2012, Ted Ts'o wrote: > On Fri, Mar 02, 2012 at 01:11:58PM +0100, Lukas Czerner wrote: > > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when > > allocation from that group takes place we know that when set, there was > > not anything allocated from that group, hence there should not be anything > > to discard from the file system point of view. > > There's a really good reason to set BLOCK_UNINIT once we have noticed > that all of the blocks in the block group have been released.... > > If you have a 3TB HDD, running e2fsck takes 4 times as long if all of > the block groups have BLOCK_UNINIT cleared, compared to a freshly > mkfs'ed file system. As a result of my getting really annoyed at how > long it took in this case, I'm planning on making e2fsck clear > BLOCK_UNINIT if possible, so that subsequent e2fsck's (and dumpe2fs > and debugfs invocations) can also be fast. > > - Ted Ok, if there is a plan to implement that, I am fine with dropping th patches. But since this optimization would be helpful for discard, we can introduce BLOCK_DISCARDED/UNPROVISIONED flag maybe ? Which would be sen only after discard and cleared with the first allocation ? Thanks! -Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 07, 2012 at 08:10:00AM +0100, Lukas Czerner wrote: > > Ok, if there is a plan to implement that, I am fine with dropping th > patches. But since this optimization would be helpful for discard, we > can introduce BLOCK_DISCARDED/UNPROVISIONED flag maybe ? Which would be > sen only after discard and cleared with the first allocation ? Heh, great minds think alike. I was thinking about a the possibility of having a BLOCK_DISCARDED flag this morning, with exactly the semantics that you are suggesting. There may be devices in the future that have fast trims which will prefer to always get the duplicate trim requests, but there is no question there are a lot of crappy devices out there right now where trims are extremely expensive, so that seems fair. Something else to think about for the future, for battery driven devices (such as handsets) automatically sending tirm commands might not be a good idea when the device is sleeping/has the screen turned off. Given that we don't have an easy way of determining whether or not the device is in a low powered state (ideally we only send the trims right after work has been queued to the device, so it's woken up already, but when there isn't anything else that needs to be sent to the device). - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 8f817f2..9ea1065a 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) ext4_fsblk_t first_data_blk = le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es); + struct ext4_group_desc *desc; int ret = 0; start = range->start >> sb->s_blocksize_bits; @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range) if (group == last_group) end = last_cluster; - if (grp->bb_free >= minlen) { + desc = ext4_get_group_desc(sb, group, NULL); + if (!desc) { + ret = -EIO; + break; + } + + if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) && + (grp->bb_free >= minlen)) { cnt = ext4_trim_all_free(sb, group, first_cluster, end, minlen); if (cnt < 0) {
This commit is an optimization for FITRIM implementation. If the group has not been initialized yet (BLOCK_UNINIT flag set), we do not need to discard such group. This flag is set on mke2fs time to speed up subsequent file system checks, because it says to us that there is nothing there in the block group. Because the BLOCK_UNINIT is only set on mke2fs time and cleared when allocation from that group takes place we know that when set, there was not anything allocated from that group, hence there should not be anything to discard from the file system point of view. Of course there might be situations where even if BLOCK_UNINIT is set the underlying storage is provisioned. This might happen for example when the user disables discard on mke2fs, however I think that this niche is not enough to not to take advantage of this optimization. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- v2: nothing changed fs/ext4/mballoc.c | 10 +++++++++- 1 files changed, 9 insertions(+), 1 deletions(-)