diff mbox

[4/4,v2] ext4: Do not discard group with BLOCK_UNINIT set

Message ID 1330690318-22627-4-git-send-email-lczerner@redhat.com
State Superseded, archived
Headers show

Commit Message

Lukas Czerner March 2, 2012, 12:11 p.m. UTC
This commit is an optimization for FITRIM implementation. If the group
has not been initialized yet (BLOCK_UNINIT flag set), we do not need to
discard such group. This flag is set on mke2fs time to speed up
subsequent file system checks, because it says to us that there is
nothing there in the block group.

Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
allocation from that group takes place we know that when set, there was
not anything allocated from that group, hence there should not be anything
to discard from the file system point of view. Of course there might be
situations where even if BLOCK_UNINIT is set the underlying storage is
provisioned. This might happen for example when the user disables discard
on mke2fs, however I think that this niche is not enough to not to take
advantage of this optimization.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
v2: nothing changed

 fs/ext4/mballoc.c |   10 +++++++++-
 1 files changed, 9 insertions(+), 1 deletions(-)

Comments

Jan Kara March 5, 2012, 12:41 p.m. UTC | #1
On Fri 02-03-12 13:11:58, Lukas Czerner wrote:
> This commit is an optimization for FITRIM implementation. If the group
> has not been initialized yet (BLOCK_UNINIT flag set), we do not need to
> discard such group. This flag is set on mke2fs time to speed up
> subsequent file system checks, because it says to us that there is
> nothing there in the block group.
> 
> Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> allocation from that group takes place we know that when set, there was
> not anything allocated from that group, hence there should not be anything
> to discard from the file system point of view. Of course there might be
> situations where even if BLOCK_UNINIT is set the underlying storage is
> provisioned. This might happen for example when the user disables discard
> on mke2fs, however I think that this niche is not enough to not to take
> advantage of this optimization.
  This patch is correct but I'm undecided whether we really want to do this
optimization or not. It might be unexpected we didn't truncate block group
which was completely free (from user's POV). I don't consider FITRIM too
performance critical and I also don't think this will be such a massive
speedup...

								Honza
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
> v2: nothing changed
> 
>  fs/ext4/mballoc.c |   10 +++++++++-
>  1 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 8f817f2..9ea1065a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  	ext4_fsblk_t first_data_blk =
>  			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
>  	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
> +	struct ext4_group_desc *desc;
>  	int ret = 0;
>  
>  	start = range->start >> sb->s_blocksize_bits;
> @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
>  		if (group == last_group)
>  			end = last_cluster;
>  
> -		if (grp->bb_free >= minlen) {
> +		desc = ext4_get_group_desc(sb, group, NULL);
> +		if (!desc) {
> +			ret = -EIO;
> +			break;
> +		}
> +
> +		if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) &&
> +		    (grp->bb_free >= minlen)) {
>  			cnt = ext4_trim_all_free(sb, group, first_cluster,
>  						end, minlen);
>  			if (cnt < 0) {
> -- 
> 1.7.4.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner March 5, 2012, 1:12 p.m. UTC | #2
On Mon, 5 Mar 2012, Jan Kara wrote:

> On Fri 02-03-12 13:11:58, Lukas Czerner wrote:
> > This commit is an optimization for FITRIM implementation. If the group
> > has not been initialized yet (BLOCK_UNINIT flag set), we do not need to
> > discard such group. This flag is set on mke2fs time to speed up
> > subsequent file system checks, because it says to us that there is
> > nothing there in the block group.
> > 
> > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> > allocation from that group takes place we know that when set, there was
> > not anything allocated from that group, hence there should not be anything
> > to discard from the file system point of view. Of course there might be
> > situations where even if BLOCK_UNINIT is set the underlying storage is
> > provisioned. This might happen for example when the user disables discard
> > on mke2fs, however I think that this niche is not enough to not to take
> > advantage of this optimization.
>   This patch is correct but I'm undecided whether we really want to do this
> optimization or not. It might be unexpected we didn't truncate block group
> which was completely free (from user's POV). I don't consider FITRIM too
> performance critical and I also don't think this will be such a massive
> speedup...
> 
> 								Honza

Hi Honzo,

on small SSD's it certainly does not bring any significant speedup. But
consider huge thin-provisioned storage and you'll immediately notice the
change, but not only because of the storage size, but also because (from
my experience) those thing are really slow with discard.

Moreover ext4 would not be the only one not discarding the whole file
system on FITRIM. See btrfs which does not map the whole storage to the
file system at creation time, but rather allocate smaller chunks as the
demand for space grows. It also means that they will not discard the
whole file system, but only mapped chunks.

And lastly, FITRIM is supposed to be a way to notify the underlying
storage about the space which is no longer used. In conjunction with
full device discard on mke2fs (which is the default), we can skip UNINIT
groups just because from the fs point of view we are sure enough that
this space is not mapped. Note that the only case where this is not
true is if someone overrides the default mke2fs behaviour, or move their
file system with dd.

But I am certainly open for discussions about that.

Thanks!
-Lukas

> > 
> > Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> > ---
> > v2: nothing changed
> > 
> >  fs/ext4/mballoc.c |   10 +++++++++-
> >  1 files changed, 9 insertions(+), 1 deletions(-)
> > 
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 8f817f2..9ea1065a 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -5033,6 +5033,7 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> >  	ext4_fsblk_t first_data_blk =
> >  			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
> >  	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
> > +	struct ext4_group_desc *desc;
> >  	int ret = 0;
> >  
> >  	start = range->start >> sb->s_blocksize_bits;
> > @@ -5076,7 +5077,14 @@ int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
> >  		if (group == last_group)
> >  			end = last_cluster;
> >  
> > -		if (grp->bb_free >= minlen) {
> > +		desc = ext4_get_group_desc(sb, group, NULL);
> > +		if (!desc) {
> > +			ret = -EIO;
> > +			break;
> > +		}
> > +
> > +		if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) &&
> > +		    (grp->bb_free >= minlen)) {
> >  			cnt = ext4_trim_all_free(sb, group, first_cluster,
> >  						end, minlen);
> >  			if (cnt < 0) {
> > -- 
> > 1.7.4.4
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Theodore Ts'o March 6, 2012, 10:18 p.m. UTC | #3
On Fri, Mar 02, 2012 at 01:11:58PM +0100, Lukas Czerner wrote:
> Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> allocation from that group takes place we know that when set, there was
> not anything allocated from that group, hence there should not be anything
> to discard from the file system point of view.

There's a really good reason to set BLOCK_UNINIT once we have noticed
that all of the blocks in the block group have been released....

If you have a 3TB HDD, running e2fsck takes 4 times as long if all of
the block groups have BLOCK_UNINIT cleared, compared to a freshly
mkfs'ed file system.  As a result of my getting really annoyed at how
long it took in this case, I'm planning on making e2fsck clear
BLOCK_UNINIT if possible, so that subsequent e2fsck's (and dumpe2fs
and debugfs invocations) can also be fast.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner March 7, 2012, 7:10 a.m. UTC | #4
On Tue, 6 Mar 2012, Ted Ts'o wrote:

> On Fri, Mar 02, 2012 at 01:11:58PM +0100, Lukas Czerner wrote:
> > Because the BLOCK_UNINIT is only set on mke2fs time and cleared when
> > allocation from that group takes place we know that when set, there was
> > not anything allocated from that group, hence there should not be anything
> > to discard from the file system point of view.
> 
> There's a really good reason to set BLOCK_UNINIT once we have noticed
> that all of the blocks in the block group have been released....
> 
> If you have a 3TB HDD, running e2fsck takes 4 times as long if all of
> the block groups have BLOCK_UNINIT cleared, compared to a freshly
> mkfs'ed file system.  As a result of my getting really annoyed at how
> long it took in this case, I'm planning on making e2fsck clear
> BLOCK_UNINIT if possible, so that subsequent e2fsck's (and dumpe2fs
> and debugfs invocations) can also be fast.
> 
> 					- Ted

Ok, if there is a plan to implement that, I am fine with dropping th
patches. But since this optimization would be helpful for discard, we
can introduce BLOCK_DISCARDED/UNPROVISIONED flag maybe ? Which would be
sen only after discard and cleared with the first allocation ?

Thanks!
-Lukas
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o March 7, 2012, 5:22 p.m. UTC | #5
On Wed, Mar 07, 2012 at 08:10:00AM +0100, Lukas Czerner wrote:
> 
> Ok, if there is a plan to implement that, I am fine with dropping th
> patches. But since this optimization would be helpful for discard, we
> can introduce BLOCK_DISCARDED/UNPROVISIONED flag maybe ? Which would be
> sen only after discard and cleared with the first allocation ?

Heh, great minds think alike.  I was thinking about a the possibility
of having a BLOCK_DISCARDED flag this morning, with exactly the
semantics that you are suggesting.  There may be devices in the future
that have fast trims which will prefer to always get the duplicate
trim requests, but there is no question there are a lot of crappy
devices out there right now where trims are extremely expensive, so
that seems fair.

Something else to think about for the future, for battery driven
devices (such as handsets) automatically sending tirm commands might
not be a good idea when the device is sleeping/has the screen turned
off.  Given that we don't have an easy way of determining whether or
not the device is in a low powered state (ideally we only send the
trims right after work has been queued to the device, so it's woken up
already, but when there isn't anything else that needs to be sent to
the device).

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 8f817f2..9ea1065a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -5033,6 +5033,7 @@  int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 	ext4_fsblk_t first_data_blk =
 			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
 	ext4_fsblk_t max_blks = ext4_blocks_count(EXT4_SB(sb)->s_es);
+	struct ext4_group_desc *desc;
 	int ret = 0;
 
 	start = range->start >> sb->s_blocksize_bits;
@@ -5076,7 +5077,14 @@  int ext4_trim_fs(struct super_block *sb, struct fstrim_range *range)
 		if (group == last_group)
 			end = last_cluster;
 
-		if (grp->bb_free >= minlen) {
+		desc = ext4_get_group_desc(sb, group, NULL);
+		if (!desc) {
+			ret = -EIO;
+			break;
+		}
+
+		if (!(desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) &&
+		    (grp->bb_free >= minlen)) {
 			cnt = ext4_trim_all_free(sb, group, first_cluster,
 						end, minlen);
 			if (cnt < 0) {