Patchwork "data=writeback" and TRIM don't get along

login
register
mail settings
Submitter Eric Sandeen
Date April 8, 2010, 4:37 a.m.
Message ID <4BBD5D90.4090203@redhat.com>
Download mbox | patch
Permalink /patch/49687/
State New
Headers show

Comments

Eric Sandeen - April 8, 2010, 4:37 a.m.
Eric Sandeen wrote:
> I'll have to think about the right way to do this... it seems pretty
> convoluted to me right now.
>   
Something like this probably works, but I really REALLY would not test
it on an important filesystem.   :)

I'm not sure it's a good idea to discard it before returning it
to the prealloc pool, because it may well get re-used again
quickly.... not sure if that's helpful.

Just a note, I think eventually we may move to more of a batch discard
in the background, because these little discards are actually quite
inefficient on the hardware we've tested so far.

-Eric

p.s. really.  Don't test this with important data.  I haven't tested
it at all yet.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - April 8, 2010, 4:47 a.m.
Eric Sandeen wrote:
> Eric Sandeen wrote:
>> I'll have to think about the right way to do this... it seems pretty
>> convoluted to me right now.
>>   
> Something like this probably works, but I really REALLY would not test
> it on an important filesystem.   :)
> 
> I'm not sure it's a good idea to discard it before returning it
> to the prealloc pool, because it may well get re-used again
> quickly.... not sure if that's helpful.

(now I'm really talking to myself, but scratch that bit - 
ext4_mb_return_to_preallocation is pretty much a no-op)

-Eric

> Just a note, I think eventually we may move to more of a batch discard
> in the background, because these little discards are actually quite
> inefficient on the hardware we've tested so far.
> 
> -Eric
> 
> p.s. really.  Don't test this with important data.  I haven't tested
> it at all yet.
> 
> Index: linux-2.6/fs/ext4/mballoc.c
> ===================================================================
> --- linux-2.6.orig/fs/ext4/mballoc.c
> +++ linux-2.6/fs/ext4/mballoc.c
> @@ -4602,6 +4606,8 @@ do_more:
>  		mb_clear_bits(bitmap_bh->b_data, bit, count);
>  		ext4_mb_free_metadata(handle, &e4b, new_entry);
>  	} else {
> +		ext4_fsblk_t discard_block;
> +
>  		/* need to update group_info->bb_free and bitmap
>  		 * with group lock held. generate_buddy look at
>  		 * them with group lock_held
> @@ -4609,6 +4615,11 @@ do_more:
>  		ext4_lock_group(sb, block_group);
>  		mb_clear_bits(bitmap_bh->b_data, bit, count);
>  		mb_free_blocks(inode, &e4b, bit, count);
> +		discard_block = bit +
> +				ext4_group_first_block_no(sb, block_group);
> +		trace_ext4_discard_blocks(sb,
> +				(unsigned long long)discard_block, count);
> +		sb_issue_discard(sb, discard_block, count);
>  		ext4_mb_return_to_preallocation(inode, &e4b, block, count);
>  	}
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nebojsa Trpkovic - April 8, 2010, 11:48 a.m.
On 04/08/10 06:47, Eric Sandeen wrote:
> Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> I'll have to think about the right way to do this... it seems pretty
>>> convoluted to me right now.
>>>   
>> Something like this probably works, but I really REALLY would not test
>> it on an important filesystem.   :)
>>
>> I'm not sure it's a good idea to discard it before returning it
>> to the prealloc pool, because it may well get re-used again
>> quickly.... not sure if that's helpful.
> 
> (now I'm really talking to myself, but scratch that bit - 
> ext4_mb_return_to_preallocation is pretty much a no-op)
> 
> -Eric
> 
>> Just a note, I think eventually we may move to more of a batch discard
>> in the background, because these little discards are actually quite
>> inefficient on the hardware we've tested so far.
>>
>> -Eric
>>
>> p.s. really.  Don't test this with important data.  I haven't tested
>> it at all yet.
>>
>> Index: linux-2.6/fs/ext4/mballoc.c
>> ===================================================================
>> --- linux-2.6.orig/fs/ext4/mballoc.c
>> +++ linux-2.6/fs/ext4/mballoc.c
>> @@ -4602,6 +4606,8 @@ do_more:
>>  		mb_clear_bits(bitmap_bh->b_data, bit, count);
>>  		ext4_mb_free_metadata(handle, &e4b, new_entry);
>>  	} else {
>> +		ext4_fsblk_t discard_block;
>> +
>>  		/* need to update group_info->bb_free and bitmap
>>  		 * with group lock held. generate_buddy look at
>>  		 * them with group lock_held
>> @@ -4609,6 +4615,11 @@ do_more:
>>  		ext4_lock_group(sb, block_group);
>>  		mb_clear_bits(bitmap_bh->b_data, bit, count);
>>  		mb_free_blocks(inode, &e4b, bit, count);
>> +		discard_block = bit +
>> +				ext4_group_first_block_no(sb, block_group);
>> +		trace_ext4_discard_blocks(sb,
>> +				(unsigned long long)discard_block, count);
>> +		sb_issue_discard(sb, discard_block, count);
>>  		ext4_mb_return_to_preallocation(inode, &e4b, block, count);
>>  	}

Well, to be honest, I'm not some programmer guy, so I doubt my skills
can be of any help here.

Second, unfortunately, my SSD is now my root partition (just one big
sda1), so I cannot experiment with it too much.

And finally,
>> I'm not sure it's a good idea to discard it before returning it
>> to the prealloc pool, because it may well get re-used again
>> quickly.... not sure if that's helpful.

I'm not sure I understood you well about this prealloc pool - re-using
mechanism, but...
AFAIK, modern SSDs are using very aggressive wear-leveling algorithms.
Writing two times into the same filesystem sector almost newer goes to
the same hardware sector. Therefore, saving sectors from discarding and
for re-using makes no much sense - once re-used it will be written to
some other physical NAND memory cell anyways. Guess we can discard it as
soon as we have no valid data in it.

Nebojsa
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - April 8, 2010, 3:32 p.m.
Nebojsa Trpkovic wrote:

> Well, to be honest, I'm not some programmer guy, so I doubt my skills
> can be of any help here.
> 
> Second, unfortunately, my SSD is now my root partition (just one big
> sda1), so I cannot experiment with it too much.

Well, you might just keep in mind that:

1) trimming these small amounts has actually looked very inefficient, and
2) data=writeback really isn't very safe in the face of a crash or power loss, and
3) hopefully we'll have a better trim solution eventually.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Nebojsa Trpkovic - April 8, 2010, 4:21 p.m.
On 04/08/10 17:32, Eric Sandeen wrote:
> Well, you might just keep in mind that:
> 
> 1) trimming these small amounts has actually looked very inefficient, and
> 2) data=writeback really isn't very safe in the face of a crash or power loss, and
> 3) hopefully we'll have a better trim solution eventually.

1) I understand that big TRIMs are better then small ones, but skipping
some TRIMs completely would lead to slow but sure drive degradation as
drive would have less and less spare space for wear leveling.

2) Yes, I'm aware of possible data=writeback inconsistency, but I've
tried to let IO scheduler to merge and reorganize as many writes as it
can, all to avoid small writes to SSD which are main cause of write
amplification.

3) I'll stick with no data=writeback for the time being. I guess I'm
doing just fine even without it. :)


One more noob quotestion completely out-of-topic:
Will md layer pass TRIM command to drive if one has ext4 on linux
software RAID 0/1/5 ?

Nebojsa
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - April 8, 2010, 4:34 p.m.
Nebojsa Trpkovic wrote:
> On 04/08/10 17:32, Eric Sandeen wrote:
>> Well, you might just keep in mind that:
>>
>> 1) trimming these small amounts has actually looked very inefficient, and
>> 2) data=writeback really isn't very safe in the face of a crash or power loss, and
>> 3) hopefully we'll have a better trim solution eventually.
> 
> 1) I understand that big TRIMs are better then small ones, but skipping
> some TRIMs completely would lead to slow but sure drive degradation as
> drive would have less and less spare space for wear leveling.
> 
> 2) Yes, I'm aware of possible data=writeback inconsistency, but I've
> tried to let IO scheduler to merge and reorganize as many writes as it
> can, all to avoid small writes to SSD which are main cause of write
> amplification.
> 
> 3) I'll stick with no data=writeback for the time being. I guess I'm
> doing just fine even without it. :)
> 
> 
> One more noob quotestion completely out-of-topic:
> Will md layer pass TRIM command to drive if one has ext4 on linux
> software RAID 0/1/5 ?

I think the answer is "not yet but it's being worked on"

-Eric
 
> Nebojsa

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

Index: linux-2.6/fs/ext4/mballoc.c
===================================================================
--- linux-2.6.orig/fs/ext4/mballoc.c
+++ linux-2.6/fs/ext4/mballoc.c
@@ -4602,6 +4606,8 @@  do_more:
 		mb_clear_bits(bitmap_bh->b_data, bit, count);
 		ext4_mb_free_metadata(handle, &e4b, new_entry);
 	} else {
+		ext4_fsblk_t discard_block;
+
 		/* need to update group_info->bb_free and bitmap
 		 * with group lock held. generate_buddy look at
 		 * them with group lock_held
@@ -4609,6 +4615,11 @@  do_more:
 		ext4_lock_group(sb, block_group);
 		mb_clear_bits(bitmap_bh->b_data, bit, count);
 		mb_free_blocks(inode, &e4b, bit, count);
+		discard_block = bit +
+				ext4_group_first_block_no(sb, block_group);
+		trace_ext4_discard_blocks(sb,
+				(unsigned long long)discard_block, count);
+		sb_issue_discard(sb, discard_block, count);
 		ext4_mb_return_to_preallocation(inode, &e4b, block, count);
 	}