Patchwork ext4: Do not normalize request from fallocate

login
register
mail settings
Submitter Lukas Czerner
Date March 21, 2013, 3:50 p.m.
Message ID <1363881045-21673-1-git-send-email-lczerner@redhat.com>
Download mbox | patch
Permalink /patch/229732/
State New
Headers show

Comments

Lukas Czerner - March 21, 2013, 3:50 p.m.
Block requests from fallocate has been normalized originally. Then it was
changed by 556b27abf73833923d5cd4be80006292e1b31662 not to normalize it.
And then it was changed by 3c6fe77017bc6ce489f231c35fed3220b6691836
again to normalize the request.

The fact is that we _never_ want to normalize the request from
fallocate. We know exactly how much space we're going to use and we do
not want anyone to mess with the request and there is no point in doing
so.

Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
large fallocate requests were not physically contiguous. However it is
important to see why that is the case. Because the request is so big the
allocator will try to find free group to allocate from skipping block
groups which are used, which is fine. However it will only allocate
extents of 2^15-1 block (limitation of uninitialized extent size)
which will leave one block in each block group free which will make the
extent tree physically non-contiguous, however _only_ by one block which
is perfectly fine.

This will never happen when we normalize the request because for some
reason (maybe bug) it will be normalized to much smaller request (2048
blocks) and those extents will then be merged together not leaving any
free block in between - hence physically contiguous. However the fact
that we're splitting huge requests into ton of smaller ones and then
merging extents together is very _very_ bad for fallocate performance.

The situation is even worst since with commit
ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
uninitialized extents so we end up with absolutely _huge_ extent tree
for bigger fallocate requests which is also bad for performance but not
only when fallocate itself, but even when working with the file
later on.

Fix this by disabling normalization for fallocate. From my simple testing
with this commit fallocate is much faster on non fragmented file
system. On my system fallocate 15T is almost 3x faster with this patch
and removing this file is almost 2x faster - tested on real hardware.

Signed-off-by: Lukas Czerner <lczerner@redhat.com>
---
 fs/ext4/extents.c |   18 ++++++++++--------
 1 files changed, 10 insertions(+), 8 deletions(-)
Dmitri Monakho - March 21, 2013, 4:03 p.m.
On Thu, 21 Mar 2013 16:50:45 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
> Block requests from fallocate has been normalized originally. Then it was
> changed by 556b27abf73833923d5cd4be80006292e1b31662 not to normalize it.
> And then it was changed by 3c6fe77017bc6ce489f231c35fed3220b6691836
> again to normalize the request.
> 
> The fact is that we _never_ want to normalize the request from
> fallocate. We know exactly how much space we're going to use and we do
> not want anyone to mess with the request and there is no point in doing
> so.
Looks reasonable. 
Reviewed-by:Dmitry Monakhov <dmonakhov@openvz.org>
> 
> Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
> large fallocate requests were not physically contiguous. However it is
> important to see why that is the case. Because the request is so big the
> allocator will try to find free group to allocate from skipping block
> groups which are used, which is fine. However it will only allocate
> extents of 2^15-1 block (limitation of uninitialized extent size)
> which will leave one block in each block group free which will make the
> extent tree physically non-contiguous, however _only_ by one block which
> is perfectly fine.
> 
> This will never happen when we normalize the request because for some
> reason (maybe bug) it will be normalized to much smaller request (2048
> blocks) and those extents will then be merged together not leaving any
> free block in between - hence physically contiguous. However the fact
> that we're splitting huge requests into ton of smaller ones and then
> merging extents together is very _very_ bad for fallocate performance.
> 
> The situation is even worst since with commit
> ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
> uninitialized extents so we end up with absolutely _huge_ extent tree
> for bigger fallocate requests which is also bad for performance but not
> only when fallocate itself, but even when working with the file
> later on.
> 
> Fix this by disabling normalization for fallocate. From my simple testing
> with this commit fallocate is much faster on non fragmented file
> system. On my system fallocate 15T is almost 3x faster with this patch
> and removing this file is almost 2x faster - tested on real hardware.
> 
> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
> ---
>  fs/ext4/extents.c |   18 ++++++++++--------
>  1 files changed, 10 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index e2bb929..a40a602 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4422,16 +4422,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>  		trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
>  		return ret;
>  	}
> -	flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
> -	if (mode & FALLOC_FL_KEEP_SIZE)
> -		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
> +
>  	/*
> -	 * Don't normalize the request if it can fit in one extent so
> -	 * that it doesn't get unnecessarily split into multiple
> -	 * extents.
> +	 * We do NOT want the requests from fallocate to be normalized
> +	 * ever!. We know exactly how much we want to allocate and
> +	 * we do not need to do any mumbo-jumbo with it. Requests bigger
> +	 * than uninit extent size, will be divided automatically into
> +	 * biggest possible extents.
>  	 */
> -	if (len <= EXT_UNINIT_MAX_LEN << blkbits)
> -		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
> +	flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
> +		EXT4_GET_BLOCKS_NO_NORMALIZE;
> +	if (mode & FALLOC_FL_KEEP_SIZE)
> +		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>  
>  retry:
>  	while (ret >= 0 && ret < max_blocks) {
> -- 
> 1.7.7.6
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Harm - March 22, 2013, 5:10 p.m.
[Resending without html.]
I haven't looked at ext4 in awhile now. CC'ing Ted and Curt in case
they want to chime in. See comment below as well.

On Thu, Mar 21, 2013 at 9:03 AM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> On Thu, 21 Mar 2013 16:50:45 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
>> Block requests from fallocate has been normalized originally. Then it was
>> changed by 556b27abf73833923d5cd4be80006292e1b31662 not to normalize it.
>> And then it was changed by 3c6fe77017bc6ce489f231c35fed3220b6691836
>> again to normalize the request.
>>
>> The fact is that we _never_ want to normalize the request from
>> fallocate. We know exactly how much space we're going to use and we do
>> not want anyone to mess with the request and there is no point in doing
>> so.
> Looks reasonable.
> Reviewed-by:Dmitry Monakhov <dmonakhov@openvz.org>
>>
>> Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that

Is this a publicly available commit? I did not find it with a quick
google search.
Anyway, I don't see a problem with this commit off-hand, but I'll
defer to Ted or Curt.

-Greg Harmon

>> large fallocate requests were not physically contiguous. However it is
>> important to see why that is the case. Because the request is so big the
>> allocator will try to find free group to allocate from skipping block
>> groups which are used, which is fine. However it will only allocate
>> extents of 2^15-1 block (limitation of uninitialized extent size)
>> which will leave one block in each block group free which will make the
>> extent tree physically non-contiguous, however _only_ by one block which
>> is perfectly fine.
>>
>> This will never happen when we normalize the request because for some
>> reason (maybe bug) it will be normalized to much smaller request (2048
>> blocks) and those extents will then be merged together not leaving any
>> free block in between - hence physically contiguous. However the fact
>> that we're splitting huge requests into ton of smaller ones and then
>> merging extents together is very _very_ bad for fallocate performance.
>>
>> The situation is even worst since with commit
>> ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
>> uninitialized extents so we end up with absolutely _huge_ extent tree
>> for bigger fallocate requests which is also bad for performance but not
>> only when fallocate itself, but even when working with the file
>> later on.
>>
>> Fix this by disabling normalization for fallocate. From my simple testing
>> with this commit fallocate is much faster on non fragmented file
>> system. On my system fallocate 15T is almost 3x faster with this patch
>> and removing this file is almost 2x faster - tested on real hardware.
>>
>> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
>> ---
>>  fs/ext4/extents.c |   18 ++++++++++--------
>>  1 files changed, 10 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index e2bb929..a40a602 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -4422,16 +4422,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>>               trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
>>               return ret;
>>       }
>> -     flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
>> -     if (mode & FALLOC_FL_KEEP_SIZE)
>> -             flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>> +
>>       /*
>> -      * Don't normalize the request if it can fit in one extent so
>> -      * that it doesn't get unnecessarily split into multiple
>> -      * extents.
>> +      * We do NOT want the requests from fallocate to be normalized
>> +      * ever!. We know exactly how much we want to allocate and
>> +      * we do not need to do any mumbo-jumbo with it. Requests bigger
>> +      * than uninit extent size, will be divided automatically into
>> +      * biggest possible extents.
>>        */
>> -     if (len <= EXT_UNINIT_MAX_LEN << blkbits)
>> -             flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
>> +     flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
>> +             EXT4_GET_BLOCKS_NO_NORMALIZE;
>> +     if (mode & FALLOC_FL_KEEP_SIZE)
>> +             flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>>
>>  retry:
>>       while (ret >= 0 && ret < max_blocks) {
>> --
>> 1.7.7.6
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 22, 2013, 7:36 p.m.
On Fri, Mar 22, 2013 at 10:10:46AM -0700, Greg Harmon wrote:
> >>
> >> Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
> 
> Is this a publicly available commit? 

Commit 3c6fe77017bc?  Yes, it landed upstream as v3.2 (try doing git
tag --contains 3c6fe77017bc to find out when it hit upstream).

    	       		       	    	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Greg Harm - March 22, 2013, 10:19 p.m.
On Fri, Mar 22, 2013 at 1:03 PM, Lukáš Czerner <lczerner@redhat.com> wrote:
> On Fri, 22 Mar 2013, Greg Harmon wrote:
>
>> Date: Fri, 22 Mar 2013 10:10:46 -0700
>> From: Greg Harmon <gharm@google.com>
>> To: Dmitry Monakhov <dmonakhov@openvz.org>
>> Cc: Lukas Czerner <lczerner@redhat.com>,
>>     "linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
>>     Theodore Ts'o <tytso@mit.edu>, Curt Wohlgemuth <curtw@google.com>
>> Subject: Re: [PATCH] ext4: Do not normalize request from fallocate
>>
>> [Resending without html.]
>> I haven't looked at ext4 in awhile now. CC'ing Ted and Curt in case
>> they want to chime in. See comment below as well.
>>
>> On Thu, Mar 21, 2013 at 9:03 AM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>> > On Thu, 21 Mar 2013 16:50:45 +0100, Lukas Czerner <lczerner@redhat.com> wrote:
>> >> Block requests from fallocate has been normalized originally. Then it was
>> >> changed by 556b27abf73833923d5cd4be80006292e1b31662 not to normalize it.
>> >> And then it was changed by 3c6fe77017bc6ce489f231c35fed3220b6691836
>> >> again to normalize the request.
>> >>
>> >> The fact is that we _never_ want to normalize the request from
>> >> fallocate. We know exactly how much space we're going to use and we do
>> >> not want anyone to mess with the request and there is no point in doing
>> >> so.
>> > Looks reasonable.
>> > Reviewed-by:Dmitry Monakhov <dmonakhov@openvz.org>
>> >>
>> >> Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
>>
>> Is this a publicly available commit? I did not find it with a quick
>> google search.
>> Anyway, I don't see a problem with this commit off-hand, but I'll
>> defer to Ted or Curt.
>
> Yes it is publicly available (git show
> 3c6fe77017bc6ce489f231c35fed3220b6691836). I cc'ed you because you
> are the author of that commit and this one if effectively
> reverting it, so I was curious whether you have anything to say
> about it in case I've missed something.
>
> Thanks!

Not sure why I was having trouble finding those commits earlier;
(ec22ba8edb507395c95fbc617eea26a6b2d98797 is in git but not indexed on
google). Thanks for the notice.

-Greg

> -Lukas
>
>>
>> -Greg Harmon
>>
>> >> large fallocate requests were not physically contiguous. However it is
>> >> important to see why that is the case. Because the request is so big the
>> >> allocator will try to find free group to allocate from skipping block
>> >> groups which are used, which is fine. However it will only allocate
>> >> extents of 2^15-1 block (limitation of uninitialized extent size)
>> >> which will leave one block in each block group free which will make the
>> >> extent tree physically non-contiguous, however _only_ by one block which
>> >> is perfectly fine.
>> >>
>> >> This will never happen when we normalize the request because for some
>> >> reason (maybe bug) it will be normalized to much smaller request (2048
>> >> blocks) and those extents will then be merged together not leaving any
>> >> free block in between - hence physically contiguous. However the fact
>> >> that we're splitting huge requests into ton of smaller ones and then
>> >> merging extents together is very _very_ bad for fallocate performance.
>> >>
>> >> The situation is even worst since with commit
>> >> ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
>> >> uninitialized extents so we end up with absolutely _huge_ extent tree
>> >> for bigger fallocate requests which is also bad for performance but not
>> >> only when fallocate itself, but even when working with the file
>> >> later on.
>> >>
>> >> Fix this by disabling normalization for fallocate. From my simple testing
>> >> with this commit fallocate is much faster on non fragmented file
>> >> system. On my system fallocate 15T is almost 3x faster with this patch
>> >> and removing this file is almost 2x faster - tested on real hardware.
>> >>
>> >> Signed-off-by: Lukas Czerner <lczerner@redhat.com>
>> >> ---
>> >>  fs/ext4/extents.c |   18 ++++++++++--------
>> >>  1 files changed, 10 insertions(+), 8 deletions(-)
>> >>
>> >> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> >> index e2bb929..a40a602 100644
>> >> --- a/fs/ext4/extents.c
>> >> +++ b/fs/ext4/extents.c
>> >> @@ -4422,16 +4422,18 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
>> >>               trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
>> >>               return ret;
>> >>       }
>> >> -     flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
>> >> -     if (mode & FALLOC_FL_KEEP_SIZE)
>> >> -             flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>> >> +
>> >>       /*
>> >> -      * Don't normalize the request if it can fit in one extent so
>> >> -      * that it doesn't get unnecessarily split into multiple
>> >> -      * extents.
>> >> +      * We do NOT want the requests from fallocate to be normalized
>> >> +      * ever!. We know exactly how much we want to allocate and
>> >> +      * we do not need to do any mumbo-jumbo with it. Requests bigger
>> >> +      * than uninit extent size, will be divided automatically into
>> >> +      * biggest possible extents.
>> >>        */
>> >> -     if (len <= EXT_UNINIT_MAX_LEN << blkbits)
>> >> -             flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
>> >> +     flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
>> >> +             EXT4_GET_BLOCKS_NO_NORMALIZE;
>> >> +     if (mode & FALLOC_FL_KEEP_SIZE)
>> >> +             flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
>> >>
>> >>  retry:
>> >>       while (ret >= 0 && ret < max_blocks) {
>> >> --
>> >> 1.7.7.6
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 24, 2013, 12:11 a.m.
On Thu, Mar 21, 2013 at 04:50:45PM +0100, Lukas Czerner wrote:
> 
> Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
> large fallocate requests were not physically contiguous. However it is
> important to see why that is the case. Because the request is so big the
> allocator will try to find free group to allocate from skipping block
> groups which are used, which is fine. However it will only allocate
> extents of 2^15-1 block (limitation of uninitialized extent size)
> which will leave one block in each block group free which will make the
> extent tree physically non-contiguous, however _only_ by one block which
> is perfectly fine.

Well, it's actually really unfortunate.  The file ends up being more
fragmented, and from an alignment point of view it's really horrid.
For a RAID array with a power of 2 stripe size, or a flash device with
a power of 2 erase block size, the result is actually quite
spectacularly bad:

File size of 1 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..   32766:     458752..    491518:  32767:             unwritten
   1:    32767..   65533:     491520..    524286:  32767:     491519: unwritten
   2:    65534..   98300:     589824..    622590:  32767:     524287: unwritten
   3:    98301..  131067:     622592..    655358:  32767:     622591: unwritten
   4:   131068..  163834:     655360..    688126:  32767:     655359: unwritten
   5:   163835..  196601:     688128..    720894:  32767:     688127: unwritten
   6:   196602..  229368:     720896..    753662:  32767:     720895: unwritten
   7:   229369..  262135:     753664..    786430:  32767:     753663: unwritten
   8:   262136..  262143:     786432..    786439:      8:     786431: unwritten,eof
1: 9 extents found

That being said, what we were doing before was quite bad, and you're
quite right about your analysis here:

> This will never happen when we normalize the request because for some
> reason (maybe bug) it will be normalized to much smaller request (2048
> blocks) and those extents will then be merged together not leaving any
> free block in between - hence physically contiguous. However the fact
> that we're splitting huge requests into ton of smaller ones and then
> merging extents together is very _very_ bad for fallocate performance.
> 
> The situation is even worst since with commit
> ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
> uninitialized extents so we end up with absolutely _huge_ extent tree
> for bigger fallocate requests which is also bad for performance but not
> only when fallocate itself, but even when working with the file
> later on.

Without this patch, we currently do this for the same 1g file:

Filesystem type is: ef53
File size of 2 is 1073741824 (262144 blocks of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..    2047:     305152..    307199:   2048:             unwritten
   1:     2048..    4095:     307200..    309247:   2048:             unwritten
   	  	       .....
 106:   217088..  219135:     522240..    524287:   2048:             unwritten
 107:   219136..  221183:     591872..    593919:   2048:     524288: unwritten
 108:   221184..  223231:     593920..    595967:   2048:             unwritten
 		       .....
 127:   260096..  262143:     632832..    634879:   2048:             unwritten,eof
2: 2 extents found

So I agree that what we're doing is poor, but the question is, can we
do something which is better that either of these two results?

That is, can we improve mballoc so that we keep an fallocated gigabyte
file as physically contiguous as possible, while using an optimal
number of on-disk extents?   i.e., 9 extents of length 32767.

Failing that, can we create 20 extents of length 16384 or so?

	      	     	       	       	  	 - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - March 24, 2013, 2:42 a.m.
On 2013-03-23, at 17:11, Theodore Ts'o <tytso@mit.edu> wrote:

> On Thu, Mar 21, 2013 at 04:50:45PM +0100, Lukas Czerner wrote:
>> 
>> Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
>> large fallocate requests were not physically contiguous. However it is
>> important to see why that is the case. Because the request is so big the
>> allocator will try to find free group to allocate from skipping block
>> groups which are used, which is fine. However it will only allocate
>> extents of 2^15-1 block (limitation of uninitialized extent size)
>> which will leave one block in each block group free which will make the
>> extent tree physically non-contiguous, however _only_ by one block which
>> is perfectly fine.
> 
> Well, it's actually really unfortunate.  The file ends up being more
> fragmented, and from an alignment point of view it's really horrid.

I was also wondering about this. 

> So I agree that what we're doing is poor, but the question is, can we
> do something which is better that either of these two results?

One option is to allocate a 32768-block in allocated extent and then write a 1-block zeroed-out extent. But, that would still cause a lot of seeks to write the single-block IO.  

> That is, can we improve mballoc so that we keep an fallocated gigabyte
> file as physically contiguous as possible, while using an optimal
> number of on-disk extents?   i.e., 9 extents of length 32767.
> 
> Failing that, can we create 20 extents of length 16384 or so?

I think this is probably the best compromise. It also improves then case for converting unwritten extents when overwriting the file, since it would be possible to merge the remaining fragments to the neighboring unwritten extents.

In the latter regard, it might be optimal to allocate approximately 32768/3 or 12288-block extents, since it would always allow merging fragments on both sides of an extent, if needed. 

Cheers, Andreas--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner - March 25, 2013, 10:09 a.m.
On Sat, 23 Mar 2013, Theodore Ts'o wrote:

> Date: Sat, 23 Mar 2013 20:11:43 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukas Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, gharm@google.com
> Subject: Re: [PATCH] ext4: Do not normalize request from fallocate
> 
> On Thu, Mar 21, 2013 at 04:50:45PM +0100, Lukas Czerner wrote:
> > 
> > Commit 3c6fe77017bc6ce489f231c35fed3220b6691836 mentioned that
> > large fallocate requests were not physically contiguous. However it is
> > important to see why that is the case. Because the request is so big the
> > allocator will try to find free group to allocate from skipping block
> > groups which are used, which is fine. However it will only allocate
> > extents of 2^15-1 block (limitation of uninitialized extent size)
> > which will leave one block in each block group free which will make the
> > extent tree physically non-contiguous, however _only_ by one block which
> > is perfectly fine.
> 
> Well, it's actually really unfortunate.  The file ends up being more
> fragmented, and from an alignment point of view it's really horrid.
> For a RAID array with a power of 2 stripe size, or a flash device with
> a power of 2 erase block size, the result is actually quite
> spectacularly bad:

Sorry for being dense, but I am trying to understand why this is so
bad and what is the "expected" column there.

The physical offset of each extent bellow starts on the start of the
block group and it seems to me that it's perfectly aligned for every
power of two up to the block group size.

If the extent would start at the physical offset from the "expected"
column, than it would be misaligned.

Maybe I am missing something, or maybe I misunderstood the concept ?
But the only problem I see is that when we would like to use that
remaining one block, but that's expected and the only way to avoid
that is to allocate smaller extents instead as you suggested below
(16384 blocks).

Thanks!
-Lukas

> 
> File size of 1 is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..   32766:     458752..    491518:  32767:             unwritten
>    1:    32767..   65533:     491520..    524286:  32767:     491519: unwritten
>    2:    65534..   98300:     589824..    622590:  32767:     524287: unwritten
>    3:    98301..  131067:     622592..    655358:  32767:     622591: unwritten
>    4:   131068..  163834:     655360..    688126:  32767:     655359: unwritten
>    5:   163835..  196601:     688128..    720894:  32767:     688127: unwritten
>    6:   196602..  229368:     720896..    753662:  32767:     720895: unwritten
>    7:   229369..  262135:     753664..    786430:  32767:     753663: unwritten
>    8:   262136..  262143:     786432..    786439:      8:     786431: unwritten,eof
> 1: 9 extents found
> 
> That being said, what we were doing before was quite bad, and you're
> quite right about your analysis here:
> 
> > This will never happen when we normalize the request because for some
> > reason (maybe bug) it will be normalized to much smaller request (2048
> > blocks) and those extents will then be merged together not leaving any
> > free block in between - hence physically contiguous. However the fact
> > that we're splitting huge requests into ton of smaller ones and then
> > merging extents together is very _very_ bad for fallocate performance.
> > 
> > The situation is even worst since with commit
> > ec22ba8edb507395c95fbc617eea26a6b2d98797 we no longer merge
> > uninitialized extents so we end up with absolutely _huge_ extent tree
> > for bigger fallocate requests which is also bad for performance but not
> > only when fallocate itself, but even when working with the file
> > later on.
> 
> Without this patch, we currently do this for the same 1g file:
> 
> Filesystem type is: ef53
> File size of 2 is 1073741824 (262144 blocks of 4096 bytes)
>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>    0:        0..    2047:     305152..    307199:   2048:             unwritten
>    1:     2048..    4095:     307200..    309247:   2048:             unwritten
>    	  	       .....
>  106:   217088..  219135:     522240..    524287:   2048:             unwritten
>  107:   219136..  221183:     591872..    593919:   2048:     524288: unwritten
>  108:   221184..  223231:     593920..    595967:   2048:             unwritten
>  		       .....
>  127:   260096..  262143:     632832..    634879:   2048:             unwritten,eof
> 2: 2 extents found
> 
> So I agree that what we're doing is poor, but the question is, can we
> do something which is better that either of these two results?
> 
> That is, can we improve mballoc so that we keep an fallocated gigabyte
> file as physically contiguous as possible, while using an optimal
> number of on-disk extents?   i.e., 9 extents of length 32767.
> 
> Failing that, can we create 20 extents of length 16384 or so?
> 
> 	      	     	       	       	  	 - Ted
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 25, 2013, 12:53 p.m.
On Mon, Mar 25, 2013 at 11:09:35AM +0100, Lukáš Czerner wrote:
> 
> Sorry for being dense, but I am trying to understand why this is so
> bad and what is the "expected" column there.
> 
> The physical offset of each extent bellow starts on the start of the
> block group and it seems to me that it's perfectly aligned for every
> power of two up to the block group size.

Yes, but the logical offset isn't aligned.  Consider the simplest
workload, which is where we are writing the 1GB file sequentially.
Let's assume that the raid stripe size is 8M.  So ideally, we would
want each write to be a multiple of 8M, starting at logical block 0.

But look what happens here:

> > File size of 1 is 1073741824 (262144 blocks of 4096 bytes)
> >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> >    0:        0..   32766:     458752..    491518:  32767:             unwritten
> >    1:    32767..   65533:     491520..    524286:  32767:     491519: unwritten
> >    2:    65534..   98300:     589824..    622590:  32767:     524287: unwritten

If we do 8M writes, then we would want to write in chunks of 2048
blocks.  So consider what happens when we write the 2048 block chunk
starting with logical block 30720.  The fact that there is a
discontinuity between logical blocks 32766 and 32767 means that we
will have to do a read-modify-write cycle for that particular RAID
stripe.

Does that make more sense?

Another reason why keeping the file as physically contiguous as
possible is because we can now extent caching using the extent status
tree.  So if we can allocate the file using 2 physically contiguous
extents in instead of 9 or 10 physically contiguous extents, it means
the extent status tree uses less memory, too.  For a 1GB file, that
might not make that much difference, but if we caching 2048 of these
1G files (on a 2TB disk, for example), keeping the files as physically
contiguous as possible means we can cache the logical to physical
block mapping of all of these files much more easily.

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner - March 25, 2013, 1:26 p.m.
On Mon, 25 Mar 2013, Theodore Ts'o wrote:

> Date: Mon, 25 Mar 2013 08:53:09 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: linux-ext4@vger.kernel.org, gharm@google.com
> Subject: Re: [PATCH] ext4: Do not normalize request from fallocate
> 
> On Mon, Mar 25, 2013 at 11:09:35AM +0100, Lukáš Czerner wrote:
> > 
> > Sorry for being dense, but I am trying to understand why this is so
> > bad and what is the "expected" column there.
> > 
> > The physical offset of each extent bellow starts on the start of the
> > block group and it seems to me that it's perfectly aligned for every
> > power of two up to the block group size.
> 
> Yes, but the logical offset isn't aligned.  Consider the simplest
> workload, which is where we are writing the 1GB file sequentially.
> Let's assume that the raid stripe size is 8M.  So ideally, we would
> want each write to be a multiple of 8M, starting at logical block 0.
> 
> But look what happens here:
> 
> > > File size of 1 is 1073741824 (262144 blocks of 4096 bytes)
> > >  ext:     logical_offset:        physical_offset: length:   expected: flags:
> > >    0:        0..   32766:     458752..    491518:  32767:             unwritten
> > >    1:    32767..   65533:     491520..    524286:  32767:     491519: unwritten
> > >    2:    65534..   98300:     589824..    622590:  32767:     524287: unwritten
> 
> If we do 8M writes, then we would want to write in chunks of 2048
> blocks.  So consider what happens when we write the 2048 block chunk
> starting with logical block 30720.  The fact that there is a
> discontinuity between logical blocks 32766 and 32767 means that we
> will have to do a read-modify-write cycle for that particular RAID
> stripe.
> 
> Does that make more sense?

Oh, now I get it :) Thanks a lot for explanation I kept thinking
about the physical layout and forgot that the logical is actually
misaligned.

> 
> Another reason why keeping the file as physically contiguous as
> possible is because we can now extent caching using the extent status
> tree.  So if we can allocate the file using 2 physically contiguous
> extents in instead of 9 or 10 physically contiguous extents, it means
> the extent status tree uses less memory, too.  For a 1GB file, that
> might not make that much difference, but if we caching 2048 of these
> 1G files (on a 2TB disk, for example), keeping the files as physically
> contiguous as possible means we can cache the logical to physical
> block mapping of all of these files much more easily.

Yes, that makes sense too.

> 
> Regards,
> 
> 						- Ted
> 

Thanks!
-Lukas
Theodore Ts'o - March 25, 2013, 2:44 p.m.
One of the things I've been thinking we need to add for some time is a
new field to the allocation_request structure which gives a hint to
mballoc about how much space we will eventually need (i.e., if we are
doing a 1GB fallocate, or if we know that we have 256megs worth of
dirty pages that we intend to write out in ext4_da_writepages, we
should tell mballoc that); this is different from how much space we
can actually use (i.e., 32,767 blocks given that's the maximum size we
can add in an uninitialized extents).  If mballoc knew this, it could
reserve that much space as a per-file preallocation.

That might be the easist way to solve this particular issue.

     	      	  	     	- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e2bb929..a40a602 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4422,16 +4422,18 @@  long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
 		trace_ext4_fallocate_exit(inode, offset, max_blocks, ret);
 		return ret;
 	}
-	flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT;
-	if (mode & FALLOC_FL_KEEP_SIZE)
-		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
+
 	/*
-	 * Don't normalize the request if it can fit in one extent so
-	 * that it doesn't get unnecessarily split into multiple
-	 * extents.
+	 * We do NOT want the requests from fallocate to be normalized
+	 * ever!. We know exactly how much we want to allocate and
+	 * we do not need to do any mumbo-jumbo with it. Requests bigger
+	 * than uninit extent size, will be divided automatically into
+	 * biggest possible extents.
 	 */
-	if (len <= EXT_UNINIT_MAX_LEN << blkbits)
-		flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;
+	flags = EXT4_GET_BLOCKS_CREATE_UNINIT_EXT |
+		EXT4_GET_BLOCKS_NO_NORMALIZE;
+	if (mode & FALLOC_FL_KEEP_SIZE)
+		flags |= EXT4_GET_BLOCKS_KEEP_SIZE;
 
 retry:
 	while (ret >= 0 && ret < max_blocks) {