Patchwork [2/2] ext4: fix bug in ext4_mb_normalize_request()

login
register
mail settings
Submitter Maurizio Lombardi
Date March 3, 2014, 2 p.m.
Message ID <1393855228-13592-3-git-send-email-mlombard@redhat.com>
Download mbox | patch
Permalink /patch/325843/
State New
Headers show

Comments

Maurizio Lombardi - March 3, 2014, 2 p.m.
When normalizing the data requests, the number of blocks to allocate
must not be higher than the number of blocks per group.
The current implementation does not take care of that and it may
hit a kernel panic if the number of blocks per group is very low.

This patch fixes the bug by ensuring that the number of blocks to allocate
is always less or equal to the number of blocks per group.

How to reproduce the bug:

#mkfs.ext4 -g 1024 /dev/sdX
#mount /dev/sdX /mnt
#dd if=/dev/zero of=/mnt/test bs=1M count=10

[  147.779177] ------------[ cut here ]------------
[  147.780015] kernel BUG at fs/ext4/mballoc.c:3145!
[  147.780015] invalid opcode: 0000 [#1] SMP
[  147.780015] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc loop snd_pcm cirrus snd_timer ttm snd drm_kms_helper soundcore drm parport_pc parport i2c_piix4 pcspkr i2c_core xfs libcrc32c e1000 floppy
[  147.780015] CPU: 0 PID: 66 Comm: kworker/u8:3 Not tainted 3.14.0-rc4+ #12
[  147.780015] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[  147.780015] Workqueue: writeback bdi_writeback_workfn (flush-7:0)
[  147.780015] task: ffff88002ec16300 ti: ffff88002ed20000 task.ti: ffff88002ed20000
[  147.780015] RIP: 0010:[<ffffffff812b779c>]  [<ffffffff812b779c>] ext4_mb_normalize_request+0x60c/0x660
[  147.780015] RSP: 0018:ffff88002ed21778  EFLAGS: 00010206
[  147.780015] RAX: ffff88002e3bb000 RBX: 0000000000000800 RCX: 0000000000000006
[  147.780015] RDX: 0000000000000800 RSI: 0000000000000046 RDI: ffff88002e3bb800
[  147.780015] RBP: ffff88002ed217e8 R08: 000000000000000a R09: 00000000000003a2
[  147.780015] R10: 0000000000000000 R11: 00000000000003a1 R12: ffff880000c17000
[  147.780015] R13: 0000000000000000 R14: 0000000000000800 R15: ffff88003d1fc2f8
[  147.780015] FS:  0000000000000000(0000) GS:ffff88003fc00000(0000) knlGS:0000000000000000
[  147.780015] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  147.780015] CR2: ffffffffff600400 CR3: 000000001c218000 CR4: 00000000000006f0
[  147.780015] Stack:
[  147.780015]  00000000014ca000 0000000000000800 ffff88002e3bb000 ffff88003d1fc0b0
[  147.780015]  ffff88002ed21980 0000080000000800 ffffffff812bd912 ffff88003d1fc2f8
[  147.780015]  ffff88002ed217f8 ffff88002ed21980 ffff88002e3bb000 ffff88002ed21970
[  147.780015] Call Trace:
[  147.780015]  [<ffffffff812bd912>] ? ext4_mb_new_blocks+0x122/0x8d0
[  147.780015]  [<ffffffff812bdbe3>] ext4_mb_new_blocks+0x3f3/0x8d0
[  147.780015]  [<ffffffff8116df7e>] ? free_hot_cold_page_list+0x4e/0xa0
[  147.780015]  [<ffffffff811bc72a>] ? __kmalloc+0x1ea/0x230
[  147.780015]  [<ffffffff812af4a8>] ? ext4_ext_find_extent+0x228/0x2b0
[  147.780015]  [<ffffffff812af4a8>] ? ext4_ext_find_extent+0x228/0x2b0
[  147.780015]  [<ffffffff812b38c1>] ext4_ext_map_blocks+0x611/0xfd0
[  147.780015]  [<ffffffff81284f55>] ext4_map_blocks+0x2b5/0x4d0
[  147.780015]  [<ffffffff81289dd1>] ext4_writepages+0x621/0xd00
[  147.780015]  [<ffffffff81171bbe>] do_writepages+0x1e/0x40
[  147.780015]  [<ffffffff811fecb0>] __writeback_single_inode+0x40/0x200
[  147.780015]  [<ffffffff811ff5d1>] writeback_sb_inodes+0x1c1/0x410
[  147.780015]  [<ffffffff811ff9e4>] wb_writeback+0xf4/0x2c0
[  147.780015]  [<ffffffff810a0f2f>] ? set_worker_desc+0x6f/0x80
[  147.780015]  [<ffffffff81202d98>] bdi_writeback_workfn+0x118/0x440
[  147.780015]  [<ffffffff8109d99a>] process_one_work+0x17a/0x410
[  147.780015]  [<ffffffff8109ed9c>] worker_thread+0x11c/0x370
[  147.780015]  [<ffffffff8109ec80>] ? manage_workers.isra.21+0x2b0/0x2b0
[  147.780015]  [<ffffffff810a55b9>] kthread+0xc9/0xe0
[  147.780015]  [<ffffffff81010000>] ? ftrace_raw_event_xen_mc_flush+0x50/0x180
[  147.780015]  [<ffffffff810a54f0>] ? flush_kthread_worker+0x80/0x80
[  147.780015]  [<ffffffff816ffc3c>] ret_from_fork+0x7c/0xb0
[  147.780015]  [<ffffffff810a54f0>] ? flush_kthread_worker+0x80/0x80
[  147.780015] Code: 1a a4 81 31 c0 e8 05 50 43 00 49 8b 44 24 08 8b 75 b8 48 c7 c7 c3 1a a4 81 48 8b 80 f8 02 00 00 48 8b 50 18 31 c0 e8 e4 4f 43 00 <0f> 0b 44 89 ee 48 c7 c7 b7 1a a4 81 31 c0 e8 d1 4f 43 00 49 8b
[  147.780015] RIP  [<ffffffff812b779c>] ext4_mb_normalize_request+0x60c/0x660
[  147.780015]  RSP <ffff88002ed21778>
[  147.830356] ---[ end trace b82d39f39fe4e04a ]---
[  147.831058] Kernel panic - not syncing: Fatal exception

Signed-off-by: Maurizio Lombardi <mlombard@redhat.com>
---
 fs/ext4/mballoc.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)
Theodore Ts'o - March 6, 2014, 3:44 p.m.
On Mon, Mar 03, 2014 at 03:00:28PM +0100, Maurizio Lombardi wrote:
> diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> index 08ddfda..546575a 100644
> --- a/fs/ext4/mballoc.c
> +++ b/fs/ext4/mballoc.c
> @@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
>  		size	  = ac->ac_o_ex.fe_len << bsbits;
>  	}
>  	size = size >> bsbits;
> +
> +	/* In any case, the size cannot be greater than the number
> +	 * of maximum free blocks per group.
> +	 */
> +	if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
> +		int sz_log2;
> +
> +		size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
> +
> +		/* Recalculate the start offset */
> +		sz_log2 = __fls(size << bsbits);
> +		start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
> +					(sz_log2 - bsbits)) << sz_log2;
> +	}
> +
>  	start = start_off >> bsbits;
>  
>  	/* don't cover already allocated blocks in selected range */

This definitely fixes the bug.  However, there will be some cases
where if the blocks per group is sufficiently small, where for smaller
files, start_off would have been 0 instead of that complicated
expression.

Looking at ext4_mb_normalize_request(), exactly what this code is
trying to do is actually a bit opaque to me, and every time I look at
it I get a headache.

Andreas, can you take a look at this?  I think you may know this code
better --- and it's somewhere I've been waiting to do some cleanup, or
at least some improved code comments.

Thanks!!

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Maurizio Lombardi - March 6, 2014, 4:54 p.m.
On Thu, Mar 06, 2014 at 10:44:07AM -0500, Theodore Ts'o wrote:
> On Mon, Mar 03, 2014 at 03:00:28PM +0100, Maurizio Lombardi wrote:
> > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > index 08ddfda..546575a 100644
> > --- a/fs/ext4/mballoc.c
> > +++ b/fs/ext4/mballoc.c
> > @@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> >  		size	  = ac->ac_o_ex.fe_len << bsbits;
> >  	}
> >  	size = size >> bsbits;
> > +
> > +	/* In any case, the size cannot be greater than the number
> > +	 * of maximum free blocks per group.
> > +	 */
> > +	if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
> > +		int sz_log2;
> > +
> > +		size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
> > +
> > +		/* Recalculate the start offset */
> > +		sz_log2 = __fls(size << bsbits);
> > +		start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
> > +					(sz_log2 - bsbits)) << sz_log2;
> > +	}
> > +
> >  	start = start_off >> bsbits;
> >  
> >  	/* don't cover already allocated blocks in selected range */
> 
> This definitely fixes the bug.  However, there will be some cases
> where if the blocks per group is sufficiently small, where for smaller
> files, start_off would have been 0 instead of that complicated
> expression.

Mmmm... if I correctly understood how ext4_normalize_request() works, everytime
you truncate the value of "size" you have to recalculate the correct start offset.
Note that start_off is zero only in those case where size is left untouched or
increased.

> 
> Looking at ext4_mb_normalize_request(), exactly what this code is
> trying to do is actually a bit opaque to me, and every time I look at
> it I get a headache.

Yes unfortunately the code is not very easy to understand.
I may be missing something and it would be nice to have someone who knows it
better to shed some light on it.

Regards,
Maurizio Lombardi

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner - March 6, 2014, 5:54 p.m.
On Thu, 6 Mar 2014, Maurizio Lombardi wrote:

> Date: Thu, 6 Mar 2014 17:54:16 +0100
> From: Maurizio Lombardi <mlombard@redhat.com>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: adilger.kernel@dilger.ca, linux-ext4@vger.kernel.org,
>     linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()
> 
> On Thu, Mar 06, 2014 at 10:44:07AM -0500, Theodore Ts'o wrote:
> > On Mon, Mar 03, 2014 at 03:00:28PM +0100, Maurizio Lombardi wrote:
> > > diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
> > > index 08ddfda..546575a 100644
> > > --- a/fs/ext4/mballoc.c
> > > +++ b/fs/ext4/mballoc.c
> > > @@ -3059,6 +3059,21 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
> > >  		size	  = ac->ac_o_ex.fe_len << bsbits;
> > >  	}
> > >  	size = size >> bsbits;
> > > +
> > > +	/* In any case, the size cannot be greater than the number
> > > +	 * of maximum free blocks per group.
> > > +	 */
> > > +	if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
> > > +		int sz_log2;
> > > +
> > > +		size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
> > > +
> > > +		/* Recalculate the start offset */
> > > +		sz_log2 = __fls(size << bsbits);
> > > +		start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
> > > +					(sz_log2 - bsbits)) << sz_log2;
> > > +	}
> > > +
> > >  	start = start_off >> bsbits;
> > >  
> > >  	/* don't cover already allocated blocks in selected range */
> > 
> > This definitely fixes the bug.  However, there will be some cases
> > where if the blocks per group is sufficiently small, where for smaller
> > files, start_off would have been 0 instead of that complicated
> > expression.
> 
> Mmmm... if I correctly understood how ext4_normalize_request() works, everytime
> you truncate the value of "size" you have to recalculate the correct start offset.
> Note that start_off is zero only in those case where size is left untouched or
> increased.

(ignoring the fact that the ext4_mb_normalize_request() is broken
for now)

Yes it tries to align down the start_off of the bigger requests to the 512,
1024, 2048, or 4096 respectively. What the reason for it is really I have
no idea. The fact is however that this will only affect file systems
with bs smaller than 4k since the start_off will be always aligned to
block size afterwards (obviously).

That said this alignment is only done when the request is "big
enough". With your change we also do it when the block group is
"small enough" which is the change in behaviour which I think was
not really intended.

Honestly I do not think this really matters a lot but this alignment
you've added is not necessary.

All that said, I was getting to rewrite this mess a long time ago,
it's just a reminder that it's something that needs to be done.
Especially since the bigger requests are getting split unnecessarily
which hurts especially in fallocate case.

Thanks!
-Lukas

> 
> > 
> > Looking at ext4_mb_normalize_request(), exactly what this code is
> > trying to do is actually a bit opaque to me, and every time I look at
> > it I get a headache.
> 
> Yes unfortunately the code is not very easy to understand.
> I may be missing something and it would be nice to have someone who knows it
> better to shed some light on it.
> 
> Regards,
> Maurizio Lombardi
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 6, 2014, 6:32 p.m.
On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
> 
> All that said, I was getting to rewrite this mess a long time ago,
> it's just a reminder that it's something that needs to be done.
> Especially since the bigger requests are getting split unnecessarily
> which hurts especially in fallocate case.

Hey Lukas,

If you are going to try to do some clean up work in mballoc, that
would be really really great!  (Although hopefully you can still work
on writing down some thoughts about the data block checksum feature
before the ext4 developer's workshop.  :-)

We should try to get input from Andreas about what some of the more
interesting hueristics in mballoc were trying to accomplish, since
there's a lot going on that's not obvious, and one of the reasons why
I've always been worried about trying to do cleanups was because
something that looks ugly might be papering over some other dark
corner of mballoc.c ---- and so I was fairly certain that one we
started opening up mballoc.c, we'd have to do a lot of work on it, and
a lot of performance measurements to make sure we didn't accidentally
introduce some performance regression.

Thanks for being willing to look at this!

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - March 7, 2014, 9:09 p.m.
On Mar 6, 2014, at 11:32 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
>> 
>> All that said, I was getting to rewrite this mess a long time ago,
>> it's just a reminder that it's something that needs to be done.
>> Especially since the bigger requests are getting split unnecessarily
>> which hurts especially in fallocate case.
> 
> We should try to get input from Andreas about what some of the more
> interesting hueristics in mballoc were trying to accomplish, since
> there's a lot going on that's not obvious, and one of the reasons why
> I've always been worried about trying to do cleanups was because
> something that looks ugly might be papering over some other dark
> corner of mballoc.c ---- and so I was fairly certain that one we
> started opening up mballoc.c, we'd have to do a lot of work on it, and
> a lot of performance measurements to make sure we didn't accidentally
> introduce some performance regression.

There is actually quite a lengthy description of mballoc at the start
of the file.  I guess it would make sense to turn anything in this
thread into comments for ext4_mb_normalize_request() once verified.

So, below is hopefully a summary of what ext4_mb_normalize_request()
is actually doing.  I've CC'd Alex to correct my mistakes.  I think
the first few cases are commented accurately and self explanatorily:

* don't prealloc blocks for non-regular files (!EXT4_MB_HINT_DATA)
  - should we reconsider this for larger directories?
* don't use prealloc if caller wants exact (EXT4_MB_HINT_GOAL_ONLY)
  - currently unused, but would be useful for defrag
* don't reserve blocks if caller doesn't want it (EXT4_MB_HINT_NOPREALLOC)
  - used for small files or if requested data fits exactly into extent
* if write is a small file, use group prealloc (EXT4_MB_HINT_GROUP_ALLOC)
  - this combines multiple small writes into a single prealloc region
    and avoids read-modify-write of RAID stripes

The rest of the function is about handling large file writes efficiently.
* round up small writes to a power-of-two value for better alignment
  - we have a patch that makes the preallocation region sizes tunable,
    if that is something of interest.  That said, we don't really use it.
* if the request is large, align it to a power-of-two boundary
  - the allocation goal is based on the logical file offset, so that if
    a file is written sparsely by multiple threads, it can coalesce into
    a densely packed file in the end.  This is common for HPC jobs, or
    applications like bittorrent.
* the list_for_each() loops align the prealloc region with other regions
  - this helps when the file becomes fully allocated that the regions
    will be contiguous on disk

I'm pretty sure some of this is not 100% accurate, hopefully Alex can
comment and correct any inconsistencies.

Cheers, Andreas
Theodore Ts'o - May 26, 2014, 4:50 p.m.
On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
> Yes it tries to align down the start_off of the bigger requests to the 512,
> 1024, 2048, or 4096 respectively. What the reason for it is really I have
> no idea. The fact is however that this will only affect file systems
> with bs smaller than 4k since the start_off will be always aligned to
> block size afterwards (obviously).
> 
> That said this alignment is only done when the request is "big
> enough". With your change we also do it when the block group is
> "small enough" which is the change in behaviour which I think was
> not really intended.
> 
> Honestly I do not think this really matters a lot but this alignment
> you've added is not necessary.
> 
> All that said, I was getting to rewrite this mess a long time ago,
> it's just a reminder that it's something that needs to be done.
> Especially since the bigger requests are getting split unnecessarily
> which hurts especially in fallocate case.

Hey Lukas, where are we with respect to your plans to fix up this
code?

I'm trying to figure out what we should do with Maurizio's bug fix.
Should we apply it even though it's making some changes to the
existing behavior.

As far as to why the existing code is trying to align the starting
offset to a power of two, I believe the idea is to avoid
fragmentation, since the normalize_request function will tend to round
up allocaiton requests to the same power of two.

   	      	       	      	   	    - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner - June 3, 2014, 6:43 p.m.
On Mon, 26 May 2014, Theodore Ts'o wrote:

> Date: Mon, 26 May 2014 12:50:10 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Maurizio Lombardi <mlombard@redhat.com>, adilger.kernel@dilger.ca,
>     linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()
> 
> On Thu, Mar 06, 2014 at 06:54:05PM +0100, Lukáš Czerner wrote:
> > Yes it tries to align down the start_off of the bigger requests to the 512,
> > 1024, 2048, or 4096 respectively. What the reason for it is really I have
> > no idea. The fact is however that this will only affect file systems
> > with bs smaller than 4k since the start_off will be always aligned to
> > block size afterwards (obviously).
> > 
> > That said this alignment is only done when the request is "big
> > enough". With your change we also do it when the block group is
> > "small enough" which is the change in behaviour which I think was
> > not really intended.
> > 
> > Honestly I do not think this really matters a lot but this alignment
> > you've added is not necessary.
> > 
> > All that said, I was getting to rewrite this mess a long time ago,
> > it's just a reminder that it's something that needs to be done.
> > Especially since the bigger requests are getting split unnecessarily
> > which hurts especially in fallocate case.
> 
> Hey Lukas, where are we with respect to your plans to fix up this
> code?
> 
> I'm trying to figure out what we should do with Maurizio's bug fix.
> Should we apply it even though it's making some changes to the
> existing behavior.
> 
> As far as to why the existing code is trying to align the starting
> offset to a power of two, I believe the idea is to avoid
> fragmentation, since the normalize_request function will tend to round
> up allocaiton requests to the same power of two.

Hi Ted,

I think that leaving the alignment of the start offset for the small
files/allocation is not good idea. We might end up with suboptimal
file layout for smaller files. While this is not a big deal for
bigger files, with smaller ones it might cause some troubles.

Maurizio, can you resend the patch without the alignment ?

Also I started looking into normalize_request and hopefully I'll
have a patch soon. Ted, do you have any suggestion for a test to
make sure that I do not make things worse ? You mentioned something
simple on LSF, but I do not remember what it was.

Thanks!
-Lukas


> 
>    	      	       	      	   	    - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
Theodore Ts'o - June 3, 2014, 8:36 p.m.
On Tue, Jun 03, 2014 at 08:43:40PM +0200, Lukáš Czerner wrote:
> 
> I think that leaving the alignment of the start offset for the small
> files/allocation is not good idea. We might end up with suboptimal
> file layout for smaller files. While this is not a big deal for
> bigger files, with smaller ones it might cause some troubles.

I thought we were only aligning the start offset for files > 2MB?

> Also I started looking into normalize_request and hopefully I'll
> have a patch soon. Ted, do you have any suggestion for a test to
> make sure that I do not make things worse ? You mentioned something
> simple on LSF, but I do not remember what it was.

The two mechanisms which we have currently are:

1) e2freefrag to measure free space fragmentation

2) e2fsck -fE fragcheck /dev/sdXX

The latter will give you a reports such as this:

142618(f): expecting 950605 actual extent phys 950800 log 89 len 26
142618(f): expecting 950826 actual extent phys 950550 log 282 len 10

Which would correspond to the following:

debugfs:  stat <142618>
Inode: 142618   Type: regular    Mode:  0666   Flags: 0x80000
Generation: 2697220261    Version: 0x00000000:00000001
User:     0   Group:     0   Size: 1194112
File ACL: 0    Directory ACL: 0
Links: 1   Blockcount: 360
Fragment:  Address: 0    Number: 0    Size: 0
 ctime: 0x53495656:aa328a78 -- Sat Apr 12 11:05:58 2014
 atime: 0x53495643:c20a0ea4 -- Sat Apr 12 11:05:39 2014
 mtime: 0x53495653:43ad6c78 -- Sat Apr 12 11:05:55 2014
crtime: 0x53495641:1332f128 -- Sat Apr 12 11:05:37 2014
Size of extra inode fields: 28
EXTENTS:
(68-76):950596-950604, (89-114):950800-950825, (282-291):950550-950559

This is admittedly imperfect.  First of all, it gives false positives
when the file has holes (as in the above case).  And even if it didn't
what we should do is to print the previous extent before the
contiunity, since what's interesting is how big was the previous
extent before we had a discontinuity, and because the length of the
current extent isn't all that interesting in the case of last extent
(the "tail") of the file.

Hmm... if I have time I might look into patching e2fsck so that e2fsck
-E fragcheck is more useful.

What's also missing is some way of taking all of this fine-grained
information is turning it into one to three figures of merit that
could be used to compare different tweaks to the block allocation
algorithm.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Lukas Czerner - June 6, 2014, 7:09 a.m.
On Tue, 3 Jun 2014, Theodore Ts'o wrote:

> Date: Tue, 3 Jun 2014 16:36:39 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Maurizio Lombardi <mlombard@redhat.com>, adilger.kernel@dilger.ca,
>     linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org
> Subject: Re: [PATCH 2/2] ext4: fix bug in ext4_mb_normalize_request()
> 
> On Tue, Jun 03, 2014 at 08:43:40PM +0200, Lukáš Czerner wrote:
> > 
> > I think that leaving the alignment of the start offset for the small
> > files/allocation is not good idea. We might end up with suboptimal
> > file layout for smaller files. While this is not a big deal for
> > bigger files, with smaller ones it might cause some troubles.
> 
> I thought we were only aligning the start offset for files > 2MB?
> 
> > Also I started looking into normalize_request and hopefully I'll
> > have a patch soon. Ted, do you have any suggestion for a test to
> > make sure that I do not make things worse ? You mentioned something
> > simple on LSF, but I do not remember what it was.
> 
> The two mechanisms which we have currently are:
> 
> 1) e2freefrag to measure free space fragmentation

Yes, that's what I am using along with e4defrag and debugfs -R
"dump_extents" to figure out what changed.

> 
> 2) e2fsck -fE fragcheck /dev/sdXX

I was not aware of that, thanks. However I was more interested in
workload to test on. Right now I have a simple script which is
doing:

        # Test fallocate time
        echo "[+] Fallocate test"
        time fallocate -l70G ${MNT}/file
        do_freefrag
        do_defrag ${MNT}/file
        do_debugfs /file
        rm -f ${MNT}/file
        sync

        # Copy linux source
        echo "[+] Copy linux source"
        cp -r $linux_source ${MNT}/linux1 &
        cp -r $linux_source ${MNT}/linux2 &
        time wait
        do_freefrag

        # Tar linux source
        echo "[+] Tar linux source"
        tar -cf ${MNT}/linux1_1.tar ${MNT}/linux1 &
        tar -cf ${MNT}/linux1_2.tar ${MNT}/linux1 &
        time wait
        do_freefrag
        do_defrag ${MNT}/linux1_1.tar
        do_debugfs /linux1_1.tar
        do_defrag ${MNT}/linux1_2.tar
        do_debugfs /linux1_2.tar

        # Untar linux source
        echo "[+] Untar linux source"
        mkdir ${MNT}/tt1
        mkdir ${MNT}/tt2
        tar -xf ${MNT}/linux1_1.tar -C ${MNT}/tt1 &
        tar -xf ${MNT}/linux1_2.tar -C ${MNT}/tt2 &
        time wait
        do_freefrag

        # Singe dd
        echo "[+] Single dd"
        time dd if=/dev/zero of=${MNT}/file bs=512 count=4194304
        do_freefrag
        do_defrag ${MNT}/file
        do_debugfs /file

        # Multiple dd test
        echo "[+] Multiple dd"
        # 2G
        dd if=/dev/zero of=${MNT}/file0 bs=4096 count=524288 &
        # 2G
        dd if=/dev/urandom of=${MNT}/file1 bs=512 count=4194304 &
        # 2G
        dd if=/dev/zero of=${MNT}/file2 bs=1M count=2048 &
        # 4M
        dd if=/dev/zero of=${MNT}/file3 bs=4096 count=1024 &
        # 16M
        dd if=/dev/zero of=${MNT}/file4 bs=4096 count=4096 &
        time wait
        do_freefrag
        do_defrag ${MNT}/file0
        do_debugfs /file0
        do_defrag ${MNT}/file1
        do_debugfs /file1
        do_defrag ${MNT}/file2
        do_debugfs /file2
        do_defrag ${MNT}/file3
        do_debugfs /file3
        do_defrag ${MNT}/file4
        do_debugfs /file4

        # Run fsstress
        echo "[+] Run fsstress"
        time $fsstress -s 123456 -p24 -n 999 -d $MNT
        do_freefrag
        do_defrag ${MNT}

And then do the same thing on the loop device and investigate the
underlying image file.

Thanks!
-Lukas
Maurizio Lombardi - June 11, 2014, 8:47 a.m.
On Tue, Jun 03, 2014 at 04:36:39PM -0400, Theodore Ts'o wrote:
> On Tue, Jun 03, 2014 at 08:43:40PM +0200, Lukáš Czerner wrote:
> > 
> > I think that leaving the alignment of the start offset for the small
> > files/allocation is not good idea. We might end up with suboptimal
> > file layout for smaller files. While this is not a big deal for
> > bigger files, with smaller ones it might cause some troubles.
> 
> I thought we were only aligning the start offset for files > 2MB?
>

That's true.

With my patch the behaviour changes slightly because it doesn't depend
only from the file size but also from the number of blocks per group.

So, if you create a filesystem with blocksize==1Kb and 1024 blocks per group, we will start
aligning the start offset for files with size > 1Mb (1Kb * 1024)

The alignment is not performed on small files unless you have
a very low number of blocks per group.

Maurizio Lombardi

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 08ddfda..546575a 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3059,6 +3059,21 @@  ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		size	  = ac->ac_o_ex.fe_len << bsbits;
 	}
 	size = size >> bsbits;
+
+	/* In any case, the size cannot be greater than the number
+	 * of maximum free blocks per group.
+	 */
+	if (size > EXT4_BLOCKS_PER_GROUP(ac->ac_sb)) {
+		int sz_log2;
+
+		size = EXT4_BLOCKS_PER_GROUP(ac->ac_sb);
+
+		/* Recalculate the start offset */
+		sz_log2 = __fls(size << bsbits);
+		start_off = ((loff_t) ac->ac_o_ex.fe_logical >>
+					(sz_log2 - bsbits)) << sz_log2;
+	}
+
 	start = start_off >> bsbits;
 
 	/* don't cover already allocated blocks in selected range */