Patchwork ext4 xfstest regression due to ext4_es_lookup_extent

login
register
mail settings
Submitter Dmitri Monakho
Date Feb. 22, 2013, 5:17 p.m.
Message ID <87obfcs1x6.fsf@openvz.org>
Download mbox | patch
Permalink /patch/222577/
State New
Headers show

Comments

Dmitri Monakho - Feb. 22, 2013, 5:17 p.m.
301'th xfstests are failed due to :
commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
Author: Zheng Liu <wenqing.lz@taobao.com>
Date:   Mon Feb 18 00:29:59 2013 -0500

    ext4: lookup block mapping in extent status tree

TESTCASE: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0

 ------------[ cut here ]------------
 kernel BUG at fs/ext4/inode.c:1452!
 invalid opcode: 0000 [#1] SMP 
 Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes\
l ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
 CPU 2 
 Pid: 2142, comm: fio Not tainted 3.8.0-rc3+ #41                  /DQ67SW
 RIP: 0010:[<ffffffff81333c7b>]  [<ffffffff81333c7b>] mpage_da_submit_io+0x41b/0x5b0
 RSP: 0018:ffff8801c2837a48  EFLAGS: 00010202
 RAX: ffff880231e64bb0 RBX: ffff880231e64bb0 RCX: 0000000000000000
 RDX: 0000000000000003 RSI: 0000000000000001 RDI: ffffffff821372d8
 RBP: ffff8801c2837bb8 R08: ffff8801c2837df8 R09: 0000000000000001
 R10: ffff880231e64bb1 R11: 0000000000000003 R12: ffffea00076f5c28
 R13: 00000000000026d0 R14: ffff8801c2837be8 R15: 00000000000ea6d0
 FS:  00007fefeaeb5700(0000) GS:ffff88023d800000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007fefdf10299c CR3: 0000000209154000 CR4: 00000000000407e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process fio (pid: 2142, threadinfo ffff8801c2836000, task ffff8801d3c48580)
 Stack:
  ffffffff82030f00 ffff8801c2837b58 ffff8801c2837df8 0000000000000000
  ffff880231e64bb1 0000000000000001 ffff880231e64bb0 000000002ba68688
  ffff8801c2837ae8 00000000000d4800 ffff880231f1cb98 00000000000026d0
 Call Trace:
  [<ffffffff81339769>] mpage_da_map_and_submit+0x369/0x3d0
  [<ffffffff81339acb>] write_cache_pages_da+0x1db/0x6f0
  [<ffffffff8133b07f>] ext4_da_writepages+0x49f/0x8b0
  [<ffffffff81105d9a>] ? __lock_acquire+0x4ca/0x560
  [<ffffffff811b4421>] do_writepages+0x51/0x70
  [<ffffffff8119fc24>] __filemap_fdatawrite_range+0x64/0x70
  [<ffffffff811a6a8b>] sys_fadvise64_64+0x25b/0x300
  [<ffffffff811a6b3e>] sys_fadvise64+0xe/0x10
  [<ffffffff818e9e99>] system_call_fastpath+0x16/0x1b
 Code: a0 fe ff ff 4c 8b 95 b0 fe ff ff 49 63 d1 48 83 c2 02 48 8b 34 d5 30 4c 2b 82 48 83 c6 01 45 85 c9 48 89 34 d5 30 4c 2b 82 74 04 <0f> 0b eb fe 41 0f b6 56 13 83 e2 01 48 63 f2 48 83 c6 02 48 8b 
 RIP  [<ffffffff81333c7b>] mpage_da_submit_io+0x41b/0x5b0
  RSP <ffff8801c2837a48>
 ---[ end trace 6f79fb4f46cf9f0e ]---
 ------------[ cut here ]------------

Commit (d100eef2) this is very first commit where we try to make
extent-status-tree usefull. Definitely this is because cached value
is out of sync with real on-disk structure. I'll try to find out exact
place where this happen, but i'm shure this is just a beginning.
So Zheng please add sane self-testing infastructure for extent status
tree,  for example like follows:
Theodore Ts'o - Feb. 22, 2013, 6:03 p.m.
On Fri, Feb 22, 2013 at 09:17:57PM +0400, Dmitry Monakhov wrote:
> 
> 301'th xfstests are failed due to :
> commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
> Author: Zheng Liu <wenqing.lz@taobao.com>
> Date:   Mon Feb 18 00:29:59 2013 -0500
> 
>     ext4: lookup block mapping in extent status tree
> 
> TESTCASE: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0

Thanks for the heads up.  I haven't updatied the xfstests I've been
using yet, since I want to make sure I'm comparing apples and oranges
during the merge window when I'm checking for regressions; I'll update
my xfstests in a week or two after the merge window settles down, and
then I'll rerun my baseline tests using the updated xfstests against
3.8.0 and 3.9-rc2 or 3.9-rc3.

(And furthermore, these new xfstests aren't yet in xfstests upstream
yet, right?  Any comments from the xfstests maintainer about whether
they are going to be willing to take your proposed new test cases?)

So when you say this is a regression, I take it that this test #301
doesn't fail on commit d100eef2440f^, but it does fail on d100eef2440f,
correct?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - Feb. 23, 2013, 5:36 a.m.
Hi Dmitry,

Thanks for pointing out.  I will fold your patch into this commit, and
take a close look at this bug right now.

Regards,
						- Zheng

On 02/23/2013 01:17 AM, Dmitry Monakhov wrote:
> 
> 301'th xfstests are failed due to :
> commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
> Author: Zheng Liu <wenqing.lz@taobao.com>
> Date:   Mon Feb 18 00:29:59 2013 -0500
> 
>     ext4: lookup block mapping in extent status tree
> 
> TESTCASE: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0
> 
>  ------------[ cut here ]------------
>  kernel BUG at fs/ext4/inode.c:1452!
>  invalid opcode: 0000 [#1] SMP 
>  Modules linked in: cpufreq_ondemand acpi_cpufreq freq_table mperf coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel microcode sg xhci_hcd button sd_mod crc_t10dif aesni_intel ablk_helper cryptd lrw aes\
> l ahci libahci pata_acpi ata_generic dm_mirror dm_region_hash dm_log dm_mod
>  CPU 2 
>  Pid: 2142, comm: fio Not tainted 3.8.0-rc3+ #41                  /DQ67SW
>  RIP: 0010:[<ffffffff81333c7b>]  [<ffffffff81333c7b>] mpage_da_submit_io+0x41b/0x5b0
>  RSP: 0018:ffff8801c2837a48  EFLAGS: 00010202
>  RAX: ffff880231e64bb0 RBX: ffff880231e64bb0 RCX: 0000000000000000
>  RDX: 0000000000000003 RSI: 0000000000000001 RDI: ffffffff821372d8
>  RBP: ffff8801c2837bb8 R08: ffff8801c2837df8 R09: 0000000000000001
>  R10: ffff880231e64bb1 R11: 0000000000000003 R12: ffffea00076f5c28
>  R13: 00000000000026d0 R14: ffff8801c2837be8 R15: 00000000000ea6d0
>  FS:  00007fefeaeb5700(0000) GS:ffff88023d800000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 00007fefdf10299c CR3: 0000000209154000 CR4: 00000000000407e0
>  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>  Process fio (pid: 2142, threadinfo ffff8801c2836000, task ffff8801d3c48580)
>  Stack:
>   ffffffff82030f00 ffff8801c2837b58 ffff8801c2837df8 0000000000000000
>   ffff880231e64bb1 0000000000000001 ffff880231e64bb0 000000002ba68688
>   ffff8801c2837ae8 00000000000d4800 ffff880231f1cb98 00000000000026d0
>  Call Trace:
>   [<ffffffff81339769>] mpage_da_map_and_submit+0x369/0x3d0
>   [<ffffffff81339acb>] write_cache_pages_da+0x1db/0x6f0
>   [<ffffffff8133b07f>] ext4_da_writepages+0x49f/0x8b0
>   [<ffffffff81105d9a>] ? __lock_acquire+0x4ca/0x560
>   [<ffffffff811b4421>] do_writepages+0x51/0x70
>   [<ffffffff8119fc24>] __filemap_fdatawrite_range+0x64/0x70
>   [<ffffffff811a6a8b>] sys_fadvise64_64+0x25b/0x300
>   [<ffffffff811a6b3e>] sys_fadvise64+0xe/0x10
>   [<ffffffff818e9e99>] system_call_fastpath+0x16/0x1b
>  Code: a0 fe ff ff 4c 8b 95 b0 fe ff ff 49 63 d1 48 83 c2 02 48 8b 34 d5 30 4c 2b 82 48 83 c6 01 45 85 c9 48 89 34 d5 30 4c 2b 82 74 04 <0f> 0b eb fe 41 0f b6 56 13 83 e2 01 48 63 f2 48 83 c6 02 48 8b 
>  RIP  [<ffffffff81333c7b>] mpage_da_submit_io+0x41b/0x5b0
>   RSP <ffff8801c2837a48>
>  ---[ end trace 6f79fb4f46cf9f0e ]---
>  ------------[ cut here ]------------
> 
> Commit (d100eef2) this is very first commit where we try to make
> extent-status-tree usefull. Definitely this is because cached value
> is out of sync with real on-disk structure. I'll try to find out exact
> place where this happen, but i'm shure this is just a beginning.
> So Zheng please add sane self-testing infastructure for extent status
> tree,  for example like follows:
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - Feb. 23, 2013, 9:37 a.m.
On Fri, 22 Feb 2013 13:03:25 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
> On Fri, Feb 22, 2013 at 09:17:57PM +0400, Dmitry Monakhov wrote:
> > 
> > 301'th xfstests are failed due to :
> > commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
> > Author: Zheng Liu <wenqing.lz@taobao.com>
> > Date:   Mon Feb 18 00:29:59 2013 -0500
> > 
> >     ext4: lookup block mapping in extent status tree
> > 
> > TESTCASE: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0
> 
> Thanks for the heads up.  I haven't updatied the xfstests I've been
> using yet, since I want to make sure I'm comparing apples and oranges
> during the merge window when I'm checking for regressions; I'll update
> my xfstests in a week or two after the merge window settles down, and
> then I'll rerun my baseline tests using the updated xfstests against
> 3.8.0 and 3.9-rc2 or 3.9-rc3.
> 
> (And furthermore, these new xfstests aren't yet in xfstests upstream
> yet, right?  Any comments from the xfstests maintainer about whether
> they are going to be willing to take your proposed new test cases?)
I hope so. I think i've fixed things according to Dave's commit.
> So when you say this is a regression, I take it that this test #301
> doesn't fail on commit d100eef2440f^, but it does fail on d100eef2440f,
> correct?
Correct. d100ee is the first bad commit which trigger BUGON()
But issue was introduced earlier  es_cache was not updated
after extents was swapped between inodes.
I'll prepare patch soon.
Actually I think that the regression in 269'th you have found recently
caused by similar issue and commit which you foud by bisecting ( the one
which allow migration between indirect<->extent based inodes)
simply helps to spot real issue in es_caching code.

BUT my main idea is that we need robust self-testing infrastructure
similar one that we have at the time extents was introduced to ext4. 
> 
> 						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - Feb. 23, 2013, 10 a.m.
On 02/23/2013 05:37 PM, Dmitry Monakhov wrote:
> On Fri, 22 Feb 2013 13:03:25 -0500, "Theodore Ts'o" <tytso@mit.edu> wrote:
>> On Fri, Feb 22, 2013 at 09:17:57PM +0400, Dmitry Monakhov wrote:
>>>
>>> 301'th xfstests are failed due to :
>>> commit d100eef2440fea13e4f09e88b1c8bcbca64beb9f
>>> Author: Zheng Liu <wenqing.lz@taobao.com>
>>> Date:   Mon Feb 18 00:29:59 2013 -0500
>>>
>>>     ext4: lookup block mapping in extent status tree
>>>
>>> TESTCASE: https://github.com/dmonakhov/xfstests/commit/7b7efeee30a41109201e2040034e71db9b66ddc0
>>
>> Thanks for the heads up.  I haven't updatied the xfstests I've been
>> using yet, since I want to make sure I'm comparing apples and oranges
>> during the merge window when I'm checking for regressions; I'll update
>> my xfstests in a week or two after the merge window settles down, and
>> then I'll rerun my baseline tests using the updated xfstests against
>> 3.8.0 and 3.9-rc2 or 3.9-rc3.
>>
>> (And furthermore, these new xfstests aren't yet in xfstests upstream
>> yet, right?  Any comments from the xfstests maintainer about whether
>> they are going to be willing to take your proposed new test cases?)
> I hope so. I think i've fixed things according to Dave's commit.
>> So when you say this is a regression, I take it that this test #301
>> doesn't fail on commit d100eef2440f^, but it does fail on d100eef2440f,
>> correct?
> Correct. d100ee is the first bad commit which trigger BUGON()
> But issue was introduced earlier  es_cache was not updated
> after extents was swapped between inodes.

Yes, you are right.  I forgot to update status tree after we do a
defragmentation.

> I'll prepare patch soon.

Ah, thanks.  So I will wait your patch.

> Actually I think that the regression in 269'th you have found recently
> caused by similar issue and commit which you foud by bisecting ( the one
> which allow migration between indirect<->extent based inodes)
> simply helps to spot real issue in es_caching code.

I will revise this patch.  IIRC, we forgot to update status tree after
an inode is migrated from extent-based to indirect-based.  Thanks for
pointing out.

> 
> BUT my main idea is that we need robust self-testing infrastructure
> similar one that we have at the time extents was introduced to ext4. 

Could you please share more detailed with me?  Extents had been
introduced for a long time ago.  I have missed too many things.

Thanks,
						- Zheng

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Feb. 24, 2013, 12:14 a.m.
On Sat, Feb 23, 2013 at 06:00:35PM +0800, Zheng Liu wrote:
> > Actually I think that the regression in 269'th you have found recently
> > caused by similar issue and commit which you foud by bisecting ( the one
> > which allow migration between indirect<->extent based inodes)
> > simply helps to spot real issue in es_caching code.
> 
> I will revise this patch.  IIRC, we forgot to update status tree after
> an inode is migrated from extent-based to indirect-based.  Thanks for
> pointing out.

Can you do this as a new commit?  I've already bumped the master
pointer up since I finished running xfstests and I'm seeing no
regressions (at least with my set of xfstests).  So given that
everything has been tested and things looks pretty stable, I pushed up
the master branch.

I did remember that you were still working on this regression, but
since we're already half-way through the merge window, I really want
to make things are ready for a merge request to Linus.  (Which I
probably will be sending to Linus by Monday or Tuesday.)

I do plan to collect bug fixes and any remaining regression fixes to
push to Linus by -rc2 or -rc3, so if don't rush fixing up defrag
functionality.

Thanks!!

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - Feb. 24, 2013, 3:21 a.m.
On Sat, Feb 23, 2013 at 07:14:47PM -0500, Theodore Ts'o wrote:
> On Sat, Feb 23, 2013 at 06:00:35PM +0800, Zheng Liu wrote:
> > > Actually I think that the regression in 269'th you have found recently
> > > caused by similar issue and commit which you foud by bisecting ( the one
> > > which allow migration between indirect<->extent based inodes)
> > > simply helps to spot real issue in es_caching code.
> > 
> > I will revise this patch.  IIRC, we forgot to update status tree after
> > an inode is migrated from extent-based to indirect-based.  Thanks for
> > pointing out.
> 
> Can you do this as a new commit?  I've already bumped the master
> pointer up since I finished running xfstests and I'm seeing no
> regressions (at least with my set of xfstests).  So given that
> everything has been tested and things looks pretty stable, I pushed up
> the master branch.

Yes, I will prepare it as a new commit.  But I am not pretty sure that
the root cause is es_caching.

> 
> I did remember that you were still working on this regression, but
> since we're already half-way through the merge window, I really want
> to make things are ready for a merge request to Linus.  (Which I
> probably will be sending to Linus by Monday or Tuesday.)
> 
> I do plan to collect bug fixes and any remaining regression fixes to
> push to Linus by -rc2 or -rc3, so if don't rush fixing up defrag
> functionality.

For defrag regression, I have two choice to fix it.  One is a quick but
sub-optimal fix that we can invalidate all written/unwritten/hole extent
from status tree.  But it will decrease the performance because we need
to load extent into status tree again.  Further, one thing we need to
keep in our mind is that some extent is unwritten and delayed.  So it
makes thing complicated.  But now we don't need to worry about it
because a bigalloc file system doesn't support defrag.  So we are safe.

Another is to update all extent in status tree.  I think this is a
better choice and I think Dmirty is working on it.  Dmitry, I don't get
your response.  Could you confirm it?

TBH, we never use migration and defrag feature in our product system.
I admit that I almost don't pay a attention to them.  It's my fault.

I make a plan for the next works.

1. Try to prepare a patch that invalidates all cache in status tree to
fix defrag regression, and wait Dmitry's patch.

2. Revise migration patch.

3. Submit remain patches for extent status tree that try to convert
unwritten extent in end_io callback function and remove a bogus wait in
ext4_ind_direct_IO.  Now the patch has already done and still need to be
tested.

4. get_block_t and *map_blocks cleanup.

5. extent-level locking.

Any comment?

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - Feb. 24, 2013, 2:58 p.m.
Hi Dmirty,

On Fri, Feb 22, 2013 at 09:17:57PM +0400, Dmitry Monakhov wrote:
[snip]
> From 65c5fc212b1c684c76899c6e5e1f24d88550c6fc Mon Sep 17 00:00:00 2001
> From: Dmitry Monakhov <dmonakhov@openvz.org>
> Date: Fri, 22 Feb 2013 20:55:52 +0400
> Subject: [PATCH] ext4 add sanity ext4_es_lookup_extent
> 
> This patch does very simple thing: it recheck result returned from
> ext4_es_lookup_extent() by comparing it old-good lookup via
> ext4_{ind,ext}_map_blocks() under i_data_sem
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>

I try to apply the following patch in my tree, but I realize that it
seems that we couldn't add this sanity check in ext4_map_blocks and
ext4_da_map_blocks now.  The reason is that when we try to initialize an
unwritten extent this extent could be zeroed out.  But we could know it
in *_map_blocks, and status tree couldn't be updated.  So we will hit a
BUG_ON(1) after added this sanity check.

I agree with you that we need to add self-testing infrastructre after
applied status tree patch series.  But it seems that we need to mix
updating status tree code with extent tree code.  IMHO it is too
complicated.  Any thought?

Thanks,
                                                - Zheng

> ---
>  fs/ext4/inode.c |   42 ++++++++++++++++++++++++++++++++++++++++++
>  1 files changed, 42 insertions(+), 0 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 95a0c62..706db1f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -482,6 +482,41 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
>  	return num;
>  }
>  
> +#define ES_AGGRESSIVE_TEST 1
> +#ifdef ES_AGGRESSIVE_TEST
> +void ext4_map_blocks_es_recheck(handle_t *handle, struct inode *inode,
> +				struct ext4_map_blocks *es_map, int es_ret,
> +				struct ext4_map_blocks *map, int flags)
> +{
> +	int ret;
> +
> +	map->m_flags = 0;
> +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> +		down_read((&EXT4_I(inode)->i_data_sem));
> +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> +		ret = ext4_ext_map_blocks(handle, inode, map, flags &
> +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> +	} else {
> +		ret = ext4_ind_map_blocks(handle, inode, map, flags &
> +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> +	}
> +	if (es_map->m_lblk != map->m_lblk ||
> +	    es_map->m_flags != map->m_flags ||
> +	    es_map->m_len != map->m_len ||
> +	    es_map->m_pblk != map->m_pblk ||
> +	    es_ret != ret) {
> +		printk("Assertation failed for inode:%lu "
> +		       "es_cached_ex [%d/%d/%llu/%x]:%d != "
> +		       "found_ex [%d/%d/%llu/%x]:%d\n", inode->i_ino,
> +		       es_map->m_lblk, es_map->m_len, es_map->m_pblk,
> +		       es_map->m_flags, es_ret,
> +		       map->m_lblk, map->m_len, map->m_pblk, map->m_flags, ret);
> +		BUG();
> +	}
> +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> +		up_read((&EXT4_I(inode)->i_data_sem));
> +}
> +#endif
>  /*
>   * The ext4_map_blocks() function tries to look up the requested blocks,
>   * and returns if the blocks are already mapped.
> @@ -509,7 +544,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  {
>  	struct extent_status es;
>  	int retval;
> +#ifdef ES_AGGRESSIVE_TEST
> +	struct ext4_map_blocks orig_map;
>  
> +	memcpy(&orig_map, map, sizeof(*map));
> +#endif
>  	map->m_flags = 0;
>  	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
>  		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
> @@ -531,6 +570,9 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
>  		} else {
>  			BUG_ON(1);
>  		}
> +#ifdef ES_AGGRESSIVE_TEST
> +		ext4_map_blocks_es_recheck(handle,inode, map, retval, &orig_map, flags);
> +#endif
>  		goto found;
>  	}
>  
> -- 
> 1.7.1
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - Feb. 25, 2013, 8:39 a.m.
On Sun, 24 Feb 2013 22:58:37 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> Hi Dmirty,
> 
> On Fri, Feb 22, 2013 at 09:17:57PM +0400, Dmitry Monakhov wrote:
> [snip]
> > From 65c5fc212b1c684c76899c6e5e1f24d88550c6fc Mon Sep 17 00:00:00 2001
> > From: Dmitry Monakhov <dmonakhov@openvz.org>
> > Date: Fri, 22 Feb 2013 20:55:52 +0400
> > Subject: [PATCH] ext4 add sanity ext4_es_lookup_extent
> > 
> > This patch does very simple thing: it recheck result returned from
> > ext4_es_lookup_extent() by comparing it old-good lookup via
> > ext4_{ind,ext}_map_blocks() under i_data_sem
> > 
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> 
> I try to apply the following patch in my tree, but I realize that it
> seems that we couldn't add this sanity check in ext4_map_blocks and
> ext4_da_map_blocks now.  The reason is that when we try to initialize an
> unwritten extent this extent could be zeroed out.  But we could know it
> in *_map_blocks, and status tree couldn't be updated.  So we will hit a
> BUG_ON(1) after added this sanity check.
This means that extent-status tree should be updated once zeroout
happen.

> 
> I agree with you that we need to add self-testing infrastructre after
> applied status tree patch series.  But it seems that we need to mix
> updating status tree code with extent tree code.  IMHO it is too
> complicated.  Any thought?
Many developers have invested significant amount of man-hours in to 
ext4's delay allocation state machine. It is now known as stable and reliable
So it reasonable to use it as a primary for self-testing of
extent-status tree. If es_cache != real_state this is means that we have
forget to update es_cache.
For example when extents was introduced it have many self testing things
such as:
1) CHECK_BINSEARCH__: Which recheck binarry search result.
        
2) AGGRESSIVE_TEST : Which force deep extent tree constructions and
                     allow to cover unusual branches.
3) DOUBLE_CHECK : Recheck block alloc bitmap one more time

I strongly believe in testing. Ext4 is production-grade filesystem
so we can not break it in the name of new features, unless we are sure
that features are safe and valuable, that's why investments in
self-testing infrastructure for ES should have very high priority.
But off course Theodore's decision whenever feature is looks stable
enough to get ready go upstream.
> 
> Thanks,
>                                                 - Zheng
> 
> > ---
> >  fs/ext4/inode.c |   42 ++++++++++++++++++++++++++++++++++++++++++
> >  1 files changed, 42 insertions(+), 0 deletions(-)
> > 
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 95a0c62..706db1f 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -482,6 +482,41 @@ static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
> >  	return num;
> >  }
> >  
> > +#define ES_AGGRESSIVE_TEST 1
> > +#ifdef ES_AGGRESSIVE_TEST
> > +void ext4_map_blocks_es_recheck(handle_t *handle, struct inode *inode,
> > +				struct ext4_map_blocks *es_map, int es_ret,
> > +				struct ext4_map_blocks *map, int flags)
> > +{
> > +	int ret;
> > +
> > +	map->m_flags = 0;
> > +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> > +		down_read((&EXT4_I(inode)->i_data_sem));
> > +	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
> > +		ret = ext4_ext_map_blocks(handle, inode, map, flags &
> > +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> > +	} else {
> > +		ret = ext4_ind_map_blocks(handle, inode, map, flags &
> > +					     EXT4_GET_BLOCKS_KEEP_SIZE);
> > +	}
> > +	if (es_map->m_lblk != map->m_lblk ||
> > +	    es_map->m_flags != map->m_flags ||
> > +	    es_map->m_len != map->m_len ||
> > +	    es_map->m_pblk != map->m_pblk ||
> > +	    es_ret != ret) {
> > +		printk("Assertation failed for inode:%lu "
> > +		       "es_cached_ex [%d/%d/%llu/%x]:%d != "
> > +		       "found_ex [%d/%d/%llu/%x]:%d\n", inode->i_ino,
> > +		       es_map->m_lblk, es_map->m_len, es_map->m_pblk,
> > +		       es_map->m_flags, es_ret,
> > +		       map->m_lblk, map->m_len, map->m_pblk, map->m_flags, ret);
> > +		BUG();
> > +	}
> > +	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
> > +		up_read((&EXT4_I(inode)->i_data_sem));
> > +}
> > +#endif
> >  /*
> >   * The ext4_map_blocks() function tries to look up the requested blocks,
> >   * and returns if the blocks are already mapped.
> > @@ -509,7 +544,11 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >  {
> >  	struct extent_status es;
> >  	int retval;
> > +#ifdef ES_AGGRESSIVE_TEST
> > +	struct ext4_map_blocks orig_map;
> >  
> > +	memcpy(&orig_map, map, sizeof(*map));
> > +#endif
> >  	map->m_flags = 0;
> >  	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
> >  		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
> > @@ -531,6 +570,9 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
> >  		} else {
> >  			BUG_ON(1);
> >  		}
> > +#ifdef ES_AGGRESSIVE_TEST
> > +		ext4_map_blocks_es_recheck(handle,inode, map, retval, &orig_map, flags);
> > +#endif
> >  		goto found;
> >  	}
> >  
> > -- 
> > 1.7.1
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - Feb. 25, 2013, 9:57 a.m.
On Mon, Feb 25, 2013 at 12:39:21PM +0400, Dmitry Monakhov wrote:
> On Sun, 24 Feb 2013 22:58:37 +0800, Zheng Liu <gnehzuil.liu@gmail.com> wrote:
> > Hi Dmirty,
> > 
> > On Fri, Feb 22, 2013 at 09:17:57PM +0400, Dmitry Monakhov wrote:
> > [snip]
> > > From 65c5fc212b1c684c76899c6e5e1f24d88550c6fc Mon Sep 17 00:00:00 2001
> > > From: Dmitry Monakhov <dmonakhov@openvz.org>
> > > Date: Fri, 22 Feb 2013 20:55:52 +0400
> > > Subject: [PATCH] ext4 add sanity ext4_es_lookup_extent
> > > 
> > > This patch does very simple thing: it recheck result returned from
> > > ext4_es_lookup_extent() by comparing it old-good lookup via
> > > ext4_{ind,ext}_map_blocks() under i_data_sem
> > > 
> > > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > 
> > I try to apply the following patch in my tree, but I realize that it
> > seems that we couldn't add this sanity check in ext4_map_blocks and
> > ext4_da_map_blocks now.  The reason is that when we try to initialize an
> > unwritten extent this extent could be zeroed out.  But we could know it
> > in *_map_blocks, and status tree couldn't be updated.  So we will hit a
> > BUG_ON(1) after added this sanity check.
> This means that extent-status tree should be updated once zeroout
> happen.
> 
> > 
> > I agree with you that we need to add self-testing infrastructre after
> > applied status tree patch series.  But it seems that we need to mix
> > updating status tree code with extent tree code.  IMHO it is too
> > complicated.  Any thought?
> Many developers have invested significant amount of man-hours in to 
> ext4's delay allocation state machine. It is now known as stable and reliable
> So it reasonable to use it as a primary for self-testing of
> extent-status tree. If es_cache != real_state this is means that we have
> forget to update es_cache.
> For example when extents was introduced it have many self testing things
> such as:
> 1) CHECK_BINSEARCH__: Which recheck binarry search result.
>         
> 2) AGGRESSIVE_TEST : Which force deep extent tree constructions and
>                      allow to cover unusual branches.
> 3) DOUBLE_CHECK : Recheck block alloc bitmap one more time

Thanks for your reminder.  I will add it.

> 
> I strongly believe in testing. Ext4 is production-grade filesystem
> so we can not break it in the name of new features, unless we are sure
> that features are safe and valuable, that's why investments in
> self-testing infrastructure for ES should have very high priority.
> But off course Theodore's decision whenever feature is looks stable
> enough to get ready go upstream.

Yes, I agree with you that we shouldn't break a production-grade file
system.  Certainly Ted has the final decision.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Feb. 26, 2013, 8:06 p.m.
On Mon, Feb 25, 2013 at 05:57:21PM +0800, Zheng Liu wrote:
> > I strongly believe in testing. Ext4 is production-grade filesystem
> > so we can not break it in the name of new features, unless we are sure
> > that features are safe and valuable, that's why investments in
> > self-testing infrastructure for ES should have very high priority.
> > But off course Theodore's decision whenever feature is looks stable
> > enough to get ready go upstream.
> 
> Yes, I agree with you that we shouldn't break a production-grade file
> system.  Certainly Ted has the final decision.

I agree that having better self-testing for sanity check features
would be a good thing to add, so that as we make changes in the
future, it's much more likely that we will find potential problems
sooner rather than later.

Given that amount of testing that we've done looking for regressions,
I'm feeling pretty good about the stability of the ext4 tree, so I'm
ready to send a pull request to Linus with the extent status tree. 

Yes, we do have the problem with the defrag ioctl, but I trust that
Zheng will have a fix for us before v3.9 releases.  Worst case, we can
just flush the extent status tree entirely which should
(inefficiently) solve the defrag bug.

Cheers,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

From 65c5fc212b1c684c76899c6e5e1f24d88550c6fc Mon Sep 17 00:00:00 2001
From: Dmitry Monakhov <dmonakhov@openvz.org>
Date: Fri, 22 Feb 2013 20:55:52 +0400
Subject: [PATCH] ext4 add sanity ext4_es_lookup_extent

This patch does very simple thing: it recheck result returned from
ext4_es_lookup_extent() by comparing it old-good lookup via
ext4_{ind,ext}_map_blocks() under i_data_sem

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/inode.c |   42 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 95a0c62..706db1f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -482,6 +482,41 @@  static pgoff_t ext4_num_dirty_pages(struct inode *inode, pgoff_t idx,
 	return num;
 }
 
+#define ES_AGGRESSIVE_TEST 1
+#ifdef ES_AGGRESSIVE_TEST
+void ext4_map_blocks_es_recheck(handle_t *handle, struct inode *inode,
+				struct ext4_map_blocks *es_map, int es_ret,
+				struct ext4_map_blocks *map, int flags)
+{
+	int ret;
+
+	map->m_flags = 0;
+	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
+		down_read((&EXT4_I(inode)->i_data_sem));
+	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
+		ret = ext4_ext_map_blocks(handle, inode, map, flags &
+					     EXT4_GET_BLOCKS_KEEP_SIZE);
+	} else {
+		ret = ext4_ind_map_blocks(handle, inode, map, flags &
+					     EXT4_GET_BLOCKS_KEEP_SIZE);
+	}
+	if (es_map->m_lblk != map->m_lblk ||
+	    es_map->m_flags != map->m_flags ||
+	    es_map->m_len != map->m_len ||
+	    es_map->m_pblk != map->m_pblk ||
+	    es_ret != ret) {
+		printk("Assertation failed for inode:%lu "
+		       "es_cached_ex [%d/%d/%llu/%x]:%d != "
+		       "found_ex [%d/%d/%llu/%x]:%d\n", inode->i_ino,
+		       es_map->m_lblk, es_map->m_len, es_map->m_pblk,
+		       es_map->m_flags, es_ret,
+		       map->m_lblk, map->m_len, map->m_pblk, map->m_flags, ret);
+		BUG();
+	}
+	if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
+		up_read((&EXT4_I(inode)->i_data_sem));
+}
+#endif
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -509,7 +544,11 @@  int ext4_map_blocks(handle_t *handle, struct inode *inode,
 {
 	struct extent_status es;
 	int retval;
+#ifdef ES_AGGRESSIVE_TEST
+	struct ext4_map_blocks orig_map;
 
+	memcpy(&orig_map, map, sizeof(*map));
+#endif
 	map->m_flags = 0;
 	ext_debug("ext4_map_blocks(): inode %lu, flag %d, max_blocks %u,"
 		  "logical block %lu\n", inode->i_ino, flags, map->m_len,
@@ -531,6 +570,9 @@  int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		} else {
 			BUG_ON(1);
 		}
+#ifdef ES_AGGRESSIVE_TEST
+		ext4_map_blocks_es_recheck(handle,inode, map, retval, &orig_map, flags);
+#endif
 		goto found;
 	}
 
-- 
1.7.1