diff mbox

fsstress-induced corruption reproduced

Message ID 4B427507.40004@redhat.com
State Superseded, archived
Headers show

Commit Message

Eric Sandeen Jan. 4, 2010, 11:08 p.m. UTC
Eric Sandeen wrote:
> Theodore Ts'o wrote:
>> One of the things which has been annoying me for a while now is a
>> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
>> a test failure because the file system found to be inconsistent:
>>
>> Inode NNN, i_blocks is X, should be Y.
> 
> Interesting, this apparently has gotten much worse since 2.6.32.
> 
> I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
> off on 2.6.33-rc2.
> 
> Probably should find out why ;) I'll go take a look.

commit d21cd8f163ac44b15c465aab7306db931c606908
Author: Dmitry Monakhov <dmonakhov@openvz.org>
Date:   Thu Dec 10 03:31:45 2009 +0000

    ext4: Fix potential quota deadlock

seems to be the culprit.

(unfortunately this means that the error we saw before is something
-else- to be fixed, yet)  Anyway ...

This is because we used to do this in ext4_mb_mark_diskspace_used() :

        /*
         * Now reduce the dirty block count also. Should not go negative
         */
        if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
                /* release all the reserved blocks if non delalloc */
                percpu_counter_sub(&sbi->s_dirtyblocks_counter,
reserv_blks);
        else {
                percpu_counter_sub(&sbi->s_dirtyblocks_counter,
                                                ac->ac_b_ex.fe_len);
                /* convert reserved quota blocks to real quota blocks */
                vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
	}

i.e. the vfs_dq_claim_block was conditional based on
EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
because we had already preallocated the blocks.

But with the above quota deadlock commit it's not unconditional
anymore in ext4_da_update_reserve_space and we always call
vfs_dq_claim_block which over-accounts.

Of course with the above commit, we have no allocation context in
ext4_da_update_reserve_space... that's all long gone so we can't key
on that anymore.

However, I think the following change will fix it; I'll run it through
xfstests later on and be sure nothing else regresses.

-Eric

         * We need to check for EXT4 here because migrate

-Eric

> -Eric
> 
>> I finally reproduced it; the problem happens when we fallocate() a
>> region of the file which we had recently written, and which is still in
>> the page cache marked as delayed allocation blocks.  When we finally
>> write those blocks out, since they are marked BH_Delay,
>> ext4_get_blocks() calls ext4_da_update_reserve_space(), which ends up
>> bumping i_blocks a second time and charging the blocks against the
>> user's quota a second time.  Oops.
>>
>> Fortunately the fsck problem is one that will be fixed with a preen (and
>> if quota is enabled, a quotacheck), so it's not super serious, but we
>> should fix it when we have a chance.  If anyone has time to look at it,
>> please let me know.  Otherwise, I'll put it on my todo list.  I don't
>> consider seriously urgent since the case is highly unlikely to occur in
>> real life, and it doesn't have any security implications; the worst an
>> attacker could do is end up charging excesss quota to herself.
>>
>> I've included a simple reproduction case below; if you run this program,
>> it will create a file "test-file" in the current working directory which
>> will appear to be 32k, even though it is really only 16k long, and if
>> you then unmount the test file system and run e2fsck -p on it, you will get
>> the error message:
>>
>> Inode XXX, i_blocks is 64, should be 32.  FIXED.
>>
>> 	       	     	    	     	     - Ted
>>
>> #define _GNU_SOURCE
>>
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <unistd.h>
>> #include <string.h>
>> #include <sys/types.h>
>> #include <fcntl.h>
>> #include <fcntl.h>
>>
>> #define BUFSIZE 1024
>>
>> int main(int argc, char **argv)
>> {
>> 	int	i, fd, ret;
>> 	char	buf[BUFSIZE];
>>
>> 	fd = open("test-file", O_RDWR|O_CREAT|O_TRUNC, 0644);
>> 	if (fd < 0) {
>> 		perror("open");
>> 		exit(1);
>> 	}
>> 	memset(&buf, 0, BUFSIZE);
>> 	for (i=0; i < 16; i++) {
>> 		ret = write(fd, &buf, BUFSIZE);
>> 		if (ret < 0) {
>> 			perror("write");
>> 			exit(1);
>> 		}
>> 		if (ret != BUFSIZE) {
>> 			fprintf(stderr, "Write return expected %d, got %d\n",
>> 				BUFSIZE, ret);
>> 			exit(1);
>> 		}
>> 	}
>> 	ret = fallocate(fd, 0, 0, 16384);
>> 	if (ret < 0) {
>> 		perror("fallocate");
>> 		exit(1);
>> 	}
>> 	ret = fsync(fd);
>> 	if (ret < 0) {
>> 		perror("fsync");
>> 		exit(1);
>> 	}
>> 	ret = close(fd);
>> 	if (ret < 0) {
>> 		perror("close");
>> 		exit(1);
>> 	}
>> 	exit(0);
>> }
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Aneesh Kumar K.V Jan. 5, 2010, 6:17 a.m. UTC | #1
On Mon, Jan 04, 2010 at 05:08:55PM -0600, Eric Sandeen wrote:
> Eric Sandeen wrote:
> > Theodore Ts'o wrote:
> >> One of the things which has been annoying me for a while now is a
> >> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
> >> a test failure because the file system found to be inconsistent:
> >>
> >> Inode NNN, i_blocks is X, should be Y.
> > 
> > Interesting, this apparently has gotten much worse since 2.6.32.
> > 
> > I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
> > off on 2.6.33-rc2.
> > 
> > Probably should find out why ;) I'll go take a look.
> 
> commit d21cd8f163ac44b15c465aab7306db931c606908
> Author: Dmitry Monakhov <dmonakhov@openvz.org>
> Date:   Thu Dec 10 03:31:45 2009 +0000
> 
>     ext4: Fix potential quota deadlock
> 
> seems to be the culprit.
> 
> (unfortunately this means that the error we saw before is something
> -else- to be fixed, yet)  Anyway ...
> 
> This is because we used to do this in ext4_mb_mark_diskspace_used() :
> 
>         /*
>          * Now reduce the dirty block count also. Should not go negative
>          */
>         if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
>                 /* release all the reserved blocks if non delalloc */
>                 percpu_counter_sub(&sbi->s_dirtyblocks_counter,
> reserv_blks);
>         else {
>                 percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>                                                 ac->ac_b_ex.fe_len);
>                 /* convert reserved quota blocks to real quota blocks */
>                 vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
> 	}
> 
> i.e. the vfs_dq_claim_block was conditional based on
> EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
> because we had already preallocated the blocks.
> 
> But with the above quota deadlock commit it's not unconditional
> anymore in ext4_da_update_reserve_space and we always call
> vfs_dq_claim_block which over-accounts.
> 

It is still conditional right ? We call ext4_da_update_reserve_space
only if EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE  is set . That will
happen only in case of delayed allocation. I guess the problem is
same as what Ted stated. But i am not sure why we are able to reproduce
it much easily on 2.6.33-rc2.


> Of course with the above commit, we have no allocation context in
> ext4_da_update_reserve_space... that's all long gone so we can't key
> on that anymore.
> 
> However, I think the following change will fix it; I'll run it through
> xfstests later on and be sure nothing else regresses.
> 
> -Eric
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5352db1..28cd8d8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1257,9 +1257,10 @@ int ext4_get_blocks(handle_t *handle, struct
> inode *inode, sector_t block,
>          * if the caller is from delayed allocation writeout path
>          * we have already reserved fs blocks for allocation
>          * let the underlying get_block() function know to
> -        * avoid double accounting
> +        * avoid double accounting.  Ditto for prealloc blocks.
>          */
> -       if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
> +       if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ||
> +           flags & EXT4_GET_BLOCKS_UNINIT_EXT)
>                 EXT4_I(inode)->i_delalloc_reserved_flag = 1;
>         /*
>          * We need to check for EXT4 here because migrate
> 


But we need to do quota update during fallocate call. Doing the above
will result we not doing that. We would will also get block accounting
wrong because we now won't be doing ext4_claim_free_blocks for fallocate.

I guess what we need is to make sure that if we have any buffer_head mapping
the same block range allocated via fallocate and if they are marked BH_Delay
we need to clear the delay flag and update the block reservation. Later during
writepage we will find these buffer_heads mapped/non-delay and will do the right thing.

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Jan. 5, 2010, 2:40 p.m. UTC | #2
Aneesh Kumar K.V wrote:
> On Mon, Jan 04, 2010 at 05:08:55PM -0600, Eric Sandeen wrote:
>> Eric Sandeen wrote:
>>> Theodore Ts'o wrote:
>>>> One of the things which has been annoying me for a while now is a
>>>> hard-to-reproduce xfsqa failure in test #13 (fsstress), which causes the
>>>> a test failure because the file system found to be inconsistent:
>>>>
>>>> Inode NNN, i_blocks is X, should be Y.
>>> Interesting, this apparently has gotten much worse since 2.6.32.
>>>
>>> I wrote an xfstests reproducer, and couldn't hit it on .32; hit it right
>>> off on 2.6.33-rc2.
>>>
>>> Probably should find out why ;) I'll go take a look.
>> commit d21cd8f163ac44b15c465aab7306db931c606908
>> Author: Dmitry Monakhov <dmonakhov@openvz.org>
>> Date:   Thu Dec 10 03:31:45 2009 +0000
>>
>>     ext4: Fix potential quota deadlock
>>
>> seems to be the culprit.
>>
>> (unfortunately this means that the error we saw before is something
>> -else- to be fixed, yet)  Anyway ...
>>
>> This is because we used to do this in ext4_mb_mark_diskspace_used() :
>>
>>         /*
>>          * Now reduce the dirty block count also. Should not go negative
>>          */
>>         if (!(ac->ac_flags & EXT4_MB_DELALLOC_RESERVED))
>>                 /* release all the reserved blocks if non delalloc */
>>                 percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>> reserv_blks);
>>         else {
>>                 percpu_counter_sub(&sbi->s_dirtyblocks_counter,
>>                                                 ac->ac_b_ex.fe_len);
>>                 /* convert reserved quota blocks to real quota blocks */
>>                 vfs_dq_claim_block(ac->ac_inode, ac->ac_b_ex.fe_len);
>> 	}
>>
>> i.e. the vfs_dq_claim_block was conditional based on
>> EXT4_MB_DELALLOC_RESERVED... and the testcase did not go that way,
>> because we had already preallocated the blocks.
>>
>> But with the above quota deadlock commit it's not unconditional
>> anymore in ext4_da_update_reserve_space and we always call
>> vfs_dq_claim_block which over-accounts.
>>
> 
> It is still conditional right ? We call ext4_da_update_reserve_space
> only if EXT4_GET_BLOCKS_UPDATE_RESERVE_SPACE  is set . That will
> happen only in case of delayed allocation. I guess the problem is
> same as what Ted stated. But i am not sure why we are able to reproduce
> it much easily on 2.6.33-rc2.
> 

Well, I'll take another look.  But back out the above commit and I think
you'll see that it changed things to make it 100% reproducible.

-Eric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5352db1..28cd8d8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1257,9 +1257,10 @@  int ext4_get_blocks(handle_t *handle, struct
inode *inode, sector_t block,
         * if the caller is from delayed allocation writeout path
         * we have already reserved fs blocks for allocation
         * let the underlying get_block() function know to
-        * avoid double accounting
+        * avoid double accounting.  Ditto for prealloc blocks.
         */
-       if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE)
+       if (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE ||
+           flags & EXT4_GET_BLOCKS_UNINIT_EXT)
                EXT4_I(inode)->i_delalloc_reserved_flag = 1;
        /*