diff mbox series

[RFC] ext4: fix partial cluster initialization when splitting extent

Message ID 1589444097-38535-1-git-send-email-jefflexu@linux.alibaba.com
State New
Headers show
Series [RFC] ext4: fix partial cluster initialization when splitting extent | expand

Commit Message

JeffleXu May 14, 2020, 8:14 a.m. UTC
Hi Eric, would you mind explaining why the magic number '2' is used here
when calculating the physical cluster number of the partial cluster in
commit f4226d9ea400 ("ext4: fix partial cluster initialization") ?

```
+     /*
+      * If we're going to split the extent, note that
+      * the cluster containing the block after 'end' is
+      * in use to avoid freeing it when removing blocks.
+      */
+     if (sbi->s_cluster_ratio > 1) {
+             pblk = ext4_ext_pblock(ex) + end - ee_block + 2;
+             partial_cluster =
+                     -(long long) EXT4_B2C(sbi, pblk);
+     }
```

As far as I understand, there we are initializing the partial cluster
describing the beginning of the split extent after 'end'. The
corrsponding physical block number of the first block in the split
extent should be 'ext4_ext_pblock(ex) + end - ee_block + 1'.

This bug will cause xfstests shared/298 failure on ext4 with bigalloc
enabled sometimes. Ext4 error messages indicate that previously freed
blocks are being freed again, and the following fsck will fail due to
the inconsistency of block bitmap and bg descriptor.

The following is an example case:

1. First, Initialize a ext4 filesystem with cluster size '16K', block size
'4K', in which case, one cluster contains four blocks.

2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
tree of this file is like:

...
36864:[0]4:220160
36868:[0]14332:145408
51200:[0]2:231424
...

3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
like:

..
ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
...

4. Then the extent tree of this file after punching is like

...
49507:[0]37:158047
49547:[0]58:158087
...

5. Detailed procedure of punching hole [49544, 49546]

5.1. The block address space:
```
lblk        ~49505  49506   49507~49543     49544~49546    49547~
	  ---------+------+-------------+----------------+--------
	    extent | hole |   extent	|	hole	 | extent
	  ---------+------+-------------+----------------+--------
pblk       ~158045  158046  158047~158083  158084~158086   158087~
```

5.2. The detailed layout of cluster 39521:
```
		cluster 39521
	<------------------------------->

		hole		  extent
	<----------------------><--------

lblk      49544   49545   49546   49547
	+-------+-------+-------+-------+
	|	|	|	|	|
	+-------+-------+-------+-------+
pblk     158084  1580845  158086  158087
```

5.3. The ftrace output when punching hole [49544, 49546]:
- ext4_ext_remove_space (start 49544, end 49546)
  - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
    - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
      - ext4_free_blocks: (block 158084 count 4)
        - ext4_mballoc_free (extent 1/6753/1)

In this case, the whole cluster 39521 is freed mistakenly when freeing
pblock 158084~158086 (i.e., the first three blocks of this cluster),
although pblock 158087 (the last remaining block of this cluster) has
not been freed yet.

The root cause of this isuue is that, the pclu of the partial cluster is
calculated mistakenly in ext4_ext_remove_space(). The correct
partial_cluster.pclu (i.e., the cluster number of the first block in the
next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
than 39522.

Fixes: f4226d9ea400 ("ext4: fix partial cluster initialization")
Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
---
 fs/ext4/extents.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comments

Eric Whitney May 14, 2020, 10:21 p.m. UTC | #1
* Jeffle Xu <jefflexu@linux.alibaba.com>:
> Hi Eric, would you mind explaining why the magic number '2' is used here
> when calculating the physical cluster number of the partial cluster in
> commit f4226d9ea400 ("ext4: fix partial cluster initialization") ?
> 
> ```
> +     /*
> +      * If we're going to split the extent, note that
> +      * the cluster containing the block after 'end' is
> +      * in use to avoid freeing it when removing blocks.
> +      */
> +     if (sbi->s_cluster_ratio > 1) {
> +             pblk = ext4_ext_pblock(ex) + end - ee_block + 2;
> +             partial_cluster =
> +                     -(long long) EXT4_B2C(sbi, pblk);
> +     }
> ```
> 
> As far as I understand, there we are initializing the partial cluster
> describing the beginning of the split extent after 'end'. The
> corrsponding physical block number of the first block in the split
> extent should be 'ext4_ext_pblock(ex) + end - ee_block + 1'.
> 
> This bug will cause xfstests shared/298 failure on ext4 with bigalloc
> enabled sometimes. Ext4 error messages indicate that previously freed
> blocks are being freed again, and the following fsck will fail due to
> the inconsistency of block bitmap and bg descriptor.
> 
> The following is an example case:
> 
> 1. First, Initialize a ext4 filesystem with cluster size '16K', block size
> '4K', in which case, one cluster contains four blocks.
> 
> 2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
> tree of this file is like:
> 
> ...
> 36864:[0]4:220160
> 36868:[0]14332:145408
> 51200:[0]2:231424
> ...
> 
> 3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
> like:
> 
> ..
> ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
> ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
> ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
> ...
> 
> 4. Then the extent tree of this file after punching is like
> 
> ...
> 49507:[0]37:158047
> 49547:[0]58:158087
> ...
> 
> 5. Detailed procedure of punching hole [49544, 49546]
> 
> 5.1. The block address space:
> ```
> lblk        ~49505  49506   49507~49543     49544~49546    49547~
> 	  ---------+------+-------------+----------------+--------
> 	    extent | hole |   extent	|	hole	 | extent
> 	  ---------+------+-------------+----------------+--------
> pblk       ~158045  158046  158047~158083  158084~158086   158087~
> ```
> 
> 5.2. The detailed layout of cluster 39521:
> ```
> 		cluster 39521
> 	<------------------------------->
> 
> 		hole		  extent
> 	<----------------------><--------
> 
> lblk      49544   49545   49546   49547
> 	+-------+-------+-------+-------+
> 	|	|	|	|	|
> 	+-------+-------+-------+-------+
> pblk     158084  1580845  158086  158087
> ```
> 
> 5.3. The ftrace output when punching hole [49544, 49546]:
> - ext4_ext_remove_space (start 49544, end 49546)
>   - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
>     - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
>       - ext4_free_blocks: (block 158084 count 4)
>         - ext4_mballoc_free (extent 1/6753/1)
> 
> In this case, the whole cluster 39521 is freed mistakenly when freeing
> pblock 158084~158086 (i.e., the first three blocks of this cluster),
> although pblock 158087 (the last remaining block of this cluster) has
> not been freed yet.
> 
> The root cause of this isuue is that, the pclu of the partial cluster is
> calculated mistakenly in ext4_ext_remove_space(). The correct
> partial_cluster.pclu (i.e., the cluster number of the first block in the
> next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
> than 39522.
> 
> Fixes: f4226d9ea400 ("ext4: fix partial cluster initialization")
> Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> ---
>  fs/ext4/extents.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index f2b577b..cb74496 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -2828,7 +2828,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
>  			 * in use to avoid freeing it when removing blocks.
>  			 */
>  			if (sbi->s_cluster_ratio > 1) {
> -				pblk = ext4_ext_pblock(ex) + end - ee_block + 2;
> +				pblk = ext4_ext_pblock(ex) + end - ee_block + 1;
>  				partial.pclu = EXT4_B2C(sbi, pblk);
>  				partial.state = nofree;
>  			}
> -- 
> 1.8.3.1
> 

Hi, Jeffle:

Thanks for the report and the patch.  At first glance I suspect the "2" is
simply a bug; logically we're just looking for the first block after the
extent split to set the partial cluster, as you suggest.  I'll post the
results of my review once I've had a chance to refresh my memory of the
code and run some more tests.

Thanks,
Eric
Eric Whitney May 18, 2020, 10:08 p.m. UTC | #2
* Eric Whitney <enwlinux@gmail.com>:
> * Jeffle Xu <jefflexu@linux.alibaba.com>:
> > Hi Eric, would you mind explaining why the magic number '2' is used here
> > when calculating the physical cluster number of the partial cluster in
> > commit f4226d9ea400 ("ext4: fix partial cluster initialization") ?
> > 
> > ```
> > +     /*
> > +      * If we're going to split the extent, note that
> > +      * the cluster containing the block after 'end' is
> > +      * in use to avoid freeing it when removing blocks.
> > +      */
> > +     if (sbi->s_cluster_ratio > 1) {
> > +             pblk = ext4_ext_pblock(ex) + end - ee_block + 2;
> > +             partial_cluster =
> > +                     -(long long) EXT4_B2C(sbi, pblk);
> > +     }
> > ```
> > 
> > As far as I understand, there we are initializing the partial cluster
> > describing the beginning of the split extent after 'end'. The
> > corrsponding physical block number of the first block in the split
> > extent should be 'ext4_ext_pblock(ex) + end - ee_block + 1'.
> > 
> > This bug will cause xfstests shared/298 failure on ext4 with bigalloc
> > enabled sometimes. Ext4 error messages indicate that previously freed
> > blocks are being freed again, and the following fsck will fail due to
> > the inconsistency of block bitmap and bg descriptor.
> > 
> > The following is an example case:
> > 
> > 1. First, Initialize a ext4 filesystem with cluster size '16K', block size
> > '4K', in which case, one cluster contains four blocks.
> > 
> > 2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
> > tree of this file is like:
> > 
> > ...
> > 36864:[0]4:220160
> > 36868:[0]14332:145408
> > 51200:[0]2:231424
> > ...
> > 
> > 3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
> > like:
> > 
> > ..
> > ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
> > ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
> > ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
> > ...
> > 
> > 4. Then the extent tree of this file after punching is like
> > 
> > ...
> > 49507:[0]37:158047
> > 49547:[0]58:158087
> > ...
> > 
> > 5. Detailed procedure of punching hole [49544, 49546]
> > 
> > 5.1. The block address space:
> > ```
> > lblk        ~49505  49506   49507~49543     49544~49546    49547~
> > 	  ---------+------+-------------+----------------+--------
> > 	    extent | hole |   extent	|	hole	 | extent
> > 	  ---------+------+-------------+----------------+--------
> > pblk       ~158045  158046  158047~158083  158084~158086   158087~
> > ```
> > 
> > 5.2. The detailed layout of cluster 39521:
> > ```
> > 		cluster 39521
> > 	<------------------------------->
> > 
> > 		hole		  extent
> > 	<----------------------><--------
> > 
> > lblk      49544   49545   49546   49547
> > 	+-------+-------+-------+-------+
> > 	|	|	|	|	|
> > 	+-------+-------+-------+-------+
> > pblk     158084  1580845  158086  158087
> > ```
> > 
> > 5.3. The ftrace output when punching hole [49544, 49546]:
> > - ext4_ext_remove_space (start 49544, end 49546)
> >   - ext4_ext_rm_leaf (start 49544, end 49546, last_extent [49507(158047), 40], partial [pclu 39522 lblk 0 state 2])
> >     - ext4_remove_blocks (extent [49507(158047), 40], from 49544 to 49546, partial [pclu 39522 lblk 0 state 2]
> >       - ext4_free_blocks: (block 158084 count 4)
> >         - ext4_mballoc_free (extent 1/6753/1)
> > 
> > In this case, the whole cluster 39521 is freed mistakenly when freeing
> > pblock 158084~158086 (i.e., the first three blocks of this cluster),
> > although pblock 158087 (the last remaining block of this cluster) has
> > not been freed yet.
> > 
> > The root cause of this isuue is that, the pclu of the partial cluster is
> > calculated mistakenly in ext4_ext_remove_space(). The correct
> > partial_cluster.pclu (i.e., the cluster number of the first block in the
> > next extent, that is, lblock 49597 (pblock 158086)) should be 39521 rather
> > than 39522.
> > 
> > Fixes: f4226d9ea400 ("ext4: fix partial cluster initialization")
> > Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com>
> > ---
> >  fs/ext4/extents.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > index f2b577b..cb74496 100644
> > --- a/fs/ext4/extents.c
> > +++ b/fs/ext4/extents.c
> > @@ -2828,7 +2828,7 @@ int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
> >  			 * in use to avoid freeing it when removing blocks.
> >  			 */
> >  			if (sbi->s_cluster_ratio > 1) {
> > -				pblk = ext4_ext_pblock(ex) + end - ee_block + 2;
> > +				pblk = ext4_ext_pblock(ex) + end - ee_block + 1;
> >  				partial.pclu = EXT4_B2C(sbi, pblk);
> >  				partial.state = nofree;
> >  			}
> > -- 
> > 1.8.3.1
> > 
> 
> Hi, Jeffle:
> 
> Thanks for the report and the patch.  At first glance I suspect the "2" is
> simply a bug; logically we're just looking for the first block after the
> extent split to set the partial cluster, as you suggest.  I'll post the
> results of my review once I've had a chance to refresh my memory of the
> code and run some more tests.
> 
> Thanks,
> Eric


Hi, Jeffle:

What kernel were you running when you observed your failures?  Does your
patch resolve all observed failures, or do any remain?  Do you have a
simple test script that reproduces the bug?

I've made almost 1000 runs of shared/298 on various bigalloc configurations
using Ted's test appliance on 5.7-rc5 and have not observed a failure.
Several auto group runs have also passed without failures.  Ideally, I'd
like to be able to reproduce your failure to be sure we fully understand
what's going on.  It's still the case that the "2" is wrong, but I think
that code in rm_leaf may be involved in an unexpected way.

Thanks,
Eric
JeffleXu May 19, 2020, 3:29 a.m. UTC | #3
On 5/19/20 6:08 AM, Eric Whitney wrote:
> Hi, Jeffle:
>
> What kernel were you running when you observed your failures?  Does your
> patch resolve all observed failures, or do any remain?  Do you have a
> simple test script that reproduces the bug?
>
> I've made almost 1000 runs of shared/298 on various bigalloc configurations
> using Ted's test appliance on 5.7-rc5 and have not observed a failure.
> Several auto group runs have also passed without failures.  Ideally, I'd
> like to be able to reproduce your failure to be sure we fully understand
> what's going on.  It's still the case that the "2" is wrong, but I think
> that code in rm_leaf may be involved in an unexpected way.
>
> Thanks,
> Eric

Hi Eric,

Following on is my test environment.


kernel: 5.7-rc4-git-eb24fdd8e6f5c6bb95129748a1801c6476492aba

e2fsprog: latest release version 1.45.6 (20-Mar-2020)

xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git, master 
branch, latest commit


1. Test device

I run the test in a VM and the VM is setup by qemu. The size of vdb is 1G,

```

#lsblk

NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
vdb    254:16   0   1G  0 disk

```


and is initialized by:

```

qemu-img create -f qcow2 /XX/disk1.qcow2 1G

qemu-kvm -drive file=/XX/disk1.qcow2,if=virtio,format=qcow2 ...

```


2. Test script


local.config of xfstests is like:

export TEST_DEV=/dev/vdb
export TEST_DIR=/mnt/test
export SCRATCH_DEV=/dev/vdc
export SCRATCH_MNT=/mnt/scratch


Following on is an example script to reproduce the failure:

```sh

#!/bin/bash

for i in `seq 100`; do
         echo y | mkfs.ext4 -O bigalloc -C 16K /dev/vdb

         ./check shared/298
         status=$?

         if [[ $status == 1 ]]; then
                 echo "$i exit"
                 exit
         fi
done

```


Indeed the failure occurs occasionally. Sometimes the script stops at 
iteration 4, or sometimes

at iteration 2, 7, 24.


The failure occurs with the following dmesg report:

```

[  387.471876] EXT4-fs error (device vdb): mb_free_blocks:1457: group 1, 
block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
[  387.473729] EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: 
group 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 
free clusters

```


3. About the applied patch

The applied patch does fix the failure in my test environment. At least 
the failure doesn't occur after running the full 100 iterations.


Thanks

Jeffle
Eric Whitney May 21, 2020, 9:26 p.m. UTC | #4
* JeffleXu <jefflexu@linux.alibaba.com>:
> 
> On 5/19/20 6:08 AM, Eric Whitney wrote:
> > Hi, Jeffle:
> > 
> > What kernel were you running when you observed your failures?  Does your
> > patch resolve all observed failures, or do any remain?  Do you have a
> > simple test script that reproduces the bug?
> > 
> > I've made almost 1000 runs of shared/298 on various bigalloc configurations
> > using Ted's test appliance on 5.7-rc5 and have not observed a failure.
> > Several auto group runs have also passed without failures.  Ideally, I'd
> > like to be able to reproduce your failure to be sure we fully understand
> > what's going on.  It's still the case that the "2" is wrong, but I think
> > that code in rm_leaf may be involved in an unexpected way.
> > 
> > Thanks,
> > Eric
> 
> Hi Eric,
> 
> Following on is my test environment.
> 
> 
> kernel: 5.7-rc4-git-eb24fdd8e6f5c6bb95129748a1801c6476492aba
> 
> e2fsprog: latest release version 1.45.6 (20-Mar-2020)
> 
> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git, master
> branch, latest commit
> 
> 
> 1. Test device
> 
> I run the test in a VM and the VM is setup by qemu. The size of vdb is 1G,
> 
> ```
> 
> #lsblk
> 
> NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> vdb    254:16   0   1G  0 disk
> 
> ```
> 
> 
> and is initialized by:
> 
> ```
> 
> qemu-img create -f qcow2 /XX/disk1.qcow2 1G
> 
> qemu-kvm -drive file=/XX/disk1.qcow2,if=virtio,format=qcow2 ...
> 
> ```
> 
> 
> 2. Test script
> 
> 
> local.config of xfstests is like:
> 
> export TEST_DEV=/dev/vdb
> export TEST_DIR=/mnt/test
> export SCRATCH_DEV=/dev/vdc
> export SCRATCH_MNT=/mnt/scratch
> 
> 
> Following on is an example script to reproduce the failure:
> 
> ```sh
> 
> #!/bin/bash
> 
> for i in `seq 100`; do
>         echo y | mkfs.ext4 -O bigalloc -C 16K /dev/vdb
> 
>         ./check shared/298
>         status=$?
> 
>         if [[ $status == 1 ]]; then
>                 echo "$i exit"
>                 exit
>         fi
> done
> 
> ```
> 
> 
> Indeed the failure occurs occasionally. Sometimes the script stops at
> iteration 4, or sometimes
> 
> at iteration 2, 7, 24.
> 
> 
> The failure occurs with the following dmesg report:
> 
> ```
> 
> [  387.471876] EXT4-fs error (device vdb): mb_free_blocks:1457: group 1,
> block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
> [  387.473729] EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group
> 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters
> 
> ```
> 
> 
> 3. About the applied patch
> 
> The applied patch does fix the failure in my test environment. At least the
> failure doesn't occur after running the full 100 iterations.
> 
> 
> Thanks
> 
> Jeffle
> 
> 
>

Hi, Jeffle:

Thanks for that information.  I'm still unable to reproduce your failure,
but by inspection your patch clearly fixes a bug, and of course, you're seeing
that.  I suspect the code in rm_leaf that also sets the partial cluster nofree
state is masking the bug in my testing.  In your case, my best guess is that
your testing is occasionally getting into the retry loop for EAGAIN in
remove_space.  This would effectively expose the bug again and could lead to
the failure you've described.

Your patch has survived all the heavy testing I've thrown at it.  So, please
repost your RFC patch as a fix, and feel free to add:
Reviewed-by: Eric Whitney <enwlinux@gmail.com>

This points out that the cluster freeing code really needs to be cleaned up,
so I'm working on a patch series that does that.

Thanks for your patience,
Eric
JeffleXu May 22, 2020, 3:09 a.m. UTC | #5
Thanks for reviewing. I will send a formal patch later ;)

Thanks,

Jeffle


On 5/22/20 5:26 AM, Eric Whitney wrote:
> * JeffleXu <jefflexu@linux.alibaba.com>:
>> On 5/19/20 6:08 AM, Eric Whitney wrote:
>>> Hi, Jeffle:
>>>
>>> What kernel were you running when you observed your failures?  Does your
>>> patch resolve all observed failures, or do any remain?  Do you have a
>>> simple test script that reproduces the bug?
>>>
>>> I've made almost 1000 runs of shared/298 on various bigalloc configurations
>>> using Ted's test appliance on 5.7-rc5 and have not observed a failure.
>>> Several auto group runs have also passed without failures.  Ideally, I'd
>>> like to be able to reproduce your failure to be sure we fully understand
>>> what's going on.  It's still the case that the "2" is wrong, but I think
>>> that code in rm_leaf may be involved in an unexpected way.
>>>
>>> Thanks,
>>> Eric
>> Hi Eric,
>>
>> Following on is my test environment.
>>
>>
>> kernel: 5.7-rc4-git-eb24fdd8e6f5c6bb95129748a1801c6476492aba
>>
>> e2fsprog: latest release version 1.45.6 (20-Mar-2020)
>>
>> xfstests: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git, master
>> branch, latest commit
>>
>>
>> 1. Test device
>>
>> I run the test in a VM and the VM is setup by qemu. The size of vdb is 1G,
>>
>> ```
>>
>> #lsblk
>>
>> NAME   MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
>> vdb    254:16   0   1G  0 disk
>>
>> ```
>>
>>
>> and is initialized by:
>>
>> ```
>>
>> qemu-img create -f qcow2 /XX/disk1.qcow2 1G
>>
>> qemu-kvm -drive file=/XX/disk1.qcow2,if=virtio,format=qcow2 ...
>>
>> ```
>>
>>
>> 2. Test script
>>
>>
>> local.config of xfstests is like:
>>
>> export TEST_DEV=/dev/vdb
>> export TEST_DIR=/mnt/test
>> export SCRATCH_DEV=/dev/vdc
>> export SCRATCH_MNT=/mnt/scratch
>>
>>
>> Following on is an example script to reproduce the failure:
>>
>> ```sh
>>
>> #!/bin/bash
>>
>> for i in `seq 100`; do
>>          echo y | mkfs.ext4 -O bigalloc -C 16K /dev/vdb
>>
>>          ./check shared/298
>>          status=$?
>>
>>          if [[ $status == 1 ]]; then
>>                  echo "$i exit"
>>                  exit
>>          fi
>> done
>>
>> ```
>>
>>
>> Indeed the failure occurs occasionally. Sometimes the script stops at
>> iteration 4, or sometimes
>>
>> at iteration 2, 7, 24.
>>
>>
>> The failure occurs with the following dmesg report:
>>
>> ```
>>
>> [  387.471876] EXT4-fs error (device vdb): mb_free_blocks:1457: group 1,
>> block 158084:freeing already freed block (bit 6753); block bitmap corrupt.
>> [  387.473729] EXT4-fs error (device vdb): ext4_mb_generate_buddy:747: group
>> 1, block bitmap and bg descriptor inconsistent: 19550 vs 19551 free clusters
>>
>> ```
>>
>>
>> 3. About the applied patch
>>
>> The applied patch does fix the failure in my test environment. At least the
>> failure doesn't occur after running the full 100 iterations.
>>
>>
>> Thanks
>>
>> Jeffle
>>
>>
>>
> Hi, Jeffle:
>
> Thanks for that information.  I'm still unable to reproduce your failure,
> but by inspection your patch clearly fixes a bug, and of course, you're seeing
> that.  I suspect the code in rm_leaf that also sets the partial cluster nofree
> state is masking the bug in my testing.  In your case, my best guess is that
> your testing is occasionally getting into the retry loop for EAGAIN in
> remove_space.  This would effectively expose the bug again and could lead to
> the failure you've described.
>
> Your patch has survived all the heavy testing I've thrown at it.  So, please
> repost your RFC patch as a fix, and feel free to add:
> Reviewed-by: Eric Whitney <enwlinux@gmail.com>
>
> This points out that the cluster freeing code really needs to be cleaned up,
> so I'm working on a patch series that does that.
>
> Thanks for your patience,
> Eric
diff mbox series

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f2b577b..cb74496 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -2828,7 +2828,7 @@  int ext4_ext_remove_space(struct inode *inode, ext4_lblk_t start,
 			 * in use to avoid freeing it when removing blocks.
 			 */
 			if (sbi->s_cluster_ratio > 1) {
-				pblk = ext4_ext_pblock(ex) + end - ee_block + 2;
+				pblk = ext4_ext_pblock(ex) + end - ee_block + 1;
 				partial.pclu = EXT4_B2C(sbi, pblk);
 				partial.state = nofree;
 			}