diff mbox series

ext4: only update i_reserved_data_blocks on successful block allocation

Message ID 20230325063443.1839558-1-libaokun1@huawei.com
State Superseded
Headers show
Series ext4: only update i_reserved_data_blocks on successful block allocation | expand

Commit Message

Baokun Li March 25, 2023, 6:34 a.m. UTC
In our fault injection test, we create an ext4 file, migrate it to
non-extent based file, then punch a hole and finally trigger a WARN_ON
in the ext4_da_update_reserve_space():

EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
ino 14, used 11 with only 10 reserved data blocks

When writing back a non-extent based file, if we enable delalloc, the
number of reserved blocks will be subtracted from the number of blocks
mapped by ext4_ind_map_blocks(), and the extent status tree will be
updated. We update the extent status tree by first removing the old
extent_status and then inserting the new extent_status. If the block range
we remove happens to be in an extent, then we need to allocate another
extent_status with ext4_es_alloc_extent().

       use old    to remove   to add new
    |----------|------------|------------|
              old extent_status

The problem is that the allocation of a new extent_status failed due to a
fault injection, and __es_shrink() did not get free memory, resulting in
a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
we map to the same extent again, and the number of reserved blocks is again
subtracted from the number of blocks in that extent. Since the blocks in
the same extent are subtracted twice, we end up triggering WARN_ON at
ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.

To fix this, we update the number of reserved blocks for non-extents inodes
only when the reserved blocks are allocated successfully, rather than every
time the blocks are mapped successfully.

Signed-off-by: Baokun Li <libaokun1@huawei.com>
---
 fs/ext4/indirect.c |  8 ++++++++
 fs/ext4/inode.c    | 10 ----------
 2 files changed, 8 insertions(+), 10 deletions(-)

Comments

Jan Kara March 27, 2023, 12:47 p.m. UTC | #1
On Sat 25-03-23 14:34:43, Baokun Li wrote:
> In our fault injection test, we create an ext4 file, migrate it to
> non-extent based file, then punch a hole and finally trigger a WARN_ON
> in the ext4_da_update_reserve_space():
> 
> EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
> ino 14, used 11 with only 10 reserved data blocks
> 
> When writing back a non-extent based file, if we enable delalloc, the
> number of reserved blocks will be subtracted from the number of blocks
> mapped by ext4_ind_map_blocks(), and the extent status tree will be
> updated. We update the extent status tree by first removing the old
> extent_status and then inserting the new extent_status. If the block range
> we remove happens to be in an extent, then we need to allocate another
> extent_status with ext4_es_alloc_extent().
> 
>        use old    to remove   to add new
>     |----------|------------|------------|
>               old extent_status
> 
> The problem is that the allocation of a new extent_status failed due to a
> fault injection, and __es_shrink() did not get free memory, resulting in
> a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
> we map to the same extent again, and the number of reserved blocks is again
> subtracted from the number of blocks in that extent. Since the blocks in
> the same extent are subtracted twice, we end up triggering WARN_ON at
> ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.

Hum, but this second call to ext4_map_blocks() should find already allocated
blocks in the indirect block and thus should not be subtracting
ei->i_reserved_data_blocks for the second time. What am I missing?

								Honza

>
Baokun Li March 27, 2023, 1:09 p.m. UTC | #2
On 2023/3/27 20:47, Jan Kara wrote:
> On Sat 25-03-23 14:34:43, Baokun Li wrote:
>> In our fault injection test, we create an ext4 file, migrate it to
>> non-extent based file, then punch a hole and finally trigger a WARN_ON
>> in the ext4_da_update_reserve_space():
>>
>> EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
>> ino 14, used 11 with only 10 reserved data blocks
>>
>> When writing back a non-extent based file, if we enable delalloc, the
>> number of reserved blocks will be subtracted from the number of blocks
>> mapped by ext4_ind_map_blocks(), and the extent status tree will be
>> updated. We update the extent status tree by first removing the old
>> extent_status and then inserting the new extent_status. If the block range
>> we remove happens to be in an extent, then we need to allocate another
>> extent_status with ext4_es_alloc_extent().
>>
>>         use old    to remove   to add new
>>      |----------|------------|------------|
>>                old extent_status
>>
>> The problem is that the allocation of a new extent_status failed due to a
>> fault injection, and __es_shrink() did not get free memory, resulting in
>> a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
>> we map to the same extent again, and the number of reserved blocks is again
>> subtracted from the number of blocks in that extent. Since the blocks in
>> the same extent are subtracted twice, we end up triggering WARN_ON at
>> ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
> Hum, but this second call to ext4_map_blocks() should find already allocated
> blocks in the indirect block and thus should not be subtracting
> ei->i_reserved_data_blocks for the second time. What am I missing?
>
> 								Honza
>
ext4_map_blocks
   1. Lookup extent status tree firstly
        goto found;
   2. get the block without requesting a new file system block.
found:
   3. ceate and map the block

When we call ext4_map_blocks() for the second time, we directly find the
corresponding blocks in the extent status tree, and then go directly to 
step 3,
because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
but contains EXT4_GET_BLOCKS_CREATE, thus subtracting 
ei->i_reserved_data_blocks
for the second time.

Thanks!
Jan Kara March 28, 2023, 10 a.m. UTC | #3
On Mon 27-03-23 21:09:42, Baokun Li wrote:
> On 2023/3/27 20:47, Jan Kara wrote:
> > On Sat 25-03-23 14:34:43, Baokun Li wrote:
> > > In our fault injection test, we create an ext4 file, migrate it to
> > > non-extent based file, then punch a hole and finally trigger a WARN_ON
> > > in the ext4_da_update_reserve_space():
> > > 
> > > EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
> > > ino 14, used 11 with only 10 reserved data blocks
> > > 
> > > When writing back a non-extent based file, if we enable delalloc, the
> > > number of reserved blocks will be subtracted from the number of blocks
> > > mapped by ext4_ind_map_blocks(), and the extent status tree will be
> > > updated. We update the extent status tree by first removing the old
> > > extent_status and then inserting the new extent_status. If the block range
> > > we remove happens to be in an extent, then we need to allocate another
> > > extent_status with ext4_es_alloc_extent().
> > > 
> > >         use old    to remove   to add new
> > >      |----------|------------|------------|
> > >                old extent_status
> > > 
> > > The problem is that the allocation of a new extent_status failed due to a
> > > fault injection, and __es_shrink() did not get free memory, resulting in
> > > a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
> > > we map to the same extent again, and the number of reserved blocks is again
> > > subtracted from the number of blocks in that extent. Since the blocks in
> > > the same extent are subtracted twice, we end up triggering WARN_ON at
> > > ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
> > Hum, but this second call to ext4_map_blocks() should find already allocated
> > blocks in the indirect block and thus should not be subtracting
> > ei->i_reserved_data_blocks for the second time. What am I missing?
> > 
> > 								Honza
> > 
> ext4_map_blocks
>   1. Lookup extent status tree firstly
>        goto found;
>   2. get the block without requesting a new file system block.
> found:
>   3. ceate and map the block
> 
> When we call ext4_map_blocks() for the second time, we directly find the
> corresponding blocks in the extent status tree, and then go directly to step
> 3,
> because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
> but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
> ei->i_reserved_data_blocks
> for the second time.

Ah, I see. Thanks for explanation. But then the problem is deeper than just
a mismatch in number of reserved delalloc block. The problem really is that
if extent status tree update fails, we have inconsistency between what is
stored in the extent status tree and what is stored on disk. And that can
cause even data corruption issues in some cases.

So I think we rather need to work on handling of errors in extent status
tree operations. In the extent status tree, we have extents which we can
just drop without issues and extents we must not drop - this depends on the
extent's status - currently ext4_es_is_delayed() extents must stay, others
may be dropped but I'd wrap the decision in a helper function.

I'm currently inclined towards the following:

1) Removal must never fail. If we need to split extent, we use GFP_NOFAIL
if we cannot just drop the second part of the split extent in case of
allocation failure.

2) Similarly if inserting extent that cannot be dropped, we use GFP_NOFAIL.

3) We do not try to "undo" failed operations like we currently do - with
the above rules we never loose information that cannot be restored.

And this should also fix the problem you've hit because in case of
allocation failure we may just end up with removed extent from the extent
status tree and thus we refetch info from the disk and find out blocks are
already allocated.

								Honza
Baokun Li March 29, 2023, 7:23 a.m. UTC | #4
On 2023/3/28 18:00, Jan Kara wrote:
> On Mon 27-03-23 21:09:42, Baokun Li wrote:
>> On 2023/3/27 20:47, Jan Kara wrote:
>>> On Sat 25-03-23 14:34:43, Baokun Li wrote:
>>>> In our fault injection test, we create an ext4 file, migrate it to
>>>> non-extent based file, then punch a hole and finally trigger a WARN_ON
>>>> in the ext4_da_update_reserve_space():
>>>>
>>>> EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
>>>> ino 14, used 11 with only 10 reserved data blocks
>>>>
>>>> When writing back a non-extent based file, if we enable delalloc, the
>>>> number of reserved blocks will be subtracted from the number of blocks
>>>> mapped by ext4_ind_map_blocks(), and the extent status tree will be
>>>> updated. We update the extent status tree by first removing the old
>>>> extent_status and then inserting the new extent_status. If the block range
>>>> we remove happens to be in an extent, then we need to allocate another
>>>> extent_status with ext4_es_alloc_extent().
>>>>
>>>>          use old    to remove   to add new
>>>>       |----------|------------|------------|
>>>>                 old extent_status
>>>>
>>>> The problem is that the allocation of a new extent_status failed due to a
>>>> fault injection, and __es_shrink() did not get free memory, resulting in
>>>> a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
>>>> we map to the same extent again, and the number of reserved blocks is again
>>>> subtracted from the number of blocks in that extent. Since the blocks in
>>>> the same extent are subtracted twice, we end up triggering WARN_ON at
>>>> ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
>>> Hum, but this second call to ext4_map_blocks() should find already allocated
>>> blocks in the indirect block and thus should not be subtracting
>>> ei->i_reserved_data_blocks for the second time. What am I missing?
>>>
>>> 								Honza
>>>
>> ext4_map_blocks
>>    1. Lookup extent status tree firstly
>>         goto found;
>>    2. get the block without requesting a new file system block.
>> found:
>>    3. ceate and map the block
>>
>> When we call ext4_map_blocks() for the second time, we directly find the
>> corresponding blocks in the extent status tree, and then go directly to step
>> 3,
>> because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
>> but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
>> ei->i_reserved_data_blocks
>> for the second time.
> Ah, I see. Thanks for explanation. But then the problem is deeper than just
> a mismatch in number of reserved delalloc block. The problem really is that
> if extent status tree update fails, we have inconsistency between what is
> stored in the extent status tree and what is stored on disk. And that can
> cause even data corruption issues in some cases.
The scenario we encountered was this:
```
write:
     ext4_es_insert_delayed_block
     [0/16) 576460752303423487 (U,D)
writepages:
     alloc lblk 11 pblk 35328
     [0/16) 576460752303423487 (U,D)
     -- remove block 11 from extent
       [0/11) 576460752303423487 (U,D,R)  +  (Newly allocated)[12/4) 
549196775151 (U,D,R)
       --Failure to allocate memory for a new extent will undo as:
             [0/16) 576460752303423487 (U,D,R)
     -- if success insert block 11 to extent status tree
       [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 
(W) + [12/4) 549196775151 (U,D,R)

U: UNWRITTEN
D: DELAYED
W: WRITTEN
R: REFERENCED
```

When we fail to allocate a new extent, we don't map buffer and we don't do
io_submit, so why is the extent tree in memory inconsistent with the one
stored on disk? Am I missing something?

I would appreciate it if you could explain under what cases and what kind of
data corruption issues can be caused.
>
> So I think we rather need to work on handling of errors in extent status
> tree operations. In the extent status tree, we have extents which we can
> just drop without issues and extents we must not drop - this depends on the
> extent's status - currently ext4_es_is_delayed() extents must stay, others
> may be dropped but I'd wrap the decision in a helper function.
>
> I'm currently inclined towards the following:
>
> 1) Removal must never fail. If we need to split extent, we use GFP_NOFAIL
> if we cannot just drop the second part of the split extent in case of
> allocation failure.
>
> 2) Similarly if inserting extent that cannot be dropped, we use GFP_NOFAIL.
>
> 3) We do not try to "undo" failed operations like we currently do - with
> the above rules we never loose information that cannot be restored.

Totally agree!

This solution looks very effective and clear, I will try to implement it.

Thank you very much for your suggestion!

>
> And this should also fix the problem you've hit because in case of
> allocation failure we may just end up with removed extent from the extent
> status tree and thus we refetch info from the disk and find out blocks are
> already allocated.
>
> 								Honza
Reloading extent tree from disk I don't quite understand here, how do we 
handle
reserved blocks? could you explain it in more detail?

Logically, I think it is still necessary to update 
i_reserved_data_blocks only after
a successful allocation. This is also done in ext4_ext_map_blocks().

Thanks again!
Jan Kara March 29, 2023, 4:22 p.m. UTC | #5
On Wed 29-03-23 15:23:19, Baokun Li wrote:
> On 2023/3/28 18:00, Jan Kara wrote:
> > On Mon 27-03-23 21:09:42, Baokun Li wrote:
> > > On 2023/3/27 20:47, Jan Kara wrote:
> > > > On Sat 25-03-23 14:34:43, Baokun Li wrote:
> > > > > In our fault injection test, we create an ext4 file, migrate it to
> > > > > non-extent based file, then punch a hole and finally trigger a WARN_ON
> > > > > in the ext4_da_update_reserve_space():
> > > > > 
> > > > > EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
> > > > > ino 14, used 11 with only 10 reserved data blocks
> > > > > 
> > > > > When writing back a non-extent based file, if we enable delalloc, the
> > > > > number of reserved blocks will be subtracted from the number of blocks
> > > > > mapped by ext4_ind_map_blocks(), and the extent status tree will be
> > > > > updated. We update the extent status tree by first removing the old
> > > > > extent_status and then inserting the new extent_status. If the block range
> > > > > we remove happens to be in an extent, then we need to allocate another
> > > > > extent_status with ext4_es_alloc_extent().
> > > > > 
> > > > >          use old    to remove   to add new
> > > > >       |----------|------------|------------|
> > > > >                 old extent_status
> > > > > 
> > > > > The problem is that the allocation of a new extent_status failed due to a
> > > > > fault injection, and __es_shrink() did not get free memory, resulting in
> > > > > a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
> > > > > we map to the same extent again, and the number of reserved blocks is again
> > > > > subtracted from the number of blocks in that extent. Since the blocks in
> > > > > the same extent are subtracted twice, we end up triggering WARN_ON at
> > > > > ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
> > > > Hum, but this second call to ext4_map_blocks() should find already allocated
> > > > blocks in the indirect block and thus should not be subtracting
> > > > ei->i_reserved_data_blocks for the second time. What am I missing?
> > > > 
> > > > 								Honza
> > > > 
> > > ext4_map_blocks
> > >    1. Lookup extent status tree firstly
> > >         goto found;
> > >    2. get the block without requesting a new file system block.
> > > found:
> > >    3. ceate and map the block
> > > 
> > > When we call ext4_map_blocks() for the second time, we directly find the
> > > corresponding blocks in the extent status tree, and then go directly to step
> > > 3,
> > > because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
> > > but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
> > > ei->i_reserved_data_blocks
> > > for the second time.
> > Ah, I see. Thanks for explanation. But then the problem is deeper than just
> > a mismatch in number of reserved delalloc block. The problem really is that
> > if extent status tree update fails, we have inconsistency between what is
> > stored in the extent status tree and what is stored on disk. And that can
> > cause even data corruption issues in some cases.
> The scenario we encountered was this:
> ```
> write:
>     ext4_es_insert_delayed_block
>     [0/16) 576460752303423487 (U,D)
> writepages:
>     alloc lblk 11 pblk 35328
>     [0/16) 576460752303423487 (U,D)
>     -- remove block 11 from extent
>       [0/11) 576460752303423487 (U,D,R)  +  (Newly allocated)[12/4)
> 549196775151 (U,D,R)
>       --Failure to allocate memory for a new extent will undo as:
>             [0/16) 576460752303423487 (U,D,R)

Yes, this is what I was expecting. So now extent status tree is
inconsistent with the on-disk allocation info because the block 11 is
already allocated on disk but recorded as unallocated in the extent status
tree.

If the similar problem happened say when we punch a hole into a middle of a
written extent and so block on disk got freed but extent status tree would
still record it as allocated, user would be able to access freed block thus
potentially exposing sensitive data.

>     -- if success insert block 11 to extent status tree
>       [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W)
> + [12/4) 549196775151 (U,D,R)
> 
> U: UNWRITTEN
> D: DELAYED
> W: WRITTEN
> R: REFERENCED
> ```
> 
> When we fail to allocate a new extent, we don't map buffer and we don't do
> io_submit, so why is the extent tree in memory inconsistent with the one
> stored on disk? Am I missing something?
> 
> I would appreciate it if you could explain under what cases and what kind of
> data corruption issues can be caused.

See above.

> > And this should also fix the problem you've hit because in case of
> > allocation failure we may just end up with removed extent from the extent
> > status tree and thus we refetch info from the disk and find out blocks are
> > already allocated.
> 
> Reloading extent tree from disk I don't quite understand here, how do we
> handle reserved blocks? could you explain it in more detail?
> 
> Logically, I think it is still necessary to update i_reserved_data_blocks
> only after a successful allocation. This is also done in
> ext4_ext_map_blocks().

I guess there is some misunderstanding here. Both with
ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating
i_reserved_data_blocks only after the blocks are successfully allocated and
inserted in the respective data structure but *before* updating extent
status tree. If extent status tree operation fails, we currently get
inconsistency between extent status tree and on-disk info in both cases
AFAICS. Am I missing something?

								Honza
Baokun Li April 3, 2023, 2:02 p.m. UTC | #6
On 2023/3/30 0:22, Jan Kara wrote:
> On Wed 29-03-23 15:23:19, Baokun Li wrote:
>> On 2023/3/28 18:00, Jan Kara wrote:
>>> On Mon 27-03-23 21:09:42, Baokun Li wrote:
>>>> On 2023/3/27 20:47, Jan Kara wrote:
>>>>> On Sat 25-03-23 14:34:43, Baokun Li wrote:
>>>>>> In our fault injection test, we create an ext4 file, migrate it to
>>>>>> non-extent based file, then punch a hole and finally trigger a WARN_ON
>>>>>> in the ext4_da_update_reserve_space():
>>>>>>
>>>>>> EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
>>>>>> ino 14, used 11 with only 10 reserved data blocks
>>>>>>
>>>>>> When writing back a non-extent based file, if we enable delalloc, the
>>>>>> number of reserved blocks will be subtracted from the number of blocks
>>>>>> mapped by ext4_ind_map_blocks(), and the extent status tree will be
>>>>>> updated. We update the extent status tree by first removing the old
>>>>>> extent_status and then inserting the new extent_status. If the block range
>>>>>> we remove happens to be in an extent, then we need to allocate another
>>>>>> extent_status with ext4_es_alloc_extent().
>>>>>>
>>>>>>           use old    to remove   to add new
>>>>>>        |----------|------------|------------|
>>>>>>                  old extent_status
>>>>>>
>>>>>> The problem is that the allocation of a new extent_status failed due to a
>>>>>> fault injection, and __es_shrink() did not get free memory, resulting in
>>>>>> a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
>>>>>> we map to the same extent again, and the number of reserved blocks is again
>>>>>> subtracted from the number of blocks in that extent. Since the blocks in
>>>>>> the same extent are subtracted twice, we end up triggering WARN_ON at
>>>>>> ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
>>>>> Hum, but this second call to ext4_map_blocks() should find already allocated
>>>>> blocks in the indirect block and thus should not be subtracting
>>>>> ei->i_reserved_data_blocks for the second time. What am I missing?
>>>>>
>>>>> 								Honza
>>>>>
>>>> ext4_map_blocks
>>>>     1. Lookup extent status tree firstly
>>>>          goto found;
>>>>     2. get the block without requesting a new file system block.
>>>> found:
>>>>     3. ceate and map the block
>>>>
>>>> When we call ext4_map_blocks() for the second time, we directly find the
>>>> corresponding blocks in the extent status tree, and then go directly to step
>>>> 3,
>>>> because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
>>>> but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
>>>> ei->i_reserved_data_blocks
>>>> for the second time.
>>> Ah, I see. Thanks for explanation. But then the problem is deeper than just
>>> a mismatch in number of reserved delalloc block. The problem really is that
>>> if extent status tree update fails, we have inconsistency between what is
>>> stored in the extent status tree and what is stored on disk. And that can
>>> cause even data corruption issues in some cases.
>> The scenario we encountered was this:
>> ```
>> write:
>>      ext4_es_insert_delayed_block
>>      [0/16) 576460752303423487 (U,D)
>> writepages:
>>      alloc lblk 11 pblk 35328
>>      [0/16) 576460752303423487 (U,D)
>>      -- remove block 11 from extent
>>        [0/11) 576460752303423487 (U,D,R)  +  (Newly allocated)[12/4)
>> 549196775151 (U,D,R)
>>        --Failure to allocate memory for a new extent will undo as:
>>              [0/16) 576460752303423487 (U,D,R)
> Yes, this is what I was expecting. So now extent status tree is
> inconsistent with the on-disk allocation info because the block 11 is
> already allocated on disk but recorded as unallocated in the extent status
> tree.

Yes! There is an inconsistency here, but do_writepages finds that the 
writeback
returns -ENOMEM and keeps retrying until it succeeds, at which point the
above inconsistency does not exist.

> If the similar problem happened say when we punch a hole into a middle of a
> written extent and so block on disk got freed but extent status tree would
> still record it as allocated, user would be able to access freed block thus
> potentially exposing sensitive data.

ext4_punch_hole
   // remove extents in extents status tree
   ext4_es_remove_extent
   // remove extents tree on disk
   ext4_ext_remove_space

In this scenario, we always try to delete the extents in the in-memory 
extents
status tree first, and then delete the extents tree on disk. So even if 
we fail in
deleting extents in memory, there is no inconsistency, am I missing 
something?

>
>>      -- if success insert block 11 to extent status tree
>>        [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W)
>> + [12/4) 549196775151 (U,D,R)
>>
>> U: UNWRITTEN
>> D: DELAYED
>> W: WRITTEN
>> R: REFERENCED
>> ```
>>
>> When we fail to allocate a new extent, we don't map buffer and we don't do
>> io_submit, so why is the extent tree in memory inconsistent with the one
>> stored on disk? Am I missing something?
>>
>> I would appreciate it if you could explain under what cases and what kind of
>> data corruption issues can be caused.
> See above.
>
>>> And this should also fix the problem you've hit because in case of
>>> allocation failure we may just end up with removed extent from the extent
>>> status tree and thus we refetch info from the disk and find out blocks are
>>> already allocated.
>> Reloading extent tree from disk I don't quite understand here, how do we
>> handle reserved blocks? could you explain it in more detail?
>>
>> Logically, I think it is still necessary to update i_reserved_data_blocks
>> only after a successful allocation. This is also done in
>> ext4_ext_map_blocks().
> I guess there is some misunderstanding here. Both with
> ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating
> i_reserved_data_blocks only after the blocks are successfully allocated and
> inserted in the respective data structure but *before* updating extent
> status tree. If extent status tree operation fails, we currently get
> inconsistency between extent status tree and on-disk info in both cases
> AFAICS. Am I missing something?
>
> 								Honza

Yes, our code is indeed designed to only update the number of reserved 
blocks
after the block allocation is complete. We have different treatment for 
extent
based file and non-extent based file in commit 5f634d064c70 ("ext4: Fix 
quota
accounting error with fallocate").

For extent based file, we update the number of reserved blocks before the
"got_allocated_blocks" tag after the blocks are successfully allocated in
ext4_ext_map_blocks().

For the non-extent based file we update the number of reserved blocks after
ext4_ind_map_blocks() is executed, which leads to the problem that when 
we call
ext4_ind_map_blocks() to create a block, it does not always create a block.
For example, if the extents status tree we encountered earlier does not 
match
the extents tree on disk, this is of course a problem in itself, but in 
terms of code
logic, updating the number of reserved blocks as ext4_ext_map_blocks() does
can prevent us from trying to create a block and not creating it, 
resulting in an
incorrect number of reserved blocks.


Thank you very much for your patient explanation!
Jan Kara April 4, 2023, 10:04 a.m. UTC | #7
On Mon 03-04-23 22:02:56, Baokun Li wrote:
> On 2023/3/30 0:22, Jan Kara wrote:
> > On Wed 29-03-23 15:23:19, Baokun Li wrote:
> > > On 2023/3/28 18:00, Jan Kara wrote:
> > > > On Mon 27-03-23 21:09:42, Baokun Li wrote:
> > > > > On 2023/3/27 20:47, Jan Kara wrote:
> > > > > > On Sat 25-03-23 14:34:43, Baokun Li wrote:
> > > > > > > In our fault injection test, we create an ext4 file, migrate it to
> > > > > > > non-extent based file, then punch a hole and finally trigger a WARN_ON
> > > > > > > in the ext4_da_update_reserve_space():
> > > > > > > 
> > > > > > > EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
> > > > > > > ino 14, used 11 with only 10 reserved data blocks
> > > > > > > 
> > > > > > > When writing back a non-extent based file, if we enable delalloc, the
> > > > > > > number of reserved blocks will be subtracted from the number of blocks
> > > > > > > mapped by ext4_ind_map_blocks(), and the extent status tree will be
> > > > > > > updated. We update the extent status tree by first removing the old
> > > > > > > extent_status and then inserting the new extent_status. If the block range
> > > > > > > we remove happens to be in an extent, then we need to allocate another
> > > > > > > extent_status with ext4_es_alloc_extent().
> > > > > > > 
> > > > > > >           use old    to remove   to add new
> > > > > > >        |----------|------------|------------|
> > > > > > >                  old extent_status
> > > > > > > 
> > > > > > > The problem is that the allocation of a new extent_status failed due to a
> > > > > > > fault injection, and __es_shrink() did not get free memory, resulting in
> > > > > > > a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
> > > > > > > we map to the same extent again, and the number of reserved blocks is again
> > > > > > > subtracted from the number of blocks in that extent. Since the blocks in
> > > > > > > the same extent are subtracted twice, we end up triggering WARN_ON at
> > > > > > > ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
> > > > > > Hum, but this second call to ext4_map_blocks() should find already allocated
> > > > > > blocks in the indirect block and thus should not be subtracting
> > > > > > ei->i_reserved_data_blocks for the second time. What am I missing?
> > > > > > 
> > > > > > 								Honza
> > > > > > 
> > > > > ext4_map_blocks
> > > > >     1. Lookup extent status tree firstly
> > > > >          goto found;
> > > > >     2. get the block without requesting a new file system block.
> > > > > found:
> > > > >     3. ceate and map the block
> > > > > 
> > > > > When we call ext4_map_blocks() for the second time, we directly find the
> > > > > corresponding blocks in the extent status tree, and then go directly to step
> > > > > 3,
> > > > > because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
> > > > > but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
> > > > > ei->i_reserved_data_blocks
> > > > > for the second time.
> > > > Ah, I see. Thanks for explanation. But then the problem is deeper than just
> > > > a mismatch in number of reserved delalloc block. The problem really is that
> > > > if extent status tree update fails, we have inconsistency between what is
> > > > stored in the extent status tree and what is stored on disk. And that can
> > > > cause even data corruption issues in some cases.
> > > The scenario we encountered was this:
> > > ```
> > > write:
> > >      ext4_es_insert_delayed_block
> > >      [0/16) 576460752303423487 (U,D)
> > > writepages:
> > >      alloc lblk 11 pblk 35328
> > >      [0/16) 576460752303423487 (U,D)
> > >      -- remove block 11 from extent
> > >        [0/11) 576460752303423487 (U,D,R)  +  (Newly allocated)[12/4)
> > > 549196775151 (U,D,R)
> > >        --Failure to allocate memory for a new extent will undo as:
> > >              [0/16) 576460752303423487 (U,D,R)
> > Yes, this is what I was expecting. So now extent status tree is
> > inconsistent with the on-disk allocation info because the block 11 is
> > already allocated on disk but recorded as unallocated in the extent status
> > tree.
> 
> Yes! There is an inconsistency here, but do_writepages finds that the
> writeback returns -ENOMEM and keeps retrying until it succeeds, at which
> point the above inconsistency does not exist.

Well, do_writepages() will not retry if wbc->sync_mode == WB_SYNC_NONE. So
the inconsistency can stay for a long time.

> > If the similar problem happened say when we punch a hole into a middle of a
> > written extent and so block on disk got freed but extent status tree would
> > still record it as allocated, user would be able to access freed block thus
> > potentially exposing sensitive data.
> 
> ext4_punch_hole
>   // remove extents in extents status tree
>   ext4_es_remove_extent
>   // remove extents tree on disk
>   ext4_ext_remove_space
> 
> In this scenario, we always try to delete the extents in the in-memory
> extents status tree first, and then delete the extents tree on disk. So
> even if we fail in deleting extents in memory, there is no inconsistency,
> am I missing something?

No, you are right, this case is safe. Still I suspect inconsistencies with
extent status tree can cause more problems and possibly stale data
exposure.

> > >      -- if success insert block 11 to extent status tree
> > >        [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W)
> > > + [12/4) 549196775151 (U,D,R)
> > > 
> > > U: UNWRITTEN
> > > D: DELAYED
> > > W: WRITTEN
> > > R: REFERENCED
> > > ```
> > > 
> > > When we fail to allocate a new extent, we don't map buffer and we don't do
> > > io_submit, so why is the extent tree in memory inconsistent with the one
> > > stored on disk? Am I missing something?
> > > 
> > > I would appreciate it if you could explain under what cases and what kind of
> > > data corruption issues can be caused.
> > See above.
> > 
> > > > And this should also fix the problem you've hit because in case of
> > > > allocation failure we may just end up with removed extent from the extent
> > > > status tree and thus we refetch info from the disk and find out blocks are
> > > > already allocated.
> > > Reloading extent tree from disk I don't quite understand here, how do we
> > > handle reserved blocks? could you explain it in more detail?
> > > 
> > > Logically, I think it is still necessary to update i_reserved_data_blocks
> > > only after a successful allocation. This is also done in
> > > ext4_ext_map_blocks().
> > I guess there is some misunderstanding here. Both with
> > ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating
> > i_reserved_data_blocks only after the blocks are successfully allocated and
> > inserted in the respective data structure but *before* updating extent
> > status tree. If extent status tree operation fails, we currently get
> > inconsistency between extent status tree and on-disk info in both cases
> > AFAICS. Am I missing something?
> 
> Yes, our code is indeed designed to only update the number of reserved
> blocks after the block allocation is complete. We have different
> treatment for extent based file and non-extent based file in commit
> 5f634d064c70 ("ext4: Fix quota accounting error with fallocate").
> 
> For extent based file, we update the number of reserved blocks before the
> "got_allocated_blocks" tag after the blocks are successfully allocated in
> ext4_ext_map_blocks().
> 
> For the non-extent based file we update the number of reserved blocks
> after ext4_ind_map_blocks() is executed, which leads to the problem that
> when we call ext4_ind_map_blocks() to create a block, it does not always
> create a block.  For example, if the extents status tree we encountered
> earlier does not match the extents tree on disk, this is of course a
> problem in itself, but in terms of code logic, updating the number of
> reserved blocks as ext4_ext_map_blocks() does can prevent us from trying
> to create a block and not creating it, resulting in an incorrect number
> of reserved blocks.

I see, thanks for explanation! Indeed it may be good to cleanup this code
so that indirect block and extent based inodes are handled in the same way.

								Honza
Baokun Li April 4, 2023, 11:31 a.m. UTC | #8
On 2023/4/4 18:04, Jan Kara wrote:
> On Mon 03-04-23 22:02:56, Baokun Li wrote:
>> On 2023/3/30 0:22, Jan Kara wrote:
>>> On Wed 29-03-23 15:23:19, Baokun Li wrote:
>>>> On 2023/3/28 18:00, Jan Kara wrote:
>>>>> On Mon 27-03-23 21:09:42, Baokun Li wrote:
>>>>>> On 2023/3/27 20:47, Jan Kara wrote:
>>>>>>> On Sat 25-03-23 14:34:43, Baokun Li wrote:
>>>>>>>> In our fault injection test, we create an ext4 file, migrate it to
>>>>>>>> non-extent based file, then punch a hole and finally trigger a WARN_ON
>>>>>>>> in the ext4_da_update_reserve_space():
>>>>>>>>
>>>>>>>> EXT4-fs warning (device sda): ext4_da_update_reserve_space:369:
>>>>>>>> ino 14, used 11 with only 10 reserved data blocks
>>>>>>>>
>>>>>>>> When writing back a non-extent based file, if we enable delalloc, the
>>>>>>>> number of reserved blocks will be subtracted from the number of blocks
>>>>>>>> mapped by ext4_ind_map_blocks(), and the extent status tree will be
>>>>>>>> updated. We update the extent status tree by first removing the old
>>>>>>>> extent_status and then inserting the new extent_status. If the block range
>>>>>>>> we remove happens to be in an extent, then we need to allocate another
>>>>>>>> extent_status with ext4_es_alloc_extent().
>>>>>>>>
>>>>>>>>            use old    to remove   to add new
>>>>>>>>         |----------|------------|------------|
>>>>>>>>                   old extent_status
>>>>>>>>
>>>>>>>> The problem is that the allocation of a new extent_status failed due to a
>>>>>>>> fault injection, and __es_shrink() did not get free memory, resulting in
>>>>>>>> a return of -ENOMEM. Then do_writepages() retries after receiving -ENOMEM,
>>>>>>>> we map to the same extent again, and the number of reserved blocks is again
>>>>>>>> subtracted from the number of blocks in that extent. Since the blocks in
>>>>>>>> the same extent are subtracted twice, we end up triggering WARN_ON at
>>>>>>>> ext4_da_update_reserve_space() because used > ei->i_reserved_data_blocks.
>>>>>>> Hum, but this second call to ext4_map_blocks() should find already allocated
>>>>>>> blocks in the indirect block and thus should not be subtracting
>>>>>>> ei->i_reserved_data_blocks for the second time. What am I missing?
>>>>>>>
>>>>>>> 								Honza
>>>>>>>
>>>>>> ext4_map_blocks
>>>>>>      1. Lookup extent status tree firstly
>>>>>>           goto found;
>>>>>>      2. get the block without requesting a new file system block.
>>>>>> found:
>>>>>>      3. ceate and map the block
>>>>>>
>>>>>> When we call ext4_map_blocks() for the second time, we directly find the
>>>>>> corresponding blocks in the extent status tree, and then go directly to step
>>>>>> 3,
>>>>>> because our flag is brand new and therefore does not contain EXT4_MAP_MAPPED
>>>>>> but contains EXT4_GET_BLOCKS_CREATE, thus subtracting
>>>>>> ei->i_reserved_data_blocks
>>>>>> for the second time.
>>>>> Ah, I see. Thanks for explanation. But then the problem is deeper than just
>>>>> a mismatch in number of reserved delalloc block. The problem really is that
>>>>> if extent status tree update fails, we have inconsistency between what is
>>>>> stored in the extent status tree and what is stored on disk. And that can
>>>>> cause even data corruption issues in some cases.
>>>> The scenario we encountered was this:
>>>> ```
>>>> write:
>>>>       ext4_es_insert_delayed_block
>>>>       [0/16) 576460752303423487 (U,D)
>>>> writepages:
>>>>       alloc lblk 11 pblk 35328
>>>>       [0/16) 576460752303423487 (U,D)
>>>>       -- remove block 11 from extent
>>>>         [0/11) 576460752303423487 (U,D,R)  +  (Newly allocated)[12/4)
>>>> 549196775151 (U,D,R)
>>>>         --Failure to allocate memory for a new extent will undo as:
>>>>               [0/16) 576460752303423487 (U,D,R)
>>> Yes, this is what I was expecting. So now extent status tree is
>>> inconsistent with the on-disk allocation info because the block 11 is
>>> already allocated on disk but recorded as unallocated in the extent status
>>> tree.
>> Yes! There is an inconsistency here, but do_writepages finds that the
>> writeback returns -ENOMEM and keeps retrying until it succeeds, at which
>> point the above inconsistency does not exist.
> Well, do_writepages() will not retry if wbc->sync_mode == WB_SYNC_NONE. So
> the inconsistency can stay for a long time.

Indeed! There may still be problems in WB_SYNC_NONE mode.

>>> If the similar problem happened say when we punch a hole into a middle of a
>>> written extent and so block on disk got freed but extent status tree would
>>> still record it as allocated, user would be able to access freed block thus
>>> potentially exposing sensitive data.
>> ext4_punch_hole
>>    // remove extents in extents status tree
>>    ext4_es_remove_extent
>>    // remove extents tree on disk
>>    ext4_ext_remove_space
>>
>> In this scenario, we always try to delete the extents in the in-memory
>> extents status tree first, and then delete the extents tree on disk. So
>> even if we fail in deleting extents in memory, there is no inconsistency,
>> am I missing something?
> No, you are right, this case is safe. Still I suspect inconsistencies with
> extent status tree can cause more problems and possibly stale data
> exposure.

Yes, we also suspect that this inconsistency may cause problems, but we have
not found a scenario where the problem may occur, which is why we asked you
for advice on what scenarios would be problematic. But in any case, the 
fix you
suggested before would avoid this problem.

Thank you for your confirmation!

>>>>       -- if success insert block 11 to extent status tree
>>>>         [0/11) 576460752303423487 (U,D,R) + (Newly allocated)[11/1) 35328 (W)
>>>> + [12/4) 549196775151 (U,D,R)
>>>>
>>>> U: UNWRITTEN
>>>> D: DELAYED
>>>> W: WRITTEN
>>>> R: REFERENCED
>>>> ```
>>>>
>>>> When we fail to allocate a new extent, we don't map buffer and we don't do
>>>> io_submit, so why is the extent tree in memory inconsistent with the one
>>>> stored on disk? Am I missing something?
>>>>
>>>> I would appreciate it if you could explain under what cases and what kind of
>>>> data corruption issues can be caused.
>>> See above.
>>>
>>>>> And this should also fix the problem you've hit because in case of
>>>>> allocation failure we may just end up with removed extent from the extent
>>>>> status tree and thus we refetch info from the disk and find out blocks are
>>>>> already allocated.
>>>> Reloading extent tree from disk I don't quite understand here, how do we
>>>> handle reserved blocks? could you explain it in more detail?
>>>>
>>>> Logically, I think it is still necessary to update i_reserved_data_blocks
>>>> only after a successful allocation. This is also done in
>>>> ext4_ext_map_blocks().
>>> I guess there is some misunderstanding here. Both with
>>> ext4_ext_map_blocks() and ext4_ind_map_blocks() we end up updating
>>> i_reserved_data_blocks only after the blocks are successfully allocated and
>>> inserted in the respective data structure but *before* updating extent
>>> status tree. If extent status tree operation fails, we currently get
>>> inconsistency between extent status tree and on-disk info in both cases
>>> AFAICS. Am I missing something?
>> Yes, our code is indeed designed to only update the number of reserved
>> blocks after the block allocation is complete. We have different
>> treatment for extent based file and non-extent based file in commit
>> 5f634d064c70 ("ext4: Fix quota accounting error with fallocate").
>>
>> For extent based file, we update the number of reserved blocks before the
>> "got_allocated_blocks" tag after the blocks are successfully allocated in
>> ext4_ext_map_blocks().
>>
>> For the non-extent based file we update the number of reserved blocks
>> after ext4_ind_map_blocks() is executed, which leads to the problem that
>> when we call ext4_ind_map_blocks() to create a block, it does not always
>> create a block.  For example, if the extents status tree we encountered
>> earlier does not match the extents tree on disk, this is of course a
>> problem in itself, but in terms of code logic, updating the number of
>> reserved blocks as ext4_ext_map_blocks() does can prevent us from trying
>> to create a block and not creating it, resulting in an incorrect number
>> of reserved blocks.
> I see, thanks for explanation! Indeed it may be good to cleanup this code
> so that indirect block and extent based inodes are handled in the same way.
>
> 								Honza

Thanks for your support!

I will send a patch V2 with the changes suggested by you!
diff mbox series

Patch

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index c68bebe7ff4b..9acab70ddf5e 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -651,6 +651,14 @@  int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
 
 	ext4_update_inode_fsync_trans(handle, inode, 1);
 	count = ar.len;
+
+	/*
+	 * Update reserved blocks/metadata blocks after successful block
+	 * allocation which had been deferred till now.
+	 */
+	if ((count > 0) && (flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
+		ext4_da_update_reserve_space(inode, count, 1);
+
 got_it:
 	map->m_flags |= EXT4_MAP_MAPPED;
 	map->m_pblk = le32_to_cpu(chain[depth-1].key);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eaeec84ec1b0..21be018b6503 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -659,16 +659,6 @@  int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			 */
 			ext4_clear_inode_state(inode, EXT4_STATE_EXT_MIGRATE);
 		}
-
-		/*
-		 * Update reserved blocks/metadata blocks after successful
-		 * block allocation which had been deferred till now. We don't
-		 * support fallocate for non extent files. So we can update
-		 * reserve space here.
-		 */
-		if ((retval > 0) &&
-			(flags & EXT4_GET_BLOCKS_DELALLOC_RESERVE))
-			ext4_da_update_reserve_space(inode, retval, 1);
 	}
 
 	if (retval > 0) {