mbox series

[00/10] ext4: fix inconsistency since reading old metadata from disk

Message ID 20200526071754.33819-1-yi.zhang@huawei.com
Headers show
Series ext4: fix inconsistency since reading old metadata from disk | expand

Message

Zhang Yi May 26, 2020, 7:17 a.m. UTC
Background
==========

This patch set point to fix the inconsistency problem which has been
discussed and partial fixed in [1].

Now, the problem is on the unstable storage which has a flaky transport
(e.g. iSCSI transport may disconnect few seconds and reconnect due to
the bad network environment), if we failed to async write metadata in
background, the end write routine in block layer will clear the buffer's
uptodate flag, but the data in such buffer is actually uptodate. Finally
we may read "old && inconsistent" metadata from the disk when we get the
buffer later because not only the uptodate flag was cleared but also we
do not check the write io error flag, or even worse the buffer has been
freed due to memory presure.

Fortunately, if the jbd2 do checkpoint after async IO error happens,
the checkpoint routine will check the write_io_error flag and abort the
the journal if detect IO error. And in the journal recover case, the
recover code will invoke sync_blockdev() after recover complete, it will
also detect IO error and refuse to mount the filesystem.

Current ext4 have already deal with this problem in __ext4_get_inode_loc()
and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
containing valid data"), but it's not enough.

[1] https://lore.kernel.org/linux-ext4/20190823030207.GC8130@mit.edu/

Description
===========

This patch set add and rework 7 wrapper functions of getting metadata
blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
sb_breadahead*(). Add buffer_write_io_error() checking into them, if
the buffer isn't uptodate and write_io_error flag was set, which means
that the buffer has been failed to write out to disk, re-add the
uptodate flag to prevent subsequent read operation.

 - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
   sb_getblk() used for newly allocated blocks and getting buffers.
 - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
   fix buffer uotpdate flag, use to replace all sb_getblk() used for
   getting buffers to read.
 - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
 - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
 - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
   not uptodate.
 - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
 - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
   except skip submit read bio if failed to lock the buffer.

Patch 1-2: do some small change in ext4 inode eio simulation and add a
helper in buffer.c, just prepare for below patches.
Patch 3: add the ext4_sb_*() function to deal with the write_io_error
flag in buffer.
Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
shrinking a buffer with write_io_error flag.
Patch 10: just do some cleanup.

After this patch set, we need to use above 7 wrapper functions to
get/read metadata block instead of invoke sb_*() functions defined in
fs/buffer.h.

Test
====

This patch set is based on linux-5.7-rc7 and has been tests by xfstests
in auto mode.

Thanks,
Yi.


zhangyi (F) (10):
  ext4: move inode eio simulation behind io completeion
  fs: pick out ll_rw_one_block() helper function
  ext4: add ext4_sb_getblk*() wrapper functions
  ext4: replace sb_getblk() with ext4_sb_getblk_locked()
  ext4: replace sb_bread*() with ext4_sb_bread*()
  ext4: replace sb_getblk() with ext4_sb_getblk()
  ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
  ext4: replace sb_breadahead() with ext4_sb_breadahead()
  ext4: abort the filesystem while freeing the write error io buffer
  ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()

 fs/buffer.c                 |  41 ++++++----
 fs/ext4/balloc.c            |   6 +-
 fs/ext4/ext4.h              |  60 ++++++++++++---
 fs/ext4/extents.c           |  13 ++--
 fs/ext4/ialloc.c            |   6 +-
 fs/ext4/indirect.c          |  13 ++--
 fs/ext4/inline.c            |   2 +-
 fs/ext4/inode.c             |  53 +++++--------
 fs/ext4/mmp.c               |   2 +-
 fs/ext4/resize.c            |  24 +++---
 fs/ext4/super.c             | 145 +++++++++++++++++++++++++++++++-----
 fs/ext4/xattr.c             |   4 +-
 fs/jbd2/transaction.c       |  20 +++--
 include/linux/buffer_head.h |   1 +
 include/linux/jbd2.h        |   3 +-
 15 files changed, 277 insertions(+), 116 deletions(-)

Comments

Zhang Yi June 8, 2020, 3:32 a.m. UTC | #1
Hiļ¼ŒTed and Jan, any suggestions of this patch set?

Thanks,
Yi.

On 2020/5/26 15:17, zhangyi (F) wrote:
> Background
> ==========
> 
> This patch set point to fix the inconsistency problem which has been
> discussed and partial fixed in [1].
> 
> Now, the problem is on the unstable storage which has a flaky transport
> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> the bad network environment), if we failed to async write metadata in
> background, the end write routine in block layer will clear the buffer's
> uptodate flag, but the data in such buffer is actually uptodate. Finally
> we may read "old && inconsistent" metadata from the disk when we get the
> buffer later because not only the uptodate flag was cleared but also we
> do not check the write io error flag, or even worse the buffer has been
> freed due to memory presure.
> 
> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> the checkpoint routine will check the write_io_error flag and abort the
> the journal if detect IO error. And in the journal recover case, the
> recover code will invoke sync_blockdev() after recover complete, it will
> also detect IO error and refuse to mount the filesystem.
> 
> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> containing valid data"), but it's not enough.
> 
> [1] https://lore.kernel.org/linux-ext4/20190823030207.GC8130@mit.edu/
> 
> Description
> ===========
> 
> This patch set add and rework 7 wrapper functions of getting metadata
> blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
> sb_breadahead*(). Add buffer_write_io_error() checking into them, if
> the buffer isn't uptodate and write_io_error flag was set, which means
> that the buffer has been failed to write out to disk, re-add the
> uptodate flag to prevent subsequent read operation.
> 
>  - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
>    sb_getblk() used for newly allocated blocks and getting buffers.
>  - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
>    fix buffer uotpdate flag, use to replace all sb_getblk() used for
>    getting buffers to read.
>  - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
>  - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
>  - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
>    not uptodate.
>  - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
>  - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
>    except skip submit read bio if failed to lock the buffer.
> 
> Patch 1-2: do some small change in ext4 inode eio simulation and add a
> helper in buffer.c, just prepare for below patches.
> Patch 3: add the ext4_sb_*() function to deal with the write_io_error
> flag in buffer.
> Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
> Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
> shrinking a buffer with write_io_error flag.
> Patch 10: just do some cleanup.
> 
> After this patch set, we need to use above 7 wrapper functions to
> get/read metadata block instead of invoke sb_*() functions defined in
> fs/buffer.h.
> 
> Test
> ====
> 
> This patch set is based on linux-5.7-rc7 and has been tests by xfstests
> in auto mode.
> 
> Thanks,
> Yi.
> 
> 
> zhangyi (F) (10):
>   ext4: move inode eio simulation behind io completeion
>   fs: pick out ll_rw_one_block() helper function
>   ext4: add ext4_sb_getblk*() wrapper functions
>   ext4: replace sb_getblk() with ext4_sb_getblk_locked()
>   ext4: replace sb_bread*() with ext4_sb_bread*()
>   ext4: replace sb_getblk() with ext4_sb_getblk()
>   ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
>   ext4: replace sb_breadahead() with ext4_sb_breadahead()
>   ext4: abort the filesystem while freeing the write error io buffer
>   ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
> 
>  fs/buffer.c                 |  41 ++++++----
>  fs/ext4/balloc.c            |   6 +-
>  fs/ext4/ext4.h              |  60 ++++++++++++---
>  fs/ext4/extents.c           |  13 ++--
>  fs/ext4/ialloc.c            |   6 +-
>  fs/ext4/indirect.c          |  13 ++--
>  fs/ext4/inline.c            |   2 +-
>  fs/ext4/inode.c             |  53 +++++--------
>  fs/ext4/mmp.c               |   2 +-
>  fs/ext4/resize.c            |  24 +++---
>  fs/ext4/super.c             | 145 +++++++++++++++++++++++++++++++-----
>  fs/ext4/xattr.c             |   4 +-
>  fs/jbd2/transaction.c       |  20 +++--
>  include/linux/buffer_head.h |   1 +
>  include/linux/jbd2.h        |   3 +-
>  15 files changed, 277 insertions(+), 116 deletions(-)
>
Jan Kara June 8, 2020, 8:20 a.m. UTC | #2
Hello Yi!

On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> Background
> ==========
> 
> This patch set point to fix the inconsistency problem which has been
> discussed and partial fixed in [1].
> 
> Now, the problem is on the unstable storage which has a flaky transport
> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> the bad network environment), if we failed to async write metadata in
> background, the end write routine in block layer will clear the buffer's
> uptodate flag, but the data in such buffer is actually uptodate. Finally
> we may read "old && inconsistent" metadata from the disk when we get the
> buffer later because not only the uptodate flag was cleared but also we
> do not check the write io error flag, or even worse the buffer has been
> freed due to memory presure.
> 
> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> the checkpoint routine will check the write_io_error flag and abort the
> the journal if detect IO error. And in the journal recover case, the
> recover code will invoke sync_blockdev() after recover complete, it will
> also detect IO error and refuse to mount the filesystem.
> 
> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> containing valid data"), but it's not enough.

Before we go and complicate ext4 code like this, I'd like to understand
what is the desired outcome which doesn't seem to be mentioned here, in the
commit 7963e5ac90125, or in the discussion you reference. If you have a
flaky transport that gives you IO errors, IMO it is not a bussiness of the
filesystem to try to fix that. I just has to make sure it properly reports
errors to userspace and (depending of errors= configuration) shuts itself
down to limit further damage. This patch seems to try to mask those errors
and that's, in my opinion, rather futile (as in you can hardly ever deal
with all the cases). BTW are you running these systems on flaky iSCSI with
errors=continue so that the errors don't shut the filesystem down
immediately?

								Honza

> [1] https://lore.kernel.org/linux-ext4/20190823030207.GC8130@mit.edu/
> 
> Description
> ===========
> 
> This patch set add and rework 7 wrapper functions of getting metadata
> blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
> sb_breadahead*(). Add buffer_write_io_error() checking into them, if
> the buffer isn't uptodate and write_io_error flag was set, which means
> that the buffer has been failed to write out to disk, re-add the
> uptodate flag to prevent subsequent read operation.
> 
>  - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
>    sb_getblk() used for newly allocated blocks and getting buffers.
>  - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
>    fix buffer uotpdate flag, use to replace all sb_getblk() used for
>    getting buffers to read.
>  - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
>  - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
>  - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
>    not uptodate.
>  - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
>  - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
>    except skip submit read bio if failed to lock the buffer.
> 
> Patch 1-2: do some small change in ext4 inode eio simulation and add a
> helper in buffer.c, just prepare for below patches.
> Patch 3: add the ext4_sb_*() function to deal with the write_io_error
> flag in buffer.
> Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
> Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
> shrinking a buffer with write_io_error flag.
> Patch 10: just do some cleanup.
> 
> After this patch set, we need to use above 7 wrapper functions to
> get/read metadata block instead of invoke sb_*() functions defined in
> fs/buffer.h.
> 
> Test
> ====
> 
> This patch set is based on linux-5.7-rc7 and has been tests by xfstests
> in auto mode.
> 
> Thanks,
> Yi.
> 
> 
> zhangyi (F) (10):
>   ext4: move inode eio simulation behind io completeion
>   fs: pick out ll_rw_one_block() helper function
>   ext4: add ext4_sb_getblk*() wrapper functions
>   ext4: replace sb_getblk() with ext4_sb_getblk_locked()
>   ext4: replace sb_bread*() with ext4_sb_bread*()
>   ext4: replace sb_getblk() with ext4_sb_getblk()
>   ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
>   ext4: replace sb_breadahead() with ext4_sb_breadahead()
>   ext4: abort the filesystem while freeing the write error io buffer
>   ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
> 
>  fs/buffer.c                 |  41 ++++++----
>  fs/ext4/balloc.c            |   6 +-
>  fs/ext4/ext4.h              |  60 ++++++++++++---
>  fs/ext4/extents.c           |  13 ++--
>  fs/ext4/ialloc.c            |   6 +-
>  fs/ext4/indirect.c          |  13 ++--
>  fs/ext4/inline.c            |   2 +-
>  fs/ext4/inode.c             |  53 +++++--------
>  fs/ext4/mmp.c               |   2 +-
>  fs/ext4/resize.c            |  24 +++---
>  fs/ext4/super.c             | 145 +++++++++++++++++++++++++++++++-----
>  fs/ext4/xattr.c             |   4 +-
>  fs/jbd2/transaction.c       |  20 +++--
>  include/linux/buffer_head.h |   1 +
>  include/linux/jbd2.h        |   3 +-
>  15 files changed, 277 insertions(+), 116 deletions(-)
> 
> -- 
> 2.21.3
>
Zhang Yi June 8, 2020, 2:39 p.m. UTC | #3
Hi, Jan.

On 2020/6/8 16:20, Jan Kara wrote:
> Hello Yi!
> 
> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
>> Background
>> ==========
>>
>> This patch set point to fix the inconsistency problem which has been
>> discussed and partial fixed in [1].
>>
>> Now, the problem is on the unstable storage which has a flaky transport
>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
>> the bad network environment), if we failed to async write metadata in
>> background, the end write routine in block layer will clear the buffer's
>> uptodate flag, but the data in such buffer is actually uptodate. Finally
>> we may read "old && inconsistent" metadata from the disk when we get the
>> buffer later because not only the uptodate flag was cleared but also we
>> do not check the write io error flag, or even worse the buffer has been
>> freed due to memory presure.
>>
>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
>> the checkpoint routine will check the write_io_error flag and abort the
>> the journal if detect IO error. And in the journal recover case, the
>> recover code will invoke sync_blockdev() after recover complete, it will
>> also detect IO error and refuse to mount the filesystem.
>>
>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
>> containing valid data"), but it's not enough.
> 
> Before we go and complicate ext4 code like this, I'd like to understand
> what is the desired outcome which doesn't seem to be mentioned here, in the
> commit 7963e5ac90125, or in the discussion you reference. If you have a
> flaky transport that gives you IO errors, IMO it is not a bussiness of the
> filesystem to try to fix that. I just has to make sure it properly reports

If we meet some IO errors due to the flaky transport, IMO the desired outcome
is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
specified, if we set "errors=read-only || panic", we expect ext4 could remount
to read-only or panic immediately to avoid inconsistency. In brief, the kernel
should try best to guarantee the filesystem on disk is consistency, this will
reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
for most cases caused by the async error problem I mentioned), so we could
recover the fs automatically in next boot.

But now, in the case of metadata async writeback, (1) is done in
end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
metadata write error, and it also cannot remount the filesystem or invoke panic
immediately. Finally, if we read the metadata on disk and re-write again, it
may lead to on-disk filesystem inconsistency.

> errors to userspace and (depending of errors= configuration) shuts itself
> down to limit further damage. This patch seems to try to mask those errors
> and that's, in my opinion, rather futile (as in you can hardly ever deal
> with all the cases). BTW are you running these systems on flaky iSCSI with
> errors=continue so that the errors don't shut the filesystem down
> immediately?
> 
Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
with errors=read-only, in the cases mentioned above, the fs will not be
remount to read-only immediately or remount after it has already been
inconsistency.

Thinking about how to fix this, one method is to invoke ext4_error() or
jbd2_journal_abort() when we detect write error to prevent further use of
the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
I think although the metadata buffer was failed to write back to the disk due
to the occasional unstable network environment, but the data in the buffer
is actually uptodate, the filesystem could self-healing after the network
recovery. In the worst case, if the network is broken for a long time, the
jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
the journal to disk, both will abort the filesystem. So I think we could
re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.

Thanks,
Yi.

> 
>> [1] https://lore.kernel.org/linux-ext4/20190823030207.GC8130@mit.edu/
>>
>> Description
>> ===========
>>
>> This patch set add and rework 7 wrapper functions of getting metadata
>> blocks, replace all sb_bread() / sb_getblk*() / ext4_bread() and
>> sb_breadahead*(). Add buffer_write_io_error() checking into them, if
>> the buffer isn't uptodate and write_io_error flag was set, which means
>> that the buffer has been failed to write out to disk, re-add the
>> uptodate flag to prevent subsequent read operation.
>>
>>  - ext4_sb_getblk(): works the same as sb_getblk(), use to replace all
>>    sb_getblk() used for newly allocated blocks and getting buffers.
>>  - ext4_sb_getblk_locked(): works the same as sb_getblk() except check &
>>    fix buffer uotpdate flag, use to replace all sb_getblk() used for
>>    getting buffers to read.
>>  - ext4_sb_getblk_gfp(): gfp version of ext4_sb_getblk().
>>  - ext4_sb_getblk_locked_gfp(): gfp version of ext4_sb_getblk_locked().
>>  - ext4_sb_bread(): get buffer and submit read bio if buffer is actually
>>    not uptodate.
>>  - ext4_sb_bread_unmovable(): unmovable version of ext4_sb_bread().
>>  - ext4_sb_breadahead_unmovable(): works the same to ext4_sb_bread_unmovable()
>>    except skip submit read bio if failed to lock the buffer.
>>
>> Patch 1-2: do some small change in ext4 inode eio simulation and add a
>> helper in buffer.c, just prepare for below patches.
>> Patch 3: add the ext4_sb_*() function to deal with the write_io_error
>> flag in buffer.
>> Patch 4-8: replace all sb_*() with ext4_sb_*() in ext4.
>> Patch 9: deal with the buffer shrinking case, abort jbd2/fs when
>> shrinking a buffer with write_io_error flag.
>> Patch 10: just do some cleanup.
>>
>> After this patch set, we need to use above 7 wrapper functions to
>> get/read metadata block instead of invoke sb_*() functions defined in
>> fs/buffer.h.
>>
>> Test
>> ====
>>
>> This patch set is based on linux-5.7-rc7 and has been tests by xfstests
>> in auto mode.
>>
>> Thanks,
>> Yi.
>>
>>
>> zhangyi (F) (10):
>>   ext4: move inode eio simulation behind io completeion
>>   fs: pick out ll_rw_one_block() helper function
>>   ext4: add ext4_sb_getblk*() wrapper functions
>>   ext4: replace sb_getblk() with ext4_sb_getblk_locked()
>>   ext4: replace sb_bread*() with ext4_sb_bread*()
>>   ext4: replace sb_getblk() with ext4_sb_getblk()
>>   ext4: switch to use ext4_sb_getblk_locked() in ext4_getblk()
>>   ext4: replace sb_breadahead() with ext4_sb_breadahead()
>>   ext4: abort the filesystem while freeing the write error io buffer
>>   ext4: remove unused parameter in jbd2_journal_try_to_free_buffers()
>>
>>  fs/buffer.c                 |  41 ++++++----
>>  fs/ext4/balloc.c            |   6 +-
>>  fs/ext4/ext4.h              |  60 ++++++++++++---
>>  fs/ext4/extents.c           |  13 ++--
>>  fs/ext4/ialloc.c            |   6 +-
>>  fs/ext4/indirect.c          |  13 ++--
>>  fs/ext4/inline.c            |   2 +-
>>  fs/ext4/inode.c             |  53 +++++--------
>>  fs/ext4/mmp.c               |   2 +-
>>  fs/ext4/resize.c            |  24 +++---
>>  fs/ext4/super.c             | 145 +++++++++++++++++++++++++++++++-----
>>  fs/ext4/xattr.c             |   4 +-
>>  fs/jbd2/transaction.c       |  20 +++--
>>  include/linux/buffer_head.h |   1 +
>>  include/linux/jbd2.h        |   3 +-
>>  15 files changed, 277 insertions(+), 116 deletions(-)
>>
>> -- 
>> 2.21.3
>>
Jan Kara June 9, 2020, 12:19 p.m. UTC | #4
Hi Yi!

On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
> > On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> >> Background
> >> ==========
> >>
> >> This patch set point to fix the inconsistency problem which has been
> >> discussed and partial fixed in [1].
> >>
> >> Now, the problem is on the unstable storage which has a flaky transport
> >> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> >> the bad network environment), if we failed to async write metadata in
> >> background, the end write routine in block layer will clear the buffer's
> >> uptodate flag, but the data in such buffer is actually uptodate. Finally
> >> we may read "old && inconsistent" metadata from the disk when we get the
> >> buffer later because not only the uptodate flag was cleared but also we
> >> do not check the write io error flag, or even worse the buffer has been
> >> freed due to memory presure.
> >>
> >> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> >> the checkpoint routine will check the write_io_error flag and abort the
> >> the journal if detect IO error. And in the journal recover case, the
> >> recover code will invoke sync_blockdev() after recover complete, it will
> >> also detect IO error and refuse to mount the filesystem.
> >>
> >> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> >> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> >> containing valid data"), but it's not enough.
> > 
> > Before we go and complicate ext4 code like this, I'd like to understand
> > what is the desired outcome which doesn't seem to be mentioned here, in the
> > commit 7963e5ac90125, or in the discussion you reference. If you have a
> > flaky transport that gives you IO errors, IMO it is not a bussiness of the
> > filesystem to try to fix that. I just has to make sure it properly reports
> 
> If we meet some IO errors due to the flaky transport, IMO the desired outcome
> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
> specified, if we set "errors=read-only || panic", we expect ext4 could remount
> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
> should try best to guarantee the filesystem on disk is consistency, this will
> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
> for most cases caused by the async error problem I mentioned), so we could
> recover the fs automatically in next boot.

Good, so I fully agree with your goals. Let's now talk about how to achieve
them :)

> But now, in the case of metadata async writeback, (1) is done in
> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
> metadata write error, and it also cannot remount the filesystem or invoke panic
> immediately. Finally, if we read the metadata on disk and re-write again, it
> may lead to on-disk filesystem inconsistency.

Ah, I see. This was the important bit I was missing. And I think the
real problem here is that ext4 cannot detect metadata write error from
async writeback. So my plan would be to detect metadata write errors early
and abort the journal and do appropriate errors=xxx handling. And a
relatively simple way how to do that these days would be to use errseq in
the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
gets incremented whenever there's writeback error in the block device
mapping so (probably in ext4_journal_check_start()) we could check whether
wb_err is different from the original value we sampled at mount time an if
yes, we know metadata writeback error has happened and we trigger the error
handling. What do you think?

> > errors to userspace and (depending of errors= configuration) shuts itself
> > down to limit further damage. This patch seems to try to mask those errors
> > and that's, in my opinion, rather futile (as in you can hardly ever deal
> > with all the cases). BTW are you running these systems on flaky iSCSI with
> > errors=continue so that the errors don't shut the filesystem down
> > immediately?
> > 
> Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
> with errors=read-only, in the cases mentioned above, the fs will not be
> remount to read-only immediately or remount after it has already been
> inconsistency.
> 
> Thinking about how to fix this, one method is to invoke ext4_error() or
> jbd2_journal_abort() when we detect write error to prevent further use of
> the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
> I think although the metadata buffer was failed to write back to the disk due
> to the occasional unstable network environment, but the data in the buffer
> is actually uptodate, the filesystem could self-healing after the network
> recovery. In the worst case, if the network is broken for a long time, the
> jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
> the journal to disk, both will abort the filesystem. So I think we could
> re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.

Yeah, but I'm actually against such self-healing logic. IMHO it is too
fragile and also fairly intrusive as your patches show. If we wanted
something like this, we'd need to put a hard thought into whether a
functionality like this belongs to ext4 or to some layer below it (e.g.
think how multipath handles temporary path failures). And even if we
decided it's worth the trouble in the filesystem, I'd rather go and change
how fs/buffer.c deals with buffer writeback errors than resetting uptodate
bits on buffers which just seems dangerous to me...

								Honza
Zhang Yi June 10, 2020, 8:55 a.m. UTC | #5
Hi, Jan.

On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
>>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
>>>> Background
>>>> ==========
>>>>
>>>> This patch set point to fix the inconsistency problem which has been
>>>> discussed and partial fixed in [1].
>>>>
>>>> Now, the problem is on the unstable storage which has a flaky transport
>>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
>>>> the bad network environment), if we failed to async write metadata in
>>>> background, the end write routine in block layer will clear the buffer's
>>>> uptodate flag, but the data in such buffer is actually uptodate. Finally
>>>> we may read "old && inconsistent" metadata from the disk when we get the
>>>> buffer later because not only the uptodate flag was cleared but also we
>>>> do not check the write io error flag, or even worse the buffer has been
>>>> freed due to memory presure.
>>>>
>>>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
>>>> the checkpoint routine will check the write_io_error flag and abort the
>>>> the journal if detect IO error. And in the journal recover case, the
>>>> recover code will invoke sync_blockdev() after recover complete, it will
>>>> also detect IO error and refuse to mount the filesystem.
>>>>
>>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
>>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
>>>> containing valid data"), but it's not enough.
>>>
>>> Before we go and complicate ext4 code like this, I'd like to understand
>>> what is the desired outcome which doesn't seem to be mentioned here, in the
>>> commit 7963e5ac90125, or in the discussion you reference. If you have a
>>> flaky transport that gives you IO errors, IMO it is not a bussiness of the
>>> filesystem to try to fix that. I just has to make sure it properly reports
>>
>> If we meet some IO errors due to the flaky transport, IMO the desired outcome
>> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
>> specified, if we set "errors=read-only || panic", we expect ext4 could remount
>> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
>> should try best to guarantee the filesystem on disk is consistency, this will
>> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
>> for most cases caused by the async error problem I mentioned), so we could
>> recover the fs automatically in next boot.
> 
> Good, so I fully agree with your goals. Let's now talk about how to achieve
> them :)
> 
>> But now, in the case of metadata async writeback, (1) is done in
>> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
>> metadata write error, and it also cannot remount the filesystem or invoke panic
>> immediately. Finally, if we read the metadata on disk and re-write again, it
>> may lead to on-disk filesystem inconsistency.
> 
> Ah, I see. This was the important bit I was missing. And I think the
> real problem here is that ext4 cannot detect metadata write error from
> async writeback. So my plan would be to detect metadata write errors early
> and abort the journal and do appropriate errors=xxx handling. And a
> relatively simple way how to do that these days would be to use errseq in
> the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
> gets incremented whenever there's writeback error in the block device
> mapping so (probably in ext4_journal_check_start()) we could check whether
> wb_err is different from the original value we sampled at mount time an if
> yes, we know metadata writeback error has happened and we trigger the error
> handling. What do you think?
> 

Thanks a lot for your suggestion, this solution looks good to me. But Ithink
add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below
race condition (It's just theoretical analysis, I'm not test it):

ext4_journal_start()
 ext4_journal_check_start()  <-- pass checking
                                 |   end_buffer_async_write()
                                 |    mark_buffer_write_io_error() <-- set b_page
sb_getblk(bh)   <-- read old data from disk
ext4_journal_get_write_access(bh)
modify this bh  <-- modify data and lead to inconsistency
ext4_handle_dirty_metadata(bh)

So I guess it may still lead to inconsistency. How about add this checking
into ext4_journal_get_write_access() ?

>>> errors to userspace and (depending of errors= configuration) shuts itself
>>> down to limit further damage. This patch seems to try to mask those errors
>>> and that's, in my opinion, rather futile (as in you can hardly ever deal
>>> with all the cases). BTW are you running these systems on flaky iSCSI with
>>> errors=continue so that the errors don't shut the filesystem down
>>> immediately?
>>>
>> Yes, I run ext4 filesystem on a flaky iSCSI(it is stable most of the time)
>> with errors=read-only, in the cases mentioned above, the fs will not be
>> remount to read-only immediately or remount after it has already been
>> inconsistency.
>>
>> Thinking about how to fix this, one method is to invoke ext4_error() or
>> jbd2_journal_abort() when we detect write error to prevent further use of
>> the filesystem. But after looking at __ext4_get_inode_loc() and 7963e5ac90125,
>> I think although the metadata buffer was failed to write back to the disk due
>> to the occasional unstable network environment, but the data in the buffer
>> is actually uptodate, the filesystem could self-healing after the network
>> recovery. In the worst case, if the network is broken for a long time, the
>> jbd2's checkpoint routing will detect the error, or jbd2 will failed to write
>> the journal to disk, both will abort the filesystem. So I think we could
>> re-set the uptodate flag when we read buffer again as 7963e5ac90125 does.
> 
> Yeah, but I'm actually against such self-healing logic. IMHO it is too
> fragile and also fairly intrusive as your patches show. If we wanted
> something like this, we'd need to put a hard thought into whether a
> functionality like this belongs to ext4 or to some layer below it (e.g.
> think how multipath handles temporary path failures). And even if we
> decided it's worth the trouble in the filesystem, I'd rather go and change
> how fs/buffer.c deals with buffer writeback errors than resetting uptodate
> bits on buffers which just seems dangerous to me...
> 

Yeah, I see. Invoke error handlers as soon as we detect error flag could
minimize the risk.

Thanks,
Yi.
Jan Kara June 10, 2020, 9:57 a.m. UTC | #6
On Wed 10-06-20 16:55:15, zhangyi (F) wrote:
> Hi, Jan.
> 
> On 2020/6/9 20:19, Jan Kara wrote> On Mon 08-06-20 22:39:31, zhangyi (F) wrote:
> >>> On Tue 26-05-20 15:17:44, zhangyi (F) wrote:
> >>>> Background
> >>>> ==========
> >>>>
> >>>> This patch set point to fix the inconsistency problem which has been
> >>>> discussed and partial fixed in [1].
> >>>>
> >>>> Now, the problem is on the unstable storage which has a flaky transport
> >>>> (e.g. iSCSI transport may disconnect few seconds and reconnect due to
> >>>> the bad network environment), if we failed to async write metadata in
> >>>> background, the end write routine in block layer will clear the buffer's
> >>>> uptodate flag, but the data in such buffer is actually uptodate. Finally
> >>>> we may read "old && inconsistent" metadata from the disk when we get the
> >>>> buffer later because not only the uptodate flag was cleared but also we
> >>>> do not check the write io error flag, or even worse the buffer has been
> >>>> freed due to memory presure.
> >>>>
> >>>> Fortunately, if the jbd2 do checkpoint after async IO error happens,
> >>>> the checkpoint routine will check the write_io_error flag and abort the
> >>>> the journal if detect IO error. And in the journal recover case, the
> >>>> recover code will invoke sync_blockdev() after recover complete, it will
> >>>> also detect IO error and refuse to mount the filesystem.
> >>>>
> >>>> Current ext4 have already deal with this problem in __ext4_get_inode_loc()
> >>>> and commit 7963e5ac90125 ("ext4: treat buffers with write errors as
> >>>> containing valid data"), but it's not enough.
> >>>
> >>> Before we go and complicate ext4 code like this, I'd like to understand
> >>> what is the desired outcome which doesn't seem to be mentioned here, in the
> >>> commit 7963e5ac90125, or in the discussion you reference. If you have a
> >>> flaky transport that gives you IO errors, IMO it is not a bussiness of the
> >>> filesystem to try to fix that. I just has to make sure it properly reports
> >>
> >> If we meet some IO errors due to the flaky transport, IMO the desired outcome
> >> is 1) report IO error; 2) ext4 filesystem act as the "errors=xxx" configuration
> >> specified, if we set "errors=read-only || panic", we expect ext4 could remount
> >> to read-only or panic immediately to avoid inconsistency. In brief, the kernel
> >> should try best to guarantee the filesystem on disk is consistency, this will
> >> reduce fsck's work (AFAIK, the fsck cannot fix those inconsistent in auto mode
> >> for most cases caused by the async error problem I mentioned), so we could
> >> recover the fs automatically in next boot.
> > 
> > Good, so I fully agree with your goals. Let's now talk about how to achieve
> > them :)
> > 
> >> But now, in the case of metadata async writeback, (1) is done in
> >> end_buffer_async_write(), but (2) is not guaranteed, because ext4 cannot detect
> >> metadata write error, and it also cannot remount the filesystem or invoke panic
> >> immediately. Finally, if we read the metadata on disk and re-write again, it
> >> may lead to on-disk filesystem inconsistency.
> > 
> > Ah, I see. This was the important bit I was missing. And I think the
> > real problem here is that ext4 cannot detect metadata write error from
> > async writeback. So my plan would be to detect metadata write errors early
> > and abort the journal and do appropriate errors=xxx handling. And a
> > relatively simple way how to do that these days would be to use errseq in
> > the block device's mapping - sb->s_bdev->bd_inode->i_mapping->wb_err - that
> > gets incremented whenever there's writeback error in the block device
> > mapping so (probably in ext4_journal_check_start()) we could check whether
> > wb_err is different from the original value we sampled at mount time an if
> > yes, we know metadata writeback error has happened and we trigger the error
> > handling. What do you think?
> > 
> 
> Thanks a lot for your suggestion, this solution looks good to me. But Ithink
> add 'wb_err' checking into ext4_journal_check_start() maybe too early, see below
> race condition (It's just theoretical analysis, I'm not test it):
> 
> ext4_journal_start()
>  ext4_journal_check_start()  <-- pass checking
>                                  |   end_buffer_async_write()
>                                  |    mark_buffer_write_io_error() <-- set b_page
> sb_getblk(bh)   <-- read old data from disk
> ext4_journal_get_write_access(bh)
> modify this bh  <-- modify data and lead to inconsistency
> ext4_handle_dirty_metadata(bh)
> 
> So I guess it may still lead to inconsistency. How about add this checking
> into ext4_journal_get_write_access() ?

Yes, this also occured to me later. Adding the check to
ext4_journal_get_write_access() should be safer.

								Honza
Theodore Ts'o June 10, 2020, 3:45 p.m. UTC | #7
On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
> > So I guess it may still lead to inconsistency. How about add this checking
> > into ext4_journal_get_write_access() ?
> 
> Yes, this also occured to me later. Adding the check to
> ext4_journal_get_write_access() should be safer.

There's another thing which we could do.  One of the issues is that we
allow buffered writeback for block devices once the change to the
block has been committed.  What if we add a change to block device
writeback code and in fs/buffer.c so that optionally, the file system
can specify a callback function can get called when an I/O error has
been reflected back up from the block layer?

It seems unfortunate that currently, we can immediately report the I/O
error for buffered writes to *files*, but for metadata blocks, we
would only be able to report the problem when we next try to modify
it.

Making changes to fs/buffer.c might be controversial, but I think it
might be result in a better solution.

     	       	       	    	      - Ted
Jan Kara June 10, 2020, 4:27 p.m. UTC | #8
On Wed 10-06-20 11:45:43, Theodore Y. Ts'o wrote:
> On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
> > > So I guess it may still lead to inconsistency. How about add this checking
> > > into ext4_journal_get_write_access() ?
> > 
> > Yes, this also occured to me later. Adding the check to
> > ext4_journal_get_write_access() should be safer.
> 
> There's another thing which we could do.  One of the issues is that we
> allow buffered writeback for block devices once the change to the
> block has been committed.  What if we add a change to block device
> writeback code and in fs/buffer.c so that optionally, the file system
> can specify a callback function can get called when an I/O error has
> been reflected back up from the block layer?
> 
> It seems unfortunate that currently, we can immediately report the I/O
> error for buffered writes to *files*, but for metadata blocks, we
> would only be able to report the problem when we next try to modify
> it.
> 
> Making changes to fs/buffer.c might be controversial, but I think it
> might be result in a better solution.

Yeah, what you propose certainly makes sence could be relatively easily
done by blkdev_writepage() using __block_write_full_page() with appropriate
endio handler which calls fs callback. I'm just not sure how propagate the
callback function from the fs to the blkdev...

								Honza
Zhang Yi June 11, 2020, 2:12 a.m. UTC | #9
On 2020/6/11 0:27, Jan Kara wrote:
> On Wed 10-06-20 11:45:43, Theodore Y. Ts'o wrote:
>> On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
>>>> So I guess it may still lead to inconsistency. How about add this checking
>>>> into ext4_journal_get_write_access() ?
>>>
>>> Yes, this also occured to me later. Adding the check to
>>> ext4_journal_get_write_access() should be safer.
>>
>> There's another thing which we could do.  One of the issues is that we
>> allow buffered writeback for block devices once the change to the
>> block has been committed.  What if we add a change to block device
>> writeback code and in fs/buffer.c so that optionally, the file system
>> can specify a callback function can get called when an I/O error has
>> been reflected back up from the block layer?
>>
>> It seems unfortunate that currently, we can immediately report the I/O
>> error for buffered writes to *files*, but for metadata blocks, we
>> would only be able to report the problem when we next try to modify
>> it.
>>
>> Making changes to fs/buffer.c might be controversial, but I think it
>> might be result in a better solution.
> 
> Yeah, what you propose certainly makes sence could be relatively easily
> done by blkdev_writepage() using __block_write_full_page() with appropriate
> endio handler which calls fs callback. I'm just not sure how propagate the
> callback function from the fs to the blkdev...
>

I have thought about this solution, we could add a hook in 'struct super_operations'
and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
wrapper from block_write_full_page() to pass our endio handler in, something like
this.

static const struct super_operations ext4_sops = {
...
	.bdev_write_page = ext4_bdev_write_page,
...
};

static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
{
	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;

	if (super && super->s_op->bdev_write_page)
		return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);

	return block_write_full_page(page, blkdev_get_block, wbc);
}

But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
solution now ?

Thanks,
Yi.
Jan Kara June 11, 2020, 8:21 a.m. UTC | #10
On Thu 11-06-20 10:12:45, zhangyi (F) wrote:
> On 2020/6/11 0:27, Jan Kara wrote:
> > On Wed 10-06-20 11:45:43, Theodore Y. Ts'o wrote:
> >> On Wed, Jun 10, 2020 at 11:57:39AM +0200, Jan Kara wrote:
> >>>> So I guess it may still lead to inconsistency. How about add this checking
> >>>> into ext4_journal_get_write_access() ?
> >>>
> >>> Yes, this also occured to me later. Adding the check to
> >>> ext4_journal_get_write_access() should be safer.
> >>
> >> There's another thing which we could do.  One of the issues is that we
> >> allow buffered writeback for block devices once the change to the
> >> block has been committed.  What if we add a change to block device
> >> writeback code and in fs/buffer.c so that optionally, the file system
> >> can specify a callback function can get called when an I/O error has
> >> been reflected back up from the block layer?
> >>
> >> It seems unfortunate that currently, we can immediately report the I/O
> >> error for buffered writes to *files*, but for metadata blocks, we
> >> would only be able to report the problem when we next try to modify
> >> it.
> >>
> >> Making changes to fs/buffer.c might be controversial, but I think it
> >> might be result in a better solution.
> > 
> > Yeah, what you propose certainly makes sence could be relatively easily
> > done by blkdev_writepage() using __block_write_full_page() with appropriate
> > endio handler which calls fs callback. I'm just not sure how propagate the
> > callback function from the fs to the blkdev...
> >
> 
> I have thought about this solution, we could add a hook in 'struct super_operations'
> and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
> wrapper from block_write_full_page() to pass our endio handler in, something like
> this.
> 
> static const struct super_operations ext4_sops = {
> ...
> 	.bdev_write_page = ext4_bdev_write_page,
> ...
> };
> 
> static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
> {
> 	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
> 
> 	if (super && super->s_op->bdev_write_page)
> 		return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
> 
> 	return block_write_full_page(page, blkdev_get_block, wbc);
> }
> 
> But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
> solution now ?

The above idea looks good to me. I'm fine with either that solution or
"wb_err" idea so maybe let's leave it for Ted to decide...

								Honza
Theodore Ts'o June 11, 2020, 4:55 p.m. UTC | #11
On Thu, Jun 11, 2020 at 10:21:03AM +0200, Jan Kara wrote:
> > I have thought about this solution, we could add a hook in 'struct super_operations'
> > and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
> > wrapper from block_write_full_page() to pass our endio handler in, something like
> > this.
> > 
> > static const struct super_operations ext4_sops = {
> > ...
> > 	.bdev_write_page = ext4_bdev_write_page,
> > ...
> > };
> > 
> > static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
> > {
> > 	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
> > 
> > 	if (super && super->s_op->bdev_write_page)
> > 		return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
> > 
> > 	return block_write_full_page(page, blkdev_get_block, wbc);
> > }
> > 
> > But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
> > solution now ?
> 
> The above idea looks good to me. I'm fine with either that solution or
> "wb_err" idea so maybe let's leave it for Ted to decide...

My preference would be to be able to get the (error from the callback
right away.  My reasoning behind that is (a) it allows the file system
to be notified about the problem right away, (b) in the case of a file
system resize, we _really_ want to know about the failure ASAP, so we
can fail the resize before we start allocating inodes and blocks to
use the new space, and (c) over time, we might be able to add some
more intelligence handling of some write errors.

For example, we already have a way of handling CRC errors when we are
reading an allocation bitmap; we simply avoid allocating blocks and
inodes from that blockgroup.  Over time, we could theoretically do
other things to try to recover from some write errors --- for example,
we could try allocating a new block for an extent tree block, and try
writing it, and if that succeeds, updating its parent node to point at
the new location.  Is it worth it to try to add that kind of
complexity?  I'm really not sure; at the end of the day, it might be
simpler to just call ext4_error() and abort using the entire file
system until a system administrator can sort out the mess.  But I
think (a) and (b) are still reasons for doing this by intercepting the
writeback error from the buffer head.

Cheers,

						- Ted
Zhang Yi June 12, 2020, 11:13 a.m. UTC | #12
On 2020/6/12 0:55, Theodore Y. Ts'o wrote:
> On Thu, Jun 11, 2020 at 10:21:03AM +0200, Jan Kara wrote:
>>> I have thought about this solution, we could add a hook in 'struct super_operations'
>>> and call it in blkdev_writepage() like blkdev_releasepage() does, and pick out a
>>> wrapper from block_write_full_page() to pass our endio handler in, something like
>>> this.
>>>
>>> static const struct super_operations ext4_sops = {
>>> ...
>>> 	.bdev_write_page = ext4_bdev_write_page,
>>> ...
>>> };
>>>
>>> static int blkdev_writepage(struct page *page, struct writeback_control *wbc)
>>> {
>>> 	struct super_block *super = BDEV_I(page->mapping->host)->bdev.bd_super;
>>>
>>> 	if (super && super->s_op->bdev_write_page)
>>> 		return super->s_op->bdev_write_page(page, blkdev_get_block, wbc);
>>>
>>> 	return block_write_full_page(page, blkdev_get_block, wbc);
>>> }
>>>
>>> But I'm not sure it's a optimal ieda. So I continue to realize the "wb_err"
>>> solution now ?
>>
>> The above idea looks good to me. I'm fine with either that solution or
>> "wb_err" idea so maybe let's leave it for Ted to decide...
> 
> My preference would be to be able to get the (error from the callback
> right away.  My reasoning behind that is (a) it allows the file system
> to be notified about the problem right away, (b) in the case of a file
> system resize, we _really_ want to know about the failure ASAP, so we
> can fail the resize before we start allocating inodes and blocks to
> use the new space, and (c) over time, we might be able to add some
> more intelligence handling of some write errors.
> 
> For example, we already have a way of handling CRC errors when we are
> reading an allocation bitmap; we simply avoid allocating blocks and
> inodes from that blockgroup.  Over time, we could theoretically do
> other things to try to recover from some write errors --- for example,
> we could try allocating a new block for an extent tree block, and try
> writing it, and if that succeeds, updating its parent node to point at
> the new location.  Is it worth it to try to add that kind of
> complexity?  I'm really not sure; at the end of the day, it might be
> simpler to just call ext4_error() and abort using the entire file
> system until a system administrator can sort out the mess.  But I
> think (a) and (b) are still reasons for doing this by intercepting the
> writeback error from the buffer head.
> 

Yeah, it make sense to me, I will realize this callback solution.

Thanks,
Yi.