diff mbox series

[v2] ext4: Fix potential data lost in recovering journal raced with synchronizing fs bdev

Message ID 20230908124317.2955345-1-chengzhihao1@huawei.com
State Superseded
Headers show
Series [v2] ext4: Fix potential data lost in recovering journal raced with synchronizing fs bdev | expand

Commit Message

Zhihao Cheng Sept. 8, 2023, 12:43 p.m. UTC
JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
however, other process could intercept the EIO information from bdev's
mapping, which leads journal recovering successful even EIO occurs during
data written back to fs device.

We found this problem in our product, iscsi + multipath is chosen for block
device of ext4. Unstable network may trigger kpartx to rescan partitions in
device mapper layer. Detailed process is shown as following:

  mount          kpartx          irq
jbd2_journal_recover
 do_one_pass
  memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
  mark_buffer_dirty // mark bh dirty
         vfs_read
	  generic_file_read_iter // dio
	   filemap_write_and_wait_range
	    __filemap_fdatawrite_range
	     do_writepages
	      block_write_full_folio
	       submit_bh_wbc
	            >>  EIO occurs in disk  <<
	                     end_buffer_async_write
			      mark_buffer_write_io_error
			       mapping_set_error
			        set_bit(AS_EIO, &mapping->flags) // set!
	    filemap_check_errors
	     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
 err2 = sync_blockdev
  filemap_write_and_wait
   filemap_check_errors
    test_and_clear_bit(AS_EIO, &mapping->flags) // false
 err2 = 0

Filesystem is mounted successfully even data from journal is failed written
into disk, and ext4 could become corrupted.

Fix it by comparing 'sbi->s_bdev_wb_err' before loading journal and after
loading journal.

Fetch a reproducer in [Link].

Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
Cc: stable@vger.kernel.org
Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 v1->v2: Checks wb_err from block device only in ext4.
 fs/ext4/super.c | 22 +++++++++++++++-------
 1 file changed, 15 insertions(+), 7 deletions(-)

Comments

Zhang Yi Sept. 9, 2023, 3:41 a.m. UTC | #1
Hello!

On 2023/9/8 20:43, Zhihao Cheng wrote:
> JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
> however, other process could intercept the EIO information from bdev's
> mapping, which leads journal recovering successful even EIO occurs during
> data written back to fs device.
> 
> We found this problem in our product, iscsi + multipath is chosen for block
> device of ext4. Unstable network may trigger kpartx to rescan partitions in
> device mapper layer. Detailed process is shown as following:
> 
>   mount          kpartx          irq
> jbd2_journal_recover
>  do_one_pass
>   memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
>   mark_buffer_dirty // mark bh dirty
>          vfs_read
> 	  generic_file_read_iter // dio
> 	   filemap_write_and_wait_range
> 	    __filemap_fdatawrite_range
> 	     do_writepages
> 	      block_write_full_folio
> 	       submit_bh_wbc
> 	            >>  EIO occurs in disk  <<
> 	                     end_buffer_async_write
> 			      mark_buffer_write_io_error
> 			       mapping_set_error
> 			        set_bit(AS_EIO, &mapping->flags) // set!
> 	    filemap_check_errors
> 	     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
>  err2 = sync_blockdev
>   filemap_write_and_wait
>    filemap_check_errors
>     test_and_clear_bit(AS_EIO, &mapping->flags) // false
>  err2 = 0
> 
> Filesystem is mounted successfully even data from journal is failed written
> into disk, and ext4 could become corrupted.
> 
> Fix it by comparing 'sbi->s_bdev_wb_err' before loading journal and after
> loading journal.
> 
> Fetch a reproducer in [Link].
> 
> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
> Cc: stable@vger.kernel.org
> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  v1->v2: Checks wb_err from block device only in ext4.
>  fs/ext4/super.c | 22 +++++++++++++++-------
>  1 file changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 38217422f938..4dcaad2403be 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -4907,6 +4907,14 @@ static int ext4_load_and_init_journal(struct super_block *sb,
>  	if (err)
>  		return err;
>  
> +	err = errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> +				       &sbi->s_bdev_wb_err);
> +	if (err) {
> +		ext4_msg(sb, KERN_ERR, "Background error %d when loading journal",
> +			 err);
> +		goto out;
> +	}
> +

This solution cannot solve the problem, because the journal tail is
still updated in journal_reset() even if we detect the writeback error
and refuse to mount the ext4 filesystem here. So I suppose we have to
check the I/O error by jbd2 module itself like v1 does.

Thanks,
Yi.

>  	if (ext4_has_feature_64bit(sb) &&
>  	    !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
>  				       JBD2_FEATURE_INCOMPAT_64BIT)) {
> @@ -5365,6 +5373,13 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  			goto failed_mount3a;
>  	}
>  
> +	/*
> +	 * Save the original bdev mapping's wb_err value which could be
> +	 * used to detect the metadata async write error.
> +	 */
> +	spin_lock_init(&sbi->s_bdev_wb_lock);
> +	errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> +				 &sbi->s_bdev_wb_err);
>  	err = -EINVAL;
>  	/*
>  	 * The first inode we look at is the journal inode.  Don't try
> @@ -5571,13 +5586,6 @@ static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
>  	}
>  #endif  /* CONFIG_QUOTA */
>  
> -	/*
> -	 * Save the original bdev mapping's wb_err value which could be
> -	 * used to detect the metadata async write error.
> -	 */
> -	spin_lock_init(&sbi->s_bdev_wb_lock);
> -	errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> -				 &sbi->s_bdev_wb_err);
>  	EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
>  	ext4_orphan_cleanup(sb, es);
>  	EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;
>
Jan Kara Sept. 11, 2023, 4:18 p.m. UTC | #2
Hello!

On Sat 09-09-23 11:41:11, Zhang Yi wrote:
> On 2023/9/8 20:43, Zhihao Cheng wrote:
> > JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
> > however, other process could intercept the EIO information from bdev's
> > mapping, which leads journal recovering successful even EIO occurs during
> > data written back to fs device.
> > 
> > We found this problem in our product, iscsi + multipath is chosen for block
> > device of ext4. Unstable network may trigger kpartx to rescan partitions in
> > device mapper layer. Detailed process is shown as following:
> > 
> >   mount          kpartx          irq
> > jbd2_journal_recover
> >  do_one_pass
> >   memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
> >   mark_buffer_dirty // mark bh dirty
> >          vfs_read
> > 	  generic_file_read_iter // dio
> > 	   filemap_write_and_wait_range
> > 	    __filemap_fdatawrite_range
> > 	     do_writepages
> > 	      block_write_full_folio
> > 	       submit_bh_wbc
> > 	            >>  EIO occurs in disk  <<
> > 	                     end_buffer_async_write
> > 			      mark_buffer_write_io_error
> > 			       mapping_set_error
> > 			        set_bit(AS_EIO, &mapping->flags) // set!
> > 	    filemap_check_errors
> > 	     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
> >  err2 = sync_blockdev
> >   filemap_write_and_wait
> >    filemap_check_errors
> >     test_and_clear_bit(AS_EIO, &mapping->flags) // false
> >  err2 = 0
> > 
> > Filesystem is mounted successfully even data from journal is failed written
> > into disk, and ext4 could become corrupted.
> > 
> > Fix it by comparing 'sbi->s_bdev_wb_err' before loading journal and after
> > loading journal.
> > 
> > Fetch a reproducer in [Link].
> > 
> > Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
> > Cc: stable@vger.kernel.org
> > Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  v1->v2: Checks wb_err from block device only in ext4.
> >  fs/ext4/super.c | 22 +++++++++++++++-------
> >  1 file changed, 15 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> > index 38217422f938..4dcaad2403be 100644
> > --- a/fs/ext4/super.c
> > +++ b/fs/ext4/super.c
> > @@ -4907,6 +4907,14 @@ static int ext4_load_and_init_journal(struct super_block *sb,
> >  	if (err)
> >  		return err;
> >  
> > +	err = errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
> > +				       &sbi->s_bdev_wb_err);
> > +	if (err) {
> > +		ext4_msg(sb, KERN_ERR, "Background error %d when loading journal",
> > +			 err);
> > +		goto out;
> > +	}
> > +
> 
> This solution cannot solve the problem, because the journal tail is
> still updated in journal_reset() even if we detect the writeback error
> and refuse to mount the ext4 filesystem here. So I suppose we have to
> check the I/O error by jbd2 module itself like v1 does.

Hum, that's a good point because next time we will try to mount the fs we
will not try to replay the journal anymore. So let's return to v1 and I'm
sorry for misguiding you Zhihao.

But when we are doing background IO error detection in jbd2 during journal
replay, I'm wondering whether we shouldn't be doing something similar in
checkpointing code - like when we are about to remove a transaction from
the journal. And as I'm checking we already do that using
JBD2_CHECKPOINT_IO_ERROR bit handling - maybe we could replace that with a
more standard errseq mechanism that is available these days as a cleanup?

And the ext4 handling in ext4_check_bdev_write_error() is useful only in
nojournal mode as otherwise jbd2 is taking care of all writeback errors
including the background ones. So maybe we can guard the
ext4_check_bdev_write_error() by a !ext4_handle_valid(handle) check to make
that obvious (and comment about that).

What do you think?

								Honza
Zhihao Cheng Sept. 12, 2023, 3:52 a.m. UTC | #3
在 2023/9/12 0:18, Jan Kara 写道:
> Hello!
> 
> On Sat 09-09-23 11:41:11, Zhang Yi wrote:
>> On 2023/9/8 20:43, Zhihao Cheng wrote:
>>> JBD2 makes sure journal data is fallen on fs device by sync_blockdev(),
>>> however, other process could intercept the EIO information from bdev's
>>> mapping, which leads journal recovering successful even EIO occurs during
>>> data written back to fs device.
>>>
>>> We found this problem in our product, iscsi + multipath is chosen for block
>>> device of ext4. Unstable network may trigger kpartx to rescan partitions in
>>> device mapper layer. Detailed process is shown as following:
>>>
>>>    mount          kpartx          irq
>>> jbd2_journal_recover
>>>   do_one_pass
>>>    memcpy(nbh->b_data, obh->b_data) // copy data to fs dev from journal
>>>    mark_buffer_dirty // mark bh dirty
>>>           vfs_read
>>> 	  generic_file_read_iter // dio
>>> 	   filemap_write_and_wait_range
>>> 	    __filemap_fdatawrite_range
>>> 	     do_writepages
>>> 	      block_write_full_folio
>>> 	       submit_bh_wbc
>>> 	            >>  EIO occurs in disk  <<
>>> 	                     end_buffer_async_write
>>> 			      mark_buffer_write_io_error
>>> 			       mapping_set_error
>>> 			        set_bit(AS_EIO, &mapping->flags) // set!
>>> 	    filemap_check_errors
>>> 	     test_and_clear_bit(AS_EIO, &mapping->flags) // clear!
>>>   err2 = sync_blockdev
>>>    filemap_write_and_wait
>>>     filemap_check_errors
>>>      test_and_clear_bit(AS_EIO, &mapping->flags) // false
>>>   err2 = 0
>>>
>>> Filesystem is mounted successfully even data from journal is failed written
>>> into disk, and ext4 could become corrupted.
>>>
>>> Fix it by comparing 'sbi->s_bdev_wb_err' before loading journal and after
>>> loading journal.
>>>
>>> Fetch a reproducer in [Link].
>>>
>>> Link: https://bugzilla.kernel.org/show_bug.cgi?id=217888
>>> Cc: stable@vger.kernel.org
>>> Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com>
>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>> ---
>>>   v1->v2: Checks wb_err from block device only in ext4.
>>>   fs/ext4/super.c | 22 +++++++++++++++-------
>>>   1 file changed, 15 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>>> index 38217422f938..4dcaad2403be 100644
>>> --- a/fs/ext4/super.c
>>> +++ b/fs/ext4/super.c
>>> @@ -4907,6 +4907,14 @@ static int ext4_load_and_init_journal(struct super_block *sb,
>>>   	if (err)
>>>   		return err;
>>>   
>>> +	err = errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
>>> +				       &sbi->s_bdev_wb_err);
>>> +	if (err) {
>>> +		ext4_msg(sb, KERN_ERR, "Background error %d when loading journal",
>>> +			 err);
>>> +		goto out;
>>> +	}
>>> +
>>
>> This solution cannot solve the problem, because the journal tail is
>> still updated in journal_reset() even if we detect the writeback error
>> and refuse to mount the ext4 filesystem here. So I suppose we have to
>> check the I/O error by jbd2 module itself like v1 does.
> 
> Hum, that's a good point because next time we will try to mount the fs we
> will not try to replay the journal anymore. So let's return to v1 and I'm
> sorry for misguiding you Zhihao.
> 
> But when we are doing background IO error detection in jbd2 during journal
> replay, I'm wondering whether we shouldn't be doing something similar in
> checkpointing code - like when we are about to remove a transaction from
> the journal. And as I'm checking we already do that using
> JBD2_CHECKPOINT_IO_ERROR bit handling - maybe we could replace that with a
> more standard errseq mechanism that is available these days as a cleanup?
>  > And the ext4 handling in ext4_check_bdev_write_error() is useful only in
> nojournal mode as otherwise jbd2 is taking care of all writeback errors
> including the background ones. So maybe we can guard the
> ext4_check_bdev_write_error() by a !ext4_handle_valid(handle) check to make
> that obvious (and comment about that). >
> What do you think?
> 
> 								Honza
> 

Ok, I will try it. Thanks for suggestions.
diff mbox series

Patch

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 38217422f938..4dcaad2403be 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -4907,6 +4907,14 @@  static int ext4_load_and_init_journal(struct super_block *sb,
 	if (err)
 		return err;
 
+	err = errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
+				       &sbi->s_bdev_wb_err);
+	if (err) {
+		ext4_msg(sb, KERN_ERR, "Background error %d when loading journal",
+			 err);
+		goto out;
+	}
+
 	if (ext4_has_feature_64bit(sb) &&
 	    !jbd2_journal_set_features(EXT4_SB(sb)->s_journal, 0, 0,
 				       JBD2_FEATURE_INCOMPAT_64BIT)) {
@@ -5365,6 +5373,13 @@  static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 			goto failed_mount3a;
 	}
 
+	/*
+	 * Save the original bdev mapping's wb_err value which could be
+	 * used to detect the metadata async write error.
+	 */
+	spin_lock_init(&sbi->s_bdev_wb_lock);
+	errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
+				 &sbi->s_bdev_wb_err);
 	err = -EINVAL;
 	/*
 	 * The first inode we look at is the journal inode.  Don't try
@@ -5571,13 +5586,6 @@  static int __ext4_fill_super(struct fs_context *fc, struct super_block *sb)
 	}
 #endif  /* CONFIG_QUOTA */
 
-	/*
-	 * Save the original bdev mapping's wb_err value which could be
-	 * used to detect the metadata async write error.
-	 */
-	spin_lock_init(&sbi->s_bdev_wb_lock);
-	errseq_check_and_advance(&sb->s_bdev->bd_inode->i_mapping->wb_err,
-				 &sbi->s_bdev_wb_err);
 	EXT4_SB(sb)->s_mount_state |= EXT4_ORPHAN_FS;
 	ext4_orphan_cleanup(sb, es);
 	EXT4_SB(sb)->s_mount_state &= ~EXT4_ORPHAN_FS;