diff mbox series

fs/buffer.c: Fix data corruption when buffer write with IO error

Message ID 1554534793-31444-1-git-send-email-zhangxiaoxu5@huawei.com
State New
Headers show
Series fs/buffer.c: Fix data corruption when buffer write with IO error | expand

Commit Message

zhangxiaoxu (A) April 6, 2019, 7:13 a.m. UTC
When the buffer write failed, 'end_buffer_write_sync' and
'end_buffer_async_write' will clear the uptodate flag. But the
data in the buffer maybe newer than disk. In some case, this
will lead data corruption.

For example: ext4 flush metadata to disk failed, it will clear
the uptodate flag. when a new coming call want the buffer, it will
read it from the disk(because the buffer no uptodate flag). But
the journal not checkpoint now, it will read old data from disk.
If read successfully, ext4 will write the old data to the new
journal, the data will corruption.

So, don't clear the uptodate flag when write the buffer failed.

Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>
---
 fs/buffer.c | 2 --
 1 file changed, 2 deletions(-)

Comments

Jan Kara April 8, 2019, 11:11 a.m. UTC | #1
On Sat 06-04-19 15:13:13, ZhangXiaoxu wrote:
> When the buffer write failed, 'end_buffer_write_sync' and
> 'end_buffer_async_write' will clear the uptodate flag. But the
> data in the buffer maybe newer than disk. In some case, this
> will lead data corruption.
> 
> For example: ext4 flush metadata to disk failed, it will clear
> the uptodate flag. when a new coming call want the buffer, it will
> read it from the disk(because the buffer no uptodate flag). But
> the journal not checkpoint now, it will read old data from disk.
> If read successfully, ext4 will write the old data to the new
> journal, the data will corruption.
> 
> So, don't clear the uptodate flag when write the buffer failed.
> 
> Signed-off-by: ZhangXiaoxu <zhangxiaoxu5@huawei.com>

Thanks for the patch. But what are the chances that after the write has
failed the read will succeed? Also there were places that were using
buffer_uptodate() to detect IO errors. Did you check all those got
converted to using buffer_write_io_error() instead?

								Honza

> ---
>  fs/buffer.c | 2 --
>  1 file changed, 2 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index ce35760..9fe1827 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -172,7 +172,6 @@ void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
>  	} else {
>  		buffer_io_error(bh, ", lost sync page write");
>  		mark_buffer_write_io_error(bh);
> -		clear_buffer_uptodate(bh);
>  	}
>  	unlock_buffer(bh);
>  	put_bh(bh);
> @@ -325,7 +324,6 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
>  	} else {
>  		buffer_io_error(bh, ", lost async page write");
>  		mark_buffer_write_io_error(bh);
> -		clear_buffer_uptodate(bh);
>  		SetPageError(page);
>  	}
>  
> -- 
> 2.7.4
>
zhangxiaoxu (A) April 9, 2019, 1:11 p.m. UTC | #2
On 4/8/2019 7:11 PM, Jan Kara wrote:
> On Sat 06-04-19 15:13:13, ZhangXiaoxu wrote:
>> When the buffer write failed, 'end_buffer_write_sync' and
>> 'end_buffer_async_write' will clear the uptodate flag. But the
>> data in the buffer maybe newer than disk. In some case, this
>> will lead data corruption.
>>
>> For example: ext4 flush metadata to disk failed, it will clear
>> the uptodate flag. when a new coming call want the buffer, it will
>> read it from the disk(because the buffer no uptodate flag). But
>> the journal not checkpoint now, it will read old data from disk.
>> If read successfully, ext4 will write the old data to the new
>> journal, the data will corruption.
>>
>> So, don't clear the uptodate flag when write the buffer failed.
>>
>> Signed-off-by: ZhangXiaoxu<zhangxiaoxu5@huawei.com>
> Thanks for the patch. But what are the chances that after the write has
> failed the read will succeed? Also there were places that were using
> buffer_uptodate() to detect IO errors. Did you check all those got
> converted to using buffer_write_io_error() instead?
> 
> 								Honza
> 
I encountered this situation when using iscsi target as the ext4 volume,
if the network down a while when ext4 write metadata to disk, and maybe
read will success after the network recovered.
In my testcase, just create and delete files, when disconnect the network,
the dir entry maybe corruption. I add some logs when read entry buffer from
disk, the buffer stats maybe buffer_write_io_error() and !buffer_uptodate().
In this case, the ext4 will corruption high probability. Because the buffer
is read from disk, and some newer information just on journal. In this case,
we should not read buffer from the disk.

I don't know any other filesystem how to use the uptodate flag. But I think
uptodate means the buffer is valid and newer than disk, am I right?

I just test ext4 use the xfstest.

Any other idea about my problem?
Jan Kara April 9, 2019, 4:06 p.m. UTC | #3
On Tue 09-04-19 21:11:22, zhangxiaoxu (A) wrote:
> 
> 
> On 4/8/2019 7:11 PM, Jan Kara wrote:
> > On Sat 06-04-19 15:13:13, ZhangXiaoxu wrote:
> > > When the buffer write failed, 'end_buffer_write_sync' and
> > > 'end_buffer_async_write' will clear the uptodate flag. But the
> > > data in the buffer maybe newer than disk. In some case, this
> > > will lead data corruption.
> > > 
> > > For example: ext4 flush metadata to disk failed, it will clear
> > > the uptodate flag. when a new coming call want the buffer, it will
> > > read it from the disk(because the buffer no uptodate flag). But
> > > the journal not checkpoint now, it will read old data from disk.
> > > If read successfully, ext4 will write the old data to the new
> > > journal, the data will corruption.
> > > 
> > > So, don't clear the uptodate flag when write the buffer failed.
> > > 
> > > Signed-off-by: ZhangXiaoxu<zhangxiaoxu5@huawei.com>
> > Thanks for the patch. But what are the chances that after the write has
> > failed the read will succeed? Also there were places that were using
> > buffer_uptodate() to detect IO errors. Did you check all those got
> > converted to using buffer_write_io_error() instead?
> > 
> > 								Honza
> > 
> I encountered this situation when using iscsi target as the ext4 volume,
> if the network down a while when ext4 write metadata to disk, and maybe
> read will success after the network recovered.

OK, but then you are running in errors=continue mode, aren't you? Because
otherwise the filesystem would get remounted read-only or panic the system
on the first metadata IO error.

> In my testcase, just create and delete files, when disconnect the network,
> the dir entry maybe corruption. I add some logs when read entry buffer from
> disk, the buffer stats maybe buffer_write_io_error() and !buffer_uptodate().
> In this case, the ext4 will corruption high probability. Because the buffer
> is read from disk, and some newer information just on journal. In this case,
> we should not read buffer from the disk.

Well, this never worked reliably (e.g. the buffer may get evicted due to
memory pressure after IO error) and never was promised to work. Once you
hit the first metadata error, all bets are off, you have to unmount the
filesystem and run fsck. Sure the damage would be likely smaller if we kept
the buffer uptodate in your case but it isn't guaranteed the filesystem
will not get corrupted. But as I said, changing this is not just a matter
of deleting those two clear_buffer_uptodate() calls...

> I don't know any other filesystem how to use the uptodate flag. But I think
> uptodate means the buffer is valid and newer than disk, am I right?

Yes, uptodate means the buffer is valid and newer than disk. However after
IO error, it isn't clear what the actual disk state is. But generally I'm
not opposed to the change you suggest, just it means you have to first go
and check all filesystems using buffer heads and verify that none of them
uses buffer_uptodate() check to detect buffer write failure.

								Honza
diff mbox series

Patch

diff --git a/fs/buffer.c b/fs/buffer.c
index ce35760..9fe1827 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -172,7 +172,6 @@  void end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 	} else {
 		buffer_io_error(bh, ", lost sync page write");
 		mark_buffer_write_io_error(bh);
-		clear_buffer_uptodate(bh);
 	}
 	unlock_buffer(bh);
 	put_bh(bh);
@@ -325,7 +324,6 @@  void end_buffer_async_write(struct buffer_head *bh, int uptodate)
 	} else {
 		buffer_io_error(bh, ", lost async page write");
 		mark_buffer_write_io_error(bh);
-		clear_buffer_uptodate(bh);
 		SetPageError(page);
 	}