Message ID | 20210412102333.2676-3-jack@suse.cz |
---|---|
State | Rejected |
Headers | show |
Series | ext4: Fix data corruption when extending DIO write races with buffered read | expand |
On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > Eric has noticed that after pagecache read rework, generic/418 is > occasionally failing for ext4 when blocksize < pagesize. In fact, the > pagecache rework just made hard to hit race in ext4 more likely. The > problem is that since ext4 conversion of direct IO writes to iomap > framework (commit 378f32bab371), we update inode size after direct IO > write only after invalidating page cache. Thus if buffered read sneaks > at unfortunate moment like: > > CPU1 - write at offset 1k CPU2 - read from offset 0 > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > ext4_readpage(); > ext4_handle_inode_extension() > > the read will zero out tail of the page as it still sees smaller inode > size and thus page cache becomes inconsistent with on-disk contents with > all the consequences. > > Fix the problem by moving inode size update into end_io handler which > gets called before the page cache is invalidated. Confused. This moves all the inode extension stuff into the completion handler, when all that really needs to be done is extending inode->i_size to tell the world there is data up to where the IO completed. Actually removing the inode from the orphan list does not need to be done in the IO completion callback, because... > if (ilock_shared) > iomap_ops = &ext4_iomap_overwrite_ops; > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > - if (ret == -ENOTBLK) > - ret = 0; > - > if (extend) > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > + dio_ops = &ext4_dio_extending_write_ops; > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); ^^^^^^ ^^^^^^^^^^^^^^^^^^^ .... if we are doing an extending write, we force DIO to complete before returning. Hence even AIO will block here on an extending write, and hence we can -always- do the correct post-IO completion orphan list cleanup here because we know a) the original IO size and b) the amount of data that was actually written. Hence all that remains is closing the buffered read vs invalidation race. All this requires is for the dio write completion to behave like XFS where it just does the inode->i_size update for extending writes. THis means the size is updated before the invalidation, and hence any read that occurs after the invalidation but before the post-eof blocks have been removed will see the correct size and read the tail page(s) correctly. This closes the race window, and the caller can still handle the post-eof block cleanup as it does now. Hence I don't see any need for changing the iomap infrastructure to solve this problem. This seems like the obvious solution to me, so what am I missing? Cheers, Dave.
On Tue 13-04-21 07:50:24, Dave Chinner wrote: > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > > Eric has noticed that after pagecache read rework, generic/418 is > > occasionally failing for ext4 when blocksize < pagesize. In fact, the > > pagecache rework just made hard to hit race in ext4 more likely. The > > problem is that since ext4 conversion of direct IO writes to iomap > > framework (commit 378f32bab371), we update inode size after direct IO > > write only after invalidating page cache. Thus if buffered read sneaks > > at unfortunate moment like: > > > > CPU1 - write at offset 1k CPU2 - read from offset 0 > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > > ext4_readpage(); > > ext4_handle_inode_extension() > > > > the read will zero out tail of the page as it still sees smaller inode > > size and thus page cache becomes inconsistent with on-disk contents with > > all the consequences. > > > > Fix the problem by moving inode size update into end_io handler which > > gets called before the page cache is invalidated. > > Confused. > > This moves all the inode extension stuff into the completion > handler, when all that really needs to be done is extending > inode->i_size to tell the world there is data up to where the > IO completed. Actually removing the inode from the orphan list > does not need to be done in the IO completion callback, because... > > > if (ilock_shared) > > iomap_ops = &ext4_iomap_overwrite_ops; > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > > - if (ret == -ENOTBLK) > > - ret = 0; > > - > > if (extend) > > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > > + dio_ops = &ext4_dio_extending_write_ops; > > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); > ^^^^^^ ^^^^^^^^^^^^^^^^^^^ > > .... if we are doing an extending write, we force DIO to complete > before returning. Hence even AIO will block here on an extending > write, and hence we can -always- do the correct post-IO completion > orphan list cleanup here because we know a) the original IO size and > b) the amount of data that was actually written. > > Hence all that remains is closing the buffered read vs invalidation > race. All this requires is for the dio write completion to behave > like XFS where it just does the inode->i_size update for extending > writes. THis means the size is updated before the invalidation, and > hence any read that occurs after the invalidation but before the > post-eof blocks have been removed will see the correct size and read > the tail page(s) correctly. This closes the race window, and the > caller can still handle the post-eof block cleanup as it does now. > > Hence I don't see any need for changing the iomap infrastructure to > solve this problem. This seems like the obvious solution to me, so > what am I missing? All that you write above is correct. The missing piece is: If everything succeeded and all the cleanup we need is removing inode from the orphan list (common case), we want to piggyback that orphan list removal into the same transaction handle as the update of the inode size. This is just a performance thing, you are absolutely right we could also do the orphan cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes to the iomap framework. OK, now that I write about this, maybe I was just too hung up on the performance improvement. Probably a better way forward is that I just fix the data corruption bug only inside ext4 (that will be also much easier to backport) and then submit the performance improvement modifying iomap if I can actually get performance data justifying it. Thanks for poking into this :) Honza
On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote: > On Tue 13-04-21 07:50:24, Dave Chinner wrote: > > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > > > Eric has noticed that after pagecache read rework, generic/418 is > > > occasionally failing for ext4 when blocksize < pagesize. In fact, the > > > pagecache rework just made hard to hit race in ext4 more likely. The > > > problem is that since ext4 conversion of direct IO writes to iomap > > > framework (commit 378f32bab371), we update inode size after direct IO > > > write only after invalidating page cache. Thus if buffered read sneaks > > > at unfortunate moment like: > > > > > > CPU1 - write at offset 1k CPU2 - read from offset 0 > > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > > > ext4_readpage(); > > > ext4_handle_inode_extension() > > > > > > the read will zero out tail of the page as it still sees smaller inode > > > size and thus page cache becomes inconsistent with on-disk contents with > > > all the consequences. > > > > > > Fix the problem by moving inode size update into end_io handler which > > > gets called before the page cache is invalidated. > > > > Confused. > > > > This moves all the inode extension stuff into the completion > > handler, when all that really needs to be done is extending > > inode->i_size to tell the world there is data up to where the > > IO completed. Actually removing the inode from the orphan list > > does not need to be done in the IO completion callback, because... > > > > > if (ilock_shared) > > > iomap_ops = &ext4_iomap_overwrite_ops; > > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > > > - if (ret == -ENOTBLK) > > > - ret = 0; > > > - > > > if (extend) > > > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > > > + dio_ops = &ext4_dio_extending_write_ops; > > > > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); > > ^^^^^^ ^^^^^^^^^^^^^^^^^^^ > > > > .... if we are doing an extending write, we force DIO to complete > > before returning. Hence even AIO will block here on an extending > > write, and hence we can -always- do the correct post-IO completion > > orphan list cleanup here because we know a) the original IO size and > > b) the amount of data that was actually written. > > > > Hence all that remains is closing the buffered read vs invalidation > > race. All this requires is for the dio write completion to behave > > like XFS where it just does the inode->i_size update for extending > > writes. THis means the size is updated before the invalidation, and > > hence any read that occurs after the invalidation but before the > > post-eof blocks have been removed will see the correct size and read > > the tail page(s) correctly. This closes the race window, and the > > caller can still handle the post-eof block cleanup as it does now. > > > > Hence I don't see any need for changing the iomap infrastructure to > > solve this problem. This seems like the obvious solution to me, so > > what am I missing? > > All that you write above is correct. The missing piece is: If everything > succeeded and all the cleanup we need is removing inode from the orphan > list (common case), we want to piggyback that orphan list removal into the > same transaction handle as the update of the inode size. This is just a > performance thing, you are absolutely right we could also do the orphan > cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes > to the iomap framework. Doesn't ext4, like XFS, keep two copies of the inode size? One for the on-disk size and one for the in-memory size? /me looks... Yeah, there's ei->i_disksize that reflects the on-disk size. And I note that the first thing that ext4_handle_inode_extension() is already checking that the write is extending past the current on-disk inode size before running the extension transaction. The page cache only cares about the inode->i_size value, not the ei->i_disksize value, so you can update them independently and still have things work correctly. That's what XFS does in xfs_dio_write_end_io - it updates the in-memory inode->i_size, then runs a transaction to atomically update the inode on-disk inode size. Updating the VFS inode size first protects against buffered read races while updating the on-disk size... So for ext4, the two separate size updates don't need to be done at the same time - you have all the state you need in the ext4 dio write path to extend the on-disk file size on successful extending write, and it is not dependent in any way on the current in-memory VFS inode size that you'd update in the ->end_io callback.... > OK, now that I write about this, maybe I was just too hung up on the > performance improvement. Probably a better way forward is that I just fix > the data corruption bug only inside ext4 (that will be also much easier to > backport) and then submit the performance improvement modifying iomap if I > can actually get performance data justifying it. Thanks for poking into > this :) Sounds like a good plan :) Cheers, Dave.
On Wed 14-04-21 08:45:31, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 11:11:22AM +0200, Jan Kara wrote: > > On Tue 13-04-21 07:50:24, Dave Chinner wrote: > > > On Mon, Apr 12, 2021 at 12:23:32PM +0200, Jan Kara wrote: > > > > Eric has noticed that after pagecache read rework, generic/418 is > > > > occasionally failing for ext4 when blocksize < pagesize. In fact, the > > > > pagecache rework just made hard to hit race in ext4 more likely. The > > > > problem is that since ext4 conversion of direct IO writes to iomap > > > > framework (commit 378f32bab371), we update inode size after direct IO > > > > write only after invalidating page cache. Thus if buffered read sneaks > > > > at unfortunate moment like: > > > > > > > > CPU1 - write at offset 1k CPU2 - read from offset 0 > > > > iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); > > > > ext4_readpage(); > > > > ext4_handle_inode_extension() > > > > > > > > the read will zero out tail of the page as it still sees smaller inode > > > > size and thus page cache becomes inconsistent with on-disk contents with > > > > all the consequences. > > > > > > > > Fix the problem by moving inode size update into end_io handler which > > > > gets called before the page cache is invalidated. > > > > > > Confused. > > > > > > This moves all the inode extension stuff into the completion > > > handler, when all that really needs to be done is extending > > > inode->i_size to tell the world there is data up to where the > > > IO completed. Actually removing the inode from the orphan list > > > does not need to be done in the IO completion callback, because... > > > > > > > if (ilock_shared) > > > > iomap_ops = &ext4_iomap_overwrite_ops; > > > > - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, > > > > - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); > > > > - if (ret == -ENOTBLK) > > > > - ret = 0; > > > > - > > > > if (extend) > > > > - ret = ext4_handle_inode_extension(inode, offset, ret, count); > > > > + dio_ops = &ext4_dio_extending_write_ops; > > > > > > > > + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, > > > > + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); > > > ^^^^^^ ^^^^^^^^^^^^^^^^^^^ > > > > > > .... if we are doing an extending write, we force DIO to complete > > > before returning. Hence even AIO will block here on an extending > > > write, and hence we can -always- do the correct post-IO completion > > > orphan list cleanup here because we know a) the original IO size and > > > b) the amount of data that was actually written. > > > > > > Hence all that remains is closing the buffered read vs invalidation > > > race. All this requires is for the dio write completion to behave > > > like XFS where it just does the inode->i_size update for extending > > > writes. THis means the size is updated before the invalidation, and > > > hence any read that occurs after the invalidation but before the > > > post-eof blocks have been removed will see the correct size and read > > > the tail page(s) correctly. This closes the race window, and the > > > caller can still handle the post-eof block cleanup as it does now. > > > > > > Hence I don't see any need for changing the iomap infrastructure to > > > solve this problem. This seems like the obvious solution to me, so > > > what am I missing? > > > > All that you write above is correct. The missing piece is: If everything > > succeeded and all the cleanup we need is removing inode from the orphan > > list (common case), we want to piggyback that orphan list removal into the > > same transaction handle as the update of the inode size. This is just a > > performance thing, you are absolutely right we could also do the orphan > > cleanup unconditionally in ext4_dio_write_iter() and thus avoid any changes > > to the iomap framework. > > Doesn't ext4, like XFS, keep two copies of the inode size? One for > the on-disk size and one for the in-memory size? > > /me looks... > > Yeah, there's ei->i_disksize that reflects the on-disk size. > > And I note that the first thing that ext4_handle_inode_extension() > is already checking that the write is extending past the current > on-disk inode size before running the extension transaction. Yes. > The page cache only cares about the inode->i_size value, not the > ei->i_disksize value, so you can update them independently and still > have things work correctly. That's what XFS does in > xfs_dio_write_end_io - it updates the in-memory inode->i_size, then > runs a transaction to atomically update the inode on-disk inode > size. Updating the VFS inode size first protects against buffered > read races while updating the on-disk size... > > So for ext4, the two separate size updates don't need to be done at > the same time - you have all the state you need in the ext4 dio > write path to extend the on-disk file size on successful extending > write, and it is not dependent in any way on the current in-memory > VFS inode size that you'd update in the ->end_io callback.... Right, that's a nice trick that will allow us to keep the original performance (I've verified that indeed splitting inode size and orphan updates into separate transactions costs us ~10% of performance when appending 512-byte chunks) without touching generic dio code. Thanks for the idea! Honza
diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 2505313d96b0..ec59ea078f53 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -280,11 +280,67 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb, return ret; } +static int ext4_prepare_dio_extend(struct inode *inode) +{ + handle_t *handle; + int ret; + + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); + if (IS_ERR(handle)) + return PTR_ERR(handle); + + ext4_fc_start_update(inode); + ret = ext4_orphan_add(handle, inode); + ext4_fc_stop_update(inode); + if (ret) { + ext4_journal_stop(handle); + return ret; + } + ext4_journal_stop(handle); + return 0; +} + +static void ext4_cleanup_dio_extend(struct inode *inode, loff_t offset, + ssize_t written, size_t count) +{ + handle_t *handle; + u8 blkbits = inode->i_blkbits; + ext4_lblk_t written_blk, end_blk; + + written_blk = ALIGN(offset + written, 1 << blkbits); + end_blk = ALIGN(offset + count, 1 << blkbits); + if (written_blk < end_blk && ext4_can_truncate(inode)) { + ext4_truncate_failed_write(inode); + /* + * If the truncate operation failed early, then the inode may + * still be on the orphan list. In that case, we need to try + * remove the inode from the in-memory linked list. + */ + if (inode->i_nlink) + ext4_orphan_del(NULL, inode); + return; + } + + /* + * We need to ensure that the inode is removed from the orphan + * list if it has been added prematurely, due to writeback of + * delalloc blocks. + */ + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) { + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); + if (IS_ERR(handle)) { + ext4_orphan_del(NULL, inode); + return; + } + ext4_orphan_del(handle, inode); + ext4_journal_stop(handle); + } +} + static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset, ssize_t written, size_t count) { handle_t *handle; - bool truncate = false; u8 blkbits = inode->i_blkbits; ext4_lblk_t written_blk, end_blk; int ret; @@ -298,73 +354,31 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset, * as much as we intended. */ WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); - if (offset + count <= EXT4_I(inode)->i_disksize) { - /* - * We need to ensure that the inode is removed from the orphan - * list if it has been added prematurely, due to writeback of - * delalloc blocks. - */ - if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) { - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - - if (IS_ERR(handle)) { - ext4_orphan_del(NULL, inode); - return PTR_ERR(handle); - } - - ext4_orphan_del(handle, inode); - ext4_journal_stop(handle); - } - + if (offset + count <= EXT4_I(inode)->i_disksize) return written; - } - - if (written < 0) - goto truncate; handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - written = PTR_ERR(handle); - goto truncate; - } + if (IS_ERR(handle)) + return PTR_ERR(handle); if (ext4_update_inode_size(inode, offset + written)) { ret = ext4_mark_inode_dirty(handle, inode); if (unlikely(ret)) { - written = ret; ext4_journal_stop(handle); - goto truncate; + return ret; } } /* - * We may need to truncate allocated but not written blocks beyond EOF. + * If there cannot be allocated but unused blocks beyond EOF, remove + * the inode from the orphan list as we are done with it. */ written_blk = ALIGN(offset + written, 1 << blkbits); end_blk = ALIGN(offset + count, 1 << blkbits); - if (written_blk < end_blk && ext4_can_truncate(inode)) - truncate = true; - - /* - * Remove the inode from the orphan list if it has been extended and - * everything went OK. - */ - if (!truncate && inode->i_nlink) + if (written_blk == end_blk && inode->i_nlink) ext4_orphan_del(handle, inode); ext4_journal_stop(handle); - if (truncate) { -truncate: - ext4_truncate_failed_write(inode); - /* - * If the truncate operation failed early, then the inode may - * still be on the orphan list. In that case, we need to try - * remove the inode from the in-memory linked list. - */ - if (inode->i_nlink) - ext4_orphan_del(NULL, inode); - } - return written; } @@ -385,10 +399,28 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, return 0; } +static int ext4_dio_extending_write_end_io(struct kiocb *iocb, ssize_t size, + ssize_t orig_size, int error, + unsigned int flags) +{ + struct inode *inode = file_inode(iocb->ki_filp); + int ret; + + ret = ext4_dio_write_end_io(iocb, size, orig_size, error, flags); + if (ret < 0) + return ret; + return ext4_handle_inode_extension(inode, iocb->ki_pos, size, + orig_size); +} + static const struct iomap_dio_ops ext4_dio_write_ops = { .end_io = ext4_dio_write_end_io, }; +static const struct iomap_dio_ops ext4_dio_extending_write_ops = { + .end_io = ext4_dio_extending_write_end_io, +}; + /* * The intention here is to start with shared lock acquired then see if any * condition requires an exclusive inode lock. If yes, then we restart the @@ -455,11 +487,11 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from, static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) { ssize_t ret; - handle_t *handle; struct inode *inode = file_inode(iocb->ki_filp); loff_t offset = iocb->ki_pos; size_t count = iov_iter_count(from); const struct iomap_ops *iomap_ops = &ext4_iomap_ops; + const struct iomap_dio_ops *dio_ops = &ext4_dio_write_ops; bool extend = false, unaligned_io = false; bool ilock_shared = true; @@ -530,33 +562,21 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from) inode_dio_wait(inode); if (extend) { - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); + ret = ext4_prepare_dio_extend(inode); + if (ret) goto out; - } - - ext4_fc_start_update(inode); - ret = ext4_orphan_add(handle, inode); - ext4_fc_stop_update(inode); - if (ret) { - ext4_journal_stop(handle); - goto out; - } - - ext4_journal_stop(handle); } if (ilock_shared) iomap_ops = &ext4_iomap_overwrite_ops; - ret = iomap_dio_rw(iocb, from, iomap_ops, &ext4_dio_write_ops, - (unaligned_io || extend) ? IOMAP_DIO_FORCE_WAIT : 0); - if (ret == -ENOTBLK) - ret = 0; - if (extend) - ret = ext4_handle_inode_extension(inode, offset, ret, count); + dio_ops = &ext4_dio_extending_write_ops; + ret = iomap_dio_rw(iocb, from, iomap_ops, dio_ops, + (extend || unaligned_io) ? IOMAP_DIO_FORCE_WAIT : 0); + if (ret == -ENOTBLK) + ret = 0; + ext4_cleanup_dio_extend(inode, offset, ret, count); out: if (ilock_shared) inode_unlock_shared(inode); @@ -599,7 +619,6 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) ssize_t ret; size_t count; loff_t offset; - handle_t *handle; bool extend = false; struct inode *inode = file_inode(iocb->ki_filp); @@ -618,26 +637,20 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from) count = iov_iter_count(from); if (offset + count > EXT4_I(inode)->i_disksize) { - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); - if (IS_ERR(handle)) { - ret = PTR_ERR(handle); - goto out; - } - - ret = ext4_orphan_add(handle, inode); - if (ret) { - ext4_journal_stop(handle); + ret = ext4_prepare_dio_extend(inode); + if (ret < 0) goto out; - } - extend = true; - ext4_journal_stop(handle); } ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops); - if (extend) - ret = ext4_handle_inode_extension(inode, offset, ret, count); + if (extend) { + if (ret > 0) + ret = ext4_handle_inode_extension(inode, offset, ret, + count); + ext4_cleanup_dio_extend(inode, offset, ret, count); + } out: inode_unlock(inode); if (ret > 0)
Eric has noticed that after pagecache read rework, generic/418 is occasionally failing for ext4 when blocksize < pagesize. In fact, the pagecache rework just made hard to hit race in ext4 more likely. The problem is that since ext4 conversion of direct IO writes to iomap framework (commit 378f32bab371), we update inode size after direct IO write only after invalidating page cache. Thus if buffered read sneaks at unfortunate moment like: CPU1 - write at offset 1k CPU2 - read from offset 0 iomap_dio_rw(..., IOMAP_DIO_FORCE_WAIT); ext4_readpage(); ext4_handle_inode_extension() the read will zero out tail of the page as it still sees smaller inode size and thus page cache becomes inconsistent with on-disk contents with all the consequences. Fix the problem by moving inode size update into end_io handler which gets called before the page cache is invalidated. Reported-by: Eric Whitney <enwlinux@gmail.com> Fixes: 378f32bab371 ("ext4: introduce direct I/O write using iomap infrastructure") Signed-off-by: Jan Kara <jack@suse.cz> --- fs/ext4/file.c | 185 ++++++++++++++++++++++++++----------------------- 1 file changed, 99 insertions(+), 86 deletions(-)