diff mbox series

ext4: Properly sync file size update after O_SYNC direct IO

Message ID 20231011142155.19328-1-jack@suse.cz
State Superseded
Headers show
Series ext4: Properly sync file size update after O_SYNC direct IO | expand

Commit Message

Jan Kara Oct. 11, 2023, 2:21 p.m. UTC
Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
sync file size update and thus if we crash at unfortunate moment, the
file can have smaller size although O_SYNC IO has reported successful
completion. The problem happens because update of on-disk inode size is
handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
dio_complete() in particular) has returned and generic_file_sync() gets
called by dio_complete(). Fix the problem by handling on-disk inode size
update directly in our ->end_io completion handler.

References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
 1 file changed, 52 insertions(+), 87 deletions(-)

So finally I've hopefully got all the corner cases right ;) At least fstest
pass now.

Comments

Dave Chinner Oct. 12, 2023, 12:26 a.m. UTC | #1
On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
> Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> sync file size update and thus if we crash at unfortunate moment, the
> file can have smaller size although O_SYNC IO has reported successful
> completion. The problem happens because update of on-disk inode size is
> handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> dio_complete() in particular) has returned and generic_file_sync() gets
> called by dio_complete(). Fix the problem by handling on-disk inode size
> update directly in our ->end_io completion handler.
> 
> References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
> Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
>  1 file changed, 52 insertions(+), 87 deletions(-)
.....
> @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
>  		 */
>  		if (inode->i_nlink)
>  			ext4_orphan_del(NULL, inode);
> +		return;
>  	}
> +	/*
> +	 * If i_disksize got extended due to writeback of delalloc blocks while
> +	 * the DIO was running we could fail to cleanup the orphan list in
> +	 * ext4_handle_inode_extension(). Do it now.
> +	 */
> +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);

So this has to be called after the DIO write completes and calls
ext4_handle_inode_extension()?

....

> @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			   dio_flags, NULL, 0);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
> -
>  	if (extend)
> -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> +		ext4_inode_extension_cleanup(inode, ret);

Because this doesn't wait for AIO DIO to complete and actually
extend the file before running the cleanup code...

Cheers,

Dave.
Gao Xiang Oct. 12, 2023, 2:21 a.m. UTC | #2
Hi Dave,

On 2023/10/12 08:26, Dave Chinner wrote:
> On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
>> Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
>> sync file size update and thus if we crash at unfortunate moment, the
>> file can have smaller size although O_SYNC IO has reported successful
>> completion. The problem happens because update of on-disk inode size is
>> handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
>> dio_complete() in particular) has returned and generic_file_sync() gets
>> called by dio_complete(). Fix the problem by handling on-disk inode size
>> update directly in our ->end_io completion handler.
>>
>> References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
>> Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
>> Signed-off-by: Jan Kara <jack@suse.cz>
>> ---
>>   fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
>>   1 file changed, 52 insertions(+), 87 deletions(-)
> .....
>> @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
>>   		 */
>>   		if (inode->i_nlink)
>>   			ext4_orphan_del(NULL, inode);
>> +		return;
>>   	}
>> +	/*
>> +	 * If i_disksize got extended due to writeback of delalloc blocks while
>> +	 * the DIO was running we could fail to cleanup the orphan list in
>> +	 * ext4_handle_inode_extension(). Do it now.
>> +	 */
>> +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
>> +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> 
> So this has to be called after the DIO write completes and calls
> ext4_handle_inode_extension()?
> 
> ....
> 
>> @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>   			   dio_flags, NULL, 0);
>>   	if (ret == -ENOTBLK)
>>   		ret = 0;
>> -
>>   	if (extend)
>> -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
>> +		ext4_inode_extension_cleanup(inode, ret);
> 
> Because this doesn't wait for AIO DIO to complete and actually
> extend the file before running the cleanup code...

As far as I know, for ext4 AIO DIO extension cases,
IOMAP_DIO_FORCE_WAIT will be set, thus no async DIO here.

So the timing for this case will be strictly:
   - ext4_handle_inode_extension()   --- record i_disksize in .end_io

   - generic_write_sync()            --- forcely do fsync()

   - ext4_inode_extension_cleanup()  --- drop orphan in another transaction
                                         as mentioned in [1]

Anyway, that is my current limited thoughts.

[1] https://lore.kernel.org/linux-ext4/20230920152005.7iowrlukd5zbvp43@quack3/

Thanks,
Gao Xiang

> 
> Cheers,
> 
> Dave.
Jan Kara Oct. 12, 2023, 8:59 a.m. UTC | #3
On Thu 12-10-23 11:26:15, Dave Chinner wrote:
> On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
> > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > sync file size update and thus if we crash at unfortunate moment, the
> > file can have smaller size although O_SYNC IO has reported successful
> > completion. The problem happens because update of on-disk inode size is
> > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > dio_complete() in particular) has returned and generic_file_sync() gets
> > called by dio_complete(). Fix the problem by handling on-disk inode size
> > update directly in our ->end_io completion handler.
> > 
> > References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
> > Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> >  1 file changed, 52 insertions(+), 87 deletions(-)
> .....
> > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> >  		 */
> >  		if (inode->i_nlink)
> >  			ext4_orphan_del(NULL, inode);
> > +		return;
> >  	}
> > +	/*
> > +	 * If i_disksize got extended due to writeback of delalloc blocks while
> > +	 * the DIO was running we could fail to cleanup the orphan list in
> > +	 * ext4_handle_inode_extension(). Do it now.
> > +	 */
> > +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> 
> So this has to be called after the DIO write completes and calls
> ext4_handle_inode_extension()?

Yes, if the write was setup as extending one ('extend' is set to true in
ext4_dio_write_iter()).

> > @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> >  			   dio_flags, NULL, 0);
> >  	if (ret == -ENOTBLK)
> >  		ret = 0;
> > -
> >  	if (extend)
> > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > +		ext4_inode_extension_cleanup(inode, ret);
> 
> Because this doesn't wait for AIO DIO to complete and actually
> extend the file before running the cleanup code...

As Gao wrote, ext4 sets IOMAP_DIO_FORCE_WAIT if 'extend' is set (see
ext4_dio_write_checks()) so if we get to calling
ext4_inode_extension_cleanup() we are guaranteed the IO has already
completed.

								Honza
Ritesh Harjani (IBM) Oct. 12, 2023, 3:25 p.m. UTC | #4
Jan Kara <jack@suse.cz> writes:

> Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> sync file size update and thus if we crash at unfortunate moment, the
> file can have smaller size although O_SYNC IO has reported successful
> completion. The problem happens because update of on-disk inode size is
> handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> dio_complete() in particular) has returned and generic_file_sync() gets
> called by dio_complete(). Fix the problem by handling on-disk inode size
> update directly in our ->end_io completion handler.
>
> References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
> Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> Signed-off-by: Jan Kara <jack@suse.cz>

Guess we need a fixes tag.

> ---
>  fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
>  1 file changed, 52 insertions(+), 87 deletions(-)
>
> So finally I've hopefully got all the corner cases right ;) At least fstest
> pass now.
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 1492b1ae21f4..d0711c1a9b06 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -306,80 +306,34 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>  }
>  
>  static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> -					   ssize_t written, size_t count)
> +					   ssize_t count)
>  {
>  	handle_t *handle;
> -	bool truncate = false;
> -	u8 blkbits = inode->i_blkbits;
> -	ext4_lblk_t written_blk, end_blk;
> -	int ret;
> -
> -	/*
> -	 * Note that EXT4_I(inode)->i_disksize can get extended up to
> -	 * inode->i_size while the I/O was running due to writeback of delalloc
> -	 * blocks. But, the code in ext4_iomap_alloc() is careful to use
> -	 * zeroed/unwritten extents if this is possible; thus we won't leave
> -	 * uninitialized blocks in a file even if we didn't succeed in writing
> -	 * as much as we intended.
> -	 */
> -	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> -	if (offset + count <= EXT4_I(inode)->i_disksize) {
> -		/*
> -		 * We need to ensure that the inode is removed from the orphan
> -		 * list if it has been added prematurely, due to writeback of
> -		 * delalloc blocks.
> -		 */
> -		if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> -			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> -
> -			if (IS_ERR(handle)) {
> -				ext4_orphan_del(NULL, inode);
> -				return PTR_ERR(handle);
> -			}
> -
> -			ext4_orphan_del(handle, inode);
> -			ext4_journal_stop(handle);
> -		}
> -
> -		return written;
> -	}
> -
> -	if (written < 0)
> -		goto truncate;
>  
> +	lockdep_assert_held_write(&inode->i_rwsem);
>  	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> -	if (IS_ERR(handle)) {
> -		written = PTR_ERR(handle);
> -		goto truncate;
> -	}
> +	if (IS_ERR(handle))
> +		return PTR_ERR(handle);
>  
> -	if (ext4_update_inode_size(inode, offset + written)) {
> -		ret = ext4_mark_inode_dirty(handle, inode);
> +	if (ext4_update_inode_size(inode, offset + count)) {
> +		int ret = ext4_mark_inode_dirty(handle, inode);
>  		if (unlikely(ret)) {
> -			written = ret;
>  			ext4_journal_stop(handle);
> -			goto truncate;
> +			return ret;
>  		}
>  	}
>  
> -	/*
> -	 * We may need to truncate allocated but not written blocks beyond EOF.
> -	 */
> -	written_blk = ALIGN(offset + written, 1 << blkbits);
> -	end_blk = ALIGN(offset + count, 1 << blkbits);
> -	if (written_blk < end_blk && ext4_can_truncate(inode))
> -		truncate = true;
> -
> -	/*
> -	 * Remove the inode from the orphan list if it has been extended and
> -	 * everything went OK.
> -	 */
> -	if (!truncate && inode->i_nlink)
> +	if (inode->i_nlink)
>  		ext4_orphan_del(handle, inode);
>  	ext4_journal_stop(handle);
>  
> -	if (truncate) {
> -truncate:
> +	return count;
> +}
> +
> +static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
> +{
> +	lockdep_assert_held_write(&inode->i_rwsem);
> +	if (count < 0) {
>  		ext4_truncate_failed_write(inode);
>  		/*
>  		 * If the truncate operation failed early, then the inode may
> @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
>  		 */
>  		if (inode->i_nlink)
>  			ext4_orphan_del(NULL, inode);
> +		return;
>  	}
> +	/*
> +	 * If i_disksize got extended due to writeback of delalloc blocks while
> +	 * the DIO was running we could fail to cleanup the orphan list in
> +	 * ext4_handle_inode_extension(). Do it now.
> +	 */
> +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>  
> -	return written;
> +		if (IS_ERR(handle)) {
> +			/*
> +			 * The write has successfully completed. Not much to
> +			 * do with the error here so just cleanup the orphan
> +			 * list and hope for the best.
> +			 */
> +			ext4_orphan_del(NULL, inode);
> +			return;
> +		}
> +		ext4_orphan_del(handle, inode);
> +		ext4_journal_stop(handle);
> +	}
>  }
>  
>  static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> @@ -399,31 +372,22 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>  	loff_t pos = iocb->ki_pos;
>  	struct inode *inode = file_inode(iocb->ki_filp);
>  
> +	if (!error && size && flags & IOMAP_DIO_END_UNWRITTEN)

Do we have IOMAP_DIO_END_UNWRITTEN? or should it be IOMAP_DIO_UNWRITTEN?
Also we don't need to check !error case if we return early in case of an error.

> +		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
>  	if (error)
>  		return error;
> -
> -	if (size && flags & IOMAP_DIO_END_UNWRITTEN) {

ditto.

> -		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> -		if (error < 0)
> -			return error;
> -	}
>  	/*
> -	 * If we are extending the file, we have to update i_size here before
> -	 * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
> -	 * buffered reads could zero out too much from page cache pages. Update
> -	 * of on-disk size will happen later in ext4_dio_write_iter() where
> -	 * we have enough information to also perform orphan list handling etc.
> -	 * Note that we perform all extending writes synchronously under
> -	 * i_rwsem held exclusively so i_size update is safe here in that case.
> -	 * If the write was not extending, we cannot see pos > i_size here
> -	 * because operations reducing i_size like truncate wait for all
> -	 * outstanding DIO before updating i_size.
> +	 * Note that EXT4_I(inode)->i_disksize can get extended up to
> +	 * inode->i_size while the I/O was running due to writeback of delalloc
> +	 * blocks. But the code in ext4_iomap_alloc() is careful to use
> +	 * zeroed/unwritten extents if this is possible; thus we won't leave
> +	 * uninitialized blocks in a file even if we didn't succeed in writing
> +	 * as much as we intended.
>  	 */
> -	pos += size;
> -	if (pos > i_size_read(inode))
> -		i_size_write(inode, pos);
> -
> -	return 0;
> +	WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
> +	if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
> +		return 0;
> +	return ext4_handle_inode_extension(inode, pos, size);
>  }

Although it is not a problem, but we are sometimes returning 0 and
sometimes count here.

>  
>  static const struct iomap_dio_ops ext4_dio_write_ops = {
> @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  			   dio_flags, NULL, 0);
>  	if (ret == -ENOTBLK)
>  		ret = 0;
> -
>  	if (extend)
> -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> +		ext4_inode_extension_cleanup(inode, ret);
>  
>  out:
>  	if (ilock_shared)
> @@ -689,8 +652,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>  
>  	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>  
> -	if (extend)
> -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> +	if (extend) {
> +		ret = ext4_handle_inode_extension(inode, offset, ret);
> +		ext4_inode_extension_cleanup(inode, ret);

ok, looks like we are using that return value here.

> +	}
>  out:
>  	inode_unlock(inode);
>  	if (ret > 0)
> -- 
> 2.35.3

Otherwise it looks good to me.

-ritesh
Jan Kara Oct. 12, 2023, 4:17 p.m. UTC | #5
On Thu 12-10-23 20:55:29, Ritesh Harjani wrote:
> Jan Kara <jack@suse.cz> writes:
> 
> > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > sync file size update and thus if we crash at unfortunate moment, the
> > file can have smaller size although O_SYNC IO has reported successful
> > completion. The problem happens because update of on-disk inode size is
> > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > dio_complete() in particular) has returned and generic_file_sync() gets
> > called by dio_complete(). Fix the problem by handling on-disk inode size
> > update directly in our ->end_io completion handler.
> >
> > References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
> > Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> Guess we need a fixes tag.

Good point, will add.

> > ---
> >  fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> >  1 file changed, 52 insertions(+), 87 deletions(-)
> >
> > So finally I've hopefully got all the corner cases right ;) At least fstest
> > pass now.
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 1492b1ae21f4..d0711c1a9b06 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -306,80 +306,34 @@ static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> >  }
> >  
> >  static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > -					   ssize_t written, size_t count)
> > +					   ssize_t count)
> >  {
> >  	handle_t *handle;
> > -	bool truncate = false;
> > -	u8 blkbits = inode->i_blkbits;
> > -	ext4_lblk_t written_blk, end_blk;
> > -	int ret;
> > -
> > -	/*
> > -	 * Note that EXT4_I(inode)->i_disksize can get extended up to
> > -	 * inode->i_size while the I/O was running due to writeback of delalloc
> > -	 * blocks. But, the code in ext4_iomap_alloc() is careful to use
> > -	 * zeroed/unwritten extents if this is possible; thus we won't leave
> > -	 * uninitialized blocks in a file even if we didn't succeed in writing
> > -	 * as much as we intended.
> > -	 */
> > -	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> > -	if (offset + count <= EXT4_I(inode)->i_disksize) {
> > -		/*
> > -		 * We need to ensure that the inode is removed from the orphan
> > -		 * list if it has been added prematurely, due to writeback of
> > -		 * delalloc blocks.
> > -		 */
> > -		if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > -			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > -
> > -			if (IS_ERR(handle)) {
> > -				ext4_orphan_del(NULL, inode);
> > -				return PTR_ERR(handle);
> > -			}
> > -
> > -			ext4_orphan_del(handle, inode);
> > -			ext4_journal_stop(handle);
> > -		}
> > -
> > -		return written;
> > -	}
> > -
> > -	if (written < 0)
> > -		goto truncate;
> >  
> > +	lockdep_assert_held_write(&inode->i_rwsem);
> >  	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > -	if (IS_ERR(handle)) {
> > -		written = PTR_ERR(handle);
> > -		goto truncate;
> > -	}
> > +	if (IS_ERR(handle))
> > +		return PTR_ERR(handle);
> >  
> > -	if (ext4_update_inode_size(inode, offset + written)) {
> > -		ret = ext4_mark_inode_dirty(handle, inode);
> > +	if (ext4_update_inode_size(inode, offset + count)) {
> > +		int ret = ext4_mark_inode_dirty(handle, inode);
> >  		if (unlikely(ret)) {
> > -			written = ret;
> >  			ext4_journal_stop(handle);
> > -			goto truncate;
> > +			return ret;
> >  		}
> >  	}
> >  
> > -	/*
> > -	 * We may need to truncate allocated but not written blocks beyond EOF.
> > -	 */
> > -	written_blk = ALIGN(offset + written, 1 << blkbits);
> > -	end_blk = ALIGN(offset + count, 1 << blkbits);
> > -	if (written_blk < end_blk && ext4_can_truncate(inode))
> > -		truncate = true;
> > -
> > -	/*
> > -	 * Remove the inode from the orphan list if it has been extended and
> > -	 * everything went OK.
> > -	 */
> > -	if (!truncate && inode->i_nlink)
> > +	if (inode->i_nlink)
> >  		ext4_orphan_del(handle, inode);
> >  	ext4_journal_stop(handle);
> >  
> > -	if (truncate) {
> > -truncate:
> > +	return count;
> > +}
> > +
> > +static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
> > +{
> > +	lockdep_assert_held_write(&inode->i_rwsem);
> > +	if (count < 0) {
> >  		ext4_truncate_failed_write(inode);
> >  		/*
> >  		 * If the truncate operation failed early, then the inode may
> > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> >  		 */
> >  		if (inode->i_nlink)
> >  			ext4_orphan_del(NULL, inode);
> > +		return;
> >  	}
> > +	/*
> > +	 * If i_disksize got extended due to writeback of delalloc blocks while
> > +	 * the DIO was running we could fail to cleanup the orphan list in
> > +	 * ext4_handle_inode_extension(). Do it now.
> > +	 */
> > +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> >  
> > -	return written;
> > +		if (IS_ERR(handle)) {
> > +			/*
> > +			 * The write has successfully completed. Not much to
> > +			 * do with the error here so just cleanup the orphan
> > +			 * list and hope for the best.
> > +			 */
> > +			ext4_orphan_del(NULL, inode);
> > +			return;
> > +		}
> > +		ext4_orphan_del(handle, inode);
> > +		ext4_journal_stop(handle);
> > +	}
> >  }
> >  
> >  static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> > @@ -399,31 +372,22 @@ static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> >  	loff_t pos = iocb->ki_pos;
> >  	struct inode *inode = file_inode(iocb->ki_filp);
> >  
> > +	if (!error && size && flags & IOMAP_DIO_END_UNWRITTEN)
> 
> Do we have IOMAP_DIO_END_UNWRITTEN? or should it be IOMAP_DIO_UNWRITTEN?
> Also we don't need to check !error case if we return early in case of an error.

It should be IOMAP_DIO_UNWRITTEN. Unrelated iomap cleanup snuck under this
patch in my tree (because older versions of the patch needed it).

> > +		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> >  	if (error)
> >  		return error;
> > -
> > -	if (size && flags & IOMAP_DIO_END_UNWRITTEN) {
> 
> ditto.
> 
> > -		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
> > -		if (error < 0)
> > -			return error;
> > -	}
> >  	/*
> > -	 * If we are extending the file, we have to update i_size here before
> > -	 * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
> > -	 * buffered reads could zero out too much from page cache pages. Update
> > -	 * of on-disk size will happen later in ext4_dio_write_iter() where
> > -	 * we have enough information to also perform orphan list handling etc.
> > -	 * Note that we perform all extending writes synchronously under
> > -	 * i_rwsem held exclusively so i_size update is safe here in that case.
> > -	 * If the write was not extending, we cannot see pos > i_size here
> > -	 * because operations reducing i_size like truncate wait for all
> > -	 * outstanding DIO before updating i_size.
> > +	 * Note that EXT4_I(inode)->i_disksize can get extended up to
> > +	 * inode->i_size while the I/O was running due to writeback of delalloc
> > +	 * blocks. But the code in ext4_iomap_alloc() is careful to use
> > +	 * zeroed/unwritten extents if this is possible; thus we won't leave
> > +	 * uninitialized blocks in a file even if we didn't succeed in writing
> > +	 * as much as we intended.
> >  	 */
> > -	pos += size;
> > -	if (pos > i_size_read(inode))
> > -		i_size_write(inode, pos);
> > -
> > -	return 0;
> > +	WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
> > +	if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
> > +		return 0;
> > +	return ext4_handle_inode_extension(inode, pos, size);
> >  }
> 
> Although it is not a problem, but we are sometimes returning 0 and
> sometimes count here.

Yeah. iomap_dio_complete() actually fixes this up but I agree it is
confusing a bit. Let's return 'size' here.

							Honza
Dave Chinner Oct. 12, 2023, 11:33 p.m. UTC | #6
On Thu, Oct 12, 2023 at 10:59:37AM +0200, Jan Kara wrote:
> On Thu 12-10-23 11:26:15, Dave Chinner wrote:
> > On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
> > > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > > sync file size update and thus if we crash at unfortunate moment, the
> > > file can have smaller size although O_SYNC IO has reported successful
> > > completion. The problem happens because update of on-disk inode size is
> > > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > > dio_complete() in particular) has returned and generic_file_sync() gets
> > > called by dio_complete(). Fix the problem by handling on-disk inode size
> > > update directly in our ->end_io completion handler.
> > > 
> > > References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
> > > Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> > >  1 file changed, 52 insertions(+), 87 deletions(-)
> > .....
> > > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > >  		 */
> > >  		if (inode->i_nlink)
> > >  			ext4_orphan_del(NULL, inode);
> > > +		return;
> > >  	}
> > > +	/*
> > > +	 * If i_disksize got extended due to writeback of delalloc blocks while
> > > +	 * the DIO was running we could fail to cleanup the orphan list in
> > > +	 * ext4_handle_inode_extension(). Do it now.
> > > +	 */
> > > +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > > +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > 
> > So this has to be called after the DIO write completes and calls
> > ext4_handle_inode_extension()?
> 
> Yes, if the write was setup as extending one ('extend' is set to true in
> ext4_dio_write_iter()).

Then that is worth a comment to document the constraint for anyone
that is trying to understand how ext4 is using the iomap DIO code.

> > > @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > >  			   dio_flags, NULL, 0);
> > >  	if (ret == -ENOTBLK)
> > >  		ret = 0;
> > > -
> > >  	if (extend)
> > > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > +		ext4_inode_extension_cleanup(inode, ret);
> > 
> > Because this doesn't wait for AIO DIO to complete and actually
> > extend the file before running the cleanup code...
> 
> As Gao wrote, ext4 sets IOMAP_DIO_FORCE_WAIT if 'extend' is set (see
> ext4_dio_write_checks()) so if we get to calling
> ext4_inode_extension_cleanup() we are guaranteed the IO has already
> completed.

Ugh. That's a pretty nasty undocumented landmine. It definitely
needs a comment (or better, a WARN_ON_ONCE()) to document that this
code -only- works if AIO is disabled. This isn't for ext4
developers, it's for people working on the iomap code to understand
that ext4 has some really non-obvious constraints in it's DIO code
paths and that's why the landmine is not being stepped on....

Cheers,

Dave.
Jan Kara Oct. 13, 2023, 10:14 a.m. UTC | #7
On Fri 13-10-23 10:33:08, Dave Chinner wrote:
> On Thu, Oct 12, 2023 at 10:59:37AM +0200, Jan Kara wrote:
> > On Thu 12-10-23 11:26:15, Dave Chinner wrote:
> > > On Wed, Oct 11, 2023 at 04:21:55PM +0200, Jan Kara wrote:
> > > > Gao Xiang has reported that on ext4 O_SYNC direct IO does not properly
> > > > sync file size update and thus if we crash at unfortunate moment, the
> > > > file can have smaller size although O_SYNC IO has reported successful
> > > > completion. The problem happens because update of on-disk inode size is
> > > > handled in ext4_dio_write_iter() *after* iomap_dio_rw() (and thus
> > > > dio_complete() in particular) has returned and generic_file_sync() gets
> > > > called by dio_complete(). Fix the problem by handling on-disk inode size
> > > > update directly in our ->end_io completion handler.
> > > > 
> > > > References: https://lore.kernel.org/all/02d18236-26ef-09b0-90ad-030c4fe3ee20@linux.alibaba.com
> > > > Reported-by: Gao Xiang <hsiangkao@linux.alibaba.com>
> > > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > > ---
> > > >  fs/ext4/file.c | 139 ++++++++++++++++++-------------------------------
> > > >  1 file changed, 52 insertions(+), 87 deletions(-)
> > > .....
> > > > @@ -388,9 +342,28 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > > >  		 */
> > > >  		if (inode->i_nlink)
> > > >  			ext4_orphan_del(NULL, inode);
> > > > +		return;
> > > >  	}
> > > > +	/*
> > > > +	 * If i_disksize got extended due to writeback of delalloc blocks while
> > > > +	 * the DIO was running we could fail to cleanup the orphan list in
> > > > +	 * ext4_handle_inode_extension(). Do it now.
> > > > +	 */
> > > > +	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > > > +		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > > 
> > > So this has to be called after the DIO write completes and calls
> > > ext4_handle_inode_extension()?
> > 
> > Yes, if the write was setup as extending one ('extend' is set to true in
> > ext4_dio_write_iter()).
> 
> Then that is worth a comment to document the constraint for anyone
> that is trying to understand how ext4 is using the iomap DIO code.

Fair enough, comment added.

> > > > @@ -606,9 +570,8 @@ static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > > >  			   dio_flags, NULL, 0);
> > > >  	if (ret == -ENOTBLK)
> > > >  		ret = 0;
> > > > -
> > > >  	if (extend)
> > > > -		ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > > > +		ext4_inode_extension_cleanup(inode, ret);
> > > 
> > > Because this doesn't wait for AIO DIO to complete and actually
> > > extend the file before running the cleanup code...
> > 
> > As Gao wrote, ext4 sets IOMAP_DIO_FORCE_WAIT if 'extend' is set (see
> > ext4_dio_write_checks()) so if we get to calling
> > ext4_inode_extension_cleanup() we are guaranteed the IO has already
> > completed.
> 
> Ugh. That's a pretty nasty undocumented landmine. It definitely
> needs a comment (or better, a WARN_ON_ONCE()) to document that this
> code -only- works if AIO is disabled. This isn't for ext4
> developers, it's for people working on the iomap code to understand
> that ext4 has some really non-obvious constraints in it's DIO code
> paths and that's why the landmine is not being stepped on....

OK. Ext4 has always been this way so I never felt the need to document it
but you're right. I've added:

	if (extend) {
                /*
                 * We always perform extending DIO write synchronously so by
                 * now the IO is completed and ext4_handle_inode_extension()
                 * was called. Cleanup the inode in case of error or race with
                 * writeback of delalloc blocks.
                 */
                WARN_ON_ONCE(ret == -EIOCBQUEUED);
                ext4_inode_extension_cleanup(inode, ret);
        }

Thanks for the suggestions!

								Honza
diff mbox series

Patch

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 1492b1ae21f4..d0711c1a9b06 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -306,80 +306,34 @@  static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
 }
 
 static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
-					   ssize_t written, size_t count)
+					   ssize_t count)
 {
 	handle_t *handle;
-	bool truncate = false;
-	u8 blkbits = inode->i_blkbits;
-	ext4_lblk_t written_blk, end_blk;
-	int ret;
-
-	/*
-	 * Note that EXT4_I(inode)->i_disksize can get extended up to
-	 * inode->i_size while the I/O was running due to writeback of delalloc
-	 * blocks. But, the code in ext4_iomap_alloc() is careful to use
-	 * zeroed/unwritten extents if this is possible; thus we won't leave
-	 * uninitialized blocks in a file even if we didn't succeed in writing
-	 * as much as we intended.
-	 */
-	WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
-	if (offset + count <= EXT4_I(inode)->i_disksize) {
-		/*
-		 * We need to ensure that the inode is removed from the orphan
-		 * list if it has been added prematurely, due to writeback of
-		 * delalloc blocks.
-		 */
-		if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
-			handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-
-			if (IS_ERR(handle)) {
-				ext4_orphan_del(NULL, inode);
-				return PTR_ERR(handle);
-			}
-
-			ext4_orphan_del(handle, inode);
-			ext4_journal_stop(handle);
-		}
-
-		return written;
-	}
-
-	if (written < 0)
-		goto truncate;
 
+	lockdep_assert_held_write(&inode->i_rwsem);
 	handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
-	if (IS_ERR(handle)) {
-		written = PTR_ERR(handle);
-		goto truncate;
-	}
+	if (IS_ERR(handle))
+		return PTR_ERR(handle);
 
-	if (ext4_update_inode_size(inode, offset + written)) {
-		ret = ext4_mark_inode_dirty(handle, inode);
+	if (ext4_update_inode_size(inode, offset + count)) {
+		int ret = ext4_mark_inode_dirty(handle, inode);
 		if (unlikely(ret)) {
-			written = ret;
 			ext4_journal_stop(handle);
-			goto truncate;
+			return ret;
 		}
 	}
 
-	/*
-	 * We may need to truncate allocated but not written blocks beyond EOF.
-	 */
-	written_blk = ALIGN(offset + written, 1 << blkbits);
-	end_blk = ALIGN(offset + count, 1 << blkbits);
-	if (written_blk < end_blk && ext4_can_truncate(inode))
-		truncate = true;
-
-	/*
-	 * Remove the inode from the orphan list if it has been extended and
-	 * everything went OK.
-	 */
-	if (!truncate && inode->i_nlink)
+	if (inode->i_nlink)
 		ext4_orphan_del(handle, inode);
 	ext4_journal_stop(handle);
 
-	if (truncate) {
-truncate:
+	return count;
+}
+
+static void ext4_inode_extension_cleanup(struct inode *inode, ssize_t count)
+{
+	lockdep_assert_held_write(&inode->i_rwsem);
+	if (count < 0) {
 		ext4_truncate_failed_write(inode);
 		/*
 		 * If the truncate operation failed early, then the inode may
@@ -388,9 +342,28 @@  static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
 		 */
 		if (inode->i_nlink)
 			ext4_orphan_del(NULL, inode);
+		return;
 	}
+	/*
+	 * If i_disksize got extended due to writeback of delalloc blocks while
+	 * the DIO was running we could fail to cleanup the orphan list in
+	 * ext4_handle_inode_extension(). Do it now.
+	 */
+	if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
+		handle_t *handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
 
-	return written;
+		if (IS_ERR(handle)) {
+			/*
+			 * The write has successfully completed. Not much to
+			 * do with the error here so just cleanup the orphan
+			 * list and hope for the best.
+			 */
+			ext4_orphan_del(NULL, inode);
+			return;
+		}
+		ext4_orphan_del(handle, inode);
+		ext4_journal_stop(handle);
+	}
 }
 
 static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
@@ -399,31 +372,22 @@  static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
 	loff_t pos = iocb->ki_pos;
 	struct inode *inode = file_inode(iocb->ki_filp);
 
+	if (!error && size && flags & IOMAP_DIO_END_UNWRITTEN)
+		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
 	if (error)
 		return error;
-
-	if (size && flags & IOMAP_DIO_END_UNWRITTEN) {
-		error = ext4_convert_unwritten_extents(NULL, inode, pos, size);
-		if (error < 0)
-			return error;
-	}
 	/*
-	 * If we are extending the file, we have to update i_size here before
-	 * page cache gets invalidated in iomap_dio_rw(). Otherwise racing
-	 * buffered reads could zero out too much from page cache pages. Update
-	 * of on-disk size will happen later in ext4_dio_write_iter() where
-	 * we have enough information to also perform orphan list handling etc.
-	 * Note that we perform all extending writes synchronously under
-	 * i_rwsem held exclusively so i_size update is safe here in that case.
-	 * If the write was not extending, we cannot see pos > i_size here
-	 * because operations reducing i_size like truncate wait for all
-	 * outstanding DIO before updating i_size.
+	 * Note that EXT4_I(inode)->i_disksize can get extended up to
+	 * inode->i_size while the I/O was running due to writeback of delalloc
+	 * blocks. But the code in ext4_iomap_alloc() is careful to use
+	 * zeroed/unwritten extents if this is possible; thus we won't leave
+	 * uninitialized blocks in a file even if we didn't succeed in writing
+	 * as much as we intended.
 	 */
-	pos += size;
-	if (pos > i_size_read(inode))
-		i_size_write(inode, pos);
-
-	return 0;
+	WARN_ON_ONCE(i_size_read(inode) < READ_ONCE(EXT4_I(inode)->i_disksize));
+	if (pos + size <= READ_ONCE(EXT4_I(inode)->i_disksize))
+		return 0;
+	return ext4_handle_inode_extension(inode, pos, size);
 }
 
 static const struct iomap_dio_ops ext4_dio_write_ops = {
@@ -606,9 +570,8 @@  static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			   dio_flags, NULL, 0);
 	if (ret == -ENOTBLK)
 		ret = 0;
-
 	if (extend)
-		ret = ext4_handle_inode_extension(inode, offset, ret, count);
+		ext4_inode_extension_cleanup(inode, ret);
 
 out:
 	if (ilock_shared)
@@ -689,8 +652,10 @@  ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
 
 	ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
 
-	if (extend)
-		ret = ext4_handle_inode_extension(inode, offset, ret, count);
+	if (extend) {
+		ret = ext4_handle_inode_extension(inode, offset, ret);
+		ext4_inode_extension_cleanup(inode, ret);
+	}
 out:
 	inode_unlock(inode);
 	if (ret > 0)