[v2,2/2] ext4: update i_disksize if direct write past ondisk size

Message ID 20180312152156.14034-2-guaneryu@gmail.com
State New
Headers show
Series
  • [v2,1/2] ext4: protect i_disksize update by i_data_sem in direct write path
Related show

Commit Message

Eryu Guan March 12, 2018, 3:21 p.m.
Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.

Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.

Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.

This fixes occasional i_size corruption in fstests generic/475.

Signed-off-by: Eryu Guan <guaneryu@gmail.com>
---

v2:
- basically no change since v1, just fix the locking issue first, and
  reintroduce the "ei" definition in this patch.

I've tested this patchset by looping generic/475 200 times without
hitting a corruption, usually it fails within 5 iterations for me. Also
tested by full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations
and dio tests from LTP, all results looked good.

 fs/ext4/inode.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

Comments

Theodore Y. Ts'o March 22, 2018, 3:45 p.m. | #1
On Mon, Mar 12, 2018 at 11:21:56PM +0800, Eryu Guan wrote:
> Currently in ext4 direct write path, we update i_disksize only when
> new eof is greater than i_size, and don't update it even when new
> eof is greater than i_disksize but less than i_size. This doesn't
> work well with delalloc buffer write, which updates i_size and
> i_disksize only when delalloc blocks are resolved (at writeback
> time), the i_disksize from direct write can be lost if a previous
> buffer write succeeded at write time but failed at writeback time,
> then results in corrupted ondisk inode size.
> 
> Consider this case, first buffer write 4k data to a new file at
> offset 16k with delayed allocation, then direct write 4k data to the
> same file at offset 4k before delalloc blocks are resolved, which
> doesn't update i_disksize because it writes within i_size(20k), but
> the extent tree metadata has been committed in journal. Then
> writeback of the delalloc blocks fails (due to device error etc.),
> and i_size/i_disksize from buffer write can't be written to disk
> (still zero). A subsequent umount/mount cycle recovers journal and
> writes extent tree metadata from direct write to disk, but with
> i_disksize being zero.
> 
> Fix it by updating i_disksize too in direct write path when new eof
> is greater than i_disksize but less than i_size, so i_disksize is
> always consistent with direct write.
> 
> This fixes occasional i_size corruption in fstests generic/475.
> 
> Signed-off-by: Eryu Guan <guaneryu@gmail.com>

Updated, thanks.

					- Ted
Jan Kara March 23, 2018, 5:04 p.m. | #2
On Mon 12-03-18 23:21:56, Eryu Guan wrote:
> Currently in ext4 direct write path, we update i_disksize only when
> new eof is greater than i_size, and don't update it even when new
> eof is greater than i_disksize but less than i_size. This doesn't
> work well with delalloc buffer write, which updates i_size and
> i_disksize only when delalloc blocks are resolved (at writeback
> time), the i_disksize from direct write can be lost if a previous
> buffer write succeeded at write time but failed at writeback time,
> then results in corrupted ondisk inode size.
> 
> Consider this case, first buffer write 4k data to a new file at
> offset 16k with delayed allocation, then direct write 4k data to the
> same file at offset 4k before delalloc blocks are resolved, which
> doesn't update i_disksize because it writes within i_size(20k), but
> the extent tree metadata has been committed in journal. Then
> writeback of the delalloc blocks fails (due to device error etc.),
> and i_size/i_disksize from buffer write can't be written to disk
> (still zero). A subsequent umount/mount cycle recovers journal and
> writes extent tree metadata from direct write to disk, but with
> i_disksize being zero.
> 
> Fix it by updating i_disksize too in direct write path when new eof
> is greater than i_disksize but less than i_size, so i_disksize is
> always consistent with direct write.
> 
> This fixes occasional i_size corruption in fstests generic/475.
> 
> Signed-off-by: Eryu Guan <guaneryu@gmail.com>

Thanks for fixing this. The patch looks good. You can add:

Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
> 
> v2:
> - basically no change since v1, just fix the locking issue first, and
>   reintroduce the "ei" definition in this patch.
> 
> I've tested this patchset by looping generic/475 200 times without
> hitting a corruption, usually it fails within 5 iterations for me. Also
> tested by full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations
> and dio tests from LTP, all results looked good.
> 
>  fs/ext4/inode.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index bff44b4a0783..9acac476c15c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3658,6 +3658,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  {
>  	struct file *file = iocb->ki_filp;
>  	struct inode *inode = file->f_mapping->host;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
>  	ssize_t ret;
>  	loff_t offset = iocb->ki_pos;
>  	size_t count = iov_iter_count(iter);
> @@ -3668,7 +3669,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  	int orphan = 0;
>  	handle_t *handle;
>  
> -	if (final_size > inode->i_size) {
> +	if (final_size > inode->i_size || final_size > ei->i_disksize) {
>  		/* Credits for sb + inode write */
>  		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>  		if (IS_ERR(handle)) {
> @@ -3788,9 +3789,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  			ext4_orphan_del(handle, inode);
>  		if (ret > 0) {
>  			loff_t end = offset + ret;
> -			if (end > inode->i_size) {
> +			if (end > inode->i_size || end > ei->i_disksize) {
>  				ext4_update_i_disksize(inode, end);
> -				i_size_write(inode, end);
> +				if (end > inode->i_size)
> +					i_size_write(inode, end);
>  				/*
>  				 * We're going to return a positive `ret'
>  				 * here due to non-zero-length I/O, so there's
> -- 
> 2.14.3
>
Sasha Levin March 30, 2018, 4:16 p.m. | #3
Hi again EXT4 ML!

Here's a mail that would be sent as a result of an autoselection. It
will note the "score" of the patch, and will run the same tests we
do for patches tagged for stable.

As before, please let me know your thoughts on this. Thanks!

===

Hi Eryu Guan,

[This is an automated email]

This commit has been processed by the -stable helper bot and determined
to be a high probability candidate for -stable trees. (score: 59.259)

The bot has tested the following trees: v4.15.12, v4.14.29, v4.9.89, v4.4.123, v4.1.50, v3.18.101.

v4.15.12: Build OK!
v4.14.29: Build OK!
v4.9.89: Build OK!
v4.4.123: Failed to apply! Possible dependencies:
    914f82a32d02: ("ext4: refactor direct IO code")
    705965bd6dfa: ("ext4: rename and split get blocks functions")
    ba5843f51d46: ("ext4: use pre-zeroed blocks for DAX page faults")
    c86d8db33a92: ("ext4: implement allocation of pre-zeroed blocks")
    2dcba4781fa3: ("ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag")

v4.1.50: Failed to apply! Possible dependencies:
    914f82a32d02: ("ext4: refactor direct IO code")
    705965bd6dfa: ("ext4: rename and split get blocks functions")
    ba5843f51d46: ("ext4: use pre-zeroed blocks for DAX page faults")
    ed923b5776a2: ("ext4: add ext4_get_block_dax()")
    e676a4c19165: ("ext4: use ext4_get_block_write() for DAX")
    11bd1a9ecdd6: ("ext4: huge page fault support")
    e676a4c19165: ("ext4: use ext4_get_block_write() for DAX")
    01a33b4ace68: ("ext4: start transaction before calling into DAX")
    ed923b5776a2: ("ext4: add ext4_get_block_dax()")
    e676a4c19165: ("ext4: use ext4_get_block_write() for DAX")
    11bd1a9ecdd6: ("ext4: huge page fault support")
    11bd1a9ecdd6: ("ext4: huge page fault support")

v3.18.101: Failed to apply! Possible dependencies:
    914f82a32d02: ("ext4: refactor direct IO code")
    6f67376318ab: ("direct_IO: use iov_iter_rw() instead of rw everywhere")
    ef96fdddcd38: ("staging: lustre: lustre: llite: use DIV_ROUND_UP")
    42b1ab979d92: ("9p: get rid of v9fs_direct_file_read()")
    9565a5445240: ("9p: get rid of v9fs_direct_file_write()")
    9abb40830700: ("fs/affs/file.c: add support to O_DIRECT")
    92b20708f9f0: ("fs/affs/file.c: fix direct IO writes beyond EOF")
    9abb40830700: ("fs/affs/file.c: add support to O_DIRECT")
    17f8c842d24a: ("Remove rw from {,__,do_}blockdev_direct_IO()")
    92b20708f9f0: ("fs/affs/file.c: fix direct IO writes beyond EOF")
    9abb40830700: ("fs/affs/file.c: add support to O_DIRECT")
    9abb40830700: ("fs/affs/file.c: add support to O_DIRECT")


EXT4 Specific tests:

v4.15.12 (http://stable-bot.westus2.cloudapp.azure.com/ext4-test2/v4.15.12/tests/):
    Tests complete.
v4.14.29 (http://stable-bot.westus2.cloudapp.azure.com/ext4-test2/v4.14.29/tests/):
    Tests complete.
v4.9.89 (http://stable-bot.westus2.cloudapp.azure.com/ext4-test2/v4.9.89/tests/):
    Tests complete.

Please let us know if you'd like to have this patch included in a stable tree.

--
Thanks,
Sasha
Theodore Y. Ts'o March 30, 2018, 6:14 p.m. | #4
On Fri, Mar 30, 2018 at 04:16:10PM +0000, Sasha Levin wrote:
> 
> Here's a mail that would be sent as a result of an autoselection. It
> will note the "score" of the patch, and will run the same tests we
> do for patches tagged for stable.
> 
> As before, please let me know your thoughts on this. Thanks!
> ...
> 
> Please let us know if you'd like to have this patch included in a stable tree.

This patch is in the ext4 tree but has not been integrated into the
mainline kernel.  It will be shortly, once the merge window opens.

How are patches selected for potential autoselection?  If they are in
linux-ext but not yet in mainline, I would assume this would be a
reason not to select it, no?

Regards,

					- Ted

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index bff44b4a0783..9acac476c15c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3658,6 +3658,7 @@  static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 {
 	struct file *file = iocb->ki_filp;
 	struct inode *inode = file->f_mapping->host;
+	struct ext4_inode_info *ei = EXT4_I(inode);
 	ssize_t ret;
 	loff_t offset = iocb->ki_pos;
 	size_t count = iov_iter_count(iter);
@@ -3668,7 +3669,7 @@  static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 	int orphan = 0;
 	handle_t *handle;
 
-	if (final_size > inode->i_size) {
+	if (final_size > inode->i_size || final_size > ei->i_disksize) {
 		/* Credits for sb + inode write */
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
 		if (IS_ERR(handle)) {
@@ -3788,9 +3789,10 @@  static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 			ext4_orphan_del(handle, inode);
 		if (ret > 0) {
 			loff_t end = offset + ret;
-			if (end > inode->i_size) {
+			if (end > inode->i_size || end > ei->i_disksize) {
 				ext4_update_i_disksize(inode, end);
-				i_size_write(inode, end);
+				if (end > inode->i_size)
+					i_size_write(inode, end);
 				/*
 				 * We're going to return a positive `ret'
 				 * here due to non-zero-length I/O, so there's