ext4: update i_disksize if direct write past ondisk size

Message ID 20180123083723.24908-1-eguan@redhat.com
State New
Headers show
Series
  • ext4: update i_disksize if direct write past ondisk size
Related show

Commit Message

Eryu Guan Jan. 23, 2018, 8:37 a.m.
Currently in ext4 direct write path, we update i_disksize only when
new eof is greater than i_size, and don't update it even when new
eof is greater than i_disksize but less than i_size. This doesn't
work well with delalloc buffer write, which updates i_size and
i_disksize only when delalloc blocks are resolved (at writeback
time), the i_disksize from direct write can be lost if a previous
buffer write succeeded at write time but failed at writeback time,
then results in corrupted ondisk inode size.

Consider this case, first buffer write 4k data to a new file at
offset 16k with delayed allocation, then direct write 4k data to the
same file at offset 4k before delalloc blocks are resolved, which
doesn't update i_disksize because it writes within i_size(20k), but
the extent tree metadata has been committed in journal. Then
writeback of the delalloc blocks fails (due to device error etc.),
and i_size/i_disksize from buffer write can't be written to disk
(still zero). A subsequent umount/mount cycle recovers journal and
writes extent tree metadata from direct write to disk, but with
i_disksize being zero.

Fix it by updating i_disksize too in direct write path when new eof
is greater than i_disksize but less than i_size, so i_disksize is
always consistent with direct write.

This fixes occasional i_size corruption in fstests generic/475.

Signed-off-by: Eryu Guan <eguan@redhat.com>
---
I think this matches what XFS does in direct write too.

I've tested it by looping generic/475 200 times without hitting a
corruption, usually it fails within 5 iterations for me. Also tested by
full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all
results looked good.

 fs/ext4/inode.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

Comments

Eryu Guan Feb. 18, 2018, noon | #1
On Tue, Jan 23, 2018 at 04:37:23PM +0800, Eryu Guan wrote:
> Currently in ext4 direct write path, we update i_disksize only when
> new eof is greater than i_size, and don't update it even when new
> eof is greater than i_disksize but less than i_size. This doesn't
> work well with delalloc buffer write, which updates i_size and
> i_disksize only when delalloc blocks are resolved (at writeback
> time), the i_disksize from direct write can be lost if a previous
> buffer write succeeded at write time but failed at writeback time,
> then results in corrupted ondisk inode size.
> 
> Consider this case, first buffer write 4k data to a new file at
> offset 16k with delayed allocation, then direct write 4k data to the
> same file at offset 4k before delalloc blocks are resolved, which
> doesn't update i_disksize because it writes within i_size(20k), but
> the extent tree metadata has been committed in journal. Then
> writeback of the delalloc blocks fails (due to device error etc.),
> and i_size/i_disksize from buffer write can't be written to disk
> (still zero). A subsequent umount/mount cycle recovers journal and
> writes extent tree metadata from direct write to disk, but with
> i_disksize being zero.
> 
> Fix it by updating i_disksize too in direct write path when new eof
> is greater than i_disksize but less than i_size, so i_disksize is
> always consistent with direct write.
> 
> This fixes occasional i_size corruption in fstests generic/475.
> 
> Signed-off-by: Eryu Guan <eguan@redhat.com>

Ping on this patch.

Thanks,
Eryu

> ---
> I think this matches what XFS does in direct write too.
> 
> I've tested it by looping generic/475 200 times without hitting a
> corruption, usually it fails within 5 iterations for me. Also tested by
> full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all
> results looked good.
> 
>  fs/ext4/inode.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 534a9130f625..2a75b0aafd31 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3668,7 +3668,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  	int orphan = 0;
>  	handle_t *handle;
>  
> -	if (final_size > inode->i_size) {
> +	if (final_size > inode->i_size || final_size > ei->i_disksize) {
>  		/* Credits for sb + inode write */
>  		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>  		if (IS_ERR(handle)) {
> @@ -3780,9 +3780,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  			ext4_orphan_del(handle, inode);
>  		if (ret > 0) {
>  			loff_t end = offset + ret;
> -			if (end > inode->i_size) {
> +			if (end > inode->i_size || end > ei->i_disksize) {
>  				ei->i_disksize = end;
> -				i_size_write(inode, end);
> +				if (end > inode->i_size)
> +					i_size_write(inode, end);
>  				/*
>  				 * We're going to return a positive `ret'
>  				 * here due to non-zero-length I/O, so there's
> -- 
> 2.14.3
>
Jan Kara March 7, 2018, 10:11 a.m. | #2
On Tue 23-01-18 16:37:23, Eryu Guan wrote:
> Currently in ext4 direct write path, we update i_disksize only when
> new eof is greater than i_size, and don't update it even when new
> eof is greater than i_disksize but less than i_size. This doesn't
> work well with delalloc buffer write, which updates i_size and
> i_disksize only when delalloc blocks are resolved (at writeback
> time), the i_disksize from direct write can be lost if a previous
> buffer write succeeded at write time but failed at writeback time,
> then results in corrupted ondisk inode size.
> 
> Consider this case, first buffer write 4k data to a new file at
> offset 16k with delayed allocation, then direct write 4k data to the
> same file at offset 4k before delalloc blocks are resolved, which
> doesn't update i_disksize because it writes within i_size(20k), but
> the extent tree metadata has been committed in journal. Then
> writeback of the delalloc blocks fails (due to device error etc.),
> and i_size/i_disksize from buffer write can't be written to disk
> (still zero). A subsequent umount/mount cycle recovers journal and
> writes extent tree metadata from direct write to disk, but with
> i_disksize being zero.
> 
> Fix it by updating i_disksize too in direct write path when new eof
> is greater than i_disksize but less than i_size, so i_disksize is
> always consistent with direct write.
> 
> This fixes occasional i_size corruption in fstests generic/475.
> 
> Signed-off-by: Eryu Guan <eguan@redhat.com>
> ---
> I think this matches what XFS does in direct write too.
> 
> I've tested it by looping generic/475 200 times without hitting a
> corruption, usually it fails within 5 iterations for me. Also tested by
> full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all
> results looked good.

Thanks for the patch! It looks good to me. Just when looking at these
i_disksize updates and thinking about mixing them with page writeback there
seems to be another bug that these i_disksize updates are not protected by
ei->i_data_sem (which is what protects i_disksize update in the writeback
path). So probably that should be fixed up as well as otherwise I'm not
sure we cannot corrupt i_disksize in some funny way when writeback and dio
write race...

								Honza

> 
>  fs/ext4/inode.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 534a9130f625..2a75b0aafd31 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3668,7 +3668,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  	int orphan = 0;
>  	handle_t *handle;
>  
> -	if (final_size > inode->i_size) {
> +	if (final_size > inode->i_size || final_size > ei->i_disksize) {
>  		/* Credits for sb + inode write */
>  		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>  		if (IS_ERR(handle)) {
> @@ -3780,9 +3780,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
>  			ext4_orphan_del(handle, inode);
>  		if (ret > 0) {
>  			loff_t end = offset + ret;
> -			if (end > inode->i_size) {
> +			if (end > inode->i_size || end > ei->i_disksize) {
>  				ei->i_disksize = end;
> -				i_size_write(inode, end);
> +				if (end > inode->i_size)
> +					i_size_write(inode, end);
>  				/*
>  				 * We're going to return a positive `ret'
>  				 * here due to non-zero-length I/O, so there's
> -- 
> 2.14.3
>
Eryu Guan March 8, 2018, 12:58 a.m. | #3
On Wed, Mar 07, 2018 at 11:11:10AM +0100, Jan Kara wrote:
> On Tue 23-01-18 16:37:23, Eryu Guan wrote:
> > Currently in ext4 direct write path, we update i_disksize only when
> > new eof is greater than i_size, and don't update it even when new
> > eof is greater than i_disksize but less than i_size. This doesn't
> > work well with delalloc buffer write, which updates i_size and
> > i_disksize only when delalloc blocks are resolved (at writeback
> > time), the i_disksize from direct write can be lost if a previous
> > buffer write succeeded at write time but failed at writeback time,
> > then results in corrupted ondisk inode size.
> > 
> > Consider this case, first buffer write 4k data to a new file at
> > offset 16k with delayed allocation, then direct write 4k data to the
> > same file at offset 4k before delalloc blocks are resolved, which
> > doesn't update i_disksize because it writes within i_size(20k), but
> > the extent tree metadata has been committed in journal. Then
> > writeback of the delalloc blocks fails (due to device error etc.),
> > and i_size/i_disksize from buffer write can't be written to disk
> > (still zero). A subsequent umount/mount cycle recovers journal and
> > writes extent tree metadata from direct write to disk, but with
> > i_disksize being zero.
> > 
> > Fix it by updating i_disksize too in direct write path when new eof
> > is greater than i_disksize but less than i_size, so i_disksize is
> > always consistent with direct write.
> > 
> > This fixes occasional i_size corruption in fstests generic/475.
> > 
> > Signed-off-by: Eryu Guan <eguan@redhat.com>
> > ---
> > I think this matches what XFS does in direct write too.
> > 
> > I've tested it by looping generic/475 200 times without hitting a
> > corruption, usually it fails within 5 iterations for me. Also tested by
> > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all
> > results looked good.
> 
> Thanks for the patch! It looks good to me. Just when looking at these

Thanks for the review!

> i_disksize updates and thinking about mixing them with page writeback there
> seems to be another bug that these i_disksize updates are not protected by
> ei->i_data_sem (which is what protects i_disksize update in the writeback
> path). So probably that should be fixed up as well as otherwise I'm not
> sure we cannot corrupt i_disksize in some funny way when writeback and dio
> write race...

It's been a while and I'll get myself refamiliar with this code path and
look into it. Do you think we can apply this patch as is for now and fix
the race you mentioned in a follow-up patch? Or I should fix both of
them in v2?

Thanks,
Eryu
Jan Kara March 8, 2018, 9 a.m. | #4
On Thu 08-03-18 08:58:17, Eryu Guan wrote:
> On Wed, Mar 07, 2018 at 11:11:10AM +0100, Jan Kara wrote:
> > On Tue 23-01-18 16:37:23, Eryu Guan wrote:
> > > Currently in ext4 direct write path, we update i_disksize only when
> > > new eof is greater than i_size, and don't update it even when new
> > > eof is greater than i_disksize but less than i_size. This doesn't
> > > work well with delalloc buffer write, which updates i_size and
> > > i_disksize only when delalloc blocks are resolved (at writeback
> > > time), the i_disksize from direct write can be lost if a previous
> > > buffer write succeeded at write time but failed at writeback time,
> > > then results in corrupted ondisk inode size.
> > > 
> > > Consider this case, first buffer write 4k data to a new file at
> > > offset 16k with delayed allocation, then direct write 4k data to the
> > > same file at offset 4k before delalloc blocks are resolved, which
> > > doesn't update i_disksize because it writes within i_size(20k), but
> > > the extent tree metadata has been committed in journal. Then
> > > writeback of the delalloc blocks fails (due to device error etc.),
> > > and i_size/i_disksize from buffer write can't be written to disk
> > > (still zero). A subsequent umount/mount cycle recovers journal and
> > > writes extent tree metadata from direct write to disk, but with
> > > i_disksize being zero.
> > > 
> > > Fix it by updating i_disksize too in direct write path when new eof
> > > is greater than i_disksize but less than i_size, so i_disksize is
> > > always consistent with direct write.
> > > 
> > > This fixes occasional i_size corruption in fstests generic/475.
> > > 
> > > Signed-off-by: Eryu Guan <eguan@redhat.com>
> > > ---
> > > I think this matches what XFS does in direct write too.
> > > 
> > > I've tested it by looping generic/475 200 times without hitting a
> > > corruption, usually it fails within 5 iterations for me. Also tested by
> > > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all
> > > results looked good.
> > 
> > Thanks for the patch! It looks good to me. Just when looking at these
> 
> Thanks for the review!
> 
> > i_disksize updates and thinking about mixing them with page writeback there
> > seems to be another bug that these i_disksize updates are not protected by
> > ei->i_data_sem (which is what protects i_disksize update in the writeback
> > path). So probably that should be fixed up as well as otherwise I'm not
> > sure we cannot corrupt i_disksize in some funny way when writeback and dio
> > write race...
> 
> It's been a while and I'll get myself refamiliar with this code path and
> look into it. Do you think we can apply this patch as is for now and fix
> the race you mentioned in a follow-up patch? Or I should fix both of
> them in v2?

IMO it would be better to fix the locking in patch 1/2 (the best would be
to make that code use ext4_update_i_disksize() which uses proper locking)
and then apply your patch as 2/2. Because your patch makes i_disksize
updates more frequent especially in the problematic situations where both
DIO and writeback code want to update it and so chances these two updates
race in a wrong way are made higher by your fix. Thanks!

								Honza

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 534a9130f625..2a75b0aafd31 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3668,7 +3668,7 @@  static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 	int orphan = 0;
 	handle_t *handle;
 
-	if (final_size > inode->i_size) {
+	if (final_size > inode->i_size || final_size > ei->i_disksize) {
 		/* Credits for sb + inode write */
 		handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
 		if (IS_ERR(handle)) {
@@ -3780,9 +3780,10 @@  static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
 			ext4_orphan_del(handle, inode);
 		if (ret > 0) {
 			loff_t end = offset + ret;
-			if (end > inode->i_size) {
+			if (end > inode->i_size || end > ei->i_disksize) {
 				ei->i_disksize = end;
-				i_size_write(inode, end);
+				if (end > inode->i_size)
+					i_size_write(inode, end);
 				/*
 				 * We're going to return a positive `ret'
 				 * here due to non-zero-length I/O, so there's