Message ID | 20180123083723.24908-1-eguan@redhat.com |
---|---|
State | New, archived |
Headers | show |
Series | ext4: update i_disksize if direct write past ondisk size | expand |
On Tue, Jan 23, 2018 at 04:37:23PM +0800, Eryu Guan wrote: > Currently in ext4 direct write path, we update i_disksize only when > new eof is greater than i_size, and don't update it even when new > eof is greater than i_disksize but less than i_size. This doesn't > work well with delalloc buffer write, which updates i_size and > i_disksize only when delalloc blocks are resolved (at writeback > time), the i_disksize from direct write can be lost if a previous > buffer write succeeded at write time but failed at writeback time, > then results in corrupted ondisk inode size. > > Consider this case, first buffer write 4k data to a new file at > offset 16k with delayed allocation, then direct write 4k data to the > same file at offset 4k before delalloc blocks are resolved, which > doesn't update i_disksize because it writes within i_size(20k), but > the extent tree metadata has been committed in journal. Then > writeback of the delalloc blocks fails (due to device error etc.), > and i_size/i_disksize from buffer write can't be written to disk > (still zero). A subsequent umount/mount cycle recovers journal and > writes extent tree metadata from direct write to disk, but with > i_disksize being zero. > > Fix it by updating i_disksize too in direct write path when new eof > is greater than i_disksize but less than i_size, so i_disksize is > always consistent with direct write. > > This fixes occasional i_size corruption in fstests generic/475. > > Signed-off-by: Eryu Guan <eguan@redhat.com> Ping on this patch. Thanks, Eryu > --- > I think this matches what XFS does in direct write too. > > I've tested it by looping generic/475 200 times without hitting a > corruption, usually it fails within 5 iterations for me. Also tested by > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all > results looked good. > > fs/ext4/inode.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 534a9130f625..2a75b0aafd31 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3668,7 +3668,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) > int orphan = 0; > handle_t *handle; > > - if (final_size > inode->i_size) { > + if (final_size > inode->i_size || final_size > ei->i_disksize) { > /* Credits for sb + inode write */ > handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); > if (IS_ERR(handle)) { > @@ -3780,9 +3780,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) > ext4_orphan_del(handle, inode); > if (ret > 0) { > loff_t end = offset + ret; > - if (end > inode->i_size) { > + if (end > inode->i_size || end > ei->i_disksize) { > ei->i_disksize = end; > - i_size_write(inode, end); > + if (end > inode->i_size) > + i_size_write(inode, end); > /* > * We're going to return a positive `ret' > * here due to non-zero-length I/O, so there's > -- > 2.14.3 >
On Tue 23-01-18 16:37:23, Eryu Guan wrote: > Currently in ext4 direct write path, we update i_disksize only when > new eof is greater than i_size, and don't update it even when new > eof is greater than i_disksize but less than i_size. This doesn't > work well with delalloc buffer write, which updates i_size and > i_disksize only when delalloc blocks are resolved (at writeback > time), the i_disksize from direct write can be lost if a previous > buffer write succeeded at write time but failed at writeback time, > then results in corrupted ondisk inode size. > > Consider this case, first buffer write 4k data to a new file at > offset 16k with delayed allocation, then direct write 4k data to the > same file at offset 4k before delalloc blocks are resolved, which > doesn't update i_disksize because it writes within i_size(20k), but > the extent tree metadata has been committed in journal. Then > writeback of the delalloc blocks fails (due to device error etc.), > and i_size/i_disksize from buffer write can't be written to disk > (still zero). A subsequent umount/mount cycle recovers journal and > writes extent tree metadata from direct write to disk, but with > i_disksize being zero. > > Fix it by updating i_disksize too in direct write path when new eof > is greater than i_disksize but less than i_size, so i_disksize is > always consistent with direct write. > > This fixes occasional i_size corruption in fstests generic/475. > > Signed-off-by: Eryu Guan <eguan@redhat.com> > --- > I think this matches what XFS does in direct write too. > > I've tested it by looping generic/475 200 times without hitting a > corruption, usually it fails within 5 iterations for me. Also tested by > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all > results looked good. Thanks for the patch! It looks good to me. Just when looking at these i_disksize updates and thinking about mixing them with page writeback there seems to be another bug that these i_disksize updates are not protected by ei->i_data_sem (which is what protects i_disksize update in the writeback path). So probably that should be fixed up as well as otherwise I'm not sure we cannot corrupt i_disksize in some funny way when writeback and dio write race... Honza > > fs/ext4/inode.c | 7 ++++--- > 1 file changed, 4 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 534a9130f625..2a75b0aafd31 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3668,7 +3668,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) > int orphan = 0; > handle_t *handle; > > - if (final_size > inode->i_size) { > + if (final_size > inode->i_size || final_size > ei->i_disksize) { > /* Credits for sb + inode write */ > handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); > if (IS_ERR(handle)) { > @@ -3780,9 +3780,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) > ext4_orphan_del(handle, inode); > if (ret > 0) { > loff_t end = offset + ret; > - if (end > inode->i_size) { > + if (end > inode->i_size || end > ei->i_disksize) { > ei->i_disksize = end; > - i_size_write(inode, end); > + if (end > inode->i_size) > + i_size_write(inode, end); > /* > * We're going to return a positive `ret' > * here due to non-zero-length I/O, so there's > -- > 2.14.3 >
On Wed, Mar 07, 2018 at 11:11:10AM +0100, Jan Kara wrote: > On Tue 23-01-18 16:37:23, Eryu Guan wrote: > > Currently in ext4 direct write path, we update i_disksize only when > > new eof is greater than i_size, and don't update it even when new > > eof is greater than i_disksize but less than i_size. This doesn't > > work well with delalloc buffer write, which updates i_size and > > i_disksize only when delalloc blocks are resolved (at writeback > > time), the i_disksize from direct write can be lost if a previous > > buffer write succeeded at write time but failed at writeback time, > > then results in corrupted ondisk inode size. > > > > Consider this case, first buffer write 4k data to a new file at > > offset 16k with delayed allocation, then direct write 4k data to the > > same file at offset 4k before delalloc blocks are resolved, which > > doesn't update i_disksize because it writes within i_size(20k), but > > the extent tree metadata has been committed in journal. Then > > writeback of the delalloc blocks fails (due to device error etc.), > > and i_size/i_disksize from buffer write can't be written to disk > > (still zero). A subsequent umount/mount cycle recovers journal and > > writes extent tree metadata from direct write to disk, but with > > i_disksize being zero. > > > > Fix it by updating i_disksize too in direct write path when new eof > > is greater than i_disksize but less than i_size, so i_disksize is > > always consistent with direct write. > > > > This fixes occasional i_size corruption in fstests generic/475. > > > > Signed-off-by: Eryu Guan <eguan@redhat.com> > > --- > > I think this matches what XFS does in direct write too. > > > > I've tested it by looping generic/475 200 times without hitting a > > corruption, usually it fails within 5 iterations for me. Also tested by > > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all > > results looked good. > > Thanks for the patch! It looks good to me. Just when looking at these Thanks for the review! > i_disksize updates and thinking about mixing them with page writeback there > seems to be another bug that these i_disksize updates are not protected by > ei->i_data_sem (which is what protects i_disksize update in the writeback > path). So probably that should be fixed up as well as otherwise I'm not > sure we cannot corrupt i_disksize in some funny way when writeback and dio > write race... It's been a while and I'll get myself refamiliar with this code path and look into it. Do you think we can apply this patch as is for now and fix the race you mentioned in a follow-up patch? Or I should fix both of them in v2? Thanks, Eryu
On Thu 08-03-18 08:58:17, Eryu Guan wrote: > On Wed, Mar 07, 2018 at 11:11:10AM +0100, Jan Kara wrote: > > On Tue 23-01-18 16:37:23, Eryu Guan wrote: > > > Currently in ext4 direct write path, we update i_disksize only when > > > new eof is greater than i_size, and don't update it even when new > > > eof is greater than i_disksize but less than i_size. This doesn't > > > work well with delalloc buffer write, which updates i_size and > > > i_disksize only when delalloc blocks are resolved (at writeback > > > time), the i_disksize from direct write can be lost if a previous > > > buffer write succeeded at write time but failed at writeback time, > > > then results in corrupted ondisk inode size. > > > > > > Consider this case, first buffer write 4k data to a new file at > > > offset 16k with delayed allocation, then direct write 4k data to the > > > same file at offset 4k before delalloc blocks are resolved, which > > > doesn't update i_disksize because it writes within i_size(20k), but > > > the extent tree metadata has been committed in journal. Then > > > writeback of the delalloc blocks fails (due to device error etc.), > > > and i_size/i_disksize from buffer write can't be written to disk > > > (still zero). A subsequent umount/mount cycle recovers journal and > > > writes extent tree metadata from direct write to disk, but with > > > i_disksize being zero. > > > > > > Fix it by updating i_disksize too in direct write path when new eof > > > is greater than i_disksize but less than i_size, so i_disksize is > > > always consistent with direct write. > > > > > > This fixes occasional i_size corruption in fstests generic/475. > > > > > > Signed-off-by: Eryu Guan <eguan@redhat.com> > > > --- > > > I think this matches what XFS does in direct write too. > > > > > > I've tested it by looping generic/475 200 times without hitting a > > > corruption, usually it fails within 5 iterations for me. Also tested by > > > full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all > > > results looked good. > > > > Thanks for the patch! It looks good to me. Just when looking at these > > Thanks for the review! > > > i_disksize updates and thinking about mixing them with page writeback there > > seems to be another bug that these i_disksize updates are not protected by > > ei->i_data_sem (which is what protects i_disksize update in the writeback > > path). So probably that should be fixed up as well as otherwise I'm not > > sure we cannot corrupt i_disksize in some funny way when writeback and dio > > write race... > > It's been a while and I'll get myself refamiliar with this code path and > look into it. Do you think we can apply this patch as is for now and fix > the race you mentioned in a follow-up patch? Or I should fix both of > them in v2? IMO it would be better to fix the locking in patch 1/2 (the best would be to make that code use ext4_update_i_disksize() which uses proper locking) and then apply your patch as 2/2. Because your patch makes i_disksize updates more frequent especially in the problematic situations where both DIO and writeback code want to update it and so chances these two updates race in a wrong way are made higher by your fix. Thanks! Honza
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 534a9130f625..2a75b0aafd31 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3668,7 +3668,7 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) int orphan = 0; handle_t *handle; - if (final_size > inode->i_size) { + if (final_size > inode->i_size || final_size > ei->i_disksize) { /* Credits for sb + inode write */ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2); if (IS_ERR(handle)) { @@ -3780,9 +3780,10 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter) ext4_orphan_del(handle, inode); if (ret > 0) { loff_t end = offset + ret; - if (end > inode->i_size) { + if (end > inode->i_size || end > ei->i_disksize) { ei->i_disksize = end; - i_size_write(inode, end); + if (end > inode->i_size) + i_size_write(inode, end); /* * We're going to return a positive `ret' * here due to non-zero-length I/O, so there's
Currently in ext4 direct write path, we update i_disksize only when new eof is greater than i_size, and don't update it even when new eof is greater than i_disksize but less than i_size. This doesn't work well with delalloc buffer write, which updates i_size and i_disksize only when delalloc blocks are resolved (at writeback time), the i_disksize from direct write can be lost if a previous buffer write succeeded at write time but failed at writeback time, then results in corrupted ondisk inode size. Consider this case, first buffer write 4k data to a new file at offset 16k with delayed allocation, then direct write 4k data to the same file at offset 4k before delalloc blocks are resolved, which doesn't update i_disksize because it writes within i_size(20k), but the extent tree metadata has been committed in journal. Then writeback of the delalloc blocks fails (due to device error etc.), and i_size/i_disksize from buffer write can't be written to disk (still zero). A subsequent umount/mount cycle recovers journal and writes extent tree metadata from direct write to disk, but with i_disksize being zero. Fix it by updating i_disksize too in direct write path when new eof is greater than i_disksize but less than i_size, so i_disksize is always consistent with direct write. This fixes occasional i_size corruption in fstests generic/475. Signed-off-by: Eryu Guan <eguan@redhat.com> --- I think this matches what XFS does in direct write too. I've tested it by looping generic/475 200 times without hitting a corruption, usually it fails within 5 iterations for me. Also tested by full fstests runs on ext2_4k, ext3_2k, ext4_1k configurations and all results looked good. fs/ext4/inode.c | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-)