Message ID | 1349289807-18811-1-git-send-email-dmonakhov@openvz.org |
---|---|
State | Superseded, archived |
Headers | show |
Are you planning on updating the 'ext4: completed_io locking cleanup V4' patch series, or is this an additional patch that should be tacked on to it? Thanks, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 3 Oct 2012 14:55:58 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote: > Are you planning on updating the 'ext4: completed_io locking cleanup > V4' patch series, or is this an additional patch that should be tacked > on to it? This is addidional patch which should be placed on top of 'ext4: completed_io locking cleanup V4' I'll try to convince Jan Kara that this is the best way to fix both bugs. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 03, 2012 at 11:32:27PM +0400, Dmitry Monakhov wrote: > On Wed, 3 Oct 2012 14:55:58 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote: > > Are you planning on updating the 'ext4: completed_io locking cleanup > > V4' patch series, or is this an additional patch that should be tacked > > on to it? > This is addidional patch which should be placed on top of > 'ext4: completed_io locking cleanup V4' > I'll try to convince Jan Kara that this is the best way to fix both bugs. OK, thanks. I've been making the spelling/typo fixes suggested by Jan to the commits in my tree. So we're up to date on that front. I just wanted to know if there were any other changes you think were necessary to the patch series. I'll add the completed_io locking cleanup to my patch series, and start running tests.... - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 03-10-12 22:43:27, Dmitry Monakhov wrote: > BUG #1) All places where we call ext4_flush_completed_IO are broken > because buffered io and DIO/AIO goes through three stages > 1) submitted io, > 2) completed io (in i_completed_io_list) conversion pended > 3) finished io (conversion done) > And by calling ext4_flush_completed_IO we will flush only > requests which were in (2) stage, which is wrong because: > 1) punch_hole and truncate _must_ wait for all outstanding unwritten io > regardless to it's state. > 2) fsync and nolock_dio_read should also wait because there is > a time window between end_page_writeback() and ext4_add_complete_io() > As result integrity fsync is broken in case of buffered write > to fallocated region: > fsync blkdev_completion > ->filemap_write_and_wait_range > ->ext4_end_bio > ->end_page_writeback > <-- filemap_write_and_wait_range return > ->ext4_flush_completed_IO > sees empty i_completed_io_list but pended > conversion still exist > ->ext4_add_complete_io > > BUG #2) Race window becomes wider due to 'ext4: completed_io locking cleanup V4' > > This patch make following changes: > 1) ext4_flush_completed_io() now first try to flush completed io and when > wait for any outstanding unwritten io via ext4_unwritten_wait() > 2) Rename function to more appropriate name. > 3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to > prevent endless wait > > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> This patch looks good except for: > diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c > index 8d849da..37cd5a4 100644 > --- a/fs/ext4/indirect.c > +++ b/fs/ext4/indirect.c > @@ -807,9 +807,11 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, > > retry: > if (rw == READ && ext4_should_dioread_nolock(inode)) { > - if (unlikely(!list_empty(&ei->i_completed_io_list))) > - ext4_flush_completed_IO(inode); > - > + if (unlikely(!atomic_read(&EXT4_I(inode)->i_unwritten))) { This condition which seems to be inverted... > + mutex_lock(&inode->i_mutex); > + ext4_flush_unwritten_io(inode); > + mutex_unlock(&inode->i_mutex); > + } > /* > * Nolock dioread optimization may be dynamically disabled > * via ext4_inode_block_unlocked_dio(). Check inode's state After fixing that, you can add: Reviewed-by: Jan Kara <jack@suse.cz> Honza
On Thu, Oct 04, 2012 at 12:11:06PM +0200, Jan Kara wrote: > > retry: > > if (rw == READ && ext4_should_dioread_nolock(inode)) { > > - if (unlikely(!list_empty(&ei->i_completed_io_list))) > > - ext4_flush_completed_IO(inode); > > - > > + if (unlikely(!atomic_read(&EXT4_I(inode)->i_unwritten))) { > This condition which seems to be inverted... Nice catch, thanks! I've fixed this in my tree. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Oct 03, 2012 at 10:43:27PM +0400, Dmitry Monakhov wrote: > -int ext4_flush_completed_IO(struct inode *inode) > +int ext4_flush_unwritten_io(struct inode *inode) > { > - return ext4_do_flush_completed_IO(inode, NULL); > + int ret; > + WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex)); > + ret = ext4_do_flush_completed_IO(inode, NULL); > + ext4_unwritten_wait(inode); > + return ret; > } > This WARN_ON is triggering on the truncate path... ------------[ cut here ]------------ WARNING: at /usr/projects/linux/ext4/fs/ext4/page-io.c:232 ext4_flush_unwritten_io+0x2d/0x4c() Hardware name: Bochs Modules linked in: Pid: 1907, comm: findfs Not tainted 3.6.0-rc1-00071-gb1edc6d #441 Call Trace: [<c0159cb3>] warn_slowpath_common+0x68/0x7d [<c0272ebd>] ? ext4_flush_unwritten_io+0x2d/0x4c [<c0159cdc>] warn_slowpath_null+0x14/0x18 [<c0272ebd>] ext4_flush_unwritten_io+0x2d/0x4c [<c028fa09>] ext4_ext_truncate+0x21/0x174 [<c0270cbb>] ? ext4_mark_inode_dirty+0x172/0x1a9 [<c026ed9e>] ext4_truncate+0x7b/0xb9 [<c0272414>] ext4_evict_inode+0x1ec/0x2e3 [<c021a724>] evict+0x94/0x135 [<c021a939>] iput+0x174/0x17a [<c02122bd>] do_unlinkat+0xc8/0x106 [<c06d8b5e>] ? restore_all+0xf/0xf [<c0199909>] ? trace_hardirqs_on_caller+0x103/0x154 [<c02138f0>] sys_unlink+0x15/0x17 [<c06d8b25>] syscall_call+0x7/0xb ---[ end trace cdd8306e94494df8 ]--- - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, 5 Oct 2012 08:40:27 -0400, Theodore Ts'o <tytso@mit.edu> wrote: > On Wed, Oct 03, 2012 at 10:43:27PM +0400, Dmitry Monakhov wrote: > > -int ext4_flush_completed_IO(struct inode *inode) > > +int ext4_flush_unwritten_io(struct inode *inode) > > { > > - return ext4_do_flush_completed_IO(inode, NULL); > > + int ret; > > + WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex)); > > + ret = ext4_do_flush_completed_IO(inode, NULL); > > + ext4_unwritten_wait(inode); > > + return ret; > > } > > > > This WARN_ON is triggering on the truncate path... Yeap, this is false positive one. We skip i_mutex on ext4_evict_inode This is strange xfsstress 269'th should caught that for me. I'll try to prepare workaround ASAP. > > ------------[ cut here ]------------ > WARNING: at /usr/projects/linux/ext4/fs/ext4/page-io.c:232 ext4_flush_unwritten_io+0x2d/0x4c() > Hardware name: Bochs > Modules linked in: > Pid: 1907, comm: findfs Not tainted 3.6.0-rc1-00071-gb1edc6d #441 > Call Trace: > [<c0159cb3>] warn_slowpath_common+0x68/0x7d > [<c0272ebd>] ? ext4_flush_unwritten_io+0x2d/0x4c > [<c0159cdc>] warn_slowpath_null+0x14/0x18 > [<c0272ebd>] ext4_flush_unwritten_io+0x2d/0x4c > [<c028fa09>] ext4_ext_truncate+0x21/0x174 > [<c0270cbb>] ? ext4_mark_inode_dirty+0x172/0x1a9 > [<c026ed9e>] ext4_truncate+0x7b/0xb9 > [<c0272414>] ext4_evict_inode+0x1ec/0x2e3 > [<c021a724>] evict+0x94/0x135 > [<c021a939>] iput+0x174/0x17a > [<c02122bd>] do_unlinkat+0xc8/0x106 > [<c06d8b5e>] ? restore_all+0xf/0xf > [<c0199909>] ? trace_hardirqs_on_caller+0x103/0x154 > [<c02138f0>] sys_unlink+0x15/0x17 > [<c06d8b25>] syscall_call+0x7/0xb > ---[ end trace cdd8306e94494df8 ]--- > > - Ted > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 1be2b44..3ab2539 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -1947,7 +1947,7 @@ extern void ext4_htree_free_dir_info(struct dir_private_info *p); /* fsync.c */ extern int ext4_sync_file(struct file *, loff_t, loff_t, int); -extern int ext4_flush_completed_IO(struct inode *); +extern int ext4_flush_unwritten_io(struct inode *); /* hash.c */ extern int ext4fs_dirhash(const char *name, int len, struct @@ -2371,6 +2371,7 @@ extern const struct file_operations ext4_dir_operations; extern const struct inode_operations ext4_file_inode_operations; extern const struct file_operations ext4_file_operations; extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin); +extern void ext4_unwritten_wait(struct inode *inode); /* namei.c */ extern const struct inode_operations ext4_dir_inode_operations; diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index c1fcf48..1c94cca 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4268,7 +4268,7 @@ void ext4_ext_truncate(struct inode *inode) * finish any pending end_io work so we won't run the risk of * converting any truncated blocks to initialized later */ - ext4_flush_completed_IO(inode); + ext4_flush_unwritten_io(inode); /* * probably first extent we're gonna free will be last in block @@ -4847,10 +4847,10 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length) /* Wait all existing dio workers, newcomers will block on i_mutex */ ext4_inode_block_unlocked_dio(inode); - inode_dio_wait(inode); - err = ext4_flush_completed_IO(inode); + err = ext4_flush_unwritten_io(inode); if (err) goto out_dio; + inode_dio_wait(inode); credits = ext4_writepage_trans_blocks(inode); handle = ext4_journal_start(inode, credits); diff --git a/fs/ext4/file.c b/fs/ext4/file.c index 39335bd..ca6f07a 100644 --- a/fs/ext4/file.c +++ b/fs/ext4/file.c @@ -55,7 +55,7 @@ static int ext4_release_file(struct inode *inode, struct file *filp) return 0; } -static void ext4_unwritten_wait(struct inode *inode) +void ext4_unwritten_wait(struct inode *inode) { wait_queue_head_t *wq = ext4_ioend_wq(inode); diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c index 4600008..be1d89f 100644 --- a/fs/ext4/fsync.c +++ b/fs/ext4/fsync.c @@ -138,7 +138,7 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync) if (inode->i_sb->s_flags & MS_RDONLY) goto out; - ret = ext4_flush_completed_IO(inode); + ret = ext4_flush_unwritten_io(inode); if (ret < 0) goto out; diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 8d849da..37cd5a4 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -807,9 +807,11 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb, retry: if (rw == READ && ext4_should_dioread_nolock(inode)) { - if (unlikely(!list_empty(&ei->i_completed_io_list))) - ext4_flush_completed_IO(inode); - + if (unlikely(!atomic_read(&EXT4_I(inode)->i_unwritten))) { + mutex_lock(&inode->i_mutex); + ext4_flush_unwritten_io(inode); + mutex_unlock(&inode->i_mutex); + } /* * Nolock dioread optimization may be dynamically disabled * via ext4_inode_block_unlocked_dio(). Check inode's state diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c index 5b24c40..1f5df21 100644 --- a/fs/ext4/page-io.c +++ b/fs/ext4/page-io.c @@ -189,8 +189,6 @@ static int ext4_do_flush_completed_IO(struct inode *inode, list_add_tail(&io->list, &complete); } - /* It is important to update all flags for all end_io in one shot w/o - * dropping the lock.*/ spin_lock_irqsave(&ei->i_completed_io_lock, flags); while (!list_empty(&complete)) { io = list_entry(complete.next, ext4_io_end_t, list); @@ -228,9 +226,13 @@ static void ext4_end_io_work(struct work_struct *work) ext4_do_flush_completed_IO(io->inode, io); } -int ext4_flush_completed_IO(struct inode *inode) +int ext4_flush_unwritten_io(struct inode *inode) { - return ext4_do_flush_completed_IO(inode, NULL); + int ret; + WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex)); + ret = ext4_do_flush_completed_IO(inode, NULL); + ext4_unwritten_wait(inode); + return ret; } ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
BUG #1) All places where we call ext4_flush_completed_IO are broken because buffered io and DIO/AIO goes through three stages 1) submitted io, 2) completed io (in i_completed_io_list) conversion pended 3) finished io (conversion done) And by calling ext4_flush_completed_IO we will flush only requests which were in (2) stage, which is wrong because: 1) punch_hole and truncate _must_ wait for all outstanding unwritten io regardless to it's state. 2) fsync and nolock_dio_read should also wait because there is a time window between end_page_writeback() and ext4_add_complete_io() As result integrity fsync is broken in case of buffered write to fallocated region: fsync blkdev_completion ->filemap_write_and_wait_range ->ext4_end_bio ->end_page_writeback <-- filemap_write_and_wait_range return ->ext4_flush_completed_IO sees empty i_completed_io_list but pended conversion still exist ->ext4_add_complete_io BUG #2) Race window becomes wider due to 'ext4: completed_io locking cleanup V4' This patch make following changes: 1) ext4_flush_completed_io() now first try to flush completed io and when wait for any outstanding unwritten io via ext4_unwritten_wait() 2) Rename function to more appropriate name. 3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to prevent endless wait Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org> --- fs/ext4/ext4.h | 3 ++- fs/ext4/extents.c | 6 +++--- fs/ext4/file.c | 2 +- fs/ext4/fsync.c | 2 +- fs/ext4/indirect.c | 8 +++++--- fs/ext4/page-io.c | 10 ++++++---- 6 files changed, 18 insertions(+), 13 deletions(-)