Message ID | 1373861479-15136-1-git-send-email-tytso@mit.edu |
---|---|
State | Accepted, archived |
Headers | show |
On Mon 15-07-13 00:11:19, Ted Tso wrote: > Just as in ext4_punch_hole() it is important that we block DIO writes > while the truncate is proceeding, since during the overwriting DIO > write, we drop i_mutex, which means a truncate could start while the > Direct I/O operation is still in progress. Do you have an actual test case? Because what you describe shouldn't be possible even now. Unlock DIO overwrite does (under i_mutex): if (rw == WRITE) atomic_inc(&inode->i_dio_count); /* If we do a overwrite dio, i_mutex locking can be released */ overwrite = *((int *)iocb->private); if (overwrite) { down_read(&EXT4_I(inode)->i_data_sem); mutex_unlock(&inode->i_mutex); } and ext4_setattr() does (again under i_mutex): ext4_inode_block_unlocked_dio(inode); inode_dio_wait(inode); ext4_inode_resume_unlocked_dio(inode); So either DIO gets i_mutex first and then ext4_setattr() waits for it to complete, or truncate completes before unlocked DIO is able to get & drop i_mutex. OTOH unlocked DIO *read* might be vulnerable to a race with truncate. That never acquires i_mutex so if the DIO read arrives after ext4_setattr() goes through inode_dio_wait(), we can have the read and truncate racing and read possibly submitting read of a just truncated block (which can get reallocated in theory while the IO is running). So something like what you do in the patch is likely needed, just the justification is somewhat different and you should also rip out / adjust the other synchronizations we have in ext4_setattr(), ext4_ext_direct_IO() and ext4_ind_direct_IO(). Honza > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > Cc: stable@vger.kernel.org > --- > fs/ext4/inode.c | 10 ++++++++-- > 1 file changed, 8 insertions(+), 2 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 98b9bff..3c5edf2 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -3659,12 +3659,16 @@ void ext4_truncate(struct inode *inode) > if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC)) > ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE); > > + /* Wait all existing dio workers, newcomers will block on i_mutex */ > + ext4_inode_block_unlocked_dio(inode); > + inode_dio_wait(inode); > + > if (ext4_has_inline_data(inode)) { > int has_inline = 1; > > ext4_inline_data_truncate(inode, &has_inline); > if (has_inline) > - return; > + goto out_resume; > } > > if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) > @@ -3675,7 +3679,7 @@ void ext4_truncate(struct inode *inode) > handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits); > if (IS_ERR(handle)) { > ext4_std_error(inode->i_sb, PTR_ERR(handle)); > - return; > + goto out_resume; > } > > if (inode->i_size & (inode->i_sb->s_blocksize - 1)) > @@ -3722,6 +3726,8 @@ out_stop: > ext4_mark_inode_dirty(handle, inode); > ext4_journal_stop(handle); > > +out_resume: > + ext4_inode_resume_unlocked_dio(inode); > trace_ext4_truncate_exit(inode); > } > > -- > 1.7.12.rc0.22.gcdd159b > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 16, 2013 at 05:46:58PM +0200, Jan Kara wrote: > and ext4_setattr() does (again under i_mutex): > ext4_inode_block_unlocked_dio(inode); > inode_dio_wait(inode); > ext4_inode_resume_unlocked_dio(inode); Ah, I missed this; thanks for pointing this out. > So either DIO gets i_mutex first and then ext4_setattr() waits for it to > complete, or truncate completes before unlocked DIO is able to get & drop > i_mutex. > > OTOH unlocked DIO *read* might be vulnerable to a race with truncate. That > never acquires i_mutex so if the DIO read arrives after ext4_setattr() goes > through inode_dio_wait(), we can have the read and truncate racing and read > possibly submitting read of a just truncated block (which can get > reallocated in theory while the IO is running). > > So something like what you do in the patch is likely needed, just the > justification is somewhat different and you should also rip out / adjust > the other synchronizations we have in ext4_setattr(), ext4_ext_direct_IO() > and ext4_ind_direct_IO(). Ok, I'll drop my current patch for now, and revisit this when I have a bit more time. I agree with your analysis, but fortunately it sounds like this race is going to be pretty hard to hit in practice --- especially, since with the journal enabled, we won't allow the block to get reused until the next commit boundary. The situation where we would need to worry would be dioread_nolock combined with no journal mode. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Tue, Jul 16, 2013 at 12:49:42PM -0400, Theodore Ts'o wrote: > On Tue, Jul 16, 2013 at 05:46:58PM +0200, Jan Kara wrote: > > and ext4_setattr() does (again under i_mutex): > > ext4_inode_block_unlocked_dio(inode); > > inode_dio_wait(inode); > > ext4_inode_resume_unlocked_dio(inode); > > Ah, I missed this; thanks for pointing this out. I took a closer look, and there are other code paths which call ext4_truncate() which don't call ext4_inode_block_unlocked_dio() and inode_dio_wait(). In particular, the orphan cleanup code and the SETFLAGS ioctl. I suspect the right answer is to move these these calls, and possibly also the truncate_pagecache() call. The other thing that we probably will want to do as a cleanup for the next merge window is to look at merging more of the code paths for ext4_punch_hole() and ext4_truncate(). Cheeers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed 17-07-13 14:27:10, Ted Tso wrote: > On Tue, Jul 16, 2013 at 12:49:42PM -0400, Theodore Ts'o wrote: > > On Tue, Jul 16, 2013 at 05:46:58PM +0200, Jan Kara wrote: > > > and ext4_setattr() does (again under i_mutex): > > > ext4_inode_block_unlocked_dio(inode); > > > inode_dio_wait(inode); > > > ext4_inode_resume_unlocked_dio(inode); > > > > Ah, I missed this; thanks for pointing this out. > > I took a closer look, and there are other code paths which call > ext4_truncate() which don't call ext4_inode_block_unlocked_dio() and > inode_dio_wait(). In particular, the orphan cleanup code and the > SETFLAGS ioctl. True although both should be harmless. Also you have ext4_truncate_failed_write() which is harmless as well. > I suspect the right answer is to move these these calls, and possibly > also the truncate_pagecache() call. Could be. > The other thing that we probably will want to do as a cleanup for the > next merge window is to look at merging more of the code paths for > ext4_punch_hole() and ext4_truncate(). Yeah, that would be definitely worthwhile effort. Honza
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 98b9bff..3c5edf2 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3659,12 +3659,16 @@ void ext4_truncate(struct inode *inode) if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC)) ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE); + /* Wait all existing dio workers, newcomers will block on i_mutex */ + ext4_inode_block_unlocked_dio(inode); + inode_dio_wait(inode); + if (ext4_has_inline_data(inode)) { int has_inline = 1; ext4_inline_data_truncate(inode, &has_inline); if (has_inline) - return; + goto out_resume; } if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) @@ -3675,7 +3679,7 @@ void ext4_truncate(struct inode *inode) handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits); if (IS_ERR(handle)) { ext4_std_error(inode->i_sb, PTR_ERR(handle)); - return; + goto out_resume; } if (inode->i_size & (inode->i_sb->s_blocksize - 1)) @@ -3722,6 +3726,8 @@ out_stop: ext4_mark_inode_dirty(handle, inode); ext4_journal_stop(handle); +out_resume: + ext4_inode_resume_unlocked_dio(inode); trace_ext4_truncate_exit(inode); }
Just as in ext4_punch_hole() it is important that we block DIO writes while the truncate is proceeding, since during the overwriting DIO write, we drop i_mutex, which means a truncate could start while the Direct I/O operation is still in progress. Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> Cc: stable@vger.kernel.org --- fs/ext4/inode.c | 10 ++++++++-- 1 file changed, 8 insertions(+), 2 deletions(-)