diff mbox

ext4: block direct I/O writes during ext4_truncate

Message ID 1373861479-15136-1-git-send-email-tytso@mit.edu
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o July 15, 2013, 4:11 a.m. UTC
Just as in ext4_punch_hole() it is important that we block DIO writes
while the truncate is proceeding, since during the overwriting DIO
write, we drop i_mutex, which means a truncate could start while the
Direct I/O operation is still in progress.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
Cc: stable@vger.kernel.org
---
 fs/ext4/inode.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

Comments

Jan Kara July 16, 2013, 3:46 p.m. UTC | #1
On Mon 15-07-13 00:11:19, Ted Tso wrote:
> Just as in ext4_punch_hole() it is important that we block DIO writes
> while the truncate is proceeding, since during the overwriting DIO
> write, we drop i_mutex, which means a truncate could start while the
> Direct I/O operation is still in progress.
  Do you have an actual test case? Because what you describe shouldn't be
possible even now.

Unlock DIO overwrite does (under i_mutex):
       if (rw == WRITE)
                atomic_inc(&inode->i_dio_count);

        /* If we do a overwrite dio, i_mutex locking can be released */
        overwrite = *((int *)iocb->private);

        if (overwrite) {
                down_read(&EXT4_I(inode)->i_data_sem);
                mutex_unlock(&inode->i_mutex);
        }


and ext4_setattr() does (again under i_mutex):
                                        ext4_inode_block_unlocked_dio(inode);
                                        inode_dio_wait(inode);
                                        ext4_inode_resume_unlocked_dio(inode);

So either DIO gets i_mutex first and then ext4_setattr() waits for it to
complete, or truncate completes before unlocked DIO is able to get & drop
i_mutex.

OTOH unlocked DIO *read* might be vulnerable to a race with truncate. That
never acquires i_mutex so if the DIO read arrives after ext4_setattr() goes
through inode_dio_wait(), we can have the read and truncate racing and read
possibly submitting read of a just truncated block (which can get
reallocated in theory while the IO is running).

So something like what you do in the patch is likely needed, just the
justification is somewhat different and you should also rip out / adjust
the other synchronizations we have in ext4_setattr(), ext4_ext_direct_IO()
and ext4_ind_direct_IO().

								Honza

> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> Cc: stable@vger.kernel.org
> ---
>  fs/ext4/inode.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 98b9bff..3c5edf2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3659,12 +3659,16 @@ void ext4_truncate(struct inode *inode)
>  	if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
>  		ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE);
>  
> +	/* Wait all existing dio workers, newcomers will block on i_mutex */
> +	ext4_inode_block_unlocked_dio(inode);
> +	inode_dio_wait(inode);
> +
>  	if (ext4_has_inline_data(inode)) {
>  		int has_inline = 1;
>  
>  		ext4_inline_data_truncate(inode, &has_inline);
>  		if (has_inline)
> -			return;
> +			goto out_resume;
>  	}
>  
>  	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> @@ -3675,7 +3679,7 @@ void ext4_truncate(struct inode *inode)
>  	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
>  	if (IS_ERR(handle)) {
>  		ext4_std_error(inode->i_sb, PTR_ERR(handle));
> -		return;
> +		goto out_resume;
>  	}
>  
>  	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
> @@ -3722,6 +3726,8 @@ out_stop:
>  	ext4_mark_inode_dirty(handle, inode);
>  	ext4_journal_stop(handle);
>  
> +out_resume:
> +	ext4_inode_resume_unlocked_dio(inode);
>  	trace_ext4_truncate_exit(inode);
>  }
>  
> -- 
> 1.7.12.rc0.22.gcdd159b
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 16, 2013, 4:49 p.m. UTC | #2
On Tue, Jul 16, 2013 at 05:46:58PM +0200, Jan Kara wrote:
> and ext4_setattr() does (again under i_mutex):
>                                         ext4_inode_block_unlocked_dio(inode);
>                                         inode_dio_wait(inode);
>                                         ext4_inode_resume_unlocked_dio(inode);

Ah, I missed this; thanks for pointing this out.

> So either DIO gets i_mutex first and then ext4_setattr() waits for it to
> complete, or truncate completes before unlocked DIO is able to get & drop
> i_mutex.
> 
> OTOH unlocked DIO *read* might be vulnerable to a race with truncate. That
> never acquires i_mutex so if the DIO read arrives after ext4_setattr() goes
> through inode_dio_wait(), we can have the read and truncate racing and read
> possibly submitting read of a just truncated block (which can get
> reallocated in theory while the IO is running).
> 
> So something like what you do in the patch is likely needed, just the
> justification is somewhat different and you should also rip out / adjust
> the other synchronizations we have in ext4_setattr(), ext4_ext_direct_IO()
> and ext4_ind_direct_IO().

Ok, I'll drop my current patch for now, and revisit this when I have a
bit more time.  I agree with your analysis, but fortunately it sounds
like this race is going to be pretty hard to hit in practice ---
especially, since with the journal enabled, we won't allow the block
to get reused until the next commit boundary.  The situation where we
would need to worry would be dioread_nolock combined with no journal
mode.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 17, 2013, 6:27 p.m. UTC | #3
On Tue, Jul 16, 2013 at 12:49:42PM -0400, Theodore Ts'o wrote:
> On Tue, Jul 16, 2013 at 05:46:58PM +0200, Jan Kara wrote:
> > and ext4_setattr() does (again under i_mutex):
> >                                         ext4_inode_block_unlocked_dio(inode);
> >                                         inode_dio_wait(inode);
> >                                         ext4_inode_resume_unlocked_dio(inode);
> 
> Ah, I missed this; thanks for pointing this out.

I took a closer look, and there are other code paths which call
ext4_truncate() which don't call ext4_inode_block_unlocked_dio() and
inode_dio_wait().  In particular, the orphan cleanup code and the
SETFLAGS ioctl.

I suspect the right answer is to move these these calls, and possibly
also the truncate_pagecache() call.

The other thing that we probably will want to do as a cleanup for the
next merge window is to look at merging more of the code paths for
ext4_punch_hole() and ext4_truncate().

Cheeers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara July 18, 2013, 2:11 p.m. UTC | #4
On Wed 17-07-13 14:27:10, Ted Tso wrote:
> On Tue, Jul 16, 2013 at 12:49:42PM -0400, Theodore Ts'o wrote:
> > On Tue, Jul 16, 2013 at 05:46:58PM +0200, Jan Kara wrote:
> > > and ext4_setattr() does (again under i_mutex):
> > >                                         ext4_inode_block_unlocked_dio(inode);
> > >                                         inode_dio_wait(inode);
> > >                                         ext4_inode_resume_unlocked_dio(inode);
> > 
> > Ah, I missed this; thanks for pointing this out.
> 
> I took a closer look, and there are other code paths which call
> ext4_truncate() which don't call ext4_inode_block_unlocked_dio() and
> inode_dio_wait().  In particular, the orphan cleanup code and the
> SETFLAGS ioctl.
  True although both should be harmless. Also you have
ext4_truncate_failed_write() which is harmless as well.

> I suspect the right answer is to move these these calls, and possibly
> also the truncate_pagecache() call.
  Could be.

> The other thing that we probably will want to do as a cleanup for the
> next merge window is to look at merging more of the code paths for
> ext4_punch_hole() and ext4_truncate().
  Yeah, that would be definitely worthwhile effort.

								Honza
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 98b9bff..3c5edf2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3659,12 +3659,16 @@  void ext4_truncate(struct inode *inode)
 	if (inode->i_size == 0 && !test_opt(inode->i_sb, NO_AUTO_DA_ALLOC))
 		ext4_set_inode_state(inode, EXT4_STATE_DA_ALLOC_CLOSE);
 
+	/* Wait all existing dio workers, newcomers will block on i_mutex */
+	ext4_inode_block_unlocked_dio(inode);
+	inode_dio_wait(inode);
+
 	if (ext4_has_inline_data(inode)) {
 		int has_inline = 1;
 
 		ext4_inline_data_truncate(inode, &has_inline);
 		if (has_inline)
-			return;
+			goto out_resume;
 	}
 
 	if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
@@ -3675,7 +3679,7 @@  void ext4_truncate(struct inode *inode)
 	handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
 	if (IS_ERR(handle)) {
 		ext4_std_error(inode->i_sb, PTR_ERR(handle));
-		return;
+		goto out_resume;
 	}
 
 	if (inode->i_size & (inode->i_sb->s_blocksize - 1))
@@ -3722,6 +3726,8 @@  out_stop:
 	ext4_mark_inode_dirty(handle, inode);
 	ext4_journal_stop(handle);
 
+out_resume:
+	ext4_inode_resume_unlocked_dio(inode);
 	trace_ext4_truncate_exit(inode);
 }