diff mbox

ext4: fix ext4_flush_completed_IO wait semantics

Message ID 1349289807-18811-1-git-send-email-dmonakhov@openvz.org
State Superseded, archived
Headers show

Commit Message

Dmitry Monakhov Oct. 3, 2012, 6:43 p.m. UTC
BUG #1) All places where we call ext4_flush_completed_IO are broken
    because buffered io and DIO/AIO goes through three stages
    1) submitted io,
    2) completed io (in i_completed_io_list) conversion pended
    3) finished  io (conversion done)
    And by calling ext4_flush_completed_IO we will flush only
    requests which were in (2) stage, which is wrong because:
     1) punch_hole and truncate _must_ wait for all outstanding unwritten io
      regardless to it's state.
     2) fsync and nolock_dio_read should also wait because there is
        a time window between end_page_writeback() and ext4_add_complete_io()
        As result integrity fsync is broken in case of buffered write
        to fallocated region:
        fsync                                      blkdev_completion
	 ->filemap_write_and_wait_range
                                                   ->ext4_end_bio
                                                     ->end_page_writeback
          <-- filemap_write_and_wait_range return
	 ->ext4_flush_completed_IO
   	 sees empty i_completed_io_list but pended
   	 conversion still exist
                                                     ->ext4_add_complete_io

BUG #2) Race window becomes wider due to 'ext4: completed_io locking cleanup V4'

This patch make following changes:
1) ext4_flush_completed_io() now first try to flush completed io and when
   wait for any outstanding unwritten io via ext4_unwritten_wait()
2) Rename function to more appropriate name.
3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
   prevent endless wait

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ext4.h     |    3 ++-
 fs/ext4/extents.c  |    6 +++---
 fs/ext4/file.c     |    2 +-
 fs/ext4/fsync.c    |    2 +-
 fs/ext4/indirect.c |    8 +++++---
 fs/ext4/page-io.c  |   10 ++++++----
 6 files changed, 18 insertions(+), 13 deletions(-)

Comments

Theodore Ts'o Oct. 3, 2012, 6:55 p.m. UTC | #1
Are you planning on updating the 'ext4: completed_io locking cleanup
V4' patch series, or is this an additional patch that should be tacked
on to it?

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitry Monakhov Oct. 3, 2012, 7:32 p.m. UTC | #2
On Wed, 3 Oct 2012 14:55:58 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote:
> Are you planning on updating the 'ext4: completed_io locking cleanup
> V4' patch series, or is this an additional patch that should be tacked
> on to it?
This is addidional patch which should be placed on top of
'ext4: completed_io locking cleanup V4'
I'll try to convince Jan Kara that this is the best way to fix both bugs.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Oct. 3, 2012, 8:24 p.m. UTC | #3
On Wed, Oct 03, 2012 at 11:32:27PM +0400, Dmitry Monakhov wrote:
> On Wed, 3 Oct 2012 14:55:58 -0400, "Theodore Ts'o" <tytso@mit.edu> wrote:
> > Are you planning on updating the 'ext4: completed_io locking cleanup
> > V4' patch series, or is this an additional patch that should be tacked
> > on to it?
> This is addidional patch which should be placed on top of
> 'ext4: completed_io locking cleanup V4'
> I'll try to convince Jan Kara that this is the best way to fix both bugs.

OK, thanks.  I've been making the spelling/typo fixes suggested by Jan
to the commits in my tree.  So we're up to date on that front.  I just
wanted to know if there were any other changes you think were
necessary to the patch series.

I'll add the completed_io locking cleanup to my patch series, and
start running tests....

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Oct. 4, 2012, 10:11 a.m. UTC | #4
On Wed 03-10-12 22:43:27, Dmitry Monakhov wrote:
> BUG #1) All places where we call ext4_flush_completed_IO are broken
>     because buffered io and DIO/AIO goes through three stages
>     1) submitted io,
>     2) completed io (in i_completed_io_list) conversion pended
>     3) finished  io (conversion done)
>     And by calling ext4_flush_completed_IO we will flush only
>     requests which were in (2) stage, which is wrong because:
>      1) punch_hole and truncate _must_ wait for all outstanding unwritten io
>       regardless to it's state.
>      2) fsync and nolock_dio_read should also wait because there is
>         a time window between end_page_writeback() and ext4_add_complete_io()
>         As result integrity fsync is broken in case of buffered write
>         to fallocated region:
>         fsync                                      blkdev_completion
> 	 ->filemap_write_and_wait_range
>                                                    ->ext4_end_bio
>                                                      ->end_page_writeback
>           <-- filemap_write_and_wait_range return
> 	 ->ext4_flush_completed_IO
>    	 sees empty i_completed_io_list but pended
>    	 conversion still exist
>                                                      ->ext4_add_complete_io
> 
> BUG #2) Race window becomes wider due to 'ext4: completed_io locking cleanup V4'
> 
> This patch make following changes:
> 1) ext4_flush_completed_io() now first try to flush completed io and when
>    wait for any outstanding unwritten io via ext4_unwritten_wait()
> 2) Rename function to more appropriate name.
> 3) Assert that all callers of ext4_flush_unwritten_io should hold i_mutex to
>    prevent endless wait
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
  This patch looks good except for:

> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 8d849da..37cd5a4 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,9 +807,11 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(!list_empty(&ei->i_completed_io_list)))
> -			ext4_flush_completed_IO(inode);
> -
> +		if (unlikely(!atomic_read(&EXT4_I(inode)->i_unwritten))) {
  This condition which seems to be inverted...

> +			mutex_lock(&inode->i_mutex);
> +			ext4_flush_unwritten_io(inode);
> +			mutex_unlock(&inode->i_mutex);
> +		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state

  After fixing that, you can add:
Reviewed-by: Jan Kara <jack@suse.cz>

									Honza
Theodore Ts'o Oct. 5, 2012, 4:20 a.m. UTC | #5
On Thu, Oct 04, 2012 at 12:11:06PM +0200, Jan Kara wrote:
> >  retry:
> >  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> > -		if (unlikely(!list_empty(&ei->i_completed_io_list)))
> > -			ext4_flush_completed_IO(inode);
> > -
> > +		if (unlikely(!atomic_read(&EXT4_I(inode)->i_unwritten))) {
>   This condition which seems to be inverted...

Nice catch, thanks!  I've fixed this in my tree.

     	    	   	      	      - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Oct. 5, 2012, 12:40 p.m. UTC | #6
On Wed, Oct 03, 2012 at 10:43:27PM +0400, Dmitry Monakhov wrote:
> -int ext4_flush_completed_IO(struct inode *inode)
> +int ext4_flush_unwritten_io(struct inode *inode)
>  {
> -	return ext4_do_flush_completed_IO(inode, NULL);
> +	int ret;
> +	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex));
> +	ret = ext4_do_flush_completed_IO(inode, NULL);
> +	ext4_unwritten_wait(inode);
> +	return ret;
>  }
>  

This WARN_ON is triggering on the truncate path...

------------[ cut here ]------------
WARNING: at /usr/projects/linux/ext4/fs/ext4/page-io.c:232 ext4_flush_unwritten_io+0x2d/0x4c()
Hardware name: Bochs
Modules linked in:
Pid: 1907, comm: findfs Not tainted 3.6.0-rc1-00071-gb1edc6d #441
Call Trace:
 [<c0159cb3>] warn_slowpath_common+0x68/0x7d
 [<c0272ebd>] ? ext4_flush_unwritten_io+0x2d/0x4c
 [<c0159cdc>] warn_slowpath_null+0x14/0x18
 [<c0272ebd>] ext4_flush_unwritten_io+0x2d/0x4c
 [<c028fa09>] ext4_ext_truncate+0x21/0x174
 [<c0270cbb>] ? ext4_mark_inode_dirty+0x172/0x1a9
 [<c026ed9e>] ext4_truncate+0x7b/0xb9
 [<c0272414>] ext4_evict_inode+0x1ec/0x2e3
 [<c021a724>] evict+0x94/0x135
 [<c021a939>] iput+0x174/0x17a
 [<c02122bd>] do_unlinkat+0xc8/0x106
 [<c06d8b5e>] ? restore_all+0xf/0xf
 [<c0199909>] ? trace_hardirqs_on_caller+0x103/0x154
 [<c02138f0>] sys_unlink+0x15/0x17
 [<c06d8b25>] syscall_call+0x7/0xb
---[ end trace cdd8306e94494df8 ]---

							- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitry Monakhov Oct. 5, 2012, 1:01 p.m. UTC | #7
On Fri, 5 Oct 2012 08:40:27 -0400, Theodore Ts'o <tytso@mit.edu> wrote:
> On Wed, Oct 03, 2012 at 10:43:27PM +0400, Dmitry Monakhov wrote:
> > -int ext4_flush_completed_IO(struct inode *inode)
> > +int ext4_flush_unwritten_io(struct inode *inode)
> >  {
> > -	return ext4_do_flush_completed_IO(inode, NULL);
> > +	int ret;
> > +	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex));
> > +	ret = ext4_do_flush_completed_IO(inode, NULL);
> > +	ext4_unwritten_wait(inode);
> > +	return ret;
> >  }
> >  
> 
> This WARN_ON is triggering on the truncate path...
Yeap, this is false positive one. We skip i_mutex on ext4_evict_inode
This is strange xfsstress 269'th should caught that for me.
I'll try to prepare workaround ASAP.
> 
> ------------[ cut here ]------------
> WARNING: at /usr/projects/linux/ext4/fs/ext4/page-io.c:232 ext4_flush_unwritten_io+0x2d/0x4c()
> Hardware name: Bochs
> Modules linked in:
> Pid: 1907, comm: findfs Not tainted 3.6.0-rc1-00071-gb1edc6d #441
> Call Trace:
>  [<c0159cb3>] warn_slowpath_common+0x68/0x7d
>  [<c0272ebd>] ? ext4_flush_unwritten_io+0x2d/0x4c
>  [<c0159cdc>] warn_slowpath_null+0x14/0x18
>  [<c0272ebd>] ext4_flush_unwritten_io+0x2d/0x4c
>  [<c028fa09>] ext4_ext_truncate+0x21/0x174
>  [<c0270cbb>] ? ext4_mark_inode_dirty+0x172/0x1a9
>  [<c026ed9e>] ext4_truncate+0x7b/0xb9
>  [<c0272414>] ext4_evict_inode+0x1ec/0x2e3
>  [<c021a724>] evict+0x94/0x135
>  [<c021a939>] iput+0x174/0x17a
>  [<c02122bd>] do_unlinkat+0xc8/0x106
>  [<c06d8b5e>] ? restore_all+0xf/0xf
>  [<c0199909>] ? trace_hardirqs_on_caller+0x103/0x154
>  [<c02138f0>] sys_unlink+0x15/0x17
>  [<c06d8b25>] syscall_call+0x7/0xb
> ---[ end trace cdd8306e94494df8 ]---
> 
> 							- Ted
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1be2b44..3ab2539 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1947,7 +1947,7 @@  extern void ext4_htree_free_dir_info(struct dir_private_info *p);
 
 /* fsync.c */
 extern int ext4_sync_file(struct file *, loff_t, loff_t, int);
-extern int ext4_flush_completed_IO(struct inode *);
+extern int ext4_flush_unwritten_io(struct inode *);
 
 /* hash.c */
 extern int ext4fs_dirhash(const char *name, int len, struct
@@ -2371,6 +2371,7 @@  extern const struct file_operations ext4_dir_operations;
 extern const struct inode_operations ext4_file_inode_operations;
 extern const struct file_operations ext4_file_operations;
 extern loff_t ext4_llseek(struct file *file, loff_t offset, int origin);
+extern void ext4_unwritten_wait(struct inode *inode);
 
 /* namei.c */
 extern const struct inode_operations ext4_dir_inode_operations;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c1fcf48..1c94cca 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4268,7 +4268,7 @@  void ext4_ext_truncate(struct inode *inode)
 	 * finish any pending end_io work so we won't run the risk of
 	 * converting any truncated blocks to initialized later
 	 */
-	ext4_flush_completed_IO(inode);
+	ext4_flush_unwritten_io(inode);
 
 	/*
 	 * probably first extent we're gonna free will be last in block
@@ -4847,10 +4847,10 @@  int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
 
 	/* Wait all existing dio workers, newcomers will block on i_mutex */
 	ext4_inode_block_unlocked_dio(inode);
-	inode_dio_wait(inode);
-	err = ext4_flush_completed_IO(inode);
+	err = ext4_flush_unwritten_io(inode);
 	if (err)
 		goto out_dio;
+	inode_dio_wait(inode);
 
 	credits = ext4_writepage_trans_blocks(inode);
 	handle = ext4_journal_start(inode, credits);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 39335bd..ca6f07a 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -55,7 +55,7 @@  static int ext4_release_file(struct inode *inode, struct file *filp)
 	return 0;
 }
 
-static void ext4_unwritten_wait(struct inode *inode)
+void ext4_unwritten_wait(struct inode *inode)
 {
 	wait_queue_head_t *wq = ext4_ioend_wq(inode);
 
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 4600008..be1d89f 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -138,7 +138,7 @@  int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
 	if (inode->i_sb->s_flags & MS_RDONLY)
 		goto out;
 
-	ret = ext4_flush_completed_IO(inode);
+	ret = ext4_flush_unwritten_io(inode);
 	if (ret < 0)
 		goto out;
 
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 8d849da..37cd5a4 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,9 +807,11 @@  ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(!list_empty(&ei->i_completed_io_list)))
-			ext4_flush_completed_IO(inode);
-
+		if (unlikely(!atomic_read(&EXT4_I(inode)->i_unwritten))) {
+			mutex_lock(&inode->i_mutex);
+			ext4_flush_unwritten_io(inode);
+			mutex_unlock(&inode->i_mutex);
+		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 5b24c40..1f5df21 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -189,8 +189,6 @@  static int ext4_do_flush_completed_IO(struct inode *inode,
 
 		list_add_tail(&io->list, &complete);
 	}
-	/* It is important to update all flags for all end_io in one shot w/o
-	 * dropping the lock.*/
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
 	while (!list_empty(&complete)) {
 		io = list_entry(complete.next, ext4_io_end_t, list);
@@ -228,9 +226,13 @@  static void ext4_end_io_work(struct work_struct *work)
 	ext4_do_flush_completed_IO(io->inode, io);
 }
 
-int ext4_flush_completed_IO(struct inode *inode)
+int ext4_flush_unwritten_io(struct inode *inode)
 {
-	return ext4_do_flush_completed_IO(inode, NULL);
+	int ret;
+	WARN_ON_ONCE(!mutex_is_locked(&inode->i_mutex));
+	ret = ext4_do_flush_completed_IO(inode, NULL);
+	ext4_unwritten_wait(inode);
+	return ret;
 }
 
 ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)