Patchwork [04/11] ext4: completed_io locking cleanup V4

login
register
mail settings
Submitter Dmitri Monakho
Date Sept. 28, 2012, 3:44 p.m.
Message ID <1348847051-6746-5-git-send-email-dmonakhov@openvz.org>
Download mbox | patch
Permalink /patch/187812/
State Accepted
Headers show

Comments

Dmitri Monakho - Sept. 28, 2012, 3:44 p.m.
Current unwritten extent conversion state-machine is very fuzzy.
- By unknown reason it want perform conversion under i_mutex. What for?
  My diagnosis:
  We already protect extent tree with i_data_sem, truncate and punch_hole
  should wait for DIO, so the only data we have to protect is end_io->flags
  modification, but only flush_completed_IO and end_io_work modified this
  flags and we can serialize them via i_completed_io_lock.

  Currently all this games with mutex_trylock result in following deadlock
   truncate:                          kworker:
    ext4_setattr                       ext4_end_io_work
    mutex_lock(i_mutex)
    inode_dio_wait(inode)  ->BLOCK
                             DEADLOCK<- mutex_trylock()
                                        inode_dio_done()
  #TEST_CASE1_BEGIN
  MNT=/mnt_scrach
  unlink $MNT/file
  fallocate -l $((1024*1024*1024)) $MNT/file
  aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
  sleep 2
  truncate -s 0 $MNT/file
  #TEST_CASE1_END

Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286

This patch makes state machine simple and clean:

(1) xxx_end_io schedule final extent conversion simply by calling
    ext4_add_complete_io(), which append it to ei->i_completed_io_list
    NOTE1: because of (2A) work should be queued only if
    ->i_completed_io_list was empty, otherwise it work is scheduled already.

(2) ext4_flush_completed_IO is responsible for handling all pending
    end_io from ei->i_completed_io_list
    Flushing sequange consist of following stages:
    A) LOCKED: Atomically drain completed_io_list to local_list
    B) Perform extents conversion
    C) LOCKED: move conveted io's to to_free list for final deletion
       	     This logic depends on context which we was called from.
    D) Final end_io context destruction
    NOTE1: i_mutex is no longer required because end_io->flags modification
    protected by ei->ext4_complete_io_lock

Full list of changes:
- Move all completion end_io related routines to page-io.c in order to improve
  logic locality
- Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
- remove EXT4_IO_END_FSYNC
- Improve SMP scalability by removing useless i_mutex which does not
  protect io->flags anymore.
- Reduce lock contention on i_completed_io_lock by optimizing list walk.
- Rename ext4_end_io_nolock to end4_end_io and make it static
- Check flush completion status to ext4_ext_punch_hole(). Because it is
  not good idea to punch blocks from corrupted inode.

Changes since V3 (in request to Jan's comments):
  Fall back to active flush_completed_IO() approach in order to prevent
  performance issues with nolocked DIO reads.
Changes since V2:
  Fix use-after-free caused by race truncate vs end_io_work

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ext4.h     |    3 +-
 fs/ext4/extents.c  |    4 +-
 fs/ext4/fsync.c    |   81 -------------------------
 fs/ext4/indirect.c |    6 +-
 fs/ext4/inode.c    |   25 +-------
 fs/ext4/page-io.c  |  171 ++++++++++++++++++++++++++++++++++------------------
 6 files changed, 121 insertions(+), 169 deletions(-)
Jan Kara - Oct. 1, 2012, 6:38 p.m.
On Fri 28-09-12 19:44:04, Dmitry Monakhov wrote:
  Couple of language fixes:
> Current unwritten extent conversion state-machine is very fuzzy.
> - By unknown reason it want perform conversion under i_mutex. What for?
    ^^ For               ^^^^^^^^ I'd just make it "performs"
>   My diagnosis:
>   We already protect extent tree with i_data_sem, truncate and punch_hole
>   should wait for DIO, so the only data we have to protect is end_io->flags
>   modification, but only flush_completed_IO and end_io_work modified this
                                                                       ^^ these
>   flags and we can serialize them via i_completed_io_lock.
> 
>   Currently all this games with mutex_trylock result in following deadlock
                  ^^^ these                              ^ the
>    truncate:                          kworker:
>     ext4_setattr                       ext4_end_io_work
>     mutex_lock(i_mutex)
>     inode_dio_wait(inode)  ->BLOCK
>                              DEADLOCK<- mutex_trylock()
>                                         inode_dio_done()
>   #TEST_CASE1_BEGIN
>   MNT=/mnt_scrach
>   unlink $MNT/file
>   fallocate -l $((1024*1024*1024)) $MNT/file
>   aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
>   sleep 2
>   truncate -s 0 $MNT/file
>   #TEST_CASE1_END
> 
> Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
> 
> This patch makes state machine simple and clean:
> 
> (1) xxx_end_io schedule final extent conversion simply by calling
>     ext4_add_complete_io(), which append it to ei->i_completed_io_list
>     NOTE1: because of (2A) work should be queued only if
>     ->i_completed_io_list was empty, otherwise it work is scheduled already.
                                                 ^^ remove this
> 
> (2) ext4_flush_completed_IO is responsible for handling all pending
>     end_io from ei->i_completed_io_list
>     Flushing sequange consist of following stages:
               ^^ sequence ^ consists
>     A) LOCKED: Atomically drain completed_io_list to local_list
>     B) Perform extents conversion
>     C) LOCKED: move conveted io's to to_free list for final deletion
                      ^^ converted
>        	     This logic depends on context which we was called from.
                                                            ^^ were
>     D) Final end_io context destruction
>     NOTE1: i_mutex is no longer required because end_io->flags modification
>     protected by ei->ext4_complete_io_lock
      ^^ is protected
> 
> Full list of changes:
> - Move all completion end_io related routines to page-io.c in order to improve
>   logic locality
> - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
> - remove EXT4_IO_END_FSYNC
> - Improve SMP scalability by removing useless i_mutex which does not
>   protect io->flags anymore.
> - Reduce lock contention on i_completed_io_lock by optimizing list walk.
> - Rename ext4_end_io_nolock to end4_end_io and make it static
> - Check flush completion status to ext4_ext_punch_hole(). Because it is
>   not good idea to punch blocks from corrupted inode.
> 
> Changes since V3 (in request to Jan's comments):
>   Fall back to active flush_completed_IO() approach in order to prevent
>   performance issues with nolocked DIO reads.
> Changes since V2:
>   Fix use-after-free caused by race truncate vs end_io_work
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
>  fs/ext4/ext4.h     |    3 +-
>  fs/ext4/extents.c  |    4 +-
>  fs/ext4/fsync.c    |   81 -------------------------
>  fs/ext4/indirect.c |    6 +-
>  fs/ext4/inode.c    |   25 +-------
>  fs/ext4/page-io.c  |  171 ++++++++++++++++++++++++++++++++++------------------
>  6 files changed, 121 insertions(+), 169 deletions(-)
> 
...
> +static int ext4_do_flush_completed_IO(struct inode *inode,
> +				      ext4_io_end_t *work_io)
> +{
> +	ext4_io_end_t *io;
> +	struct list_head unwritten, complete, to_free;
> +	unsigned long flags;
> +	struct ext4_inode_info *ei = EXT4_I(inode);
> +	int err, ret = 0;
> +
> +	INIT_LIST_HEAD(&complete);
> +	INIT_LIST_HEAD(&to_free);
> +
> +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> +	dump_completed_IO(inode);
> +	list_replace_init(&ei->i_completed_io_list, &unwritten);
> +	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> +
> +	while (!list_empty(&unwritten)) {
> +		io = list_entry(unwritten.next, ext4_io_end_t, list);
> +		BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> +		list_del_init(&io->list);
> +
> +		err = ext4_end_io(io);
> +		if (unlikely(!ret && err))
> +			ret = err;
> +
> +		list_add_tail(&io->list, &complete);
> +	}
> +	/* It is important to update all flags for all end_io in one shot w/o
> +	 * dropping the lock.*/
  Why? It would seem that once io structures are moved from
i_completed_io_list, they are unreachable by anyone else?

> +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> +	while (!list_empty(&complete)) {
> +		io = list_entry(complete.next, ext4_io_end_t, list);
> +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> +		/* end_io context can not be destroyed now because it still
> +		 * used by queued worker. Worker thread will destroy it later */
> +		if (io->flag & EXT4_IO_END_QUEUED)
> +			list_del_init(&io->list);
> +		else
> +			list_move(&io->list, &to_free);
> +	}
> +	/* If we are called from worker context, it is time to clear queued
> +	 * flag, and destroy it's end_io if it was converted already */
> +	if (work_io) {
> +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> +			list_add_tail(&work_io->list, &to_free);
>  	}
> -	list_del_init(&io->list);
>  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> -	(void) ext4_end_io_nolock(io);
> -	mutex_unlock(&inode->i_mutex);
> -free:
> -	ext4_free_io_end(io);
> +
> +	while (!list_empty(&to_free)) {
> +		io = list_entry(to_free.next, ext4_io_end_t, list);
> +		list_del_init(&io->list);
> +		ext4_free_io_end(io);
> +	}
> +	return ret;
> +}
> +
> +/*
> + * work on completed aio dio IO, to convert unwritten extents to extents
> + */
> +static void ext4_end_io_work(struct work_struct *work)
> +{
> +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> +	ext4_do_flush_completed_IO(io->inode, io);
> +}
> +
> +int ext4_flush_completed_IO(struct inode *inode)
> +{
> +	return ext4_do_flush_completed_IO(inode, NULL);
>  }
  Also it seems that when ext4_flush_completed_IO() is called, workqueue
can have several IO structures queued in its local lists thus we miss them
here and don't properly wait for all conversions?

								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - Oct. 2, 2012, 7:16 a.m.
On Mon, 1 Oct 2012 20:38:33 +0200, Jan Kara <jack@suse.cz> wrote:
> On Fri 28-09-12 19:44:04, Dmitry Monakhov wrote:
>   Couple of language fixes:
> > Current unwritten extent conversion state-machine is very fuzzy.
> > - By unknown reason it want perform conversion under i_mutex. What for?
>     ^^ For               ^^^^^^^^ I'd just make it "performs"
> >   My diagnosis:
> >   We already protect extent tree with i_data_sem, truncate and punch_hole
> >   should wait for DIO, so the only data we have to protect is end_io->flags
> >   modification, but only flush_completed_IO and end_io_work modified this
>                                                                        ^^ these
> >   flags and we can serialize them via i_completed_io_lock.
> > 
> >   Currently all this games with mutex_trylock result in following deadlock
>                   ^^^ these                              ^ the
> >    truncate:                          kworker:
> >     ext4_setattr                       ext4_end_io_work
> >     mutex_lock(i_mutex)
> >     inode_dio_wait(inode)  ->BLOCK
> >                              DEADLOCK<- mutex_trylock()
> >                                         inode_dio_done()
> >   #TEST_CASE1_BEGIN
> >   MNT=/mnt_scrach
> >   unlink $MNT/file
> >   fallocate -l $((1024*1024*1024)) $MNT/file
> >   aio-stress -I 100000 -O -s 100m -n -t 1 -c 10 -o 2 -o 3 $MNT/file
> >   sleep 2
> >   truncate -s 0 $MNT/file
> >   #TEST_CASE1_END
> > 
> > Or use 286's xfstests https://github.com/dmonakhov/xfstests/blob/devel/286
> > 
> > This patch makes state machine simple and clean:
> > 
> > (1) xxx_end_io schedule final extent conversion simply by calling
> >     ext4_add_complete_io(), which append it to ei->i_completed_io_list
> >     NOTE1: because of (2A) work should be queued only if
> >     ->i_completed_io_list was empty, otherwise it work is scheduled already.
>                                                  ^^ remove this
> > 
> > (2) ext4_flush_completed_IO is responsible for handling all pending
> >     end_io from ei->i_completed_io_list
> >     Flushing sequange consist of following stages:
>                ^^ sequence ^ consists
> >     A) LOCKED: Atomically drain completed_io_list to local_list
> >     B) Perform extents conversion
> >     C) LOCKED: move conveted io's to to_free list for final deletion
>                       ^^ converted
> >        	     This logic depends on context which we was called from.
>                                                             ^^ were
> >     D) Final end_io context destruction
> >     NOTE1: i_mutex is no longer required because end_io->flags modification
> >     protected by ei->ext4_complete_io_lock
>       ^^ is protected
> > 
> > Full list of changes:
> > - Move all completion end_io related routines to page-io.c in order to improve
> >   logic locality
> > - Move open coded logic from various xx_end_xx routines to ext4_add_complete_io()
> > - remove EXT4_IO_END_FSYNC
> > - Improve SMP scalability by removing useless i_mutex which does not
> >   protect io->flags anymore.
> > - Reduce lock contention on i_completed_io_lock by optimizing list walk.
> > - Rename ext4_end_io_nolock to end4_end_io and make it static
> > - Check flush completion status to ext4_ext_punch_hole(). Because it is
> >   not good idea to punch blocks from corrupted inode.
> > 
> > Changes since V3 (in request to Jan's comments):
> >   Fall back to active flush_completed_IO() approach in order to prevent
> >   performance issues with nolocked DIO reads.
> > Changes since V2:
> >   Fix use-after-free caused by race truncate vs end_io_work
> > 
> > Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> > ---
> >  fs/ext4/ext4.h     |    3 +-
> >  fs/ext4/extents.c  |    4 +-
> >  fs/ext4/fsync.c    |   81 -------------------------
> >  fs/ext4/indirect.c |    6 +-
> >  fs/ext4/inode.c    |   25 +-------
> >  fs/ext4/page-io.c  |  171 ++++++++++++++++++++++++++++++++++------------------
> >  6 files changed, 121 insertions(+), 169 deletions(-)
> > 
> ...
> > +static int ext4_do_flush_completed_IO(struct inode *inode,
> > +				      ext4_io_end_t *work_io)
> > +{
> > +	ext4_io_end_t *io;
> > +	struct list_head unwritten, complete, to_free;
> > +	unsigned long flags;
> > +	struct ext4_inode_info *ei = EXT4_I(inode);
> > +	int err, ret = 0;
> > +
> > +	INIT_LIST_HEAD(&complete);
> > +	INIT_LIST_HEAD(&to_free);
> > +
> > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > +	dump_completed_IO(inode);
> > +	list_replace_init(&ei->i_completed_io_list, &unwritten);
> > +	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > +
> > +	while (!list_empty(&unwritten)) {
> > +		io = list_entry(unwritten.next, ext4_io_end_t, list);
> > +		BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> > +		list_del_init(&io->list);
> > +
> > +		err = ext4_end_io(io);
> > +		if (unlikely(!ret && err))
> > +			ret = err;
> > +
> > +		list_add_tail(&io->list, &complete);
> > +	}
> > +	/* It is important to update all flags for all end_io in one shot w/o
> > +	 * dropping the lock.*/
>   Why? It would seem that once io structures are moved from
> i_completed_io_list, they are unreachable by anyone else?
> 
> > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > +	while (!list_empty(&complete)) {
> > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > +		/* end_io context can not be destroyed now because it still
> > +		 * used by queued worker. Worker thread will destroy it later */
> > +		if (io->flag & EXT4_IO_END_QUEUED)
> > +			list_del_init(&io->list);
> > +		else
> > +			list_move(&io->list, &to_free);
> > +	}
> > +	/* If we are called from worker context, it is time to clear queued
> > +	 * flag, and destroy it's end_io if it was converted already */
> > +	if (work_io) {
> > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > +			list_add_tail(&work_io->list, &to_free);
> >  	}
> > -	list_del_init(&io->list);
> >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > -	(void) ext4_end_io_nolock(io);
> > -	mutex_unlock(&inode->i_mutex);
> > -free:
> > -	ext4_free_io_end(io);
> > +
> > +	while (!list_empty(&to_free)) {
> > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > +		list_del_init(&io->list);
> > +		ext4_free_io_end(io);
> > +	}
> > +	return ret;
> > +}
> > +
> > +/*
> > + * work on completed aio dio IO, to convert unwritten extents to extents
> > + */
> > +static void ext4_end_io_work(struct work_struct *work)
> > +{
> > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > +	ext4_do_flush_completed_IO(io->inode, io);
> > +}
> > +
> > +int ext4_flush_completed_IO(struct inode *inode)
> > +{
> > +	return ext4_do_flush_completed_IO(inode, NULL);
> >  }
>   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> can have several IO structures queued in its local lists thus we miss them
> here and don't properly wait for all conversions?
No it is not. Because list drained atomically, and 
add_complete_io will queue work only if list is empty.

Race between conversion and dequeue-process is not possible because
we hold lock during entire walk of complete_list, so from external
point of view we mark list as conversed(clear unwritten flag)
happens atomically. I've drawn all possible situations and race not
happen. If you know any please let me know.

> 
> 								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Oct. 2, 2012, 10:31 a.m.
On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
...
> > > +static int ext4_do_flush_completed_IO(struct inode *inode,
> > > +	while (!list_empty(&unwritten)) {
> > > +		io = list_entry(unwritten.next, ext4_io_end_t, list);
> > > +		BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> > > +		list_del_init(&io->list);
> > > +
> > > +		err = ext4_end_io(io);
> > > +		if (unlikely(!ret && err))
> > > +			ret = err;
> > > +
> > > +		list_add_tail(&io->list, &complete);
> > > +	}
> > > +	/* It is important to update all flags for all end_io in one shot w/o
> > > +	 * dropping the lock.*/
> >   Why? It would seem that once io structures are moved from
> > i_completed_io_list, they are unreachable by anyone else?
  You seem to have missed this comment?


> > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > +	while (!list_empty(&complete)) {
> > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > +		/* end_io context can not be destroyed now because it still
> > > +		 * used by queued worker. Worker thread will destroy it later */
> > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > +			list_del_init(&io->list);
> > > +		else
> > > +			list_move(&io->list, &to_free);
> > > +	}
> > > +	/* If we are called from worker context, it is time to clear queued
> > > +	 * flag, and destroy it's end_io if it was converted already */
> > > +	if (work_io) {
> > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > +			list_add_tail(&work_io->list, &to_free);
> > >  	}
> > > -	list_del_init(&io->list);
> > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > -	(void) ext4_end_io_nolock(io);
> > > -	mutex_unlock(&inode->i_mutex);
> > > -free:
> > > -	ext4_free_io_end(io);
> > > +
> > > +	while (!list_empty(&to_free)) {
> > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > +		list_del_init(&io->list);
> > > +		ext4_free_io_end(io);
> > > +	}
> > > +	return ret;
> > > +}
> > > +
> > > +/*
> > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > + */
> > > +static void ext4_end_io_work(struct work_struct *work)
> > > +{
> > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > +}
> > > +
> > > +int ext4_flush_completed_IO(struct inode *inode)
> > > +{
> > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > >  }
> >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > can have several IO structures queued in its local lists thus we miss them
> > here and don't properly wait for all conversions?
> No it is not. Because list drained atomically, and 
> add_complete_io will queue work only if list is empty.
> 
> Race between conversion and dequeue-process is not possible because
> we hold lock during entire walk of complete_list, so from external
> point of view we mark list as conversed(clear unwritten flag)
> happens atomically. I've drawn all possible situations and race not
> happen. If you know any please let me know.
  I guess I'm missing something obvious. So let's go step by step:
1) ext4_flush_completed_IO() must make sure there is no outstanding
   conversion for the inode.
2) Now assume we have non-empty i_completed_io_list - thus work is queued.
3) The following situation seems to be possible:

CPU1					CPU2
(worker thread)				(truncate)
ext4_end_io_work()
  ext4_do_flush_completed_IO()
    spin_lock_irqsave(&ei->i_completed_io_lock, flags);
    dump_completed_IO(inode);
    list_replace_init(&ei->i_completed_io_list, &unwritten);
    spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);

					ext4_flush_completed_IO()
					  ext4_do_flush_completed_IO()
					    - sees empty i_completed_io_list
					     => exits

  But we still have some conversions pending in 'unwritten' list. What am
I missing?

								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - Oct. 2, 2012, 10:57 a.m.
On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> ...
> > > > +static int ext4_do_flush_completed_IO(struct inode *inode,
> > > > +	while (!list_empty(&unwritten)) {
> > > > +		io = list_entry(unwritten.next, ext4_io_end_t, list);
> > > > +		BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
> > > > +		list_del_init(&io->list);
> > > > +
> > > > +		err = ext4_end_io(io);
> > > > +		if (unlikely(!ret && err))
> > > > +			ret = err;
> > > > +
> > > > +		list_add_tail(&io->list, &complete);
> > > > +	}
> > > > +	/* It is important to update all flags for all end_io in one shot w/o
> > > > +	 * dropping the lock.*/
> > >   Why? It would seem that once io structures are moved from
> > > i_completed_io_list, they are unreachable by anyone else?
>   You seem to have missed this comment?
Yep. i've missed that comment, and yes it is appeared to not important
to update whole list atomically, it may be dropped on each end_io.
> 
> 
> > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > +	while (!list_empty(&complete)) {
> > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > +		/* end_io context can not be destroyed now because it still
> > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > +			list_del_init(&io->list);
> > > > +		else
> > > > +			list_move(&io->list, &to_free);
> > > > +	}
> > > > +	/* If we are called from worker context, it is time to clear queued
> > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > +	if (work_io) {
> > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > +			list_add_tail(&work_io->list, &to_free);
> > > >  	}
> > > > -	list_del_init(&io->list);
> > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > -	(void) ext4_end_io_nolock(io);
> > > > -	mutex_unlock(&inode->i_mutex);
> > > > -free:
> > > > -	ext4_free_io_end(io);
> > > > +
> > > > +	while (!list_empty(&to_free)) {
> > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > +		list_del_init(&io->list);
> > > > +		ext4_free_io_end(io);
> > > > +	}
> > > > +	return ret;
> > > > +}
> > > > +
> > > > +/*
> > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > + */
> > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > +{
> > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > +}
> > > > +
> > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > +{
> > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > >  }
> > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > can have several IO structures queued in its local lists thus we miss them
> > > here and don't properly wait for all conversions?
> > No it is not. Because list drained atomically, and 
> > add_complete_io will queue work only if list is empty.
> > 
> > Race between conversion and dequeue-process is not possible because
> > we hold lock during entire walk of complete_list, so from external
> > point of view we mark list as conversed(clear unwritten flag)
> > happens atomically. I've drawn all possible situations and race not
> > happen. If you know any please let me know.
>   I guess I'm missing something obvious. So let's go step by step:
> 1) ext4_flush_completed_IO() must make sure there is no outstanding
>    conversion for the inode.
> 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> 3) The following situation seems to be possible:
> 
> CPU1					CPU2
> (worker thread)				(truncate)
> ext4_end_io_work()
>   ext4_do_flush_completed_IO()
>     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
>     dump_completed_IO(inode);
>     list_replace_init(&ei->i_completed_io_list, &unwritten);
>     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> 
> 					ext4_flush_completed_IO()
> 					  ext4_do_flush_completed_IO()
> 					    - sees empty i_completed_io_list
> 					     => exits
> 
>   But we still have some conversions pending in 'unwritten' list. What am
> I missing?
Indeed, I've simply missed that case. The case which result silently
broke integrity sync ;(
Thank you for spotting this. I'll be back with updated version.
> 
> 								Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Oct. 2, 2012, 11:11 a.m.
On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > +	while (!list_empty(&complete)) {
> > > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > +		/* end_io context can not be destroyed now because it still
> > > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > > +			list_del_init(&io->list);
> > > > > +		else
> > > > > +			list_move(&io->list, &to_free);
> > > > > +	}
> > > > > +	/* If we are called from worker context, it is time to clear queued
> > > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > > +	if (work_io) {
> > > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > +			list_add_tail(&work_io->list, &to_free);
> > > > >  	}
> > > > > -	list_del_init(&io->list);
> > > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > -	(void) ext4_end_io_nolock(io);
> > > > > -	mutex_unlock(&inode->i_mutex);
> > > > > -free:
> > > > > -	ext4_free_io_end(io);
> > > > > +
> > > > > +	while (!list_empty(&to_free)) {
> > > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > +		list_del_init(&io->list);
> > > > > +		ext4_free_io_end(io);
> > > > > +	}
> > > > > +	return ret;
> > > > > +}
> > > > > +
> > > > > +/*
> > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > + */
> > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > +{
> > > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > > +}
> > > > > +
> > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > +{
> > > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > > >  }
> > > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > can have several IO structures queued in its local lists thus we miss them
> > > > here and don't properly wait for all conversions?
> > > No it is not. Because list drained atomically, and 
> > > add_complete_io will queue work only if list is empty.
> > > 
> > > Race between conversion and dequeue-process is not possible because
> > > we hold lock during entire walk of complete_list, so from external
> > > point of view we mark list as conversed(clear unwritten flag)
> > > happens atomically. I've drawn all possible situations and race not
> > > happen. If you know any please let me know.
> >   I guess I'm missing something obvious. So let's go step by step:
> > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> >    conversion for the inode.
> > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > 3) The following situation seems to be possible:
> > 
> > CPU1					CPU2
> > (worker thread)				(truncate)
> > ext4_end_io_work()
> >   ext4_do_flush_completed_IO()
> >     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> >     dump_completed_IO(inode);
> >     list_replace_init(&ei->i_completed_io_list, &unwritten);
> >     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > 
> > 					ext4_flush_completed_IO()
> > 					  ext4_do_flush_completed_IO()
> > 					    - sees empty i_completed_io_list
> > 					     => exits
> > 
> >   But we still have some conversions pending in 'unwritten' list. What am
> > I missing?
> Indeed, I've simply missed that case. The case which result silently
> broke integrity sync ;(
> Thank you for spotting this. I'll be back with updated version.
  Umm, actually, I was thinking about it and ext4_flush_completed_IO()
seems to be unnecessary in fsync these days. We don't call aio_complete()
until we perform the conversion so what fsync does to such IO is undefined.
Such optimization is a separate matter though.

But for truncate or punch hole, it is critical that all conversions are
really flushed.

								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - Oct. 2, 2012, 12:42 p.m.
On Tue, 2 Oct 2012 13:11:06 +0200, Jan Kara <jack@suse.cz> wrote:
> On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> > On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> > > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > +	while (!list_empty(&complete)) {
> > > > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > > +		/* end_io context can not be destroyed now because it still
> > > > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > > > +			list_del_init(&io->list);
> > > > > > +		else
> > > > > > +			list_move(&io->list, &to_free);
> > > > > > +	}
> > > > > > +	/* If we are called from worker context, it is time to clear queued
> > > > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > > > +	if (work_io) {
> > > > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > > +			list_add_tail(&work_io->list, &to_free);
> > > > > >  	}
> > > > > > -	list_del_init(&io->list);
> > > > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > -	(void) ext4_end_io_nolock(io);
> > > > > > -	mutex_unlock(&inode->i_mutex);
> > > > > > -free:
> > > > > > -	ext4_free_io_end(io);
> > > > > > +
> > > > > > +	while (!list_empty(&to_free)) {
> > > > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > > +		list_del_init(&io->list);
> > > > > > +		ext4_free_io_end(io);
> > > > > > +	}
> > > > > > +	return ret;
> > > > > > +}
> > > > > > +
> > > > > > +/*
> > > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > > + */
> > > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > > +{
> > > > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > > > +}
> > > > > > +
> > > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > > +{
> > > > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > > > >  }
> > > > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > > can have several IO structures queued in its local lists thus we miss them
> > > > > here and don't properly wait for all conversions?
> > > > No it is not. Because list drained atomically, and 
> > > > add_complete_io will queue work only if list is empty.
> > > > 
> > > > Race between conversion and dequeue-process is not possible because
> > > > we hold lock during entire walk of complete_list, so from external
> > > > point of view we mark list as conversed(clear unwritten flag)
> > > > happens atomically. I've drawn all possible situations and race not
> > > > happen. If you know any please let me know.
> > >   I guess I'm missing something obvious. So let's go step by step:
> > > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> > >    conversion for the inode.
> > > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > > 3) The following situation seems to be possible:
> > > 
> > > CPU1					CPU2
> > > (worker thread)				(truncate)
> > > ext4_end_io_work()
> > >   ext4_do_flush_completed_IO()
> > >     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > >     dump_completed_IO(inode);
> > >     list_replace_init(&ei->i_completed_io_list, &unwritten);
> > >     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > 
> > > 					ext4_flush_completed_IO()
> > > 					  ext4_do_flush_completed_IO()
> > > 					    - sees empty i_completed_io_list
> > > 					     => exits
> > > 
> > >   But we still have some conversions pending in 'unwritten' list. What am
> > > I missing?
> > Indeed, I've simply missed that case. The case which result silently
> > broke integrity sync ;(
> > Thank you for spotting this. I'll be back with updated version.
>   Umm, actually, I was thinking about it and ext4_flush_completed_IO()
> seems to be unnecessary in fsync these days. We don't call aio_complete()
> until we perform the conversion so what fsync does to such IO is undefined.
> Such optimization is a separate matter though.
Yes aio is ok, but integrity fsync after buffered write to unwritten
extent is broken.

fsync()                             blkdev_completion          kwork
->filemap_write_and_wait_range
                                    ->ext4_end_bio
                                     ->end_page_writeback
 <-- filemap_write_and_wait_range return  
                                      ->ext4_add_complete_io

                                                              ->ext4_do_flush_completed_IO
                                                                 ->list_replace_init
->ext4_flush_completed_IO
  sees empty i_comleted_io_list but pended
  conversion still exist
                                                                ->ext4_end_io


> 
> But for truncate or punch hole, it is critical that all conversions are
> really flushed.
> 
> 								Honza
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Oct. 2, 2012, 1:30 p.m.
On Tue 02-10-12 16:42:39, Dmitry Monakhov wrote:
> On Tue, 2 Oct 2012 13:11:06 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> > > On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > > +	while (!list_empty(&complete)) {
> > > > > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > > > +		/* end_io context can not be destroyed now because it still
> > > > > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > > > > +			list_del_init(&io->list);
> > > > > > > +		else
> > > > > > > +			list_move(&io->list, &to_free);
> > > > > > > +	}
> > > > > > > +	/* If we are called from worker context, it is time to clear queued
> > > > > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > > > > +	if (work_io) {
> > > > > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > > > +			list_add_tail(&work_io->list, &to_free);
> > > > > > >  	}
> > > > > > > -	list_del_init(&io->list);
> > > > > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > > -	(void) ext4_end_io_nolock(io);
> > > > > > > -	mutex_unlock(&inode->i_mutex);
> > > > > > > -free:
> > > > > > > -	ext4_free_io_end(io);
> > > > > > > +
> > > > > > > +	while (!list_empty(&to_free)) {
> > > > > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > > > +		list_del_init(&io->list);
> > > > > > > +		ext4_free_io_end(io);
> > > > > > > +	}
> > > > > > > +	return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > > > + */
> > > > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > > > +{
> > > > > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > > > > +}
> > > > > > > +
> > > > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > > > +{
> > > > > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > > > > >  }
> > > > > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > > > can have several IO structures queued in its local lists thus we miss them
> > > > > > here and don't properly wait for all conversions?
> > > > > No it is not. Because list drained atomically, and 
> > > > > add_complete_io will queue work only if list is empty.
> > > > > 
> > > > > Race between conversion and dequeue-process is not possible because
> > > > > we hold lock during entire walk of complete_list, so from external
> > > > > point of view we mark list as conversed(clear unwritten flag)
> > > > > happens atomically. I've drawn all possible situations and race not
> > > > > happen. If you know any please let me know.
> > > >   I guess I'm missing something obvious. So let's go step by step:
> > > > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> > > >    conversion for the inode.
> > > > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > > > 3) The following situation seems to be possible:
> > > > 
> > > > CPU1					CPU2
> > > > (worker thread)				(truncate)
> > > > ext4_end_io_work()
> > > >   ext4_do_flush_completed_IO()
> > > >     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > >     dump_completed_IO(inode);
> > > >     list_replace_init(&ei->i_completed_io_list, &unwritten);
> > > >     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > 
> > > > 					ext4_flush_completed_IO()
> > > > 					  ext4_do_flush_completed_IO()
> > > > 					    - sees empty i_completed_io_list
> > > > 					     => exits
> > > > 
> > > >   But we still have some conversions pending in 'unwritten' list. What am
> > > > I missing?
> > > Indeed, I've simply missed that case. The case which result silently
> > > broke integrity sync ;(
> > > Thank you for spotting this. I'll be back with updated version.
> >   Umm, actually, I was thinking about it and ext4_flush_completed_IO()
> > seems to be unnecessary in fsync these days. We don't call aio_complete()
> > until we perform the conversion so what fsync does to such IO is undefined.
> > Such optimization is a separate matter though.
> Yes aio is ok, but integrity fsync after buffered write to unwritten
> extent is broken.
> 
> fsync()                             blkdev_completion          kwork
> ->filemap_write_and_wait_range
>                                     ->ext4_end_bio
>                                      ->end_page_writeback
>  <-- filemap_write_and_wait_range return  
>                                       ->ext4_add_complete_io
> 
>                                                               ->ext4_do_flush_completed_IO
>                                                                  ->list_replace_init
> ->ext4_flush_completed_IO
>   sees empty i_comleted_io_list but pended
>   conversion still exist
>                                                                 ->ext4_end_io
> 
  Correct. Thanks for pointing that out.

								Honza
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dmitri Monakho - Oct. 3, 2012, 11:21 a.m.
On Tue, 2 Oct 2012 15:30:19 +0200, Jan Kara <jack@suse.cz> wrote:
> On Tue 02-10-12 16:42:39, Dmitry Monakhov wrote:
> > On Tue, 2 Oct 2012 13:11:06 +0200, Jan Kara <jack@suse.cz> wrote:
> > > On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> > > > On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > > > +	while (!list_empty(&complete)) {
> > > > > > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > > > > +		/* end_io context can not be destroyed now because it still
> > > > > > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > > > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > > > > > +			list_del_init(&io->list);
> > > > > > > > +		else
> > > > > > > > +			list_move(&io->list, &to_free);
> > > > > > > > +	}
> > > > > > > > +	/* If we are called from worker context, it is time to clear queued
> > > > > > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > > > > > +	if (work_io) {
> > > > > > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > > > > +			list_add_tail(&work_io->list, &to_free);
> > > > > > > >  	}
> > > > > > > > -	list_del_init(&io->list);
> > > > > > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > > > -	(void) ext4_end_io_nolock(io);
> > > > > > > > -	mutex_unlock(&inode->i_mutex);
> > > > > > > > -free:
> > > > > > > > -	ext4_free_io_end(io);
> > > > > > > > +
> > > > > > > > +	while (!list_empty(&to_free)) {
> > > > > > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > > > > +		list_del_init(&io->list);
> > > > > > > > +		ext4_free_io_end(io);
> > > > > > > > +	}
> > > > > > > > +	return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > > > > + */
> > > > > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > > > > +{
> > > > > > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > > > > +{
> > > > > > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > > > > > >  }
> > > > > > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > > > > can have several IO structures queued in its local lists thus we miss them
> > > > > > > here and don't properly wait for all conversions?
> > > > > > No it is not. Because list drained atomically, and 
> > > > > > add_complete_io will queue work only if list is empty.
> > > > > > 
> > > > > > Race between conversion and dequeue-process is not possible because
> > > > > > we hold lock during entire walk of complete_list, so from external
> > > > > > point of view we mark list as conversed(clear unwritten flag)
> > > > > > happens atomically. I've drawn all possible situations and race not
> > > > > > happen. If you know any please let me know.
> > > > >   I guess I'm missing something obvious. So let's go step by step:
> > > > > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> > > > >    conversion for the inode.
> > > > > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > > > > 3) The following situation seems to be possible:
> > > > > 
> > > > > CPU1					CPU2
> > > > > (worker thread)				(truncate)
> > > > > ext4_end_io_work()
> > > > >   ext4_do_flush_completed_IO()
> > > > >     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > >     dump_completed_IO(inode);
> > > > >     list_replace_init(&ei->i_completed_io_list, &unwritten);
> > > > >     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > 
> > > > > 					ext4_flush_completed_IO()
> > > > > 					  ext4_do_flush_completed_IO()
> > > > > 					    - sees empty i_completed_io_list
> > > > > 					     => exits
> > > > > 
> > > > >   But we still have some conversions pending in 'unwritten' list. What am
> > > > > I missing?
> > > > Indeed, I've simply missed that case. The case which result silently
> > > > broke integrity sync ;(
> > > > Thank you for spotting this. I'll be back with updated version.
> > >   Umm, actually, I was thinking about it and ext4_flush_completed_IO()
> > > seems to be unnecessary in fsync these days. We don't call aio_complete()
> > > until we perform the conversion so what fsync does to such IO is undefined.
> > > Such optimization is a separate matter though.
> > Yes aio is ok, but integrity fsync after buffered write to unwritten
> > extent is broken.
> > 
> > fsync()                             blkdev_completion          kwork
> > ->filemap_write_and_wait_range
> >                                     ->ext4_end_bio
> >                                      ->end_page_writeback
> >  <-- filemap_write_and_wait_range return  
> >                                       ->ext4_add_complete_io
> > 
> >                                                               ->ext4_do_flush_completed_IO
> >                                                                  ->list_replace_init
> > ->ext4_flush_completed_IO
> >   sees empty i_comleted_io_list but pended
> >   conversion still exist
> >                                                                 ->ext4_end_io
> > 
>   Correct. Thanks for pointing that out.
In my deference I should say that integrity fsync was broken before my patch in
case of buffered writes because end_page_writeback called before
end_io added to ei->i_comlete_io_list
 fsync()                                    blkdev_completion
 ->filemap_write_and_wait_range
                                            ->ext4_end_bio
                                             ->end_page_writeback
  <-- filemap_write_and_wait_range return  
 ->ext4_flush_completed_IO
   sees empty i_comleted_io_list but pended
   conversion still exist
                                               ->ext4_add_complete_io

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara - Oct. 4, 2012, 10:22 a.m.
On Wed 03-10-12 15:21:25, Dmitry Monakhov wrote:
> On Tue, 2 Oct 2012 15:30:19 +0200, Jan Kara <jack@suse.cz> wrote:
> > On Tue 02-10-12 16:42:39, Dmitry Monakhov wrote:
> > > On Tue, 2 Oct 2012 13:11:06 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > On Tue 02-10-12 14:57:22, Dmitry Monakhov wrote:
> > > > > On Tue, 2 Oct 2012 12:31:41 +0200, Jan Kara <jack@suse.cz> wrote:
> > > > > > On Tue 02-10-12 11:16:38, Dmitry Monakhov wrote:
> > > > > > > > > +	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > > > > > +	while (!list_empty(&complete)) {
> > > > > > > > > +		io = list_entry(complete.next, ext4_io_end_t, list);
> > > > > > > > > +		io->flag &= ~EXT4_IO_END_UNWRITTEN;
> > > > > > > > > +		/* end_io context can not be destroyed now because it still
> > > > > > > > > +		 * used by queued worker. Worker thread will destroy it later */
> > > > > > > > > +		if (io->flag & EXT4_IO_END_QUEUED)
> > > > > > > > > +			list_del_init(&io->list);
> > > > > > > > > +		else
> > > > > > > > > +			list_move(&io->list, &to_free);
> > > > > > > > > +	}
> > > > > > > > > +	/* If we are called from worker context, it is time to clear queued
> > > > > > > > > +	 * flag, and destroy it's end_io if it was converted already */
> > > > > > > > > +	if (work_io) {
> > > > > > > > > +		work_io->flag &= ~EXT4_IO_END_QUEUED;
> > > > > > > > > +		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
> > > > > > > > > +			list_add_tail(&work_io->list, &to_free);
> > > > > > > > >  	}
> > > > > > > > > -	list_del_init(&io->list);
> > > > > > > > >  	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > > > > -	(void) ext4_end_io_nolock(io);
> > > > > > > > > -	mutex_unlock(&inode->i_mutex);
> > > > > > > > > -free:
> > > > > > > > > -	ext4_free_io_end(io);
> > > > > > > > > +
> > > > > > > > > +	while (!list_empty(&to_free)) {
> > > > > > > > > +		io = list_entry(to_free.next, ext4_io_end_t, list);
> > > > > > > > > +		list_del_init(&io->list);
> > > > > > > > > +		ext4_free_io_end(io);
> > > > > > > > > +	}
> > > > > > > > > +	return ret;
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +/*
> > > > > > > > > + * work on completed aio dio IO, to convert unwritten extents to extents
> > > > > > > > > + */
> > > > > > > > > +static void ext4_end_io_work(struct work_struct *work)
> > > > > > > > > +{
> > > > > > > > > +	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
> > > > > > > > > +	ext4_do_flush_completed_IO(io->inode, io);
> > > > > > > > > +}
> > > > > > > > > +
> > > > > > > > > +int ext4_flush_completed_IO(struct inode *inode)
> > > > > > > > > +{
> > > > > > > > > +	return ext4_do_flush_completed_IO(inode, NULL);
> > > > > > > > >  }
> > > > > > > >   Also it seems that when ext4_flush_completed_IO() is called, workqueue
> > > > > > > > can have several IO structures queued in its local lists thus we miss them
> > > > > > > > here and don't properly wait for all conversions?
> > > > > > > No it is not. Because list drained atomically, and 
> > > > > > > add_complete_io will queue work only if list is empty.
> > > > > > > 
> > > > > > > Race between conversion and dequeue-process is not possible because
> > > > > > > we hold lock during entire walk of complete_list, so from external
> > > > > > > point of view we mark list as conversed(clear unwritten flag)
> > > > > > > happens atomically. I've drawn all possible situations and race not
> > > > > > > happen. If you know any please let me know.
> > > > > >   I guess I'm missing something obvious. So let's go step by step:
> > > > > > 1) ext4_flush_completed_IO() must make sure there is no outstanding
> > > > > >    conversion for the inode.
> > > > > > 2) Now assume we have non-empty i_completed_io_list - thus work is queued.
> > > > > > 3) The following situation seems to be possible:
> > > > > > 
> > > > > > CPU1					CPU2
> > > > > > (worker thread)				(truncate)
> > > > > > ext4_end_io_work()
> > > > > >   ext4_do_flush_completed_IO()
> > > > > >     spin_lock_irqsave(&ei->i_completed_io_lock, flags);
> > > > > >     dump_completed_IO(inode);
> > > > > >     list_replace_init(&ei->i_completed_io_list, &unwritten);
> > > > > >     spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
> > > > > > 
> > > > > > 					ext4_flush_completed_IO()
> > > > > > 					  ext4_do_flush_completed_IO()
> > > > > > 					    - sees empty i_completed_io_list
> > > > > > 					     => exits
> > > > > > 
> > > > > >   But we still have some conversions pending in 'unwritten' list. What am
> > > > > > I missing?
> > > > > Indeed, I've simply missed that case. The case which result silently
> > > > > broke integrity sync ;(
> > > > > Thank you for spotting this. I'll be back with updated version.
> > > >   Umm, actually, I was thinking about it and ext4_flush_completed_IO()
> > > > seems to be unnecessary in fsync these days. We don't call aio_complete()
> > > > until we perform the conversion so what fsync does to such IO is undefined.
> > > > Such optimization is a separate matter though.
> > > Yes aio is ok, but integrity fsync after buffered write to unwritten
> > > extent is broken.
> > > 
> > > fsync()                             blkdev_completion          kwork
> > > ->filemap_write_and_wait_range
> > >                                     ->ext4_end_bio
> > >                                      ->end_page_writeback
> > >  <-- filemap_write_and_wait_range return  
> > >                                       ->ext4_add_complete_io
> > > 
> > >                                                               ->ext4_do_flush_completed_IO
> > >                                                                  ->list_replace_init
> > > ->ext4_flush_completed_IO
> > >   sees empty i_comleted_io_list but pended
> > >   conversion still exist
> > >                                                                 ->ext4_end_io
> > > 
> >   Correct. Thanks for pointing that out.
> In my deference I should say that integrity fsync was broken before my patch in
> case of buffered writes because end_page_writeback called before
> end_io added to ei->i_comlete_io_list
>  fsync()                                    blkdev_completion
>  ->filemap_write_and_wait_range
>                                             ->ext4_end_bio
>                                              ->end_page_writeback
>   <-- filemap_write_and_wait_range return  
>  ->ext4_flush_completed_IO
>    sees empty i_comleted_io_list but pended
>    conversion still exist
>                                                ->ext4_add_complete_io
  Right. Actually calling end_page_writeback() before we are sure page can
be correctly reloaded from disk (i.e. before all extent manipulations are
done) is asking for trouble - see e.g. mail
http://lists.openwall.net/linux-ext4/2011/06/08/12 and further discussion.
The discussion was somewhat open-ended but at that time calling
end_page_writeback() after extent conversion was problematic because of
i_mutex. Now we don't need i_mutex for extent conversion, so it is save to
call end_page_writeback() after we convert the extents. So moving
end_page_writeback() there would be good and it would simplify come logic
as well I believe - in particular fsync() would be simpler.

								Honza

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0d393dc..9c3d004 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -186,7 +186,6 @@  struct mpage_da_data {
 #define EXT4_IO_END_ERROR	0x0002
 #define EXT4_IO_END_QUEUED	0x0004
 #define EXT4_IO_END_DIRECT	0x0008
-#define EXT4_IO_END_IN_FSYNC	0x0010
 
 struct ext4_io_page {
 	struct page	*p_page;
@@ -2424,11 +2423,11 @@  extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
 
 /* page-io.c */
 extern int __init ext4_init_pageio(void);
+extern void ext4_add_complete_io(ext4_io_end_t *io_end);
 extern void ext4_exit_pageio(void);
 extern void ext4_ioend_wait(struct inode *);
 extern void ext4_free_io_end(ext4_io_end_t *io);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
-extern int ext4_end_io_nolock(ext4_io_end_t *io);
 extern void ext4_io_submit(struct ext4_io_submit *io);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
 			       struct page *page,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 69e2d13..70ba122 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4607,7 +4607,9 @@  int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length)
 	}
 
 	/* finish any pending end_io work */
-	ext4_flush_completed_IO(inode);
+	err = ext4_flush_completed_IO(inode);
+	if (err)
+		return err;
 
 	credits = ext4_writepage_trans_blocks(inode);
 	handle = ext4_journal_start(inode, credits);
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 323eb15..4600008 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -34,87 +34,6 @@ 
 
 #include <trace/events/ext4.h>
 
-static void dump_completed_IO(struct inode * inode)
-{
-#ifdef	EXT4FS_DEBUG
-	struct list_head *cur, *before, *after;
-	ext4_io_end_t *io, *io0, *io1;
-	unsigned long flags;
-
-	if (list_empty(&EXT4_I(inode)->i_completed_io_list)){
-		ext4_debug("inode %lu completed_io list is empty\n", inode->i_ino);
-		return;
-	}
-
-	ext4_debug("Dump inode %lu completed_io list \n", inode->i_ino);
-	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
-	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list){
-		cur = &io->list;
-		before = cur->prev;
-		io0 = container_of(before, ext4_io_end_t, list);
-		after = cur->next;
-		io1 = container_of(after, ext4_io_end_t, list);
-
-		ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
-			    io, inode->i_ino, io0, io1);
-	}
-	spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
-#endif
-}
-
-/*
- * This function is called from ext4_sync_file().
- *
- * When IO is completed, the work to convert unwritten extents to
- * written is queued on workqueue but may not get immediately
- * scheduled. When fsync is called, we need to ensure the
- * conversion is complete before fsync returns.
- * The inode keeps track of a list of pending/completed IO that
- * might needs to do the conversion. This function walks through
- * the list and convert the related unwritten extents for completed IO
- * to written.
- * The function return the number of pending IOs on success.
- */
-int ext4_flush_completed_IO(struct inode *inode)
-{
-	ext4_io_end_t *io;
-	struct ext4_inode_info *ei = EXT4_I(inode);
-	unsigned long flags;
-	int ret = 0;
-	int ret2 = 0;
-
-	dump_completed_IO(inode);
-	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	while (!list_empty(&ei->i_completed_io_list)){
-		io = list_entry(ei->i_completed_io_list.next,
-				ext4_io_end_t, list);
-		list_del_init(&io->list);
-		io->flag |= EXT4_IO_END_IN_FSYNC;
-		/*
-		 * Calling ext4_end_io_nolock() to convert completed
-		 * IO to written.
-		 *
-		 * When ext4_sync_file() is called, run_queue() may already
-		 * about to flush the work corresponding to this io structure.
-		 * It will be upset if it founds the io structure related
-		 * to the work-to-be schedule is freed.
-		 *
-		 * Thus we need to keep the io structure still valid here after
-		 * conversion finished. The io structure has a flag to
-		 * avoid double converting from both fsync and background work
-		 * queue work.
-		 */
-		spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-		ret = ext4_end_io_nolock(io);
-		if (ret < 0)
-			ret2 = ret;
-		spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-		io->flag &= ~EXT4_IO_END_IN_FSYNC;
-	}
-	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-	return (ret2 < 0) ? ret2 : 0;
-}
-
 /*
  * If we're not journaling and this is a just-created file, we have to
  * sync our parent directory (if it was freshly created) since
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 9917c42..ebc7d83 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -808,11 +808,9 @@  ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(!list_empty(&ei->i_completed_io_list))) {
-			mutex_lock(&inode->i_mutex);
+		if (unlikely(!list_empty(&ei->i_completed_io_list)))
 			ext4_flush_completed_IO(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
+
 		ret = __blockdev_direct_IO(rw, iocb, inode,
 				 inode->i_sb->s_bdev, iov,
 				 offset, nr_segs,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3a28cf7..1ea34e5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2853,9 +2853,6 @@  static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
 {
 	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
         ext4_io_end_t *io_end = iocb->private;
-	struct workqueue_struct *wq;
-	unsigned long flags;
-	struct ext4_inode_info *ei;
 
 	/* if not async direct IO or dio with 0 bytes write, just return */
 	if (!io_end || !size)
@@ -2884,24 +2881,14 @@  out:
 		io_end->iocb = iocb;
 		io_end->result = ret;
 	}
-	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
-
-	/* Add the io_end to per-inode completed aio dio list*/
-	ei = EXT4_I(io_end->inode);
-	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	list_add_tail(&io_end->list, &ei->i_completed_io_list);
-	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
 
-	/* queue the work to convert unwritten extents to written */
-	queue_work(wq, &io_end->work);
+	ext4_add_complete_io(io_end);
 }
 
 static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
 {
 	ext4_io_end_t *io_end = bh->b_private;
-	struct workqueue_struct *wq;
 	struct inode *inode;
-	unsigned long flags;
 
 	if (!test_clear_buffer_uninit(bh) || !io_end)
 		goto out;
@@ -2920,15 +2907,7 @@  static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
 	 */
 	inode = io_end->inode;
 	ext4_set_io_unwritten_flag(inode, io_end);
-
-	/* Add the io_end to per-inode completed io list*/
-	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
-	list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
-	spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
-
-	wq = EXT4_SB(inode->i_sb)->dio_unwritten_wq;
-	/* queue the work to convert unwritten extents to written */
-	queue_work(wq, &io_end->work);
+	ext4_add_complete_io(io_end);
 out:
 	bh->b_private = NULL;
 	bh->b_end_io = NULL;
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 9970022..5b24c40 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -71,6 +71,7 @@  void ext4_free_io_end(ext4_io_end_t *io)
 	int i;
 
 	BUG_ON(!io);
+	BUG_ON(!list_empty(&io->list));
 	BUG_ON(io->flag & EXT4_IO_END_UNWRITTEN);
 
 	if (io->page)
@@ -83,21 +84,14 @@  void ext4_free_io_end(ext4_io_end_t *io)
 	kmem_cache_free(io_end_cachep, io);
 }
 
-/*
- * check a range of space and convert unwritten extents to written.
- *
- * Called with inode->i_mutex; we depend on this when we manipulate
- * io->flag, since we could otherwise race with ext4_flush_completed_IO()
- */
-int ext4_end_io_nolock(ext4_io_end_t *io)
+/* check a range of space and convert unwritten extents to written. */
+static int ext4_end_io(ext4_io_end_t *io)
 {
 	struct inode *inode = io->inode;
 	loff_t offset = io->offset;
 	ssize_t size = io->size;
 	int ret = 0;
 
-	BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
-
 	ext4_debug("ext4_end_io_nolock: io 0x%p from inode %lu,list->next 0x%p,"
 		   "list->prev 0x%p\n",
 		   io, inode->i_ino, io->list.next, io->list.prev);
@@ -110,7 +104,6 @@  int ext4_end_io_nolock(ext4_io_end_t *io)
 			 "(inode %lu, offset %llu, size %zd, error %d)",
 			 inode->i_ino, offset, size, ret);
 	}
-	io->flag &= ~EXT4_IO_END_UNWRITTEN;
 	if (io->iocb)
 		aio_complete(io->iocb, io->result, 0);
 
@@ -122,51 +115,122 @@  int ext4_end_io_nolock(ext4_io_end_t *io)
 	return ret;
 }
 
-/*
- * work on completed aio dio IO, to convert unwritten extents to extents
- */
-static void ext4_end_io_work(struct work_struct *work)
+static void dump_completed_IO(struct inode *inode)
+{
+#ifdef	EXT4FS_DEBUG
+	struct list_head *cur, *before, *after;
+	ext4_io_end_t *io, *io0, *io1;
+	unsigned long flags;
+
+	if (list_empty(&EXT4_I(inode)->i_completed_io_list)) {
+		ext4_debug("inode %lu completed_io list is empty\n",
+			   inode->i_ino);
+		return;
+	}
+
+	ext4_debug("Dump inode %lu completed_io list\n", inode->i_ino);
+	list_for_each_entry(io, &EXT4_I(inode)->i_completed_io_list, list) {
+		cur = &io->list;
+		before = cur->prev;
+		io0 = container_of(before, ext4_io_end_t, list);
+		after = cur->next;
+		io1 = container_of(after, ext4_io_end_t, list);
+
+		ext4_debug("io 0x%p from inode %lu,prev 0x%p,next 0x%p\n",
+			    io, inode->i_ino, io0, io1);
+	}
+#endif
+}
+
+/* Add the io_end to per-inode completed end_io list. */
+void ext4_add_complete_io(ext4_io_end_t *io_end)
 {
-	ext4_io_end_t		*io = container_of(work, ext4_io_end_t, work);
-	struct inode		*inode = io->inode;
-	struct ext4_inode_info	*ei = EXT4_I(inode);
-	unsigned long		flags;
+	struct ext4_inode_info *ei = EXT4_I(io_end->inode);
+	struct workqueue_struct *wq;
+	unsigned long flags;
+
+	BUG_ON(!(io_end->flag & EXT4_IO_END_UNWRITTEN));
+	wq = EXT4_SB(io_end->inode->i_sb)->dio_unwritten_wq;
 
 	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
-	if (io->flag & EXT4_IO_END_IN_FSYNC)
-		goto requeue;
-	if (list_empty(&io->list)) {
-		spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-		goto free;
+	if (list_empty(&ei->i_completed_io_list)) {
+		io_end->flag |= EXT4_IO_END_QUEUED;
+		queue_work(wq, &io_end->work);
 	}
+	list_add_tail(&io_end->list, &ei->i_completed_io_list);
+	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+}
 
-	if (!mutex_trylock(&inode->i_mutex)) {
-		bool was_queued;
-requeue:
-		was_queued = !!(io->flag & EXT4_IO_END_QUEUED);
-		io->flag |= EXT4_IO_END_QUEUED;
-		spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-		/*
-		 * Requeue the work instead of waiting so that the work
-		 * items queued after this can be processed.
-		 */
-		queue_work(EXT4_SB(inode->i_sb)->dio_unwritten_wq, &io->work);
-		/*
-		 * To prevent the ext4-dio-unwritten thread from keeping
-		 * requeueing end_io requests and occupying cpu for too long,
-		 * yield the cpu if it sees an end_io request that has already
-		 * been requeued.
-		 */
-		if (was_queued)
-			yield();
-		return;
+static int ext4_do_flush_completed_IO(struct inode *inode,
+				      ext4_io_end_t *work_io)
+{
+	ext4_io_end_t *io;
+	struct list_head unwritten, complete, to_free;
+	unsigned long flags;
+	struct ext4_inode_info *ei = EXT4_I(inode);
+	int err, ret = 0;
+
+	INIT_LIST_HEAD(&complete);
+	INIT_LIST_HEAD(&to_free);
+
+	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+	dump_completed_IO(inode);
+	list_replace_init(&ei->i_completed_io_list, &unwritten);
+	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
+
+	while (!list_empty(&unwritten)) {
+		io = list_entry(unwritten.next, ext4_io_end_t, list);
+		BUG_ON(!(io->flag & EXT4_IO_END_UNWRITTEN));
+		list_del_init(&io->list);
+
+		err = ext4_end_io(io);
+		if (unlikely(!ret && err))
+			ret = err;
+
+		list_add_tail(&io->list, &complete);
+	}
+	/* It is important to update all flags for all end_io in one shot w/o
+	 * dropping the lock.*/
+	spin_lock_irqsave(&ei->i_completed_io_lock, flags);
+	while (!list_empty(&complete)) {
+		io = list_entry(complete.next, ext4_io_end_t, list);
+		io->flag &= ~EXT4_IO_END_UNWRITTEN;
+		/* end_io context can not be destroyed now because it still
+		 * used by queued worker. Worker thread will destroy it later */
+		if (io->flag & EXT4_IO_END_QUEUED)
+			list_del_init(&io->list);
+		else
+			list_move(&io->list, &to_free);
+	}
+	/* If we are called from worker context, it is time to clear queued
+	 * flag, and destroy it's end_io if it was converted already */
+	if (work_io) {
+		work_io->flag &= ~EXT4_IO_END_QUEUED;
+		if (!(work_io->flag & EXT4_IO_END_UNWRITTEN))
+			list_add_tail(&work_io->list, &to_free);
 	}
-	list_del_init(&io->list);
 	spin_unlock_irqrestore(&ei->i_completed_io_lock, flags);
-	(void) ext4_end_io_nolock(io);
-	mutex_unlock(&inode->i_mutex);
-free:
-	ext4_free_io_end(io);
+
+	while (!list_empty(&to_free)) {
+		io = list_entry(to_free.next, ext4_io_end_t, list);
+		list_del_init(&io->list);
+		ext4_free_io_end(io);
+	}
+	return ret;
+}
+
+/*
+ * work on completed aio dio IO, to convert unwritten extents to extents
+ */
+static void ext4_end_io_work(struct work_struct *work)
+{
+	ext4_io_end_t *io = container_of(work, ext4_io_end_t, work);
+	ext4_do_flush_completed_IO(io->inode, io);
+}
+
+int ext4_flush_completed_IO(struct inode *inode)
+{
+	return ext4_do_flush_completed_IO(inode, NULL);
 }
 
 ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags)
@@ -199,9 +263,7 @@  static void buffer_io_error(struct buffer_head *bh)
 static void ext4_end_bio(struct bio *bio, int error)
 {
 	ext4_io_end_t *io_end = bio->bi_private;
-	struct workqueue_struct *wq;
 	struct inode *inode;
-	unsigned long flags;
 	int i;
 	sector_t bi_sector = bio->bi_sector;
 
@@ -259,14 +321,7 @@  static void ext4_end_bio(struct bio *bio, int error)
 		return;
 	}
 
-	/* Add the io_end to per-inode completed io list*/
-	spin_lock_irqsave(&EXT4_I(inode)->i_completed_io_lock, flags);
-	list_add_tail(&io_end->list, &EXT4_I(inode)->i_completed_io_list);
-	spin_unlock_irqrestore(&EXT4_I(inode)->i_completed_io_lock, flags);
-
-	wq = EXT4_SB(inode->i_sb)->dio_unwritten_wq;
-	/* queue the work to convert unwritten extents to written */
-	queue_work(wq, &io_end->work);
+	ext4_add_complete_io(io_end);
 }
 
 void ext4_io_submit(struct ext4_io_submit *io)