diff mbox

[RFC,8/9,v1] ext4: refine unwritten extent conversion

Message ID 1356335742-11793-9-git-send-email-wenqing.lz@taobao.com
State Superseded, archived
Headers show

Commit Message

Zheng Liu Dec. 24, 2012, 7:55 a.m. UTC
From: Zheng Liu <wenqing.lz@taobao.com>

Currently all unwritten extent conversion work is pushed into a workqueue to be
done because DIO end_io is in a irq context and this conversion needs to take
i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
tracking all extent status, we can first convert this unwritten extent in extent
status tree, and call aio_complete() and inode_dio_done() to notify upper level
that this dio has done.  We don't need to be worried about exposing a stale data
because we first try to lookup in extent status tree.  So it makes us to see the
latest extent status.  Meanwhile we queue a work to convert this unwritten
extent in extent tree.  After this improvement, reader also needn't wait this
conversion to be done when dioread_nolock is enabled.

CC: Jan Kara <jack@suse.cz>
CC: "Darrick J. Wong" <darrick.wong@oracle.com>
CC: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
[Cc' to Jan, Darrick, and Christoph because Christoph is trying to handle
O_(D)SYNC for AIO]

Hi Jan, Darrick, and Christoph,

This patch refines the unwritten extent conversion in ext4.  Now we can call
aio_complete() and inode_dio_done() in end_io function.  I believe Christoph's
patch also can work well after applied this patch.  Could you please review
this patch?

Regards,
					- Zheng

 fs/ext4/ext4.h     |  2 +-
 fs/ext4/indirect.c | 11 ++++-------
 fs/ext4/inode.c    | 11 ++++-------
 fs/ext4/page-io.c  | 26 ++++++++++++++++++++++----
 4 files changed, 31 insertions(+), 19 deletions(-)

Comments

Jan Kara Dec. 31, 2012, 4:36 p.m. UTC | #1
On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently all unwritten extent conversion work is pushed into a workqueue to be
> done because DIO end_io is in a irq context and this conversion needs to take
> i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> tracking all extent status, we can first convert this unwritten extent in extent
> status tree, and call aio_complete() and inode_dio_done() to notify upper level
> that this dio has done.  We don't need to be worried about exposing a stale data
> because we first try to lookup in extent status tree.  So it makes us to see the
> latest extent status.  Meanwhile we queue a work to convert this unwritten
> extent in extent tree.  After this improvement, reader also needn't wait this
> conversion to be done when dioread_nolock is enabled.
> 
> CC: Jan Kara <jack@suse.cz>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> ---
> [Cc' to Jan, Darrick, and Christoph because Christoph is trying to handle
> O_(D)SYNC for AIO]
> 
> Hi Jan, Darrick, and Christoph,
> 
> This patch refines the unwritten extent conversion in ext4.  Now we can call
> aio_complete() and inode_dio_done() in end_io function.  I believe Christoph's
> patch also can work well after applied this patch.  Could you please review
> this patch?
  Umm, I don't understand one thing (please bear with me, I've not followed
extent status tree work in detail): After you report IO completion, you
must make sure subsequent read returns data you wrote. Thus you would need
to track also physical location of all extents that are written but not yet
converted in the extent status tree. I'm not sure which patches are in
flight but it's definitely not happening right now and it seems to me it
would complicate the extent status tree (effectively making a full extent
cache out of it, making it considerably heavier etc.). If extent tree is
really going that way, then what you propose is probably a good idea. I'm
still somewhat uneasy about completing the IO before it's really on disk
(we still need flushing of conversions in various places) but doing the
conversion before completing the IO has its own (locking) issues especially
for writeback path. So the solution using extent status tree is fine.

								Honza
Jan Kara Dec. 31, 2012, 5:04 p.m. UTC | #2
On Mon 31-12-12 17:36:21, Jan Kara wrote:
> On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > done because DIO end_io is in a irq context and this conversion needs to take
> > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > tracking all extent status, we can first convert this unwritten extent in extent
> > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > that this dio has done.  We don't need to be worried about exposing a stale data
> > because we first try to lookup in extent status tree.  So it makes us to see the
> > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > extent in extent tree.  After this improvement, reader also needn't wait this
> > conversion to be done when dioread_nolock is enabled.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > CC: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > ---
> > [Cc' to Jan, Darrick, and Christoph because Christoph is trying to handle
> > O_(D)SYNC for AIO]
> > 
> > Hi Jan, Darrick, and Christoph,
> > 
> > This patch refines the unwritten extent conversion in ext4.  Now we can call
> > aio_complete() and inode_dio_done() in end_io function.  I believe Christoph's
> > patch also can work well after applied this patch.  Could you please review
> > this patch?
>   Umm, I don't understand one thing (please bear with me, I've not followed
> extent status tree work in detail): After you report IO completion, you
> must make sure subsequent read returns data you wrote. Thus you would need
> to track also physical location of all extents that are written but not yet
> converted in the extent status tree. I'm not sure which patches are in
> flight but it's definitely not happening right now and it seems to me it
> would complicate the extent status tree (effectively making a full extent
> cache out of it, making it considerably heavier etc.). If extent tree is
> really going that way, then what you propose is probably a good idea. I'm
> still somewhat uneasy about completing the IO before it's really on disk
> (we still need flushing of conversions in various places) but doing the
> conversion before completing the IO has its own (locking) issues especially
> for writeback path. So the solution using extent status tree is fine.
  Ah, the patch is part of a series which changes the extent tree :) OK,
I'm looking into the patches...

								Honza
Jan Kara Dec. 31, 2012, 9:58 p.m. UTC | #3
On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> Currently all unwritten extent conversion work is pushed into a workqueue to be
> done because DIO end_io is in a irq context and this conversion needs to take
> i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> tracking all extent status, we can first convert this unwritten extent in extent
> status tree, and call aio_complete() and inode_dio_done() to notify upper level
> that this dio has done.  We don't need to be worried about exposing a stale data
> because we first try to lookup in extent status tree.  So it makes us to see the
> latest extent status.  Meanwhile we queue a work to convert this unwritten
> extent in extent tree.  After this improvement, reader also needn't wait this
> conversion to be done when dioread_nolock is enabled.
> 
> CC: Jan Kara <jack@suse.cz>
> CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> CC: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
  OK, so after reading other patches this should work fine. Just I think we
should somehow mark in the extent status tree that the extent tree is
inconsistent with what's on disk - something like extent dirty bit. It will
be set for UNWRITTEN extents where conversion is pending logically it would
also make sence to have it set for DELAYED extents. Then if we need to
reclaim some extents due to memory pressure we know we have to keep dirty
extents because those cache irreplacible information. What do you think?

								Honza
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 8462eb3..b76dc49 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -195,7 +195,6 @@ struct mpage_da_data {
>  #define	EXT4_IO_END_UNWRITTEN	0x0001
>  #define EXT4_IO_END_ERROR	0x0002
>  #define EXT4_IO_END_QUEUED	0x0004
> -#define EXT4_IO_END_DIRECT	0x0008
>  
>  struct ext4_io_page {
>  	struct page	*p_page;
> @@ -2538,6 +2537,7 @@ extern void ext4_ioend_wait(struct inode *);
>  extern void ext4_free_io_end(ext4_io_end_t *io);
>  extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>  extern void ext4_io_submit(struct ext4_io_submit *io);
> +extern void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio);
>  extern int ext4_bio_write_page(struct ext4_io_submit *io,
>  			       struct page *page,
>  			       int len,
> diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
> index 20862f9..c6d7f7f 100644
> --- a/fs/ext4/indirect.c
> +++ b/fs/ext4/indirect.c
> @@ -807,11 +807,6 @@ ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
>  
>  retry:
>  	if (rw == READ && ext4_should_dioread_nolock(inode)) {
> -		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
> -			mutex_lock(&inode->i_mutex);
> -			ext4_flush_unwritten_io(inode);
> -			mutex_unlock(&inode->i_mutex);
> -		}
>  		/*
>  		 * Nolock dioread optimization may be dynamically disabled
>  		 * via ext4_inode_block_unlocked_dio(). Check inode's state
> @@ -831,8 +826,10 @@ retry:
>  		inode_dio_done(inode);
>  	} else {
>  locked:
> -		ret = blockdev_direct_IO(rw, iocb, inode, iov,
> -				 offset, nr_segs, ext4_get_block);
> +		ret = __blockdev_direct_IO(rw, iocb, inode,
> +				 inode->i_sb->s_bdev, iov,
> +				 offset, nr_segs,
> +				 ext4_get_block, NULL, NULL, DIO_LOCKING);
>  
>  		if (unlikely((rw & WRITE) && ret < 0)) {
>  			loff_t isize = i_size_read(inode);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 6610dc7..4549103 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3024,9 +3024,8 @@ static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
>  			       EXT4_GET_BLOCKS_NO_LOCK);
>  }
>  
> -static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> -			    ssize_t size, void *private, int ret,
> -			    bool is_async)
> +static void ext4_end_dio(struct kiocb *iocb, loff_t offset, ssize_t size,
> +			 void *private, int ret, bool is_async)
>  {
>  	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
>          ext4_io_end_t *io_end = iocb->private;
> @@ -3058,8 +3057,7 @@ out:
>  		io_end->iocb = iocb;
>  		io_end->result = ret;
>  	}
> -
> -	ext4_add_complete_io(io_end);
> +	ext4_end_io_bh(io_end, 1);
>  }
>  
>  static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
> @@ -3195,7 +3193,6 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  			ret = -ENOMEM;
>  			goto retake_lock;
>  		}
> -		io_end->flag |= EXT4_IO_END_DIRECT;
>  		iocb->private = io_end;
>  		/*
>  		 * we save the io structure for current async direct
> @@ -3216,7 +3213,7 @@ static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
>  				   inode->i_sb->s_bdev, iov,
>  				   offset, nr_segs,
>  				   get_block_func,
> -				   ext4_end_io_dio,
> +				   ext4_end_dio,
>  				   NULL,
>  				   dio_flags);
>  
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index 0016fbc..6b2d88d 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -103,17 +103,35 @@ static int ext4_end_io(ext4_io_end_t *io)
>  			 "(inode %lu, offset %llu, size %zd, error %d)",
>  			 inode->i_ino, offset, size, ret);
>  	}
> -	if (io->iocb)
> -		aio_complete(io->iocb, io->result, 0);
>  
> -	if (io->flag & EXT4_IO_END_DIRECT)
> -		inode_dio_done(inode);
>  	/* Wake up anyone waiting on unwritten extent conversion */
>  	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
>  		wake_up_all(ext4_ioend_wq(inode));
> +
>  	return ret;
>  }
>  
> +void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio)
> +{
> +	struct inode *inode;
> +
> +	inode = io_end->inode;
> +	(void)ext4_es_convert_unwritten_extents(inode, io_end->offset,
> +						io_end->size);
> +	ext4_add_complete_io(io_end);
> +
> +	/*
> +	 * Here we can safely notify upper level that aio has done because
> +	 * unwritten extent in extent status tree has been converted.  Thus,
> +	 * others won't get a stale data because we always lookup extent status
> +	 * tree firstly in get_block_t.
> +	 */
> +	if (io_end->iocb)
> +		aio_complete(io_end->iocb, io_end->result, 0);
> +	if (is_dio)
> +		inode_dio_done(inode);
> +}
> +
>  static void dump_completed_IO(struct inode *inode)
>  {
>  #ifdef	EXT4FS_DEBUG
> -- 
> 1.7.12.rc2.18.g61b472e
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu Jan. 1, 2013, 5:24 a.m. UTC | #4
On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > done because DIO end_io is in a irq context and this conversion needs to take
> > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > tracking all extent status, we can first convert this unwritten extent in extent
> > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > that this dio has done.  We don't need to be worried about exposing a stale data
> > because we first try to lookup in extent status tree.  So it makes us to see the
> > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > extent in extent tree.  After this improvement, reader also needn't wait this
> > conversion to be done when dioread_nolock is enabled.
> > 
> > CC: Jan Kara <jack@suse.cz>
> > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > CC: Christoph Hellwig <hch@infradead.org>
> > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
>   OK, so after reading other patches this should work fine. Just I think we
> should somehow mark in the extent status tree that the extent tree is
> inconsistent with what's on disk - something like extent dirty bit. It will
> be set for UNWRITTEN extents where conversion is pending logically it would
> also make sence to have it set for DELAYED extents. Then if we need to
> reclaim some extents due to memory pressure we know we have to keep dirty
> extents because those cache irreplacible information. What do you think?

Dirty bit is a good idea for UNWRITTEN extent because we can feel free
to reclaim all WRITTEN extents and all UNWRITTEN extents that are
without dirty bit.  But we can not reclaim DEALYED extents no matter
whether they are dirty or not because they are used to lookup an delayed
extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
extent must be kept in status tree.  That is why in step 1 status tree
only tracks all DELAYED extents in the tree.

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara Jan. 3, 2013, 10:56 a.m. UTC | #5
On Tue 01-01-13 13:24:45, Zheng Liu wrote:
> On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> > On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > > done because DIO end_io is in a irq context and this conversion needs to take
> > > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > > tracking all extent status, we can first convert this unwritten extent in extent
> > > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > > that this dio has done.  We don't need to be worried about exposing a stale data
> > > because we first try to lookup in extent status tree.  So it makes us to see the
> > > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > > extent in extent tree.  After this improvement, reader also needn't wait this
> > > conversion to be done when dioread_nolock is enabled.
> > > 
> > > CC: Jan Kara <jack@suse.cz>
> > > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > CC: Christoph Hellwig <hch@infradead.org>
> > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> >   OK, so after reading other patches this should work fine. Just I think we
> > should somehow mark in the extent status tree that the extent tree is
> > inconsistent with what's on disk - something like extent dirty bit. It will
> > be set for UNWRITTEN extents where conversion is pending logically it would
> > also make sence to have it set for DELAYED extents. Then if we need to
> > reclaim some extents due to memory pressure we know we have to keep dirty
> > extents because those cache irreplacible information. What do you think?
> 
> Dirty bit is a good idea for UNWRITTEN extent because we can feel free
> to reclaim all WRITTEN extents and all UNWRITTEN extents that are
> without dirty bit.  But we can not reclaim DEALYED extents no matter
> whether they are dirty or not because they are used to lookup an delayed
> extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
> extent must be kept in status tree.  That is why in step 1 status tree
> only tracks all DELAYED extents in the tree.
  So I was thinking about this some more and also testing some code and I
realized that using extent status tree won't be enough (sadly). In case of
AIO DIO with O_SYNC set, we have to perform extent conversion on *disk*
before we can complete the AIO. So extent status tree won't help us in any
way.

Furthermore O_SYNC writes end up calling ->fsync() after IO is finished
which currently waits for all unwritten extents to convert and that
effectively deadlocks the conversion thread if there are more DIO
conversions pending. To fix this we would have to hack around
ext4_file_fsync() to avoid waiting in case of O_SYNC AIO writes and that
gets nasty quickly. So I'm currently back to my original plan of completing
IO only after extent conversion happens... I'll see how that works out.

Eventually we could *optimize* that by doing the extent conversion only in
the extent status tree if possible but I wouldn't bother with it right now.
For one thing, I also realized we would probably have to somehow throttle
writers so that there are not too many outstanding conversions (when we
complete AIO only after the conversion is finished, writer is naturally
limited by the amount of AIOs allowed).

								Honza
Zheng Liu Jan. 4, 2013, 4:26 a.m. UTC | #6
On Thu, Jan 03, 2013 at 11:56:13AM +0100, Jan Kara wrote:
> On Tue 01-01-13 13:24:45, Zheng Liu wrote:
> > On Mon, Dec 31, 2012 at 10:58:15PM +0100, Jan Kara wrote:
> > > On Mon 24-12-12 15:55:41, Zheng Liu wrote:
> > > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > > 
> > > > Currently all unwritten extent conversion work is pushed into a workqueue to be
> > > > done because DIO end_io is in a irq context and this conversion needs to take
> > > > i_data_sem locking.  But we couldn't take a semaphore in a irq context.  After
> > > > tracking all extent status, we can first convert this unwritten extent in extent
> > > > status tree, and call aio_complete() and inode_dio_done() to notify upper level
> > > > that this dio has done.  We don't need to be worried about exposing a stale data
> > > > because we first try to lookup in extent status tree.  So it makes us to see the
> > > > latest extent status.  Meanwhile we queue a work to convert this unwritten
> > > > extent in extent tree.  After this improvement, reader also needn't wait this
> > > > conversion to be done when dioread_nolock is enabled.
> > > > 
> > > > CC: Jan Kara <jack@suse.cz>
> > > > CC: "Darrick J. Wong" <darrick.wong@oracle.com>
> > > > CC: Christoph Hellwig <hch@infradead.org>
> > > > Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
> > >   OK, so after reading other patches this should work fine. Just I think we
> > > should somehow mark in the extent status tree that the extent tree is
> > > inconsistent with what's on disk - something like extent dirty bit. It will
> > > be set for UNWRITTEN extents where conversion is pending logically it would
> > > also make sence to have it set for DELAYED extents. Then if we need to
> > > reclaim some extents due to memory pressure we know we have to keep dirty
> > > extents because those cache irreplacible information. What do you think?
> > 
> > Dirty bit is a good idea for UNWRITTEN extent because we can feel free
> > to reclaim all WRITTEN extents and all UNWRITTEN extents that are
> > without dirty bit.  But we can not reclaim DEALYED extents no matter
> > whether they are dirty or not because they are used to lookup an delayed
> > extent in fiemap, seek_data/hole, and bigalloc.  So at least DEALYED
> > extent must be kept in status tree.  That is why in step 1 status tree
> > only tracks all DELAYED extents in the tree.
>   So I was thinking about this some more and also testing some code and I
> realized that using extent status tree won't be enough (sadly). In case of
> AIO DIO with O_SYNC set, we have to perform extent conversion on *disk*
> before we can complete the AIO. So extent status tree won't help us in any
> way.

Ah, I see.  Conversion must be done in disk before aio_complete() is
called if AIO DIO with O_SYNC set.  So now we must call aio_complete()
after ext4_convert_unwritten_extents() in ext4_end_io().

> 
> Furthermore O_SYNC writes end up calling ->fsync() after IO is finished
> which currently waits for all unwritten extents to convert and that
> effectively deadlocks the conversion thread if there are more DIO
> conversions pending. To fix this we would have to hack around
> ext4_file_fsync() to avoid waiting in case of O_SYNC AIO writes and that
> gets nasty quickly. So I'm currently back to my original plan of completing
> IO only after extent conversion happens... I'll see how that works out.
> 
> Eventually we could *optimize* that by doing the extent conversion only in
> the extent status tree if possible but I wouldn't bother with it right now.
> For one thing, I also realized we would probably have to somehow throttle
> writers so that there are not too many outstanding conversions (when we
> complete AIO only after the conversion is finished, writer is naturally
> limited by the amount of AIOs allowed).

Sorry, currently no any idea is in my mind.  I will think about it. :-(

Thanks,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8462eb3..b76dc49 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -195,7 +195,6 @@  struct mpage_da_data {
 #define	EXT4_IO_END_UNWRITTEN	0x0001
 #define EXT4_IO_END_ERROR	0x0002
 #define EXT4_IO_END_QUEUED	0x0004
-#define EXT4_IO_END_DIRECT	0x0008
 
 struct ext4_io_page {
 	struct page	*p_page;
@@ -2538,6 +2537,7 @@  extern void ext4_ioend_wait(struct inode *);
 extern void ext4_free_io_end(ext4_io_end_t *io);
 extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
 extern void ext4_io_submit(struct ext4_io_submit *io);
+extern void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio);
 extern int ext4_bio_write_page(struct ext4_io_submit *io,
 			       struct page *page,
 			       int len,
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 20862f9..c6d7f7f 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -807,11 +807,6 @@  ssize_t ext4_ind_direct_IO(int rw, struct kiocb *iocb,
 
 retry:
 	if (rw == READ && ext4_should_dioread_nolock(inode)) {
-		if (unlikely(atomic_read(&EXT4_I(inode)->i_unwritten))) {
-			mutex_lock(&inode->i_mutex);
-			ext4_flush_unwritten_io(inode);
-			mutex_unlock(&inode->i_mutex);
-		}
 		/*
 		 * Nolock dioread optimization may be dynamically disabled
 		 * via ext4_inode_block_unlocked_dio(). Check inode's state
@@ -831,8 +826,10 @@  retry:
 		inode_dio_done(inode);
 	} else {
 locked:
-		ret = blockdev_direct_IO(rw, iocb, inode, iov,
-				 offset, nr_segs, ext4_get_block);
+		ret = __blockdev_direct_IO(rw, iocb, inode,
+				 inode->i_sb->s_bdev, iov,
+				 offset, nr_segs,
+				 ext4_get_block, NULL, NULL, DIO_LOCKING);
 
 		if (unlikely((rw & WRITE) && ret < 0)) {
 			loff_t isize = i_size_read(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 6610dc7..4549103 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3024,9 +3024,8 @@  static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
 			       EXT4_GET_BLOCKS_NO_LOCK);
 }
 
-static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
-			    ssize_t size, void *private, int ret,
-			    bool is_async)
+static void ext4_end_dio(struct kiocb *iocb, loff_t offset, ssize_t size,
+			 void *private, int ret, bool is_async)
 {
 	struct inode *inode = iocb->ki_filp->f_path.dentry->d_inode;
         ext4_io_end_t *io_end = iocb->private;
@@ -3058,8 +3057,7 @@  out:
 		io_end->iocb = iocb;
 		io_end->result = ret;
 	}
-
-	ext4_add_complete_io(io_end);
+	ext4_end_io_bh(io_end, 1);
 }
 
 static void ext4_end_io_buffer_write(struct buffer_head *bh, int uptodate)
@@ -3195,7 +3193,6 @@  static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 			ret = -ENOMEM;
 			goto retake_lock;
 		}
-		io_end->flag |= EXT4_IO_END_DIRECT;
 		iocb->private = io_end;
 		/*
 		 * we save the io structure for current async direct
@@ -3216,7 +3213,7 @@  static ssize_t ext4_ext_direct_IO(int rw, struct kiocb *iocb,
 				   inode->i_sb->s_bdev, iov,
 				   offset, nr_segs,
 				   get_block_func,
-				   ext4_end_io_dio,
+				   ext4_end_dio,
 				   NULL,
 				   dio_flags);
 
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 0016fbc..6b2d88d 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -103,17 +103,35 @@  static int ext4_end_io(ext4_io_end_t *io)
 			 "(inode %lu, offset %llu, size %zd, error %d)",
 			 inode->i_ino, offset, size, ret);
 	}
-	if (io->iocb)
-		aio_complete(io->iocb, io->result, 0);
 
-	if (io->flag & EXT4_IO_END_DIRECT)
-		inode_dio_done(inode);
 	/* Wake up anyone waiting on unwritten extent conversion */
 	if (atomic_dec_and_test(&EXT4_I(inode)->i_unwritten))
 		wake_up_all(ext4_ioend_wq(inode));
+
 	return ret;
 }
 
+void ext4_end_io_bh(ext4_io_end_t *io_end, int is_dio)
+{
+	struct inode *inode;
+
+	inode = io_end->inode;
+	(void)ext4_es_convert_unwritten_extents(inode, io_end->offset,
+						io_end->size);
+	ext4_add_complete_io(io_end);
+
+	/*
+	 * Here we can safely notify upper level that aio has done because
+	 * unwritten extent in extent status tree has been converted.  Thus,
+	 * others won't get a stale data because we always lookup extent status
+	 * tree firstly in get_block_t.
+	 */
+	if (io_end->iocb)
+		aio_complete(io_end->iocb, io_end->result, 0);
+	if (is_dio)
+		inode_dio_done(inode);
+}
+
 static void dump_completed_IO(struct inode *inode)
 {
 #ifdef	EXT4FS_DEBUG