diff mbox

[v3,1/2] ext4: Pass in DIO_SKIP_DIO_COUNT flag if inode_dio_begin() called

Message ID 1460484775-33359-2-git-send-email-Waiman.Long@hpe.com
State Superseded, archived
Headers show

Commit Message

Waiman Long April 12, 2016, 6:12 p.m. UTC
When performing direct I/O, the current ext4 code does
not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
inode_dio_begin()/inode_dio_end() internally.  This doubling of
inode_dio_begin()/inode_dio_end() calls are wasteful.

This patch removes the extra internal inode_dio_begin()/inode_dio_end()
calls when those calls are being issued by the caller directly. For
really fast storage systems like NVDIMM, the removal of the extra
inode_dio_begin()/inode_dio_end() can give a meaningful boost to
I/O performance.

On a 4-socket Haswell-EX system (72 cores) running 4.6-rc1 kernel,
fio with 38 threads doing parallel I/O on two shared files on an
NVDIMM with DAX gave the following aggregrate bandwidth with and
without the patch:

  Test          W/O patch       With patch      % change
  ----          ---------       ----------      --------
  Read-only      8688MB/s       10173MB/s        +17.1%
  Read-write     2687MB/s        2830MB/s         +5.3%

Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/ext4/indirect.c |   10 ++++++++--
 fs/ext4/inode.c    |   12 +++++++++---
 2 files changed, 17 insertions(+), 5 deletions(-)

Comments

Dave Chinner April 14, 2016, 3:16 a.m. UTC | #1
On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
> When performing direct I/O, the current ext4 code does
> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been
> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
> inode_dio_begin()/inode_dio_end() internally.  This doubling of
> inode_dio_begin()/inode_dio_end() calls are wasteful.
> 
> This patch removes the extra internal inode_dio_begin()/inode_dio_end()
> calls when those calls are being issued by the caller directly. For
> really fast storage systems like NVDIMM, the removal of the extra
> inode_dio_begin()/inode_dio_end() can give a meaningful boost to
> I/O performance.

Doesn't this break truncate IO serialisation?

i.e. it appears to me that the ext4 use of inode_dio_begin()/
inode_dio_end() does not cover AIO, where the IO is still in flight
when submission returns. i.e. the inode_dio_end() call
needs to be in IO completion, not in the submitter context. The only
reason it doesn't break right now is that the duplicate accounting
in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
accounting will cause AIO writes to race with truncate.

Same AIO vs truncate problem occurs with the indirect read case you
modified to skip the direct IO layer accounting.

Cheers,

Dave.
Waiman Long April 14, 2016, 4:21 p.m. UTC | #2
On 04/13/2016 11:16 PM, Dave Chinner wrote:
> On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
>> When performing direct I/O, the current ext4 code does
>> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
>> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been
>> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
>> inode_dio_begin()/inode_dio_end() internally.  This doubling of
>> inode_dio_begin()/inode_dio_end() calls are wasteful.
>>
>> This patch removes the extra internal inode_dio_begin()/inode_dio_end()
>> calls when those calls are being issued by the caller directly. For
>> really fast storage systems like NVDIMM, the removal of the extra
>> inode_dio_begin()/inode_dio_end() can give a meaningful boost to
>> I/O performance.
> Doesn't this break truncate IO serialisation?
>
> i.e. it appears to me that the ext4 use of inode_dio_begin()/
> inode_dio_end() does not cover AIO, where the IO is still in flight
> when submission returns. i.e. the inode_dio_end() call
> needs to be in IO completion, not in the submitter context. The only
> reason it doesn't break right now is that the duplicate accounting
> in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
> accounting will cause AIO writes to race with truncate.
>
> Same AIO vs truncate problem occurs with the indirect read case you
> modified to skip the direct IO layer accounting.

I don't quite understand how the duplicate accounting is correct wrt 
AIO. Both the direct and indirect paths are something like:

     inode_dio_begin()
     ...
         inode_dio_begin()
         ...
         inode_dio_end()
     ...
     inode_dio_end()

What the patch does is to eliminate the innermost inode_dio_begin/end 
pair. Unless there is a difference between a dio count of 2 vs. 1, I 
can't see how the code correctness differ with and without my patch.

Cheers,
Longman


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 15, 2016, 8:17 a.m. UTC | #3
On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
> On 04/13/2016 11:16 PM, Dave Chinner wrote:
> >On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
> >>When performing direct I/O, the current ext4 code does
> >>not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
> >>__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
> >>called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
> >>inode_dio_begin()/inode_dio_end() internally.  This doubling of
> >>inode_dio_begin()/inode_dio_end() calls are wasteful.
> >>
> >>This patch removes the extra internal inode_dio_begin()/inode_dio_end()
> >>calls when those calls are being issued by the caller directly. For
> >>really fast storage systems like NVDIMM, the removal of the extra
> >>inode_dio_begin()/inode_dio_end() can give a meaningful boost to
> >>I/O performance.
> >Doesn't this break truncate IO serialisation?
> >
> >i.e. it appears to me that the ext4 use of inode_dio_begin()/
> >inode_dio_end() does not cover AIO, where the IO is still in flight
> >when submission returns. i.e. the inode_dio_end() call
> >needs to be in IO completion, not in the submitter context. The only
> >reason it doesn't break right now is that the duplicate accounting
> >in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
> >accounting will cause AIO writes to race with truncate.
> >
> >Same AIO vs truncate problem occurs with the indirect read case you
> >modified to skip the direct IO layer accounting.
> 
> I don't quite understand how the duplicate accounting is correct wrt
> AIO. Both the direct and indirect paths are something like:
> 
>     inode_dio_begin()
>     ...
>         inode_dio_begin()
>         ...
>         inode_dio_end()
>     ...
>     inode_dio_end()

With AIO:

	inode_dio_begin()
	...
		inode_dio_begin()
		<submit IO, no wait>
	...
	inode_dio_end()
<ext4 returns to userspace with AIO+DIO in progress>

<some time later DIO completes>
	        dio_complete
		  inode_dio_end()

IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission
does not wait for IO completion before returning.

> What the patch does is to eliminate the innermost
> inode_dio_begin/end pair.

Yes, and with that change inode_dio_wait() no longer waits for
AIO+DIO writes on ext4, hence breaking truncate IO barrier
requirements of inode_dio_wait().

Cheers,

Dave.
Waiman Long April 15, 2016, 5:17 p.m. UTC | #4
On 04/15/2016 04:17 AM, Dave Chinner wrote:
> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
>> On 04/13/2016 11:16 PM, Dave Chinner wrote:
>>> On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
>>>> When performing direct I/O, the current ext4 code does
>>>> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
>>>> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been
>>>> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
>>>> inode_dio_begin()/inode_dio_end() internally.  This doubling of
>>>> inode_dio_begin()/inode_dio_end() calls are wasteful.
>>>>
>>>> This patch removes the extra internal inode_dio_begin()/inode_dio_end()
>>>> calls when those calls are being issued by the caller directly. For
>>>> really fast storage systems like NVDIMM, the removal of the extra
>>>> inode_dio_begin()/inode_dio_end() can give a meaningful boost to
>>>> I/O performance.
>>> Doesn't this break truncate IO serialisation?
>>>
>>> i.e. it appears to me that the ext4 use of inode_dio_begin()/
>>> inode_dio_end() does not cover AIO, where the IO is still in flight
>>> when submission returns. i.e. the inode_dio_end() call
>>> needs to be in IO completion, not in the submitter context. The only
>>> reason it doesn't break right now is that the duplicate accounting
>>> in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
>>> accounting will cause AIO writes to race with truncate.
>>>
>>> Same AIO vs truncate problem occurs with the indirect read case you
>>> modified to skip the direct IO layer accounting.
>> I don't quite understand how the duplicate accounting is correct wrt
>> AIO. Both the direct and indirect paths are something like:
>>
>>      inode_dio_begin()
>>      ...
>>          inode_dio_begin()
>>          ...
>>          inode_dio_end()
>>      ...
>>      inode_dio_end()
> With AIO:
>
> 	inode_dio_begin()
> 	...
> 		inode_dio_begin()
> 		<submit IO, no wait>
> 	...
> 	inode_dio_end()
> <ext4 returns to userspace with AIO+DIO in progress>
>
> <some time later DIO completes>
> 	dio_complete
> 		  inode_dio_end()
>
> IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission
> does not wait for IO completion before returning.
>
>> What the patch does is to eliminate the innermost
>> inode_dio_begin/end pair.
> Yes, and with that change inode_dio_wait() no longer waits for
> AIO+DIO writes on ext4, hence breaking truncate IO barrier
> requirements of inode_dio_wait().
>
> Cheers,
>
> Dave.

You are right and thank for pointing this out to me. I think I focus too 
much on the dax_do_io() internal and didn't realize that inode_dio_end() 
can be deferred in __blockdev_direct_IO(). I will update my patch to 
eliminate the extra inode_dio_begin/end pair only for dax_do_io().

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 15, 2016, 10:19 p.m. UTC | #5
On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
> On 04/15/2016 04:17 AM, Dave Chinner wrote:
> >On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
> >>On 04/13/2016 11:16 PM, Dave Chinner wrote:
> >>>On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote:
> >>>>When performing direct I/O, the current ext4 code does
> >>>>not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or
> >>>>__blockdev_direct_IO() when inode_dio_begin() has, in fact, been
> >>>>called. This causes dax_do_io()/__blockdev_direct_IO() to invoke
> >>>>inode_dio_begin()/inode_dio_end() internally.  This doubling of
> >>>>inode_dio_begin()/inode_dio_end() calls are wasteful.
> >>>>
> >>>>This patch removes the extra internal inode_dio_begin()/inode_dio_end()
> >>>>calls when those calls are being issued by the caller directly. For
> >>>>really fast storage systems like NVDIMM, the removal of the extra
> >>>>inode_dio_begin()/inode_dio_end() can give a meaningful boost to
> >>>>I/O performance.
> >>>Doesn't this break truncate IO serialisation?
> >>>
> >>>i.e. it appears to me that the ext4 use of inode_dio_begin()/
> >>>inode_dio_end() does not cover AIO, where the IO is still in flight
> >>>when submission returns. i.e. the inode_dio_end() call
> >>>needs to be in IO completion, not in the submitter context. The only
> >>>reason it doesn't break right now is that the duplicate accounting
> >>>in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO
> >>>accounting will cause AIO writes to race with truncate.
> >>>
> >>>Same AIO vs truncate problem occurs with the indirect read case you
> >>>modified to skip the direct IO layer accounting.
> >>I don't quite understand how the duplicate accounting is correct wrt
> >>AIO. Both the direct and indirect paths are something like:
> >>
> >>     inode_dio_begin()
> >>     ...
> >>         inode_dio_begin()
> >>         ...
> >>         inode_dio_end()
> >>     ...
> >>     inode_dio_end()
> >With AIO:
> >
> >	inode_dio_begin()
> >	...
> >		inode_dio_begin()
> >		<submit IO, no wait>
> >	...
> >	inode_dio_end()
> ><ext4 returns to userspace with AIO+DIO in progress>
> >
> ><some time later DIO completes>
> >	dio_complete
> >		  inode_dio_end()
> >
> >IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission
> >does not wait for IO completion before returning.
> >
> >>What the patch does is to eliminate the innermost
> >>inode_dio_begin/end pair.
> >Yes, and with that change inode_dio_wait() no longer waits for
> >AIO+DIO writes on ext4, hence breaking truncate IO barrier
> >requirements of inode_dio_wait().
> >
> >Cheers,
> >
> >Dave.
> 
> You are right and thank for pointing this out to me. I think I focus too
> much on the dax_do_io() internal and didn't realize that inode_dio_end() can
> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
> the extra inode_dio_begin/end pair only for dax_do_io().

Even there there is the risk that a future change will break ext4.
the ext4 code needs fixing first, then you can look at skipping the
DIO based counting everywhere.

i.e. fix the root cause of the problem, don't hack around it or
throw band-aids over it.

Cheers,

Dave.
Waiman Long April 18, 2016, 7:46 p.m. UTC | #6
On 04/15/2016 06:19 PM, Dave Chinner wrote:
> On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
>> On 04/15/2016 04:17 AM, Dave Chinner wrote:
>>> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
>>>> What the patch does is to eliminate the innermost
>>>> inode_dio_begin/end pair.
>>> Yes, and with that change inode_dio_wait() no longer waits for
>>> AIO+DIO writes on ext4, hence breaking truncate IO barrier
>>> requirements of inode_dio_wait().
>>>
>>> Cheers,
>>>
>>> Dave.
>> You are right and thank for pointing this out to me. I think I focus too
>> much on the dax_do_io() internal and didn't realize that inode_dio_end() can
>> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
>> the extra inode_dio_begin/end pair only for dax_do_io().
> Even there there is the risk that a future change will break ext4.
> the ext4 code needs fixing first, then you can look at skipping the
> DIO based counting everywhere.
>
> i.e. fix the root cause of the problem, don't hack around it or
> throw band-aids over it.
>
> Cheers,
>
> Dave.

I agree that the ext4 code needs fixing w.r.t. the problem that you 
found. That will take more time and testing. In the mean time, I think 
it is OK to pick the low-hanging fruits that are handled by my patch.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner April 19, 2016, 11:01 p.m. UTC | #7
On Mon, Apr 18, 2016 at 03:46:46PM -0400, Waiman Long wrote:
> On 04/15/2016 06:19 PM, Dave Chinner wrote:
> >On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
> >>On 04/15/2016 04:17 AM, Dave Chinner wrote:
> >>>On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
> >>>>What the patch does is to eliminate the innermost
> >>>>inode_dio_begin/end pair.
> >>>Yes, and with that change inode_dio_wait() no longer waits for
> >>>AIO+DIO writes on ext4, hence breaking truncate IO barrier
> >>>requirements of inode_dio_wait().
> >>>
> >>>Cheers,
> >>>
> >>>Dave.
> >>You are right and thank for pointing this out to me. I think I focus too
> >>much on the dax_do_io() internal and didn't realize that inode_dio_end() can
> >>be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
> >>the extra inode_dio_begin/end pair only for dax_do_io().
> >Even there there is the risk that a future change will break ext4.
> >the ext4 code needs fixing first, then you can look at skipping the
> >DIO based counting everywhere.
> >
> >i.e. fix the root cause of the problem, don't hack around it or
> >throw band-aids over it.
> 
> I agree that the ext4 code needs fixing w.r.t. the problem that you
> found. That will take more time and testing. In the mean time, I
> think it is OK to pick the low-hanging fruits that are handled by my
> patch.

IOWs, you're saying that you won't fix the problem, because all you
care about is scalability results. This is how we end up with code
that breaks randomly in future because if it doesn't get fixed now,
nobody will fix the underlying problem. So, fix it now, fix it
properly and you still get your scalability improvement without
leaving a landmine that will explode on someone else in future.

Fix it now, fix it properly.

Cheers,

Dave.
Waiman Long April 20, 2016, 3:59 p.m. UTC | #8
On 04/19/2016 07:01 PM, Dave Chinner wrote:
> On Mon, Apr 18, 2016 at 03:46:46PM -0400, Waiman Long wrote:
>> On 04/15/2016 06:19 PM, Dave Chinner wrote:
>>> On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote:
>>>> On 04/15/2016 04:17 AM, Dave Chinner wrote:
>>>>> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote:
>>>>>> What the patch does is to eliminate the innermost
>>>>>> inode_dio_begin/end pair.
>>>>> Yes, and with that change inode_dio_wait() no longer waits for
>>>>> AIO+DIO writes on ext4, hence breaking truncate IO barrier
>>>>> requirements of inode_dio_wait().
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Dave.
>>>> You are right and thank for pointing this out to me. I think I focus too
>>>> much on the dax_do_io() internal and didn't realize that inode_dio_end() can
>>>> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate
>>>> the extra inode_dio_begin/end pair only for dax_do_io().
>>> Even there there is the risk that a future change will break ext4.
>>> the ext4 code needs fixing first, then you can look at skipping the
>>> DIO based counting everywhere.
>>>
>>> i.e. fix the root cause of the problem, don't hack around it or
>>> throw band-aids over it.
>> I agree that the ext4 code needs fixing w.r.t. the problem that you
>> found. That will take more time and testing. In the mean time, I
>> think it is OK to pick the low-hanging fruits that are handled by my
>> patch.
> IOWs, you're saying that you won't fix the problem, because all you
> care about is scalability results. This is how we end up with code
> that breaks randomly in future because if it doesn't get fixed now,
> nobody will fix the underlying problem. So, fix it now, fix it
> properly and you still get your scalability improvement without
> leaving a landmine that will explode on someone else in future.
>
> Fix it now, fix it properly.

I am not saying that I will not fix it. I am just saying that I need 
more time to fully understand what code changes need to be done. I am 
not that well versed in the filesystem internal, though it will be a 
good learning experience for me.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 20, 2016, 8:58 p.m. UTC | #9
FYI, none of the Dax code even needs to ever touch the dio_count,
as dax I/O can't be asynchronous, and we thus don't need it to protect
against truncate.  I'd suggest to remove it and then end_io callback
from the DAX code entirely as a start and then move from there.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long April 21, 2016, 6:15 p.m. UTC | #10
On 04/20/2016 04:58 PM, Christoph Hellwig wrote:
> FYI, none of the Dax code even needs to ever touch the dio_count,
> as dax I/O can't be asynchronous, and we thus don't need it to protect
> against truncate.  I'd suggest to remove it and then end_io callback
> from the DAX code entirely as a start and then move from there.

Yes, it seems like we may not need to change the dio_count in 
dax_do_io() after all. BTW, what do mean by using end_io callback as a 
start?

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig April 25, 2016, 11:48 a.m. UTC | #11
On Thu, Apr 21, 2016 at 02:15:24PM -0400, Waiman Long wrote:
> On 04/20/2016 04:58 PM, Christoph Hellwig wrote:
> >FYI, none of the Dax code even needs to ever touch the dio_count,
> >as dax I/O can't be asynchronous, and we thus don't need it to protect
> >against truncate.  I'd suggest to remove it and then end_io callback
> >from the DAX code entirely as a start and then move from there.
> 
> Yes, it seems like we may not need to change the dio_count in dax_do_io()
> after all. BTW, what do mean by using end_io callback as a start?

I mean to remove both the i_dio_count manipulation, and the unessecary
end_io callback from dax_do_io.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Waiman Long April 26, 2016, 4:32 p.m. UTC | #12
On 04/25/2016 07:48 AM, Christoph Hellwig wrote:
> On Thu, Apr 21, 2016 at 02:15:24PM -0400, Waiman Long wrote:
>> On 04/20/2016 04:58 PM, Christoph Hellwig wrote:
>>> FYI, none of the Dax code even needs to ever touch the dio_count,
>>> as dax I/O can't be asynchronous, and we thus don't need it to protect
>>> against truncate.  I'd suggest to remove it and then end_io callback
>> >from the DAX code entirely as a start and then move from there.
>>
>> Yes, it seems like we may not need to change the dio_count in dax_do_io()
>> after all. BTW, what do mean by using end_io callback as a start?
> I mean to remove both the i_dio_count manipulation, and the unessecary
> end_io callback from dax_do_io.

Thanks for the clarification.

Since DAX I/O is always synchronous, the locking done by the caller or 
in the dax_do_Io() for read should be enough to prevent truncation from 
happening at the same time. So we don't need to use i_dio_count for that 
purpose.

However, I have not understood enough of the block IO layer to determine 
if the end_io callback is really redundant. I am not confident enough to 
touch the end_io callback.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa6..4304be6 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -706,14 +706,20 @@  retry:
 			inode_dio_end(inode);
 			goto locked;
 		}
+		/*
+		 * Need to pass in DIO_SKIP_DIO_COUNT to prevent
+		 * duplicated inode_dio_begin/inode_dio_end sequence.
+		 */
 		if (IS_DAX(inode))
 			ret = dax_do_io(iocb, inode, iter, offset,
-					ext4_dio_get_block, NULL, 0);
+					ext4_dio_get_block, NULL,
+					DIO_SKIP_DIO_COUNT);
 		else
 			ret = __blockdev_direct_IO(iocb, inode,
 						   inode->i_sb->s_bdev, iter,
 						   offset, ext4_dio_get_block,
-						   NULL, NULL, 0);
+						   NULL, NULL,
+						   DIO_SKIP_DIO_COUNT);
 		inode_dio_end(inode);
 	} else {
 locked:
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dab84a2..779aa33 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3358,9 +3358,15 @@  static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 	 * Make all waiters for direct IO properly wait also for extent
 	 * conversion. This also disallows race between truncate() and
 	 * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
+	 *
+	 * Both dax_do_io() and __blockdev_direct_IO() will unnecessarily
+	 * call inode_dio_begin()/inode_dio_end() again if the
+	 * DIO_SKIP_DIO_COUNT flag is not set.
 	 */
-	if (iov_iter_rw(iter) == WRITE)
+	if (iov_iter_rw(iter) == WRITE) {
+		dio_flags = DIO_SKIP_DIO_COUNT;
 		inode_dio_begin(inode);
+	}
 
 	/* If we do a overwrite dio, i_mutex locking can be released */
 	overwrite = *((int *)iocb->private);
@@ -3393,10 +3399,10 @@  static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
 		get_block_func = ext4_dio_get_block_overwrite;
 	else if (is_sync_kiocb(iocb)) {
 		get_block_func = ext4_dio_get_block_unwritten_sync;
-		dio_flags = DIO_LOCKING;
+		dio_flags |= DIO_LOCKING;
 	} else {
 		get_block_func = ext4_dio_get_block_unwritten_async;
-		dio_flags = DIO_LOCKING;
+		dio_flags |= DIO_LOCKING;
 	}
 #ifdef CONFIG_EXT4_FS_ENCRYPTION
 	BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));