Message ID | 1460484775-33359-2-git-send-email-Waiman.Long@hpe.com |
---|---|
State | Superseded, archived |
Headers | show |
On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote: > When performing direct I/O, the current ext4 code does > not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or > __blockdev_direct_IO() when inode_dio_begin() has, in fact, been > called. This causes dax_do_io()/__blockdev_direct_IO() to invoke > inode_dio_begin()/inode_dio_end() internally. This doubling of > inode_dio_begin()/inode_dio_end() calls are wasteful. > > This patch removes the extra internal inode_dio_begin()/inode_dio_end() > calls when those calls are being issued by the caller directly. For > really fast storage systems like NVDIMM, the removal of the extra > inode_dio_begin()/inode_dio_end() can give a meaningful boost to > I/O performance. Doesn't this break truncate IO serialisation? i.e. it appears to me that the ext4 use of inode_dio_begin()/ inode_dio_end() does not cover AIO, where the IO is still in flight when submission returns. i.e. the inode_dio_end() call needs to be in IO completion, not in the submitter context. The only reason it doesn't break right now is that the duplicate accounting in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO accounting will cause AIO writes to race with truncate. Same AIO vs truncate problem occurs with the indirect read case you modified to skip the direct IO layer accounting. Cheers, Dave.
On 04/13/2016 11:16 PM, Dave Chinner wrote: > On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote: >> When performing direct I/O, the current ext4 code does >> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or >> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been >> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke >> inode_dio_begin()/inode_dio_end() internally. This doubling of >> inode_dio_begin()/inode_dio_end() calls are wasteful. >> >> This patch removes the extra internal inode_dio_begin()/inode_dio_end() >> calls when those calls are being issued by the caller directly. For >> really fast storage systems like NVDIMM, the removal of the extra >> inode_dio_begin()/inode_dio_end() can give a meaningful boost to >> I/O performance. > Doesn't this break truncate IO serialisation? > > i.e. it appears to me that the ext4 use of inode_dio_begin()/ > inode_dio_end() does not cover AIO, where the IO is still in flight > when submission returns. i.e. the inode_dio_end() call > needs to be in IO completion, not in the submitter context. The only > reason it doesn't break right now is that the duplicate accounting > in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO > accounting will cause AIO writes to race with truncate. > > Same AIO vs truncate problem occurs with the indirect read case you > modified to skip the direct IO layer accounting. I don't quite understand how the duplicate accounting is correct wrt AIO. Both the direct and indirect paths are something like: inode_dio_begin() ... inode_dio_begin() ... inode_dio_end() ... inode_dio_end() What the patch does is to eliminate the innermost inode_dio_begin/end pair. Unless there is a difference between a dio count of 2 vs. 1, I can't see how the code correctness differ with and without my patch. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote: > On 04/13/2016 11:16 PM, Dave Chinner wrote: > >On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote: > >>When performing direct I/O, the current ext4 code does > >>not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or > >>__blockdev_direct_IO() when inode_dio_begin() has, in fact, been > >>called. This causes dax_do_io()/__blockdev_direct_IO() to invoke > >>inode_dio_begin()/inode_dio_end() internally. This doubling of > >>inode_dio_begin()/inode_dio_end() calls are wasteful. > >> > >>This patch removes the extra internal inode_dio_begin()/inode_dio_end() > >>calls when those calls are being issued by the caller directly. For > >>really fast storage systems like NVDIMM, the removal of the extra > >>inode_dio_begin()/inode_dio_end() can give a meaningful boost to > >>I/O performance. > >Doesn't this break truncate IO serialisation? > > > >i.e. it appears to me that the ext4 use of inode_dio_begin()/ > >inode_dio_end() does not cover AIO, where the IO is still in flight > >when submission returns. i.e. the inode_dio_end() call > >needs to be in IO completion, not in the submitter context. The only > >reason it doesn't break right now is that the duplicate accounting > >in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO > >accounting will cause AIO writes to race with truncate. > > > >Same AIO vs truncate problem occurs with the indirect read case you > >modified to skip the direct IO layer accounting. > > I don't quite understand how the duplicate accounting is correct wrt > AIO. Both the direct and indirect paths are something like: > > inode_dio_begin() > ... > inode_dio_begin() > ... > inode_dio_end() > ... > inode_dio_end() With AIO: inode_dio_begin() ... inode_dio_begin() <submit IO, no wait> ... inode_dio_end() <ext4 returns to userspace with AIO+DIO in progress> <some time later DIO completes> dio_complete inode_dio_end() IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission does not wait for IO completion before returning. > What the patch does is to eliminate the innermost > inode_dio_begin/end pair. Yes, and with that change inode_dio_wait() no longer waits for AIO+DIO writes on ext4, hence breaking truncate IO barrier requirements of inode_dio_wait(). Cheers, Dave.
On 04/15/2016 04:17 AM, Dave Chinner wrote: > On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote: >> On 04/13/2016 11:16 PM, Dave Chinner wrote: >>> On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote: >>>> When performing direct I/O, the current ext4 code does >>>> not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or >>>> __blockdev_direct_IO() when inode_dio_begin() has, in fact, been >>>> called. This causes dax_do_io()/__blockdev_direct_IO() to invoke >>>> inode_dio_begin()/inode_dio_end() internally. This doubling of >>>> inode_dio_begin()/inode_dio_end() calls are wasteful. >>>> >>>> This patch removes the extra internal inode_dio_begin()/inode_dio_end() >>>> calls when those calls are being issued by the caller directly. For >>>> really fast storage systems like NVDIMM, the removal of the extra >>>> inode_dio_begin()/inode_dio_end() can give a meaningful boost to >>>> I/O performance. >>> Doesn't this break truncate IO serialisation? >>> >>> i.e. it appears to me that the ext4 use of inode_dio_begin()/ >>> inode_dio_end() does not cover AIO, where the IO is still in flight >>> when submission returns. i.e. the inode_dio_end() call >>> needs to be in IO completion, not in the submitter context. The only >>> reason it doesn't break right now is that the duplicate accounting >>> in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO >>> accounting will cause AIO writes to race with truncate. >>> >>> Same AIO vs truncate problem occurs with the indirect read case you >>> modified to skip the direct IO layer accounting. >> I don't quite understand how the duplicate accounting is correct wrt >> AIO. Both the direct and indirect paths are something like: >> >> inode_dio_begin() >> ... >> inode_dio_begin() >> ... >> inode_dio_end() >> ... >> inode_dio_end() > With AIO: > > inode_dio_begin() > ... > inode_dio_begin() > <submit IO, no wait> > ... > inode_dio_end() > <ext4 returns to userspace with AIO+DIO in progress> > > <some time later DIO completes> > dio_complete > inode_dio_end() > > IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission > does not wait for IO completion before returning. > >> What the patch does is to eliminate the innermost >> inode_dio_begin/end pair. > Yes, and with that change inode_dio_wait() no longer waits for > AIO+DIO writes on ext4, hence breaking truncate IO barrier > requirements of inode_dio_wait(). > > Cheers, > > Dave. You are right and thank for pointing this out to me. I think I focus too much on the dax_do_io() internal and didn't realize that inode_dio_end() can be deferred in __blockdev_direct_IO(). I will update my patch to eliminate the extra inode_dio_begin/end pair only for dax_do_io(). Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote: > On 04/15/2016 04:17 AM, Dave Chinner wrote: > >On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote: > >>On 04/13/2016 11:16 PM, Dave Chinner wrote: > >>>On Tue, Apr 12, 2016 at 02:12:54PM -0400, Waiman Long wrote: > >>>>When performing direct I/O, the current ext4 code does > >>>>not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or > >>>>__blockdev_direct_IO() when inode_dio_begin() has, in fact, been > >>>>called. This causes dax_do_io()/__blockdev_direct_IO() to invoke > >>>>inode_dio_begin()/inode_dio_end() internally. This doubling of > >>>>inode_dio_begin()/inode_dio_end() calls are wasteful. > >>>> > >>>>This patch removes the extra internal inode_dio_begin()/inode_dio_end() > >>>>calls when those calls are being issued by the caller directly. For > >>>>really fast storage systems like NVDIMM, the removal of the extra > >>>>inode_dio_begin()/inode_dio_end() can give a meaningful boost to > >>>>I/O performance. > >>>Doesn't this break truncate IO serialisation? > >>> > >>>i.e. it appears to me that the ext4 use of inode_dio_begin()/ > >>>inode_dio_end() does not cover AIO, where the IO is still in flight > >>>when submission returns. i.e. the inode_dio_end() call > >>>needs to be in IO completion, not in the submitter context. The only > >>>reason it doesn't break right now is that the duplicate accounting > >>>in the DIO code is correct w.r.t. AIO. Hence bypassing the DIO > >>>accounting will cause AIO writes to race with truncate. > >>> > >>>Same AIO vs truncate problem occurs with the indirect read case you > >>>modified to skip the direct IO layer accounting. > >>I don't quite understand how the duplicate accounting is correct wrt > >>AIO. Both the direct and indirect paths are something like: > >> > >> inode_dio_begin() > >> ... > >> inode_dio_begin() > >> ... > >> inode_dio_end() > >> ... > >> inode_dio_end() > >With AIO: > > > > inode_dio_begin() > > ... > > inode_dio_begin() > > <submit IO, no wait> > > ... > > inode_dio_end() > ><ext4 returns to userspace with AIO+DIO in progress> > > > ><some time later DIO completes> > > dio_complete > > inode_dio_end() > > > >IOWs, the ext4 accounting is broken w.r.t. AIO, where IO submission > >does not wait for IO completion before returning. > > > >>What the patch does is to eliminate the innermost > >>inode_dio_begin/end pair. > >Yes, and with that change inode_dio_wait() no longer waits for > >AIO+DIO writes on ext4, hence breaking truncate IO barrier > >requirements of inode_dio_wait(). > > > >Cheers, > > > >Dave. > > You are right and thank for pointing this out to me. I think I focus too > much on the dax_do_io() internal and didn't realize that inode_dio_end() can > be deferred in __blockdev_direct_IO(). I will update my patch to eliminate > the extra inode_dio_begin/end pair only for dax_do_io(). Even there there is the risk that a future change will break ext4. the ext4 code needs fixing first, then you can look at skipping the DIO based counting everywhere. i.e. fix the root cause of the problem, don't hack around it or throw band-aids over it. Cheers, Dave.
On 04/15/2016 06:19 PM, Dave Chinner wrote: > On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote: >> On 04/15/2016 04:17 AM, Dave Chinner wrote: >>> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote: >>>> What the patch does is to eliminate the innermost >>>> inode_dio_begin/end pair. >>> Yes, and with that change inode_dio_wait() no longer waits for >>> AIO+DIO writes on ext4, hence breaking truncate IO barrier >>> requirements of inode_dio_wait(). >>> >>> Cheers, >>> >>> Dave. >> You are right and thank for pointing this out to me. I think I focus too >> much on the dax_do_io() internal and didn't realize that inode_dio_end() can >> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate >> the extra inode_dio_begin/end pair only for dax_do_io(). > Even there there is the risk that a future change will break ext4. > the ext4 code needs fixing first, then you can look at skipping the > DIO based counting everywhere. > > i.e. fix the root cause of the problem, don't hack around it or > throw band-aids over it. > > Cheers, > > Dave. I agree that the ext4 code needs fixing w.r.t. the problem that you found. That will take more time and testing. In the mean time, I think it is OK to pick the low-hanging fruits that are handled by my patch. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, Apr 18, 2016 at 03:46:46PM -0400, Waiman Long wrote: > On 04/15/2016 06:19 PM, Dave Chinner wrote: > >On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote: > >>On 04/15/2016 04:17 AM, Dave Chinner wrote: > >>>On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote: > >>>>What the patch does is to eliminate the innermost > >>>>inode_dio_begin/end pair. > >>>Yes, and with that change inode_dio_wait() no longer waits for > >>>AIO+DIO writes on ext4, hence breaking truncate IO barrier > >>>requirements of inode_dio_wait(). > >>> > >>>Cheers, > >>> > >>>Dave. > >>You are right and thank for pointing this out to me. I think I focus too > >>much on the dax_do_io() internal and didn't realize that inode_dio_end() can > >>be deferred in __blockdev_direct_IO(). I will update my patch to eliminate > >>the extra inode_dio_begin/end pair only for dax_do_io(). > >Even there there is the risk that a future change will break ext4. > >the ext4 code needs fixing first, then you can look at skipping the > >DIO based counting everywhere. > > > >i.e. fix the root cause of the problem, don't hack around it or > >throw band-aids over it. > > I agree that the ext4 code needs fixing w.r.t. the problem that you > found. That will take more time and testing. In the mean time, I > think it is OK to pick the low-hanging fruits that are handled by my > patch. IOWs, you're saying that you won't fix the problem, because all you care about is scalability results. This is how we end up with code that breaks randomly in future because if it doesn't get fixed now, nobody will fix the underlying problem. So, fix it now, fix it properly and you still get your scalability improvement without leaving a landmine that will explode on someone else in future. Fix it now, fix it properly. Cheers, Dave.
On 04/19/2016 07:01 PM, Dave Chinner wrote: > On Mon, Apr 18, 2016 at 03:46:46PM -0400, Waiman Long wrote: >> On 04/15/2016 06:19 PM, Dave Chinner wrote: >>> On Fri, Apr 15, 2016 at 01:17:41PM -0400, Waiman Long wrote: >>>> On 04/15/2016 04:17 AM, Dave Chinner wrote: >>>>> On Thu, Apr 14, 2016 at 12:21:13PM -0400, Waiman Long wrote: >>>>>> What the patch does is to eliminate the innermost >>>>>> inode_dio_begin/end pair. >>>>> Yes, and with that change inode_dio_wait() no longer waits for >>>>> AIO+DIO writes on ext4, hence breaking truncate IO barrier >>>>> requirements of inode_dio_wait(). >>>>> >>>>> Cheers, >>>>> >>>>> Dave. >>>> You are right and thank for pointing this out to me. I think I focus too >>>> much on the dax_do_io() internal and didn't realize that inode_dio_end() can >>>> be deferred in __blockdev_direct_IO(). I will update my patch to eliminate >>>> the extra inode_dio_begin/end pair only for dax_do_io(). >>> Even there there is the risk that a future change will break ext4. >>> the ext4 code needs fixing first, then you can look at skipping the >>> DIO based counting everywhere. >>> >>> i.e. fix the root cause of the problem, don't hack around it or >>> throw band-aids over it. >> I agree that the ext4 code needs fixing w.r.t. the problem that you >> found. That will take more time and testing. In the mean time, I >> think it is OK to pick the low-hanging fruits that are handled by my >> patch. > IOWs, you're saying that you won't fix the problem, because all you > care about is scalability results. This is how we end up with code > that breaks randomly in future because if it doesn't get fixed now, > nobody will fix the underlying problem. So, fix it now, fix it > properly and you still get your scalability improvement without > leaving a landmine that will explode on someone else in future. > > Fix it now, fix it properly. I am not saying that I will not fix it. I am just saying that I need more time to fully understand what code changes need to be done. I am not that well versed in the filesystem internal, though it will be a good learning experience for me. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FYI, none of the Dax code even needs to ever touch the dio_count, as dax I/O can't be asynchronous, and we thus don't need it to protect against truncate. I'd suggest to remove it and then end_io callback from the DAX code entirely as a start and then move from there. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/20/2016 04:58 PM, Christoph Hellwig wrote: > FYI, none of the Dax code even needs to ever touch the dio_count, > as dax I/O can't be asynchronous, and we thus don't need it to protect > against truncate. I'd suggest to remove it and then end_io callback > from the DAX code entirely as a start and then move from there. Yes, it seems like we may not need to change the dio_count in dax_do_io() after all. BTW, what do mean by using end_io callback as a start? Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 21, 2016 at 02:15:24PM -0400, Waiman Long wrote: > On 04/20/2016 04:58 PM, Christoph Hellwig wrote: > >FYI, none of the Dax code even needs to ever touch the dio_count, > >as dax I/O can't be asynchronous, and we thus don't need it to protect > >against truncate. I'd suggest to remove it and then end_io callback > >from the DAX code entirely as a start and then move from there. > > Yes, it seems like we may not need to change the dio_count in dax_do_io() > after all. BTW, what do mean by using end_io callback as a start? I mean to remove both the i_dio_count manipulation, and the unessecary end_io callback from dax_do_io. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 04/25/2016 07:48 AM, Christoph Hellwig wrote: > On Thu, Apr 21, 2016 at 02:15:24PM -0400, Waiman Long wrote: >> On 04/20/2016 04:58 PM, Christoph Hellwig wrote: >>> FYI, none of the Dax code even needs to ever touch the dio_count, >>> as dax I/O can't be asynchronous, and we thus don't need it to protect >>> against truncate. I'd suggest to remove it and then end_io callback >> >from the DAX code entirely as a start and then move from there. >> >> Yes, it seems like we may not need to change the dio_count in dax_do_io() >> after all. BTW, what do mean by using end_io callback as a start? > I mean to remove both the i_dio_count manipulation, and the unessecary > end_io callback from dax_do_io. Thanks for the clarification. Since DAX I/O is always synchronous, the locking done by the caller or in the dax_do_Io() for read should be enough to prevent truncation from happening at the same time. So we don't need to use i_dio_count for that purpose. However, I have not understood enough of the block IO layer to determine if the end_io callback is really redundant. I am not confident enough to touch the end_io callback. Cheers, Longman -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c index 3027fa6..4304be6 100644 --- a/fs/ext4/indirect.c +++ b/fs/ext4/indirect.c @@ -706,14 +706,20 @@ retry: inode_dio_end(inode); goto locked; } + /* + * Need to pass in DIO_SKIP_DIO_COUNT to prevent + * duplicated inode_dio_begin/inode_dio_end sequence. + */ if (IS_DAX(inode)) ret = dax_do_io(iocb, inode, iter, offset, - ext4_dio_get_block, NULL, 0); + ext4_dio_get_block, NULL, + DIO_SKIP_DIO_COUNT); else ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter, offset, ext4_dio_get_block, - NULL, NULL, 0); + NULL, NULL, + DIO_SKIP_DIO_COUNT); inode_dio_end(inode); } else { locked: diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index dab84a2..779aa33 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -3358,9 +3358,15 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter, * Make all waiters for direct IO properly wait also for extent * conversion. This also disallows race between truncate() and * overwrite DIO as i_dio_count needs to be incremented under i_mutex. + * + * Both dax_do_io() and __blockdev_direct_IO() will unnecessarily + * call inode_dio_begin()/inode_dio_end() again if the + * DIO_SKIP_DIO_COUNT flag is not set. */ - if (iov_iter_rw(iter) == WRITE) + if (iov_iter_rw(iter) == WRITE) { + dio_flags = DIO_SKIP_DIO_COUNT; inode_dio_begin(inode); + } /* If we do a overwrite dio, i_mutex locking can be released */ overwrite = *((int *)iocb->private); @@ -3393,10 +3399,10 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter, get_block_func = ext4_dio_get_block_overwrite; else if (is_sync_kiocb(iocb)) { get_block_func = ext4_dio_get_block_unwritten_sync; - dio_flags = DIO_LOCKING; + dio_flags |= DIO_LOCKING; } else { get_block_func = ext4_dio_get_block_unwritten_async; - dio_flags = DIO_LOCKING; + dio_flags |= DIO_LOCKING; } #ifdef CONFIG_EXT4_FS_ENCRYPTION BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
When performing direct I/O, the current ext4 code does not pass in the DIO_SKIP_DIO_COUNT flag to dax_do_io() or __blockdev_direct_IO() when inode_dio_begin() has, in fact, been called. This causes dax_do_io()/__blockdev_direct_IO() to invoke inode_dio_begin()/inode_dio_end() internally. This doubling of inode_dio_begin()/inode_dio_end() calls are wasteful. This patch removes the extra internal inode_dio_begin()/inode_dio_end() calls when those calls are being issued by the caller directly. For really fast storage systems like NVDIMM, the removal of the extra inode_dio_begin()/inode_dio_end() can give a meaningful boost to I/O performance. On a 4-socket Haswell-EX system (72 cores) running 4.6-rc1 kernel, fio with 38 threads doing parallel I/O on two shared files on an NVDIMM with DAX gave the following aggregrate bandwidth with and without the patch: Test W/O patch With patch % change ---- --------- ---------- -------- Read-only 8688MB/s 10173MB/s +17.1% Read-write 2687MB/s 2830MB/s +5.3% Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- fs/ext4/indirect.c | 10 ++++++++-- fs/ext4/inode.c | 12 +++++++++--- 2 files changed, 17 insertions(+), 5 deletions(-)