Message ID | 1461947276-25988-2-git-send-email-Waiman.Long@hpe.com |
---|---|
State | Superseded, archived |
Headers | show |
On Fri 29-04-16 12:27:55, Waiman Long wrote: > The purpose of the i_dio_count is to protect against truncation while > the I/O operation is in progress. As dax_do_io() only does synchronous > I/O, the locking performed by the caller or within dax_do_io() for > read should be enough to protect it against truncation. There is no > need to touch the i_dio_count. > > Eliminating two atomic operations can sometimes give a noticeable > improvement in I/O performance as NVDIMM is much faster than other > disk devices. > > Suggested-by: Christoph Hellwig <hch@infradead.org> > Signed-off-by: Waiman Long <Waiman.Long@hpe.com> We cannot easily do this currently - the reason is that in several places we wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this logic with this patch. If we indeed put all writes under i_mutex, this problem would go away but as Dave explains in his email, we consciously do as much IO as we can without i_mutex to allow reasonable scalability of multiple writers into the same file. The downside of that is that overwrites and writes vs reads are not atomic wrt each other as POSIX requires. It has been that way for direct IO in XFS case for a long time, with DAX this non-conforming behavior is proliferating more. I agree that's not ideal but serializing all writes on a file is rather harsh for persistent memory as well... Honza
On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote: > We cannot easily do this currently - the reason is that in several places we > wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while > holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this > logic with this patch. > > If we indeed put all writes under i_mutex, this problem would go away but > as Dave explains in his email, we consciously do as much IO as we can > without i_mutex to allow reasonable scalability of multiple writers into > the same file. So the above should be fine for xfs, but you're telling me that ext4 is doing DAX I/O without any inode lock at all? In that case it's indeed not going to work. > The downside of that is that overwrites and writes vs reads are not atomic > wrt each other as POSIX requires. It has been that way for direct IO in XFS > case for a long time, with DAX this non-conforming behavior is proliferating > more. I agree that's not ideal but serializing all writes on a file is > rather harsh for persistent memory as well... For non-O_DIRECT I/O it's simply required.. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu 05-05-16 07:27:48, Christoph Hellwig wrote: > On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote: > > We cannot easily do this currently - the reason is that in several places we > > wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while > > holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this > > logic with this patch. > > > > If we indeed put all writes under i_mutex, this problem would go away but > > as Dave explains in his email, we consciously do as much IO as we can > > without i_mutex to allow reasonable scalability of multiple writers into > > the same file. > > So the above should be fine for xfs, but you're telling me that ext4 > is doing DAX I/O without any inode lock at all? In that case it's > indeed not going to work. By default ext4 uses i_mutex to serialize both direct (and thus dax) reads and writes. However with dioread_nolock mount option, we use only i_data_sem (ext4 local rwsem) for direct reads and overwrites. That is enough to guarantee ext4 metadata consistency and gives you better scalability but you lose write vs read and write vs write atomicity (essentially you get the same behavior as for XFS direct IO). > > The downside of that is that overwrites and writes vs reads are not atomic > > wrt each other as POSIX requires. It has been that way for direct IO in XFS > > case for a long time, with DAX this non-conforming behavior is proliferating > > more. I agree that's not ideal but serializing all writes on a file is > > rather harsh for persistent memory as well... > > For non-O_DIRECT I/O it's simply required.. Well, we already break write vs read atomicity for buffered IO for all filesystems except XFS which has its special locking. So that's not a new thing. I agree that also breaking write vs write atomicity for 'normal' IO is a new thing, in a way more serious as the corrupted result ends up being stored on disk, and some applications may be broken by that. So we should fix that. I was hoping that Davidlohr would come up with a more scalable range-locking implementation than my original RB-tree based one and we could use that but that seems to be taking longer than I originally expected... Honza
diff --git a/fs/dax.c b/fs/dax.c index 90322eb..1b4b500 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -253,8 +253,12 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter, * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the * caller for writes. For reads, we take and release the i_mutex ourselves. * If DIO_LOCKING is not set, the filesystem takes care of its own locking. - * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O - * is in progress. + * + * The do_blockdev_direct_IO() function increment i_dio_count while the I/O + * is in progress. However, the dax_do_io() always does synchronous I/O. The + * locking done by the caller or within dax_do_io() for read (DIO_LOCKING) + * should be enough to protect against concurrent truncation. We don't really + * need to touch i_dio_count here. */ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, struct iov_iter *iter, loff_t pos, get_block_t get_block, @@ -277,10 +281,6 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, } } - /* Protects against truncate */ - if (!(flags & DIO_SKIP_DIO_COUNT)) - inode_dio_begin(inode); - retval = dax_io(inode, iter, pos, end, get_block, &bh); if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) @@ -294,8 +294,6 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode, retval = err; } - if (!(flags & DIO_SKIP_DIO_COUNT)) - inode_dio_end(inode); out: return retval; }
The purpose of the i_dio_count is to protect against truncation while the I/O operation is in progress. As dax_do_io() only does synchronous I/O, the locking performed by the caller or within dax_do_io() for read should be enough to protect it against truncation. There is no need to touch the i_dio_count. Eliminating two atomic operations can sometimes give a noticeable improvement in I/O performance as NVDIMM is much faster than other disk devices. Suggested-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Waiman Long <Waiman.Long@hpe.com> --- fs/dax.c | 14 ++++++-------- 1 files changed, 6 insertions(+), 8 deletions(-)