diff mbox

[v5,1/2] dax: Don't touch i_dio_count in dax_do_io()

Message ID 1461947276-25988-2-git-send-email-Waiman.Long@hpe.com
State Superseded, archived
Headers show

Commit Message

Waiman Long April 29, 2016, 4:27 p.m. UTC
The purpose of the i_dio_count is to protect against truncation while
the I/O operation is in progress. As dax_do_io() only does synchronous
I/O, the locking performed by the caller or within dax_do_io() for
read should be enough to protect it against truncation. There is no
need to touch the i_dio_count.

Eliminating two atomic operations can sometimes give a noticeable
improvement in I/O performance as NVDIMM is much faster than other
disk devices.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
 fs/dax.c |   14 ++++++--------
 1 files changed, 6 insertions(+), 8 deletions(-)

Comments

Jan Kara May 5, 2016, 2:16 p.m. UTC | #1
On Fri 29-04-16 12:27:55, Waiman Long wrote:
> The purpose of the i_dio_count is to protect against truncation while
> the I/O operation is in progress. As dax_do_io() only does synchronous
> I/O, the locking performed by the caller or within dax_do_io() for
> read should be enough to protect it against truncation. There is no
> need to touch the i_dio_count.
> 
> Eliminating two atomic operations can sometimes give a noticeable
> improvement in I/O performance as NVDIMM is much faster than other
> disk devices.
> 
> Suggested-by: Christoph Hellwig <hch@infradead.org>
> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>

We cannot easily do this currently - the reason is that in several places we
wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
logic with this patch.

If we indeed put all writes under i_mutex, this problem would go away but
as Dave explains in his email, we consciously do as much IO as we can
without i_mutex to allow reasonable scalability of multiple writers into
the same file.

The downside of that is that overwrites and writes vs reads are not atomic
wrt each other as POSIX requires. It has been that way for direct IO in XFS
case for a long time, with DAX this non-conforming behavior is proliferating
more. I agree that's not ideal but serializing all writes on a file is
rather harsh for persistent memory as well...

								Honza
Christoph Hellwig May 5, 2016, 2:27 p.m. UTC | #2
On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote:
> We cannot easily do this currently - the reason is that in several places we
> wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
> holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
> logic with this patch.
> 
> If we indeed put all writes under i_mutex, this problem would go away but
> as Dave explains in his email, we consciously do as much IO as we can
> without i_mutex to allow reasonable scalability of multiple writers into
> the same file.

So the above should be fine for xfs, but you're telling me that ext4
is doing DAX I/O without any inode lock at all?  In that case it's
indeed not going to work.

> The downside of that is that overwrites and writes vs reads are not atomic
> wrt each other as POSIX requires. It has been that way for direct IO in XFS
> case for a long time, with DAX this non-conforming behavior is proliferating
> more. I agree that's not ideal but serializing all writes on a file is
> rather harsh for persistent memory as well...

For non-O_DIRECT I/O it's simply required..
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara May 5, 2016, 3:48 p.m. UTC | #3
On Thu 05-05-16 07:27:48, Christoph Hellwig wrote:
> On Thu, May 05, 2016 at 04:16:37PM +0200, Jan Kara wrote:
> > We cannot easily do this currently - the reason is that in several places we
> > wait for i_dio_count to drop to 0 (look for inode_dio_wait()) while
> > holding i_mutex to wait for all outstanding DIO / DAX IO. You'd break this
> > logic with this patch.
> > 
> > If we indeed put all writes under i_mutex, this problem would go away but
> > as Dave explains in his email, we consciously do as much IO as we can
> > without i_mutex to allow reasonable scalability of multiple writers into
> > the same file.
> 
> So the above should be fine for xfs, but you're telling me that ext4
> is doing DAX I/O without any inode lock at all?  In that case it's
> indeed not going to work.

By default ext4 uses i_mutex to serialize both direct (and thus dax) reads
and writes. However with dioread_nolock mount option, we use only i_data_sem
(ext4 local rwsem) for direct reads and overwrites. That is enough to
guarantee ext4 metadata consistency and gives you better scalability but
you lose write vs read and write vs write atomicity (essentially you get
the same behavior as for XFS direct IO).

> > The downside of that is that overwrites and writes vs reads are not atomic
> > wrt each other as POSIX requires. It has been that way for direct IO in XFS
> > case for a long time, with DAX this non-conforming behavior is proliferating
> > more. I agree that's not ideal but serializing all writes on a file is
> > rather harsh for persistent memory as well...
> 
> For non-O_DIRECT I/O it's simply required..

Well, we already break write vs read atomicity for buffered IO for all
filesystems except XFS which has its special locking. So that's not a new
thing. I agree that also breaking write vs write atomicity for 'normal' IO
is a new thing, in a way more serious as the corrupted result ends up being
stored on disk, and some applications may be broken by that. So we should
fix that.

I was hoping that Davidlohr would come up with a more scalable
range-locking implementation than my original RB-tree based one and we
could use that but that seems to be taking longer than I originally
expected...

								Honza
diff mbox

Patch

diff --git a/fs/dax.c b/fs/dax.c
index 90322eb..1b4b500 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -253,8 +253,12 @@  static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
  * If @flags has DIO_LOCKING set, we assume that the i_mutex is held by the
  * caller for writes.  For reads, we take and release the i_mutex ourselves.
  * If DIO_LOCKING is not set, the filesystem takes care of its own locking.
- * As with do_blockdev_direct_IO(), we increment i_dio_count while the I/O
- * is in progress.
+ *
+ * The do_blockdev_direct_IO() function increment i_dio_count while the I/O
+ * is in progress. However, the dax_do_io() always does synchronous I/O. The
+ * locking done by the caller or within dax_do_io() for read (DIO_LOCKING)
+ * should be enough to protect against concurrent truncation. We don't really
+ * need to touch i_dio_count here.
  */
 ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 		  struct iov_iter *iter, loff_t pos, get_block_t get_block,
@@ -277,10 +281,6 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 		}
 	}
 
-	/* Protects against truncate */
-	if (!(flags & DIO_SKIP_DIO_COUNT))
-		inode_dio_begin(inode);
-
 	retval = dax_io(inode, iter, pos, end, get_block, &bh);
 
 	if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ)
@@ -294,8 +294,6 @@  ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
 			retval = err;
 	}
 
-	if (!(flags & DIO_SKIP_DIO_COUNT))
-		inode_dio_end(inode);
  out:
 	return retval;
 }