diff mbox series

[RFC,1/2] iomap: direct-io: Move inode_dio_begin before filemap_write_and_wait_range

Message ID 27607a16327fe9664f32d09abe565af0d1ae56c9.1578907891.git.riteshh@linux.ibm.com
State Rejected
Headers show
Series ext4: Fix stale data read exposure problem with DIO read/page_mkwrite | expand

Commit Message

Ritesh Harjani Jan. 13, 2020, 11:04 a.m. UTC
Some filesystems (e.g. ext4) need to know in it's writeback path, that
whether DIO is in progress or not. This info may be needed to avoid the
stale data exposure race with DIO reads.

Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
---
 fs/iomap/direct-io.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

Comments

Darrick Wong Jan. 13, 2020, 9:51 p.m. UTC | #1
On Mon, Jan 13, 2020 at 04:34:21PM +0530, Ritesh Harjani wrote:
> Some filesystems (e.g. ext4) need to know in it's writeback path, that
> whether DIO is in progress or not. This info may be needed to avoid the
> stale data exposure race with DIO reads.

Does XFS have this problem too?

Admittedly dio read during mmap write is probably not well supported. ;)

> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>  fs/iomap/direct-io.c | 17 +++++++++++++----

Might want to cc fsdevel and the iomap maintainers...

--D

>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23837926c0c5..d1c159bd3854 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_NOWAIT;
>  	}
>  
> +	/*
> +	 * Call inode_dio_begin() before we write out and wait for writeback to
> +	 * complete. This may be needed by some filesystems to prevent race
> +	 * like stale data exposure by DIO reads.
> +	 */
> +	inode_dio_begin(inode);
> +	/* So that i_dio_count is incremented before below operation */
> +	smp_mb__after_atomic();
> +
>  	ret = filemap_write_and_wait_range(mapping, pos, end);
>  	if (ret)
> -		goto out_free_dio;
> +		goto out_end_dio;
>  
>  	/*
>  	 * Try to invalidate cache pages for the range we're direct
> @@ -488,11 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  	    !inode->i_sb->s_dio_done_wq) {
>  		ret = sb_init_dio_done_wq(inode->i_sb);
>  		if (ret < 0)
> -			goto out_free_dio;
> +			goto out_end_dio;
>  	}
>  
> -	inode_dio_begin(inode);
> -
>  	blk_start_plug(&plug);
>  	do {
>  		ret = iomap_apply(inode, pos, count, flags, ops, dio,
> @@ -568,6 +575,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  
>  	return iomap_dio_complete(dio);
>  
> +out_end_dio:
> +	inode_dio_end(inode);
>  out_free_dio:
>  	kfree(dio);
>  	return ret;
> -- 
> 2.21.0
>
Jan Kara Jan. 14, 2020, 9:05 a.m. UTC | #2
On Mon 13-01-20 13:51:59, Darrick J. Wong wrote:
> On Mon, Jan 13, 2020 at 04:34:21PM +0530, Ritesh Harjani wrote:
> > Some filesystems (e.g. ext4) need to know in it's writeback path, that
> > whether DIO is in progress or not. This info may be needed to avoid the
> > stale data exposure race with DIO reads.
> 
> Does XFS have this problem too?
> 
> Admittedly dio read during mmap write is probably not well supported. ;)

Well, XFS always performs buffered writeback using unwritten extents so at
least the immediate problem of stale data exposure ext4 has does not happen
there AFAICT. 

								Honza
Jan Kara Jan. 14, 2020, 9:12 a.m. UTC | #3
On Mon 13-01-20 16:34:21, Ritesh Harjani wrote:
> Some filesystems (e.g. ext4) need to know in it's writeback path, that
> whether DIO is in progress or not. This info may be needed to avoid the
> stale data exposure race with DIO reads.
> 
> Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com>
> ---
>  fs/iomap/direct-io.c | 17 +++++++++++++----
>  1 file changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 23837926c0c5..d1c159bd3854 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  		flags |= IOMAP_NOWAIT;
>  	}
>  
> +	/*
> +	 * Call inode_dio_begin() before we write out and wait for writeback to
> +	 * complete. This may be needed by some filesystems to prevent race
> +	 * like stale data exposure by DIO reads.
> +	 */
> +	inode_dio_begin(inode);
> +	/* So that i_dio_count is incremented before below operation */
> +	smp_mb__after_atomic();

I wonder if the barrier shouldn't go into inode_dio_begin() - as a sepatare
patch. Because people just treat this as a lock-kind-of-thingy. E.g. btrfs
or ceph use inode_dio_begin() in places which I'd consider prone to CPU
reordering issues without this barrier...

Otherwise the patch looks good to me.

								Honza
Christoph Hellwig Jan. 14, 2020, 4:37 p.m. UTC | #4
Please add at very least the fsdevel and xfs lists to iomap patches.

Using i_dio_count for any kind of detection is bogus.  If you want to
pass flags to the writeback code please do so explicitly through
struct writeback_control.
Christoph Hellwig Jan. 14, 2020, 4:38 p.m. UTC | #5
On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote:
> 
> Well, XFS always performs buffered writeback using unwritten extents so at
> least the immediate problem of stale data exposure ext4 has does not happen
> there AFAICT. 

Currently XFS never uses unwritten extents when converting delalloc
extents.
Jan Kara Jan. 14, 2020, 5:19 p.m. UTC | #6
On Tue 14-01-20 08:37:02, Christoph Hellwig wrote:
> Using i_dio_count for any kind of detection is bogus.  If you want to
> pass flags to the writeback code please do so explicitly through
> struct writeback_control.

We want to detect in the writeback path whether there's direct IO (read)
currently running for the inode. Not for the writeback issued from
iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be
racing with - so struct writeback_control won't help. Now if you want to
see the ugly details why this hack is needed, see my other email to Ritesh
in this thread with details of the race.

								Honza
Christoph Hellwig Jan. 14, 2020, 6:27 p.m. UTC | #7
On Tue, Jan 14, 2020 at 06:19:34PM +0100, Jan Kara wrote:
> We want to detect in the writeback path whether there's direct IO (read)
> currently running for the inode. Not for the writeback issued from
> iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be
> racing with - so struct writeback_control won't help. Now if you want to
> see the ugly details why this hack is needed, see my other email to Ritesh
> in this thread with details of the race.

How do we get other writeback after iomap_dio_rw wrote everything out?
Either way I'm trying to kill i_dio_count as it has all kinds of
problems, see the patch sent out earlier today.
Jan Kara Jan. 15, 2020, 9:08 a.m. UTC | #8
On Tue 14-01-20 10:27:36, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 06:19:34PM +0100, Jan Kara wrote:
> > We want to detect in the writeback path whether there's direct IO (read)
> > currently running for the inode. Not for the writeback issued from
> > iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be
> > racing with - so struct writeback_control won't help. Now if you want to
> > see the ugly details why this hack is needed, see my other email to Ritesh
> > in this thread with details of the race.
> 
> How do we get other writeback after iomap_dio_rw wrote everything out?

You create dirty page using mmap in the range read by iomap_dio_rw() and then
background writeback happens at unfortunate time... Email [1] has the exact
traces.

> Either way I'm trying to kill i_dio_count as it has all kinds of
> problems, see the patch sent out earlier today.

OK, I'll see that patch.

								Honza

[1] https://lore.kernel.org/linux-ext4/20200114094741.GC6466@quack2.suse.cz
Jan Kara Jan. 15, 2020, 9:19 a.m. UTC | #9
On Tue 14-01-20 08:38:18, Christoph Hellwig wrote:
> On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote:
> > 
> > Well, XFS always performs buffered writeback using unwritten extents so at
> > least the immediate problem of stale data exposure ext4 has does not happen
> > there AFAICT. 
> 
> Currently XFS never uses unwritten extents when converting delalloc
> extents.

I see, it is a long time since I last looked at that part of XFS code. So
then I think XFS might be prone to the same kind of race and data exposure
as I outlined in [1]...

								Honza

[1] https://lore.kernel.org/linux-ext4/20200114094741.GC6466@quack2.suse.cz
Christoph Hellwig Jan. 15, 2020, 2:56 p.m. UTC | #10
On Wed, Jan 15, 2020 at 10:19:25AM +0100, Jan Kara wrote:
> On Tue 14-01-20 08:38:18, Christoph Hellwig wrote:
> > On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote:
> > > 
> > > Well, XFS always performs buffered writeback using unwritten extents so at
> > > least the immediate problem of stale data exposure ext4 has does not happen
> > > there AFAICT. 
> > 
> > Currently XFS never uses unwritten extents when converting delalloc
> > extents.
> 
> I see, it is a long time since I last looked at that part of XFS code. So
> then I think XFS might be prone to the same kind of race and data exposure
> as I outlined in [1]...

I think not using unwritten extents for filling holes inside i_size will
always lead to the potential for stale data exposure in one form or
another.  Because of that Darrick has started looking into always using
unwritten extents for buffered writes inside i_size.
diff mbox series

Patch

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 23837926c0c5..d1c159bd3854 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -468,9 +468,18 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 		flags |= IOMAP_NOWAIT;
 	}
 
+	/*
+	 * Call inode_dio_begin() before we write out and wait for writeback to
+	 * complete. This may be needed by some filesystems to prevent race
+	 * like stale data exposure by DIO reads.
+	 */
+	inode_dio_begin(inode);
+	/* So that i_dio_count is incremented before below operation */
+	smp_mb__after_atomic();
+
 	ret = filemap_write_and_wait_range(mapping, pos, end);
 	if (ret)
-		goto out_free_dio;
+		goto out_end_dio;
 
 	/*
 	 * Try to invalidate cache pages for the range we're direct
@@ -488,11 +497,9 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 	    !inode->i_sb->s_dio_done_wq) {
 		ret = sb_init_dio_done_wq(inode->i_sb);
 		if (ret < 0)
-			goto out_free_dio;
+			goto out_end_dio;
 	}
 
-	inode_dio_begin(inode);
-
 	blk_start_plug(&plug);
 	do {
 		ret = iomap_apply(inode, pos, count, flags, ops, dio,
@@ -568,6 +575,8 @@  iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 
 	return iomap_dio_complete(dio);
 
+out_end_dio:
+	inode_dio_end(inode);
 out_free_dio:
 	kfree(dio);
 	return ret;