Message ID | 27607a16327fe9664f32d09abe565af0d1ae56c9.1578907891.git.riteshh@linux.ibm.com |
---|---|
State | Rejected |
Headers | show |
Series | ext4: Fix stale data read exposure problem with DIO read/page_mkwrite | expand |
On Mon, Jan 13, 2020 at 04:34:21PM +0530, Ritesh Harjani wrote: > Some filesystems (e.g. ext4) need to know in it's writeback path, that > whether DIO is in progress or not. This info may be needed to avoid the > stale data exposure race with DIO reads. Does XFS have this problem too? Admittedly dio read during mmap write is probably not well supported. ;) > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> > --- > fs/iomap/direct-io.c | 17 +++++++++++++---- Might want to cc fsdevel and the iomap maintainers... --D > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 23837926c0c5..d1c159bd3854 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > flags |= IOMAP_NOWAIT; > } > > + /* > + * Call inode_dio_begin() before we write out and wait for writeback to > + * complete. This may be needed by some filesystems to prevent race > + * like stale data exposure by DIO reads. > + */ > + inode_dio_begin(inode); > + /* So that i_dio_count is incremented before below operation */ > + smp_mb__after_atomic(); > + > ret = filemap_write_and_wait_range(mapping, pos, end); > if (ret) > - goto out_free_dio; > + goto out_end_dio; > > /* > * Try to invalidate cache pages for the range we're direct > @@ -488,11 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > !inode->i_sb->s_dio_done_wq) { > ret = sb_init_dio_done_wq(inode->i_sb); > if (ret < 0) > - goto out_free_dio; > + goto out_end_dio; > } > > - inode_dio_begin(inode); > - > blk_start_plug(&plug); > do { > ret = iomap_apply(inode, pos, count, flags, ops, dio, > @@ -568,6 +575,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > > return iomap_dio_complete(dio); > > +out_end_dio: > + inode_dio_end(inode); > out_free_dio: > kfree(dio); > return ret; > -- > 2.21.0 >
On Mon 13-01-20 13:51:59, Darrick J. Wong wrote: > On Mon, Jan 13, 2020 at 04:34:21PM +0530, Ritesh Harjani wrote: > > Some filesystems (e.g. ext4) need to know in it's writeback path, that > > whether DIO is in progress or not. This info may be needed to avoid the > > stale data exposure race with DIO reads. > > Does XFS have this problem too? > > Admittedly dio read during mmap write is probably not well supported. ;) Well, XFS always performs buffered writeback using unwritten extents so at least the immediate problem of stale data exposure ext4 has does not happen there AFAICT. Honza
On Mon 13-01-20 16:34:21, Ritesh Harjani wrote: > Some filesystems (e.g. ext4) need to know in it's writeback path, that > whether DIO is in progress or not. This info may be needed to avoid the > stale data exposure race with DIO reads. > > Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> > --- > fs/iomap/direct-io.c | 17 +++++++++++++---- > 1 file changed, 13 insertions(+), 4 deletions(-) > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c > index 23837926c0c5..d1c159bd3854 100644 > --- a/fs/iomap/direct-io.c > +++ b/fs/iomap/direct-io.c > @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, > flags |= IOMAP_NOWAIT; > } > > + /* > + * Call inode_dio_begin() before we write out and wait for writeback to > + * complete. This may be needed by some filesystems to prevent race > + * like stale data exposure by DIO reads. > + */ > + inode_dio_begin(inode); > + /* So that i_dio_count is incremented before below operation */ > + smp_mb__after_atomic(); I wonder if the barrier shouldn't go into inode_dio_begin() - as a sepatare patch. Because people just treat this as a lock-kind-of-thingy. E.g. btrfs or ceph use inode_dio_begin() in places which I'd consider prone to CPU reordering issues without this barrier... Otherwise the patch looks good to me. Honza
Please add at very least the fsdevel and xfs lists to iomap patches. Using i_dio_count for any kind of detection is bogus. If you want to pass flags to the writeback code please do so explicitly through struct writeback_control.
On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote: > > Well, XFS always performs buffered writeback using unwritten extents so at > least the immediate problem of stale data exposure ext4 has does not happen > there AFAICT. Currently XFS never uses unwritten extents when converting delalloc extents.
On Tue 14-01-20 08:37:02, Christoph Hellwig wrote: > Using i_dio_count for any kind of detection is bogus. If you want to > pass flags to the writeback code please do so explicitly through > struct writeback_control. We want to detect in the writeback path whether there's direct IO (read) currently running for the inode. Not for the writeback issued from iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be racing with - so struct writeback_control won't help. Now if you want to see the ugly details why this hack is needed, see my other email to Ritesh in this thread with details of the race. Honza
On Tue, Jan 14, 2020 at 06:19:34PM +0100, Jan Kara wrote: > We want to detect in the writeback path whether there's direct IO (read) > currently running for the inode. Not for the writeback issued from > iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be > racing with - so struct writeback_control won't help. Now if you want to > see the ugly details why this hack is needed, see my other email to Ritesh > in this thread with details of the race. How do we get other writeback after iomap_dio_rw wrote everything out? Either way I'm trying to kill i_dio_count as it has all kinds of problems, see the patch sent out earlier today.
On Tue 14-01-20 10:27:36, Christoph Hellwig wrote: > On Tue, Jan 14, 2020 at 06:19:34PM +0100, Jan Kara wrote: > > We want to detect in the writeback path whether there's direct IO (read) > > currently running for the inode. Not for the writeback issued from > > iomap_dio_rw() but for any arbitrary writeback that iomap_dio_rw() can be > > racing with - so struct writeback_control won't help. Now if you want to > > see the ugly details why this hack is needed, see my other email to Ritesh > > in this thread with details of the race. > > How do we get other writeback after iomap_dio_rw wrote everything out? You create dirty page using mmap in the range read by iomap_dio_rw() and then background writeback happens at unfortunate time... Email [1] has the exact traces. > Either way I'm trying to kill i_dio_count as it has all kinds of > problems, see the patch sent out earlier today. OK, I'll see that patch. Honza [1] https://lore.kernel.org/linux-ext4/20200114094741.GC6466@quack2.suse.cz
On Tue 14-01-20 08:38:18, Christoph Hellwig wrote: > On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote: > > > > Well, XFS always performs buffered writeback using unwritten extents so at > > least the immediate problem of stale data exposure ext4 has does not happen > > there AFAICT. > > Currently XFS never uses unwritten extents when converting delalloc > extents. I see, it is a long time since I last looked at that part of XFS code. So then I think XFS might be prone to the same kind of race and data exposure as I outlined in [1]... Honza [1] https://lore.kernel.org/linux-ext4/20200114094741.GC6466@quack2.suse.cz
On Wed, Jan 15, 2020 at 10:19:25AM +0100, Jan Kara wrote: > On Tue 14-01-20 08:38:18, Christoph Hellwig wrote: > > On Tue, Jan 14, 2020 at 10:05:07AM +0100, Jan Kara wrote: > > > > > > Well, XFS always performs buffered writeback using unwritten extents so at > > > least the immediate problem of stale data exposure ext4 has does not happen > > > there AFAICT. > > > > Currently XFS never uses unwritten extents when converting delalloc > > extents. > > I see, it is a long time since I last looked at that part of XFS code. So > then I think XFS might be prone to the same kind of race and data exposure > as I outlined in [1]... I think not using unwritten extents for filling holes inside i_size will always lead to the potential for stale data exposure in one form or another. Because of that Darrick has started looking into always using unwritten extents for buffered writes inside i_size.
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c index 23837926c0c5..d1c159bd3854 100644 --- a/fs/iomap/direct-io.c +++ b/fs/iomap/direct-io.c @@ -468,9 +468,18 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, flags |= IOMAP_NOWAIT; } + /* + * Call inode_dio_begin() before we write out and wait for writeback to + * complete. This may be needed by some filesystems to prevent race + * like stale data exposure by DIO reads. + */ + inode_dio_begin(inode); + /* So that i_dio_count is incremented before below operation */ + smp_mb__after_atomic(); + ret = filemap_write_and_wait_range(mapping, pos, end); if (ret) - goto out_free_dio; + goto out_end_dio; /* * Try to invalidate cache pages for the range we're direct @@ -488,11 +497,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, !inode->i_sb->s_dio_done_wq) { ret = sb_init_dio_done_wq(inode->i_sb); if (ret < 0) - goto out_free_dio; + goto out_end_dio; } - inode_dio_begin(inode); - blk_start_plug(&plug); do { ret = iomap_apply(inode, pos, count, flags, ops, dio, @@ -568,6 +575,8 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter, return iomap_dio_complete(dio); +out_end_dio: + inode_dio_end(inode); out_free_dio: kfree(dio); return ret;
Some filesystems (e.g. ext4) need to know in it's writeback path, that whether DIO is in progress or not. This info may be needed to avoid the stale data exposure race with DIO reads. Signed-off-by: Ritesh Harjani <riteshh@linux.ibm.com> --- fs/iomap/direct-io.c | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-)