diff mbox series

ext4: fix race between writepages and enabling EXT4_EXTENTS_FL

Message ID 20200218002151.1581441-1-ebiggers@kernel.org
State Not Applicable
Headers show
Series ext4: fix race between writepages and enabling EXT4_EXTENTS_FL | expand

Commit Message

Eric Biggers Feb. 18, 2020, 12:21 a.m. UTC
From: Eric Biggers <ebiggers@google.com>

If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
on it, the following warning in ext4_add_complete_io() can be hit:

WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120

Here's a minimal reproducer (not 100% reliable) (root isn't required):

	while true; do
		sync
	done &
	while true; do
		rm -f file
		touch file
		chattr -e file
		echo X >> file
		chattr +e file
	done

The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
(which only returns true on extent-based files) is checked once to set
the number of reserved journal credits, and also again later to select
the flags for ext4_map_blocks() and copy the reserved journal handle to
ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
the first check can see dioread_nolock disabled while the later one can
see it enabled, causing the reserved handle to unexpectedly be NULL.

Fix this by checking ext4_should_dioread_nolock() only once and storing
the result in struct mpage_da_data.  This way, each ext4_writepages()
call uses a consistent dioread_nolock setting.

This was originally reported by syzbot without a reproducer at
https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
but now that dioread_nolock is the default I also started seeing this
when running syzkaller locally.

Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
Cc: stable@kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/ext4/inode.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Jan Kara Feb. 18, 2020, 7:49 a.m. UTC | #1
On Mon 17-02-20 16:21:51, Eric Biggers wrote:
> From: Eric Biggers <ebiggers@google.com>
> 
> If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
> on it, the following warning in ext4_add_complete_io() can be hit:
> 
> WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
> 
> Here's a minimal reproducer (not 100% reliable) (root isn't required):
> 
> 	while true; do
> 		sync
> 	done &
> 	while true; do
> 		rm -f file
> 		touch file
> 		chattr -e file
> 		echo X >> file
> 		chattr +e file
> 	done
> 
> The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
> (which only returns true on extent-based files) is checked once to set
> the number of reserved journal credits, and also again later to select
> the flags for ext4_map_blocks() and copy the reserved journal handle to
> ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
> the first check can see dioread_nolock disabled while the later one can
> see it enabled, causing the reserved handle to unexpectedly be NULL.
> 
> Fix this by checking ext4_should_dioread_nolock() only once and storing
> the result in struct mpage_da_data.  This way, each ext4_writepages()
> call uses a consistent dioread_nolock setting.
> 
> This was originally reported by syzbot without a reproducer at
> https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
> but now that dioread_nolock is the default I also started seeing this
> when running syzkaller locally.
> 
> Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
> Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
> Cc: stable@kernel.org
> Signed-off-by: Eric Biggers <ebiggers@google.com>

What you propose is probably enough to stop this particular race but I
think there are other races that can get triggered by inode conversion
to/from extent format. So I think we rather need to make inode
format conversion much more careful (or we could just remove that
functionality because I'm not sure if anybody actually uses it).

WRT making inode format conversion more careful you can have a look at how
ext4_change_inode_journal_flag() works. I uses EXT4_I(inode)->i_mmap_sem to
block page faults, it also uses sbi->s_journal_flag_rwsem to avoid races
with writepages and I belive the migration code should do the same.

								Honza
Eric Biggers Feb. 19, 2020, 4:56 a.m. UTC | #2
On Tue, Feb 18, 2020 at 08:49:14AM +0100, Jan Kara wrote:
> On Mon 17-02-20 16:21:51, Eric Biggers wrote:
> > From: Eric Biggers <ebiggers@google.com>
> > 
> > If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
> > on it, the following warning in ext4_add_complete_io() can be hit:
> > 
> > WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120
> > 
> > Here's a minimal reproducer (not 100% reliable) (root isn't required):
> > 
> > 	while true; do
> > 		sync
> > 	done &
> > 	while true; do
> > 		rm -f file
> > 		touch file
> > 		chattr -e file
> > 		echo X >> file
> > 		chattr +e file
> > 	done
> > 
> > The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
> > (which only returns true on extent-based files) is checked once to set
> > the number of reserved journal credits, and also again later to select
> > the flags for ext4_map_blocks() and copy the reserved journal handle to
> > ext4_io_end::handle.  But if EXT4_EXTENTS_FL is being concurrently set,
> > the first check can see dioread_nolock disabled while the later one can
> > see it enabled, causing the reserved handle to unexpectedly be NULL.
> > 
> > Fix this by checking ext4_should_dioread_nolock() only once and storing
> > the result in struct mpage_da_data.  This way, each ext4_writepages()
> > call uses a consistent dioread_nolock setting.
> > 
> > This was originally reported by syzbot without a reproducer at
> > https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
> > but now that dioread_nolock is the default I also started seeing this
> > when running syzkaller locally.
> > 
> > Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
> > Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
> > Cc: stable@kernel.org
> > Signed-off-by: Eric Biggers <ebiggers@google.com>
> 
> What you propose is probably enough to stop this particular race but I
> think there are other races that can get triggered by inode conversion
> to/from extent format. So I think we rather need to make inode
> format conversion much more careful (or we could just remove that
> functionality because I'm not sure if anybody actually uses it).
> 
> WRT making inode format conversion more careful you can have a look at how
> ext4_change_inode_journal_flag() works. I uses EXT4_I(inode)->i_mmap_sem to
> block page faults, it also uses sbi->s_journal_flag_rwsem to avoid races
> with writepages and I belive the migration code should do the same.

I was looking at that earlier, but I was a bit concerned that people could
complain about a performance regression due to EXTENTS_FL no longer being
settable/clearable on different files concurrently.

But if we think this functionality is rarely used and that no one would care,
then sure, we should go with that solution instead.  I'll probably rename
s_journal_flag_rwsem to s_writepages_rwsem since it will become for both the
EXTENTS and JOURNAL_DATA flags.

- Eric
diff mbox series

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e60aca791d3f1..7e02851043bca 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1527,6 +1527,7 @@  struct mpage_da_data {
 	struct ext4_map_blocks map;
 	struct ext4_io_submit io_submit;	/* IO submission data */
 	unsigned int do_map:1;
+	unsigned int dioread_nolock:1;
 };
 
 static void mpage_release_unused_pages(struct mpage_da_data *mpd,
@@ -2335,7 +2336,7 @@  static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	struct inode *inode = mpd->inode;
 	struct ext4_map_blocks *map = &mpd->map;
 	int get_blocks_flags;
-	int err, dioread_nolock;
+	int err;
 
 	trace_ext4_da_write_pages_extent(inode, map);
 	/*
@@ -2356,8 +2357,7 @@  static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	get_blocks_flags = EXT4_GET_BLOCKS_CREATE |
 			   EXT4_GET_BLOCKS_METADATA_NOFAIL |
 			   EXT4_GET_BLOCKS_IO_SUBMIT;
-	dioread_nolock = ext4_should_dioread_nolock(inode);
-	if (dioread_nolock)
+	if (mpd->dioread_nolock)
 		get_blocks_flags |= EXT4_GET_BLOCKS_IO_CREATE_EXT;
 	if (map->m_flags & (1 << BH_Delay))
 		get_blocks_flags |= EXT4_GET_BLOCKS_DELALLOC_RESERVE;
@@ -2365,7 +2365,7 @@  static int mpage_map_one_extent(handle_t *handle, struct mpage_da_data *mpd)
 	err = ext4_map_blocks(handle, inode, map, get_blocks_flags);
 	if (err < 0)
 		return err;
-	if (dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
+	if (mpd->dioread_nolock && (map->m_flags & EXT4_MAP_UNWRITTEN)) {
 		if (!mpd->io_submit.io_end->handle &&
 		    ext4_handle_valid(handle)) {
 			mpd->io_submit.io_end->handle = handle->h_rsv_handle;
@@ -2685,6 +2685,9 @@  static int ext4_writepages(struct address_space *mapping,
 		 */
 		rsv_blocks = 1 + ext4_chunk_trans_blocks(inode,
 						PAGE_SIZE >> inode->i_blkbits);
+		mpd.dioread_nolock = 1;
+	} else {
+		mpd.dioread_nolock = 0;
 	}
 
 	if (wbc->range_start == 0 && wbc->range_end == LLONG_MAX)