ext4: evict inline data when writing to memory map

Submitted by Eric Biggers on March 13, 2017, 12:47 a.m.

Details

Message ID 20170313004708.29838-1-ebiggers3@gmail.com
State Awaiting Upstream
Headers show

Commit Message

Eric Biggers March 13, 2017, 12:47 a.m.
From: Eric Biggers <ebiggers@google.com>

Currently the case of writing via mmap to a file with inline data is not
handled.  This is maybe a rare case since it requires a writable memory
map of a very small file, but it is trivial to trigger with on
inline_data filesystem, and it causes the
'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
ext4_writepages() to be hit:

    mkfs.ext4 -O inline_data /dev/vdb
    mount /dev/vdb /mnt
    xfs_io -f /mnt/file \
	-c 'pwrite 0 1' \
	-c 'mmap -w 0 1m' \
	-c 'mwrite 0 1' \
	-c 'fsync'

	kernel BUG at fs/ext4/inode.c:2723!
	invalid opcode: 0000 [#1] SMP
	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
	task: ffff88003d3a8040 task.stack: ffffc90000300000
	RIP: 0010:ext4_writepages+0xc89/0xf8a
	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
	Call Trace:
	 ? _raw_spin_unlock+0x27/0x2a
	 ? kvm_clock_read+0x1e/0x20
	 do_writepages+0x23/0x2c
	 ? do_writepages+0x23/0x2c
	 __filemap_fdatawrite_range+0x80/0x87
	 filemap_write_and_wait_range+0x67/0x8c
	 ext4_sync_file+0x20e/0x472
	 vfs_fsync_range+0x8e/0x9f
	 ? syscall_trace_enter+0x25b/0x2d0
	 vfs_fsync+0x1c/0x1e
	 do_fsync+0x31/0x4a
	 SyS_fsync+0x10/0x14
	 do_syscall_64+0x69/0x131
	 entry_SYSCALL64_slow_path+0x25/0x25

We could try to be smart and keep the inline data in this case, or at
least support delayed allocation when allocating the block, but these
solutions would be more complicated and don't seem worthwhile given how
rare this case seems to be.  So just fix the bug by calling
ext4_convert_inline_data() when we're asked to make a page writable, so
that any inline data gets evicted, with the block allocated immediately.

Reported-by: Nick Alcock <nick.alcock@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
---
 fs/ext4/inode.c | 5 +++++
 1 file changed, 5 insertions(+)

Comments

Andreas Dilger March 13, 2017, 4:39 a.m.
> On Mar 12, 2017, at 6:47 PM, Eric Biggers <ebiggers3@gmail.com> wrote:
> 
> From: Eric Biggers <ebiggers@google.com>
> 
> Currently the case of writing via mmap to a file with inline data is not
> handled.  This is maybe a rare case since it requires a writable memory
> map of a very small file, but it is trivial to trigger with on
> inline_data filesystem, and it causes the
> 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in
> ext4_writepages() to be hit:
> 
>    mkfs.ext4 -O inline_data /dev/vdb
>    mount /dev/vdb /mnt
>    xfs_io -f /mnt/file \
> 	-c 'pwrite 0 1' \
> 	-c 'mmap -w 0 1m' \
> 	-c 'mwrite 0 1' \
> 	-c 'fsync'
> 
> 	kernel BUG at fs/ext4/inode.c:2723!
> 	invalid opcode: 0000 [#1] SMP
> 	CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633
> 	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
> 	task: ffff88003d3a8040 task.stack: ffffc90000300000
> 	RIP: 0010:ext4_writepages+0xc89/0xf8a
> 	RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283
> 	RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc
> 	RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246
> 	RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5
> 	R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698
> 	R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff
> 	FS:  00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
> 	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> 	CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0
> 	Call Trace:
> 	 ? _raw_spin_unlock+0x27/0x2a
> 	 ? kvm_clock_read+0x1e/0x20
> 	 do_writepages+0x23/0x2c
> 	 ? do_writepages+0x23/0x2c
> 	 __filemap_fdatawrite_range+0x80/0x87
> 	 filemap_write_and_wait_range+0x67/0x8c
> 	 ext4_sync_file+0x20e/0x472
> 	 vfs_fsync_range+0x8e/0x9f
> 	 ? syscall_trace_enter+0x25b/0x2d0
> 	 vfs_fsync+0x1c/0x1e
> 	 do_fsync+0x31/0x4a
> 	 SyS_fsync+0x10/0x14
> 	 do_syscall_64+0x69/0x131
> 	 entry_SYSCALL64_slow_path+0x25/0x25
> 
> We could try to be smart and keep the inline data in this case, or at
> least support delayed allocation when allocating the block, but these
> solutions would be more complicated and don't seem worthwhile given how
> rare this case seems to be.  So just fix the bug by calling
> ext4_convert_inline_data() when we're asked to make a page writable, so
> that any inline data gets evicted, with the block allocated immediately.

That is what I was going to suggest also, before I somehow lost this
thread in my inbox.  Keeping inline file data is a tiny improvement,
and any complexity to make this work is not worth it.

Reviewed-by: Andreas Dilger <adilger@dilger.ca>

> Reported-by: Nick Alcock <nick.alcock@oracle.com>
> Cc: stable@vger.kernel.org
> Signed-off-by: Eric Biggers <ebiggers@google.com>
> ---
> fs/ext4/inode.c | 5 +++++
> 1 file changed, 5 insertions(+)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 7385e6a6b6cb..0fc3e25ef531 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf)
> 	file_update_time(vma->vm_file);
> 
> 	down_read(&EXT4_I(inode)->i_mmap_sem);
> +
> +	ret = ext4_convert_inline_data(inode);
> +	if (ret)
> +		goto out_ret;
> +
> 	/* Delalloc case is easy... */
> 	if (test_opt(inode->i_sb, DELALLOC) &&
> 	    !ext4_should_journal_data(inode) &&
> --
> 2.12.0
> 


Cheers, Andreas
Christoph Hellwig March 13, 2017, 11:33 p.m.
>     mkfs.ext4 -O inline_data /dev/vdb
>     mount /dev/vdb /mnt
>     xfs_io -f /mnt/file \
> 	-c 'pwrite 0 1' \
> 	-c 'mmap -w 0 1m' \
> 	-c 'mwrite 0 1' \
> 	-c 'fsync'

Please add this test case to xfstests.
Eric Biggers March 14, 2017, 11:32 p.m.
On Mon, Mar 13, 2017 at 04:33:13PM -0700, Christoph Hellwig wrote:
> >     mkfs.ext4 -O inline_data /dev/vdb
> >     mount /dev/vdb /mnt
> >     xfs_io -f /mnt/file \
> > 	-c 'pwrite 0 1' \
> > 	-c 'mmap -w 0 1m' \
> > 	-c 'mwrite 0 1' \
> > 	-c 'fsync'
> 
> Please add this test case to xfstests.

I'm working on this, and I discovered there's still a bug.  After the data is
written with mwrite, if the filesystem is then mount-cycled, the contents of the
file are the old contents rather than the new contents.

I believe this is caused by a bug in ext4_convert_inline_data().  Specifically,
the new block containing the evicted data is journalled using a buffer_head
associated with the block device.  This is wrong because it can overwrite data
that is later written through non-journalled writeback.

I'll look into this more when I have time, but in any case it appears this fix
isn't complete.

Eric
Theodore Ts'o April 30, 2017, 4:13 a.m.
On Tue, Mar 14, 2017 at 04:32:39PM -0700, Eric Biggers wrote:
> I'm working on this, and I discovered there's still a bug.  After the data is
> written with mwrite, if the filesystem is then mount-cycled, the contents of the
> file are the old contents rather than the new contents.
> 
> I believe this is caused by a bug in ext4_convert_inline_data().  Specifically,
> the new block containing the evicted data is journalled using a buffer_head
> associated with the block device.  This is wrong because it can overwrite data
> that is later written through non-journalled writeback.

I'll apply this patch for now, since it fixes a userspace-triggerable
BUG, but you're right.  ext4_convert_inline_data() is busted; it
should not be using data journalling, but rather it should check to
make sure the page is already in memory (and creating it if
necessary), and just write it out to memory.

This is a separate bug, and we should fix it, but the first patch is
correct and should go in.

					- Ted

Patch hide | download patch | download mbox

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7385e6a6b6cb..0fc3e25ef531 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5839,6 +5839,11 @@  int ext4_page_mkwrite(struct vm_fault *vmf)
 	file_update_time(vma->vm_file);
 
 	down_read(&EXT4_I(inode)->i_mmap_sem);
+
+	ret = ext4_convert_inline_data(inode);
+	if (ret)
+		goto out_ret;
+
 	/* Delalloc case is easy... */
 	if (test_opt(inode->i_sb, DELALLOC) &&
 	    !ext4_should_journal_data(inode) &&