Message ID | 20170313004708.29838-1-ebiggers3@gmail.com |
---|---|
State | Awaiting Upstream, archived |
Headers | show |
> On Mar 12, 2017, at 6:47 PM, Eric Biggers <ebiggers3@gmail.com> wrote: > > From: Eric Biggers <ebiggers@google.com> > > Currently the case of writing via mmap to a file with inline data is not > handled. This is maybe a rare case since it requires a writable memory > map of a very small file, but it is trivial to trigger with on > inline_data filesystem, and it causes the > 'BUG_ON(ext4_test_inode_state(inode, EXT4_STATE_MAY_INLINE_DATA));' in > ext4_writepages() to be hit: > > mkfs.ext4 -O inline_data /dev/vdb > mount /dev/vdb /mnt > xfs_io -f /mnt/file \ > -c 'pwrite 0 1' \ > -c 'mmap -w 0 1m' \ > -c 'mwrite 0 1' \ > -c 'fsync' > > kernel BUG at fs/ext4/inode.c:2723! > invalid opcode: 0000 [#1] SMP > CPU: 1 PID: 2532 Comm: xfs_io Not tainted 4.11.0-rc1-xfstests-00301-g071d9acf3d1f #633 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014 > task: ffff88003d3a8040 task.stack: ffffc90000300000 > RIP: 0010:ext4_writepages+0xc89/0xf8a > RSP: 0018:ffffc90000303ca0 EFLAGS: 00010283 > RAX: 0000028410000000 RBX: ffff8800383fa3b0 RCX: ffffffff812afcdc > RDX: 00000a9d00000246 RSI: ffffffff81e660e0 RDI: 0000000000000246 > RBP: ffffc90000303dc0 R08: 0000000000000002 R09: 869618e8f99b4fa5 > R10: 00000000852287a2 R11: 00000000a03b49f4 R12: ffff88003808e698 > R13: 0000000000000000 R14: 7fffffffffffffff R15: 7fffffffffffffff > FS: 00007fd3e53094c0(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: 00007fd3e4c51000 CR3: 000000003d554000 CR4: 00000000003406e0 > Call Trace: > ? _raw_spin_unlock+0x27/0x2a > ? kvm_clock_read+0x1e/0x20 > do_writepages+0x23/0x2c > ? do_writepages+0x23/0x2c > __filemap_fdatawrite_range+0x80/0x87 > filemap_write_and_wait_range+0x67/0x8c > ext4_sync_file+0x20e/0x472 > vfs_fsync_range+0x8e/0x9f > ? syscall_trace_enter+0x25b/0x2d0 > vfs_fsync+0x1c/0x1e > do_fsync+0x31/0x4a > SyS_fsync+0x10/0x14 > do_syscall_64+0x69/0x131 > entry_SYSCALL64_slow_path+0x25/0x25 > > We could try to be smart and keep the inline data in this case, or at > least support delayed allocation when allocating the block, but these > solutions would be more complicated and don't seem worthwhile given how > rare this case seems to be. So just fix the bug by calling > ext4_convert_inline_data() when we're asked to make a page writable, so > that any inline data gets evicted, with the block allocated immediately. That is what I was going to suggest also, before I somehow lost this thread in my inbox. Keeping inline file data is a tiny improvement, and any complexity to make this work is not worth it. Reviewed-by: Andreas Dilger <adilger@dilger.ca> > Reported-by: Nick Alcock <nick.alcock@oracle.com> > Cc: stable@vger.kernel.org > Signed-off-by: Eric Biggers <ebiggers@google.com> > --- > fs/ext4/inode.c | 5 +++++ > 1 file changed, 5 insertions(+) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 7385e6a6b6cb..0fc3e25ef531 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf) > file_update_time(vma->vm_file); > > down_read(&EXT4_I(inode)->i_mmap_sem); > + > + ret = ext4_convert_inline_data(inode); > + if (ret) > + goto out_ret; > + > /* Delalloc case is easy... */ > if (test_opt(inode->i_sb, DELALLOC) && > !ext4_should_journal_data(inode) && > -- > 2.12.0 > Cheers, Andreas
> mkfs.ext4 -O inline_data /dev/vdb > mount /dev/vdb /mnt > xfs_io -f /mnt/file \ > -c 'pwrite 0 1' \ > -c 'mmap -w 0 1m' \ > -c 'mwrite 0 1' \ > -c 'fsync' Please add this test case to xfstests.
On Mon, Mar 13, 2017 at 04:33:13PM -0700, Christoph Hellwig wrote: > > mkfs.ext4 -O inline_data /dev/vdb > > mount /dev/vdb /mnt > > xfs_io -f /mnt/file \ > > -c 'pwrite 0 1' \ > > -c 'mmap -w 0 1m' \ > > -c 'mwrite 0 1' \ > > -c 'fsync' > > Please add this test case to xfstests. I'm working on this, and I discovered there's still a bug. After the data is written with mwrite, if the filesystem is then mount-cycled, the contents of the file are the old contents rather than the new contents. I believe this is caused by a bug in ext4_convert_inline_data(). Specifically, the new block containing the evicted data is journalled using a buffer_head associated with the block device. This is wrong because it can overwrite data that is later written through non-journalled writeback. I'll look into this more when I have time, but in any case it appears this fix isn't complete. Eric
On Tue, Mar 14, 2017 at 04:32:39PM -0700, Eric Biggers wrote: > I'm working on this, and I discovered there's still a bug. After the data is > written with mwrite, if the filesystem is then mount-cycled, the contents of the > file are the old contents rather than the new contents. > > I believe this is caused by a bug in ext4_convert_inline_data(). Specifically, > the new block containing the evicted data is journalled using a buffer_head > associated with the block device. This is wrong because it can overwrite data > that is later written through non-journalled writeback. I'll apply this patch for now, since it fixes a userspace-triggerable BUG, but you're right. ext4_convert_inline_data() is busted; it should not be using data journalling, but rather it should check to make sure the page is already in memory (and creating it if necessary), and just write it out to memory. This is a separate bug, and we should fix it, but the first patch is correct and should go in. - Ted
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 7385e6a6b6cb..0fc3e25ef531 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5839,6 +5839,11 @@ int ext4_page_mkwrite(struct vm_fault *vmf) file_update_time(vma->vm_file); down_read(&EXT4_I(inode)->i_mmap_sem); + + ret = ext4_convert_inline_data(inode); + if (ret) + goto out_ret; + /* Delalloc case is easy... */ if (test_opt(inode->i_sb, DELALLOC) && !ext4_should_journal_data(inode) &&