Message ID | 20220326065351.761952-1-yebin10@huawei.com |
---|---|
State | Awaiting Upstream |
Headers | show |
Series | [-next] ext4: fix warning in ext4_handle_inode_extension | expand |
Ye Bin <yebin10@huawei.com> writes: > We got issue as follows: > EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory > EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error > EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory > EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc > ------------[ cut here ]------------ > WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 > Modules linked in: > CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 > RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 > RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 > RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 > RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd > RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f > R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a > R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b > FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > do_iter_readv_writev+0x2e5/0x360 > do_iter_write+0x112/0x4c0 > do_pwritev+0x1e5/0x390 > __x64_sys_pwritev2+0x7e/0xa0 > do_syscall_64+0x37/0x50 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Above issue may happen as follows: > Assume > inode.i_size=4096 > EXT4_I(inode)->i_disksize=4096 > > step 1: set inode->i_isize = 8192 > ext4_setattr > if (attr->ia_size != inode->i_size) > EXT4_I(inode)->i_disksize = attr->ia_size; > rc = ext4_mark_inode_dirty > ext4_reserve_inode_write > ext4_get_inode_loc > __ext4_get_inode_loc > sb_getblk --> return -ENOMEM > ... > if (!error) ->will not update i_size > i_size_write(inode, attr->ia_size); > Now: > inode.i_size=4096 > EXT4_I(inode)->i_disksize=8192 > > step 2: Direct write 4096 bytes > ext4_file_write_iter > ext4_dio_write_iter > iomap_dio_rw ->return error > if (extend) > ext4_handle_inode_extension > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); > ->Then trigger warning. > > To solve above issue, if mark inode dirty failed in ext4_setattr just > set 'EXT4_I(inode)->i_disksize' with old value. > > Signed-off-by: Ye Bin <yebin10@huawei.com> > --- > fs/ext4/inode.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 90fd6f7b6209..8adf1f802f6c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > if (attr->ia_valid & ATTR_SIZE) { > handle_t *handle; > loff_t oldsize = inode->i_size; > + loff_t old_disksize; > int shrink = (attr->ia_size < inode->i_size); > > if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { > @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > inode->i_sb->s_blocksize_bits); > > down_write(&EXT4_I(inode)->i_data_sem); > + old_disksize = EXT4_I(inode)->i_disksize; > EXT4_I(inode)->i_disksize = attr->ia_size; > rc = ext4_mark_inode_dirty(handle, inode); > if (!error) > @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > */ > if (!error) > i_size_write(inode, attr->ia_size); > + else > + EXT4_I(inode)->i_disksize = old_disksize; Shouldn't this always be done if ext4_mark_inode_dirty fails? if (rc) EXT4_I(inode)->i_disksize = old_disksize; Otherwise you hit the same issue if (!error && rc), no?
On Sat 26-03-22 14:53:51, Ye Bin wrote: > We got issue as follows: > EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory > EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error > EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory > EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc > ------------[ cut here ]------------ > WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 > Modules linked in: > CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 > RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 > RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 > RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 > RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd > RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f > R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a > R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b > FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > do_iter_readv_writev+0x2e5/0x360 > do_iter_write+0x112/0x4c0 > do_pwritev+0x1e5/0x390 > __x64_sys_pwritev2+0x7e/0xa0 > do_syscall_64+0x37/0x50 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > Above issue may happen as follows: > Assume > inode.i_size=4096 > EXT4_I(inode)->i_disksize=4096 > > step 1: set inode->i_isize = 8192 > ext4_setattr > if (attr->ia_size != inode->i_size) > EXT4_I(inode)->i_disksize = attr->ia_size; > rc = ext4_mark_inode_dirty > ext4_reserve_inode_write > ext4_get_inode_loc > __ext4_get_inode_loc > sb_getblk --> return -ENOMEM > ... > if (!error) ->will not update i_size > i_size_write(inode, attr->ia_size); > Now: > inode.i_size=4096 > EXT4_I(inode)->i_disksize=8192 > > step 2: Direct write 4096 bytes > ext4_file_write_iter > ext4_dio_write_iter > iomap_dio_rw ->return error > if (extend) > ext4_handle_inode_extension > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); > ->Then trigger warning. > > To solve above issue, if mark inode dirty failed in ext4_setattr just > set 'EXT4_I(inode)->i_disksize' with old value. > > Signed-off-by: Ye Bin <yebin10@huawei.com> Thanks for the fix! So I think this deserves a further debate. I have two points here: 1) If ext4_mark_inode_dirty() fails (or basically any metadata writeback) we must abort the journal because metadata is not guaranteed to be consistent anymore. In this particular callsite of ext4_mark_inode_dirty() you were able to undo the changes but there are many more where it is not sanely possible AFAICT. Hence I think that ext4_reserve_inode_write() needs to call ext4_journal_abort_handle() (as already happens inside __ext4_journal_get_write_access()) and not just ext4_std_error(). 2) The assertion in ext4_handle_inode_extension() should be conditioned on !is_journal_aborted() to avoid useless warnings for filesystems we know are inconsistent anyway. Thoughts? Honza > --- > fs/ext4/inode.c | 4 ++++ > 1 file changed, 4 insertions(+) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 90fd6f7b6209..8adf1f802f6c 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > if (attr->ia_valid & ATTR_SIZE) { > handle_t *handle; > loff_t oldsize = inode->i_size; > + loff_t old_disksize; > int shrink = (attr->ia_size < inode->i_size); > > if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { > @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > inode->i_sb->s_blocksize_bits); > > down_write(&EXT4_I(inode)->i_data_sem); > + old_disksize = EXT4_I(inode)->i_disksize; > EXT4_I(inode)->i_disksize = attr->ia_size; > rc = ext4_mark_inode_dirty(handle, inode); > if (!error) > @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > */ > if (!error) > i_size_write(inode, attr->ia_size); > + else > + EXT4_I(inode)->i_disksize = old_disksize; > up_write(&EXT4_I(inode)->i_data_sem); > ext4_journal_stop(handle); > if (error) > -- > 2.31.1 >
yebin <yebin10@huawei.com> writes: > On 2022/3/28 23:57, Gabriel Krisman Bertazi wrote: >> Ye Bin <yebin10@huawei.com> writes: >> >>> We got issue as follows: >>> EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory >>> EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error >>> EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory >>> EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc >>> ------------[ cut here ]------------ >>> WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 >>> Modules linked in: >>> CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 >>> RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 >>> RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 >>> RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 >>> RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd >>> RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f >>> R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a >>> R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b >>> FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 >>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>> CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>> Call Trace: >>> do_iter_readv_writev+0x2e5/0x360 >>> do_iter_write+0x112/0x4c0 >>> do_pwritev+0x1e5/0x390 >>> __x64_sys_pwritev2+0x7e/0xa0 >>> do_syscall_64+0x37/0x50 >>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>> >>> Above issue may happen as follows: >>> Assume >>> inode.i_size=4096 >>> EXT4_I(inode)->i_disksize=4096 >>> >>> step 1: set inode->i_isize = 8192 >>> ext4_setattr >>> if (attr->ia_size != inode->i_size) >>> EXT4_I(inode)->i_disksize = attr->ia_size; >>> rc = ext4_mark_inode_dirty >>> ext4_reserve_inode_write >>> ext4_get_inode_loc >>> __ext4_get_inode_loc >>> sb_getblk --> return -ENOMEM >>> ... >>> if (!error) ->will not update i_size >>> i_size_write(inode, attr->ia_size); >>> Now: >>> inode.i_size=4096 >>> EXT4_I(inode)->i_disksize=8192 >>> >>> step 2: Direct write 4096 bytes >>> ext4_file_write_iter >>> ext4_dio_write_iter >>> iomap_dio_rw ->return error >>> if (extend) >>> ext4_handle_inode_extension >>> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); >>> ->Then trigger warning. >>> >>> To solve above issue, if mark inode dirty failed in ext4_setattr just >>> set 'EXT4_I(inode)->i_disksize' with old value. >>> >>> Signed-off-by: Ye Bin <yebin10@huawei.com> >>> --- >>> fs/ext4/inode.c | 4 ++++ >>> 1 file changed, 4 insertions(+) >>> >>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>> index 90fd6f7b6209..8adf1f802f6c 100644 >>> --- a/fs/ext4/inode.c >>> +++ b/fs/ext4/inode.c >>> @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>> if (attr->ia_valid & ATTR_SIZE) { >>> handle_t *handle; >>> loff_t oldsize = inode->i_size; >>> + loff_t old_disksize; >>> int shrink = (attr->ia_size < inode->i_size); >>> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) >>> { >>> @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>> inode->i_sb->s_blocksize_bits); >>> down_write(&EXT4_I(inode)->i_data_sem); >>> + old_disksize = EXT4_I(inode)->i_disksize; >>> EXT4_I(inode)->i_disksize = attr->ia_size; >>> rc = ext4_mark_inode_dirty(handle, inode); >>> if (!error) >>> @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>> */ >>> if (!error) >>> i_size_write(inode, attr->ia_size); >>> + else >>> + EXT4_I(inode)->i_disksize = old_disksize; >> Shouldn't this always be done if ext4_mark_inode_dirty fails? >> >> if (rc) >> EXT4_I(inode)->i_disksize = old_disksize; >> >> Otherwise you hit the same issue if (!error && rc), no? > > In fact, '(!error && rc)' condition is always unsatisfied. > > ext4_setattr > ... > down_write(&EXT4_I(inode)->i_data_sem); > EXT4_I(inode)->i_disksize = attr->ia_size; > rc = ext4_mark_inode_dirty(handle, inode); > *if (!error) ** > ** error = rc;* Oh right. sorry for the noise.
On 2022/3/29 17:28, Jan Kara wrote: > On Sat 26-03-22 14:53:51, Ye Bin wrote: >> We got issue as follows: >> EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory >> EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error >> EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory >> EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc >> ------------[ cut here ]------------ >> WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 >> Modules linked in: >> CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 >> RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 >> RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 >> RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 >> RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd >> RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f >> R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a >> R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b >> FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >> CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >> Call Trace: >> do_iter_readv_writev+0x2e5/0x360 >> do_iter_write+0x112/0x4c0 >> do_pwritev+0x1e5/0x390 >> __x64_sys_pwritev2+0x7e/0xa0 >> do_syscall_64+0x37/0x50 >> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >> >> Above issue may happen as follows: >> Assume >> inode.i_size=4096 >> EXT4_I(inode)->i_disksize=4096 >> >> step 1: set inode->i_isize = 8192 >> ext4_setattr >> if (attr->ia_size != inode->i_size) >> EXT4_I(inode)->i_disksize = attr->ia_size; >> rc = ext4_mark_inode_dirty >> ext4_reserve_inode_write >> ext4_get_inode_loc >> __ext4_get_inode_loc >> sb_getblk --> return -ENOMEM >> ... >> if (!error) ->will not update i_size >> i_size_write(inode, attr->ia_size); >> Now: >> inode.i_size=4096 >> EXT4_I(inode)->i_disksize=8192 >> >> step 2: Direct write 4096 bytes >> ext4_file_write_iter >> ext4_dio_write_iter >> iomap_dio_rw ->return error >> if (extend) >> ext4_handle_inode_extension >> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); >> ->Then trigger warning. >> >> To solve above issue, if mark inode dirty failed in ext4_setattr just >> set 'EXT4_I(inode)->i_disksize' with old value. >> >> Signed-off-by: Ye Bin <yebin10@huawei.com> > Thanks for the fix! So I think this deserves a further debate. I have two > points here: > > 1) If ext4_mark_inode_dirty() fails (or basically any metadata writeback) > we must abort the journal because metadata is not guaranteed to be > consistent anymore. In this particular callsite of ext4_mark_inode_dirty() > you were able to undo the changes but there are many more where it is not > sanely possible AFAICT. Hence I think that ext4_reserve_inode_write() needs > to call ext4_journal_abort_handle() (as already happens inside > __ext4_journal_get_write_access()) and not just ext4_std_error(). > > 2) The assertion in ext4_handle_inode_extension() should be conditioned on > !is_journal_aborted() to avoid useless warnings for filesystems we know are > inconsistent anyway. > > Thoughts? > > Honza Do you mean call jbd2_abort in ext4_reserve_inode_write() ? If we abort journal when metadata is not guaranteed to be consistent. The mode of ‘errors=continue’ is unnecessary. >> --- >> fs/ext4/inode.c | 4 ++++ >> 1 file changed, 4 insertions(+) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 90fd6f7b6209..8adf1f802f6c 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >> if (attr->ia_valid & ATTR_SIZE) { >> handle_t *handle; >> loff_t oldsize = inode->i_size; >> + loff_t old_disksize; >> int shrink = (attr->ia_size < inode->i_size); >> >> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { >> @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >> inode->i_sb->s_blocksize_bits); >> >> down_write(&EXT4_I(inode)->i_data_sem); >> + old_disksize = EXT4_I(inode)->i_disksize; >> EXT4_I(inode)->i_disksize = attr->ia_size; >> rc = ext4_mark_inode_dirty(handle, inode); >> if (!error) >> @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >> */ >> if (!error) >> i_size_write(inode, attr->ia_size); >> + else >> + EXT4_I(inode)->i_disksize = old_disksize; >> up_write(&EXT4_I(inode)->i_data_sem); >> ext4_journal_stop(handle); >> if (error) >> -- >> 2.31.1 >>
On Wed 30-03-22 20:08:13, yebin wrote: > On 2022/3/29 17:28, Jan Kara wrote: > > On Sat 26-03-22 14:53:51, Ye Bin wrote: > > > We got issue as follows: > > > EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory > > > EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error > > > EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory > > > EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc > > > ------------[ cut here ]------------ > > > WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 > > > Modules linked in: > > > CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 > > > RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 > > > RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 > > > RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 > > > RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd > > > RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f > > > R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a > > > R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b > > > FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 > > > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > > > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > > > Call Trace: > > > do_iter_readv_writev+0x2e5/0x360 > > > do_iter_write+0x112/0x4c0 > > > do_pwritev+0x1e5/0x390 > > > __x64_sys_pwritev2+0x7e/0xa0 > > > do_syscall_64+0x37/0x50 > > > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > Above issue may happen as follows: > > > Assume > > > inode.i_size=4096 > > > EXT4_I(inode)->i_disksize=4096 > > > > > > step 1: set inode->i_isize = 8192 > > > ext4_setattr > > > if (attr->ia_size != inode->i_size) > > > EXT4_I(inode)->i_disksize = attr->ia_size; > > > rc = ext4_mark_inode_dirty > > > ext4_reserve_inode_write > > > ext4_get_inode_loc > > > __ext4_get_inode_loc > > > sb_getblk --> return -ENOMEM > > > ... > > > if (!error) ->will not update i_size > > > i_size_write(inode, attr->ia_size); > > > Now: > > > inode.i_size=4096 > > > EXT4_I(inode)->i_disksize=8192 > > > > > > step 2: Direct write 4096 bytes > > > ext4_file_write_iter > > > ext4_dio_write_iter > > > iomap_dio_rw ->return error > > > if (extend) > > > ext4_handle_inode_extension > > > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); > > > ->Then trigger warning. > > > > > > To solve above issue, if mark inode dirty failed in ext4_setattr just > > > set 'EXT4_I(inode)->i_disksize' with old value. > > > > > > Signed-off-by: Ye Bin <yebin10@huawei.com> > > Thanks for the fix! So I think this deserves a further debate. I have two > > points here: > > > > 1) If ext4_mark_inode_dirty() fails (or basically any metadata writeback) > > we must abort the journal because metadata is not guaranteed to be > > consistent anymore. In this particular callsite of ext4_mark_inode_dirty() > > you were able to undo the changes but there are many more where it is not > > sanely possible AFAICT. Hence I think that ext4_reserve_inode_write() needs > > to call ext4_journal_abort_handle() (as already happens inside > > __ext4_journal_get_write_access()) and not just ext4_std_error(). > > > > 2) The assertion in ext4_handle_inode_extension() should be conditioned on > > !is_journal_aborted() to avoid useless warnings for filesystems we know are > > inconsistent anyway. > > > > Thoughts? > > > > Honza > Do you mean call jbd2_abort in ext4_reserve_inode_write() ? Yes. > If we abort journal when metadata is not guaranteed to be consistent. The > mode of ‘errors=continue’ is unnecessary. Well, firstly, errors=continue was always the best effort. There are no guarantees which failures we are able to withstand and which not. Generally, I think we try to withstand on-disk filesystem inconsistency but not inconsistency coming from programming errors or other external factors like out-of-memory conditions. Secondly, we already do abort the journal when e.g. jbd2_journal_get_write_access() fails (although that generally means some internal inconsistency) or when say revoke handling fails to allocate memory for a revoke record. So it won't be a new thing. Thirdly, and perhaps most importantly, you have found and fixed just one fairly innocent problem happening due to in memory inode state getting inconsistent after we fail to record the inode in the journal. There are almost 80 callsites of ext4_mark_inode_dirty() and honestly I suspect that e.g. inconsistent states resulting from extent tree manipulations being aborted in the middle due to ext4_ext_dirty() failing due to ENOMEM will also trigger all sorts of "interesting" behavior. So that's why I'd rather abort the journal than try to continue when we almost certainly now we cannot. One other possibility I could see dealing with this class of problems is using __GFP_NOFAIL for inode buffer head allocation (through sb_getblk_gfp()) in __ext4_get_inode_loc(). BTW, how did you trigger NULL sb_getblk() return in the first place? AFAICS grow_buffers() already uses __GFP_NOFAIL for all the allocations? Honza > > > --- > > > fs/ext4/inode.c | 4 ++++ > > > 1 file changed, 4 insertions(+) > > > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > > > index 90fd6f7b6209..8adf1f802f6c 100644 > > > --- a/fs/ext4/inode.c > > > +++ b/fs/ext4/inode.c > > > @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > > > if (attr->ia_valid & ATTR_SIZE) { > > > handle_t *handle; > > > loff_t oldsize = inode->i_size; > > > + loff_t old_disksize; > > > int shrink = (attr->ia_size < inode->i_size); > > > if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { > > > @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > > > inode->i_sb->s_blocksize_bits); > > > down_write(&EXT4_I(inode)->i_data_sem); > > > + old_disksize = EXT4_I(inode)->i_disksize; > > > EXT4_I(inode)->i_disksize = attr->ia_size; > > > rc = ext4_mark_inode_dirty(handle, inode); > > > if (!error) > > > @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, > > > */ > > > if (!error) > > > i_size_write(inode, attr->ia_size); > > > + else > > > + EXT4_I(inode)->i_disksize = old_disksize; > > > up_write(&EXT4_I(inode)->i_data_sem); > > > ext4_journal_stop(handle); > > > if (error) > > > -- > > > 2.31.1 > > > >
On 2022/3/30 21:30, Jan Kara wrote: > On Wed 30-03-22 20:08:13, yebin wrote: >> On 2022/3/29 17:28, Jan Kara wrote: >>> On Sat 26-03-22 14:53:51, Ye Bin wrote: >>>> We got issue as follows: >>>> EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory >>>> EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error >>>> EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory >>>> EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc >>>> ------------[ cut here ]------------ >>>> WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 >>>> Modules linked in: >>>> CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 >>>> RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 >>>> RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 >>>> RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 >>>> RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd >>>> RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f >>>> R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a >>>> R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b >>>> FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 >>>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 >>>> CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 >>>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >>>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 >>>> Call Trace: >>>> do_iter_readv_writev+0x2e5/0x360 >>>> do_iter_write+0x112/0x4c0 >>>> do_pwritev+0x1e5/0x390 >>>> __x64_sys_pwritev2+0x7e/0xa0 >>>> do_syscall_64+0x37/0x50 >>>> entry_SYSCALL_64_after_hwframe+0x44/0xa9 >>>> >>>> Above issue may happen as follows: >>>> Assume >>>> inode.i_size=4096 >>>> EXT4_I(inode)->i_disksize=4096 >>>> >>>> step 1: set inode->i_isize = 8192 >>>> ext4_setattr >>>> if (attr->ia_size != inode->i_size) >>>> EXT4_I(inode)->i_disksize = attr->ia_size; >>>> rc = ext4_mark_inode_dirty >>>> ext4_reserve_inode_write >>>> ext4_get_inode_loc >>>> __ext4_get_inode_loc >>>> sb_getblk --> return -ENOMEM >>>> ... >>>> if (!error) ->will not update i_size >>>> i_size_write(inode, attr->ia_size); >>>> Now: >>>> inode.i_size=4096 >>>> EXT4_I(inode)->i_disksize=8192 >>>> >>>> step 2: Direct write 4096 bytes >>>> ext4_file_write_iter >>>> ext4_dio_write_iter >>>> iomap_dio_rw ->return error >>>> if (extend) >>>> ext4_handle_inode_extension >>>> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); >>>> ->Then trigger warning. >>>> >>>> To solve above issue, if mark inode dirty failed in ext4_setattr just >>>> set 'EXT4_I(inode)->i_disksize' with old value. >>>> >>>> Signed-off-by: Ye Bin <yebin10@huawei.com> >>> Thanks for the fix! So I think this deserves a further debate. I have two >>> points here: >>> >>> 1) If ext4_mark_inode_dirty() fails (or basically any metadata writeback) >>> we must abort the journal because metadata is not guaranteed to be >>> consistent anymore. In this particular callsite of ext4_mark_inode_dirty() >>> you were able to undo the changes but there are many more where it is not >>> sanely possible AFAICT. Hence I think that ext4_reserve_inode_write() needs >>> to call ext4_journal_abort_handle() (as already happens inside >>> __ext4_journal_get_write_access()) and not just ext4_std_error(). >>> >>> 2) The assertion in ext4_handle_inode_extension() should be conditioned on >>> !is_journal_aborted() to avoid useless warnings for filesystems we know are >>> inconsistent anyway. >>> >>> Thoughts? >>> >>> Honza >> Do you mean call jbd2_abort in ext4_reserve_inode_write() ? > Yes. > >> If we abort journal when metadata is not guaranteed to be consistent. The >> mode of ‘errors=continue’ is unnecessary. > Well, firstly, errors=continue was always the best effort. There are no > guarantees which failures we are able to withstand and which not. > Generally, I think we try to withstand on-disk filesystem inconsistency but > not inconsistency coming from programming errors or other external factors > like out-of-memory conditions. Secondly, we already do abort the journal > when e.g. jbd2_journal_get_write_access() fails (although that generally > means some internal inconsistency) or when say revoke handling fails to > allocate memory for a revoke record. So it won't be a new thing. Thirdly, > and perhaps most importantly, you have found and fixed just one fairly > innocent problem happening due to in memory inode state getting > inconsistent after we fail to record the inode in the journal. There are > almost 80 callsites of ext4_mark_inode_dirty() and honestly I suspect that > e.g. inconsistent states resulting from extent tree manipulations being > aborted in the middle due to ext4_ext_dirty() failing due to ENOMEM will > also trigger all sorts of "interesting" behavior. So that's why I'd rather > abort the journal than try to continue when we almost certainly now we > cannot. > > One other possibility I could see dealing with this class of problems is > using __GFP_NOFAIL for inode buffer head allocation (through > sb_getblk_gfp()) in __ext4_get_inode_loc(). BTW, how did you trigger NULL > sb_getblk() return in the first place? AFAICS grow_buffers() already uses > __GFP_NOFAIL for all the allocations? > > Honza To be honest, I don't know syzkaller how to inject the NOMEM fault. If syzkaller rely on the memory fault injection mode provided by the kernel, should report null pointer access. Anyway, If inject a single point of IO fault, we still have to face the same situation. > >>>> --- >>>> fs/ext4/inode.c | 4 ++++ >>>> 1 file changed, 4 insertions(+) >>>> >>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >>>> index 90fd6f7b6209..8adf1f802f6c 100644 >>>> --- a/fs/ext4/inode.c >>>> +++ b/fs/ext4/inode.c >>>> @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>>> if (attr->ia_valid & ATTR_SIZE) { >>>> handle_t *handle; >>>> loff_t oldsize = inode->i_size; >>>> + loff_t old_disksize; >>>> int shrink = (attr->ia_size < inode->i_size); >>>> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { >>>> @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>>> inode->i_sb->s_blocksize_bits); >>>> down_write(&EXT4_I(inode)->i_data_sem); >>>> + old_disksize = EXT4_I(inode)->i_disksize; >>>> EXT4_I(inode)->i_disksize = attr->ia_size; >>>> rc = ext4_mark_inode_dirty(handle, inode); >>>> if (!error) >>>> @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, >>>> */ >>>> if (!error) >>>> i_size_write(inode, attr->ia_size); >>>> + else >>>> + EXT4_I(inode)->i_disksize = old_disksize; >>>> up_write(&EXT4_I(inode)->i_data_sem); >>>> ext4_journal_stop(handle); >>>> if (error) >>>> -- >>>> 2.31.1 >>>>
On Thu, Apr 14, 2022 at 10:46:36AM +0800, yebin wrote: > To be honest, I don't know syzkaller how to inject the NOMEM > fault. If syzkaller rely on the memory fault injection mode provided > by the kernel, should report null pointer access. Anyway, If inject > a single point of IO fault, we still have to face the same > situation. Was this patch in response to a syzkaller report? There wasn't a Reported-by tag indicating that this came from syzkaller. If it did, and it came from syzkaller run by the syzkaller team (e.g., at https://syzkaller.appspot.com/upstream) could you include a reference to the syzkaller report? > On 2022/3/30 21:30, Jan Kara wrote: > > > Do you mean call jbd2_abort in ext4_reserve_inode_write() ? > > Yes. > > > > > If we abort journal when metadata is not guaranteed to be consistent. The > > > mode of ‘errors=continue’ is unnecessary. > > Well, firstly, errors=continue was always the best effort. There are no > > guarantees which failures we are able to withstand and which not. That's true; however, in general, if we can back out and recover from the error, we should, so that errors=continue can work. If we think that continuing will result in far more file system corruption and/or the error is from the journalling infrastructure itself, then aborting the journal makes sense. > > There are > > almost 80 callsites of ext4_mark_inode_dirty() and honestly I suspect that > > e.g. inconsistent states resulting from extent tree manipulations being > > aborted in the middle due to ext4_ext_dirty() failing due to ENOMEM will > > also trigger all sorts of "interesting" behavior. So that's why I'd rather > > abort the journal than try to continue when we almost certainly now we > > cannot. It would not be a bad thing for us to audit all of the callers of ext4_mark_inode_dirty() and ext4_reserve_inode_write(). We are getting things right in at least some of the callers (for example: ext4_mkdir). In any case, I'll take this patch, but if this was in response to a syzkaller report, please let me know with the syzkaller ID is, so I can update the commit before I send a pull request to Linus. Thanks! - Ted
On Sat, 26 Mar 2022 14:53:51 +0800, Ye Bin wrote: > We got issue as follows: > EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory > EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error > EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory > EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc > ------------[ cut here ]------------ > WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 > Modules linked in: > CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 > RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 > RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 > RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 > RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd > RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f > R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a > R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b > FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 > CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Call Trace: > do_iter_readv_writev+0x2e5/0x360 > do_iter_write+0x112/0x4c0 > do_pwritev+0x1e5/0x390 > __x64_sys_pwritev2+0x7e/0xa0 > do_syscall_64+0x37/0x50 > entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > [...] Applied, thanks! [1/1] ext4: fix warning in ext4_handle_inode_extension commit: f3e076d26874f4d54126c07ba51d8dd407008697 Best regards,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 90fd6f7b6209..8adf1f802f6c 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -5384,6 +5384,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, if (attr->ia_valid & ATTR_SIZE) { handle_t *handle; loff_t oldsize = inode->i_size; + loff_t old_disksize; int shrink = (attr->ia_size < inode->i_size); if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) { @@ -5455,6 +5456,7 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, inode->i_sb->s_blocksize_bits); down_write(&EXT4_I(inode)->i_data_sem); + old_disksize = EXT4_I(inode)->i_disksize; EXT4_I(inode)->i_disksize = attr->ia_size; rc = ext4_mark_inode_dirty(handle, inode); if (!error) @@ -5466,6 +5468,8 @@ int ext4_setattr(struct user_namespace *mnt_userns, struct dentry *dentry, */ if (!error) i_size_write(inode, attr->ia_size); + else + EXT4_I(inode)->i_disksize = old_disksize; up_write(&EXT4_I(inode)->i_data_sem); ext4_journal_stop(handle); if (error)
We got issue as follows: EXT4-fs error (device loop0) in ext4_reserve_inode_write:5741: Out of memory EXT4-fs error (device loop0): ext4_setattr:5462: inode #13: comm syz-executor.0: mark_inode_dirty error EXT4-fs error (device loop0) in ext4_setattr:5519: Out of memory EXT4-fs error (device loop0): ext4_ind_map_blocks:595: inode #13: comm syz-executor.0: Can't allocate blocks for non-extent mapped inodes with bigalloc ------------[ cut here ]------------ WARNING: CPU: 1 PID: 4361 at fs/ext4/file.c:301 ext4_file_write_iter+0x11c9/0x1220 Modules linked in: CPU: 1 PID: 4361 Comm: syz-executor.0 Not tainted 5.10.0+ #1 RIP: 0010:ext4_file_write_iter+0x11c9/0x1220 RSP: 0018:ffff924d80b27c00 EFLAGS: 00010282 RAX: ffffffff815a3379 RBX: 0000000000000000 RCX: 000000003b000000 RDX: ffff924d81601000 RSI: 00000000000009cc RDI: 00000000000009cd RBP: 000000000000000d R08: ffffffffbc5a2c6b R09: 0000902e0e52a96f R10: ffff902e2b7c1b40 R11: ffff902e2b7c1b40 R12: 000000000000000a R13: 0000000000000001 R14: ffff902e0e52aa10 R15: ffffffffffffff8b FS: 00007f81a7f65700(0000) GS:ffff902e3bc80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffff600400 CR3: 000000012db88001 CR4: 00000000003706e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: do_iter_readv_writev+0x2e5/0x360 do_iter_write+0x112/0x4c0 do_pwritev+0x1e5/0x390 __x64_sys_pwritev2+0x7e/0xa0 do_syscall_64+0x37/0x50 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Above issue may happen as follows: Assume inode.i_size=4096 EXT4_I(inode)->i_disksize=4096 step 1: set inode->i_isize = 8192 ext4_setattr if (attr->ia_size != inode->i_size) EXT4_I(inode)->i_disksize = attr->ia_size; rc = ext4_mark_inode_dirty ext4_reserve_inode_write ext4_get_inode_loc __ext4_get_inode_loc sb_getblk --> return -ENOMEM ... if (!error) ->will not update i_size i_size_write(inode, attr->ia_size); Now: inode.i_size=4096 EXT4_I(inode)->i_disksize=8192 step 2: Direct write 4096 bytes ext4_file_write_iter ext4_dio_write_iter iomap_dio_rw ->return error if (extend) ext4_handle_inode_extension WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize); ->Then trigger warning. To solve above issue, if mark inode dirty failed in ext4_setattr just set 'EXT4_I(inode)->i_disksize' with old value. Signed-off-by: Ye Bin <yebin10@huawei.com> --- fs/ext4/inode.c | 4 ++++ 1 file changed, 4 insertions(+)