diff mbox series

[03/22] ext4: Do not iput inode under running transaction in ext4_mkdir()

Message ID 20191003220613.10791-3-jack@suse.cz
State Superseded
Headers show
Series ext4: Fix transaction overflow due to revoke descriptors | expand

Commit Message

Jan Kara Oct. 3, 2019, 10:05 p.m. UTC
When ext4_mkdir() fails to add entry into directory, it ends up dropping
freshly created inode under the running transaction and thus inode
truncation happens under that transaction. That breaks assumptions that
ext4_evict_inode() does not get called from a transaction context
(although I'm not aware of any real issue) and is completely
unnecessary. Just stop the transaction before dropping inode reference.

CC: stable@vger.kernel.org
Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/namei.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Comments

Theodore Ts'o Oct. 21, 2019, 1:21 a.m. UTC | #1
On Fri, Oct 04, 2019 at 12:05:49AM +0200, Jan Kara wrote:
> When ext4_mkdir() fails to add entry into directory, it ends up dropping
> freshly created inode under the running transaction and thus inode
> truncation happens under that transaction. That breaks assumptions that
> ext4_evict_inode() does not get called from a transaction context
> (although I'm not aware of any real issue) and is completely
> unnecessary. Just stop the transaction before dropping inode reference.
> 
> CC: stable@vger.kernel.org
> Signed-off-by: Jan Kara <jack@suse.cz>

If we call ext4_journal_stop(handle) before calling iput(inode),
there's a chance that we could crash with the inode with i_link_counts
== 0, but we won't have yet call ext4_evict_inode() to mark the inode
as free in the inode bitmap.  This would result in a inode leak.

Also, this isn't the only place where we can enter ext4_evict_inode()
with an active handle; the same situation arise in ext4_add_nondir(),
and for the same reason.

So I think the code is right as is.  Do you agree?

	     	       	       	     	- Ted
Jan Kara Oct. 24, 2019, 10:19 a.m. UTC | #2
On Sun 20-10-19 21:21:05, Theodore Y. Ts'o wrote:
> On Fri, Oct 04, 2019 at 12:05:49AM +0200, Jan Kara wrote:
> > When ext4_mkdir() fails to add entry into directory, it ends up dropping
> > freshly created inode under the running transaction and thus inode
> > truncation happens under that transaction. That breaks assumptions that
> > ext4_evict_inode() does not get called from a transaction context
> > (although I'm not aware of any real issue) and is completely
> > unnecessary. Just stop the transaction before dropping inode reference.
> > 
> > CC: stable@vger.kernel.org
> > Signed-off-by: Jan Kara <jack@suse.cz>
> 
> If we call ext4_journal_stop(handle) before calling iput(inode),
> there's a chance that we could crash with the inode with i_link_counts
> == 0, but we won't have yet call ext4_evict_inode() to mark the inode
> as free in the inode bitmap.  This would result in a inode leak.
> 
> Also, this isn't the only place where we can enter ext4_evict_inode()
> with an active handle; the same situation arise in ext4_add_nondir(),
> and for the same reason.
> 
> So I think the code is right as is.  Do you agree?

Correct on both points. Thanks for spotting this! Now I still don't think
that calling iput() with running transaction is good. It complicates
matters with revoke record reservation but it is also fragile for other
reasons - e.g. flush worker could find the allocated inode just before we
will call iput() on it, try to write it out, block on starting transaction
and we get a deadlock with inode_wait_for_writeback() inside evict(). Now
inode *probably* won't be dirty yet by the time we get to ext4_add_nondir()
or similar, that's why I say above it's just fragile, not an outright bug.

So I'd still prefer to do the iput() outside of transaction and we can
protect from leaking the inode in case of crash by adding it to orphan
list. I'll update the patch. Thanks for review!

								Honza
Theodore Ts'o Oct. 24, 2019, 12:09 p.m. UTC | #3
On Thu, Oct 24, 2019 at 12:19:06PM +0200, Jan Kara wrote:
> Correct on both points. Thanks for spotting this! Now I still don't think
> that calling iput() with running transaction is good. It complicates
> matters with revoke record reservation but it is also fragile for other
> reasons - e.g. flush worker could find the allocated inode just before we
> will call iput() on it, try to write it out, block on starting transaction
> and we get a deadlock with inode_wait_for_writeback() inside evict(). Now
> inode *probably* won't be dirty yet by the time we get to ext4_add_nondir()
> or similar, that's why I say above it's just fragile, not an outright bug.

But we don't ever write the inode itself via
inode_wait_for_writeback(), because how ext4 journalling works.  (See
the comments before ext4_mark_inode_dirty()).  And for the special
inodes (directories, device nodes, etc.) there's no data dirtyness to
worry about.  For regular files, we hit this code path when have just
created the inode, but were not able to add a link to the parent
directory; the fd wasn't been released to userspace yet, so it can't
be data dirty either.

So unless I'm missing something, I don't think the deadlock described
above is possible?

We can certainly add it to the orphan list if it's necessary, but it's
extra overhead and adds a global contention point.  So if it's not
necessary, I'd rather avoid it if possible, and I think it's safe to
do so in this case.

						- Ted
Jan Kara Oct. 24, 2019, 1:37 p.m. UTC | #4
On Thu 24-10-19 08:09:58, Theodore Y. Ts'o wrote:
> On Thu, Oct 24, 2019 at 12:19:06PM +0200, Jan Kara wrote:
> > Correct on both points. Thanks for spotting this! Now I still don't think
> > that calling iput() with running transaction is good. It complicates
> > matters with revoke record reservation but it is also fragile for other
> > reasons - e.g. flush worker could find the allocated inode just before we
> > will call iput() on it, try to write it out, block on starting transaction
> > and we get a deadlock with inode_wait_for_writeback() inside evict(). Now
> > inode *probably* won't be dirty yet by the time we get to ext4_add_nondir()
> > or similar, that's why I say above it's just fragile, not an outright bug.
> 
> But we don't ever write the inode itself via
> inode_wait_for_writeback(), because how ext4 journalling works.  (See
> the comments before ext4_mark_inode_dirty()).  And for the special
> inodes (directories, device nodes, etc.) there's no data dirtyness to
> worry about.  For regular files, we hit this code path when have just
> created the inode, but were not able to add a link to the parent
> directory; the fd wasn't been released to userspace yet, so it can't
> be data dirty either.
> 
> So unless I'm missing something, I don't think the deadlock described
> above is possible?

Actually, now that I look at it, large symlinks may be prone to this
deadlock. There we create unlinked inode, add it to orphan list, stop
transaction, call __page_symlink() which will dirty the inode through
mark_inode_dirty(), then we start transaction and call ext4_add_nondir()
which may call iput() while the transaction is started.

Granted we can fix just ext4_symlink() but it kind of demonstrates my point
that calling iput() under transaction is fragile - some of the stuff done
on last iput generaly ranks above transaction start, just in cases we clean
up failed create none of them happens to block currently (except for the
symlink case mentioned above). And also lockdep does not track dependencies
like inode_wait_for_writeback() as otherwise it would complain as well.

> We can certainly add it to the orphan list if it's necessary, but it's
> extra overhead and adds a global contention point.  So if it's not
> necessary, I'd rather avoid it if possible, and I think it's safe to
> do so in this case.

As this is error cleanup path (only EIO and ENOSPC are realistic failure
cases AFAICT) I don't think performance really matters here. I certainly
don't want to add inode to orphan list in the fast path. I agree that would
be non-starter. I'll try to write a patch and we'll see how bad it will be.
If you still hate it, I can have a look into how bad it would be to fix
ext4_symlink() and somehow deal with revoke reservation issues.

								Honza
Theodore Ts'o Nov. 4, 2019, 12:35 p.m. UTC | #5
On Thu, Oct 24, 2019 at 03:37:01PM +0200, Jan Kara wrote:
> > We can certainly add it to the orphan list if it's necessary, but it's
> > extra overhead and adds a global contention point.  So if it's not
> > necessary, I'd rather avoid it if possible, and I think it's safe to
> > do so in this case.
> 
> As this is error cleanup path (only EIO and ENOSPC are realistic failure
> cases AFAICT) I don't think performance really matters here. I certainly
> don't want to add inode to orphan list in the fast path. I agree that would
> be non-starter. I'll try to write a patch and we'll see how bad it will be.
> If you still hate it, I can have a look into how bad it would be to fix
> ext4_symlink() and somehow deal with revoke reservation issues.

That seems fair; I agree, adding inodes to the list in the error path
shouldn't be an issue.

					- Ted
diff mbox series

Patch

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index a427d2031a8d..9c872a33aea7 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2781,8 +2781,9 @@  static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 		clear_nlink(inode);
 		unlock_new_inode(inode);
 		ext4_mark_inode_dirty(handle, inode);
+		ext4_journal_stop(handle);
 		iput(inode);
-		goto out_stop;
+		goto out_retry;
 	}
 	ext4_inc_count(handle, dir);
 	ext4_update_dx_flag(dir);
@@ -2796,6 +2797,7 @@  static int ext4_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode)
 out_stop:
 	if (handle)
 		ext4_journal_stop(handle);
+out_retry:
 	if (err == -ENOSPC && ext4_should_retry_alloc(dir->i_sb, &retries))
 		goto retry;
 	return err;