Patchwork [URGENT] ext4: fix potential deadlock in ext4_evict_inode()

login
register
mail settings
Submitter Theodore Ts'o
Date Aug. 26, 2011, 3:33 a.m.
Message ID <E1QwnAu-00087H-8X@tytso-glaptop.cam.corp.google.com>
Download mbox | patch
Permalink /patch/111702/
State Superseded
Headers show

Comments

Theodore Ts'o - Aug. 26, 2011, 3:33 a.m.
Note: this will probably need to be sent to Linus as an emergency
bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
regression.

Jiayingz, I'd appreciate if you could review this, since this is a
partial undo of commit 2581fdc810, which you authored.  I don't think
taking out the call to ext4_flush_complted_IO() should should cause any
problems, since it should only delay how long it takes for an inode to
be evicted, and in some cases we are already waiting for a truncate or
journal commit to complete.  But I don't want to take any chances, so a
second pair of eyes would be appreciated.  Thanks!!

	      	     	     	  	   - Ted

From 18271e31ece46955c0fd61e726fa7540fddf8924 Mon Sep 17 00:00:00 2001
From: Theodore Ts'o <tytso@mit.edu>
Date: Thu, 25 Aug 2011 23:26:01 -0400
Subject: [PATCH] ext4: fix potential deadlock in ext4_evict_inode()

Commit 2581fdc810 moved ext4_ioend_wait() from ext4_destroy_inode() to
ext4_evict_inode().  It also added code to explicitly call
ext4_flush_completed_IO(inode):

	mutex_lock(&inode->i_mutex);
	ext4_flush_completed_IO(inode);
	mutex_unlock(&inode->i_mutex);

Unfortunately, we can't take the i_mutex lock in ext4_evict_inode()
without potentially causing a deadlock.

Fix this by removing the code sequence altogether.  This may result in
ext4_evict_inode() taking longer to complete, but that's ok, we're not
in a rush here.  That just means we have to wait until the workqueue
is scheduled, which is OK; there's nothing that says we have to do
this work on the current thread, which would require taking a lock
that might lead to a deadlock condition.

See Kernel Bugzilla #41682 for one example of the circular locking
problem that arise.  Another one can be seen here:
Tao Ma - Aug. 26, 2011, 3:56 a.m.
Hi Ted,
On 08/26/2011 11:33 AM, Theodore Ts'o wrote:
> Note: this will probably need to be sent to Linus as an emergency
> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
> regression.
> 
> Jiayingz, I'd appreciate if you could review this, since this is a
> partial undo of commit 2581fdc810, which you authored.  I don't think
> taking out the call to ext4_flush_complted_IO() should should cause any
> problems, since it should only delay how long it takes for an inode to
> be evicted, and in some cases we are already waiting for a truncate or
> journal commit to complete.  But I don't want to take any chances, so a
> second pair of eyes would be appreciated.  Thanks!!
I do agree that the revert can help to resolve that lockdep issue, but I
think jiaying's patch and the deadlock described in her commit log does
make sense. So I am working on another way to resolve it and hope to
send it out today. Please review the patch when it is ready.

Thanks
Tao
> 
> 	      	     	     	  	   - Ted
> 
> From 18271e31ece46955c0fd61e726fa7540fddf8924 Mon Sep 17 00:00:00 2001
> From: Theodore Ts'o <tytso@mit.edu>
> Date: Thu, 25 Aug 2011 23:26:01 -0400
> Subject: [PATCH] ext4: fix potential deadlock in ext4_evict_inode()
> 
> Commit 2581fdc810 moved ext4_ioend_wait() from ext4_destroy_inode() to
> ext4_evict_inode().  It also added code to explicitly call
> ext4_flush_completed_IO(inode):
> 
> 	mutex_lock(&inode->i_mutex);
> 	ext4_flush_completed_IO(inode);
> 	mutex_unlock(&inode->i_mutex);
> 
> Unfortunately, we can't take the i_mutex lock in ext4_evict_inode()
> without potentially causing a deadlock.
> 
> Fix this by removing the code sequence altogether.  This may result in
> ext4_evict_inode() taking longer to complete, but that's ok, we're not
> in a rush here.  That just means we have to wait until the workqueue
> is scheduled, which is OK; there's nothing that says we have to do
> this work on the current thread, which would require taking a lock
> that might lead to a deadlock condition.
> 
> See Kernel Bugzilla #41682 for one example of the circular locking
> problem that arise.  Another one can be seen here:
> 
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.1.0-rc3-00012-g2a22fc1 #1839
> -------------------------------------------------------
> dd/7677 is trying to acquire lock:
>  (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
> 
> but task is already holding lock:
>  (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}:
>        [<c018eb02>] lock_acquire+0x99/0xbd
>        [<c06a53b5>] __mutex_lock_common+0x33/0x2fb
>        [<c06a572b>] mutex_lock_nested+0x26/0x2f
>        [<c026c2db>] ext4_evict_inode+0x3e/0x2bd
>        [<c0214bb0>] evict+0x8e/0x131
>        [<c0214de6>] dispose_list+0x36/0x40
>        [<c0215239>] evict_inodes+0xcd/0xd5
>        [<c0204a23>] generic_shutdown_super+0x3d/0xaa
>        [<c0204ab2>] kill_block_super+0x22/0x5e
>        [<c0204cb8>] deactivate_locked_super+0x22/0x4e
>        [<c02055b2>] deactivate_super+0x3d/0x43
>        [<c0218427>] mntput_no_expire+0xda/0xdf
>        [<c0219486>] sys_umount+0x286/0x2ab
>        [<c02194bd>] sys_oldumount+0x12/0x14
>        [<c06a6ac5>] syscall_call+0x7/0xb
> 
> -> #0 (&type->s_umount_key#18){++++..}:
>        [<c018e262>] __lock_acquire+0x967/0xbd2
>        [<c018eb02>] lock_acquire+0x99/0xbd
>        [<c06a5991>] down_read+0x28/0x65
>        [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
>        [<c0269630>] ext4_nonda_switch+0xd0/0xe1
>        [<c026e953>] ext4_da_write_begin+0x3c/0x1cf
>        [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4
>        [<c01d58d3>] __generic_file_aio_write+0x254/0x285
>        [<c01d596e>] generic_file_aio_write+0x6a/0xba
>        [<c026732f>] ext4_file_write+0x1d6/0x227
>        [<c0202789>] do_sync_write+0x8f/0xca
>        [<c02030d5>] vfs_write+0x85/0xe3
>        [<c02031d4>] sys_write+0x40/0x65
>        [<c06a6ac5>] syscall_call+0x7/0xb
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=41682
> 
> Cc: stable@kernel.org
> Cc: Jiaying Zhang <jiayingz@google.com>
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> ---
>  fs/ext4/inode.c |    3 ---
>  1 files changed, 0 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 29b7148..cf0b515 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -121,9 +121,6 @@ void ext4_evict_inode(struct inode *inode)
>  
>  	trace_ext4_evict_inode(inode);
>  
> -	mutex_lock(&inode->i_mutex);
> -	ext4_flush_completed_IO(inode);
> -	mutex_unlock(&inode->i_mutex);
>  	ext4_ioend_wait(inode);
>  
>  	if (inode->i_nlink) {

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 26, 2011, 7:35 a.m.
On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote:
> 
> Note: this will probably need to be sent to Linus as an emergency
> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
> regression.

It doesn't appear to be a bug. All of the new ext4 lockdep reports
in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false
positives....

.....
> =======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.1.0-rc3-00012-g2a22fc1 #1839
> -------------------------------------------------------
> dd/7677 is trying to acquire lock:
>  (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
> 
> but task is already holding lock:
>  (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba
> 
> which lock already depends on the new lock.
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}:
>        [<c018eb02>] lock_acquire+0x99/0xbd
>        [<c06a53b5>] __mutex_lock_common+0x33/0x2fb
>        [<c06a572b>] mutex_lock_nested+0x26/0x2f
>        [<c026c2db>] ext4_evict_inode+0x3e/0x2bd
>        [<c0214bb0>] evict+0x8e/0x131
>        [<c0214de6>] dispose_list+0x36/0x40
>        [<c0215239>] evict_inodes+0xcd/0xd5
>        [<c0204a23>] generic_shutdown_super+0x3d/0xaa
>        [<c0204ab2>] kill_block_super+0x22/0x5e
>        [<c0204cb8>] deactivate_locked_super+0x22/0x4e
>        [<c02055b2>] deactivate_super+0x3d/0x43
>        [<c0218427>] mntput_no_expire+0xda/0xdf
>        [<c0219486>] sys_umount+0x286/0x2ab
>        [<c02194bd>] sys_oldumount+0x12/0x14
>        [<c06a6ac5>] syscall_call+0x7/0xb
> 
> -> #0 (&type->s_umount_key#18){++++..}:
>        [<c018e262>] __lock_acquire+0x967/0xbd2
>        [<c018eb02>] lock_acquire+0x99/0xbd
>        [<c06a5991>] down_read+0x28/0x65
>        [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
>        [<c0269630>] ext4_nonda_switch+0xd0/0xe1
>        [<c026e953>] ext4_da_write_begin+0x3c/0x1cf
>        [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4
>        [<c01d58d3>] __generic_file_aio_write+0x254/0x285
>        [<c01d596e>] generic_file_aio_write+0x6a/0xba
>        [<c026732f>] ext4_file_write+0x1d6/0x227
>        [<c0202789>] do_sync_write+0x8f/0xca
>        [<c02030d5>] vfs_write+0x85/0xe3
>        [<c02031d4>] sys_write+0x40/0x65
>        [<c06a6ac5>] syscall_call+0x7/0xb

That's definitely a false positive - sys_write() will have an active
reference to the inode, and evict is only called on inodes without
active references. Hence you can never get a deadlock between an
inode context with an active reference and the same inode in the
evict/dispose path because inode cannot be in both places at once...

This is why XFS changes the lockdep context for the its iolock as
soon as .evict is called on the inode - to stop these false
positives from being emitted whenever memory reclaim or unmount
evicts inodes.

Cheers,

dave.
Tao Ma - Aug. 26, 2011, 7:42 a.m.
Hi Dave,
On 08/26/2011 03:35 PM, Dave Chinner wrote:
> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote:
>>
>> Note: this will probably need to be sent to Linus as an emergency
>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
>> regression.
> 
> It doesn't appear to be a bug. All of the new ext4 lockdep reports
> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false
> positives....
> 
> .....
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 3.1.0-rc3-00012-g2a22fc1 #1839
>> -------------------------------------------------------
>> dd/7677 is trying to acquire lock:
>>  (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
>>
>> but task is already holding lock:
>>  (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}:
>>        [<c018eb02>] lock_acquire+0x99/0xbd
>>        [<c06a53b5>] __mutex_lock_common+0x33/0x2fb
>>        [<c06a572b>] mutex_lock_nested+0x26/0x2f
>>        [<c026c2db>] ext4_evict_inode+0x3e/0x2bd
>>        [<c0214bb0>] evict+0x8e/0x131
>>        [<c0214de6>] dispose_list+0x36/0x40
>>        [<c0215239>] evict_inodes+0xcd/0xd5
>>        [<c0204a23>] generic_shutdown_super+0x3d/0xaa
>>        [<c0204ab2>] kill_block_super+0x22/0x5e
>>        [<c0204cb8>] deactivate_locked_super+0x22/0x4e
>>        [<c02055b2>] deactivate_super+0x3d/0x43
>>        [<c0218427>] mntput_no_expire+0xda/0xdf
>>        [<c0219486>] sys_umount+0x286/0x2ab
>>        [<c02194bd>] sys_oldumount+0x12/0x14
>>        [<c06a6ac5>] syscall_call+0x7/0xb
>>
>> -> #0 (&type->s_umount_key#18){++++..}:
>>        [<c018e262>] __lock_acquire+0x967/0xbd2
>>        [<c018eb02>] lock_acquire+0x99/0xbd
>>        [<c06a5991>] down_read+0x28/0x65
>>        [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
>>        [<c0269630>] ext4_nonda_switch+0xd0/0xe1
>>        [<c026e953>] ext4_da_write_begin+0x3c/0x1cf
>>        [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4
>>        [<c01d58d3>] __generic_file_aio_write+0x254/0x285
>>        [<c01d596e>] generic_file_aio_write+0x6a/0xba
>>        [<c026732f>] ext4_file_write+0x1d6/0x227
>>        [<c0202789>] do_sync_write+0x8f/0xca
>>        [<c02030d5>] vfs_write+0x85/0xe3
>>        [<c02031d4>] sys_write+0x40/0x65
>>        [<c06a6ac5>] syscall_call+0x7/0xb
> 
> That's definitely a false positive - sys_write() will have an active
> reference to the inode, and evict is only called on inodes without
> active references. Hence you can never get a deadlock between an
> inode context with an active reference and the same inode in the
> evict/dispose path because inode cannot be in both places at once...
yeah, the fact is that lockdep isn't that smart. ;)
> 
> This is why XFS changes the lockdep context for the its iolock as
> soon as .evict is called on the inode - to stop these false
> positives from being emitted whenever memory reclaim or unmount
> evicts inodes.
I don't think ext4 can change the lockdep context here since we have
another ext4_end_io_work which can work and have mutex_lock(i_mutex)
with inode->i_count=0. It isn't safe for us to abruptly change the
lockdep here I guess.

What I am trying to do here is to avoid the ext4_end_io_work in case of
i_count = 0, so that we can either change the lockdep context or remove
the mutex_lock here completely.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 26, 2011, 8:44 a.m.
On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote:
> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote:
> > 
> > Note: this will probably need to be sent to Linus as an emergency
> > bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
> > regression.
> 
> It doesn't appear to be a bug. All of the new ext4 lockdep reports
> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false
> positives....

While the lockdep report is false positive, I agree that your
change is the right fix to make - the IO completions are already
queued on the workqueue, so they don't need to be flushed to get
them to complete. All that needs to be done is call
ext4_ioend_wait() for them to complete, and that gets rid of the
i_mutex altogether. (*)

Sorry for the noise, Ted.

Cheers,

Dave.

(*) Even better would be to get rid of that flush list altogether
and just use ext4_ioend_wait() everywhere....
Christoph Hellwig - Aug. 26, 2011, 8:50 a.m.
On Fri, Aug 26, 2011 at 06:44:03PM +1000, Dave Chinner wrote:
> While the lockdep report is false positive, I agree that your
> change is the right fix to make - the IO completions are already
> queued on the workqueue, so they don't need to be flushed to get
> them to complete. All that needs to be done is call
> ext4_ioend_wait() for them to complete, and that gets rid of the
> i_mutex altogether. (*)

The really correct fix is to stop using the ioend wait like I did
in my pending series for XFS.  The only thing preventing ext4 from
doing that is that it marks pages uptodate from irq context,
before finishing the ioends.  This has a lot of nasty implications
as the the page might be marked uptodate without actually having
that state on disk.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 26, 2011, 9:03 a.m.
On 08/26/2011 04:44 PM, Dave Chinner wrote:
> On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote:
>> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote:
>>>
>>> Note: this will probably need to be sent to Linus as an emergency
>>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
>>> regression.
>>
>> It doesn't appear to be a bug. All of the new ext4 lockdep reports
>> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false
>> positives....
> 
> While the lockdep report is false positive, I agree that your
> change is the right fix to make - the IO completions are already
> queued on the workqueue, so they don't need to be flushed to get
> them to complete. All that needs to be done is call
> ext4_ioend_wait() for them to complete, and that gets rid of the
> i_mutex altogether. (*)
ext4_ioend_wait can't work here for a nasty bug. Please see the commit
log of 2581fdc8.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 26, 2011, 9:10 a.m.
Hi Christoph,
On 08/26/2011 04:50 PM, Christoph Hellwig wrote:
> On Fri, Aug 26, 2011 at 06:44:03PM +1000, Dave Chinner wrote:
>> While the lockdep report is false positive, I agree that your
>> change is the right fix to make - the IO completions are already
>> queued on the workqueue, so they don't need to be flushed to get
>> them to complete. All that needs to be done is call
>> ext4_ioend_wait() for them to complete, and that gets rid of the
>> i_mutex altogether. (*)
> 
> The really correct fix is to stop using the ioend wait like I did
> in my pending series for XFS.  The only thing preventing ext4 from
> doing that is that it marks pages uptodate from irq context,
> before finishing the ioends.  This has a lot of nasty implications
> as the the page might be marked uptodate without actually having
> that state on disk.
yeah, the DIO read on a buffer write file issue is also another side
effect of it.[1]
So could you please talk a little more about how xfs handle it now?

[1] http://www.spinics.net/lists/linux-ext4/msg27139.html

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig - Aug. 26, 2011, 9:17 a.m.
On Fri, Aug 26, 2011 at 05:10:35PM +0800, Tao Ma wrote:
> yeah, the DIO read on a buffer write file issue is also another side
> effect of it.[1]
> So could you please talk a little more about how xfs handle it now?

The thing I have queued up for 3.2 makes it very simple:  we do not
track I/O ends any more at all, outside of the workqueue.

For buffered I/O we only mark the page uptodate when all unwritten
extent conversion and size updates have finished.  All data integrity
callers and inode eviction wait for the pages to be update so we are
covered.

For direct I/O we only call inode_dio_done and aio_complete once all
unwritten extent size updates are done.  Inodes can't be evicted until
we drop a reference to the inode, which can't happen until the
sync or async dio is done and we drop the inode reference the VFS
holds for it.  Sync and fsync are only guaranteed to pick up I/O
that has returned to userspace, so we are covered for that as well.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Chinner - Aug. 26, 2011, 9:24 a.m.
On Fri, Aug 26, 2011 at 05:03:14PM +0800, Tao Ma wrote:
> On 08/26/2011 04:44 PM, Dave Chinner wrote:
> > On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote:
> >> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote:
> >>>
> >>> Note: this will probably need to be sent to Linus as an emergency
> >>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
> >>> regression.
> >>
> >> It doesn't appear to be a bug. All of the new ext4 lockdep reports
> >> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false
> >> positives....
> > 
> > While the lockdep report is false positive, I agree that your
> > change is the right fix to make - the IO completions are already
> > queued on the workqueue, so they don't need to be flushed to get
> > them to complete. All that needs to be done is call
> > ext4_ioend_wait() for them to complete, and that gets rid of the
> > i_mutex altogether. (*)
> ext4_ioend_wait can't work here for a nasty bug. Please see the commit
> log of 2581fdc8.

Unless I'm missing something, the described race with
ext4_truncate() flushing completions without the i_mutex lock held
cannot occur if you've already waited for all pending completions to
drain by calling ext4_ioend_wait()....

Cheers,

Dave.
Tao Ma - Aug. 26, 2011, 9:27 a.m.
On 08/26/2011 05:24 PM, Dave Chinner wrote:
> On Fri, Aug 26, 2011 at 05:03:14PM +0800, Tao Ma wrote:
>> On 08/26/2011 04:44 PM, Dave Chinner wrote:
>>> On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote:
>>>> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote:
>>>>>
>>>>> Note: this will probably need to be sent to Linus as an emergency
>>>>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
>>>>> regression.
>>>>
>>>> It doesn't appear to be a bug. All of the new ext4 lockdep reports
>>>> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false
>>>> positives....
>>>
>>> While the lockdep report is false positive, I agree that your
>>> change is the right fix to make - the IO completions are already
>>> queued on the workqueue, so they don't need to be flushed to get
>>> them to complete. All that needs to be done is call
>>> ext4_ioend_wait() for them to complete, and that gets rid of the
>>> i_mutex altogether. (*)
>> ext4_ioend_wait can't work here for a nasty bug. Please see the commit
>> log of 2581fdc8.
> 
> Unless I'm missing something, the described race with
> ext4_truncate() flushing completions without the i_mutex lock held
> cannot occur if you've already waited for all pending completions to
> drain by calling ext4_ioend_wait()....
No, it doesn't mean the ext4_truncate. But another race pasted below.

Flush inode's i_completed_io_list before calling ext4_io_wait to
prevent the following deadlock scenario: A page fault happens while
some process is writing inode A. During page fault,
shrink_icache_memory is called that in turn evicts another inode
B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
that waits for inode B's i_ioend_count to become zero. However, inode
B's ioend work was queued behind some of inode A's ioend work on the
same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
thread on that cpu is processing inode A's ioend work, it tries to
grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
still hold before the page fault happened, we enter a deadlock.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 26, 2011, 9:28 a.m.
Hi Ted,
On 08/26/2011 11:56 AM, Tao Ma wrote:
> Hi Ted,
> On 08/26/2011 11:33 AM, Theodore Ts'o wrote:
>> Note: this will probably need to be sent to Linus as an emergency
>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a
>> regression.
>>
>> Jiayingz, I'd appreciate if you could review this, since this is a
>> partial undo of commit 2581fdc810, which you authored.  I don't think
>> taking out the call to ext4_flush_complted_IO() should should cause any
>> problems, since it should only delay how long it takes for an inode to
>> be evicted, and in some cases we are already waiting for a truncate or
>> journal commit to complete.  But I don't want to take any chances, so a
>> second pair of eyes would be appreciated.  Thanks!!
> I do agree that the revert can help to resolve that lockdep issue, but I
> think jiaying's patch and the deadlock described in her commit log does
> make sense. So I am working on another way to resolve it and hope to
> send it out today. Please review the patch when it is ready.
I am afraid I have to take my word back. The change is too complicated
and I don't think it is OK now for such a later rc. Sorry.

As there are so many users complain about this lockdep warning, we'd
better send it out to Linus ASAP although it can't resolve the nasty bug
described in the commit log by Jiaying.

Thanks
Tao
> 
> Thanks
> Tao
>>
>> 	      	     	     	  	   - Ted
>>
>> From 18271e31ece46955c0fd61e726fa7540fddf8924 Mon Sep 17 00:00:00 2001
>> From: Theodore Ts'o <tytso@mit.edu>
>> Date: Thu, 25 Aug 2011 23:26:01 -0400
>> Subject: [PATCH] ext4: fix potential deadlock in ext4_evict_inode()
>>
>> Commit 2581fdc810 moved ext4_ioend_wait() from ext4_destroy_inode() to
>> ext4_evict_inode().  It also added code to explicitly call
>> ext4_flush_completed_IO(inode):
>>
>> 	mutex_lock(&inode->i_mutex);
>> 	ext4_flush_completed_IO(inode);
>> 	mutex_unlock(&inode->i_mutex);
>>
>> Unfortunately, we can't take the i_mutex lock in ext4_evict_inode()
>> without potentially causing a deadlock.
>>
>> Fix this by removing the code sequence altogether.  This may result in
>> ext4_evict_inode() taking longer to complete, but that's ok, we're not
>> in a rush here.  That just means we have to wait until the workqueue
>> is scheduled, which is OK; there's nothing that says we have to do
>> this work on the current thread, which would require taking a lock
>> that might lead to a deadlock condition.
>>
>> See Kernel Bugzilla #41682 for one example of the circular locking
>> problem that arise.  Another one can be seen here:
>>
>> =======================================================
>> [ INFO: possible circular locking dependency detected ]
>> 3.1.0-rc3-00012-g2a22fc1 #1839
>> -------------------------------------------------------
>> dd/7677 is trying to acquire lock:
>>  (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
>>
>> but task is already holding lock:
>>  (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba
>>
>> which lock already depends on the new lock.
>>
>> the existing dependency chain (in reverse order) is:
>>
>> -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}:
>>        [<c018eb02>] lock_acquire+0x99/0xbd
>>        [<c06a53b5>] __mutex_lock_common+0x33/0x2fb
>>        [<c06a572b>] mutex_lock_nested+0x26/0x2f
>>        [<c026c2db>] ext4_evict_inode+0x3e/0x2bd
>>        [<c0214bb0>] evict+0x8e/0x131
>>        [<c0214de6>] dispose_list+0x36/0x40
>>        [<c0215239>] evict_inodes+0xcd/0xd5
>>        [<c0204a23>] generic_shutdown_super+0x3d/0xaa
>>        [<c0204ab2>] kill_block_super+0x22/0x5e
>>        [<c0204cb8>] deactivate_locked_super+0x22/0x4e
>>        [<c02055b2>] deactivate_super+0x3d/0x43
>>        [<c0218427>] mntput_no_expire+0xda/0xdf
>>        [<c0219486>] sys_umount+0x286/0x2ab
>>        [<c02194bd>] sys_oldumount+0x12/0x14
>>        [<c06a6ac5>] syscall_call+0x7/0xb
>>
>> -> #0 (&type->s_umount_key#18){++++..}:
>>        [<c018e262>] __lock_acquire+0x967/0xbd2
>>        [<c018eb02>] lock_acquire+0x99/0xbd
>>        [<c06a5991>] down_read+0x28/0x65
>>        [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
>>        [<c0269630>] ext4_nonda_switch+0xd0/0xe1
>>        [<c026e953>] ext4_da_write_begin+0x3c/0x1cf
>>        [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4
>>        [<c01d58d3>] __generic_file_aio_write+0x254/0x285
>>        [<c01d596e>] generic_file_aio_write+0x6a/0xba
>>        [<c026732f>] ext4_file_write+0x1d6/0x227
>>        [<c0202789>] do_sync_write+0x8f/0xca
>>        [<c02030d5>] vfs_write+0x85/0xe3
>>        [<c02031d4>] sys_write+0x40/0x65
>>        [<c06a6ac5>] syscall_call+0x7/0xb
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=41682
>>
>> Cc: stable@kernel.org
>> Cc: Jiaying Zhang <jiayingz@google.com>
>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>> ---
>>  fs/ext4/inode.c |    3 ---
>>  1 files changed, 0 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index 29b7148..cf0b515 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -121,9 +121,6 @@ void ext4_evict_inode(struct inode *inode)
>>  
>>  	trace_ext4_evict_inode(inode);
>>  
>> -	mutex_lock(&inode->i_mutex);
>> -	ext4_flush_completed_IO(inode);
>> -	mutex_unlock(&inode->i_mutex);
>>  	ext4_ioend_wait(inode);
>>  
>>  	if (inode->i_nlink) {
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 26, 2011, 11:35 a.m.
On Aug 26, 2011, at 5:17 AM, Christoph Hellwig wrote:

> The thing I have queued up for 3.2 makes it very simple:  we do not
> track I/O ends any more at all, outside of the workqueue.
> 
> For buffered I/O we only mark the page uptodate when all unwritten
> extent conversion and size updates have finished.  All data integrity
> callers and inode eviction wait for the pages to be update so we are
> covered.
> 
> For direct I/O we only call inode_dio_done and aio_complete once all
> unwritten extent size updates are done.  Inodes can't be evicted until
> we drop a reference to the inode, which can't happen until the
> sync or async dio is done and we drop the inode reference the VFS
> holds for it.  Sync and fsync are only guaranteed to pick up I/O
> that has returned to userspace, so we are covered for that as well.

Long term, I definitely want to make ext4 do something similar. 
What we have now is just way too fragile…

-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Tao Ma - Aug. 26, 2011, 2:53 p.m.
On 08/26/2011 07:35 PM, Theodore Tso wrote:
> 
> On Aug 26, 2011, at 5:17 AM, Christoph Hellwig wrote:
> 
>> The thing I have queued up for 3.2 makes it very simple:  we do not
>> track I/O ends any more at all, outside of the workqueue.
>>
>> For buffered I/O we only mark the page uptodate when all unwritten
>> extent conversion and size updates have finished.  All data integrity
>> callers and inode eviction wait for the pages to be update so we are
>> covered.
>>
>> For direct I/O we only call inode_dio_done and aio_complete once all
>> unwritten extent size updates are done.  Inodes can't be evicted until
>> we drop a reference to the inode, which can't happen until the
>> sync or async dio is done and we drop the inode reference the VFS
>> holds for it.  Sync and fsync are only guaranteed to pick up I/O
>> that has returned to userspace, so we are covered for that as well.
> 
> Long term, I definitely want to make ext4 do something similar. 
> What we have now is just way too fragile…
yeah, actually I have done some basic tests about letting
ext4_free_io_end to clear the page writeback flag for us after the
unwritten extent conversion, and it does have several problems with both
ext4 and jbd2. I will try to write up some solution for review.

Thanks
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 26, 2011, 3:52 p.m.
On Fri, Aug 26, 2011 at 05:27:39PM +0800, Tao Ma wrote:
> No, it doesn't mean the ext4_truncate. But another race pasted below.
> 
> Flush inode's i_completed_io_list before calling ext4_io_wait to
> prevent the following deadlock scenario: A page fault happens while
> some process is writing inode A. During page fault,
> shrink_icache_memory is called that in turn evicts another inode
> B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
> that waits for inode B's i_ioend_count to become zero. However, inode
> B's ioend work was queued behind some of inode A's ioend work on the
> same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
> thread on that cpu is processing inode A's ioend work, it tries to
> grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
> still hold before the page fault happened, we enter a deadlock.

... but that shouldn't be a problem since we're not holding A's
i_mutex at this point, right?  Or am I missing something?

	   	       	       	       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang - Aug. 26, 2011, 4:58 p.m.
Hi Ted,

On Fri, Aug 26, 2011 at 8:52 AM, Ted Ts'o <tytso@mit.edu> wrote:
> On Fri, Aug 26, 2011 at 05:27:39PM +0800, Tao Ma wrote:
>> No, it doesn't mean the ext4_truncate. But another race pasted below.
>>
>> Flush inode's i_completed_io_list before calling ext4_io_wait to
>> prevent the following deadlock scenario: A page fault happens while
>> some process is writing inode A. During page fault,
>> shrink_icache_memory is called that in turn evicts another inode
>> B. Inode B has some pending io_end work so it calls ext4_ioend_wait()
>> that waits for inode B's i_ioend_count to become zero. However, inode
>> B's ioend work was queued behind some of inode A's ioend work on the
>> same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten
>> thread on that cpu is processing inode A's ioend work, it tries to
>> grab inode A's i_mutex lock. Since the i_mutex lock of inode A is
>> still hold before the page fault happened, we enter a deadlock.
>
> ... but that shouldn't be a problem since we're not holding A's
> i_mutex at this point, right?  Or am I missing something?
I think it is possible that we are holding A's i_mutex lock if the page
fault happens while we are writing inode A. The problem is if we call
ext4_evict_inode for inode B during the page fault handling and we
just call ext4_ioend_wait() to wait for inode B's i_ioend_count to
become zero, we rely on the ext4-dio-unwritten worker thread to
finish any queued work at some time. But as mentioned in the change
commit log, B's io_end work may be queued after A's work on the
same cpu. Since A's i_mutex lock may be still hold during the page
fault time, the ext4-dio-unwritten worker thread can't make progress.

Now thinking about an alternative approach to resolve the deadlock
mentioned above, maybe we can use mutex_trylock() in
ext4_end_io_work() and if we can't grab the mutex lock for an inode,
just requeue the work to the end of workqueue?

Jiaying
>
>                                       - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 26, 2011, 8:22 p.m.
On Fri, Aug 26, 2011 at 09:58:45AM -0700, Jiaying Zhang wrote:
> Now thinking about an alternative approach to resolve the deadlock
> mentioned above, maybe we can use mutex_trylock() in
> ext4_end_io_work() and if we can't grab the mutex lock for an inode,
> just requeue the work to the end of workqueue?

Good idea!  That should speed up work queue processing in general, I
think.

The downside is that inodes that currently locked might take longer to
complete.  In the case of fsync() we'll just force the I/O completion
to happen in the context of the fsync'ing process, so I don't think it
should be a problem in practice I think.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang - Aug. 27, 2011, 5:17 a.m.
On Fri, Aug 26, 2011 at 1:22 PM, Ted Ts'o <tytso@mit.edu> wrote:
> On Fri, Aug 26, 2011 at 09:58:45AM -0700, Jiaying Zhang wrote:
>> Now thinking about an alternative approach to resolve the deadlock
>> mentioned above, maybe we can use mutex_trylock() in
>> ext4_end_io_work() and if we can't grab the mutex lock for an inode,
>> just requeue the work to the end of workqueue?
>
> Good idea!  That should speed up work queue processing in general, I
> think.
>
> The downside is that inodes that currently locked might take longer to
> complete.  In the case of fsync() we'll just force the I/O completion
> to happen in the context of the fsync'ing process, so I don't think it
> should be a problem in practice I think.
>
Ted,

I am working on a patch to use mutex_trylock() in ext4_end_io_work()
so that we can fix the described deadlock without needing to call
mutex_lock() and  ext4_flush_completed_IO() in ext4_evict_inode().
I run into some problem while testing it. Given that the deadlock my
original patch intended to fix only exists with dioread_nolock enabled
but the lockdep issue happens in all of cases, I think we should roll
that part back as you have planned. I am going to send a separate patch
later to fix the deadlock issue once I resolved the problem found in
my test.

Jiaying

>                                        - Ted
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jiaying Zhang - Aug. 31, 2011, 1:15 a.m.
FYI, I just sent out a patch to the linux-ext4@ that uses
mutex_trylock() in ext4_end_io_work() and removes the i_mutex
lock and ext4_flush_completed_IO() in ext4_evict_inode():
http://www.spinics.net/lists/linux-ext4/msg27407.html

Jiaying

On Fri, Aug 26, 2011 at 10:17 PM, Jiaying Zhang <jiayingz@google.com> wrote:
> On Fri, Aug 26, 2011 at 1:22 PM, Ted Ts'o <tytso@mit.edu> wrote:
>> On Fri, Aug 26, 2011 at 09:58:45AM -0700, Jiaying Zhang wrote:
>>> Now thinking about an alternative approach to resolve the deadlock
>>> mentioned above, maybe we can use mutex_trylock() in
>>> ext4_end_io_work() and if we can't grab the mutex lock for an inode,
>>> just requeue the work to the end of workqueue?
>>
>> Good idea!  That should speed up work queue processing in general, I
>> think.
>>
>> The downside is that inodes that currently locked might take longer to
>> complete.  In the case of fsync() we'll just force the I/O completion
>> to happen in the context of the fsync'ing process, so I don't think it
>> should be a problem in practice I think.
>>
> Ted,
>
> I am working on a patch to use mutex_trylock() in ext4_end_io_work()
> so that we can fix the described deadlock without needing to call
> mutex_lock() and  ext4_flush_completed_IO() in ext4_evict_inode().
> I run into some problem while testing it. Given that the deadlock my
> original patch intended to fix only exists with dioread_nolock enabled
> but the lockdep issue happens in all of cases, I think we should roll
> that part back as you have planned. I am going to send a separate patch
> later to fix the deadlock issue once I resolved the problem found in
> my test.
>
> Jiaying
>
>>                                        - Ted
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

=======================================================
[ INFO: possible circular locking dependency detected ]
3.1.0-rc3-00012-g2a22fc1 #1839
-------------------------------------------------------
dd/7677 is trying to acquire lock:
 (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d

but task is already holding lock:
 (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}:
       [<c018eb02>] lock_acquire+0x99/0xbd
       [<c06a53b5>] __mutex_lock_common+0x33/0x2fb
       [<c06a572b>] mutex_lock_nested+0x26/0x2f
       [<c026c2db>] ext4_evict_inode+0x3e/0x2bd
       [<c0214bb0>] evict+0x8e/0x131
       [<c0214de6>] dispose_list+0x36/0x40
       [<c0215239>] evict_inodes+0xcd/0xd5
       [<c0204a23>] generic_shutdown_super+0x3d/0xaa
       [<c0204ab2>] kill_block_super+0x22/0x5e
       [<c0204cb8>] deactivate_locked_super+0x22/0x4e
       [<c02055b2>] deactivate_super+0x3d/0x43
       [<c0218427>] mntput_no_expire+0xda/0xdf
       [<c0219486>] sys_umount+0x286/0x2ab
       [<c02194bd>] sys_oldumount+0x12/0x14
       [<c06a6ac5>] syscall_call+0x7/0xb

-> #0 (&type->s_umount_key#18){++++..}:
       [<c018e262>] __lock_acquire+0x967/0xbd2
       [<c018eb02>] lock_acquire+0x99/0xbd
       [<c06a5991>] down_read+0x28/0x65
       [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d
       [<c0269630>] ext4_nonda_switch+0xd0/0xe1
       [<c026e953>] ext4_da_write_begin+0x3c/0x1cf
       [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4
       [<c01d58d3>] __generic_file_aio_write+0x254/0x285
       [<c01d596e>] generic_file_aio_write+0x6a/0xba
       [<c026732f>] ext4_file_write+0x1d6/0x227
       [<c0202789>] do_sync_write+0x8f/0xca
       [<c02030d5>] vfs_write+0x85/0xe3
       [<c02031d4>] sys_write+0x40/0x65
       [<c06a6ac5>] syscall_call+0x7/0xb

https://bugzilla.kernel.org/show_bug.cgi?id=41682

Cc: stable@kernel.org
Cc: Jiaying Zhang <jiayingz@google.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/inode.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 29b7148..cf0b515 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -121,9 +121,6 @@  void ext4_evict_inode(struct inode *inode)
 
 	trace_ext4_evict_inode(inode);
 
-	mutex_lock(&inode->i_mutex);
-	ext4_flush_completed_IO(inode);
-	mutex_unlock(&inode->i_mutex);
 	ext4_ioend_wait(inode);
 
 	if (inode->i_nlink) {