Message ID | E1QwnAu-00087H-8X@tytso-glaptop.cam.corp.google.com |
---|---|
State | Superseded, archived |
Headers | show |
Hi Ted, On 08/26/2011 11:33 AM, Theodore Ts'o wrote: > Note: this will probably need to be sent to Linus as an emergency > bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a > regression. > > Jiayingz, I'd appreciate if you could review this, since this is a > partial undo of commit 2581fdc810, which you authored. I don't think > taking out the call to ext4_flush_complted_IO() should should cause any > problems, since it should only delay how long it takes for an inode to > be evicted, and in some cases we are already waiting for a truncate or > journal commit to complete. But I don't want to take any chances, so a > second pair of eyes would be appreciated. Thanks!! I do agree that the revert can help to resolve that lockdep issue, but I think jiaying's patch and the deadlock described in her commit log does make sense. So I am working on another way to resolve it and hope to send it out today. Please review the patch when it is ready. Thanks Tao > > - Ted > > From 18271e31ece46955c0fd61e726fa7540fddf8924 Mon Sep 17 00:00:00 2001 > From: Theodore Ts'o <tytso@mit.edu> > Date: Thu, 25 Aug 2011 23:26:01 -0400 > Subject: [PATCH] ext4: fix potential deadlock in ext4_evict_inode() > > Commit 2581fdc810 moved ext4_ioend_wait() from ext4_destroy_inode() to > ext4_evict_inode(). It also added code to explicitly call > ext4_flush_completed_IO(inode): > > mutex_lock(&inode->i_mutex); > ext4_flush_completed_IO(inode); > mutex_unlock(&inode->i_mutex); > > Unfortunately, we can't take the i_mutex lock in ext4_evict_inode() > without potentially causing a deadlock. > > Fix this by removing the code sequence altogether. This may result in > ext4_evict_inode() taking longer to complete, but that's ok, we're not > in a rush here. That just means we have to wait until the workqueue > is scheduled, which is OK; there's nothing that says we have to do > this work on the current thread, which would require taking a lock > that might lead to a deadlock condition. > > See Kernel Bugzilla #41682 for one example of the circular locking > problem that arise. Another one can be seen here: > > ======================================================= > [ INFO: possible circular locking dependency detected ] > 3.1.0-rc3-00012-g2a22fc1 #1839 > ------------------------------------------------------- > dd/7677 is trying to acquire lock: > (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d > > but task is already holding lock: > (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}: > [<c018eb02>] lock_acquire+0x99/0xbd > [<c06a53b5>] __mutex_lock_common+0x33/0x2fb > [<c06a572b>] mutex_lock_nested+0x26/0x2f > [<c026c2db>] ext4_evict_inode+0x3e/0x2bd > [<c0214bb0>] evict+0x8e/0x131 > [<c0214de6>] dispose_list+0x36/0x40 > [<c0215239>] evict_inodes+0xcd/0xd5 > [<c0204a23>] generic_shutdown_super+0x3d/0xaa > [<c0204ab2>] kill_block_super+0x22/0x5e > [<c0204cb8>] deactivate_locked_super+0x22/0x4e > [<c02055b2>] deactivate_super+0x3d/0x43 > [<c0218427>] mntput_no_expire+0xda/0xdf > [<c0219486>] sys_umount+0x286/0x2ab > [<c02194bd>] sys_oldumount+0x12/0x14 > [<c06a6ac5>] syscall_call+0x7/0xb > > -> #0 (&type->s_umount_key#18){++++..}: > [<c018e262>] __lock_acquire+0x967/0xbd2 > [<c018eb02>] lock_acquire+0x99/0xbd > [<c06a5991>] down_read+0x28/0x65 > [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d > [<c0269630>] ext4_nonda_switch+0xd0/0xe1 > [<c026e953>] ext4_da_write_begin+0x3c/0x1cf > [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4 > [<c01d58d3>] __generic_file_aio_write+0x254/0x285 > [<c01d596e>] generic_file_aio_write+0x6a/0xba > [<c026732f>] ext4_file_write+0x1d6/0x227 > [<c0202789>] do_sync_write+0x8f/0xca > [<c02030d5>] vfs_write+0x85/0xe3 > [<c02031d4>] sys_write+0x40/0x65 > [<c06a6ac5>] syscall_call+0x7/0xb > > https://bugzilla.kernel.org/show_bug.cgi?id=41682 > > Cc: stable@kernel.org > Cc: Jiaying Zhang <jiayingz@google.com> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> > --- > fs/ext4/inode.c | 3 --- > 1 files changed, 0 insertions(+), 3 deletions(-) > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c > index 29b7148..cf0b515 100644 > --- a/fs/ext4/inode.c > +++ b/fs/ext4/inode.c > @@ -121,9 +121,6 @@ void ext4_evict_inode(struct inode *inode) > > trace_ext4_evict_inode(inode); > > - mutex_lock(&inode->i_mutex); > - ext4_flush_completed_IO(inode); > - mutex_unlock(&inode->i_mutex); > ext4_ioend_wait(inode); > > if (inode->i_nlink) { -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote: > > Note: this will probably need to be sent to Linus as an emergency > bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a > regression. It doesn't appear to be a bug. All of the new ext4 lockdep reports in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false positives.... ..... > ======================================================= > [ INFO: possible circular locking dependency detected ] > 3.1.0-rc3-00012-g2a22fc1 #1839 > ------------------------------------------------------- > dd/7677 is trying to acquire lock: > (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d > > but task is already holding lock: > (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba > > which lock already depends on the new lock. > > the existing dependency chain (in reverse order) is: > > -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}: > [<c018eb02>] lock_acquire+0x99/0xbd > [<c06a53b5>] __mutex_lock_common+0x33/0x2fb > [<c06a572b>] mutex_lock_nested+0x26/0x2f > [<c026c2db>] ext4_evict_inode+0x3e/0x2bd > [<c0214bb0>] evict+0x8e/0x131 > [<c0214de6>] dispose_list+0x36/0x40 > [<c0215239>] evict_inodes+0xcd/0xd5 > [<c0204a23>] generic_shutdown_super+0x3d/0xaa > [<c0204ab2>] kill_block_super+0x22/0x5e > [<c0204cb8>] deactivate_locked_super+0x22/0x4e > [<c02055b2>] deactivate_super+0x3d/0x43 > [<c0218427>] mntput_no_expire+0xda/0xdf > [<c0219486>] sys_umount+0x286/0x2ab > [<c02194bd>] sys_oldumount+0x12/0x14 > [<c06a6ac5>] syscall_call+0x7/0xb > > -> #0 (&type->s_umount_key#18){++++..}: > [<c018e262>] __lock_acquire+0x967/0xbd2 > [<c018eb02>] lock_acquire+0x99/0xbd > [<c06a5991>] down_read+0x28/0x65 > [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d > [<c0269630>] ext4_nonda_switch+0xd0/0xe1 > [<c026e953>] ext4_da_write_begin+0x3c/0x1cf > [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4 > [<c01d58d3>] __generic_file_aio_write+0x254/0x285 > [<c01d596e>] generic_file_aio_write+0x6a/0xba > [<c026732f>] ext4_file_write+0x1d6/0x227 > [<c0202789>] do_sync_write+0x8f/0xca > [<c02030d5>] vfs_write+0x85/0xe3 > [<c02031d4>] sys_write+0x40/0x65 > [<c06a6ac5>] syscall_call+0x7/0xb That's definitely a false positive - sys_write() will have an active reference to the inode, and evict is only called on inodes without active references. Hence you can never get a deadlock between an inode context with an active reference and the same inode in the evict/dispose path because inode cannot be in both places at once... This is why XFS changes the lockdep context for the its iolock as soon as .evict is called on the inode - to stop these false positives from being emitted whenever memory reclaim or unmount evicts inodes. Cheers, dave.
Hi Dave, On 08/26/2011 03:35 PM, Dave Chinner wrote: > On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote: >> >> Note: this will probably need to be sent to Linus as an emergency >> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a >> regression. > > It doesn't appear to be a bug. All of the new ext4 lockdep reports > in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false > positives.... > > ..... >> ======================================================= >> [ INFO: possible circular locking dependency detected ] >> 3.1.0-rc3-00012-g2a22fc1 #1839 >> ------------------------------------------------------- >> dd/7677 is trying to acquire lock: >> (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d >> >> but task is already holding lock: >> (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba >> >> which lock already depends on the new lock. >> >> the existing dependency chain (in reverse order) is: >> >> -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}: >> [<c018eb02>] lock_acquire+0x99/0xbd >> [<c06a53b5>] __mutex_lock_common+0x33/0x2fb >> [<c06a572b>] mutex_lock_nested+0x26/0x2f >> [<c026c2db>] ext4_evict_inode+0x3e/0x2bd >> [<c0214bb0>] evict+0x8e/0x131 >> [<c0214de6>] dispose_list+0x36/0x40 >> [<c0215239>] evict_inodes+0xcd/0xd5 >> [<c0204a23>] generic_shutdown_super+0x3d/0xaa >> [<c0204ab2>] kill_block_super+0x22/0x5e >> [<c0204cb8>] deactivate_locked_super+0x22/0x4e >> [<c02055b2>] deactivate_super+0x3d/0x43 >> [<c0218427>] mntput_no_expire+0xda/0xdf >> [<c0219486>] sys_umount+0x286/0x2ab >> [<c02194bd>] sys_oldumount+0x12/0x14 >> [<c06a6ac5>] syscall_call+0x7/0xb >> >> -> #0 (&type->s_umount_key#18){++++..}: >> [<c018e262>] __lock_acquire+0x967/0xbd2 >> [<c018eb02>] lock_acquire+0x99/0xbd >> [<c06a5991>] down_read+0x28/0x65 >> [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d >> [<c0269630>] ext4_nonda_switch+0xd0/0xe1 >> [<c026e953>] ext4_da_write_begin+0x3c/0x1cf >> [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4 >> [<c01d58d3>] __generic_file_aio_write+0x254/0x285 >> [<c01d596e>] generic_file_aio_write+0x6a/0xba >> [<c026732f>] ext4_file_write+0x1d6/0x227 >> [<c0202789>] do_sync_write+0x8f/0xca >> [<c02030d5>] vfs_write+0x85/0xe3 >> [<c02031d4>] sys_write+0x40/0x65 >> [<c06a6ac5>] syscall_call+0x7/0xb > > That's definitely a false positive - sys_write() will have an active > reference to the inode, and evict is only called on inodes without > active references. Hence you can never get a deadlock between an > inode context with an active reference and the same inode in the > evict/dispose path because inode cannot be in both places at once... yeah, the fact is that lockdep isn't that smart. ;) > > This is why XFS changes the lockdep context for the its iolock as > soon as .evict is called on the inode - to stop these false > positives from being emitted whenever memory reclaim or unmount > evicts inodes. I don't think ext4 can change the lockdep context here since we have another ext4_end_io_work which can work and have mutex_lock(i_mutex) with inode->i_count=0. It isn't safe for us to abruptly change the lockdep here I guess. What I am trying to do here is to avoid the ext4_end_io_work in case of i_count = 0, so that we can either change the lockdep context or remove the mutex_lock here completely. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote: > On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote: > > > > Note: this will probably need to be sent to Linus as an emergency > > bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a > > regression. > > It doesn't appear to be a bug. All of the new ext4 lockdep reports > in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false > positives.... While the lockdep report is false positive, I agree that your change is the right fix to make - the IO completions are already queued on the workqueue, so they don't need to be flushed to get them to complete. All that needs to be done is call ext4_ioend_wait() for them to complete, and that gets rid of the i_mutex altogether. (*) Sorry for the noise, Ted. Cheers, Dave. (*) Even better would be to get rid of that flush list altogether and just use ext4_ioend_wait() everywhere....
On Fri, Aug 26, 2011 at 06:44:03PM +1000, Dave Chinner wrote: > While the lockdep report is false positive, I agree that your > change is the right fix to make - the IO completions are already > queued on the workqueue, so they don't need to be flushed to get > them to complete. All that needs to be done is call > ext4_ioend_wait() for them to complete, and that gets rid of the > i_mutex altogether. (*) The really correct fix is to stop using the ioend wait like I did in my pending series for XFS. The only thing preventing ext4 from doing that is that it marks pages uptodate from irq context, before finishing the ioends. This has a lot of nasty implications as the the page might be marked uptodate without actually having that state on disk. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/26/2011 04:44 PM, Dave Chinner wrote: > On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote: >> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote: >>> >>> Note: this will probably need to be sent to Linus as an emergency >>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a >>> regression. >> >> It doesn't appear to be a bug. All of the new ext4 lockdep reports >> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false >> positives.... > > While the lockdep report is false positive, I agree that your > change is the right fix to make - the IO completions are already > queued on the workqueue, so they don't need to be flushed to get > them to complete. All that needs to be done is call > ext4_ioend_wait() for them to complete, and that gets rid of the > i_mutex altogether. (*) ext4_ioend_wait can't work here for a nasty bug. Please see the commit log of 2581fdc8. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Christoph, On 08/26/2011 04:50 PM, Christoph Hellwig wrote: > On Fri, Aug 26, 2011 at 06:44:03PM +1000, Dave Chinner wrote: >> While the lockdep report is false positive, I agree that your >> change is the right fix to make - the IO completions are already >> queued on the workqueue, so they don't need to be flushed to get >> them to complete. All that needs to be done is call >> ext4_ioend_wait() for them to complete, and that gets rid of the >> i_mutex altogether. (*) > > The really correct fix is to stop using the ioend wait like I did > in my pending series for XFS. The only thing preventing ext4 from > doing that is that it marks pages uptodate from irq context, > before finishing the ioends. This has a lot of nasty implications > as the the page might be marked uptodate without actually having > that state on disk. yeah, the DIO read on a buffer write file issue is also another side effect of it.[1] So could you please talk a little more about how xfs handle it now? [1] http://www.spinics.net/lists/linux-ext4/msg27139.html Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2011 at 05:10:35PM +0800, Tao Ma wrote: > yeah, the DIO read on a buffer write file issue is also another side > effect of it.[1] > So could you please talk a little more about how xfs handle it now? The thing I have queued up for 3.2 makes it very simple: we do not track I/O ends any more at all, outside of the workqueue. For buffered I/O we only mark the page uptodate when all unwritten extent conversion and size updates have finished. All data integrity callers and inode eviction wait for the pages to be update so we are covered. For direct I/O we only call inode_dio_done and aio_complete once all unwritten extent size updates are done. Inodes can't be evicted until we drop a reference to the inode, which can't happen until the sync or async dio is done and we drop the inode reference the VFS holds for it. Sync and fsync are only guaranteed to pick up I/O that has returned to userspace, so we are covered for that as well. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2011 at 05:03:14PM +0800, Tao Ma wrote: > On 08/26/2011 04:44 PM, Dave Chinner wrote: > > On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote: > >> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote: > >>> > >>> Note: this will probably need to be sent to Linus as an emergency > >>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a > >>> regression. > >> > >> It doesn't appear to be a bug. All of the new ext4 lockdep reports > >> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false > >> positives.... > > > > While the lockdep report is false positive, I agree that your > > change is the right fix to make - the IO completions are already > > queued on the workqueue, so they don't need to be flushed to get > > them to complete. All that needs to be done is call > > ext4_ioend_wait() for them to complete, and that gets rid of the > > i_mutex altogether. (*) > ext4_ioend_wait can't work here for a nasty bug. Please see the commit > log of 2581fdc8. Unless I'm missing something, the described race with ext4_truncate() flushing completions without the i_mutex lock held cannot occur if you've already waited for all pending completions to drain by calling ext4_ioend_wait().... Cheers, Dave.
On 08/26/2011 05:24 PM, Dave Chinner wrote: > On Fri, Aug 26, 2011 at 05:03:14PM +0800, Tao Ma wrote: >> On 08/26/2011 04:44 PM, Dave Chinner wrote: >>> On Fri, Aug 26, 2011 at 05:35:07PM +1000, Dave Chinner wrote: >>>> On Thu, Aug 25, 2011 at 11:33:44PM -0400, Theodore Ts'o wrote: >>>>> >>>>> Note: this will probably need to be sent to Linus as an emergency >>>>> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a >>>>> regression. >>>> >>>> It doesn't appear to be a bug. All of the new ext4 lockdep reports >>>> in 3.1 I've seen (except for the mmap_sem/i_mutex one) are false >>>> positives.... >>> >>> While the lockdep report is false positive, I agree that your >>> change is the right fix to make - the IO completions are already >>> queued on the workqueue, so they don't need to be flushed to get >>> them to complete. All that needs to be done is call >>> ext4_ioend_wait() for them to complete, and that gets rid of the >>> i_mutex altogether. (*) >> ext4_ioend_wait can't work here for a nasty bug. Please see the commit >> log of 2581fdc8. > > Unless I'm missing something, the described race with > ext4_truncate() flushing completions without the i_mutex lock held > cannot occur if you've already waited for all pending completions to > drain by calling ext4_ioend_wait().... No, it doesn't mean the ext4_truncate. But another race pasted below. Flush inode's i_completed_io_list before calling ext4_io_wait to prevent the following deadlock scenario: A page fault happens while some process is writing inode A. During page fault, shrink_icache_memory is called that in turn evicts another inode B. Inode B has some pending io_end work so it calls ext4_ioend_wait() that waits for inode B's i_ioend_count to become zero. However, inode B's ioend work was queued behind some of inode A's ioend work on the same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten thread on that cpu is processing inode A's ioend work, it tries to grab inode A's i_mutex lock. Since the i_mutex lock of inode A is still hold before the page fault happened, we enter a deadlock. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Ted, On 08/26/2011 11:56 AM, Tao Ma wrote: > Hi Ted, > On 08/26/2011 11:33 AM, Theodore Ts'o wrote: >> Note: this will probably need to be sent to Linus as an emergency >> bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a >> regression. >> >> Jiayingz, I'd appreciate if you could review this, since this is a >> partial undo of commit 2581fdc810, which you authored. I don't think >> taking out the call to ext4_flush_complted_IO() should should cause any >> problems, since it should only delay how long it takes for an inode to >> be evicted, and in some cases we are already waiting for a truncate or >> journal commit to complete. But I don't want to take any chances, so a >> second pair of eyes would be appreciated. Thanks!! > I do agree that the revert can help to resolve that lockdep issue, but I > think jiaying's patch and the deadlock described in her commit log does > make sense. So I am working on another way to resolve it and hope to > send it out today. Please review the patch when it is ready. I am afraid I have to take my word back. The change is too complicated and I don't think it is OK now for such a later rc. Sorry. As there are so many users complain about this lockdep warning, we'd better send it out to Linus ASAP although it can't resolve the nasty bug described in the commit log by Jiaying. Thanks Tao > > Thanks > Tao >> >> - Ted >> >> From 18271e31ece46955c0fd61e726fa7540fddf8924 Mon Sep 17 00:00:00 2001 >> From: Theodore Ts'o <tytso@mit.edu> >> Date: Thu, 25 Aug 2011 23:26:01 -0400 >> Subject: [PATCH] ext4: fix potential deadlock in ext4_evict_inode() >> >> Commit 2581fdc810 moved ext4_ioend_wait() from ext4_destroy_inode() to >> ext4_evict_inode(). It also added code to explicitly call >> ext4_flush_completed_IO(inode): >> >> mutex_lock(&inode->i_mutex); >> ext4_flush_completed_IO(inode); >> mutex_unlock(&inode->i_mutex); >> >> Unfortunately, we can't take the i_mutex lock in ext4_evict_inode() >> without potentially causing a deadlock. >> >> Fix this by removing the code sequence altogether. This may result in >> ext4_evict_inode() taking longer to complete, but that's ok, we're not >> in a rush here. That just means we have to wait until the workqueue >> is scheduled, which is OK; there's nothing that says we have to do >> this work on the current thread, which would require taking a lock >> that might lead to a deadlock condition. >> >> See Kernel Bugzilla #41682 for one example of the circular locking >> problem that arise. Another one can be seen here: >> >> ======================================================= >> [ INFO: possible circular locking dependency detected ] >> 3.1.0-rc3-00012-g2a22fc1 #1839 >> ------------------------------------------------------- >> dd/7677 is trying to acquire lock: >> (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d >> >> but task is already holding lock: >> (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba >> >> which lock already depends on the new lock. >> >> the existing dependency chain (in reverse order) is: >> >> -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}: >> [<c018eb02>] lock_acquire+0x99/0xbd >> [<c06a53b5>] __mutex_lock_common+0x33/0x2fb >> [<c06a572b>] mutex_lock_nested+0x26/0x2f >> [<c026c2db>] ext4_evict_inode+0x3e/0x2bd >> [<c0214bb0>] evict+0x8e/0x131 >> [<c0214de6>] dispose_list+0x36/0x40 >> [<c0215239>] evict_inodes+0xcd/0xd5 >> [<c0204a23>] generic_shutdown_super+0x3d/0xaa >> [<c0204ab2>] kill_block_super+0x22/0x5e >> [<c0204cb8>] deactivate_locked_super+0x22/0x4e >> [<c02055b2>] deactivate_super+0x3d/0x43 >> [<c0218427>] mntput_no_expire+0xda/0xdf >> [<c0219486>] sys_umount+0x286/0x2ab >> [<c02194bd>] sys_oldumount+0x12/0x14 >> [<c06a6ac5>] syscall_call+0x7/0xb >> >> -> #0 (&type->s_umount_key#18){++++..}: >> [<c018e262>] __lock_acquire+0x967/0xbd2 >> [<c018eb02>] lock_acquire+0x99/0xbd >> [<c06a5991>] down_read+0x28/0x65 >> [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d >> [<c0269630>] ext4_nonda_switch+0xd0/0xe1 >> [<c026e953>] ext4_da_write_begin+0x3c/0x1cf >> [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4 >> [<c01d58d3>] __generic_file_aio_write+0x254/0x285 >> [<c01d596e>] generic_file_aio_write+0x6a/0xba >> [<c026732f>] ext4_file_write+0x1d6/0x227 >> [<c0202789>] do_sync_write+0x8f/0xca >> [<c02030d5>] vfs_write+0x85/0xe3 >> [<c02031d4>] sys_write+0x40/0x65 >> [<c06a6ac5>] syscall_call+0x7/0xb >> >> https://bugzilla.kernel.org/show_bug.cgi?id=41682 >> >> Cc: stable@kernel.org >> Cc: Jiaying Zhang <jiayingz@google.com> >> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> >> --- >> fs/ext4/inode.c | 3 --- >> 1 files changed, 0 insertions(+), 3 deletions(-) >> >> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c >> index 29b7148..cf0b515 100644 >> --- a/fs/ext4/inode.c >> +++ b/fs/ext4/inode.c >> @@ -121,9 +121,6 @@ void ext4_evict_inode(struct inode *inode) >> >> trace_ext4_evict_inode(inode); >> >> - mutex_lock(&inode->i_mutex); >> - ext4_flush_completed_IO(inode); >> - mutex_unlock(&inode->i_mutex); >> ext4_ioend_wait(inode); >> >> if (inode->i_nlink) { > > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Aug 26, 2011, at 5:17 AM, Christoph Hellwig wrote: > The thing I have queued up for 3.2 makes it very simple: we do not > track I/O ends any more at all, outside of the workqueue. > > For buffered I/O we only mark the page uptodate when all unwritten > extent conversion and size updates have finished. All data integrity > callers and inode eviction wait for the pages to be update so we are > covered. > > For direct I/O we only call inode_dio_done and aio_complete once all > unwritten extent size updates are done. Inodes can't be evicted until > we drop a reference to the inode, which can't happen until the > sync or async dio is done and we drop the inode reference the VFS > holds for it. Sync and fsync are only guaranteed to pick up I/O > that has returned to userspace, so we are covered for that as well. Long term, I definitely want to make ext4 do something similar. What we have now is just way too fragile… -- Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On 08/26/2011 07:35 PM, Theodore Tso wrote: > > On Aug 26, 2011, at 5:17 AM, Christoph Hellwig wrote: > >> The thing I have queued up for 3.2 makes it very simple: we do not >> track I/O ends any more at all, outside of the workqueue. >> >> For buffered I/O we only mark the page uptodate when all unwritten >> extent conversion and size updates have finished. All data integrity >> callers and inode eviction wait for the pages to be update so we are >> covered. >> >> For direct I/O we only call inode_dio_done and aio_complete once all >> unwritten extent size updates are done. Inodes can't be evicted until >> we drop a reference to the inode, which can't happen until the >> sync or async dio is done and we drop the inode reference the VFS >> holds for it. Sync and fsync are only guaranteed to pick up I/O >> that has returned to userspace, so we are covered for that as well. > > Long term, I definitely want to make ext4 do something similar. > What we have now is just way too fragile… yeah, actually I have done some basic tests about letting ext4_free_io_end to clear the page writeback flag for us after the unwritten extent conversion, and it does have several problems with both ext4 and jbd2. I will try to write up some solution for review. Thanks Tao -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2011 at 05:27:39PM +0800, Tao Ma wrote: > No, it doesn't mean the ext4_truncate. But another race pasted below. > > Flush inode's i_completed_io_list before calling ext4_io_wait to > prevent the following deadlock scenario: A page fault happens while > some process is writing inode A. During page fault, > shrink_icache_memory is called that in turn evicts another inode > B. Inode B has some pending io_end work so it calls ext4_ioend_wait() > that waits for inode B's i_ioend_count to become zero. However, inode > B's ioend work was queued behind some of inode A's ioend work on the > same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten > thread on that cpu is processing inode A's ioend work, it tries to > grab inode A's i_mutex lock. Since the i_mutex lock of inode A is > still hold before the page fault happened, we enter a deadlock. ... but that shouldn't be a problem since we're not holding A's i_mutex at this point, right? Or am I missing something? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Hi Ted, On Fri, Aug 26, 2011 at 8:52 AM, Ted Ts'o <tytso@mit.edu> wrote: > On Fri, Aug 26, 2011 at 05:27:39PM +0800, Tao Ma wrote: >> No, it doesn't mean the ext4_truncate. But another race pasted below. >> >> Flush inode's i_completed_io_list before calling ext4_io_wait to >> prevent the following deadlock scenario: A page fault happens while >> some process is writing inode A. During page fault, >> shrink_icache_memory is called that in turn evicts another inode >> B. Inode B has some pending io_end work so it calls ext4_ioend_wait() >> that waits for inode B's i_ioend_count to become zero. However, inode >> B's ioend work was queued behind some of inode A's ioend work on the >> same cpu's ext4-dio-unwritten workqueue. As the ext4-dio-unwritten >> thread on that cpu is processing inode A's ioend work, it tries to >> grab inode A's i_mutex lock. Since the i_mutex lock of inode A is >> still hold before the page fault happened, we enter a deadlock. > > ... but that shouldn't be a problem since we're not holding A's > i_mutex at this point, right? Or am I missing something? I think it is possible that we are holding A's i_mutex lock if the page fault happens while we are writing inode A. The problem is if we call ext4_evict_inode for inode B during the page fault handling and we just call ext4_ioend_wait() to wait for inode B's i_ioend_count to become zero, we rely on the ext4-dio-unwritten worker thread to finish any queued work at some time. But as mentioned in the change commit log, B's io_end work may be queued after A's work on the same cpu. Since A's i_mutex lock may be still hold during the page fault time, the ext4-dio-unwritten worker thread can't make progress. Now thinking about an alternative approach to resolve the deadlock mentioned above, maybe we can use mutex_trylock() in ext4_end_io_work() and if we can't grab the mutex lock for an inode, just requeue the work to the end of workqueue? Jiaying > > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2011 at 09:58:45AM -0700, Jiaying Zhang wrote: > Now thinking about an alternative approach to resolve the deadlock > mentioned above, maybe we can use mutex_trylock() in > ext4_end_io_work() and if we can't grab the mutex lock for an inode, > just requeue the work to the end of workqueue? Good idea! That should speed up work queue processing in general, I think. The downside is that inodes that currently locked might take longer to complete. In the case of fsync() we'll just force the I/O completion to happen in the context of the fsync'ing process, so I don't think it should be a problem in practice I think. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Aug 26, 2011 at 1:22 PM, Ted Ts'o <tytso@mit.edu> wrote: > On Fri, Aug 26, 2011 at 09:58:45AM -0700, Jiaying Zhang wrote: >> Now thinking about an alternative approach to resolve the deadlock >> mentioned above, maybe we can use mutex_trylock() in >> ext4_end_io_work() and if we can't grab the mutex lock for an inode, >> just requeue the work to the end of workqueue? > > Good idea! That should speed up work queue processing in general, I > think. > > The downside is that inodes that currently locked might take longer to > complete. In the case of fsync() we'll just force the I/O completion > to happen in the context of the fsync'ing process, so I don't think it > should be a problem in practice I think. > Ted, I am working on a patch to use mutex_trylock() in ext4_end_io_work() so that we can fix the described deadlock without needing to call mutex_lock() and ext4_flush_completed_IO() in ext4_evict_inode(). I run into some problem while testing it. Given that the deadlock my original patch intended to fix only exists with dioread_nolock enabled but the lockdep issue happens in all of cases, I think we should roll that part back as you have planned. I am going to send a separate patch later to fix the deadlock issue once I resolved the problem found in my test. Jiaying > - Ted > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
FYI, I just sent out a patch to the linux-ext4@ that uses mutex_trylock() in ext4_end_io_work() and removes the i_mutex lock and ext4_flush_completed_IO() in ext4_evict_inode(): http://www.spinics.net/lists/linux-ext4/msg27407.html Jiaying On Fri, Aug 26, 2011 at 10:17 PM, Jiaying Zhang <jiayingz@google.com> wrote: > On Fri, Aug 26, 2011 at 1:22 PM, Ted Ts'o <tytso@mit.edu> wrote: >> On Fri, Aug 26, 2011 at 09:58:45AM -0700, Jiaying Zhang wrote: >>> Now thinking about an alternative approach to resolve the deadlock >>> mentioned above, maybe we can use mutex_trylock() in >>> ext4_end_io_work() and if we can't grab the mutex lock for an inode, >>> just requeue the work to the end of workqueue? >> >> Good idea! That should speed up work queue processing in general, I >> think. >> >> The downside is that inodes that currently locked might take longer to >> complete. In the case of fsync() we'll just force the I/O completion >> to happen in the context of the fsync'ing process, so I don't think it >> should be a problem in practice I think. >> > Ted, > > I am working on a patch to use mutex_trylock() in ext4_end_io_work() > so that we can fix the described deadlock without needing to call > mutex_lock() and ext4_flush_completed_IO() in ext4_evict_inode(). > I run into some problem while testing it. Given that the deadlock my > original patch intended to fix only exists with dioread_nolock enabled > but the lockdep issue happens in all of cases, I think we should roll > that part back as you have planned. I am going to send a separate patch > later to fix the deadlock issue once I resolved the problem found in > my test. > > Jiaying > >> - Ted >> > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
======================================================= [ INFO: possible circular locking dependency detected ] 3.1.0-rc3-00012-g2a22fc1 #1839 ------------------------------------------------------- dd/7677 is trying to acquire lock: (&type->s_umount_key#18){++++..}, at: [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d but task is already holding lock: (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c01d5956>] generic_file_aio_write+0x52/0xba which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&sb->s_type->i_mutex_key#3){+.+.+.}: [<c018eb02>] lock_acquire+0x99/0xbd [<c06a53b5>] __mutex_lock_common+0x33/0x2fb [<c06a572b>] mutex_lock_nested+0x26/0x2f [<c026c2db>] ext4_evict_inode+0x3e/0x2bd [<c0214bb0>] evict+0x8e/0x131 [<c0214de6>] dispose_list+0x36/0x40 [<c0215239>] evict_inodes+0xcd/0xd5 [<c0204a23>] generic_shutdown_super+0x3d/0xaa [<c0204ab2>] kill_block_super+0x22/0x5e [<c0204cb8>] deactivate_locked_super+0x22/0x4e [<c02055b2>] deactivate_super+0x3d/0x43 [<c0218427>] mntput_no_expire+0xda/0xdf [<c0219486>] sys_umount+0x286/0x2ab [<c02194bd>] sys_oldumount+0x12/0x14 [<c06a6ac5>] syscall_call+0x7/0xb -> #0 (&type->s_umount_key#18){++++..}: [<c018e262>] __lock_acquire+0x967/0xbd2 [<c018eb02>] lock_acquire+0x99/0xbd [<c06a5991>] down_read+0x28/0x65 [<c021ea77>] writeback_inodes_sb_if_idle+0x26/0x3d [<c0269630>] ext4_nonda_switch+0xd0/0xe1 [<c026e953>] ext4_da_write_begin+0x3c/0x1cf [<c01d46ad>] generic_file_buffered_write+0xc0/0x1b4 [<c01d58d3>] __generic_file_aio_write+0x254/0x285 [<c01d596e>] generic_file_aio_write+0x6a/0xba [<c026732f>] ext4_file_write+0x1d6/0x227 [<c0202789>] do_sync_write+0x8f/0xca [<c02030d5>] vfs_write+0x85/0xe3 [<c02031d4>] sys_write+0x40/0x65 [<c06a6ac5>] syscall_call+0x7/0xb https://bugzilla.kernel.org/show_bug.cgi?id=41682 Cc: stable@kernel.org Cc: Jiaying Zhang <jiayingz@google.com> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu> --- fs/ext4/inode.c | 3 --- 1 files changed, 0 insertions(+), 3 deletions(-) diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 29b7148..cf0b515 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -121,9 +121,6 @@ void ext4_evict_inode(struct inode *inode) trace_ext4_evict_inode(inode); - mutex_lock(&inode->i_mutex); - ext4_flush_completed_IO(inode); - mutex_unlock(&inode->i_mutex); ext4_ioend_wait(inode); if (inode->i_nlink) {
Note: this will probably need to be sent to Linus as an emergency bugfix ASAP, since it was introduced in 3.1-rc1, so it represents a regression. Jiayingz, I'd appreciate if you could review this, since this is a partial undo of commit 2581fdc810, which you authored. I don't think taking out the call to ext4_flush_complted_IO() should should cause any problems, since it should only delay how long it takes for an inode to be evicted, and in some cases we are already waiting for a truncate or journal commit to complete. But I don't want to take any chances, so a second pair of eyes would be appreciated. Thanks!! - Ted From 18271e31ece46955c0fd61e726fa7540fddf8924 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Thu, 25 Aug 2011 23:26:01 -0400 Subject: [PATCH] ext4: fix potential deadlock in ext4_evict_inode() Commit 2581fdc810 moved ext4_ioend_wait() from ext4_destroy_inode() to ext4_evict_inode(). It also added code to explicitly call ext4_flush_completed_IO(inode): mutex_lock(&inode->i_mutex); ext4_flush_completed_IO(inode); mutex_unlock(&inode->i_mutex); Unfortunately, we can't take the i_mutex lock in ext4_evict_inode() without potentially causing a deadlock. Fix this by removing the code sequence altogether. This may result in ext4_evict_inode() taking longer to complete, but that's ok, we're not in a rush here. That just means we have to wait until the workqueue is scheduled, which is OK; there's nothing that says we have to do this work on the current thread, which would require taking a lock that might lead to a deadlock condition. See Kernel Bugzilla #41682 for one example of the circular locking problem that arise. Another one can be seen here: