Message ID | 1332314639-22875-2-git-send-email-lczerner@redhat.com |
---|---|
State | Superseded, archived |
Headers | show |
On Wed, Mar 21, 2012 at 08:23:55AM +0100, Lukas Czerner wrote: > + /* > + * This is fugly, but even though we're going to get rid of the > + * EOFBLOCKS_LF in the future, we have to handle it correctly now > + * because there are still versions of e2fsck out there which > + * would scream otherwise. Once the new e2fsck code ignoring > + * this flag is common enough this can be removed entirely. > + */ > + if (ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)) { > + struct ext4_ext_path *path; > + ext4_lblk_t last_block; > + > + mutex_lock(&inode->i_mutex); > + down_read(&EXT4_I(inode)->i_data_sem); I was looking at this patch, and I was wondering why we weren't taking i_mutex earlier in ext4_ext_punch_hole(). The primary use of i_mutex is to protect writes racing with each other and with truncate. Given that punch essentially works like truncate, and all of ext4_truncate() is run with i_mutex down, and currently ext4_ext_punch_hole() (before applying this patch) doesn't isn't taking i_mutex at all, I'm wondering if we can run into problems where punch is racing against a write --- if the pages are already in mapped, then the write might not even need to take i_data_sem. Lukas, Allison --- am I missing something here? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, 21 Mar 2012, Ted Ts'o wrote: > On Wed, Mar 21, 2012 at 08:23:55AM +0100, Lukas Czerner wrote: > > + /* > > + * This is fugly, but even though we're going to get rid of the > > + * EOFBLOCKS_LF in the future, we have to handle it correctly now > > + * because there are still versions of e2fsck out there which > > + * would scream otherwise. Once the new e2fsck code ignoring > > + * this flag is common enough this can be removed entirely. > > + */ > > + if (ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)) { > > + struct ext4_ext_path *path; > > + ext4_lblk_t last_block; > > + > > + mutex_lock(&inode->i_mutex); > > + down_read(&EXT4_I(inode)->i_data_sem); > > I was looking at this patch, and I was wondering why we weren't taking > i_mutex earlier in ext4_ext_punch_hole(). The primary use of i_mutex > is to protect writes racing with each other and with truncate. Given > that punch essentially works like truncate, and all of ext4_truncate() > is run with i_mutex down, and currently ext4_ext_punch_hole() (before > applying this patch) doesn't isn't taking i_mutex at all, I'm > wondering if we can run into problems where punch is racing against a > write --- if the pages are already in mapped, then the write might not > even need to take i_data_sem. > > Lukas, Allison --- am I missing something here? > > - Ted Hmm, so we can not race with truncate since we do not touch i_size in punch_hole and the extent tree modification is done with i_data_sem locked. As far as write is concerned, I do not think that race is possible. If for example we're trying to write into the same range we're punching out the buffer we're trying to write into is either mapped, or not right ? If it's mapped then punch_hole did not get to this block yet, but it will eventually in the current transaction. If the buffer is unmapped, then we're going to get a new block, which means we're going to have to lock the i_data_sem, but since punch_hole is already holding it we have to wait before it finishes. The worse what can happen is that after a write spanning several block we'll have first part of the write punched out, but second part written correctly since in this case it might hit already punched block and need to wait for punch_hole to finish, after that the rest of the range is written. However the write should remain consistent on block granularity which is all we guarantee anyway, right ? -Lukas -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Mar 22, 2012 at 09:25:15AM +0100, Lukas Czerner wrote: > > The worse what can happen is that after a write spanning several block > we'll have first part of the write punched out, but second part written > correctly since in this case it might hit already punched block > and need to wait for punch_hole to finish, after that the rest of the > range is written. However the write should remain consistent on block > granularity which is all we guarantee anyway, right ? I need to look more closely at this, but thing that was worrying me was the part of truncate/punch where we have to invalidate the parts of the page cache where we've unmapped the blocks. i.e., the call to truncate_inode_pages_range() racing with the write. I think we're ok, since truncate_inode_pages_range() grabs the page spinlock and then checks for PageWriteback, which ought to be sufficient, but truncate does take that codepath with i_mutex down, and so my spidey sense is tingling. I may just being too paranoid, though. Still, that's not a criticism of your patch. More serious is the following lockdep warning that I got. Grabbing i_mutex after the transaction handle is started can lead to a circular locking deadlock... - Ted BEGIN TEST: Ext4 4k block Wed Mar 21 22:47:17 EDT 2012 Device: /dev/vdb mke2fs options: -q mount options: -o block_validity 000 - unknown test, ignored FSTYP -- ext4 PLATFORM -- Linux/i686 candygram 3.3.0-rc2-00592-gc56a0b2 MKFS_OPTIONS -- -q /dev/vdc MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc 075 [ 808.872903] [ 808.873567] ====================================================== [ 808.875933] [ INFO: possible circular locking dependency detected ] [ 808.875933] 3.3.0-rc2-00592-gc56a0b2 #32 Not tainted [ 808.875933] ------------------------------------------------------- [ 808.875933] fsx/13769 is trying to acquire lock: [ 808.875933] (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c028d900>] ext4_ext_punch_hole+0x2b8/0x382 [ 808.875933] [ 808.875933] but task is already holding lock: [ 808.875933] (jbd2_handle){+.+...}, at: [<c02a5995>] start_this_handle+0x4e4/0x51a [ 808.875933] [ 808.875933] which lock already depends on the new lock. [ 808.875933] [ 808.875933] [ 808.875933] the existing dependency chain (in reverse order) is: [ 808.875933] [ 808.875933] -> #1 (jbd2_handle){+.+...}: [ 808.875933] [<c019789d>] lock_acquire+0x99/0xbd [ 808.875933] [<c02a59b7>] start_this_handle+0x506/0x51a [ 808.875933] [<c02a5ba6>] jbd2__journal_start+0xae/0xda [ 808.875933] [<c02a5be4>] jbd2_journal_start+0x12/0x14 [ 808.875933] [<c0284fb8>] ext4_journal_start_sb+0x11e/0x126 [ 808.875933] [<c0277661>] ext4_unlink+0x82/0x1e5 [ 808.875933] [<c02127e1>] vfs_unlink+0x61/0xaf [ 808.875933] [<c02147b5>] do_unlinkat+0xa0/0x112 [ 808.875933] [<c0214946>] sys_unlinkat+0x30/0x37 [ 808.875933] [<c06d8c5d>] syscall_call+0x7/0xb [ 808.875933] [ 808.875933] -> #0 (&sb->s_type->i_mutex_key#3){+.+.+.}: [ 808.875933] [<c0197598>] __lock_acquire+0x989/0xbf5 [ 808.875933] [<c019789d>] lock_acquire+0x99/0xbd [ 808.875933] [<c06d65f4>] __mutex_lock_common+0x30/0x316 [ 808.875933] [<c06d6988>] mutex_lock_nested+0x26/0x2f [ 808.875933] [<c028d900>] ext4_ext_punch_hole+0x2b8/0x382 [ 808.875933] [<c026e316>] ext4_punch_hole+0x5f/0x70 [ 808.875933] [<c028fbce>] ext4_fallocate+0x63/0x469 [ 808.875933] [<c0208974>] do_fallocate+0xe7/0x105 [ 808.875933] [<c02089c3>] sys_fallocate+0x31/0x46 [ 808.875933] [<c06d8c5d>] syscall_call+0x7/0xb [ 808.875933] [ 808.875933] other info that might help us debug this: [ 808.875933] [ 808.875933] Possible unsafe locking scenario: [ 808.875933] [ 808.875933] CPU0 CPU1 [ 808.875933] ---- ---- [ 808.875933] lock(jbd2_handle); [ 808.875933] lock(&sb->s_type->i_mutex_key#3); [ 808.875933] lock(jbd2_handle); [ 808.875933] lock(&sb->s_type->i_mutex_key#3); [ 808.875933] [ 808.875933] *** DEADLOCK *** [ 808.875933] [ 808.875933] 1 lock held by fsx/13769: [ 808.875933] #0: (jbd2_handle){+.+...}, at: [<c02a5995>] start_this_handle+0x4e4/0x51a [ 808.875933] [ 808.875933] stack backtrace: [ 808.875933] Pid: 13769, comm: fsx Not tainted 3.3.0-rc2-00592-gc56a0b2 #32 [ 808.875933] Call Trace: [ 808.875933] [<c01954fb>] print_circular_bug+0x194/0x1a1 [ 808.875933] [<c0197598>] __lock_acquire+0x989/0xbf5 [ 808.875933] [<c019789d>] lock_acquire+0x99/0xbd [ 808.875933] [<c028d900>] ? ext4_ext_punch_hole+0x2b8/0x382 [ 808.875933] [<c06d65f4>] __mutex_lock_common+0x30/0x316 [ 808.875933] [<c028d900>] ? ext4_ext_punch_hole+0x2b8/0x382 [ 808.875933] [<c017d53a>] ? local_clock+0x3d/0x55 [ 808.875933] [<c01942de>] ? lock_release_holdtime+0x2b/0xcd [ 808.875933] [<c028d8d9>] ? ext4_ext_punch_hole+0x291/0x382 [ 808.875933] [<c06d6988>] mutex_lock_nested+0x26/0x2f [ 808.875933] [<c028d900>] ? ext4_ext_punch_hole+0x2b8/0x382 [ 808.875933] [<c028d900>] ext4_ext_punch_hole+0x2b8/0x382 [ 808.875933] [<c026e316>] ext4_punch_hole+0x5f/0x70 [ 808.875933] [<c028fbce>] ext4_fallocate+0x63/0x469 [ 808.875933] [<c017d4ed>] ? sched_clock_cpu+0x134/0x144 [ 808.875933] [<c023473e>] ? fsnotify+0x1e8/0x202 [ 808.875933] [<c01940d5>] ? trace_hardirqs_off+0xb/0xd [ 808.875933] [<c017d53a>] ? local_clock+0x3d/0x55 [ 808.875933] [<c020a873>] ? fget+0x57/0x71 [ 808.875933] [<c0208974>] do_fallocate+0xe7/0x105 [ 808.875933] [<c02089c3>] sys_fallocate+0x31/0x46 [ 808.875933] [<c06d8c5d>] syscall_call+0x7/0xb [ 808.875933] [<c06d0000>] ? init_intel+0x1aa/0x370 -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, 22 Mar 2012, Ted Ts'o wrote: > On Thu, Mar 22, 2012 at 09:25:15AM +0100, Lukas Czerner wrote: > > > > The worse what can happen is that after a write spanning several block > > we'll have first part of the write punched out, but second part written > > correctly since in this case it might hit already punched block > > and need to wait for punch_hole to finish, after that the rest of the > > range is written. However the write should remain consistent on block > > granularity which is all we guarantee anyway, right ? > > I need to look more closely at this, but thing that was worrying me > was the part of truncate/punch where we have to invalidate the parts > of the page cache where we've unmapped the blocks. i.e., the call to > truncate_inode_pages_range() racing with the write. I think we're ok, > since truncate_inode_pages_range() grabs the page spinlock and then > checks for PageWriteback, which ought to be sufficient, but truncate > does take that codepath with i_mutex down, and so my spidey sense is > tingling. I may just being too paranoid, though. > > Still, that's not a criticism of your patch. > > More serious is the following lockdep warning that I got. Grabbing > i_mutex after the transaction handle is started can lead to a circular > locking deadlock... > > - Ted Hrm, that's not very good. So we probably need to take the i_mutex for the whole transaction. It's not pretty solution, but I do not see other way around. Maybe we could clear the flag after the punch_hole in different transaction, but then the fallocate keep size and punch_hole race window would be much bigger. -Lukas > > BEGIN TEST: Ext4 4k block Wed Mar 21 22:47:17 EDT 2012 > Device: /dev/vdb > mke2fs options: -q > mount options: -o block_validity > 000 - unknown test, ignored > FSTYP -- ext4 > PLATFORM -- Linux/i686 candygram 3.3.0-rc2-00592-gc56a0b2 > MKFS_OPTIONS -- -q /dev/vdc > MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc > 075 [ 808.872903] > [ 808.873567] ====================================================== > [ 808.875933] [ INFO: possible circular locking dependency detected ] > [ 808.875933] 3.3.0-rc2-00592-gc56a0b2 #32 Not tainted > [ 808.875933] ------------------------------------------------------- > [ 808.875933] fsx/13769 is trying to acquire lock: > [ 808.875933] (&sb->s_type->i_mutex_key#3){+.+.+.}, at: [<c028d900>] ext4_ext_punch_hole+0x2b8/0x382 > [ 808.875933] > [ 808.875933] but task is already holding lock: > [ 808.875933] (jbd2_handle){+.+...}, at: [<c02a5995>] start_this_handle+0x4e4/0x51a > [ 808.875933] > [ 808.875933] which lock already depends on the new lock. > [ 808.875933] > [ 808.875933] > [ 808.875933] the existing dependency chain (in reverse order) is: > [ 808.875933] > [ 808.875933] -> #1 (jbd2_handle){+.+...}: > [ 808.875933] [<c019789d>] lock_acquire+0x99/0xbd > [ 808.875933] [<c02a59b7>] start_this_handle+0x506/0x51a > [ 808.875933] [<c02a5ba6>] jbd2__journal_start+0xae/0xda > [ 808.875933] [<c02a5be4>] jbd2_journal_start+0x12/0x14 > [ 808.875933] [<c0284fb8>] ext4_journal_start_sb+0x11e/0x126 > [ 808.875933] [<c0277661>] ext4_unlink+0x82/0x1e5 > [ 808.875933] [<c02127e1>] vfs_unlink+0x61/0xaf > [ 808.875933] [<c02147b5>] do_unlinkat+0xa0/0x112 > [ 808.875933] [<c0214946>] sys_unlinkat+0x30/0x37 > [ 808.875933] [<c06d8c5d>] syscall_call+0x7/0xb > [ 808.875933] > [ 808.875933] -> #0 (&sb->s_type->i_mutex_key#3){+.+.+.}: > [ 808.875933] [<c0197598>] __lock_acquire+0x989/0xbf5 > [ 808.875933] [<c019789d>] lock_acquire+0x99/0xbd > [ 808.875933] [<c06d65f4>] __mutex_lock_common+0x30/0x316 > [ 808.875933] [<c06d6988>] mutex_lock_nested+0x26/0x2f > [ 808.875933] [<c028d900>] ext4_ext_punch_hole+0x2b8/0x382 > [ 808.875933] [<c026e316>] ext4_punch_hole+0x5f/0x70 > [ 808.875933] [<c028fbce>] ext4_fallocate+0x63/0x469 > [ 808.875933] [<c0208974>] do_fallocate+0xe7/0x105 > [ 808.875933] [<c02089c3>] sys_fallocate+0x31/0x46 > [ 808.875933] [<c06d8c5d>] syscall_call+0x7/0xb > [ 808.875933] > [ 808.875933] other info that might help us debug this: > [ 808.875933] > [ 808.875933] Possible unsafe locking scenario: > [ 808.875933] > [ 808.875933] CPU0 CPU1 > [ 808.875933] ---- ---- > [ 808.875933] lock(jbd2_handle); > [ 808.875933] lock(&sb->s_type->i_mutex_key#3); > [ 808.875933] lock(jbd2_handle); > [ 808.875933] lock(&sb->s_type->i_mutex_key#3); > [ 808.875933] > [ 808.875933] *** DEADLOCK *** > [ 808.875933] > [ 808.875933] 1 lock held by fsx/13769: > [ 808.875933] #0: (jbd2_handle){+.+...}, at: [<c02a5995>] start_this_handle+0x4e4/0x51a > [ 808.875933] > [ 808.875933] stack backtrace: > [ 808.875933] Pid: 13769, comm: fsx Not tainted 3.3.0-rc2-00592-gc56a0b2 #32 > [ 808.875933] Call Trace: > [ 808.875933] [<c01954fb>] print_circular_bug+0x194/0x1a1 > [ 808.875933] [<c0197598>] __lock_acquire+0x989/0xbf5 > [ 808.875933] [<c019789d>] lock_acquire+0x99/0xbd > [ 808.875933] [<c028d900>] ? ext4_ext_punch_hole+0x2b8/0x382 > [ 808.875933] [<c06d65f4>] __mutex_lock_common+0x30/0x316 > [ 808.875933] [<c028d900>] ? ext4_ext_punch_hole+0x2b8/0x382 > [ 808.875933] [<c017d53a>] ? local_clock+0x3d/0x55 > [ 808.875933] [<c01942de>] ? lock_release_holdtime+0x2b/0xcd > [ 808.875933] [<c028d8d9>] ? ext4_ext_punch_hole+0x291/0x382 > [ 808.875933] [<c06d6988>] mutex_lock_nested+0x26/0x2f > [ 808.875933] [<c028d900>] ? ext4_ext_punch_hole+0x2b8/0x382 > [ 808.875933] [<c028d900>] ext4_ext_punch_hole+0x2b8/0x382 > [ 808.875933] [<c026e316>] ext4_punch_hole+0x5f/0x70 > [ 808.875933] [<c028fbce>] ext4_fallocate+0x63/0x469 > [ 808.875933] [<c017d4ed>] ? sched_clock_cpu+0x134/0x144 > [ 808.875933] [<c023473e>] ? fsnotify+0x1e8/0x202 > [ 808.875933] [<c01940d5>] ? trace_hardirqs_off+0xb/0xd > [ 808.875933] [<c017d53a>] ? local_clock+0x3d/0x55 > [ 808.875933] [<c020a873>] ? fget+0x57/0x71 > [ 808.875933] [<c0208974>] do_fallocate+0xe7/0x105 > [ 808.875933] [<c02089c3>] sys_fallocate+0x31/0x46 > [ 808.875933] [<c06d8c5d>] syscall_call+0x7/0xb > [ 808.875933] [<c06d0000>] ? init_intel+0x1aa/0x370 >
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c index 06b0792..7b822c0 100644 --- a/fs/ext4/extents.c +++ b/fs/ext4/extents.c @@ -4822,6 +4822,41 @@ int ext4_ext_punch_hole(struct file *file, loff_t offset, loff_t length) up_write(&EXT4_I(inode)->i_data_sem); + /* + * This is fugly, but even though we're going to get rid of the + * EOFBLOCKS_LF in the future, we have to handle it correctly now + * because there are still versions of e2fsck out there which + * would scream otherwise. Once the new e2fsck code ignoring + * this flag is common enough this can be removed entirely. + */ + if (ext4_test_inode_flag(inode, EXT4_INODE_EOFBLOCKS)) { + struct ext4_ext_path *path; + ext4_lblk_t last_block; + + mutex_lock(&inode->i_mutex); + down_read(&EXT4_I(inode)->i_data_sem); + + last_block = inode->i_size >> EXT4_BLOCK_SIZE_BITS(sb); + + /* + * We have to check whether there is any extent past the + * i_size. If not, we probably punched that out, so we need + * to clear the EOFBLOCKS flag + */ + path = ext4_ext_find_extent(inode, last_block, NULL); + if (IS_ERR(path)) + err = PTR_ERR(path); + else { + err = check_eofblocks_fl(handle, inode, last_block, + path, 1); + ext4_ext_drop_refs(path); + kfree(path); + } + + up_read(&EXT4_I(inode)->i_data_sem); + mutex_unlock(&inode->i_mutex); + } + out: ext4_orphan_del(handle, inode); inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
We have to clear the EOFBLOCK flag in the case that we just punched out all blocks allocated past the i_size, so that e2fsck does not compain about it. Even though we're going to remove EOFBLOCKS flag in the future we still have to handle it correctly for now as there are e2fsck versions out there which would complain about it. We can remove this after the new e2fsck code ignoring the EOFBLOCKS flag is common enough. Signed-off-by: Lukas Czerner <lczerner@redhat.com> --- fs/ext4/extents.c | 35 +++++++++++++++++++++++++++++++++++ 1 files changed, 35 insertions(+), 0 deletions(-)