diff mbox

ext4: fix deadlock of i_data_sem in ext4_mark_inode_dirty()

Message ID CAPTn0cAPYMMDx-_RvL22Mf8YMcGib65YbiwfzDhCkEO7-OtHjw@mail.gmail.com
State Rejected, archived
Headers show

Commit Message

Li Xi Sept. 4, 2014, 8:49 a.m. UTC
There are multiple places where ext4_mark_inode_dirty() is called holding
write lock of EXT4_I(inode)->i_data_sem. However, if
ext4_mark_inode_dirty() needs to expand inode size, this will cause
deadlock when ext4_xattr_block_set() tries to get read lock of
EXT4_I(inode)->i_data_sem.

Following was the messages dumped when this problem happened:

INFO: task plymouthd:124 blocked for more than 120 seconds.
      Not tainted 3.16.0-rc5+ #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
plymouthd       D 0000000000000000     0   124      1 0x00000000
 ffff8800378c38b0 0000000000000082 ffffffff81285998 ffff8800378c0010
 0000000000012b80 0000000000012b80 ffff88011a49e010 ffff88011a55cf50
 ffffffff8108e210 ffff88011a49e010 ffff880037d050a0 fffffffeffffffff
Call Trace:
 [<ffffffff81285998>] ? cfq_dispatch_requests+0x48/0x2b0
 [<ffffffff8108e210>] ? account_entity_dequeue+0x80/0xa0
 [<ffffffff81592ef9>] schedule+0x29/0x70
 [<ffffffff815955dd>] rwsem_down_read_failed+0x9d/0xf0
 [<ffffffff812978d4>] call_rwsem_down_read_failed+0x14/0x30
 [<ffffffff81595054>] ? down_read+0x24/0x30
 [<ffffffffa00cefb6>] ext4_xattr_block_set+0x546/0x6a0 [ext4]
 [<ffffffffa00cfc13>] ext4_expand_extra_isize_ea+0x4b3/0x8b0 [ext4]
 [<ffffffffa0090b06>] ext4_mark_inode_dirty+0x1a6/0x240 [ext4]
 [<ffffffffa0091c12>] ext4_setattr+0x452/0x5f0 [ext4]
 [<ffffffff811b0baa>] notify_change+0x1ca/0x330
 [<ffffffff81193a36>] do_truncate+0x66/0xa0
 [<ffffffff8124ab16>] ? ima_file_check+0x36/0x50
 [<ffffffff811a548d>] do_last+0x6bd/0x8c0
 [<ffffffff8119f700>] ? __inode_permission+0x20/0xc0
 [<ffffffff811a5754>] path_openat+0xc4/0x480
 [<ffffffff811d4fe0>] ? ep_read_events_proc+0xc0/0xc0
 [<ffffffff811a5c4a>] do_filp_open+0x4a/0xa0
 [<ffffffff811b215d>] ? __alloc_fd+0xcd/0x140
 [<ffffffff8119435a>] do_sys_open+0x11a/0x230
 [<ffffffff811944ae>] SyS_open+0x1e/0x20
 [<ffffffff81596592>] system_call_fastpath+0x16/0x1b

Signed-off-by: Li Xi <lixi <at> ddn.com>
---
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Sept. 5, 2014, 1:59 a.m. UTC | #1
On Thu, Sep 04, 2014 at 04:49:58PM +0800, Li Xi wrote:
> There are multiple places where ext4_mark_inode_dirty() is called holding
> write lock of EXT4_I(inode)->i_data_sem. However, if
> ext4_mark_inode_dirty() needs to expand inode size, this will cause
> deadlock when ext4_xattr_block_set() tries to get read lock of
> EXT4_I(inode)->i_data_sem.

This was with inline data enabled, right?

The problem with your change is that the reason why the locking is the
way it is was to fix a bug which Jan Kara identified in commit
90e775b71ac4e68: "ext4: fix lost truncate due to race with writeback".

    ext4: fix lost truncate due to race with writeback
    
    The following race can lead to a loss of i_disksize update from truncate
    thus resulting in a wrong inode size if the inode size isn't updated
    again before inode is reclaimed:
    
    ext4_setattr()                              mpage_map_and_submit_extent()
      EXT4_I(inode)->i_disksize = attr->ia_size;
      ...                                         ...
                                          disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
                                          /* False because i_size isn't
                                           * updated yet */
                                          if (disksize > i_size_read(inode))
                                          /* True, because i_disksize is
                                           * already truncated */
                                          if (disksize > EXT4_I(inode)->i_disksize)
                                            /* Overwrite i_disksize
                                             * update from truncate */
                                            ext4_update_i_disksize()
      i_size_write(inode, attr->ia_size);
    
    For other places updating i_disksize such race cannot happen because
    i_mutex prevents these races. Writeback is the only place where we do
    not hold i_mutex and we cannot grab it there because of lock ordering.
    
    We fix the race by doing both i_disksize and i_size update in truncate
    atomically under i_data_sem and in mpage_map_and_submit_extent() we move
    the check against i_size under i_data_sem as well.
    
    Signed-off-by: Jan Kara <jack@suse.cz>
    Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
    Cc: stable@vger.kernel.org

So I think we need to find another way to fix this problem.  There are
a limited number of places before we call ext4_mark_inode_dirty()
where i_size will grow such that the inline data code might need to
move the data out from i_blocks[].

It might make more sense to have a helper function which checks to see
if this condition holds, and do the converation away from using
inline_data for that inode *before* we call ext4_mark_inode_dirty().

Does that make sense to you?

Regards,

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Li Xi Sept. 5, 2014, 2:29 a.m. UTC | #2
On Fri, Sep 5, 2014 at 9:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Sep 04, 2014 at 04:49:58PM +0800, Li Xi wrote:
>> There are multiple places where ext4_mark_inode_dirty() is called holding
>> write lock of EXT4_I(inode)->i_data_sem. However, if
>> ext4_mark_inode_dirty() needs to expand inode size, this will cause
>> deadlock when ext4_xattr_block_set() tries to get read lock of
>> EXT4_I(inode)->i_data_sem.
>
> This was with inline data enabled, right?
I hit this problem when starting a kernel with project quota support for ext4.
The ext4 file system was not formated with project quota feature so it tried
to extend the space for project ID. And this problem happened every time
when the kernel was rebooted. Inline data was not enable on that file
system. I am not sure whether this problem will happen under other
circumstances. :)
>
>
> So I think we need to find another way to fix this problem.  There are
> a limited number of places before we call ext4_mark_inode_dirty()
> where i_size will grow such that the inline data code might need to
> move the data out from i_blocks[].
>
> It might make more sense to have a helper function which checks to see
> if this condition holds, and do the converation away from using
> inline_data for that inode *before* we call ext4_mark_inode_dirty().
>
> Does that make sense to you?
Sure, I agree there should be other better solution for this problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Sept. 5, 2014, 3:30 a.m. UTC | #3
On Fri, Sep 05, 2014 at 10:29:16AM +0800, Li Xi wrote:
> On Fri, Sep 5, 2014 at 9:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Thu, Sep 04, 2014 at 04:49:58PM +0800, Li Xi wrote:
> >> There are multiple places where ext4_mark_inode_dirty() is called holding
> >> write lock of EXT4_I(inode)->i_data_sem. However, if
> >> ext4_mark_inode_dirty() needs to expand inode size, this will cause
> >> deadlock when ext4_xattr_block_set() tries to get read lock of
> >> EXT4_I(inode)->i_data_sem.
> >
> > This was with inline data enabled, right?
> I hit this problem when starting a kernel with project quota support for ext4.
> The ext4 file system was not formated with project quota feature so it tried
> to extend the space for project ID. And this problem happened every time
> when the kernel was rebooted. Inline data was not enable on that file
> system. I am not sure whether this problem will happen under other
> circumstances. :)

I'm not understanding why expanding the inode size would result in
needing to call ext4_xattr_block_set.  Was that becuse you were
storing the project quota in the xattr?  I'm just trying to understand
the context.

Please also note that a recent set of patches (sent to the ext4 list
and in the ext4 git tree) has removed the need for taking i_data_sem
in xattr.c:

http://patchwork.ozlabs.org/patch/385347
http://patchwork.ozlabs.org/patch/385348
http://patchwork.ozlabs.org/patch/385346

(It's the 2/3 patch that removes taking i_data_sem read lock in xattr.c.)

Regards,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Li Xi Sept. 5, 2014, 4:44 a.m. UTC | #4
On Fri, Sep 5, 2014 at 11:30 AM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, Sep 05, 2014 at 10:29:16AM +0800, Li Xi wrote:
>> On Fri, Sep 5, 2014 at 9:59 AM, Theodore Ts'o <tytso@mit.edu> wrote:
>> > On Thu, Sep 04, 2014 at 04:49:58PM +0800, Li Xi wrote:
>> >> There are multiple places where ext4_mark_inode_dirty() is called holding
>> >> write lock of EXT4_I(inode)->i_data_sem. However, if
>> >> ext4_mark_inode_dirty() needs to expand inode size, this will cause
>> >> deadlock when ext4_xattr_block_set() tries to get read lock of
>> >> EXT4_I(inode)->i_data_sem.
>> >
>> > This was with inline data enabled, right?
>> I hit this problem when starting a kernel with project quota support for ext4.
>> The ext4 file system was not formated with project quota feature so it tried
>> to extend the space for project ID. And this problem happened every time
>> when the kernel was rebooted. Inline data was not enable on that file
>> system. I am not sure whether this problem will happen under other
>> circumstances. :)
>
> I'm not understanding why expanding the inode size would result in
> needing to call ext4_xattr_block_set.  Was that becuse you were
> storing the project quota in the xattr?  I'm just trying to understand
> the context.
Yeah, you are right. The problem happened when the project ID was
saved as xattr. And I can't reproduce the same problem with the new
patches.
>
> Please also note that a recent set of patches (sent to the ext4 list
> and in the ext4 git tree) has removed the need for taking i_data_sem
> in xattr.c:
>
> http://patchwork.ozlabs.org/patch/385347
> http://patchwork.ozlabs.org/patch/385348
> http://patchwork.ozlabs.org/patch/385346
>
> (It's the 2/3 patch that removes taking i_data_sem read lock in xattr.c.)
>
Great! Please ignore this patch since it won't be a problem
any more.

 Regards,

                                                 - Li Xi
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

Index: linux.git/fs/ext4/inode.c
===================================================================
--- linux.git.orig/fs/ext4/inode.c
+++ linux.git/fs/ext4/inode.c
@@ -2270,8 +2270,8 @@  static int mpage_map_and_submit_extent(h
             disksize = i_size;
         if (disksize > EXT4_I(inode)->i_disksize)
             EXT4_I(inode)->i_disksize = disksize;
-        err2 = ext4_mark_inode_dirty(handle, inode);
         up_write(&EXT4_I(inode)->i_data_sem);
+        err2 = ext4_mark_inode_dirty(handle, inode);
         if (err2)
             ext4_error(inode->i_sb,
                    "Failed to mark inode %lu dirty",
@@ -4650,9 +4650,11 @@  int ext4_setattr(struct dentry *dentry,
             }
             down_write(&EXT4_I(inode)->i_data_sem);
             EXT4_I(inode)->i_disksize = attr->ia_size;
+            up_write(&EXT4_I(inode)->i_data_sem);
             rc = ext4_mark_inode_dirty(handle, inode);
             if (!error)
                 error = rc;
+            down_write(&EXT4_I(inode)->i_data_sem);
             /*
              * We have to update i_size under i_data_sem together
              * with i_disksize to avoid races with writeback code
Index: linux.git/fs/ext4/ioctl.c
===================================================================
--- linux.git.orig/fs/ext4/ioctl.c
+++ linux.git/fs/ext4/ioctl.c
@@ -169,7 +169,9 @@  static long swap_inode_boot_loader(struc

     ext4_discard_preallocations(inode);

+    ext4_double_up_write_data_sem(inode, inode_bl);
     err = ext4_mark_inode_dirty(handle, inode);
+    ext4_double_down_write_data_sem(inode, inode_bl);
     if (err < 0) {
         ext4_warning(inode->i_sb,
             "couldn't mark inode #%lu dirty (err %d)",
@@ -177,14 +179,18 @@  static long swap_inode_boot_loader(struc
         /* Revert all changes: */
         swap_inode_data(inode, inode_bl);
     } else {
+        ext4_double_up_write_data_sem(inode, inode_bl);
         err = ext4_mark_inode_dirty(handle, inode_bl);
+        ext4_double_down_write_data_sem(inode, inode_bl);
         if (err < 0) {
             ext4_warning(inode_bl->i_sb,
                 "couldn't mark inode #%lu dirty (err %d)",
                 inode_bl->i_ino, err);
             /* Revert all changes: */
             swap_inode_data(inode, inode_bl);
+            ext4_double_up_write_data_sem(inode, inode_bl);
             ext4_mark_inode_dirty(handle, inode);
+            ext4_double_down_write_data_sem(inode, inode_bl);
         }
     }
     ext4_journal_stop(handle);
Index: linux.git/fs/ext4/migrate.c
===================================================================
--- linux.git.orig/fs/ext4/migrate.c
+++ linux.git/fs/ext4/migrate.c
@@ -635,14 +635,17 @@  int ext4_ind_migrate(struct inode *inode

     down_write(&EXT4_I(inode)->i_data_sem);
     ret = ext4_ext_check_inode(inode);
-    if (ret)
+    if (ret) {
+        up_write(&EXT4_I(inode)->i_data_sem);
         goto errout;
+    }

     eh = ext_inode_hdr(inode);
     ex  = EXT_FIRST_EXTENT(eh);
     if (ext4_blocks_count(es) > EXT4_MAX_BLOCK_FILE_PHYS ||
         eh->eh_depth != 0 || le16_to_cpu(eh->eh_entries) > 1) {
         ret = -EOPNOTSUPP;
+        up_write(&EXT4_I(inode)->i_data_sem);
         goto errout;
     }
     if (eh->eh_entries == 0)
@@ -652,6 +655,7 @@  int ext4_ind_migrate(struct inode *inode
         blk = ext4_ext_pblock(ex);
         if (len > EXT4_NDIR_BLOCKS) {
             ret = -EOPNOTSUPP;
+            up_write(&EXT4_I(inode)->i_data_sem);
             goto errout;
         }
     }
@@ -660,9 +664,9 @@  int ext4_ind_migrate(struct inode *inode
     memset(ei->i_data, 0, sizeof(ei->i_data));
     for (i=0; i < len; i++)
         ei->i_data[i] = cpu_to_le32(blk++);
+    up_write(&EXT4_I(inode)->i_data_sem);
     ext4_mark_inode_dirty(handle, inode);
 errout:
     ext4_journal_stop(handle);
-    up_write(&EXT4_I(inode)->i_data_sem);
     return ret;
 }