Patchwork ext4: take i_mutex in ext4_symlink to eliminate a warning from ext4_truncate

login
register
mail settings
Submitter Zheng Liu
Date March 27, 2013, 1:19 p.m.
Message ID <1364390347-4360-1-git-send-email-wenqing.lz@taobao.com>
Download mbox | patch
Permalink /patch/231679/
State Superseded
Headers show

Comments

Zheng Liu - March 27, 2013, 1:19 p.m.
From: Zheng Liu <wenqing.lz@taobao.com>

After applied this commit (8e4061cb), we will get a warning from
ext4_truncate when i_mutex isn't taken.  Here the assumption is that
i_mutex should be taken when we do a truncation.  In ext4_symlink we
could need to call ext4_truncate to trim some blocks beyond i_size, but
the i_mutex isn't taken.

Signed-off-by: Zheng Liu <wenqing.lz@taobao.com>
---
 fs/ext4/namei.c | 2 ++
 1 file changed, 2 insertions(+)
Theodore Ts'o - March 27, 2013, 1:41 p.m.
On Wed, Mar 27, 2013 at 09:19:07PM +0800, Zheng Liu wrote:
> From: Zheng Liu <wenqing.lz@taobao.com>
> 
> After applied this commit (8e4061cb), we will get a warning from
> ext4_truncate when i_mutex isn't taken.  Here the assumption is that
> i_mutex should be taken when we do a truncation.  In ext4_symlink we
> could need to call ext4_truncate to trim some blocks beyond i_size, but
> the i_mutex isn't taken.

Hmm, and this is why I added the warning.  Even after looking your
patch, I'm having trouble finding the codepath that results in
ext4_truncate() getting called from __page_symlink().  Can you send
the stack trace from the WARN_ON, just so I can see what I missed?

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 27, 2013, 1:51 p.m.
Ah, now I see.  Thanks for sending the stack trace.  On the failure
path, we're calling the inline function ext4_truncate_filaed_write()
and this is calling ext4_truncate().

But I'm now wondering if we need to take the i_data_sem mutex in
ext4_truncate_failed_write().

Otherwise, couldn't we end up with problems where a failed write calls
ext4_truncate() without i_data_sem(), and that races with something
else --- say, a punch or truncate call to that same inode?

     	      	      	 	       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - March 27, 2013, 2:02 p.m.
On Wed, Mar 27, 2013 at 09:41:10AM -0400, Theodore Ts'o wrote:
> On Wed, Mar 27, 2013 at 09:19:07PM +0800, Zheng Liu wrote:
> > From: Zheng Liu <wenqing.lz@taobao.com>
> > 
> > After applied this commit (8e4061cb), we will get a warning from
> > ext4_truncate when i_mutex isn't taken.  Here the assumption is that
> > i_mutex should be taken when we do a truncation.  In ext4_symlink we
> > could need to call ext4_truncate to trim some blocks beyond i_size, but
> > the i_mutex isn't taken.
> 
> Hmm, and this is why I added the warning.  Even after looking your
> patch, I'm having trouble finding the codepath that results in
> ext4_truncate() getting called from __page_symlink().  Can you send
> the stack trace from the WARN_ON, just so I can see what I missed?

Here it is.

kernel: ------------[ cut here ]------------
kernel: WARNING: at fs/ext4/inode.c:3803 ext4_truncate+0x36/0x2d7 [ext4]()
kernel: Hardware name: OptiPlex 780
kernel: Modules linked in: ext4(O) jbd2 crc16 cpufreq_ondemand ipv6
dm_mirror dm_region_hash dm_log dm_mod parport_pc parport dcdbas
acpi_cpufreq mperf serio_raw button pcspkr i2c_i801 sg ehci_pci
ehci_hcd e1000e ptp pps_core ext3 jbd sd_mod ahci libahci libata
scsi_mod uhci_hcd radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2
c_core [last unloaded: crc16]
kernel: Pid: 4152, comm: fsstress Tainted: G W  O 3.9.0-rc4+ #6
kernel: Call Trace:
kernel: [<ffffffff82031e14>] warn_slowpath_common+0x85/0x9f
kernel: [<ffffffff82031e48>] warn_slowpath_null+0x1a/0x1c
kernel: [<ffffffffa044b789>] ext4_truncate+0x36/0x2d7 [ext4]
kernel: [<ffffffffa044c9a5>] ext4_write_begin+0x35f/0x3b9 [ext4]
kernel: [<ffffffff820af514>] pagecache_write_begin+0x1c/0x1e
kernel: [<ffffffff820f5f80>] __page_symlink+0x6d/0x10f
kernel: [<ffffffffa0450806>] ?  ext4_orphan_add+0x1ea/0x219 [ext4]
kernel: [<ffffffffa0451a7f>] ext4_symlink+0x1ca/0x33e [ext4]
kernel: [<ffffffff820f8c41>] vfs_symlink+0x7c/0xd6
kernel: [<ffffffff820fb634>] sys_symlinkat+0x68/0xb9
kernel: [<ffffffff820fb69b>] sys_symlink+0x16/0x18
kernel: [<ffffffff8238fac2>] system_call_fastpath+0x16/0x1b
kernel: ---[ end trace 05d179cc296c4f3a ]---

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - March 27, 2013, 2:04 p.m.
On Wed, Mar 27, 2013 at 10:02:50PM +0800, Zheng Liu wrote:
> On Wed, Mar 27, 2013 at 09:41:10AM -0400, Theodore Ts'o wrote:
> > On Wed, Mar 27, 2013 at 09:19:07PM +0800, Zheng Liu wrote:
> > > From: Zheng Liu <wenqing.lz@taobao.com>
> > > 
> > > After applied this commit (8e4061cb), we will get a warning from
> > > ext4_truncate when i_mutex isn't taken.  Here the assumption is that
> > > i_mutex should be taken when we do a truncation.  In ext4_symlink we
> > > could need to call ext4_truncate to trim some blocks beyond i_size, but
> > > the i_mutex isn't taken.
> > 
> > Hmm, and this is why I added the warning.  Even after looking your
> > patch, I'm having trouble finding the codepath that results in
> > ext4_truncate() getting called from __page_symlink().  Can you send
> > the stack trace from the WARN_ON, just so I can see what I missed?
> 
> Here it is.
> 
> kernel: ------------[ cut here ]------------
> kernel: WARNING: at fs/ext4/inode.c:3803 ext4_truncate+0x36/0x2d7 [ext4]()
> kernel: Hardware name: OptiPlex 780
> kernel: Modules linked in: ext4(O) jbd2 crc16 cpufreq_ondemand ipv6
> dm_mirror dm_region_hash dm_log dm_mod parport_pc parport dcdbas
> acpi_cpufreq mperf serio_raw button pcspkr i2c_i801 sg ehci_pci
> ehci_hcd e1000e ptp pps_core ext3 jbd sd_mod ahci libahci libata
> scsi_mod uhci_hcd radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2
> c_core [last unloaded: crc16]
> kernel: Pid: 4152, comm: fsstress Tainted: G W  O 3.9.0-rc4+ #6
> kernel: Call Trace:
> kernel: [<ffffffff82031e14>] warn_slowpath_common+0x85/0x9f
> kernel: [<ffffffff82031e48>] warn_slowpath_null+0x1a/0x1c
> kernel: [<ffffffffa044b789>] ext4_truncate+0x36/0x2d7 [ext4]
> kernel: [<ffffffffa044c9a5>] ext4_write_begin+0x35f/0x3b9 [ext4]
> kernel: [<ffffffff820af514>] pagecache_write_begin+0x1c/0x1e
> kernel: [<ffffffff820f5f80>] __page_symlink+0x6d/0x10f
> kernel: [<ffffffffa0450806>] ?  ext4_orphan_add+0x1ea/0x219 [ext4]
> kernel: [<ffffffffa0451a7f>] ext4_symlink+0x1ca/0x33e [ext4]
> kernel: [<ffffffff820f8c41>] vfs_symlink+0x7c/0xd6
> kernel: [<ffffffff820fb634>] sys_symlinkat+0x68/0xb9
> kernel: [<ffffffff820fb69b>] sys_symlink+0x16/0x18
> kernel: [<ffffffff8238fac2>] system_call_fastpath+0x16/0x1b
> kernel: ---[ end trace 05d179cc296c4f3a ]---

Sorry, maybe you want to reproduce this warning.  xfstests #083 can
trigger it.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - March 27, 2013, 3:07 p.m.
On Wed, Mar 27, 2013 at 09:51:55AM -0400, Theodore Ts'o wrote:
> Ah, now I see.  Thanks for sending the stack trace.  On the failure
> path, we're calling the inline function ext4_truncate_filaed_write()
> and this is calling ext4_truncate().
> 
> But I'm now wondering if we need to take the i_data_sem mutex in
> ext4_truncate_failed_write().
> 
> Otherwise, couldn't we end up with problems where a failed write calls
> ext4_truncate() without i_data_sem(), and that races with something
> else --- say, a punch or truncate call to that same inode?

I don't think we need to take i_mutex lock honestly.  In ext4_symlink
when we call __page_symlink() the new inode doesn't access yet.  So no
one can do a punching hole or truncation to this inode.  But I also
think we need to add WARN_ON in ext4_truncate because i_mutex lock is
used to serialize truncate/punch hole and buffered io.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 27, 2013, 3:12 p.m.
On Wed, Mar 27, 2013 at 11:19:22PM +0800, Zheng Liu wrote:
> > > Otherwise, couldn't we end up with problems where a failed write calls
> > > ext4_truncate() without i_data_sem(), and that races with something
> > > else --- say, a punch or truncate call to that same inode?
> 
> Let me think about it.  I need to take a close look at it.

Note that I'm not so concerned when we are creating symlink --- you
are quite right in pointing out in that case the inode isn't in the
namespace yet, so that prevents races --- but also what might happen
in an ENOSPC write(2) failure racing against a punch/truncate call.

But again, this is why I added the warning --- it was to find these
edge cases that we might not have considered.  :-)

     	   	   	     	  	       - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - March 27, 2013, 3:19 p.m.
On Wed, Mar 27, 2013 at 11:07:35PM +0800, Zheng Liu wrote:
> On Wed, Mar 27, 2013 at 09:51:55AM -0400, Theodore Ts'o wrote:
> > Ah, now I see.  Thanks for sending the stack trace.  On the failure
> > path, we're calling the inline function ext4_truncate_filaed_write()
> > and this is calling ext4_truncate().
> > 
> > But I'm now wondering if we need to take the i_data_sem mutex in
                                                 ^^^
Sigh, I misread i_data_sem and i_mutex.  Really sorry about that. :-/

> > ext4_truncate_failed_write().
> > 
> > Otherwise, couldn't we end up with problems where a failed write calls
> > ext4_truncate() without i_data_sem(), and that races with something
> > else --- say, a punch or truncate call to that same inode?

Let me think about it.  I need to take a close look at it.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Zheng Liu - March 27, 2013, 3:35 p.m.
On Wed, Mar 27, 2013 at 11:12:48AM -0400, Theodore Ts'o wrote:
> On Wed, Mar 27, 2013 at 11:19:22PM +0800, Zheng Liu wrote:
> > > > Otherwise, couldn't we end up with problems where a failed write calls
> > > > ext4_truncate() without i_data_sem(), and that races with something
> > > > else --- say, a punch or truncate call to that same inode?
> > 
> > Let me think about it.  I need to take a close look at it.
> 
> Note that I'm not so concerned when we are creating symlink --- you
> are quite right in pointing out in that case the inode isn't in the
> namespace yet, so that prevents races --- but also what might happen
> in an ENOSPC write(2) failure racing against a punch/truncate call.
> 
> But again, this is why I added the warning --- it was to find these
> edge cases that we might not have considered.  :-)

ext4_truncate_failed_write() is called by the following functions:
 - ext4_ind_direct_IO
 - ext4_convert_inline_data_to_extent
 - ext4_da_convert_inline_data_to_extent
 - ext4_write_begin
 - ext4_write_end
 - ext4_journalled_write_end
 - ext4_da_write_begin

All these functions are protected by i_mutex.  So we can serialize it
with truncate/punch hole.

Regards,
                                                - Zheng
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - March 28, 2013, 2:06 p.m.
I looked more closely at the assumption that ext4_write_begin() holds
i_mutex.  This is guaranteed by Documentation/filesystems/Locking,
which notes that write_begin() and write_end() functions hold i_mutex:

			PageLocked(page)	i_mutex
write_begin:		locks the page		yes
write_end:		yes, unlocks		yes

So the bug is that ext4_symlink() calls __page_symlink();
__page_symlink() calls pagecache_write_begin() which calls
write_begin(), without taking i_mutex.

So we can fix this by taking i_mutex in ext4_symlink(), but I think it
would be better to take the i_mutex in __page_symlink(), since it
would then address a violation of the locking rules for all file
systems.

Al, do you agree?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index 3825d6a..d75f91a 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -2856,7 +2856,9 @@  retry:
 		ext4_journal_stop(handle);
 		if (err)
 			goto err_drop_inode;
+		mutex_lock(&inode->i_mutex);
 		err = __page_symlink(inode, symname, l, 1);
+		mutex_unlock(&inode->i_mutex);
 		if (err)
 			goto err_drop_inode;
 		/*