Patchwork [4/4] ext4: Fix lost truncate due to race with writeback

login
register
mail settings
Submitter Jan Kara
Date Aug. 5, 2013, 1:52 p.m.
Message ID <1375710744-29329-5-git-send-email-jack@suse.cz>
Download mbox | patch
Permalink /patch/264680/
State Accepted
Headers show

Comments

Jan Kara - Aug. 5, 2013, 1:52 p.m.
The following race can lead to a loss of i_disksize update from truncate
thus resulting in a wrong inode size if the inode size isn't updated
again before inode is reclaimed:

ext4_setattr()				mpage_map_and_submit_extent()
  EXT4_I(inode)->i_disksize = attr->ia_size;
  ...					  ...
					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
					  /* False because i_size isn't
					   * updated yet */
					  if (disksize > i_size_read(inode))
					  /* True, because i_disksize is
					   * already truncated */
					  if (disksize > EXT4_I(inode)->i_disksize)
					    /* Overwrite i_disksize
					     * update from truncate */
					    ext4_update_i_disksize()
  i_size_write(inode, attr->ia_size);

For other places updating i_disksize such race cannot happen because
i_mutex prevents these races. Writeback is the only place where we do
not hold i_mutex and we cannot grab it there because of lock ordering.

We fix the race by doing both i_disksize and i_size update in truncate
atomically under i_data_sem and in mpage_map_and_submit_extent() we move
the check against i_size under i_data_sem as well.

Signed-off-by: Jan Kara <jack@suse.cz>
---
 fs/ext4/ext4.h  | 24 ++++++++++++++++++++----
 fs/ext4/inode.c | 17 ++++++++++++-----
 2 files changed, 32 insertions(+), 9 deletions(-)
Theodore Ts'o - Aug. 17, 2013, 2:12 p.m.
On Mon, Aug 05, 2013 at 03:52:24PM +0200, Jan Kara wrote:
> The following race can lead to a loss of i_disksize update from truncate
> thus resulting in a wrong inode size if the inode size isn't updated
> again before inode is reclaimed:
> 
> ext4_setattr()				mpage_map_and_submit_extent()
>   EXT4_I(inode)->i_disksize = attr->ia_size;
>   ...					  ...
> 					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
> 					  /* False because i_size isn't
> 					   * updated yet */
> 					  if (disksize > i_size_read(inode))
> 					  /* True, because i_disksize is
> 					   * already truncated */
> 					  if (disksize > EXT4_I(inode)->i_disksize)
> 					    /* Overwrite i_disksize
> 					     * update from truncate */
> 					    ext4_update_i_disksize()
>   i_size_write(inode, attr->ia_size);
> 
> For other places updating i_disksize such race cannot happen because
> i_mutex prevents these races. Writeback is the only place where we do
> not hold i_mutex and we cannot grab it there because of lock ordering.
> 
> We fix the race by doing both i_disksize and i_size update in truncate
> atomically under i_data_sem and in mpage_map_and_submit_extent() we move
> the check against i_size under i_data_sem as well.
> 
> Signed-off-by: Jan Kara <jack@suse.cz>

Applied, thanks.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones - Aug. 26, 2013, 7:01 p.m.
On Sat, Aug 17, 2013 at 10:12:27AM -0400, Theodore Ts'o wrote:
 > On Mon, Aug 05, 2013 at 03:52:24PM +0200, Jan Kara wrote:
 > > The following race can lead to a loss of i_disksize update from truncate
 > > thus resulting in a wrong inode size if the inode size isn't updated
 > > again before inode is reclaimed:
 > > 
 > > ext4_setattr()				mpage_map_and_submit_extent()
 > >   EXT4_I(inode)->i_disksize = attr->ia_size;
 > >   ...					  ...
 > > 					  disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT
 > > 					  /* False because i_size isn't
 > > 					   * updated yet */
 > > 					  if (disksize > i_size_read(inode))
 > > 					  /* True, because i_disksize is
 > > 					   * already truncated */
 > > 					  if (disksize > EXT4_I(inode)->i_disksize)
 > > 					    /* Overwrite i_disksize
 > > 					     * update from truncate */
 > > 					    ext4_update_i_disksize()
 > >   i_size_write(inode, attr->ia_size);
 > > 
 > > For other places updating i_disksize such race cannot happen because
 > > i_mutex prevents these races. Writeback is the only place where we do
 > > not hold i_mutex and we cannot grab it there because of lock ordering.
 > > 
 > > We fix the race by doing both i_disksize and i_size update in truncate
 > > atomically under i_data_sem and in mpage_map_and_submit_extent() we move
 > > the check against i_size under i_data_sem as well.
 > > 
 > > Signed-off-by: Jan Kara <jack@suse.cz>
 > 
 > Applied, thanks.

Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.

	Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Aug. 28, 2013, 10:55 p.m.
On Mon, Aug 26, 2013 at 03:01:48PM -0400, Dave Jones wrote:
> 
> Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.

These patches fixed races that have been around for a while; it's not
a regression.  Given that they are fairly involved, I was nervous
sending them to Linus for 3.11, given the late date.

They are queued for the next merge window, and I'll mark them cc:
stable@vger.kernel.org.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Dave Jones - Aug. 28, 2013, 11:01 p.m.
On Wed, Aug 28, 2013 at 06:55:04PM -0400, Theodore Ts'o wrote:
 > On Mon, Aug 26, 2013 at 03:01:48PM -0400, Dave Jones wrote:
 > > 
 > > Is this queued for 3.11 ? 1k blocksize fs's are still broken in rc7.
 > 
 > These patches fixed races that have been around for a while; it's not
 > a regression.  Given that they are fairly involved, I was nervous
 > sending them to Linus for 3.11, given the late date.

That's odd, because I can't reproduce the problem I'm seeing on 3.10

 > They are queued for the next merge window, and I'll mark them cc:
 > stable@vger.kernel.org.
 
Fair enough.

	Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index b577e45..648c5e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2416,16 +2416,32 @@  do {								\
 #define EXT4_FREECLUSTERS_WATERMARK 0
 #endif
 
+/* Update i_disksize. Requires i_mutex to avoid races with truncate */
 static inline void ext4_update_i_disksize(struct inode *inode, loff_t newsize)
 {
-	/*
-	 * XXX: replace with spinlock if seen contended -bzzz
-	 */
+	WARN_ON_ONCE(S_ISREG(inode->i_mode) &&
+		     !mutex_is_locked(&inode->i_mutex));
+	down_write(&EXT4_I(inode)->i_data_sem);
+	if (newsize > EXT4_I(inode)->i_disksize)
+		EXT4_I(inode)->i_disksize = newsize;
+	up_write(&EXT4_I(inode)->i_data_sem);
+}
+
+/*
+ * Update i_disksize after writeback has been started. Races with truncate
+ * are avoided by checking i_size under i_data_sem.
+ */
+static inline void ext4_wb_update_i_disksize(struct inode *inode, loff_t newsize)
+{
+	loff_t i_size;
+
 	down_write(&EXT4_I(inode)->i_data_sem);
+	i_size = i_size_read(inode);
+	if (newsize > i_size)
+		newsize = i_size;
 	if (newsize > EXT4_I(inode)->i_disksize)
 		EXT4_I(inode)->i_disksize = newsize;
 	up_write(&EXT4_I(inode)->i_data_sem);
-	return ;
 }
 
 struct ext4_group_info {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e7d98d2..5d3706e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2240,12 +2240,10 @@  static int mpage_map_and_submit_extent(handle_t *handle,
 
 	/* Update on-disk size after IO is submitted */
 	disksize = ((loff_t)mpd->first_page) << PAGE_CACHE_SHIFT;
-	if (disksize > i_size_read(inode))
-		disksize = i_size_read(inode);
 	if (disksize > EXT4_I(inode)->i_disksize) {
 		int err2;
 
-		ext4_update_i_disksize(inode, disksize);
+		ext4_wb_update_i_disksize(inode, disksize);
 		err2 = ext4_mark_inode_dirty(handle, inode);
 		if (err2)
 			ext4_error(inode->i_sb,
@@ -4587,18 +4585,27 @@  int ext4_setattr(struct dentry *dentry, struct iattr *attr)
 				error = ext4_orphan_add(handle, inode);
 				orphan = 1;
 			}
+			down_write(&EXT4_I(inode)->i_data_sem);
 			EXT4_I(inode)->i_disksize = attr->ia_size;
 			rc = ext4_mark_inode_dirty(handle, inode);
 			if (!error)
 				error = rc;
+			/*
+			 * We have to update i_size under i_data_sem together
+			 * with i_disksize to avoid races with writeback code
+			 * running ext4_wb_update_i_disksize().
+			 */
+			if (!error)
+				i_size_write(inode, attr->ia_size);
+			up_write(&EXT4_I(inode)->i_data_sem);
 			ext4_journal_stop(handle);
 			if (error) {
 				ext4_orphan_del(NULL, inode);
 				goto err_out;
 			}
-		}
+		} else
+			i_size_write(inode, attr->ia_size);
 
-		i_size_write(inode, attr->ia_size);
 		/*
 		 * Blocks are going to be removed from the inode. Wait
 		 * for dio in flight.  Temporarily disable