diff mbox

[1/2] ext4: Add inode to the orphan list during block allocation failure

Message ID 1238491766-13182-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
State Accepted, archived
Headers show

Commit Message

Aneesh Kumar K.V March 31, 2009, 9:29 a.m. UTC
We should add inode to the orphan list in the same transaction
as block allocation. This ensures that if we crash after a failed
block allocation and before we do a vmtruncate we don't leak block
(ie block marked as used in bitmap but not claimed by the inode).

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
CC:  Jan Kara <jack@suse.cz>
---
 fs/ext4/inode.c |   15 +++++++++++++--
 1 files changed, 13 insertions(+), 2 deletions(-)

Comments

Jan Kara March 31, 2009, 9:34 a.m. UTC | #1
On Tue 31-03-09 14:59:25, Aneesh Kumar K.V wrote:
> We should add inode to the orphan list in the same transaction
> as block allocation. This ensures that if we crash after a failed
> block allocation and before we do a vmtruncate we don't leak block
> (ie block marked as used in bitmap but not claimed by the inode).
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> CC:  Jan Kara <jack@suse.cz>
  Looks fine.

  Reviewed-by: Jan Kara <jack@suse.cz>

								Honza

> ---
>  fs/ext4/inode.c |   15 +++++++++++++--
>  1 files changed, 13 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2231a65..074185f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -1424,7 +1424,7 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  				struct page **pagep, void **fsdata)
>  {
>  	struct inode *inode = mapping->host;
> -	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
> +	int ret, needed_blocks;
>  	handle_t *handle;
>  	int retries = 0;
>  	struct page *page;
> @@ -1435,6 +1435,11 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  		   "dev %s ino %lu pos %llu len %u flags %u",
>  		   inode->i_sb->s_id, inode->i_ino,
>  		   (unsigned long long) pos, len, flags);
> +	/*
> +	 * Reserve one block more for addition to orphan list in case
> +	 * we allocate blocks but write fails for some reason
> +	 */
> +	needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
>   	index = pos >> PAGE_CACHE_SHIFT;
>  	from = pos & (PAGE_CACHE_SIZE - 1);
>  	to = from + len;
> @@ -1468,14 +1473,20 @@ static int ext4_write_begin(struct file *file, struct address_space *mapping,
>  
>  	if (ret) {
>  		unlock_page(page);
> -		ext4_journal_stop(handle);
>  		page_cache_release(page);
>  		/*
>  		 * block_write_begin may have instantiated a few blocks
>  		 * outside i_size.  Trim these off again. Don't need
>  		 * i_size_read because we hold i_mutex.
> +		 *
> +		 * Add inode to orphan list in case we crash before
> +		 * truncate finishes
>  		 */
>  		if (pos + len > inode->i_size)
> +			ext4_orphan_add(handle, inode);
> +
> +		ext4_journal_stop(handle);
> +		if (pos + len > inode->i_size)
>  			vmtruncate(inode, inode->i_size);
>  	}
>  
> -- 
> 1.6.2.1.404.gb0085.dirty
>
Theodore Ts'o April 5, 2009, 3:11 a.m. UTC | #2
On Tue, Mar 31, 2009 at 02:59:25PM +0530, Aneesh Kumar K.V wrote:
> We should add inode to the orphan list in the same transaction
> as block allocation. This ensures that if we crash after a failed
> block allocation and before we do a vmtruncate we don't leak block
> (ie block marked as used in bitmap but not claimed by the inode).

How likely is this going to happen?  If it's a failure in the block
allocation, we will have really end up with blocks outside i_size?
And in that case, I'm missing how this ends up being a leaked block,
since presumably the block is still being referenced by the inode.  Am
I missing something here?

I guess it can happen if do_journal_get_write_access() returns an
error, which could potentially happen if there is a ENOMEM error
coming from the jbd2 layer.  But that brings up another potential
problem.  It's possible for vmtruncate() to fail; if ext4_truncate()
fails too early (particularly in ext4_ext_truncate, due to a memory
error in a jbd2 routines), it's possible for it to return without
calling ext4_orphan_del() --- and stray inodes on the orphan list will
cause the kernel to panic on umount.

I think this can be fixed by making sure that ext4_truncate() and
ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
paths.  That *should* the problem, since at the moment, it doesn't
look vmtruncate() will return without calling inode->i_op->truncate().
But could you double check this carefully?

Thanks,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara April 6, 2009, 10:05 a.m. UTC | #3
Hi,

On Sat 04-04-09 23:11:16, Theodore Tso wrote:
> On Tue, Mar 31, 2009 at 02:59:25PM +0530, Aneesh Kumar K.V wrote:
> > We should add inode to the orphan list in the same transaction
> > as block allocation. This ensures that if we crash after a failed
> > block allocation and before we do a vmtruncate we don't leak block
> > (ie block marked as used in bitmap but not claimed by the inode).
> 
> How likely is this going to happen?  If it's a failure in the block
> allocation, we will have really end up with blocks outside i_size?
> And in that case, I'm missing how this ends up being a leaked block,
> since presumably the block is still being referenced by the inode.  Am
> I missing something here?
  Aneesh's changelog was not completely precise here. First note that some
problems (namely those in ext4_write_begin()) can happen only if blocksize
< pagesize - there we can succeed in allocating some blocks for the page
but not all of them. Since we then decide to redo the whole write, we have
to truncate blocks we have already allocated.
  The problem in ext4_..._write_end() is different (and rather easy to hit
under heavy memory pressure) - the problem is that generic_perform_write()
fails to copy data into our kernel page because the user page from which
we should copy the data has been paged-out. In this situation we again
decide to redo the write and so we should truncate the already allocated
blocks since i_size is still set to the value before write. Otherwise
we'd have blocks allocated beyond i_size and so a crash before we
successfully redo the write would "leak" blocks (you're right the inode
would be still referencing them but they'll be above i_size and I guess
it could confuse some code).

> I guess it can happen if do_journal_get_write_access() returns an
> error, which could potentially happen if there is a ENOMEM error
> coming from the jbd2 layer.  But that brings up another potential
> problem.  It's possible for vmtruncate() to fail; if ext4_truncate()
> fails too early (particularly in ext4_ext_truncate, due to a memory
> error in a jbd2 routines), it's possible for it to return without
> calling ext4_orphan_del() --- and stray inodes on the orphan list will
> cause the kernel to panic on umount.
> 
> I think this can be fixed by making sure that ext4_truncate() and
> ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
> paths.  That *should* the problem, since at the moment, it doesn't
> look vmtruncate() will return without calling inode->i_op->truncate().
> But could you double check this carefully?
  Ah, OK, that should be fixed. But note that current ext4_setattr()
does exactly the same thing on standard truncates - it adds inode to
orphan list and calls inode_setattr() which end's up calling vmtruncate().

								Honza
Theodore Ts'o June 5, 2009, 4:31 a.m. UTC | #4
On Mon, Apr 06, 2009 at 12:05:09PM +0200, Jan Kara wrote:
> > I think this can be fixed by making sure that ext4_truncate() and
> > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
> > paths.  That *should* the problem, since at the moment, it doesn't
> > look vmtruncate() will return without calling inode->i_op->truncate().
> > But could you double check this carefully?
>
>   Ah, OK, that should be fixed. But note that current ext4_setattr()
> does exactly the same thing on standard truncates - it adds inode to
> orphan list and calls inode_setattr() which end's up calling vmtruncate().

I finally had a chance to take a closer look at this.  ext4_setattr()
is safe, because it does this after calling inode_setattr():

	/* If inode_setattr's call to ext4_truncate failed to get a
	 * transaction handle at all, we need to clean up the in-core
	 * orphan list manually. */
	if (inode->i_nlink)
		ext4_orphan_del(NULL, inode);

So if we put the same thing into the ext4_write_begin() and
ext4_writeback_write_end() in these patches, it should be OK.  The key
is that if the inode is already is on the orphan list, it's harmless
to call ext4_orphan_add() --- and if the inode has already been
removed from the orphan list, it's harmless to call ext4_orphan_del()
on it.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 5, 2009, 6:22 a.m. UTC | #5
I've tried applying these two patches to the ext4 patch queue.  On a
metadata-heavy workload (specifically, fsx), it causes a 5%
degradation in the wall clock run-time of the fsx run-time.

Maybe there's no way around that, but it's rather unfortunate....

	       	   	      	       	  - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o June 5, 2009, 7:24 a.m. UTC | #6
On Fri, Jun 05, 2009 at 02:22:34AM -0400, Theodore Tso wrote:
> I've tried applying these two patches to the ext4 patch queue.  On a
> metadata-heavy workload (specifically, fsx), it causes a 5%
> degradation in the wall clock run-time of the fsx run-time.

I just did some more timing tests, and it looks like I can't confirm
it.  The test times are noisy enough I need to run some more under very
controlled circumsntances.  So ignore this for now, it might not be a
problem after all.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jan Kara June 5, 2009, 11:42 p.m. UTC | #7
On Fri 05-06-09 03:24:17, Theodore Tso wrote:
> On Fri, Jun 05, 2009 at 02:22:34AM -0400, Theodore Tso wrote:
> > I've tried applying these two patches to the ext4 patch queue.  On a
> > metadata-heavy workload (specifically, fsx), it causes a 5%
> > degradation in the wall clock run-time of the fsx run-time.
> 
> I just did some more timing tests, and it looks like I can't confirm
> it.  The test times are noisy enough I need to run some more under very
> controlled circumsntances.  So ignore this for now, it might not be a
> problem after all.
  It would be strange if the patches caused a measurable slowdown. They
change something only in the failure path where we fail to allocate some
blocks or fail to copy in all the data and these should better be
exceptional cases...

								Honza
Jan Kara June 5, 2009, 11:44 p.m. UTC | #8
On Fri 05-06-09 00:31:17, Theodore Tso wrote:
> On Mon, Apr 06, 2009 at 12:05:09PM +0200, Jan Kara wrote:
> > > I think this can be fixed by making sure that ext4_truncate() and
> > > ext4_ext_truncate() calls ext4_orphan_del() in *all* of their error
> > > paths.  That *should* the problem, since at the moment, it doesn't
> > > look vmtruncate() will return without calling inode->i_op->truncate().
> > > But could you double check this carefully?
> >
> >   Ah, OK, that should be fixed. But note that current ext4_setattr()
> > does exactly the same thing on standard truncates - it adds inode to
> > orphan list and calls inode_setattr() which end's up calling vmtruncate().
> 
> I finally had a chance to take a closer look at this.  ext4_setattr()
> is safe, because it does this after calling inode_setattr():
> 
> 	/* If inode_setattr's call to ext4_truncate failed to get a
> 	 * transaction handle at all, we need to clean up the in-core
> 	 * orphan list manually. */
> 	if (inode->i_nlink)
> 		ext4_orphan_del(NULL, inode);
> 
> So if we put the same thing into the ext4_write_begin() and
> ext4_writeback_write_end() in these patches, it should be OK.  The key
> is that if the inode is already is on the orphan list, it's harmless
> to call ext4_orphan_add() --- and if the inode has already been
> removed from the orphan list, it's harmless to call ext4_orphan_del()
> on it.
  Ah, good point. I haven't noticed this in ext4_inode_setattr when I
was checking it. Aneesh, will you take care of it for ext4? I'll cook up
an appropriate change for ext3... Thanks Ted for looking into this.

								Honza
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2231a65..074185f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1424,7 +1424,7 @@  static int ext4_write_begin(struct file *file, struct address_space *mapping,
 				struct page **pagep, void **fsdata)
 {
 	struct inode *inode = mapping->host;
-	int ret, needed_blocks = ext4_writepage_trans_blocks(inode);
+	int ret, needed_blocks;
 	handle_t *handle;
 	int retries = 0;
 	struct page *page;
@@ -1435,6 +1435,11 @@  static int ext4_write_begin(struct file *file, struct address_space *mapping,
 		   "dev %s ino %lu pos %llu len %u flags %u",
 		   inode->i_sb->s_id, inode->i_ino,
 		   (unsigned long long) pos, len, flags);
+	/*
+	 * Reserve one block more for addition to orphan list in case
+	 * we allocate blocks but write fails for some reason
+	 */
+	needed_blocks = ext4_writepage_trans_blocks(inode) + 1;
  	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
@@ -1468,14 +1473,20 @@  static int ext4_write_begin(struct file *file, struct address_space *mapping,
 
 	if (ret) {
 		unlock_page(page);
-		ext4_journal_stop(handle);
 		page_cache_release(page);
 		/*
 		 * block_write_begin may have instantiated a few blocks
 		 * outside i_size.  Trim these off again. Don't need
 		 * i_size_read because we hold i_mutex.
+		 *
+		 * Add inode to orphan list in case we crash before
+		 * truncate finishes
 		 */
 		if (pos + len > inode->i_size)
+			ext4_orphan_add(handle, inode);
+
+		ext4_journal_stop(handle);
+		if (pos + len > inode->i_size)
 			vmtruncate(inode, inode->i_size);
 	}