diff mbox

discard an inode's preallocated blocks after failed allocation

Message ID 20101130232238.4F5A74274C@ruihe.smo.corp.google.com
State New, archived
Headers show

Commit Message

Jiaying Zhang Nov. 30, 2010, 11:22 p.m. UTC
We have seen kernel crashes caused by the BUG_ON in 
ext4_mb_return_to_preallocation() that checks whether the inode's
i_prealloc_list is empty. As I understand, the assumption is that
when ext4_mb_return_to_preallocation() is called from ext4_free_blocks(),
the inode's preallocation list should have been already discarded.
However, although we call ext4_discard_preallocations() during ext4
truncate, we don't always call that function in various failure
cases before calling ext4_free_blocks(). So it is likely to hit this
BUG_ON with disk errors or corrupted fs etc.

To fix the problem, the following patch adds ext4_discard_preallocation()
before ext4_free_blocks() in failed allocation cases. This will discard
any preallocated block extent attached to the inode, but I think it is 
probably what we should be doing with failed allocation.

I am also curious whether we can drop the ext4_mb_return_to_preallocation()
call from ext4_free_blocks(). From the comments above that function, it
seems to intent to discard the specified blocks only but all it is currently
doing is the BUG_ON check on whether the inode's preallocation list is empty.
Is there any plan to extend this function later?


ext4: discard an inode's preallocated blocks after failed allocation so that
we won't hit on the BUG_ON check in ext4_mb_return_to_preallocation() later.

Signed-off-by: Jiaying Zhang <jiayingz@google.com>

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Jan. 9, 2011, 3:04 a.m. UTC | #1
On Tue, Nov 30, 2010 at 03:22:38PM -0800, Jiaying Zhang wrote:
> We have seen kernel crashes caused by the BUG_ON in 
> ext4_mb_return_to_preallocation() that checks whether the inode's
> i_prealloc_list is empty. As I understand, the assumption is that
> when ext4_mb_return_to_preallocation() is called from ext4_free_blocks(),
> the inode's preallocation list should have been already discarded.
> However, although we call ext4_discard_preallocations() during ext4
> truncate, we don't always call that function in various failure
> cases before calling ext4_free_blocks(). So it is likely to hit this
> BUG_ON with disk errors or corrupted fs etc.
> 
> To fix the problem, the following patch adds ext4_discard_preallocation()
> before ext4_free_blocks() in failed allocation cases. This will discard
> any preallocated block extent attached to the inode, but I think it is 
> probably what we should be doing with failed allocation.
> 
> I am also curious whether we can drop the ext4_mb_return_to_preallocation()
> call from ext4_free_blocks(). From the comments above that function, it
> seems to intent to discard the specified blocks only but all it is currently
> doing is the BUG_ON check on whether the inode's preallocation list is empty.
> Is there any plan to extend this function later?
> 
> 
> ext4: discard an inode's preallocated blocks after failed allocation so that
> we won't hit on the BUG_ON check in ext4_mb_return_to_preallocation() later.
> 
> Signed-off-by: Jiaying Zhang <jiayingz@google.com>

Sorry, I haven't had a chance to go diving into the "legislative
intent" behind the code in question, until now.  The reason why I
haven't is because it's the mballoc code is a bit of a mess (and some
of it is my fault).

The basic idea behind ext4_mb_return_to_preallocation() is to take the
blocks that had been allocated, and if they are adjacent to the
preallocation regions reserved for the inode, to add them to back to
the preallocation region.  This is especially a good thing if the
blocks had just been taking from a preallocation region, and we
errored out trying to insert them into the extent tree, for example.
In that case, it would be good to be able to "put back" the blocks to
their original preallocation region.

There are only a few problems with this scheme:

1)  It was never implemented.

2) There is a BUG_ON which triggers if the inode has any
preallocations.  After staring at it for a long time, I've concluded
that it makes no sense whatsoever.  If I'm right about why it was
there, then in the case where we had just allocated some blocks from
the preallocation region, and now we need to put them back, of course
we inode would have some prellocated regions.

3) If the journal is enabled, we treat all data blocks as if they were
metadata blocks (that is, we don't actually release the blocks until
the current transaction has committed); this is necessary since we
need to make sure we don't actually use some data block until we know
it has been released.  This is fine, but (3a) in the code where we
finally get around to freeing the data blocks, we don't actually call
ext4_mb_return_to_preallcation(), which means the bug which Jiaying
was tracking down can never happen except in no-journal mod, and (3b)
if the data block has never been used (i.e., in the case where we
allocated the data block, but then ever got a chance to install the
blocks so they could have never have gotten used), we don't need to
wait until the next journal transaction --- and in fact in that case
there's no point calling trim on the blocks in question, since they
were never written into.

Argh; what a mess.

So it seems like the best thing to do for now is an alternate patch
which Jiaying had suggested to me privately, which is to simply remove
the BUG_ON in ext4_mb_return_to_preallocation(), since in the long
run, when we get aroundt to actually implementing
ext4_mb_return_to_preallocation(), it does the right thing.

I'm actually not sure how much of an optimization it is to implement
ext4_mb_return_to_preallocation(), so the other alternative is to
remove the function entirely.  How often are we likely to return an
error to userspace, and then have the userspace continue writing to
the file?

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 29a4adf..2164df6 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1056,6 +1056,7 @@  cleanup:
 	}
 
 	if (err) {
+		ext4_discard_preallocations(inode);
 		/* free all allocated blocks in error case */
 		for (i = 0; i < depth; i++) {
 			if (!ablocks[i])
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4f4362c..456cb4a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -690,6 +690,7 @@  allocated:
 	*err = 0;
 	return ret;
 failed_out:
+	ext4_discard_preallocations(inode);
 	for (i = 0; i < index; i++)
 		ext4_free_blocks(handle, inode, 0, new_blocks[i], 1, 0);
 	return ret;
@@ -787,6 +788,7 @@  static int ext4_alloc_branch(handle_t *handle, struct inode *inode,
 	return err;
 failed:
 	/* Allocation failed, free what we already allocated */
+	ext4_discard_preallocations(inode);
 	ext4_free_blocks(handle, inode, 0, new_blocks[0], 1, 0);
 	for (i = 1; i <= n ; i++) {
 		/*
@@ -878,6 +880,7 @@  static int ext4_splice_branch(handle_t *handle, struct inode *inode,
 	return err;
 
 err_out:
+	ext4_discard_preallocations(inode);
 	for (i = 1; i <= num; i++) {
 		/*
 		 * branch[i].bh is newly allocated, so there is no