diff mbox

[RFC,-V3,8/9] ext4: Fix double free of blocks

Message ID 1225997374-10846-8-git-send-email-aneesh.kumar@linux.vnet.ibm.com
State Rejected, archived
Headers show

Commit Message

Aneesh Kumar K.V Nov. 6, 2008, 6:49 p.m. UTC
Blocks freed but not yet committed will be marked free in disk bitmap.
We need to consider them as used when releasing inode prealloc
space. Otherwise we would double free them via mb_free_blocks
ext4_mb_release_inode_pa looks at the block bitmap and mark the
block found free in the bitmap withing the inode prealloc space
as unused. If we have blocks allocated out of a prealloc space
and later freed we would have called ext4_mb_free_blocks on them
That would mark blocks as free in bitmap and later on journal commit
we would call mb_free_blocks on them, thus resuling in double free

ext4_lock_group is used to make sure nothing gets added to the free
list when we are doing a release_inode_pa

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
---
 fs/ext4/mballoc.c |   75 ++++++++++++++++++++++++++++++++++-------------------
 1 files changed, 48 insertions(+), 27 deletions(-)

Comments

Theodore Ts'o Nov. 6, 2008, 10:37 p.m. UTC | #1
On Fri, Nov 07, 2008 at 12:19:33AM +0530, Aneesh Kumar K.V wrote:
> Blocks freed but not yet committed will be marked free in disk bitmap.

That should probably read, "marked free in the on-disk block
allocation bitmap".

And I'm having real trouble parsing this below

> ext4_mb_release_inode_pa looks at the block bitmap and mark the
> block found free in the bitmap withing the inode prealloc space

"the block"  should that be plural?   Which block is it?

"found free in the bitmap"?  which bitmap?  The block allocation
bitmap?

"withing"?  Is that supposed to be within?

"withing the prealloc space"?  What is that supposed to be modifying?
It is cloest to "the bitmap", but that doesn't make any sense; is this
phrase supposed to be modifying "the block"?

> as unused. If we have blocks allocated out of a prealloc space

"mark as unused" Mark as unused where?  In the block allocation
bitmap?  But I thought these were "blocks found free in the bitmap",
so aren't they already unused?

It's better to try to explain things at a higher level.  What what
you've said on the conference call, I *think* what happens is that the
preallocation range is range of blocks which is reserved for
allocation for that particular inode.  To support this, the blocks in
question are made to appear as though they are not available in the
mballoc's in-memory buddy bitmap, which is what mballoc consults when
making allocations.

For some strange reason which I don't understand, when blocks are
served out of the preallocation area and allocated to the inode, they
are not removed from the preallocation area.  (This seems like the
real bug, but whatever...)  So when we release a preallocation area
for an inode, for each block in the inode's preallocation area, we
check and see if the block is marked as free according to the on-disk
disk allocation bitmap, and if so we make it look available in
mballoc's in-memory buddy bitmap.

Unfortunately, if the disk blocks in question are freed before the
commit transmit, the blocks would look free in the on-disk block
allocation bitmap, and so there would be an attempt to mark them as
available twice in mballoc's buddy bitmap. 

To get around this problem, the patch allocates a temporary bitmap
which also includes the blocks in the "release on commit" linked list.

Did I get this right?  If so, we should put an abbreviated version of
the above in the commit, and more of this needs to be explicitly
documented in comments in mballoc.c

So ---- stupid question.  Instead of creating the temporary bitmap,
which I fear will be very expensive --- why not remove the block from
the inode's preallocation area when it is served up?  Then, it becomes
easier when releasing the inode preallocation area; we wouldn't need
to consult the on-disk block allocation bitmap at all, and merely just
iterate over all of the blocks in the inode preallocation space, and
mark them as available in the mballoc buddy bitmap?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V Nov. 7, 2008, 4:01 p.m. UTC | #2
On Thu, Nov 06, 2008 at 05:37:02PM -0500, Theodore Tso wrote:
> On Fri, Nov 07, 2008 at 12:19:33AM +0530, Aneesh Kumar K.V wrote:
> > Blocks freed but not yet committed will be marked free in disk bitmap.
> 
> That should probably read, "marked free in the on-disk block
> allocation bitmap".
> 
> And I'm having real trouble parsing this below
> 
> > ext4_mb_release_inode_pa looks at the block bitmap and mark the
> > block found free in the bitmap withing the inode prealloc space
> 
> "the block"  should that be plural?   Which block is it?
> 
> "found free in the bitmap"?  which bitmap?  The block allocation
> bitmap?
> 
> "withing"?  Is that supposed to be within?
> 
> "withing the prealloc space"?  What is that supposed to be modifying?
> It is cloest to "the bitmap", but that doesn't make any sense; is this
> phrase supposed to be modifying "the block"?
> 
> > as unused. If we have blocks allocated out of a prealloc space
> 
> "mark as unused" Mark as unused where?  In the block allocation
> bitmap?  But I thought these were "blocks found free in the bitmap",
> so aren't they already unused?
> 
> It's better to try to explain things at a higher level.  What what
> you've said on the conference call, I *think* what happens is that the
> preallocation range is range of blocks which is reserved for
> allocation for that particular inode.  To support this, the blocks in
> question are made to appear as though they are not available in the
> mballoc's in-memory buddy bitmap, which is what mballoc consults when
> making allocations.


Yes. That is correct

> 
> For some strange reason which I don't understand, when blocks are
> served out of the preallocation area and allocated to the inode, they
> are not removed from the preallocation area.  (This seems like the
> real bug, but whatever...) 

This happens for inode prealloc space. Because i node prealloc space
is mapped using the logical block number. Ie if we have a prealloc space
of contiguous 'x' blocks starting at physical block 'p'.  When creating
this prealloc space we would have been requesting for blocks for logical
block 'l'. ie 

[ p .........(p+k).............p+x]
  l

Now request for 2 blocks at logical file block 'l' get allocated
from above prealloc space. Again request for 3 blocks at logical block
'l+k' get allocated from the above prealloc space. Also request for 
3 blocks at logical block 'p+x+1' WON't use the above prealloc space.
Inshort inode prealloc space use logical block number to find which
prealloc space to use. We also don't reduce the pa_len of the prealloc
space. But we reduce pa_free. Now when we discard this prealloc space
we need to look at on-block bitmap and find out how many blocks actually
got allocated out of the prealloc space. So we start with the first zero
block on the block bitmap starting from physical block 'p'. We search
for next set bit. All the blocks in that range are free.  We them mark
them free in the buddy bitmap.


>So when we release a preallocation area
> for an inode, for each block in the inode's preallocation area, we
> check and see if the block is marked as free according to the on-disk
> disk allocation bitmap, and if so we make it look available in
> mballoc's in-memory buddy bitmap.
> 
> Unfortunately, if the disk blocks in question are freed before the
> commit transmit, the blocks would look free in the on-disk block
> allocation bitmap, and so there would be an attempt to mark them as
> available twice in mballoc's buddy bitmap. 

exactly.

> 
> To get around this problem, the patch allocates a temporary bitmap
> which also includes the blocks in the "release on commit" linked list.
> 
> Did I get this right?  If so, we should put an abbreviated version of
> the above in the commit, and more of this needs to be explicitly
> documented in comments in mballoc.c

I am tempted to add your mail above as it is :)

> 
> So ---- stupid question.  Instead of creating the temporary bitmap,
> which I fear will be very expensive --- why not remove the block from
> the inode's preallocation area when it is served up? 

We can't do that because inode prealloc space look at logical block
of the file.


>Then, it becomes
> easier when releasing the inode preallocation area; we wouldn't need
> to consult the on-disk block allocation bitmap at all, and merely just
> iterate over all of the blocks in the inode preallocation space, and
> mark them as available in the mballoc buddy bitmap?

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 7293209..f58de20 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -3725,19 +3725,19 @@  static int ext4_mb_new_preallocation(struct ext4_allocation_context *ac)
  * TODO: optimize the case when there are no in-core structures yet
  */
 static noinline_for_stack int
-ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
-			struct ext4_prealloc_space *pa,
-			struct ext4_allocation_context *ac)
+ext4_mb_release_inode_pa(struct ext4_buddy *e4b, void *bitmap,
+			 struct ext4_prealloc_space *pa,
+			 struct ext4_allocation_context *ac)
 {
-	struct super_block *sb = e4b->bd_sb;
-	struct ext4_sb_info *sbi = EXT4_SB(sb);
+	int err = 0;
+	int free = 0;
+	sector_t start;
 	unsigned int end;
 	unsigned int next;
 	ext4_group_t group;
 	ext4_grpblk_t bit;
-	sector_t start;
-	int err = 0;
-	int free = 0;
+	struct super_block *sb = e4b->bd_sb;
+	struct ext4_sb_info *sbi = EXT4_SB(sb);
 
 	BUG_ON(pa->pa_deleted == 0);
 	ext4_get_group_no_and_offset(sb, pa->pa_pstart, &group, &bit);
@@ -3749,12 +3749,11 @@  ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
 		ac->ac_inode = pa->pa_inode;
 		ac->ac_op = EXT4_MB_HISTORY_DISCARD;
 	}
-
 	while (bit < end) {
-		bit = mb_find_next_zero_bit(bitmap_bh->b_data, end, bit);
+		bit = mb_find_next_zero_bit(bitmap, end, bit);
 		if (bit >= end)
 			break;
-		next = mb_find_next_bit(bitmap_bh->b_data, end, bit);
+		next = mb_find_next_bit(bitmap, end, bit);
 		start = group * EXT4_BLOCKS_PER_GROUP(sb) + bit +
 				le32_to_cpu(sbi->s_es->s_first_data_block);
 		mb_debug("    free preallocated %u/%u in group %u\n",
@@ -3773,18 +3772,12 @@  ext4_mb_release_inode_pa(struct ext4_buddy *e4b, struct buffer_head *bitmap_bh,
 		mb_free_blocks(pa->pa_inode, e4b, bit, next - bit);
 		bit = next + 1;
 	}
-	if (free != pa->pa_free) {
-		printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n",
-			pa, (unsigned long) pa->pa_lstart,
-			(unsigned long) pa->pa_pstart,
-			(unsigned long) pa->pa_len);
-		ext4_error(sb, __func__, "free %u, pa_free %u\n",
-						free, pa->pa_free);
-		/*
-		 * pa is already deleted so we use the value obtained
-		 * from the bitmap and continue.
-		 */
-	}
+	/*
+	 * The blocks allocated and later freed from this pa
+	 * can result in pa_free being different from the
+	 * bitmap free block count. This is because we don't
+	 * update pa_len on releasing blocks.
+	 */
 	atomic_add(free, &sbi->s_mb_discarded);
 
 	return err;
@@ -3840,6 +3833,7 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 	struct ext4_allocation_context *ac;
 	struct list_head list;
 	struct ext4_buddy e4b;
+	void *bitmap;
 	int err;
 	int busy = 0;
 	int free = 0;
@@ -3855,12 +3849,21 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 				"bitmap for %u\n", group);
 		return 0;
 	}
-
+	/*
+	 * Blocks freed but not yet committed will be marked free in
+	 * disk bitmap.  We need to consider them as used when
+	 * releasing inode pa.  Otherwise we would double free them via
+	 * mb_free_blocks
+	 */
+	bitmap = kmalloc(sb->s_blocksize, GFP_NOFS);
+	if (!bitmap)
+		return 0;
 	err = ext4_mb_load_buddy(sb, group, &e4b);
 	if (err) {
 		__release(e4b->alloc_semp);
 		ext4_error(sb, __func__, "Error in loading buddy "
 				"information for %u\n", group);
+		kfree(bitmap);
 		put_bh(bitmap_bh);
 		return 0;
 	}
@@ -3915,6 +3918,8 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 		goto out;
 	}
 
+	memcpy(bitmap, bitmap_bh->b_data, sb->s_blocksize);
+	ext4_mb_generate_from_freelist(sb, bitmap, group);
 	/* now free all selected PAs */
 	list_for_each_entry_safe(pa, tmp, &list, u.pa_tmp_list) {
 
@@ -3926,7 +3931,7 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 		if (pa->pa_linear)
 			ext4_mb_release_group_pa(&e4b, pa, ac);
 		else
-			ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa, ac);
+			ext4_mb_release_inode_pa(&e4b, bitmap, pa, ac);
 
 		list_del(&pa->u.pa_tmp_list);
 		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);
@@ -3937,6 +3942,7 @@  ext4_mb_discard_group_preallocations(struct super_block *sb,
 	if (ac)
 		kmem_cache_free(ext4_ac_cachep, ac);
 	ext4_mb_release_desc(&e4b);
+	kfree(bitmap);
 	put_bh(bitmap_bh);
 	return free;
 }
@@ -3961,6 +3967,7 @@  void ext4_discard_preallocations(struct inode *inode)
 	struct list_head list;
 	struct ext4_buddy e4b;
 	int err;
+	void *bitmap;
 
 	if (!S_ISREG(inode->i_mode)) {
 		/*BUG_ON(!list_empty(&ei->i_prealloc_list));*/
@@ -4039,14 +4046,28 @@  void ext4_discard_preallocations(struct inode *inode)
 			ext4_mb_release_desc(&e4b);
 			continue;
 		}
-
+		/* blocks freed but not yet committed will
+		 * be marked free in disk bitmap. We need to
+		 * consider them as used when releasing inode
+		 * pa. Otherwise we would double free them
+		 * via mb_free_blocks
+		 */
+		bitmap = kmalloc(sb->s_blocksize, GFP_NOFS);
+		if (!bitmap) {
+			ext4_mb_release_desc(&e4b);
+			put_bh(bitmap_bh);
+			continue;
+		}
+		memcpy(bitmap, bitmap_bh->b_data, sb->s_blocksize);
 		ext4_lock_group(sb, group);
+		ext4_mb_generate_from_freelist(sb, bitmap, group);
 		list_del(&pa->pa_group_list);
-		ext4_mb_release_inode_pa(&e4b, bitmap_bh, pa, ac);
+		ext4_mb_release_inode_pa(&e4b, bitmap, pa, ac);
 		ext4_unlock_group(sb, group);
 
 		ext4_mb_release_desc(&e4b);
 		put_bh(bitmap_bh);
+		kfree(bitmap);
 
 		list_del(&pa->u.pa_tmp_list);
 		call_rcu(&(pa)->u.pa_rcu, ext4_mb_pa_callback);