diff mbox

[-V2,3/5] ext4: Fix the race between read_block_bitmap and mark_diskspace_used

Message ID 1227285875-18011-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
State Superseded, archived
Headers show

Commit Message

Aneesh Kumar K.V Nov. 21, 2008, 4:44 p.m. UTC
We need to make sure we update the block bitmap and clear
EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
time in ext4_read_block_bitmap (introduced by
c806e68f5647109350ec546fee5b526962970fd2 )

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 fs/ext4/mballoc.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

Comments

Eric Sandeen Nov. 21, 2008, 5:22 p.m. UTC | #1
Aneesh Kumar K.V wrote:
> We need to make sure we update the block bitmap and clear
> EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
> at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
> time in ext4_read_block_bitmap (introduced by
> c806e68f5647109350ec546fee5b526962970fd2 )

Can you add details about the failure mode(s) of this race, so people
(i.e. me) have an idea which bugs they've seen that it might address?

Thanks,
-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V Nov. 21, 2008, 5:31 p.m. UTC | #2
On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote:
> Aneesh Kumar K.V wrote:
> > We need to make sure we update the block bitmap and clear
> > EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
> > at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
> > time in ext4_read_block_bitmap (introduced by
> > c806e68f5647109350ec546fee5b526962970fd2 )
> 
> Can you add details about the failure mode(s) of this race, so people
> (i.e. me) have an idea which bugs they've seen that it might address?
> 

ext4_read_block_bitmap does

	spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
	if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
		ext4_init_block_bitmap(sb, bh, block_group, desc);

the above ext4_init_block_bitmap actually zero out the block bitmap.

Now on the block allocation side we do

	mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
				ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);

	spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);

ie on allocation we update the bitmap then we take the sb_bgl_lock
and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a
parallel ext4_read_block_bitmap can zero out the bitmap in between
the above mb_set_bits and spin_lock(sb_bg_lock..)

Result of this race is
a) blocks getting allocated multiple times
b) File corruption because two files have same blocks allocated
c) mb_free_blocks called multiple times on the same block

....

Same is true with inode bitmap also.

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V Nov. 21, 2008, 5:39 p.m. UTC | #3
On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote:
> On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote:
> > Aneesh Kumar K.V wrote:
> > > We need to make sure we update the block bitmap and clear
> > > EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
> > > at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
> > > time in ext4_read_block_bitmap (introduced by
> > > c806e68f5647109350ec546fee5b526962970fd2 )
> > 
> > Can you add details about the failure mode(s) of this race, so people
> > (i.e. me) have an idea which bugs they've seen that it might address?
> > 
> 

The errors I have seen are

a)
3795  if (free != pa->pa_free) {
3796      printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n",
3797                pa, (unsigned long) pa->pa_lstart,
3798                (unsigned long) pa->pa_pstart,
3799                 (unsigned long) pa->pa_len);

b)

1091     if (!mb_test_bit(block, EXT4_MB_BITMAP(e4b))) {
1092             ext4_fsblk_t blocknr;
1093             blocknr = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb);
1094             blocknr += block;
1095             blocknr +=
1096 			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
1097           ext4_unlock_group(sb, e4b->bd_group);
1098           ext4_error(sb, __func__, "double-free of
inode"

For inode bitmap i have seen

[root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71
ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle
[root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/
total 411
drwxrwxrwx 3 689933 root    1024 Nov 18 05:59 .
drwxrwxrwx 3   8391 root    1024 Nov 18 05:59 ..
drwxrwxrwx 2 root   root    1024 Nov 18 05:33 d83
-rw-rw-rw- 1 root   root       0 Nov 18 05:06 fb4
-rw-rw-rw- 1 root   root 3350138 Nov 18 05:33 fb9
?--------- ? ?      ?          ?            ? l71
lrwxrwxrwx 1 root   root     509 Nov 18 05:23 ld9 -> xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxx
[root@llm19 tmp]# 

dmesg gives:

EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449


Some other message i got before. But i didn't capture the info fully 

a) "Deleting nonexistent file ..." warning in ext4_unlink

b) "Empty directory has too many links..." in ext4_rmdir

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 21, 2008, 5:39 p.m. UTC | #4
Aneesh Kumar K.V wrote:
> On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote:
>> Aneesh Kumar K.V wrote:
>>> We need to make sure we update the block bitmap and clear
>>> EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
>>> at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
>>> time in ext4_read_block_bitmap (introduced by
>>> c806e68f5647109350ec546fee5b526962970fd2 )
>> Can you add details about the failure mode(s) of this race, so people
>> (i.e. me) have an idea which bugs they've seen that it might address?
>>
> 
> ext4_read_block_bitmap does
> 
> 	spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group));
> 	if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> 		ext4_init_block_bitmap(sb, bh, block_group, desc);
> 
> the above ext4_init_block_bitmap actually zero out the block bitmap.
> 
> Now on the block allocation side we do
> 
> 	mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
> 				ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
> 
> 	spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
> 	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
> 		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
> 
> ie on allocation we update the bitmap then we take the sb_bgl_lock
> and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a
> parallel ext4_read_block_bitmap can zero out the bitmap in between
> the above mb_set_bits and spin_lock(sb_bg_lock..)
> 
> Result of this race is
> a) blocks getting allocated multiple times
> b) File corruption because two files have same blocks allocated
> c) mb_free_blocks called multiple times on the same block

Thanks -

And do any of these cases lead to BUG(), WARNING(), ext3_error(), etc
messages that people may one day google for?

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Nov. 21, 2008, 5:40 p.m. UTC | #5
Aneesh Kumar K.V wrote:
> On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote:
>> On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote:
>>> Aneesh Kumar K.V wrote:
>>>> We need to make sure we update the block bitmap and clear
>>>> EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
>>>> at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
>>>> time in ext4_read_block_bitmap (introduced by
>>>> c806e68f5647109350ec546fee5b526962970fd2 )
>>> Can you add details about the failure mode(s) of this race, so people
>>> (i.e. me) have an idea which bugs they've seen that it might address?
>>>
> 
> The errors I have seen are

Ah, there we go.  IMHO, putting a few of these errors into the commit
would be helpful.

Thanks,
-Eric

> a)
> 3795  if (free != pa->pa_free) {
> 3796      printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n",
> 3797                pa, (unsigned long) pa->pa_lstart,
> 3798                (unsigned long) pa->pa_pstart,
> 3799                 (unsigned long) pa->pa_len);
> 
> b)
> 
> 1091     if (!mb_test_bit(block, EXT4_MB_BITMAP(e4b))) {
> 1092             ext4_fsblk_t blocknr;
> 1093             blocknr = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb);
> 1094             blocknr += block;
> 1095             blocknr +=
> 1096 			le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block);
> 1097           ext4_unlock_group(sb, e4b->bd_group);
> 1098           ext4_error(sb, __func__, "double-free of
> inode"
> 
> For inode bitmap i have seen
> 
> [root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71
> ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle
> [root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/
> total 411
> drwxrwxrwx 3 689933 root    1024 Nov 18 05:59 .
> drwxrwxrwx 3   8391 root    1024 Nov 18 05:59 ..
> drwxrwxrwx 2 root   root    1024 Nov 18 05:33 d83
> -rw-rw-rw- 1 root   root       0 Nov 18 05:06 fb4
> -rw-rw-rw- 1 root   root 3350138 Nov 18 05:33 fb9
> ?--------- ? ?      ?          ?            ? l71
> lrwxrwxrwx 1 root   root     509 Nov 18 05:23 ld9 -> xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxx
> [root@llm19 tmp]# 
> 
> dmesg gives:
> 
> EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449
> 
> 
> Some other message i got before. But i didn't capture the info fully 
> 
> a) "Deleting nonexistent file ..." warning in ext4_unlink
> 
> b) "Empty directory has too many links..." in ext4_rmdir
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 23, 2008, 2 p.m. UTC | #6
On Fri, Nov 21, 2008 at 10:14:33PM +0530, Aneesh Kumar K.V wrote:
> We need to make sure we update the block bitmap and clear
> EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look
> at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each
> time in ext4_read_block_bitmap (introduced by
> c806e68f5647109350ec546fee5b526962970fd2 )

You are changing mb_clear_bits() and and mb_set_bits() so they take
the spinlock over the entire operaiton, instead of over each
particular bit.  These function are used in a largish number of
places, not just for updating the block bitmap, but also the mb buddy
bitmaps, etc.  So there may be a scalability impact here, although
taking the spinlock once instead of multiple times is probably a win.

My bigger concern is given that we are playing games like *this*:

		if ((cur & 31) == 0 && (len - cur) >= 32) {
			/* fast path: set whole word at once */
			addr = bm + (cur >> 3);
			*addr = 0xffffffff;
			cur += 32;
			continue;
		}

without taking a lock, I'm a little surprised we haven't been
seriously burned by other race conditions.  What's the point of
calling mb_set_bit_atomic() and passing in a spinlock if we are doing
this kind of check without the protection of the same spinlock?!?

Andreas, if you are using mb_clear_bits() and mb_set_bits() in
Lustre's mballoc.c with this in production, you may want to take a
look at this patch.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Nov. 23, 2008, 7:02 p.m. UTC | #7
On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote:
> Same is true with inode bitmap also.
> 

I don't see how this patch could affect any races with the inode
bitmaps; the patch only affects mballoc.c, and only changes the
locking around bitmaps.

This is one that I think we may want to send to Linus soon as a 2.6.28
bugfix.

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V Nov. 24, 2008, 6:40 a.m. UTC | #8
On Sun, Nov 23, 2008 at 02:02:24PM -0500, Theodore Tso wrote:
> On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote:
> > Same is true with inode bitmap also.
> > 
> 
> I don't see how this patch could affect any races with the inode
> bitmaps; the patch only affects mballoc.c, and only changes the
> locking around bitmaps.

The intent was to explain why we need
[PATCH -V2 5/5] ext4: Fix the race between read_inode_bitmap
and ext4_new_inode


> 
> This is one that I think we may want to send to Linus soon as a 2.6.28
> bugfix.
> 

Yes.

-aneesh
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Alex Zhuravlev Nov. 24, 2008, 7:14 a.m. UTC | #9
Theodore Tso wrote:
> My bigger concern is given that we are playing games like *this*:
> 
> 		if ((cur & 31) == 0 && (len - cur) >= 32) {
> 			/* fast path: set whole word at once */
> 			addr = bm + (cur >> 3);
> 			*addr = 0xffffffff;
> 			cur += 32;
> 			continue;
> 		}

this is to avoid expensive LOCK prefix in some cases.

> without taking a lock, I'm a little surprised we haven't been
> seriously burned by other race conditions.  What's the point of
> calling mb_set_bit_atomic() and passing in a spinlock if we are doing
> this kind of check without the protection of the same spinlock?!?

why would we need a lock for a whole word bitop ?

thanks, Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index c5dcdf0..1ed949c 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1065,6 +1065,8 @@  static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len)
 	__u32 *addr;
 
 	len = cur + len;
+	if (lock)
+		spin_lock(lock);			\
 	while (cur < len) {
 		if ((cur & 31) == 0 && (len - cur) >= 32) {
 			/* fast path: clear whole word at once */
@@ -1073,9 +1075,11 @@  static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len)
 			cur += 32;
 			continue;
 		}
-		mb_clear_bit_atomic(lock, cur, bm);
+		mb_clear_bit(cur, bm);
 		cur++;
 	}
+	if (lock)
+		spin_unlock(lock);			\
 }
 
 static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
@@ -1083,6 +1087,8 @@  static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
 	__u32 *addr;
 
 	len = cur + len;
+	if (lock)
+		spin_lock(lock);			\
 	while (cur < len) {
 		if ((cur & 31) == 0 && (len - cur) >= 32) {
 			/* fast path: set whole word at once */
@@ -1091,9 +1097,11 @@  static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len)
 			cur += 32;
 			continue;
 		}
-		mb_set_bit_atomic(lock, cur, bm);
+		mb_set_bit(cur, bm);
 		cur++;
 	}
+	if (lock)
+		spin_unlock(lock);			\
 }
 
 static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b,
@@ -3004,10 +3012,9 @@  ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac,
 		}
 	}
 #endif
-	mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data,
-				ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
-
 	spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group));
+	mb_set_bits(NULL, bitmap_bh->b_data,
+				ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len);
 	if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) {
 		gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT);
 		gdp->bg_free_blocks_count =