Message ID | 1227285875-18011-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com |
---|---|
State | Superseded, archived |
Headers | show |
Aneesh Kumar K.V wrote: > We need to make sure we update the block bitmap and clear > EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look > at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each > time in ext4_read_block_bitmap (introduced by > c806e68f5647109350ec546fee5b526962970fd2 ) Can you add details about the failure mode(s) of this race, so people (i.e. me) have an idea which bugs they've seen that it might address? Thanks, -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote: > Aneesh Kumar K.V wrote: > > We need to make sure we update the block bitmap and clear > > EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look > > at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each > > time in ext4_read_block_bitmap (introduced by > > c806e68f5647109350ec546fee5b526962970fd2 ) > > Can you add details about the failure mode(s) of this race, so people > (i.e. me) have an idea which bugs they've seen that it might address? > ext4_read_block_bitmap does spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group)); if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { ext4_init_block_bitmap(sb, bh, block_group, desc); the above ext4_init_block_bitmap actually zero out the block bitmap. Now on the block allocation side we do mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data, ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len); spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group)); if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); ie on allocation we update the bitmap then we take the sb_bgl_lock and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a parallel ext4_read_block_bitmap can zero out the bitmap in between the above mb_set_bits and spin_lock(sb_bg_lock..) Result of this race is a) blocks getting allocated multiple times b) File corruption because two files have same blocks allocated c) mb_free_blocks called multiple times on the same block .... Same is true with inode bitmap also. -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote: > On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote: > > Aneesh Kumar K.V wrote: > > > We need to make sure we update the block bitmap and clear > > > EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look > > > at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each > > > time in ext4_read_block_bitmap (introduced by > > > c806e68f5647109350ec546fee5b526962970fd2 ) > > > > Can you add details about the failure mode(s) of this race, so people > > (i.e. me) have an idea which bugs they've seen that it might address? > > > The errors I have seen are a) 3795 if (free != pa->pa_free) { 3796 printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n", 3797 pa, (unsigned long) pa->pa_lstart, 3798 (unsigned long) pa->pa_pstart, 3799 (unsigned long) pa->pa_len); b) 1091 if (!mb_test_bit(block, EXT4_MB_BITMAP(e4b))) { 1092 ext4_fsblk_t blocknr; 1093 blocknr = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb); 1094 blocknr += block; 1095 blocknr += 1096 le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); 1097 ext4_unlock_group(sb, e4b->bd_group); 1098 ext4_error(sb, __func__, "double-free of inode" For inode bitmap i have seen [root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71 ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle [root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/ total 411 drwxrwxrwx 3 689933 root 1024 Nov 18 05:59 . drwxrwxrwx 3 8391 root 1024 Nov 18 05:59 .. drwxrwxrwx 2 root root 1024 Nov 18 05:33 d83 -rw-rw-rw- 1 root root 0 Nov 18 05:06 fb4 -rw-rw-rw- 1 root root 3350138 Nov 18 05:33 fb9 ?--------- ? ? ? ? ? l71 lrwxrwxrwx 1 root root 509 Nov 18 05:23 ld9 -> xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxx [root@llm19 tmp]# dmesg gives: EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449 Some other message i got before. But i didn't capture the info fully a) "Deleting nonexistent file ..." warning in ext4_unlink b) "Empty directory has too many links..." in ext4_rmdir -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V wrote: > On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote: >> Aneesh Kumar K.V wrote: >>> We need to make sure we update the block bitmap and clear >>> EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look >>> at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each >>> time in ext4_read_block_bitmap (introduced by >>> c806e68f5647109350ec546fee5b526962970fd2 ) >> Can you add details about the failure mode(s) of this race, so people >> (i.e. me) have an idea which bugs they've seen that it might address? >> > > ext4_read_block_bitmap does > > spin_lock(sb_bgl_lock(EXT4_SB(sb), block_group)); > if (desc->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { > ext4_init_block_bitmap(sb, bh, block_group, desc); > > the above ext4_init_block_bitmap actually zero out the block bitmap. > > Now on the block allocation side we do > > mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data, > ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len); > > spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group)); > if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { > gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); > > ie on allocation we update the bitmap then we take the sb_bgl_lock > and clear the EXT4_BG_BLOCK_UNINIT flag. What can happen is a > parallel ext4_read_block_bitmap can zero out the bitmap in between > the above mb_set_bits and spin_lock(sb_bg_lock..) > > Result of this race is > a) blocks getting allocated multiple times > b) File corruption because two files have same blocks allocated > c) mb_free_blocks called multiple times on the same block Thanks - And do any of these cases lead to BUG(), WARNING(), ext3_error(), etc messages that people may one day google for? -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Aneesh Kumar K.V wrote: > On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote: >> On Fri, Nov 21, 2008 at 11:22:04AM -0600, Eric Sandeen wrote: >>> Aneesh Kumar K.V wrote: >>>> We need to make sure we update the block bitmap and clear >>>> EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look >>>> at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each >>>> time in ext4_read_block_bitmap (introduced by >>>> c806e68f5647109350ec546fee5b526962970fd2 ) >>> Can you add details about the failure mode(s) of this race, so people >>> (i.e. me) have an idea which bugs they've seen that it might address? >>> > > The errors I have seen are Ah, there we go. IMHO, putting a few of these errors into the commit would be helpful. Thanks, -Eric > a) > 3795 if (free != pa->pa_free) { > 3796 printk(KERN_CRIT "pa %p: logic %lu, phys. %lu, len %lu\n", > 3797 pa, (unsigned long) pa->pa_lstart, > 3798 (unsigned long) pa->pa_pstart, > 3799 (unsigned long) pa->pa_len); > > b) > > 1091 if (!mb_test_bit(block, EXT4_MB_BITMAP(e4b))) { > 1092 ext4_fsblk_t blocknr; > 1093 blocknr = e4b->bd_group * EXT4_BLOCKS_PER_GROUP(sb); > 1094 blocknr += block; > 1095 blocknr += > 1096 le32_to_cpu(EXT4_SB(sb)->s_es->s_first_data_block); > 1097 ext4_unlock_group(sb, e4b->bd_group); > 1098 ext4_error(sb, __func__, "double-free of > inode" > > For inode bitmap i have seen > > [root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71 > ls: /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/l71: Stale NFS file handle > [root@llm19 tmp]# ls -al /mnt/tmp/f/p369/d3/d6/d39/db2/dee/d10f/d3f/ > total 411 > drwxrwxrwx 3 689933 root 1024 Nov 18 05:59 . > drwxrwxrwx 3 8391 root 1024 Nov 18 05:59 .. > drwxrwxrwx 2 root root 1024 Nov 18 05:33 d83 > -rw-rw-rw- 1 root root 0 Nov 18 05:06 fb4 > -rw-rw-rw- 1 root root 3350138 Nov 18 05:33 fb9 > ?--------- ? ? ? ? ? l71 > lrwxrwxrwx 1 root root 509 Nov 18 05:23 ld9 -> xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxxx/xxxxxxxx > [root@llm19 tmp]# > > dmesg gives: > > EXT4-fs error (device sdb1): ext4_free_inode: bit already cleared for inode 168449 > > > Some other message i got before. But i didn't capture the info fully > > a) "Deleting nonexistent file ..." warning in ext4_unlink > > b) "Empty directory has too many links..." in ext4_rmdir > -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 21, 2008 at 10:14:33PM +0530, Aneesh Kumar K.V wrote: > We need to make sure we update the block bitmap and clear > EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look > at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each > time in ext4_read_block_bitmap (introduced by > c806e68f5647109350ec546fee5b526962970fd2 ) You are changing mb_clear_bits() and and mb_set_bits() so they take the spinlock over the entire operaiton, instead of over each particular bit. These function are used in a largish number of places, not just for updating the block bitmap, but also the mb buddy bitmaps, etc. So there may be a scalability impact here, although taking the spinlock once instead of multiple times is probably a win. My bigger concern is given that we are playing games like *this*: if ((cur & 31) == 0 && (len - cur) >= 32) { /* fast path: set whole word at once */ addr = bm + (cur >> 3); *addr = 0xffffffff; cur += 32; continue; } without taking a lock, I'm a little surprised we haven't been seriously burned by other race conditions. What's the point of calling mb_set_bit_atomic() and passing in a spinlock if we are doing this kind of check without the protection of the same spinlock?!? Andreas, if you are using mb_clear_bits() and mb_set_bits() in Lustre's mballoc.c with this in production, you may want to take a look at this patch. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote: > Same is true with inode bitmap also. > I don't see how this patch could affect any races with the inode bitmaps; the patch only affects mballoc.c, and only changes the locking around bitmaps. This is one that I think we may want to send to Linus soon as a 2.6.28 bugfix. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sun, Nov 23, 2008 at 02:02:24PM -0500, Theodore Tso wrote: > On Fri, Nov 21, 2008 at 11:01:35PM +0530, Aneesh Kumar K.V wrote: > > Same is true with inode bitmap also. > > > > I don't see how this patch could affect any races with the inode > bitmaps; the patch only affects mballoc.c, and only changes the > locking around bitmaps. The intent was to explain why we need [PATCH -V2 5/5] ext4: Fix the race between read_inode_bitmap and ext4_new_inode > > This is one that I think we may want to send to Linus soon as a 2.6.28 > bugfix. > Yes. -aneesh -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > My bigger concern is given that we are playing games like *this*: > > if ((cur & 31) == 0 && (len - cur) >= 32) { > /* fast path: set whole word at once */ > addr = bm + (cur >> 3); > *addr = 0xffffffff; > cur += 32; > continue; > } this is to avoid expensive LOCK prefix in some cases. > without taking a lock, I'm a little surprised we haven't been > seriously burned by other race conditions. What's the point of > calling mb_set_bit_atomic() and passing in a spinlock if we are doing > this kind of check without the protection of the same spinlock?!? why would we need a lock for a whole word bitop ? thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index c5dcdf0..1ed949c 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1065,6 +1065,8 @@ static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len) __u32 *addr; len = cur + len; + if (lock) + spin_lock(lock); \ while (cur < len) { if ((cur & 31) == 0 && (len - cur) >= 32) { /* fast path: clear whole word at once */ @@ -1073,9 +1075,11 @@ static void mb_clear_bits(spinlock_t *lock, void *bm, int cur, int len) cur += 32; continue; } - mb_clear_bit_atomic(lock, cur, bm); + mb_clear_bit(cur, bm); cur++; } + if (lock) + spin_unlock(lock); \ } static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len) @@ -1083,6 +1087,8 @@ static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len) __u32 *addr; len = cur + len; + if (lock) + spin_lock(lock); \ while (cur < len) { if ((cur & 31) == 0 && (len - cur) >= 32) { /* fast path: set whole word at once */ @@ -1091,9 +1097,11 @@ static void mb_set_bits(spinlock_t *lock, void *bm, int cur, int len) cur += 32; continue; } - mb_set_bit_atomic(lock, cur, bm); + mb_set_bit(cur, bm); cur++; } + if (lock) + spin_unlock(lock); \ } static void mb_free_blocks(struct inode *inode, struct ext4_buddy *e4b, @@ -3004,10 +3012,9 @@ ext4_mb_mark_diskspace_used(struct ext4_allocation_context *ac, } } #endif - mb_set_bits(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group), bitmap_bh->b_data, - ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len); - spin_lock(sb_bgl_lock(sbi, ac->ac_b_ex.fe_group)); + mb_set_bits(NULL, bitmap_bh->b_data, + ac->ac_b_ex.fe_start, ac->ac_b_ex.fe_len); if (gdp->bg_flags & cpu_to_le16(EXT4_BG_BLOCK_UNINIT)) { gdp->bg_flags &= cpu_to_le16(~EXT4_BG_BLOCK_UNINIT); gdp->bg_free_blocks_count =
We need to make sure we update the block bitmap and clear EXT4_BG_BLOCK_UNINIT flag with sb_bgl_lock held. We look at EXT4_BG_BLOCK_UNINIT and reinit the block bitmap each time in ext4_read_block_bitmap (introduced by c806e68f5647109350ec546fee5b526962970fd2 ) Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> --- fs/ext4/mballoc.c | 17 ++++++++++++----- 1 files changed, 12 insertions(+), 5 deletions(-)