Message ID | 4AA92307.4010304@redhat.com |
---|---|
State | Superseded, archived |
Headers | show |
On Thu, Sep 10, 2009 at 11:02:15AM -0500, Eric Sandeen wrote: > Today, the ext4 allocator will happily allocate blocks past > 232 for indirect-block files, which results in the block > numbers getting truncated, and corruption ensues. Everywhere where you say 232, you mean 2^32, right? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Thu, Sep 10, 2009 at 11:02:15AM -0500, Eric Sandeen wrote: >> Today, the ext4 allocator will happily allocate blocks past >> 232 for indirect-block files, which results in the block >> numbers getting truncated, and corruption ensues. > > Everywhere where you say 232, you mean 2^32, right? sorry, cut and paste error, yes. (email client helpfully turned 2^32 into something prettier, but then didn't copy it right on the resend) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 10, 2009 11:02 -0500, Eric Sandeen wrote: > This patch limits such allocations to < 232, and adds > WARN_ONs (maybe should be BUG_ONs) if we do get blocks > larger than that. Given that this may corrupt the filesystem (e.g. block 2^32 turning into block 0 and overwriting the superblock) I think a BUG_ON() is probably more appropriate. This should only happen with software bugs, so it is more appropriate than ext4_error() I think. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger wrote: > On Sep 10, 2009 11:02 -0500, Eric Sandeen wrote: >> This patch limits such allocations to < 232, and adds >> WARN_ONs (maybe should be BUG_ONs) if we do get blocks >> larger than that. > > Given that this may corrupt the filesystem (e.g. block > 2^32 turning into block 0 and overwriting the superblock) > I think a BUG_ON() is probably more appropriate. This > should only happen with software bugs, so it is more > appropriate than ext4_error() I think. Ok, fine by me. I can send an update. Any suggestions on the naming issues? (what's the official name for a "not-extent-based-file?") I ran it a lot through a mkfs/mount/fsstress/unmount/fsck cycle, and all seemed well. mkfs was without extents, so I was thinking we were in good shape. However, Ric just ran a massive fs_mark test on a 60T filesystem that he created with "mke2fs" (no extents and no journal - accidentally) and we got no corruption even without this patch. I need to see if a filesystem w/o the extents feature (at all, vs. some old-format files on an extents fs) never even tries to allocate past 2^32; I didn't think so, but now not so sure. I probably need to do more testing ... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 10, 2009 at 04:16:32PM -0500, Eric Sandeen wrote: > Ok, fine by me. I can send an update. > > Any suggestions on the naming issues? (what's the official name for a > "not-extent-based-file?") What I normally use is "extent-mapped file" and "indirect block mapped file", or "non-extent-mapped file". - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Thu, Sep 10, 2009 at 04:16:32PM -0500, Eric Sandeen wrote: >> Ok, fine by me. I can send an update. >> >> Any suggestions on the naming issues? (what's the official name for a >> "not-extent-based-file?") > > What I normally use is "extent-mapped file" and "indirect block mapped > file", or "non-extent-mapped file". I'll see if that can fit in a macro name nicely :) Thanks, -Eric > - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 10, 2009 16:16 -0500, Eric Sandeen wrote: > Any suggestions on the naming issues? (what's the official name for a > "not-extent-based-file?") I've always used "block mapped" (i.e. mapped block-by-block) vs. "extent mapped". > However, Ric just ran a massive fs_mark test on a 60T filesystem that he > created with "mke2fs" (no extents and no journal - accidentally) and we > got no corruption even without this patch. > > I need to see if a filesystem w/o the extents feature (at all, vs. some > old-format files on an extents fs) never even tries to allocate past > 2^32; I didn't think so, but now not so sure. Well, it may depend a lot on which inodes are in use. That will set the goal block, and may prevent any above-16TB allocations. Either you could fill the bitmaps with 0xff (and zero the free blocks counters, to avoid problems with mballoc), or actually fill the first 16TB of the filesystem. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Andreas Dilger wrote: > On Sep 10, 2009 16:16 -0500, Eric Sandeen wrote: >> Any suggestions on the naming issues? (what's the official name for a >> "not-extent-based-file?") > > I've always used "block mapped" (i.e. mapped block-by-block) vs. > "extent mapped". > >> However, Ric just ran a massive fs_mark test on a 60T filesystem that he >> created with "mke2fs" (no extents and no journal - accidentally) and we >> got no corruption even without this patch. >> >> I need to see if a filesystem w/o the extents feature (at all, vs. some >> old-format files on an extents fs) never even tries to allocate past >> 2^32; I didn't think so, but now not so sure. > > Well, it may depend a lot on which inodes are in use. That will set the > goal block, and may prevent any above-16TB allocations. Either you could yep, though I had many, many inodes in the high groups ... Problem is I don't quite trust debugfs etc to get it right, so when I see < 32 bits, I'm not sure if it's really there, or if the reporting/debug tool wrapped it ;) -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Sep 10, 2009 23:51 +0200, Andreas Dilger wrote: > Well, it may depend a lot on which inodes are in use. That will set the > goal block, and may prevent any above-16TB allocations. Either you could > fill the bitmaps with 0xff (and zero the free blocks counters, to avoid > problems with mballoc), or actually fill the first 16TB of the filesystem. Or, just start creating top-level directories until you get one past 16TB and use that for your test... We have a patch for allowing a goal inode to be specified, and it might make sense to add a mount option to allow setting the inode goal for testing... Hey, look, I even posted that patch, I now recall: http://osdir.com/ml/linux-ext4/2009-06/msg00233.html Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Sep 10, 2009 at 04:57:38PM -0500, Eric Sandeen wrote: > > Problem is I don't quite trust debugfs etc to get it right, so when I > see < 32 bits, I'm not sure if it's really there, or if the > reporting/debug tool wrapped it ;) Debugfs from pu branch definitely does get it right --- I've checked. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Theodore Tso wrote: > On Thu, Sep 10, 2009 at 04:57:38PM -0500, Eric Sandeen wrote: >> Problem is I don't quite trust debugfs etc to get it right, so when I >> see < 32 bits, I'm not sure if it's really there, or if the >> reporting/debug tool wrapped it ;) > > Debugfs from pu branch definitely does get it right --- I've checked. > > - Ted I couldn't get debugfs from pu to even load a large filesystem, odd... # debugfs/debugfs ../bigfile debugfs 1.41.9 (22-Aug-2009) ../bigfile: Filesystem too large to use legacy bitmaps while reading block bitmap I haven't yet looked into this... -Eric -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h index 9714db3..1147994 100644 --- a/fs/ext4/ext4.h +++ b/fs/ext4/ext4.h @@ -386,6 +386,9 @@ struct ext4_mount_options { #endif }; +/* Max physical block we can addres w/o extents */ +#define EXT4_MAX_BLOCK_FILE_PHYS 0xFFFFFFFF + /* * Structure of an inode on the disk */ @@ -841,6 +844,7 @@ struct ext4_sb_info { unsigned long s_gdb_count; /* Number of group descriptor blocks */ unsigned long s_desc_per_block; /* Number of group descriptors per block */ ext4_group_t s_groups_count; /* Number of groups in the fs */ + ext4_group_t s_blockfile_groups;/* Groups acceptable for non-extent files */ unsigned long s_overhead_last; /* Last calculated overhead */ unsigned long s_blocks_last; /* Last seen block count */ loff_t s_bitmap_maxbytes; /* max bytes for bitmap files */ diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index f9c642b..f716d49 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -551,15 +551,21 @@ static ext4_fsblk_t ext4_find_near(struct inode *inode, Indirect *ind) * * Normally this function find the preferred place for block allocation, * returns it. + * Because this is only used for non-extent files, we limit the block nr + * to 32 bits. */ static ext4_fsblk_t ext4_find_goal(struct inode *inode, ext4_lblk_t block, Indirect *partial) { + ext4_fsblk_t goal; + /* * XXX need to get goal block from mballoc's data structures */ - return ext4_find_near(inode, partial); + goal = ext4_find_near(inode, partial); + goal = goal & EXT4_MAX_BLOCK_FILE_PHYS; + return goal; } /** @@ -640,6 +646,8 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode, if (*err) goto failed_out; + WARN_ON(current_block + count > EXT4_MAX_BLOCK_FILE_PHYS); + target -= count; /* allocate blocks for indirect blocks */ while (index < indirect_blks && count) { @@ -674,6 +682,7 @@ static int ext4_alloc_blocks(handle_t *handle, struct inode *inode, ar.flags = EXT4_MB_HINT_DATA; current_block = ext4_mb_new_blocks(handle, &ar, err); + WARN_ON(current_block + ar.len > EXT4_MAX_BLOCK_FILE_PHYS); if (*err && (target == blks)) { /* diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index cd25846..b87854b 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -1943,6 +1943,10 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac) sb = ac->ac_sb; sbi = EXT4_SB(sb); ngroups = ext4_get_groups_count(sb); + /* non-extent files are limited to low blocks/groups */ + if (!(EXT4_I(ac->ac_inode)->i_flags & EXT4_EXTENTS_FL)) + ngroups = sbi->s_blockfile_groups; + BUG_ON(ac->ac_status == AC_STATUS_FOUND); /* first, try the goal */ @@ -3382,6 +3386,11 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac) ac->ac_o_ex.fe_logical >= pa->pa_lstart + pa->pa_len) continue; + /* non-extent files can't have physical blocks past 2^32 */ + if (!(EXT4_I(ac->ac_inode)->i_flags & EXT4_EXTENTS_FL) && + pa->pa_pstart + pa->pa_len > EXT4_MAX_BLOCK_FILE_PHYS) + continue; + /* found preallocated blocks, use them */ spin_lock(&pa->pa_lock); if (pa->pa_deleted == 0 && pa->pa_free) { diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 8f4f079..8dcdded 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -2595,6 +2595,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent) goto failed_mount; } sbi->s_groups_count = blocks_count; + sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count, + (EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb))); db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) / EXT4_DESC_PER_BLOCK(sb); sbi->s_group_desc = kmalloc(db_count * sizeof(struct buffer_head *), diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c index 62b31c2..6bce3f8 100644 --- a/fs/ext4/xattr.c +++ b/fs/ext4/xattr.c @@ -810,12 +810,23 @@ inserted: get_bh(new_bh); } else { /* We need to allocate a new block */ - ext4_fsblk_t goal = ext4_group_first_block_no(sb, + ext4_fsblk_t goal, block; + + goal = ext4_group_first_block_no(sb, EXT4_I(inode)->i_block_group); - ext4_fsblk_t block = ext4_new_meta_blocks(handle, inode, + + /* non-extent files can't have physical blocks past 2^32 */ + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + goal = goal & EXT4_MAX_BLOCK_FILE_PHYS; + + block = ext4_new_meta_blocks(handle, inode, goal, NULL, &error); if (error) goto cleanup; + + if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL)) + WARN_ON(block > EXT4_MAX_BLOCK_FILE_PHYS); + ea_idebug(inode, "creating block %d", block); new_bh = sb_getblk(sb, block);
Today, the ext4 allocator will happily allocate blocks past 232 for indirect-block files, which results in the block numbers getting truncated, and corruption ensues. This patch limits such allocations to < 232, and adds WARN_ONs (maybe should be BUG_ONs) if we do get blocks larger than that. This should address RH Bug 519471, ext4 bitmap allocator must limit blocks to < 232 * ext4_find_goal() is modified to choose a goal < UINT_MAX, so that our starting point is in an acceptable range. * ext4_xattr_block_set() is modified such that the goal block is < UINT_MAX, as above. * ext4_mb_regular_allocator() is modified so that the group search does not continue into groups which are too high * ext4_mb_use_preallocated() has a check that we don't use preallocated space which is too far out * ext4_alloc_blocks() and ext4_xattr_block_set() add some WARN_ONs No attempt has been made to limit inode locations to < 232, so we may wind up with blocks far from their inodes. Doing this much already will lead to some odd ENOSPC issues when the "lower 32" gets full, and further restricting inodes could make that even weirder. For high inodes, choosing a goal of the original, % UINT_MAX, may be a bit odd, but then we're in an odd situation anyway, and I don't know of a better heuristic. Perhaps an ext4-specific #define would be better than UINT_MAX? The allocator being what it is, I may have missed some spots, so I'd welcome review. Thanks, -Eric Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- V2: got modulo-happy in ext4_mb_regular_allocator, just limit ngroups to no more than UINT_MAX. V3: address some of Andreas' review points But I think we need some better macro & sb info member names... -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html