diff mbox

[RFC,V3] ext4: limit block allocations for indirect-block files to < 2^32

Message ID 4AA92307.4010304@redhat.com
State Superseded, archived
Headers show

Commit Message

Eric Sandeen Sept. 10, 2009, 4:02 p.m. UTC
Today, the ext4 allocator will happily allocate blocks past
232 for indirect-block files, which results in the block
numbers getting truncated, and corruption ensues.

This patch limits such allocations to < 232, and adds
WARN_ONs (maybe should be BUG_ONs) if we do get blocks
larger than that.

This should address RH Bug 519471, ext4 bitmap allocator 
must limit blocks to < 232

* ext4_find_goal() is modified to choose a goal < UINT_MAX,
  so that our starting point is in an acceptable range.

* ext4_xattr_block_set() is modified such that the goal block
  is < UINT_MAX, as above.

* ext4_mb_regular_allocator() is modified so that the group
  search does not continue into groups which are too high

* ext4_mb_use_preallocated() has a check that we don't use
  preallocated space which is too far out

* ext4_alloc_blocks() and ext4_xattr_block_set() add some WARN_ONs

No attempt has been made to limit inode locations to < 232,
so we may wind up with blocks far from their inodes.  Doing
this much already will lead to some odd ENOSPC issues when the
"lower 32" gets full, and further restricting inodes could
make that even weirder.

For high inodes, choosing a goal of the original, % UINT_MAX,
may be a bit odd, but then we're in an odd situation anyway,
and I don't know of a better heuristic.

Perhaps an ext4-specific #define would be better than UINT_MAX?

The allocator being what it is, I may have missed some spots,
so I'd welcome review.

Thanks,
-Eric

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---

V2: got modulo-happy in ext4_mb_regular_allocator, just limit
ngroups to no more than UINT_MAX.

V3: address some of Andreas' review points
But I think we need some better macro & sb info member names...



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Sept. 10, 2009, 4:53 p.m. UTC | #1
On Thu, Sep 10, 2009 at 11:02:15AM -0500, Eric Sandeen wrote:
> Today, the ext4 allocator will happily allocate blocks past
> 232 for indirect-block files, which results in the block
> numbers getting truncated, and corruption ensues.

Everywhere where you say 232, you mean 2^32, right?

	   	     	      	       	     - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Sept. 10, 2009, 4:56 p.m. UTC | #2
Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 11:02:15AM -0500, Eric Sandeen wrote:
>> Today, the ext4 allocator will happily allocate blocks past
>> 232 for indirect-block files, which results in the block
>> numbers getting truncated, and corruption ensues.
> 
> Everywhere where you say 232, you mean 2^32, right?

sorry, cut and paste error, yes.

(email client helpfully turned 2^32 into something prettier, but then
didn't copy it right on the resend)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 10, 2009, 9:10 p.m. UTC | #3
On Sep 10, 2009  11:02 -0500, Eric Sandeen wrote:
> This patch limits such allocations to < 232, and adds
> WARN_ONs (maybe should be BUG_ONs) if we do get blocks
> larger than that.

Given that this may corrupt the filesystem (e.g. block
2^32 turning into block 0 and overwriting the superblock)
I think a BUG_ON() is probably more appropriate.  This
should only happen with software bugs, so it is more
appropriate than ext4_error() I think.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Sept. 10, 2009, 9:16 p.m. UTC | #4
Andreas Dilger wrote:
> On Sep 10, 2009  11:02 -0500, Eric Sandeen wrote:
>> This patch limits such allocations to < 232, and adds
>> WARN_ONs (maybe should be BUG_ONs) if we do get blocks
>> larger than that.
> 
> Given that this may corrupt the filesystem (e.g. block
> 2^32 turning into block 0 and overwriting the superblock)
> I think a BUG_ON() is probably more appropriate.  This
> should only happen with software bugs, so it is more
> appropriate than ext4_error() I think.

Ok, fine by me.  I can send an update.

Any suggestions on the naming issues?  (what's the official name for a
"not-extent-based-file?")

I ran it a lot through a mkfs/mount/fsstress/unmount/fsck cycle, and all
seemed well.  mkfs was without extents, so I was thinking we were in
good shape.

However, Ric just ran a massive fs_mark test on a 60T filesystem that he
created with "mke2fs" (no extents and no journal - accidentally) and we
got no corruption even without this patch.

I need to see if a filesystem w/o the extents feature (at all, vs. some
old-format files on an extents fs) never even tries to allocate past
2^32; I didn't think so, but now not so sure.

I probably need to do more testing ...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Sept. 10, 2009, 9:33 p.m. UTC | #5
On Thu, Sep 10, 2009 at 04:16:32PM -0500, Eric Sandeen wrote:
> Ok, fine by me.  I can send an update.
> 
> Any suggestions on the naming issues?  (what's the official name for a
> "not-extent-based-file?")

What I normally use is "extent-mapped file" and "indirect block mapped
file", or "non-extent-mapped file".

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Sept. 10, 2009, 9:42 p.m. UTC | #6
Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 04:16:32PM -0500, Eric Sandeen wrote:
>> Ok, fine by me.  I can send an update.
>>
>> Any suggestions on the naming issues?  (what's the official name for a
>> "not-extent-based-file?")
> 
> What I normally use is "extent-mapped file" and "indirect block mapped
> file", or "non-extent-mapped file".

I'll see if that can fit in a macro name nicely :)

Thanks,
-Eric

> 						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 10, 2009, 9:51 p.m. UTC | #7
On Sep 10, 2009  16:16 -0500, Eric Sandeen wrote:
> Any suggestions on the naming issues?  (what's the official name for a
> "not-extent-based-file?")

I've always used "block mapped" (i.e. mapped block-by-block) vs.
"extent mapped".

> However, Ric just ran a massive fs_mark test on a 60T filesystem that he
> created with "mke2fs" (no extents and no journal - accidentally) and we
> got no corruption even without this patch.
> 
> I need to see if a filesystem w/o the extents feature (at all, vs. some
> old-format files on an extents fs) never even tries to allocate past
> 2^32; I didn't think so, but now not so sure.

Well, it may depend a lot on which inodes are in use.  That will set the
goal block, and may prevent any above-16TB allocations.  Either you could
fill the bitmaps with 0xff (and zero the free blocks counters, to avoid
problems with mballoc), or actually fill the first 16TB of the filesystem.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Sept. 10, 2009, 9:57 p.m. UTC | #8
Andreas Dilger wrote:
> On Sep 10, 2009  16:16 -0500, Eric Sandeen wrote:
>> Any suggestions on the naming issues?  (what's the official name for a
>> "not-extent-based-file?")
> 
> I've always used "block mapped" (i.e. mapped block-by-block) vs.
> "extent mapped".
> 
>> However, Ric just ran a massive fs_mark test on a 60T filesystem that he
>> created with "mke2fs" (no extents and no journal - accidentally) and we
>> got no corruption even without this patch.
>>
>> I need to see if a filesystem w/o the extents feature (at all, vs. some
>> old-format files on an extents fs) never even tries to allocate past
>> 2^32; I didn't think so, but now not so sure.
> 
> Well, it may depend a lot on which inodes are in use.  That will set the
> goal block, and may prevent any above-16TB allocations.  Either you could

yep, though I had many, many inodes in the high groups ...

Problem is I don't quite trust debugfs etc to get it right, so when I
see < 32 bits, I'm not sure if it's really there, or if the
reporting/debug tool wrapped it ;)

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 10, 2009, 10:01 p.m. UTC | #9
On Sep 10, 2009  23:51 +0200, Andreas Dilger wrote:
> Well, it may depend a lot on which inodes are in use.  That will set the
> goal block, and may prevent any above-16TB allocations.  Either you could
> fill the bitmaps with 0xff (and zero the free blocks counters, to avoid
> problems with mballoc), or actually fill the first 16TB of the filesystem.

Or, just start creating top-level directories until you get one past
16TB and use that for your test...  We have a patch for allowing a
goal inode to be specified, and it might make sense to add a mount
option to allow setting the inode goal for testing...

Hey, look, I even posted that patch, I now recall:
http://osdir.com/ml/linux-ext4/2009-06/msg00233.html

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Sept. 10, 2009, 11:19 p.m. UTC | #10
On Thu, Sep 10, 2009 at 04:57:38PM -0500, Eric Sandeen wrote:
> 
> Problem is I don't quite trust debugfs etc to get it right, so when I
> see < 32 bits, I'm not sure if it's really there, or if the
> reporting/debug tool wrapped it ;)

Debugfs from pu branch definitely does get it right --- I've checked.

	     	       		       	   - Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen Sept. 11, 2009, 2:15 p.m. UTC | #11
Theodore Tso wrote:
> On Thu, Sep 10, 2009 at 04:57:38PM -0500, Eric Sandeen wrote:
>> Problem is I don't quite trust debugfs etc to get it right, so when I
>> see < 32 bits, I'm not sure if it's really there, or if the
>> reporting/debug tool wrapped it ;)
> 
> Debugfs from pu branch definitely does get it right --- I've checked.
> 
> 	     	       		       	   - Ted

I couldn't get debugfs from pu to even load a large filesystem, odd...

# debugfs/debugfs ../bigfile
debugfs 1.41.9 (22-Aug-2009)
../bigfile: Filesystem too large to use legacy bitmaps while reading 
block bitmap

I haven't yet looked into this...

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9714db3..1147994 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -386,6 +386,9 @@  struct ext4_mount_options {
 #endif
 };
 
+/* Max physical block we can addres w/o extents */
+#define EXT4_MAX_BLOCK_FILE_PHYS	0xFFFFFFFF
+
 /*
  * Structure of an inode on the disk
  */
@@ -841,6 +844,7 @@  struct ext4_sb_info {
 	unsigned long s_gdb_count;	/* Number of group descriptor blocks */
 	unsigned long s_desc_per_block;	/* Number of group descriptors per block */
 	ext4_group_t s_groups_count;	/* Number of groups in the fs */
+	ext4_group_t s_blockfile_groups;/* Groups acceptable for non-extent files */
 	unsigned long s_overhead_last;  /* Last calculated overhead */
 	unsigned long s_blocks_last;    /* Last seen block count */
 	loff_t s_bitmap_maxbytes;	/* max bytes for bitmap files */
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f9c642b..f716d49 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -551,15 +551,21 @@  static ext4_fsblk_t ext4_find_near(struct inode *inode, Indirect *ind)
  *
  *	Normally this function find the preferred place for block allocation,
  *	returns it.
+ *	Because this is only used for non-extent files, we limit the block nr
+ *	to 32 bits.
  */
 static ext4_fsblk_t ext4_find_goal(struct inode *inode, ext4_lblk_t block,
 				   Indirect *partial)
 {
+	ext4_fsblk_t goal;
+
 	/*
 	 * XXX need to get goal block from mballoc's data structures
 	 */
 
-	return ext4_find_near(inode, partial);
+	goal = ext4_find_near(inode, partial);
+	goal = goal & EXT4_MAX_BLOCK_FILE_PHYS;
+	return goal;
 }
 
 /**
@@ -640,6 +646,8 @@  static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
 		if (*err)
 			goto failed_out;
 
+		WARN_ON(current_block + count > EXT4_MAX_BLOCK_FILE_PHYS);
+
 		target -= count;
 		/* allocate blocks for indirect blocks */
 		while (index < indirect_blks && count) {
@@ -674,6 +682,7 @@  static int ext4_alloc_blocks(handle_t *handle, struct inode *inode,
 		ar.flags = EXT4_MB_HINT_DATA;
 
 	current_block = ext4_mb_new_blocks(handle, &ar, err);
+	WARN_ON(current_block + ar.len > EXT4_MAX_BLOCK_FILE_PHYS);
 
 	if (*err && (target == blks)) {
 		/*
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index cd25846..b87854b 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -1943,6 +1943,10 @@  ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	sb = ac->ac_sb;
 	sbi = EXT4_SB(sb);
 	ngroups = ext4_get_groups_count(sb);
+	/* non-extent files are limited to low blocks/groups */
+	if (!(EXT4_I(ac->ac_inode)->i_flags & EXT4_EXTENTS_FL))
+		ngroups = sbi->s_blockfile_groups;
+
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 
 	/* first, try the goal */
@@ -3382,6 +3386,11 @@  ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 			ac->ac_o_ex.fe_logical >= pa->pa_lstart + pa->pa_len)
 			continue;
 
+		/* non-extent files can't have physical blocks past 2^32 */
+		if (!(EXT4_I(ac->ac_inode)->i_flags & EXT4_EXTENTS_FL) &&
+			pa->pa_pstart + pa->pa_len > EXT4_MAX_BLOCK_FILE_PHYS)
+			continue;
+
 		/* found preallocated blocks, use them */
 		spin_lock(&pa->pa_lock);
 		if (pa->pa_deleted == 0 && pa->pa_free) {
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 8f4f079..8dcdded 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2595,6 +2595,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 		goto failed_mount;
 	}
 	sbi->s_groups_count = blocks_count;
+	sbi->s_blockfile_groups = min_t(ext4_group_t, sbi->s_groups_count,
+			(EXT4_MAX_BLOCK_FILE_PHYS / EXT4_BLOCKS_PER_GROUP(sb)));
 	db_count = (sbi->s_groups_count + EXT4_DESC_PER_BLOCK(sb) - 1) /
 		   EXT4_DESC_PER_BLOCK(sb);
 	sbi->s_group_desc = kmalloc(db_count * sizeof(struct buffer_head *),
diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index 62b31c2..6bce3f8 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -810,12 +810,23 @@  inserted:
 			get_bh(new_bh);
 		} else {
 			/* We need to allocate a new block */
-			ext4_fsblk_t goal = ext4_group_first_block_no(sb,
+			ext4_fsblk_t goal, block;
+
+			goal = ext4_group_first_block_no(sb,
 						EXT4_I(inode)->i_block_group);
-			ext4_fsblk_t block = ext4_new_meta_blocks(handle, inode,
+
+			/* non-extent files can't have physical blocks past 2^32 */
+			if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+				goal = goal & EXT4_MAX_BLOCK_FILE_PHYS;
+
+			block = ext4_new_meta_blocks(handle, inode,
 						  goal, NULL, &error);
 			if (error)
 				goto cleanup;
+
+			if (!(EXT4_I(inode)->i_flags & EXT4_EXTENTS_FL))
+				WARN_ON(block > EXT4_MAX_BLOCK_FILE_PHYS);
+
 			ea_idebug(inode, "creating block %d", block);
 
 			new_bh = sb_getblk(sb, block);