Patchwork [RFC] resize2fs and uninit_bg questions

login
register
mail settings
Submitter Will Drewry
Date Sept. 23, 2009, 7:28 p.m.
Message ID <2359eed20909231228m2050cf65pa9029f931f655b10@mail.gmail.com>
Download mbox | patch
Permalink /patch/34188/
State New
Headers show

Comments

Will Drewry - Sept. 23, 2009, 7:28 p.m.
On Wed, Sep 23, 2009 at 4:34 AM, Andreas Dilger <adilger@sun.com> wrote:
> On Sep 19, 2009  07:19 -0500, Will Drewry wrote:
>> @@ -482,6 +482,8 @@ retry:
>>
>>       csum_flag = EXT2_HAS_RO_COMPAT_FEATURE(fs->super,
>>                                              EXT4_FEATURE_RO_COMPAT_GDT_CSUM);
>> +     lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
>> +                                            EXT2_FEATURE_COMPAT_LAZY_BG);
>>       adj = old_fs->group_desc_count;
>>       max_group = fs->group_desc_count - adj;
>>       if (fs->super->s_feature_incompat & EXT2_FEATURE_INCOMPAT_META_BG)
>> @@ -496,9 +498,12 @@ retry:
>>               adjblocks = 0;
>>
>>               fs->group_desc[i].bg_flags = 0;
>> -             if (csum_flag)
>> -                     fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT |
>> -                             EXT2_BG_INODE_ZEROED;
>> +             if (csum_flag) {
>> +                     fs->group_desc[i].bg_flags |= EXT2_BG_INODE_UNINIT;
>> +                     if (!lazy_flag) {
>> +                             fs->group_desc[i].bg_flags |= EXT2_BG_INODE_ZEROED;
>> +                     }
>> +             }
>
> This code could be cleaned up a bit by assigning "EXT2_BG_INODE_ZEROED" to
> lazy_flag so that you don't need to check this each time:
>
>        lazy_flag = EXT2_HAS_COMPAT_FEATURE(fs->super,
>                                            EXT2_FEATURE_COMPAT_LAZY_BG) ?
>                                EXT2_BG_INODE_ZEROED : 0;
>
>                if (csum_flag)
>                        fs->group_desc[i].bg_flags |=
>                                EXT2_BG_INODE_UNINIT | lazy_flag;
>
> Still waiting to head from Ted on the use of COMPAT_LAZY_BG.

Ah nice - I can definitely do that.  I also think I mangled the spaces
v tabs on the ext4.h part of the patch.  If COMPAT_LAZY_BG is off limits,
I can easily use some other flag/indicator.

That aside, I've also got a barebones kernel patch which supports lazy
online resizing which accompanies the e2fsprogs patch above. I realize
it is probably less-than-practical until there is an initializing
thread, but I'd appreciate any feedback if possible -- even if just to
ensure I'm understanding things correctly.

Thanks!
will


Signed-off-by: Will Drewry <redpig@dataspill.org>
---
fs/ext4/ext4.h   |    1 +
fs/ext4/resize.c |   38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 37 insertions(+), 2 deletions(-)

 	/*
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - Sept. 25, 2009, 6:21 a.m.
On Sep 23, 2009  14:28 -0500, Will Drewry wrote:
> That aside, I've also got a barebones kernel patch which supports lazy
> online resizing which accompanies the e2fsprogs patch above. I realize
> it is probably less-than-practical until there is an initializing
> thread, but I'd appreciate any feedback if possible -- even if just to
> ensure I'm understanding things correctly.

This is looking very good.  For prototyping the initializing thread, I
would suggest to create a "per-group initialization and check" function
(maybe just stealing the inside of the ext4_check_descriptors() loop as
a starting point) that is called inline for each group at mount time.
Add in the check for a missing INODE_ZEROED and do the zeroing there.

While that will initially just push the mke2fs time into the kernel, it
would likely be somewhat faster than running it from mke2fs because it
will not need any userspace memory, nor will there be any user->kernel
data copying going on.

Once you have that working, you can work on adding this to a kernel
thread that starts at mount time and stops when it has checked all
of the groups, or if the filesystem is being unmounted.  It probably
makes sense to only have one of these threads for the entire system
(instead of one per filesystem), so that they don't start crazy seeking
IO when there are multiple new partitions on a single disk.  That means
some kind of work queue with a list of superblocks and their groups to
check (so that when online resizing only the new groups are checked).


One recent thought I had was that we might want to distinguish between
filesystems that want lazy itable zeroing (i.e. real production filesystems)
and ones that don't want itable zeroing at all (i.e. filesystems for
testing or ones created on sparse loop devices that will always return
0s for unwritten blocks).  The latter is very useful for keeping image
size small, for VM images, etc.

IMHO for filesystems that want itable zeroing could just be a mke2fs
option, and the kernel detects this from groups that do not have
INODE_ZEROED set.  Filesystems that don't want any itable zeroing
should set COMPAT_LAZY_BG to tell the kernel not to zero the itable
(though the flag would be misnamed in that case).

Maybe it makes sense to keep COMPAT_LAZY_BG for filesystems that haven't
had their itables zeroed, but want it, and add a new COMPAT_NO_ZERO flag
that indicates the kernel shouldn't zero itables at all?   Ted?

> Signed-off-by: Will Drewry <redpig@dataspill.org>
> diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
> index 68b0351..faf9263 100644
> --- a/fs/ext4/resize.c
> +++ b/fs/ext4/resize.c
> @@ -261,13 +261,31 @@ static int setup_new_group_blocks(struct super_block *sb,
>  		   input->inode_bitmap - start);
>  	ext4_set_bit(input->inode_bitmap - start, bh->b_data);
> 
> -	/* Zero out all of the inode table blocks */
> +	/* Zero out all of the inode table blocks except if the fs supports lazy
> +	 * itables. */
> +	lazy = EXT4_HAS_RO_COMPAT_FEATURE(sb,
> +	         EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&

(style) This continuation should be aligned with the '(' on the previous line.

> +	       EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_LAZY_BG);
> +	if (lazy) {
> +		ext4_debug("lazy_bg for inode blocks "
> +			   "%#04llx (+%d) - %#04llx (+%d) ",
> +			   input->inode_table, input->inode_table - start,
> +			  input->inode_table + sbi->s_itb_per_group - 1,

(style) the indenting doesn't match (first 2 lines are correct), last one not

> +			  (input->inode_table - start) + sbi->s_itb_per_group - 1);

(style) wrap at 80 columns, or in this case, remove a few spaces to make
line 80 columns wide.

(suggestion) It might make sense to print this debug message with the
inode table geometry regardless of whether it is lazy or not, and just
indicate in the message whether it will be zeroed or not.

>  	for (i = 0, block = input->inode_table, bit = block - start;
>  	     i < sbi->s_itb_per_group; i++, bit++, block++) {
>  		struct buffer_head *it;
> 
>  		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);
> 
> +		/* Even though we don't initialize the inode table, we need
> +		 * to mark the blocks used by it for later init. */
> +		if (lazy) {
> +			ext4_set_bit(bit, bh->b_data);
> +			continue;
> +		}

(suggestion) I prefer not to do the same operation in two different parts
of the loop, as this does, since if there are other changes to the code
later the "if (lazy)" clause will get increasing code duplication.

 Better would be a conditional here that only did the loop internals if
not lazy, like:


 	for (i = 0, block = input->inode_table, bit = block - start;
 	     i < sbi->s_itb_per_group; i++, bit++, block++) {
 		struct buffer_head *it;

 		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);

		/* If the filesystem is not using lazy initialization then we
		 * need zero the inode table blocks to avoid e2fsck issues. */
		if (!lazy) {
			if ((err = extend_or_restart_transaction(handle, 1,bh)))
				goto exit_bh;

			if (IS_ERR(it = bclean(handle, sb, block))) {
				err = PTR_ERR(it);
				goto exit_bh;
			}
			ext4_handle_dirty_metadata(handle, NULL, it);
			brelse(it);
		}
		ext4_set_bit(bit, bh->b_data);
	}

It could also be an "if (lazy) goto set_bit;", but given the small size
of the conditional I think the above is still pretty readable and not
too much indented.

The rest of it looks good.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9714db3..24d2880 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1043,6 +1043,7 @@  static inline int ext4_valid_inum(struct
super_block *sb, unsigned long ino)
 #define EXT4_FEATURE_COMPAT_EXT_ATTR		0x0008
 #define EXT4_FEATURE_COMPAT_RESIZE_INODE	0x0010
 #define EXT4_FEATURE_COMPAT_DIR_INDEX		0x0020
+#define EXT4_FEATURE_COMPAT_LAZY_BG		0x0040

 #define EXT4_FEATURE_RO_COMPAT_SPARSE_SUPER	0x0001
 #define EXT4_FEATURE_RO_COMPAT_LARGE_FILE	0x0002
diff --git a/fs/ext4/resize.c b/fs/ext4/resize.c
index 68b0351..faf9263 100644
--- a/fs/ext4/resize.c
+++ b/fs/ext4/resize.c
@@ -183,7 +183,7 @@  static int setup_new_group_blocks(struct super_block *sb,
 	handle_t *handle;
 	ext4_fsblk_t block;
 	ext4_grpblk_t bit;
-	int i;
+	int i, lazy = 0;
 	int err = 0, err2;

 	/* This transaction may be extended/restarted along the way */
@@ -261,13 +261,31 @@  static int setup_new_group_blocks(struct super_block *sb,
 		   input->inode_bitmap - start);
 	ext4_set_bit(input->inode_bitmap - start, bh->b_data);

-	/* Zero out all of the inode table blocks */
+	/* Zero out all of the inode table blocks except if the fs supports lazy
+	 * itables. */
+	lazy = EXT4_HAS_RO_COMPAT_FEATURE(sb,
+	         EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
+	       EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_LAZY_BG);
+	if (lazy) {
+		ext4_debug("lazy_bg for inode blocks "
+			   "%#04llx (+%d) - %#04llx (+%d) ",
+			   input->inode_table, input->inode_table - start,
+			  input->inode_table + sbi->s_itb_per_group - 1,
+			  (input->inode_table - start) + sbi->s_itb_per_group - 1);
+	}
+
 	for (i = 0, block = input->inode_table, bit = block - start;
 	     i < sbi->s_itb_per_group; i++, bit++, block++) {
 		struct buffer_head *it;

 		ext4_debug("clear inode block %#04llx (+%d)\n", block, bit);

+		/* Even though we don't initialize the inode table, we need
+		 * to mark the blocks used by it for later init. */
+		if (lazy) {
+			ext4_set_bit(bit, bh->b_data);
+			continue;
+		}
 		if ((err = extend_or_restart_transaction(handle, 1, bh)))
 			goto exit_bh;

@@ -286,6 +304,11 @@  static int setup_new_group_blocks(struct super_block *sb,
 	mark_bitmap_end(input->blocks_count, sb->s_blocksize * 8, bh->b_data);
 	ext4_handle_dirty_metadata(handle, NULL, bh);
 	brelse(bh);
+	/* If lazy, we're done since we are marked INODE_UNINIT and that
+	 * includes the inode bitmap. (ext4_init_inode_bitmap will do
+	 * this for us later). */
+	if (lazy)
+		goto exit_journal;
 	/* Mark unused entries in inode bitmap used */
 	ext4_debug("clear inode bitmap %#04llx (+%llu)\n",
 		   input->inode_bitmap, input->inode_bitmap - start);
@@ -868,6 +891,17 @@  int ext4_group_add(struct super_block *sb, struct
ext4_new_group_data *input)
 	ext4_free_blks_set(sb, gdp, input->free_blocks_count);
 	ext4_free_inodes_set(sb, gdp, EXT4_INODES_PER_GROUP(sb));
 	gdp->bg_flags = cpu_to_le16(EXT4_BG_INODE_ZEROED);
+	/* If we can do lazy initialization, we do. */
+	if (EXT4_HAS_RO_COMPAT_FEATURE(sb, EXT4_FEATURE_RO_COMPAT_GDT_CSUM) &&
+	    EXT4_HAS_COMPAT_FEATURE(sb, EXT4_FEATURE_COMPAT_LAZY_BG)) {
+		/* Only use EXT4_BG_INODE_UNINIT and not BLOCK_UNINIT
+		 * because we purposefully initialize the block bitmaps
+		 * to avoid managing super block backup decisions, making
+		 * sure the last block is init'd, etc.
+		 */
+		gdp->bg_flags = cpu_to_le16(EXT4_BG_INODE_UNINIT);
+		ext4_itable_unused_set(sb, gdp, EXT4_INODES_PER_GROUP(sb));
+	}
 	gdp->bg_checksum = ext4_group_desc_csum(sbi, input->group, gdp);