diff mbox

[3/3] ext4: Use readahead when reading an inode from the inode table

Message ID 20080928042749.GA8711@mit.edu
State Accepted, archived
Headers show

Commit Message

Theodore Ts'o Sept. 28, 2008, 4:27 a.m. UTC
On Fri, Sep 26, 2008 at 02:21:07AM -0600, Andreas Dilger wrote:
> As a general UI interface, specifying the "bits to shift the filesystem
> block number" seems like an easy-to-implement but is fairly bad from
> a usability point of view.  I'd much prefer to specify this as a
> number of kB to readahead, and it can be converted internally to the
> number of blocks to readahead.  It isn't fatal if we do a bit of rounding
> on the input value to match a full blocksize.

Good idea.  Here's a new patch which does this.

ext4: Use readahead when reading an inode from the inode table

With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block.  So request readahead for adjacent inode table
blocks to reduce the time it takes when iterating over directories
(especially when doing this in htree sort order) in a cold cache case.
With this patch, the time it takes to run "git status" on a kernel
tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches"
is reduced by 21%.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Andreas Dilger Oct. 2, 2008, 2:17 a.m. UTC | #1
On Sep 28, 2008  00:27 -0400, Theodore Ts'o wrote:
> ext4: Use readahead when reading an inode from the inode table
> 
> With modern hard drives, reading 64k takes roughly the same time as
> reading a 4k block.  So request readahead for adjacent inode table
> blocks to reduce the time it takes when iterating over directories
> (especially when doing this in htree sort order) in a cold cache case.
> With this patch, the time it takes to run "git status" on a kernel
> tree after flushing the caches via "echo 3 > /proc/sys/vm/drop_caches"
> is reduced by 21%.

I'd actually thought that having a tunable in units of "kB" is better
than blocks, since userspace shouldn't have to know the filesystem
block size to tune readahead for a device.  Depending on the block size
this tunable can vary by 64x the amount of readahead (1kB vs. 64kB blocks).

> @@ -3969,6 +3934,36 @@ static int __ext4_get_inode_loc(struct inode *inode,
>  
>  make_io:
>  		/*
> +		 * If we need to do any I/O, try to readahead up to 16
> +		 * blocks from the inode table.

Comment is out of date.

> +		if (EXT4_SB(sb)->s_inode_readahead_blks) {
> +			/* Make sure s_inode_readahead_blks is a power of 2 */
> +			while (EXT4_SB(sb)->s_inode_readahead_blks &
> +			       (EXT4_SB(sb)->s_inode_readahead_blks-1))
> +				EXT4_SB(sb)->s_inode_readahead_blks = 
> +				   (EXT4_SB(sb)->s_inode_readahead_blks &
> +				    (EXT4_SB(sb)->s_inode_readahead_blks-1));

Is there a good reason why the readahead blocks is a power of 2?  Given
that the blocks are likely NOT contiguous for a directory, nor are they
aligned to the underlying LUN offsets, I don't think this is a benefit.
In any case, any tweaking of s_inode_readahead_blks should probably be
done at the time it is set instead of each time an inode is read.

> +			ext4_error(sb, "ext4_get_inode_loc",

s/ext4_get_inode_loc/__func__/?

> +		case Opt_inode_readahead_blks:
> +			if (option < 0 || option > 31)
> +				return 0;
> +			sbi->s_inode_readahead_blks = option;

This would appear to limit the inode_readahead_blks to 31 blocks, yet the
default is 32?  I suspect this is left over from when it was a shift?

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Oct. 3, 2008, 3:56 a.m. UTC | #2
On Wed, Oct 01, 2008 at 07:17:46PM -0700, Andreas Dilger wrote:
> I'd actually thought that having a tunable in units of "kB" is better
> than blocks, since userspace shouldn't have to know the filesystem
> block size to tune readahead for a device.  Depending on the block size
> this tunable can vary by 64x the amount of readahead (1kB vs. 64kB blocks).

True, but realistically, except for benchmarks and die-hards that want
to get the last bit of performance out, I really doubt that many
people will be tweaking this tunable anyway.  Mainly I'm just feeling
rather lazy and am not sure it's worth it.  :-)

> > +		 * If we need to do any I/O, try to readahead up to 16
> > +		 * blocks from the inode table.
> 
> Comment is out of date.

Good point, thanks, I'll fix it.

> 
> > +		if (EXT4_SB(sb)->s_inode_readahead_blks) {
> > +			/* Make sure s_inode_readahead_blks is a power of 2 */
> > +			while (EXT4_SB(sb)->s_inode_readahead_blks &
> > +			       (EXT4_SB(sb)->s_inode_readahead_blks-1))
> > +				EXT4_SB(sb)->s_inode_readahead_blks = 
> > +				   (EXT4_SB(sb)->s_inode_readahead_blks &
> > +				    (EXT4_SB(sb)->s_inode_readahead_blks-1));
> 
> Is there a good reason why the readahead blocks is a power of 2?  Given
> that the blocks are likely NOT contiguous for a directory, nor are they
> aligned to the underlying LUN offsets, I don't think this is a benefit.

What we are reading ahead here are not directory blocks, but the inode
table.  So these are real, physical block numbers that we are dealing
with here, and so aligning the read requests to powers of two can be very
helpful.

> In any case, any tweaking of s_inode_readahead_blks should probably be
> done at the time it is set instead of each time an inode is read.

The parameter s_inode_readahead_blks can be set via a proc setting, so
it would be a lot more trouble to tweak the parameter at the time when
it is set.  (It could be done, but it means we could no longer use the
generic proc functions.)  As it turns out, if s_inode_readahead_blks
is a power of two, the while loop becomes a single conditional branch
instruction, so it's really not that expensive.

> > +			ext4_error(sb, "ext4_get_inode_loc",
> 
> s/ext4_get_inode_loc/__func__/?

Sure.

> > +		case Opt_inode_readahead_blks:
> > +			if (option < 0 || option > 31)
> > +				return 0;
> > +			sbi->s_inode_readahead_blks = option;
> 
> This would appear to limit the inode_readahead_blks to 31 blocks, yet the
> default is 32?  I suspect this is left over from when it was a shift?

Yup, good point.  I didn't notice becuase it's much more convenient to
use the /proc interface, and that's how I did all of my testing.

    	      		     	    	- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 163c445..922d187 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -790,6 +790,8 @@  static inline int ext4_valid_inum(struct super_block *sb, unsigned long ino)
 #define	EXT4_DEF_RESUID		0
 #define	EXT4_DEF_RESGID		0
 
+#define EXT4_DEF_INODE_READAHEAD_BLKS	32
+
 /*
  * Default mount options
  */
diff --git a/fs/ext4/ext4_sb.h b/fs/ext4/ext4_sb.h
index f92af01..94e0757 100644
--- a/fs/ext4/ext4_sb.h
+++ b/fs/ext4/ext4_sb.h
@@ -52,6 +52,7 @@  struct ext4_sb_info {
 	int s_desc_per_block_bits;
 	int s_inode_size;
 	int s_first_ino;
+	unsigned int s_inode_readahead_blks;
 	spinlock_t s_next_gen_lock;
 	u32 s_next_generation;
 	u32 s_hash_seed[4];
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eed1265..5bd700f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3833,41 +3833,6 @@  out_stop:
 	ext4_journal_stop(handle);
 }
 
-static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
-		unsigned long ino, struct ext4_iloc *iloc)
-{
-	ext4_group_t block_group;
-	unsigned long offset;
-	ext4_fsblk_t block;
-	struct ext4_group_desc *gdp;
-
-	if (!ext4_valid_inum(sb, ino)) {
-		/*
-		 * This error is already checked for in namei.c unless we are
-		 * looking at an NFS filehandle, in which case no error
-		 * report is needed
-		 */
-		return 0;
-	}
-
-	block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
-	gdp = ext4_get_group_desc(sb, block_group, NULL);
-	if (!gdp)
-		return 0;
-
-	/*
-	 * Figure out the offset within the block group inode table
-	 */
-	offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb)) *
-		EXT4_INODE_SIZE(sb);
-	block = ext4_inode_table(sb, gdp) +
-		(offset >> EXT4_BLOCK_SIZE_BITS(sb));
-
-	iloc->block_group = block_group;
-	iloc->offset = offset & (EXT4_BLOCK_SIZE(sb) - 1);
-	return block;
-}
-
 /*
  * ext4_get_inode_loc returns with an extra refcount against the inode's
  * underlying buffer_head on success. If 'in_mem' is true, we have all
@@ -3877,19 +3842,35 @@  static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
 static int __ext4_get_inode_loc(struct inode *inode,
 				struct ext4_iloc *iloc, int in_mem)
 {
-	ext4_fsblk_t block;
-	struct buffer_head *bh;
+	struct ext4_group_desc	*gdp;
+	struct buffer_head	*bh;
+	struct super_block	*sb = inode->i_sb;
+	ext4_fsblk_t		block;
+	int			inodes_per_block, inode_offset;
+
+	iloc->bh = 0;
+	if (!ext4_valid_inum(sb, inode->i_ino))
+		return -EIO;
 
-	block = ext4_get_inode_block(inode->i_sb, inode->i_ino, iloc);
-	if (!block)
+	iloc->block_group = (inode->i_ino - 1) / EXT4_INODES_PER_GROUP(sb);
+	gdp = ext4_get_group_desc(sb, iloc->block_group, NULL);
+	if (!gdp)
 		return -EIO;
 
-	bh = sb_getblk(inode->i_sb, block);
+	/*
+	 * Figure out the offset within the block group inode table
+	 */
+	inodes_per_block = (EXT4_BLOCK_SIZE(sb) / EXT4_INODE_SIZE(sb));
+	inode_offset = ((inode->i_ino - 1) %
+			EXT4_INODES_PER_GROUP(sb));
+	block = ext4_inode_table(sb, gdp) + (inode_offset / inodes_per_block);
+	iloc->offset = (inode_offset % inodes_per_block) * EXT4_INODE_SIZE(sb);
+
+	bh = sb_getblk(sb, block);
 	if (!bh) {
-		ext4_error (inode->i_sb, "ext4_get_inode_loc",
-				"unable to read inode block - "
-				"inode=%lu, block=%llu",
-				 inode->i_ino, block);
+		ext4_error(sb, "ext4_get_inode_loc", "unable to read "
+			   "inode block - inode=%lu, block=%llu",
+			   inode->i_ino, block);
 		return -EIO;
 	}
 	if (!buffer_uptodate(bh)) {
@@ -3917,28 +3898,12 @@  static int __ext4_get_inode_loc(struct inode *inode,
 		 */
 		if (in_mem) {
 			struct buffer_head *bitmap_bh;
-			struct ext4_group_desc *desc;
-			int inodes_per_buffer;
-			int inode_offset, i;
-			ext4_group_t block_group;
-			int start;
-
-			block_group = (inode->i_ino - 1) /
-					EXT4_INODES_PER_GROUP(inode->i_sb);
-			inodes_per_buffer = bh->b_size /
-				EXT4_INODE_SIZE(inode->i_sb);
-			inode_offset = ((inode->i_ino - 1) %
-					EXT4_INODES_PER_GROUP(inode->i_sb));
-			start = inode_offset & ~(inodes_per_buffer - 1);
+			int i, start;
 
-			/* Is the inode bitmap in cache? */
-			desc = ext4_get_group_desc(inode->i_sb,
-						block_group, NULL);
-			if (!desc)
-				goto make_io;
+			start = inode_offset & ~(inodes_per_block - 1);
 
-			bitmap_bh = sb_getblk(inode->i_sb,
-				ext4_inode_bitmap(inode->i_sb, desc));
+			/* Is the inode bitmap in cache? */
+			bitmap_bh = sb_getblk(sb, ext4_inode_bitmap(sb, gdp));
 			if (!bitmap_bh)
 				goto make_io;
 
@@ -3951,14 +3916,14 @@  static int __ext4_get_inode_loc(struct inode *inode,
 				brelse(bitmap_bh);
 				goto make_io;
 			}
-			for (i = start; i < start + inodes_per_buffer; i++) {
+			for (i = start; i < start + inodes_per_block; i++) {
 				if (i == inode_offset)
 					continue;
 				if (ext4_test_bit(i, bitmap_bh->b_data))
 					break;
 			}
 			brelse(bitmap_bh);
-			if (i == start + inodes_per_buffer) {
+			if (i == start + inodes_per_block) {
 				/* all other inodes are free, so skip I/O */
 				memset(bh->b_data, 0, bh->b_size);
 				set_buffer_uptodate(bh);
@@ -3969,6 +3934,36 @@  static int __ext4_get_inode_loc(struct inode *inode,
 
 make_io:
 		/*
+		 * If we need to do any I/O, try to readahead up to 16
+		 * blocks from the inode table.
+		 */
+		if (EXT4_SB(sb)->s_inode_readahead_blks) {
+			ext4_fsblk_t b, end, table;
+			unsigned num;
+
+			table = ext4_inode_table(sb, gdp);
+			/* Make sure s_inode_readahead_blks is a power of 2 */
+			while (EXT4_SB(sb)->s_inode_readahead_blks &
+			       (EXT4_SB(sb)->s_inode_readahead_blks-1))
+				EXT4_SB(sb)->s_inode_readahead_blks = 
+				   (EXT4_SB(sb)->s_inode_readahead_blks &
+				    (EXT4_SB(sb)->s_inode_readahead_blks-1));
+			b = block & ~(EXT4_SB(sb)->s_inode_readahead_blks-1);
+			if (table > b)
+				b = table;
+			end = b + EXT4_SB(sb)->s_inode_readahead_blks;
+			num = EXT4_INODES_PER_GROUP(sb);
+			if (EXT4_HAS_RO_COMPAT_FEATURE(sb,
+				       EXT4_FEATURE_RO_COMPAT_GDT_CSUM))
+				num -= le16_to_cpu(gdp->bg_itable_unused);
+			table += num / inodes_per_block;
+			if (end > table)
+				end = table;
+			while (b <= end)
+				sb_breadahead(sb, b++);
+		}
+
+		/*
 		 * There are other valid inodes in the buffer, this inode
 		 * has in-inode xattrs, or we don't have this inode in memory.
 		 * Read the block from disk.
@@ -3978,10 +3973,9 @@  make_io:
 		submit_bh(READ_META, bh);
 		wait_on_buffer(bh);
 		if (!buffer_uptodate(bh)) {
-			ext4_error(inode->i_sb, "ext4_get_inode_loc",
-					"unable to read inode block - "
-					"inode=%lu, block=%llu",
-					inode->i_ino, block);
+			ext4_error(sb, "ext4_get_inode_loc",
+				   "unable to read inode block - inode=%lu, "
+				   "block=%llu", inode->i_ino, block);
 			brelse(bh);
 			return -EIO;
 		}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6dee26d..9094095 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -514,8 +514,10 @@  static void ext4_put_super(struct super_block *sb)
 		BUFFER_TRACE(sbi->s_sbh, "marking dirty");
 		ext4_commit_super(sb, es, 1);
 	}
-	if (sbi->s_proc)
+	if (sbi->s_proc) {
+		remove_proc_entry("inode_readahead_blks", sbi->s_proc);
 		remove_proc_entry(sb->s_id, ext4_proc_root);
+	}
 
 	for (i = 0; i < sbi->s_gdb_count; i++)
 		brelse(sbi->s_group_desc[i]);
@@ -778,6 +780,10 @@  static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 	else if (test_opt(sb, DATA_FLAGS) == EXT4_MOUNT_WRITEBACK_DATA)
 		seq_puts(seq, ",data=writeback");
 
+	if (sbi->s_inode_readahead_blks != EXT4_DEF_INODE_READAHEAD_BLKS)
+		seq_printf(seq, ",inode_readahead_blks=%u",
+			   sbi->s_inode_readahead_blks);
+
 	ext4_show_quota_options(seq, sb);
 	return 0;
 }
@@ -912,6 +918,7 @@  enum {
 	Opt_ignore, Opt_barrier, Opt_err, Opt_resize, Opt_usrquota,
 	Opt_grpquota, Opt_extents, Opt_noextents, Opt_i_version,
 	Opt_mballoc, Opt_nomballoc, Opt_stripe, Opt_delalloc, Opt_nodelalloc,
+	Opt_inode_readahead_blks
 };
 
 static match_table_t tokens = {
@@ -972,6 +979,7 @@  static match_table_t tokens = {
 	{Opt_resize, "resize"},
 	{Opt_delalloc, "delalloc"},
 	{Opt_nodelalloc, "nodelalloc"},
+	{Opt_inode_readahead_blks, "inode_readahead_blks=%u"},
 	{Opt_err, NULL},
 };
 
@@ -1380,6 +1388,13 @@  set_qf_format:
 		case Opt_delalloc:
 			set_opt(sbi->s_mount_opt, DELALLOC);
 			break;
+		case Opt_inode_readahead_blks:
+			if (match_int(&args[0], &option))
+				return 0;
+			if (option < 0 || option > 31)
+				return 0;
+			sbi->s_inode_readahead_blks = option;
+			break;
 		default:
 			printk(KERN_ERR
 			       "EXT4-fs: Unrecognized mount option \"%s\" "
@@ -1937,6 +1952,7 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	sbi->s_mount_opt = 0;
 	sbi->s_resuid = EXT4_DEF_RESUID;
 	sbi->s_resgid = EXT4_DEF_RESGID;
+	sbi->s_inode_readahead_blks = EXT4_DEF_INODE_READAHEAD_BLKS;
 	sbi->s_sb_block = sb_block;
 
 	unlock_kernel();
@@ -2233,6 +2249,11 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	if (ext4_proc_root)
 		sbi->s_proc = proc_mkdir(sb->s_id, ext4_proc_root);
 
+	if (sbi->s_proc)
+		proc_create_data("inode_readahead_blks", 0644, sbi->s_proc,
+				 &ext4_ui_proc_fops,
+				 &sbi->s_inode_readahead_blks);
+
 	bgl_lock_init(&sbi->s_blockgroup_lock);
 
 	for (i = 0; i < db_count; i++) {
@@ -2512,8 +2533,10 @@  failed_mount2:
 		brelse(sbi->s_group_desc[i]);
 	kfree(sbi->s_group_desc);
 failed_mount:
-	if (sbi->s_proc)
+	if (sbi->s_proc) {
+		remove_proc_entry("inode_readahead_blks", sbi->s_proc);
 		remove_proc_entry(sb->s_id, ext4_proc_root);
+	}
 #ifdef CONFIG_QUOTA
 	for (i = 0; i < MAXQUOTAS; i++)
 		kfree(sbi->s_qf_names[i]);