diff mbox

[RFC] ext4: Use preallocation when reading from the inode table

Message ID E1KhvsB-0002Ik-Ex@closure.thunk.org
State Superseded, archived
Delegated to: Theodore Ts'o
Headers show

Commit Message

Theodore Ts'o Sept. 23, 2008, 12:35 a.m. UTC
With modern hard drives, reading 64k takes roughly the same time as
reading a 4k block.  So request adjacent inode table blocks to reduce
the time it takes when iterating over directories (especially when doing
this in htree sort order) in a cold cache case.  With this patch, the
time it takes to run "git status" on a kernel tree after flushing the
caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.

Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
----

Note: this patch could work for ext3 as well.  Anyone see any downsides?
A 20% improvement seems too easy...  I guess it does increase what we
hold in the buffer cache by a small amount, but once the inodes are
cached, we'll stop needing to do any I/O, and we only try to do the
readahead when we know that we need to do some I/O anyway.

This patch also eliminates ext4_get_inode_block(), since it's a static
function which is only called once, and we needed access to the block
group descriptor, so it simplified the code to move the code into
__ext4_get_inode_loc().  The interesting bits are in the very last hunk
of the patch.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Alan Cox Sept. 23, 2008, 9:16 a.m. UTC | #1
On Mon, 22 Sep 2008 20:35:23 -0400
"Theodore Ts'o" <tytso@mit.edu> wrote:

> 
> With modern hard drives, reading 64k takes roughly the same time as
> reading a 4k block.  So request adjacent inode table blocks to reduce
> the time it takes when iterating over directories (especially when doing
> this in htree sort order) in a cold cache case.  With this patch, the
> time it takes to run "git status" on a kernel tree after flushing the
> caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.
> 
> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>

Acked-by: Alan Cox <alan@redhat.com>

I'm actually suprised that 16 is the magic tuning number you've used and
a bigger one isn't even more of a win

Alan
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 23, 2008, 11:50 a.m. UTC | #2
On Sep 23, 2008  10:16 +0100, Alan Cox wrote:
> On Mon, 22 Sep 2008 20:35:23 -0400
> "Theodore Ts'o" <tytso@mit.edu> wrote:
> > With modern hard drives, reading 64k takes roughly the same time as
> > reading a 4k block.  So request adjacent inode table blocks to reduce
> > the time it takes when iterating over directories (especially when doing
> > this in htree sort order) in a cold cache case.  With this patch, the
> > time it takes to run "git status" on a kernel tree after flushing the
> > caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.
> > 
> > Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
> 
> Acked-by: Alan Cox <alan@redhat.com>
> 
> I'm actually suprised that 16 is the magic tuning number you've used and
> a bigger one isn't even more of a win

I was going to suggest making this at least a #defined constant instead
of hard coding the values there.  Making it a mount option and/or /proc
value would allow further testing.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Sept. 23, 2008, 12:18 p.m. UTC | #3
Andreas Dilger wrote:
> On Sep 23, 2008  10:16 +0100, Alan Cox wrote:
>   
>> On Mon, 22 Sep 2008 20:35:23 -0400
>> "Theodore Ts'o" <tytso@mit.edu> wrote:
>>     
>>> With modern hard drives, reading 64k takes roughly the same time as
>>> reading a 4k block.  So request adjacent inode table blocks to reduce
>>> the time it takes when iterating over directories (especially when doing
>>> this in htree sort order) in a cold cache case.  With this patch, the
>>> time it takes to run "git status" on a kernel tree after flushing the
>>> caches via "echo 3 > /proc/sys/vm/drop_caches", is reduced by 21%.
>>>
>>> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
>>>       
>> Acked-by: Alan Cox <alan@redhat.com>
>>
>> I'm actually suprised that 16 is the magic tuning number you've used and
>> a bigger one isn't even more of a win
>>     
>
> I was going to suggest making this at least a #defined constant instead
> of hard coding the values there.  Making it a mount option and/or /proc
> value would allow further testing.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Sr. Staff Engineer, Lustre Group
> Sun Microsystems of Canada, Inc.
>
>   

I think that Alan is probably right - the magic number for modern drives 
is probably closer to 256K. Having it be a /sys tunable (with a larger 
default) would be a nice way to verify this.

Ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Sept. 24, 2008, 1:30 a.m. UTC | #4
On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:
> I think that Alan is probably right - the magic number for modern drives  
> is probably closer to 256K. Having it be a /sys tunable (with a larger  
> default) would be a nice way to verify this.

I've played with this a bit, and with the "git status" workload,
increasing the magic number beyond 16 (64k) doesn't actually help,
because the number of inodes we need to touch wasn't big enough.

So I switched to a different workload, which ran "find /path -size 0
-print" with a much larger directory hierarchy.  With that workload I
got the following results:

ra_bits	ra_blocks  ra_kb  seconds  % improvement
0	   1	     4	  53.3		 -
1	   2	     8	  47.3		11.3%
2	   4	    16	  41.7		21.8%
3	   8	    32	  37.5		29.6%
4	  16	    64	  34.4		35.5%
5	  32	   128	  32		40.0%
6	  64	   256	  30.7		42.4%
7	 128	   512	  28.8		46.0%
8	 256	  1024	  28.3		46.9%
9	 512	  2048	  27.5		48.4%

Given these numbers, I'm using a default of inode_readahead_bits of 5
(i.3., 32 blocks, or 128k for 4k blocksize filesystems).  For a
workload that is 100% stat-based, without any I/O, it is possible to
get better results by using a higher number, yes, but I'm concerned
that a larger readahead may end up interfering with other reads.  We
need to run some other workloads to be sure a larger number won't
cause problems before we go more aggressive on this parameter.

I'll send the revised patch in another message.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ric Wheeler Sept. 24, 2008, 1:23 p.m. UTC | #5
Theodore Tso wrote:
> On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:
>   
>> I think that Alan is probably right - the magic number for modern drives  
>> is probably closer to 256K. Having it be a /sys tunable (with a larger  
>> default) would be a nice way to verify this.
>>     
>
> I've played with this a bit, and with the "git status" workload,
> increasing the magic number beyond 16 (64k) doesn't actually help,
> because the number of inodes we need to touch wasn't big enough.
>
> So I switched to a different workload, which ran "find /path -size 0
> -print" with a much larger directory hierarchy.  With that workload I
> got the following results:
>
> ra_bits	ra_blocks  ra_kb  seconds  % improvement
> 0	   1	     4	  53.3		 -
> 1	   2	     8	  47.3		11.3%
> 2	   4	    16	  41.7		21.8%
> 3	   8	    32	  37.5		29.6%
> 4	  16	    64	  34.4		35.5%
> 5	  32	   128	  32		40.0%
> 6	  64	   256	  30.7		42.4%
> 7	 128	   512	  28.8		46.0%
> 8	 256	  1024	  28.3		46.9%
> 9	 512	  2048	  27.5		48.4%
>
> Given these numbers, I'm using a default of inode_readahead_bits of 5
> (i.3., 32 blocks, or 128k for 4k blocksize filesystems).  For a
> workload that is 100% stat-based, without any I/O, it is possible to
> get better results by using a higher number, yes, but I'm concerned
> that a larger readahead may end up interfering with other reads.  We
> need to run some other workloads to be sure a larger number won't
> cause problems before we go more aggressive on this parameter.
>
> I'll send the revised patch in another message.
>
> 					- Ted
>   

That sounds about right for modern S-ATA/SAS drives. I would expect that 
having this be a tunable knob might help for some types of storage (SSD 
might not care, but should be faster in any case?).

ric

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Chris Mason Sept. 24, 2008, 2:20 p.m. UTC | #6
On Wed, 2008-09-24 at 09:23 -0400, Ric Wheeler wrote:
> Theodore Tso wrote:
> > On Tue, Sep 23, 2008 at 08:18:54AM -0400, Ric Wheeler wrote:

[ numbers ]

> > Given these numbers, I'm using a default of inode_readahead_bits of 5
> > (i.3., 32 blocks, or 128k for 4k blocksize filesystems).  For a
> > workload that is 100% stat-based, without any I/O, it is possible to
> > get better results by using a higher number, yes, but I'm concerned
> > that a larger readahead may end up interfering with other reads.  We
> > need to run some other workloads to be sure a larger number won't
> > cause problems before we go more aggressive on this parameter.
>
> That sounds about right for modern S-ATA/SAS drives. I would expect that 
> having this be a tunable knob might help for some types of storage (SSD 
> might not care, but should be faster in any case?).

For the test runs being done here, there's a pretty high chance that all
of the inodes you read ahead will get used before the pages are dropped,
so we want to find a balance between those and the worst case workloads
where inode reads are basically random.  One good data point is the
completion time for IOs of different sizes.

I used fio to measure the latencies on O_DIRECT randomreads of given
sizes on a fast 500GB sata drive.  Here is the output for a 4k run (I
used elevator=noop, but cfq was about the same):

f4k: (groupid=6, jobs=1): err= 0: pid=22877
  read : io=15816KiB, bw=539KiB/s, iops=131, runt= 30004msec
    clat (usec): min=555, max=20909, avg=7581.38, stdev=2475.88
     issued r/w: total=3954/0, short=0/0
     lat (usec): 750=0.03%
     lat (msec): 2=0.03%, 4=7.08%, 10=71.60%, 20=21.24%, 50=0.03%

clat is completion latency, but note fio switches between usec and msec
just to keep us on our toes.  Other important numbers are iop/s and
total issued ios.  The test limits the run on each IO size to 30
seconds.

The 4k run gets 131 iop/s, so my sata drive can read 131 inodes/second
in a worst case random workload.  iop rates for the others:

4k	131
8k	130
16k	128
32k	126
64k	121
128k	113
256k	100

A slightly trimmed job output is below, and the fio job file I used is
attached if anyone wants to try this on their own machines.  I'd stick
with either 32k or 64k as the sweet spots, but a tunable is definitely a
good idea.

-chris

f256k: (groupid=0, jobs=1): err= 0: pid=22871
  read : io=770816KiB, bw=26309KiB/s, iops=100, runt= 30001msec
    clat (msec): min=1, max=45, avg= 9.96, stdev= 2.63
     issued r/w: total=3011/0, short=0/0
     lat (msec): 2=0.03%, 10=50.35%, 20=49.58%, 50=0.03%

f128k: (groupid=1, jobs=1): err= 0: pid=22872
  read : io=434560KiB, bw=14830KiB/s, iops=113, runt= 30005msec
    clat (msec): min=1, max=72, avg= 8.83, stdev= 2.82
     issued r/w: total=3395/0, short=0/0
     lat (msec): 2=0.06%, 4=0.62%, 10=63.62%, 20=35.64%, 100=0.06%

f64k: (groupid=2, jobs=1): err= 0: pid=22873
  read : io=233280KiB, bw=7961KiB/s, iops=121, runt= 30006msec
    clat (usec): min=815, max=14931, avg=8225.21, stdev=2471.22
     issued r/w: total=3645/0, short=0/0
     lat (usec): 1000=0.05%
     lat (msec): 4=2.50%, 10=69.11%, 20=28.34%

f32k: (groupid=3, jobs=1): err= 0: pid=22874
  read : io=121472KiB, bw=4144KiB/s, iops=126, runt= 30010msec
    clat (usec): min=715, max=53124, avg=7898.75, stdev=2613.35
     issued r/w: total=3796/0, short=0/0
     lat (usec): 750=0.03%
     lat (msec): 4=4.77%, 10=70.10%, 20=25.08%, 100=0.03%

f16k: (groupid=4, jobs=1): err= 0: pid=22875
  read : io=61584KiB, bw=2101KiB/s, iops=128, runt= 30001msec
    clat (msec): min=1, max=16, avg= 7.79, stdev= 2.46
     issued r/w: total=3849/0, short=0/0
[global]
filename=/dev/sdb
numjobs=1
size=16g
rw=randread
direct=1

[f256k]
bs=256k
runtime=30
stonewall

[f128k]
bs=128k
runtime=30
stonewall

[f64k]
bs=64k
runtime=30
stonewall

[f32k]
bs=32k
runtime=30
stonewall

[f16k]
bs=16k
runtime=30
stonewall

[f8k]
bs=8k
runtime=30
stonewall

[f4k]
bs=4k
runtime=30
stonewall
Theodore Ts'o Sept. 24, 2008, 8:35 p.m. UTC | #7
On Wed, Sep 24, 2008 at 09:23:34AM -0400, Ric Wheeler wrote:
>
> That sounds about right for modern S-ATA/SAS drives. I would expect that  
> having this be a tunable knob might help for some types of storage (SSD  
> might not care, but should be faster in any case?).
>

Well, for SSD's, wouldn't seek time not matter, but then the limiting
factor should be the overhead of the read transaction, and the I/O
bandwidth of the SSD.  So prefetching too much might hurt even more
for SSD's, although in comparison with spinning rust platters, it
would probably still be faster.  :-)

So I'm pretty sure that with an SSD we'll want to turn the tunable
down, not up.

On Wed, Sep 24, 2008 at 10:20:34AM -0400, Chris Mason wrote:
>For the test runs being done here, there's a pretty high chance that all
>of the inodes you read ahead will get used before the pages are dropped,
>so we want to find a balance between those and the worst case workloads
>where inode reads are basically random.  

Yep, agreed.

On the other hand, if we take your iop/s and translate them to
milliseconds so we can measure the latency in the case where the
workload is essentialy doing random reads, and then cross correlated
it with my measurements, we get this table:

i/o size iops/s  ms latency  % degredation         % improvement
    	 	    	     of random inodes   of related inodes I/O
   4k	  131       7.634      
   8k	  130	    7.692	0.77%		    11.3%
  16k	  128	    7.813	2.34%		    21.8%
  32k	  126	    7.937	3.97%		    29.6%
  64k	  121	    8.264	8.26%		    35.5%
 128k	  113	    8.850      15.93%		    40.0%
 256k	  100	   10.000      31.00%		    42.4%

Depending on whether you believe that workloads involving random inode
reads are more common compared to related inodes I/O, the sweet spot
is probably somewhere between 32k and 128k.  I'm open to opinions
(preferably backed up with more benchmarks of likely workloads) of
whether we should use a default value of inode_readahead_bits of 4 or
5 (i.e., 64k, my original guess, or 128k, in v2 of the patch).  But
yes, making it tunable is definitely going to be necessary, since for
different workloads (i.e squid vs. git repositories) will have very
different requirements.

The other thought are the one which comes to mind is whether we should
use a similar large readahead if all we are doing writes vs. reads.
For example, if we are just updating a single inode, and we are
reading a block to do a read/modify write cycle, maybe we shouldn't be
doing as much readahead.

						- Ted

P.S.  One caveat is that my numbers were taken from a Laptop SATA, and
if Chris's were taken from a Desktop/Sever SATA drive the numbers
might not be properly comparable.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Sept. 25, 2008, 11:40 p.m. UTC | #8
On Sep 24, 2008  16:35 -0400, Theodore Ts'o wrote:
> On the other hand, if we take your iop/s and translate them to
> milliseconds so we can measure the latency in the case where the
> workload is essentialy doing random reads, and then cross correlated
> it with my measurements, we get this table:

Comparing the incremental benefit of each step:

> i/o size iops/s  ms latency  % degredation         % improvement
>     	 	    	     of random inodes   of related inodes I/O
>    4k	  131       7.634      
>    8k	  130	    7.692	0.77%		    11.3%
                                       1.57%              10.5%
>   16k	  128	    7.813	2.34%		    21.8%
                                       1.63%               7.8%
>   32k	  126	    7.937	3.97%		    29.6%
                                       4.29%               5.9%
>   64k	  121	    8.264	8.26%		    35.5%
                                       7.67%               4.5%
>  128k	  113	    8.850      15.93%		    40.0%
                                      16.07%               2.4%
>  256k	  100	   10.000      31.00%		    42.4%
> 
> Depending on whether you believe that workloads involving random inode
> reads are more common compared to related inodes I/O, the sweet spot
> is probably somewhere between 32k and 128k.  I'm open to opinions
> (preferably backed up with more benchmarks of likely workloads) of
> whether we should use a default value of inode_readahead_bits of 4 or
> 5 (i.e., 64k, my original guess, or 128k, in v2 of the patch).  But
> yes, making it tunable is definitely going to be necessary, since for
> different workloads (i.e squid vs. git repositories) will have very
> different requirements.

It looks like moving from 64kB to 128kB readahead might be a loss for
"unknown" workloads, since that increases latency by 7.67% for the random
inode case, but we only get 4.5% improvement in the sequential inode case.
Also recall that at large scale "htree" breaks down to random inode
lookup so that isn't exactly a fringe case (though readahead may still
help if the cache is large enough).

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index eed1265..9f6c10e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3833,41 +3833,6 @@  out_stop:
 	ext4_journal_stop(handle);
 }
 
-static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
-		unsigned long ino, struct ext4_iloc *iloc)
-{
-	ext4_group_t block_group;
-	unsigned long offset;
-	ext4_fsblk_t block;
-	struct ext4_group_desc *gdp;
-
-	if (!ext4_valid_inum(sb, ino)) {
-		/*
-		 * This error is already checked for in namei.c unless we are
-		 * looking at an NFS filehandle, in which case no error
-		 * report is needed
-		 */
-		return 0;
-	}
-
-	block_group = (ino - 1) / EXT4_INODES_PER_GROUP(sb);
-	gdp = ext4_get_group_desc(sb, block_group, NULL);
-	if (!gdp)
-		return 0;
-
-	/*
-	 * Figure out the offset within the block group inode table
-	 */
-	offset = ((ino - 1) % EXT4_INODES_PER_GROUP(sb)) *
-		EXT4_INODE_SIZE(sb);
-	block = ext4_inode_table(sb, gdp) +
-		(offset >> EXT4_BLOCK_SIZE_BITS(sb));
-
-	iloc->block_group = block_group;
-	iloc->offset = offset & (EXT4_BLOCK_SIZE(sb) - 1);
-	return block;
-}
-
 /*
  * ext4_get_inode_loc returns with an extra refcount against the inode's
  * underlying buffer_head on success. If 'in_mem' is true, we have all
@@ -3877,13 +3842,30 @@  static ext4_fsblk_t ext4_get_inode_block(struct super_block *sb,
 static int __ext4_get_inode_loc(struct inode *inode,
 				struct ext4_iloc *iloc, int in_mem)
 {
+	struct ext4_group_desc *gdp;
 	ext4_fsblk_t block;
 	struct buffer_head *bh;
 
-	block = ext4_get_inode_block(inode->i_sb, inode->i_ino, iloc);
-	if (!block)
+	iloc->bh = 0;
+	if (!ext4_valid_inum(inode->i_sb, inode->i_ino))
 		return -EIO;
 
+	iloc->block_group = (inode->i_ino - 1) / 
+		EXT4_INODES_PER_GROUP(inode->i_sb);
+	gdp = ext4_get_group_desc(inode->i_sb, iloc->block_group, NULL);
+	if (!gdp)
+		return -EIO;
+
+	/*
+	 * Figure out the offset within the block group inode table
+	 */
+	iloc->offset = ((inode->i_ino - 1) % 
+			EXT4_INODES_PER_GROUP(inode->i_sb)) *
+		EXT4_INODE_SIZE(inode->i_sb);
+	block = ext4_inode_table(inode->i_sb, gdp) +
+		(iloc->offset >> EXT4_BLOCK_SIZE_BITS(inode->i_sb));
+	iloc->offset = iloc->offset & (EXT4_BLOCK_SIZE(inode->i_sb) - 1);
+
 	bh = sb_getblk(inode->i_sb, block);
 	if (!bh) {
 		ext4_error (inode->i_sb, "ext4_get_inode_loc",
@@ -3969,6 +3951,31 @@  static int __ext4_get_inode_loc(struct inode *inode,
 
 make_io:
 		/*
+		 * If we need to do any I/O, try to readahead up to 16
+		 * blocks from the inode table.
+		 */
+		{
+			ext4_fsblk_t b, end, table;
+			unsigned num;
+
+			table = ext4_inode_table(inode->i_sb, gdp);
+			b = block & ~15;
+			if (table > b)
+				b = table;
+			end = b+16;
+			num = EXT4_INODES_PER_GROUP(inode->i_sb);
+			if (EXT4_HAS_RO_COMPAT_FEATURE(inode->i_sb, 
+				       EXT4_FEATURE_RO_COMPAT_GDT_CSUM))
+				num -= le16_to_cpu(gdp->bg_itable_unused);
+			table += num / (EXT4_BLOCK_SIZE(inode->i_sb) /
+					EXT4_INODE_SIZE(inode->i_sb));
+			if (end > table)
+				end = table;
+			while (b <= end)
+				sb_breadahead(inode->i_sb, b++);
+		}
+
+		/*
 		 * There are other valid inodes in the buffer, this inode
 		 * has in-inode xattrs, or we don't have this inode in memory.
 		 * Read the block from disk.