diff mbox

[2/2] libext2fs/e2fsck: implement metadata prefetching

Message ID 20140130235058.31064.21096.stgit@birch.djwong.org
State Superseded, archived
Headers show

Commit Message

Darrick Wong Jan. 30, 2014, 11:50 p.m. UTC
Use threads with our new mmap IO manager to prefetch metadata.  This
results in a major e2fsck run time speedup.  There's also a stupider
multiprocess version that works with the good old UNIX IO manager to
get pages into the page cache.

Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
---
 e2fsck/unix.c          |   13 +
 lib/ext2fs/Makefile.in |    8 +
 lib/ext2fs/ext2fs.h    |   13 +
 lib/ext2fs/prefetch.c  |  456 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 490 insertions(+)
 create mode 100644 lib/ext2fs/prefetch.c



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o Jan. 31, 2014, 1:53 p.m. UTC | #1
On Fri, Jan 31, 2014 at 03:10:00AM -0700, Andreas Dilger wrote:
> We implemented something like this for a metadata scanning tool called
> "e2scan".  At the time, the fastest method of prefetching data was
> posix_fadvise(POSIX_FADV_WILLNEED). We also tried the readahead() syscall. 

I think using posix_fadvise() and readahead() is probably the best way
to go, at least initially.  If we can avoid needing to add an extra
I/O manager, I think that would be better.

As far as a single HDD prefetching its brains out, it would be good if
we can figure out a way to enable the right amount of prefetching
automatically.  Something that might be worth trying is to instrument
unix_io so we can measure the time waiting for disk I/O, and then
compare that to the wall clock time running on a single disk file
system, without doing any prefetching.

If there isn't any delta, then the only way prefetching could help us
is if we can optimize the head motion and remove some seeks (i.e., if
we know that we will need block #23 in the future, and we get a
request for block #24, we can read both at one go).  If we measure the
difference between time spent for disk I/O during each e2fsck pass,
and wall clock time during each I/O, we'll also know which e2fsck pass
would benefit the most from smarter prefetching.

What that might imply is that for HDD's --- which is where we would
want something like this the most --- what we might want to do is to
create a list of blocks that e2fsck knows it will need next (for
example, during pass 1, we are collecting the blocks for directories,
so we could use that to populate the prefetch list for pass 2.

Something that we might need to go to in the future is instead of
using mmap(), to maintain our own explicit buffer cache inside
unix_io, and use direct I/O to avoid caching the disk blocks twice.
Then when we use a single-threaded disk prefetcher, managed by the
unix_io, it will know when a particular I/O request has completed, and
more importantly, if there is a synchronous read request coming in
from main body of the program, it can stop prefetching and allow the
higher priority read to complete.  We can also experiment with how
many threads might make sense --- even with an HDD, using multiple
threads so that we can take advantage of NCQ might still be a win.

One other thought.  The location of the dynamic metadata blocks (i.e.,
directory and extent tree blocks) is a hint that we could potentially
store in the file system.  Since it is only for optimization purposes,
e2fsck can try to find it in the file system, and if it is there, and
it looks valid it can use it.  If the file system is too corrupted, or
the data looks suspect, it can always ignore the cached hints.

Finally, if we are managing our own buffer cache, we should consider
adding a bforget method to the I/O manager.  That way e2fsck can give
hints to the caching layer that a block isn't needed any more.  If it
is in the cache, it can be dropped, to free memory, and if it is still
on the to-be-prefetched list it should also be dropped.  (Of course,
if a block is on the to-be-prefetched list, and a synchronous read
request comes in for that block, we should have dropped it from the
to-be-prefetched list at that point.)  The main use for having a
bforget method is for the most part, once we are done scanning a
non-directory extent tree block, we won't be needing it again.  

Slightly more complex (so it might not be worth it), we could also
drop inode table blocks from our buffer cache that do not contain any
directory inodes once we are done scanning them in pass 1.

Cheers,

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick Wong Feb. 1, 2014, 8:16 a.m. UTC | #2
On Fri, Jan 31, 2014 at 08:53:25AM -0500, Theodore Ts'o wrote:
> On Fri, Jan 31, 2014 at 03:10:00AM -0700, Andreas Dilger wrote:
> > We implemented something like this for a metadata scanning tool called
> > "e2scan".  At the time, the fastest method of prefetching data was
> > posix_fadvise(POSIX_FADV_WILLNEED). We also tried the readahead() syscall. 

Thanks, Andreas! :)

Hmm.  I pulled in that patch and rewrites the prefetcher routines to
spawn a single thread to try to pull in the bitmaps and itable via
posix_fadvise.  I also noticed that e2fsck pass1 creates a list of
directory blocks to scan in pass2, so I restructured the prefetch code
to prefetch an arbitrary dblist.  The bitmap/itable prefetcher now
simply creates a dblist and hands it off to the dblist fetch routine.
It's also smart enough to do find contiguous runs of blocks and
fadvise the whole thing in with a single call.

The initial naive e2fsck implementation starts separate threads to
prefetch everything without caring if it blows out the cache.  This
seems to reduce e2fsck run times on SSD and RAIDs by 20-40%.

On the plus side the code became much easier, and fsck seems able to
repair a broken filesystem without crashing.  Welp, I guess that
teaches me not to go overboard with complexity until it's needed.

I made a few changes to Andreas' patch -- first, I changed the
parameters to ULL types to be consistent with the other IO manager
function prototypes; second, I made it a bit more robust for IO
managers that don't provide a readahead function.  (Does anyone still
build this on NT?)

> I think using posix_fadvise() and readahead() is probably the best way
> to go, at least initially.  If we can avoid needing to add an extra
> I/O manager, I think that would be better.

Agreed.  I will send new patches shortly.

> As far as a single HDD prefetching its brains out, it would be good if
> we can figure out a way to enable the right amount of prefetching
> automatically.  Something that might be worth trying is to instrument
> unix_io so we can measure the time waiting for disk I/O, and then
> compare that to the wall clock time running on a single disk file
> system, without doing any prefetching.

Changing to this fadvise mechanism seems to have eliminated the
horrible blowout on USB HDDs.  There's a slight speed decrease.

> If there isn't any delta, then the only way prefetching could help us
> is if we can optimize the head motion and remove some seeks (i.e., if
> we know that we will need block #23 in the future, and we get a
> request for block #24, we can read both at one go).  If we measure the
> difference between time spent for disk I/O during each e2fsck pass,
> and wall clock time during each I/O, we'll also know which e2fsck pass
> would benefit the most from smarter prefetching.
> 
> What that might imply is that for HDD's --- which is where we would
> want something like this the most --- what we might want to do is to
> create a list of blocks that e2fsck knows it will need next (for
> example, during pass 1, we are collecting the blocks for directories,
> so we could use that to populate the prefetch list for pass 2.
> 
> Something that we might need to go to in the future is instead of
> using mmap(), to maintain our own explicit buffer cache inside
> unix_io, and use direct I/O to avoid caching the disk blocks twice.
> Then when we use a single-threaded disk prefetcher, managed by the
> unix_io, it will know when a particular I/O request has completed, and
> more importantly, if there is a synchronous read request coming in
> from main body of the program, it can stop prefetching and allow the
> higher priority read to complete.  We can also experiment with how
> many threads might make sense --- even with an HDD, using multiple
> threads so that we can take advantage of NCQ might still be a win.
> 
> One other thought.  The location of the dynamic metadata blocks (i.e.,
> directory and extent tree blocks) is a hint that we could potentially
> store in the file system.  Since it is only for optimization purposes,
> e2fsck can try to find it in the file system, and if it is there, and
> it looks valid it can use it.  If the file system is too corrupted, or
> the data looks suspect, it can always ignore the cached hints.
> 
> Finally, if we are managing our own buffer cache, we should consider
> adding a bforget method to the I/O manager.  That way e2fsck can give
> hints to the caching layer that a block isn't needed any more.  If it
> is in the cache, it can be dropped, to free memory, and if it is still
> on the to-be-prefetched list it should also be dropped.  (Of course,
> if a block is on the to-be-prefetched list, and a synchronous read
> request comes in for that block, we should have dropped it from the
> to-be-prefetched list at that point.)  The main use for having a
> bforget method is for the most part, once we are done scanning a
> non-directory extent tree block, we won't be needing it again.  
> 
> Slightly more complex (so it might not be worth it), we could also
> drop inode table blocks from our buffer cache that do not contain any
> directory inodes once we are done scanning them in pass 1.

Let's play around with my silly prefetch patches and decide if we need
this extra complexity.

I thought about prefetching the extent tree too, but I'm not sure if
that's worth the trouble.  It might be worth it to go for the ETBs
that the inode points to directly, since that's more or less free.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Phillip Susi Feb. 27, 2014, 5:03 p.m. UTC | #3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 1/31/2014 8:53 AM, Theodore Ts'o wrote:
> Something that we might need to go to in the future is instead of 
> using mmap(), to maintain our own explicit buffer cache inside 
> unix_io, and use direct I/O to avoid caching the disk blocks
> twice. Then when we use a single-threaded disk prefetcher, managed
> by the unix_io, it will know when a particular I/O request has
> completed, and more importantly, if there is a synchronous read
> request coming in from main body of the program, it can stop
> prefetching and allow the higher priority read to complete.  We can
> also experiment with how many threads might make sense --- even
> with an HDD, using multiple threads so that we can take advantage
> of NCQ might still be a win.

Why build your own cache instead of letting the kernel take care of
it?  I  believe the IO elevator already gives preferential treatment
to blocking reads so just using readahead() to prefetch and sticking
with plain old read() should work nicely.

> Finally, if we are managing our own buffer cache, we should
> consider adding a bforget method to the I/O manager.  That way
> e2fsck can give hints to the caching layer that a block isn't
> needed any more.  If it is in the cache, it can be dropped, to free
> memory, and if it is still on the to-be-prefetched list it should
> also be dropped.  (Of course, if a block is on the to-be-prefetched
> list, and a synchronous read request comes in for that block, we
> should have dropped it from the to-be-prefetched list at that
> point.)  The main use for having a bforget method is for the most
> part, once we are done scanning a non-directory extent tree block,
> we won't be needing it again.

Good idea, but this also could just be translated to posix_fadvise to
have the kernel drop the pages from the cache.

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.17 (MingW32)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iQEcBAEBAgAGBQJTD2/8AAoJEI5FoCIzSKrwN9QIAKKyOssDAoy6Zab7w+j6aXdH
f8Lzw/+WWevPpvX1o9SakQXVYejkiZBcEhJ/NrtQoDzQjesrERpKIj7m+Yb2blaQ
VYu9hj21ReVYngmmOlUHL9LJJtfwGptyO3SGvuSZv3NooMjEu51tCXxHLxONaYAb
SqvRBVH57E7hY1HckKTbD14YleKn/SD7peNIp1KP39zQw1soR83+oHh2qZ3Zaq7E
fa/UG7tnIMcOjHghacr83LA7XW+sQYn8pDeAmw38SKe+uiHvcqjzych6lQlKLw19
pKdzERd8zaPsiS2y4og/RzWZHolohiDp67cWRncYKm3ai2w4KbU9KXlQHa/6Fy0=
=orjY
-----END PGP SIGNATURE-----
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Darrick Wong Feb. 27, 2014, 6:31 p.m. UTC | #4
On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> On 1/31/2014 8:53 AM, Theodore Ts'o wrote:
> > Something that we might need to go to in the future is instead of 
> > using mmap(), to maintain our own explicit buffer cache inside 
> > unix_io, and use direct I/O to avoid caching the disk blocks
> > twice. Then when we use a single-threaded disk prefetcher, managed
> > by the unix_io, it will know when a particular I/O request has
> > completed, and more importantly, if there is a synchronous read
> > request coming in from main body of the program, it can stop
> > prefetching and allow the higher priority read to complete.  We can
> > also experiment with how many threads might make sense --- even
> > with an HDD, using multiple threads so that we can take advantage
> > of NCQ might still be a win.
> 
> Why build your own cache instead of letting the kernel take care of
> it?  I  believe the IO elevator already gives preferential treatment
> to blocking reads so just using readahead() to prefetch and sticking
> with plain old read() should work nicely.
> 
> > Finally, if we are managing our own buffer cache, we should
> > consider adding a bforget method to the I/O manager.  That way
> > e2fsck can give hints to the caching layer that a block isn't
> > needed any more.  If it is in the cache, it can be dropped, to free
> > memory, and if it is still on the to-be-prefetched list it should
> > also be dropped.  (Of course, if a block is on the to-be-prefetched
> > list, and a synchronous read request comes in for that block, we
> > should have dropped it from the to-be-prefetched list at that
> > point.)  The main use for having a bforget method is for the most
> > part, once we are done scanning a non-directory extent tree block,
> > we won't be needing it again.
> 
> Good idea, but this also could just be translated to posix_fadvise to
> have the kernel drop the pages from the cache.

That's more or less what I've done with the e2fsck patchset I'm working on.
It's not terribly smarter than Val's old thing from ~2007, but nowadays more
people have the kind of hardware where prefetching is noticeable.

Well, pass2 tells the kernel to drop the dir block as soon as thinks it's
finished with the dir block.  I haven't studied the effect of dropping the
itable during pass1.

--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o Feb. 28, 2014, 2:28 a.m. UTC | #5
On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote:
> 
> Why build your own cache instead of letting the kernel take care of
> it?  I  believe the IO elevator already gives preferential treatment
> to blocking reads so just using readahead() to prefetch and sticking
> with plain old read() should work nicely.

The reason why it might be better for us to use our own cache is
because we can more accurately know when we're done with the block,
and we can drop it from the cache.

I suppose we could use posix_fadvise(POSIX_FADV_DONTNEED) --- and
hopefully this works on block devices for the buffer cache, but it
wouldn't all surprise me that if we can get finer-grained control if
we use O_DIRECT and manage the buffers ourselves.  Whether it's worth
the extra complexitry is a fair question --- but simply adding
metadata prefetching is going to add a fair amount of complexity
already, and we should test to make sure that readahead() and
posix_fadvise() actually work correctly on block devices --- a couple
of years ago, I had explored readahead() precisely as a cheap way of
adding metadata precaching for e2fsck, and it was a no-op when I tried
the test back then.

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger Feb. 28, 2014, 6:54 p.m. UTC | #6
On Feb 27, 2014, at 7:28 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote:
>> 
>> Why build your own cache instead of letting the kernel take care of
>> it?  I  believe the IO elevator already gives preferential treatment
>> to blocking reads so just using readahead() to prefetch and sticking
>> with plain old read() should work nicely.
> 
> The reason why it might be better for us to use our own cache is
> because we can more accurately know when we're done with the block,
> and we can drop it from the cache.

One argument in favour of using the kernel buffer cache is that the
common case of e2fsck followed by mounting the filesystem would be
much faster because e2fsck has already populated the kernel cache.
Otherwise, all of the IO done to populate the userspace cache would
be lost when e2fsck exits.  Similarly, repeated runs of e2fsck would
not see any benefit of the userspace cache.

> I suppose we could use posix_fadvise(POSIX_FADV_DONTNEED) --- and
> hopefully this works on block devices for the buffer cache, but it
> wouldn't all surprise me that if we can get finer-grained control if
> we use O_DIRECT and manage the buffers ourselves.  Whether it's worth
> the extra complexitry is a fair question --- but simply adding
> metadata prefetching is going to add a fair amount of complexity
> already, and we should test to make sure that readahead() and
> posix_fadvise() actually work correctly on block devices --- a couple
> of years ago, I had explored readahead() precisely as a cheap way of
> adding metadata precaching for e2fsck, and it was a no-op when I tried
> the test back then.

We tested several different mechanisms for readahead a few years ago
for the e2scan tool, and that resulted in the readahead patch that
Darrick updated recently.  It definitely shows performance improvement.

Whether POSIX_FADV_DONTNEED actually flushes pages from cache is a
separate question.  My preference would be that if this is currently
a no-op that we work to fix it in the kernel so that it is working
for everyone rather than investing time and effort into code that is
only useful for e2fsprogs.

Cheers, Andreas
Darrick Wong Feb. 28, 2014, 8:18 p.m. UTC | #7
On Fri, Feb 28, 2014 at 11:54:55AM -0700, Andreas Dilger wrote:
> On Feb 27, 2014, at 7:28 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> > On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote:
> >> 
> >> Why build your own cache instead of letting the kernel take care of
> >> it?  I  believe the IO elevator already gives preferential treatment
> >> to blocking reads so just using readahead() to prefetch and sticking
> >> with plain old read() should work nicely.
> > 
> > The reason why it might be better for us to use our own cache is
> > because we can more accurately know when we're done with the block,
> > and we can drop it from the cache.

But if we accurately know when we're done with the block, we can also ask the
kernel to drop the block from its cache.

> One argument in favour of using the kernel buffer cache is that the
> common case of e2fsck followed by mounting the filesystem would be
> much faster because e2fsck has already populated the kernel cache.
> Otherwise, all of the IO done to populate the userspace cache would
> be lost when e2fsck exits.  Similarly, repeated runs of e2fsck would
> not see any benefit of the userspace cache.

I tend to agree with this.

> > I suppose we could use posix_fadvise(POSIX_FADV_DONTNEED) --- and
> > hopefully this works on block devices for the buffer cache, but it
> > wouldn't all surprise me that if we can get finer-grained control if
> > we use O_DIRECT and manage the buffers ourselves.  Whether it's worth
> > the extra complexitry is a fair question --- but simply adding
> > metadata prefetching is going to add a fair amount of complexity

It's not too much churn; here's the diffstat from my patches:
21 files changed, 723 insertions(+), 8 deletions(-)

I was attempting to write the smallest readahead implementation I could get
away with.  I noticed that unix_io.c already implements a simplistic 8-block
cache, but it doesn't have the notion of readahead.  The kernel provides the
fadvise knob for readahead, so if it really worked as advertised, why not use
that?

I further thought that regardless of whether I use fadvise or build out
unix_io's cache, we'd probably want a way to asynchronously (pre)fetch large
chunks of metadata anyway.  unix_io is pretty terrible about threaded IO (the
lseek+read/write are not thread safe), which means I'd probably have to make
them thread safe if I wanted any significant threaded RA ability.  I actually
converted unix_io to use pread/pwrite if available as part of that work, and
I'll send that patch along in case anyone wants to reduce system call
overhead.

The fadvise approach avoids most of that thread safety requirement.  I tried to
write the prefetch code carefully enough not to modify any of the data
structures fed into it, and let the kernel take care of all the details of
threaded IO.  At worst, fadvise ignores e2fsck and we lose nothing.

> > already, and we should test to make sure that readahead() and
> > posix_fadvise() actually work correctly on block devices --- a couple
> > of years ago, I had explored readahead() precisely as a cheap way of
> > adding metadata precaching for e2fsck, and it was a no-op when I tried
> > the test back then.

(See below)

> We tested several different mechanisms for readahead a few years ago
> for the e2scan tool, and that resulted in the readahead patch that
> Darrick updated recently.  It definitely shows performance improvement.
> 
> Whether POSIX_FADV_DONTNEED actually flushes pages from cache is a
> separate question.  My preference would be that if this is currently
> a no-op that we work to fix it in the kernel so that it is working
> for everyone rather than investing time and effort into code that is
> only useful for e2fsprogs.

I wrote a test program[1] that WILLNEED's 256M and then DONTNEED's it, and saw
these results (on 3.14-rc4):

Before WILLNEED
             total       used       free     shared    buffers     cached
Mem:       2049412      83764    1965648          0          0      14124
-/+ buffers/cache:      69640    1979772
Swap:            0          0          0
After WILLNEED
             total       used       free     shared    buffers     cached
Mem:       2049412     346304    1703108          0     262144      14148
-/+ buffers/cache:      70012    1979400
Swap:            0          0          0
Sleeping for 30 seconds...
             total       used       free     shared    buffers     cached
Mem:       2049412     346236    1703176          0     262144      14180
-/+ buffers/cache:      69912    1979500
Swap:            0          0          0
After DONTNEED
             total       used       free     shared    buffers     cached
Mem:       2049412      77112    1972300          0          0      14160
-/+ buffers/cache:      62952    1986460
Swap:            0          0          0
Sleeping for another 30 seconds...
             total       used       free     shared    buffers     cached
Mem:       2049412      76752    1972660          0          0      14180
-/+ buffers/cache:      62572    1986840
Swap:            0          0          0

Based on that, it looks as though the two fadvise calls actually work, at least
on a quiet system.

--D

[1] test program:

/* Test RA */
#define _XOPEN_SOURCE 600
#define _DARWIN_C_SOURCE
#define _FILE_OFFSET_BITS 64
#define _LARGEFILE_SOURCE
#define _LARGEFILE64_SOURCE
#ifndef _GNU_SOURCE
#define _GNU_SOURCE
#endif

#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <stdlib.h>

int main(int argc, char *argv[])
{
	int fd, ret = 0;

	if (argc != 2 || strcmp(argv[1], "--help") == 0) {
		printf("Usage: %s device\n", argv[0]);
		return 0;
	}

	fd = open(argv[1], O_RDWR);
	if (fd < 0) {
		perror(argv[1]);
		goto fail;
	}

	ret = system("echo 3 > /proc/sys/vm/drop_caches");
	if (ret)
		goto fail2;

	printf("Before WILLNEED\n");
	ret = system("free");
	if (ret)
		goto fail2;

	ret = posix_fadvise(fd, 0, 256 * 1048576, POSIX_FADV_WILLNEED);
	if (ret)
		goto fail2;

	printf("After WILLNEED\n");
	ret = system("free");
	if (ret)
		goto fail2;

	printf("Sleeping for 30 seconds...\n");
	sleep(30);
	ret = system("free");
	if (ret)
		goto fail2;

	ret = posix_fadvise(fd, 0, 256 * 1048576, POSIX_FADV_DONTNEED);
	if (ret)
		goto fail2;

	printf("After DONTNEED\n");
	ret = system("free");
	if (ret)
		goto fail2;

	printf("Sleeping for another 30 seconds...\n");
	sleep(30);
	ret = system("free");
	if (ret)
		goto fail2;

fail2:
	close(fd);
fail:
	if (ret)
		perror(argv[1]);
	return ret;
}
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/e2fsck/unix.c b/e2fsck/unix.c
index eeeef7c..33afc06 100644
--- a/e2fsck/unix.c
+++ b/e2fsck/unix.c
@@ -1181,6 +1181,7 @@  int main (int argc, char *argv[])
 	__u32 features[3];
 	char *cp;
 	int qtype = -99;  /* quota type */
+	struct ext2fs_prefetch_handle *h = NULL;
 
 	clear_problem_context(&pctx);
 	sigcatcher_setup();
@@ -1638,9 +1639,21 @@  print_unsupp_features:
 		quota_init_context(&ctx->qctx, ctx->fs, qtype);
 	}
 
+	if (getenv("PREFETCH")) {
+		int flags = PREFETCH_INODES | PREFETCH_DIRS | PREFETCH_THREADED;
+		if (getenv("PREFETCH_WAIT"))
+			flags &= ~PREFETCH_THREADED;
+		retval = ext2fs_prefetch(fs, flags, &h);
+		if (retval)
+			com_err(ctx->program_name, retval, "prefetching");
+	}
+
 	run_result = e2fsck_run(ctx);
 	e2fsck_clear_progbar(ctx);
 
+	if (h)
+		ext2fs_prefetch_free(&h);
+
 	if (ctx->flags & E2F_FLAG_JOURNAL_INODE) {
 		if (fix_problem(ctx, PR_6_RECREATE_JOURNAL, &pctx)) {
 			if (journal_size < 1024)
diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in
index a1b5a01..9fbf2b5 100644
--- a/lib/ext2fs/Makefile.in
+++ b/lib/ext2fs/Makefile.in
@@ -73,6 +73,7 @@  OBJS= $(DEBUGFS_LIB_OBJS) $(RESIZE_LIB_OBJS) $(E2IMAGE_LIB_OBJS) \
 	native.o \
 	newdir.o \
 	openfs.o \
+	prefetch.o \
 	progress.o \
 	punch.o \
 	qcow2.o \
@@ -150,6 +151,7 @@  SRCS= ext2_err.c \
 	$(srcdir)/native.c \
 	$(srcdir)/newdir.c \
 	$(srcdir)/openfs.c \
+	$(srcdir)/prefetch.c \
 	$(srcdir)/progress.c \
 	$(srcdir)/punch.c \
 	$(srcdir)/qcow2.c \
@@ -863,6 +865,12 @@  openfs.o: $(srcdir)/openfs.c $(top_builddir)/lib/config.h \
  $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \
  $(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \
  $(srcdir)/ext2_ext_attr.h $(srcdir)/bitops.h $(srcdir)/e2image.h
+prefetch.o: $(srcdir)/prefetch.c $(top_builddir)/lib/config.h \
+ $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2fs.h \
+ $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2_fs.h \
+ $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h $(srcdir)/ext2_io.h \
+ $(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/ext2_ext_attr.h \
+ $(srcdir)/bitops.h $(srcdir)/ext2fsP.h
 progress.o: $(srcdir)/progress.c $(top_builddir)/lib/config.h \
  $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2fs.h \
  $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2_fs.h \
diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
index ba5c388..e634d2c 100644
--- a/lib/ext2fs/ext2fs.h
+++ b/lib/ext2fs/ext2fs.h
@@ -1522,6 +1522,19 @@  errcode_t ext2fs_mmp_update2(ext2_filsys fs, int immediately);
 errcode_t ext2fs_mmp_stop(ext2_filsys fs);
 unsigned ext2fs_mmp_new_seq(void);
 
+/* prefetch.c */
+#define PREFETCH_THREADED	(0x00000001)
+#define PREFETCH_ERROR_ABORT	(0x00000002)
+#define PREFETCH_BITMAPS	(0x00000004)
+#define PREFETCH_INODES		(0x00000008)
+#define PREFETCH_MAPS		(0x00000010)
+#define PREFETCH_DIRS		(0x00000020)
+struct ext2fs_prefetch_handle;
+errcode_t ext2fs_prefetch(ext2_filsys fs, int flags,
+			  struct ext2fs_prefetch_handle **h);
+errcode_t ext2fs_prefetch_wait(struct ext2fs_prefetch_handle *h);
+errcode_t ext2fs_prefetch_free(struct ext2fs_prefetch_handle **h);
+
 /* read_bb.c */
 extern errcode_t ext2fs_read_bb_inode(ext2_filsys fs,
 				      ext2_badblocks_list *bb_list);
diff --git a/lib/ext2fs/prefetch.c b/lib/ext2fs/prefetch.c
new file mode 100644
index 0000000..022af41
--- /dev/null
+++ b/lib/ext2fs/prefetch.c
@@ -0,0 +1,456 @@ 
+/*
+ * prefetch.c --- Prefetch filesystem metadata.
+ *
+ * Copyright (C) 2014 by Oracle, Darrick J. Wong.
+ *
+ * %Begin-Header%
+ * This file may be redistributed under the terms of the GNU Library
+ * General Public License, version 2.
+ * %End-Header%
+ */
+
+#define _LARGEFILE_SOURCE
+#define _LARGEFILE64_SOURCE
+#ifndef _GNU_SOURCE
+#define _GNU_SOURCE
+#endif
+
+#include <unistd.h>
+#include <sys/syscall.h>
+#include <fcntl.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <signal.h>
+#include <pthread.h>
+
+#include "config.h"
+#include "ext2_fs.h"
+#include "ext2fs.h"
+
+#define USE_THREADS 1
+#define USE_SUPER 1
+
+struct ext2fs_prefetch_handle {
+	ext2_filsys fs;
+	int flags;
+	int done;
+	pid_t pid;
+#ifdef USE_THREADS
+	pthread_t tid;
+#endif
+};
+
+static int ignore_block(ext2_filsys fs, blk64_t *blocknr, e2_blkcnt_t blockcnt,
+			blk64_t ref_blk, int ref_offset, void *priv_data)
+{
+	return 0;
+}
+
+struct dirent_iterate {
+	void *buf;
+	int flags;
+};
+
+static int dirent_block(ext2_filsys fs, blk64_t *blocknr, e2_blkcnt_t blockcnt,
+			blk64_t ref_blk, int ref_offset, void *priv_data)
+{
+	struct dirent_iterate *di = priv_data;
+	errcode_t err;
+
+	err = io_channel_read_blk64(fs->io, *blocknr, 1, di->buf);
+	if (err && (di->flags & PREFETCH_ERROR_ABORT))
+		return BLOCK_ABORT;
+	return 0;
+}
+
+/*
+ * First dumb prefetch implementation: Separate process, just read data to
+ * get it into the page cache, at least.
+ */
+static void do_ext2fs_prefetch(ext2_filsys fs, int flags)
+{
+	void			*buf;
+	blk64_t			blk;
+	dgrp_t			i;
+	ext2_inode_scan		scan;
+	int			length = EXT2_INODE_SIZE(fs->super);
+	ext2_ino_t		ino;
+	errcode_t		err;
+	struct ext2_inode	inode;
+	struct dirent_iterate	di;
+	int			iter_flags;
+	unsigned int		blocks_to_read;
+
+	err = ext2fs_get_array(fs->blocksize, fs->inode_blocks_per_group, &buf);
+	if (err)
+		return;
+
+	/* load bitmaps */
+	if (!(flags & PREFETCH_BITMAPS))
+		goto skip_bitmaps;
+	err = ext2fs_read_bitmaps(fs);
+	if (err && (flags & PREFETCH_ERROR_ABORT))
+		goto out;
+
+skip_bitmaps:
+	/* load inode tables */
+	if (!(flags & PREFETCH_INODES) || (flags & (PREFETCH_MAPS |
+						    PREFETCH_DIRS)))
+		goto skip_itable;
+
+	for (i = 0; i < fs->group_desc_count; i++) {
+		if (ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT))
+			continue;
+
+		blocks_to_read = fs->inode_blocks_per_group;
+		if (ext2fs_has_group_desc_csum(fs)) {
+			unsigned int num_inodes =
+					fs->super->s_inodes_per_group -
+					ext2fs_bg_itable_unused(fs, i);
+			blocks_to_read = (num_inodes *
+					  EXT2_INODE_SIZE(fs->super)) /
+					 fs->blocksize;
+		}
+
+		blk = ext2fs_inode_table_loc(fs, i);
+		err = io_channel_read_blk64(fs->io, blk, blocks_to_read, buf);
+		if (err && (flags & PREFETCH_ERROR_ABORT))
+			goto out;
+	}
+
+skip_itable:
+	/* load inodes */
+	if (!(flags & (PREFETCH_MAPS | PREFETCH_DIRS)))
+		goto skip_inodes;
+
+	err = ext2fs_open_inode_scan(fs, 0, &scan);
+	if (err && (flags & PREFETCH_ERROR_ABORT))
+		goto out;
+
+	di.buf = buf;
+	di.flags = flags;
+	do {
+		err = ext2fs_get_next_inode_full(scan, &ino, &inode,
+						 sizeof(inode));
+		if (err)
+			break;
+		if (!ino)
+			break;
+		if (!ext2fs_test_inode_bitmap2(fs->inode_map, ino))
+			continue;
+
+		iter_flags = BLOCK_FLAG_READ_ONLY | BLOCK_FLAG_DATA_ONLY;
+		if ((flags & PREFETCH_MAPS) &&
+		    !(flags & PREFETCH_DIRS) &&
+		    (LINUX_S_ISREG(inode.i_mode) ||
+		     LINUX_S_ISLNK(inode.i_mode))) {
+			err = ext2fs_block_iterate3(fs, ino, iter_flags, NULL,
+						    ignore_block, &di);
+		} else if ((flags & PREFETCH_DIRS) &&
+			   !(flags & PREFETCH_MAPS) &&
+			   LINUX_S_ISDIR(inode.i_mode)) {
+			err = ext2fs_block_iterate3(fs, ino, iter_flags, NULL,
+						    dirent_block, &di);
+		} else {
+			int (*func)(ext2_filsys fs, blk64_t *blocknr,
+				    e2_blkcnt_t blockcnt, blk64_t ref_blk,
+				    int ref_offset, void *priv_data) =
+			LINUX_S_ISDIR(inode.i_mode) ? dirent_block :
+						ignore_block;
+			err = ext2fs_block_iterate3(fs, ino, iter_flags, NULL,
+						    func, &di);
+		}
+		if (err && (flags & PREFETCH_ERROR_ABORT))
+			break;
+
+		blk = ext2fs_file_acl_block(fs, &inode);
+		if (!blk)
+			continue;
+		err = io_channel_read_blk64(fs->io, blk, 1, buf);
+		if (err && (flags & PREFETCH_ERROR_ABORT))
+			break;
+	} while (ino);
+
+out2:
+	ext2fs_close_inode_scan(scan);
+
+skip_inodes:
+out:
+	ext2fs_free_mem(&buf);
+
+	return;
+}
+
+static void *prefetch_thread(void *data)
+{
+	struct ext2fs_prefetch_handle *pd = data;
+	do_ext2fs_prefetch(pd->fs, pd->flags);
+	return NULL;
+}
+
+/*
+ * Second, less dumb prefetch: Use threads to preload metadata in group order.
+ */
+struct super_entry {
+	dgrp_t group;
+	ext2_ino_t num_inodes;
+};
+
+struct super_thread {
+	ext2_filsys fs;
+	pthread_t tid;
+	int flags;
+	struct super_entry *start, *end;
+	unsigned int skip_factor;
+	void *buf;
+};
+
+static void *super_func(void *data)
+{
+	struct super_thread *t = data;
+	ext2_filsys fs = t->fs;
+	int flags = t->flags;
+	void *buf = t->buf;
+	struct super_entry *e;
+	unsigned int blocks_to_read;
+	blk64_t blk;
+	ext2_ino_t i, ino;
+	unsigned int nr_read = 0;
+	struct dirent_iterate di;
+	struct ext2_inode inode;
+	int iter_flags;
+	errcode_t err;
+
+	/* Read the inode tables */
+	for (e = t->start; e < t->end; e += t->skip_factor) {
+		blocks_to_read = (e->num_inodes *
+				  EXT2_INODE_SIZE(fs->super)) / fs->blocksize;
+		blk = ext2fs_inode_table_loc(fs, e->group);
+		err = io_channel_read_blk64(fs->io, blk, blocks_to_read, buf);
+		if (err && (flags & PREFETCH_ERROR_ABORT))
+			continue;
+	}
+
+	/* Scan inodes for extent/dir blocks */
+	di.buf = buf;
+	di.flags = flags;
+	for (e = t->start; e < t->end; e += t->skip_factor) {
+		for (i = 0; i < e->num_inodes; i++) {
+			ino = e->group * fs->super->s_inodes_per_group + i;
+			err = ext2fs_read_inode(fs, ino, &inode);
+			if (err)
+				continue;
+			/* Skip unlinked or unknown-type inodes */
+			if (!inode.i_links_count ||
+			    (inode.i_mode & 0xF000) == 0)
+				continue;
+
+			iter_flags = BLOCK_FLAG_READ_ONLY |
+				     BLOCK_FLAG_DATA_ONLY;
+			if ((flags & PREFETCH_MAPS) &&
+			    !(flags & PREFETCH_DIRS) &&
+			    (LINUX_S_ISREG(inode.i_mode) ||
+			     LINUX_S_ISLNK(inode.i_mode))) {
+				err = ext2fs_block_iterate3(fs, ino,
+							    iter_flags, NULL,
+							    ignore_block, &di);
+			} else if ((flags & PREFETCH_DIRS) &&
+				   !(flags & PREFETCH_MAPS) &&
+				   LINUX_S_ISDIR(inode.i_mode)) {
+				err = ext2fs_block_iterate3(fs, ino,
+							    iter_flags, NULL,
+							    dirent_block, &di);
+			} else {
+				int (*func)(ext2_filsys fs, blk64_t *blocknr,
+					    e2_blkcnt_t blockcnt,
+					    blk64_t ref_blk,
+					    int ref_offset, void *priv_data) =
+				LINUX_S_ISDIR(inode.i_mode) ? dirent_block :
+							ignore_block;
+				err = ext2fs_block_iterate3(fs, ino,
+							    iter_flags, NULL,
+							    func, &di);
+			}
+
+			blk = ext2fs_file_acl_block(fs, &inode);
+			if (!blk)
+				continue;
+			err = io_channel_read_blk64(fs->io, blk, 1, buf);
+		}
+	}
+
+	return NULL;
+}
+
+static void *super_prefetch(void *data)
+{
+	struct ext2fs_prefetch_handle *pd = data;
+	ext2_filsys fs = pd->fs;
+	int flags = pd->flags;
+	void *b, *r;
+	struct super_thread *threads = NULL, *t;
+	unsigned int num_threads = sysconf(_SC_NPROCESSORS_ONLN);
+	unsigned int j;
+	struct super_entry *entries = NULL, *e = NULL;
+	unsigned int num_entries = 0;
+	dgrp_t i;
+	ext2_ino_t num_inodes;
+	errcode_t err;
+
+	/* Find all non-empty groups */
+	for (i = 0; i < fs->group_desc_count; i++) {
+		if (ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT))
+			continue;
+
+		num_inodes = fs->super->s_inodes_per_group;
+		if (ext2fs_bg_free_inodes_count(fs, i) == num_inodes)
+			continue;
+		if (ext2fs_has_group_desc_csum(fs))
+			num_inodes -= ext2fs_bg_itable_unused(fs, i);
+		if (e == entries + num_entries) {
+			r = realloc(entries, (num_entries + 32) *
+					     sizeof(*entries));
+			if (r == NULL) {
+				err = errno;
+				goto out;
+			}
+			entries = r;
+			e = entries + num_entries;
+			num_entries += 32;
+		}
+		e->group = i;
+		e->num_inodes = num_inodes;
+		e++;
+	}
+	num_entries = e - entries;
+
+	/* Set up the threads */
+	if (getenv("PREFETCH_THREADS")) {
+		j = atoi(getenv("PREFETCH_THREADS"));
+		if (j > 0)
+			num_threads = j;
+	}
+
+	err = ext2fs_get_arrayzero(num_threads, sizeof(*threads) +
+				(fs->blocksize * fs->inode_blocks_per_group),
+				&b);
+	if (err)
+		goto out;
+	threads = b + (fs->blocksize * fs->inode_blocks_per_group *
+		       num_threads);
+
+	for (j = 0, t = threads, e = entries; j < num_threads; j++, t++, e++) {
+		t->fs = fs;
+		t->flags = flags;
+		t->start = e;
+		t->end = entries + num_entries;
+		t->skip_factor = num_threads;
+		err = ext2fs_dup_handle(fs, &t->fs);
+		if (err)
+			goto out2;
+		t->fs->icache = NULL;
+		t->buf = b + (fs->blocksize * fs->inode_blocks_per_group * j);
+		pthread_create(&t->tid, NULL, super_func, t);
+	}
+
+	/* Wait for threads */
+	for (j = 0, t = threads; j < num_threads; j++, t++)
+		pthread_join(t->tid, NULL);
+	pd->done = 1;
+out2:
+	ext2fs_free_mem(&b);
+out:
+	free(entries);
+	return NULL;
+}
+
+struct unix_private_data_hack {
+	int	magic;
+	int	dev;
+};
+
+errcode_t ext2fs_prefetch(ext2_filsys fs, int flags,
+			  struct ext2fs_prefetch_handle **h)
+{
+	struct ext2fs_prefetch_handle *pd;
+	errcode_t err;
+
+	err = ext2fs_get_memzero(sizeof(*pd), &pd);
+	if (err)
+		return err;
+	pd->fs = fs;
+	pd->flags = flags;
+
+	/* Load the rest */
+	if (flags & PREFETCH_THREADED) {
+		if (fs->io->manager == mmap_io_manager) {
+#if USE_SUPER
+			struct timespec ts;
+			err = pthread_create(&pd->tid, NULL, super_prefetch,
+					     pd);
+			if (err)
+				goto errout;
+			ts.tv_sec = 0; ts.tv_nsec = 500000;
+			nanosleep(&ts, NULL);
+#elif USE_THREADS
+			err = pthread_create(&pd->tid, NULL, prefetch_thread,
+					     pd);
+			if (err)
+				goto errout;
+#else
+			goto single_thread;
+#endif
+		} else if (fs->io->manager == unix_io_manager) {
+			pd->pid = fork();
+			if (pd->pid < 0) {
+				err = errno;
+				goto errout;
+			} else if (pd->pid == 0) {
+				struct unix_private_data_hack *m =
+						fs->io->private_data;
+				m->dev = open(fs->device_name, O_RDONLY);
+				do_ext2fs_prefetch(fs, flags);
+				exit(0);
+			}
+		}
+	} else {
+single_thread:
+#if USE_SUPER
+		super_prefetch(pd);
+#else
+		do_ext2fs_prefetch(fs, flags);
+#endif
+		pd->done = 1;
+	}
+	*h = pd;
+
+	return 0;
+errout:
+	ext2fs_free_mem(&pd);
+	return err;
+}
+
+errcode_t ext2fs_prefetch_wait(struct ext2fs_prefetch_handle *h)
+{
+	pid_t ret;
+	int status;
+
+	if (h->flags & PREFETCH_THREADED && h->done != 0) {
+		if (h->tid)
+			ret = pthread_join(h->tid, NULL);
+		if (h->pid) {
+			ret = waitpid(h->pid, &status, WNOHANG);
+			if (ret == 0)
+				kill(h->pid, SIGKILL);
+			waitpid(h->pid, NULL, 0);
+		}
+	}
+	h->done = 1;
+	return 0;
+}
+
+errcode_t ext2fs_prefetch_free(struct ext2fs_prefetch_handle **h)
+{
+	ext2fs_prefetch_wait(*h);
+	return ext2fs_free_mem(h);
+}