Message ID | 20140130235058.31064.21096.stgit@birch.djwong.org |
---|---|
State | Superseded, archived |
Headers | show |
On Fri, Jan 31, 2014 at 03:10:00AM -0700, Andreas Dilger wrote: > We implemented something like this for a metadata scanning tool called > "e2scan". At the time, the fastest method of prefetching data was > posix_fadvise(POSIX_FADV_WILLNEED). We also tried the readahead() syscall. I think using posix_fadvise() and readahead() is probably the best way to go, at least initially. If we can avoid needing to add an extra I/O manager, I think that would be better. As far as a single HDD prefetching its brains out, it would be good if we can figure out a way to enable the right amount of prefetching automatically. Something that might be worth trying is to instrument unix_io so we can measure the time waiting for disk I/O, and then compare that to the wall clock time running on a single disk file system, without doing any prefetching. If there isn't any delta, then the only way prefetching could help us is if we can optimize the head motion and remove some seeks (i.e., if we know that we will need block #23 in the future, and we get a request for block #24, we can read both at one go). If we measure the difference between time spent for disk I/O during each e2fsck pass, and wall clock time during each I/O, we'll also know which e2fsck pass would benefit the most from smarter prefetching. What that might imply is that for HDD's --- which is where we would want something like this the most --- what we might want to do is to create a list of blocks that e2fsck knows it will need next (for example, during pass 1, we are collecting the blocks for directories, so we could use that to populate the prefetch list for pass 2. Something that we might need to go to in the future is instead of using mmap(), to maintain our own explicit buffer cache inside unix_io, and use direct I/O to avoid caching the disk blocks twice. Then when we use a single-threaded disk prefetcher, managed by the unix_io, it will know when a particular I/O request has completed, and more importantly, if there is a synchronous read request coming in from main body of the program, it can stop prefetching and allow the higher priority read to complete. We can also experiment with how many threads might make sense --- even with an HDD, using multiple threads so that we can take advantage of NCQ might still be a win. One other thought. The location of the dynamic metadata blocks (i.e., directory and extent tree blocks) is a hint that we could potentially store in the file system. Since it is only for optimization purposes, e2fsck can try to find it in the file system, and if it is there, and it looks valid it can use it. If the file system is too corrupted, or the data looks suspect, it can always ignore the cached hints. Finally, if we are managing our own buffer cache, we should consider adding a bforget method to the I/O manager. That way e2fsck can give hints to the caching layer that a block isn't needed any more. If it is in the cache, it can be dropped, to free memory, and if it is still on the to-be-prefetched list it should also be dropped. (Of course, if a block is on the to-be-prefetched list, and a synchronous read request comes in for that block, we should have dropped it from the to-be-prefetched list at that point.) The main use for having a bforget method is for the most part, once we are done scanning a non-directory extent tree block, we won't be needing it again. Slightly more complex (so it might not be worth it), we could also drop inode table blocks from our buffer cache that do not contain any directory inodes once we are done scanning them in pass 1. Cheers, - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Fri, Jan 31, 2014 at 08:53:25AM -0500, Theodore Ts'o wrote: > On Fri, Jan 31, 2014 at 03:10:00AM -0700, Andreas Dilger wrote: > > We implemented something like this for a metadata scanning tool called > > "e2scan". At the time, the fastest method of prefetching data was > > posix_fadvise(POSIX_FADV_WILLNEED). We also tried the readahead() syscall. Thanks, Andreas! :) Hmm. I pulled in that patch and rewrites the prefetcher routines to spawn a single thread to try to pull in the bitmaps and itable via posix_fadvise. I also noticed that e2fsck pass1 creates a list of directory blocks to scan in pass2, so I restructured the prefetch code to prefetch an arbitrary dblist. The bitmap/itable prefetcher now simply creates a dblist and hands it off to the dblist fetch routine. It's also smart enough to do find contiguous runs of blocks and fadvise the whole thing in with a single call. The initial naive e2fsck implementation starts separate threads to prefetch everything without caring if it blows out the cache. This seems to reduce e2fsck run times on SSD and RAIDs by 20-40%. On the plus side the code became much easier, and fsck seems able to repair a broken filesystem without crashing. Welp, I guess that teaches me not to go overboard with complexity until it's needed. I made a few changes to Andreas' patch -- first, I changed the parameters to ULL types to be consistent with the other IO manager function prototypes; second, I made it a bit more robust for IO managers that don't provide a readahead function. (Does anyone still build this on NT?) > I think using posix_fadvise() and readahead() is probably the best way > to go, at least initially. If we can avoid needing to add an extra > I/O manager, I think that would be better. Agreed. I will send new patches shortly. > As far as a single HDD prefetching its brains out, it would be good if > we can figure out a way to enable the right amount of prefetching > automatically. Something that might be worth trying is to instrument > unix_io so we can measure the time waiting for disk I/O, and then > compare that to the wall clock time running on a single disk file > system, without doing any prefetching. Changing to this fadvise mechanism seems to have eliminated the horrible blowout on USB HDDs. There's a slight speed decrease. > If there isn't any delta, then the only way prefetching could help us > is if we can optimize the head motion and remove some seeks (i.e., if > we know that we will need block #23 in the future, and we get a > request for block #24, we can read both at one go). If we measure the > difference between time spent for disk I/O during each e2fsck pass, > and wall clock time during each I/O, we'll also know which e2fsck pass > would benefit the most from smarter prefetching. > > What that might imply is that for HDD's --- which is where we would > want something like this the most --- what we might want to do is to > create a list of blocks that e2fsck knows it will need next (for > example, during pass 1, we are collecting the blocks for directories, > so we could use that to populate the prefetch list for pass 2. > > Something that we might need to go to in the future is instead of > using mmap(), to maintain our own explicit buffer cache inside > unix_io, and use direct I/O to avoid caching the disk blocks twice. > Then when we use a single-threaded disk prefetcher, managed by the > unix_io, it will know when a particular I/O request has completed, and > more importantly, if there is a synchronous read request coming in > from main body of the program, it can stop prefetching and allow the > higher priority read to complete. We can also experiment with how > many threads might make sense --- even with an HDD, using multiple > threads so that we can take advantage of NCQ might still be a win. > > One other thought. The location of the dynamic metadata blocks (i.e., > directory and extent tree blocks) is a hint that we could potentially > store in the file system. Since it is only for optimization purposes, > e2fsck can try to find it in the file system, and if it is there, and > it looks valid it can use it. If the file system is too corrupted, or > the data looks suspect, it can always ignore the cached hints. > > Finally, if we are managing our own buffer cache, we should consider > adding a bforget method to the I/O manager. That way e2fsck can give > hints to the caching layer that a block isn't needed any more. If it > is in the cache, it can be dropped, to free memory, and if it is still > on the to-be-prefetched list it should also be dropped. (Of course, > if a block is on the to-be-prefetched list, and a synchronous read > request comes in for that block, we should have dropped it from the > to-be-prefetched list at that point.) The main use for having a > bforget method is for the most part, once we are done scanning a > non-directory extent tree block, we won't be needing it again. > > Slightly more complex (so it might not be worth it), we could also > drop inode table blocks from our buffer cache that do not contain any > directory inodes once we are done scanning them in pass 1. Let's play around with my silly prefetch patches and decide if we need this extra complexity. I thought about prefetching the extent tree too, but I'm not sure if that's worth the trouble. It might be worth it to go for the ETBs that the inode points to directly, since that's more or less free. --D -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 1/31/2014 8:53 AM, Theodore Ts'o wrote: > Something that we might need to go to in the future is instead of > using mmap(), to maintain our own explicit buffer cache inside > unix_io, and use direct I/O to avoid caching the disk blocks > twice. Then when we use a single-threaded disk prefetcher, managed > by the unix_io, it will know when a particular I/O request has > completed, and more importantly, if there is a synchronous read > request coming in from main body of the program, it can stop > prefetching and allow the higher priority read to complete. We can > also experiment with how many threads might make sense --- even > with an HDD, using multiple threads so that we can take advantage > of NCQ might still be a win. Why build your own cache instead of letting the kernel take care of it? I believe the IO elevator already gives preferential treatment to blocking reads so just using readahead() to prefetch and sticking with plain old read() should work nicely. > Finally, if we are managing our own buffer cache, we should > consider adding a bforget method to the I/O manager. That way > e2fsck can give hints to the caching layer that a block isn't > needed any more. If it is in the cache, it can be dropped, to free > memory, and if it is still on the to-be-prefetched list it should > also be dropped. (Of course, if a block is on the to-be-prefetched > list, and a synchronous read request comes in for that block, we > should have dropped it from the to-be-prefetched list at that > point.) The main use for having a bforget method is for the most > part, once we are done scanning a non-directory extent tree block, > we won't be needing it again. Good idea, but this also could just be translated to posix_fadvise to have the kernel drop the pages from the cache. -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJTD2/8AAoJEI5FoCIzSKrwN9QIAKKyOssDAoy6Zab7w+j6aXdH f8Lzw/+WWevPpvX1o9SakQXVYejkiZBcEhJ/NrtQoDzQjesrERpKIj7m+Yb2blaQ VYu9hj21ReVYngmmOlUHL9LJJtfwGptyO3SGvuSZv3NooMjEu51tCXxHLxONaYAb SqvRBVH57E7hY1HckKTbD14YleKn/SD7peNIp1KP39zQw1soR83+oHh2qZ3Zaq7E fa/UG7tnIMcOjHghacr83LA7XW+sQYn8pDeAmw38SKe+uiHvcqjzych6lQlKLw19 pKdzERd8zaPsiS2y4og/RzWZHolohiDp67cWRncYKm3ai2w4KbU9KXlQHa/6Fy0= =orjY -----END PGP SIGNATURE----- -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On 1/31/2014 8:53 AM, Theodore Ts'o wrote: > > Something that we might need to go to in the future is instead of > > using mmap(), to maintain our own explicit buffer cache inside > > unix_io, and use direct I/O to avoid caching the disk blocks > > twice. Then when we use a single-threaded disk prefetcher, managed > > by the unix_io, it will know when a particular I/O request has > > completed, and more importantly, if there is a synchronous read > > request coming in from main body of the program, it can stop > > prefetching and allow the higher priority read to complete. We can > > also experiment with how many threads might make sense --- even > > with an HDD, using multiple threads so that we can take advantage > > of NCQ might still be a win. > > Why build your own cache instead of letting the kernel take care of > it? I believe the IO elevator already gives preferential treatment > to blocking reads so just using readahead() to prefetch and sticking > with plain old read() should work nicely. > > > Finally, if we are managing our own buffer cache, we should > > consider adding a bforget method to the I/O manager. That way > > e2fsck can give hints to the caching layer that a block isn't > > needed any more. If it is in the cache, it can be dropped, to free > > memory, and if it is still on the to-be-prefetched list it should > > also be dropped. (Of course, if a block is on the to-be-prefetched > > list, and a synchronous read request comes in for that block, we > > should have dropped it from the to-be-prefetched list at that > > point.) The main use for having a bforget method is for the most > > part, once we are done scanning a non-directory extent tree block, > > we won't be needing it again. > > Good idea, but this also could just be translated to posix_fadvise to > have the kernel drop the pages from the cache. That's more or less what I've done with the e2fsck patchset I'm working on. It's not terribly smarter than Val's old thing from ~2007, but nowadays more people have the kind of hardware where prefetching is noticeable. Well, pass2 tells the kernel to drop the dir block as soon as thinks it's finished with the dir block. I haven't studied the effect of dropping the itable during pass1. --D -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote: > > Why build your own cache instead of letting the kernel take care of > it? I believe the IO elevator already gives preferential treatment > to blocking reads so just using readahead() to prefetch and sticking > with plain old read() should work nicely. The reason why it might be better for us to use our own cache is because we can more accurately know when we're done with the block, and we can drop it from the cache. I suppose we could use posix_fadvise(POSIX_FADV_DONTNEED) --- and hopefully this works on block devices for the buffer cache, but it wouldn't all surprise me that if we can get finer-grained control if we use O_DIRECT and manage the buffers ourselves. Whether it's worth the extra complexitry is a fair question --- but simply adding metadata prefetching is going to add a fair amount of complexity already, and we should test to make sure that readahead() and posix_fadvise() actually work correctly on block devices --- a couple of years ago, I had explored readahead() precisely as a cheap way of adding metadata precaching for e2fsck, and it was a no-op when I tried the test back then. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
On Feb 27, 2014, at 7:28 PM, Theodore Ts'o <tytso@mit.edu> wrote: > On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote: >> >> Why build your own cache instead of letting the kernel take care of >> it? I believe the IO elevator already gives preferential treatment >> to blocking reads so just using readahead() to prefetch and sticking >> with plain old read() should work nicely. > > The reason why it might be better for us to use our own cache is > because we can more accurately know when we're done with the block, > and we can drop it from the cache. One argument in favour of using the kernel buffer cache is that the common case of e2fsck followed by mounting the filesystem would be much faster because e2fsck has already populated the kernel cache. Otherwise, all of the IO done to populate the userspace cache would be lost when e2fsck exits. Similarly, repeated runs of e2fsck would not see any benefit of the userspace cache. > I suppose we could use posix_fadvise(POSIX_FADV_DONTNEED) --- and > hopefully this works on block devices for the buffer cache, but it > wouldn't all surprise me that if we can get finer-grained control if > we use O_DIRECT and manage the buffers ourselves. Whether it's worth > the extra complexitry is a fair question --- but simply adding > metadata prefetching is going to add a fair amount of complexity > already, and we should test to make sure that readahead() and > posix_fadvise() actually work correctly on block devices --- a couple > of years ago, I had explored readahead() precisely as a cheap way of > adding metadata precaching for e2fsck, and it was a no-op when I tried > the test back then. We tested several different mechanisms for readahead a few years ago for the e2scan tool, and that resulted in the readahead patch that Darrick updated recently. It definitely shows performance improvement. Whether POSIX_FADV_DONTNEED actually flushes pages from cache is a separate question. My preference would be that if this is currently a no-op that we work to fix it in the kernel so that it is working for everyone rather than investing time and effort into code that is only useful for e2fsprogs. Cheers, Andreas
On Fri, Feb 28, 2014 at 11:54:55AM -0700, Andreas Dilger wrote: > On Feb 27, 2014, at 7:28 PM, Theodore Ts'o <tytso@mit.edu> wrote: > > On Thu, Feb 27, 2014 at 12:03:56PM -0500, Phillip Susi wrote: > >> > >> Why build your own cache instead of letting the kernel take care of > >> it? I believe the IO elevator already gives preferential treatment > >> to blocking reads so just using readahead() to prefetch and sticking > >> with plain old read() should work nicely. > > > > The reason why it might be better for us to use our own cache is > > because we can more accurately know when we're done with the block, > > and we can drop it from the cache. But if we accurately know when we're done with the block, we can also ask the kernel to drop the block from its cache. > One argument in favour of using the kernel buffer cache is that the > common case of e2fsck followed by mounting the filesystem would be > much faster because e2fsck has already populated the kernel cache. > Otherwise, all of the IO done to populate the userspace cache would > be lost when e2fsck exits. Similarly, repeated runs of e2fsck would > not see any benefit of the userspace cache. I tend to agree with this. > > I suppose we could use posix_fadvise(POSIX_FADV_DONTNEED) --- and > > hopefully this works on block devices for the buffer cache, but it > > wouldn't all surprise me that if we can get finer-grained control if > > we use O_DIRECT and manage the buffers ourselves. Whether it's worth > > the extra complexitry is a fair question --- but simply adding > > metadata prefetching is going to add a fair amount of complexity It's not too much churn; here's the diffstat from my patches: 21 files changed, 723 insertions(+), 8 deletions(-) I was attempting to write the smallest readahead implementation I could get away with. I noticed that unix_io.c already implements a simplistic 8-block cache, but it doesn't have the notion of readahead. The kernel provides the fadvise knob for readahead, so if it really worked as advertised, why not use that? I further thought that regardless of whether I use fadvise or build out unix_io's cache, we'd probably want a way to asynchronously (pre)fetch large chunks of metadata anyway. unix_io is pretty terrible about threaded IO (the lseek+read/write are not thread safe), which means I'd probably have to make them thread safe if I wanted any significant threaded RA ability. I actually converted unix_io to use pread/pwrite if available as part of that work, and I'll send that patch along in case anyone wants to reduce system call overhead. The fadvise approach avoids most of that thread safety requirement. I tried to write the prefetch code carefully enough not to modify any of the data structures fed into it, and let the kernel take care of all the details of threaded IO. At worst, fadvise ignores e2fsck and we lose nothing. > > already, and we should test to make sure that readahead() and > > posix_fadvise() actually work correctly on block devices --- a couple > > of years ago, I had explored readahead() precisely as a cheap way of > > adding metadata precaching for e2fsck, and it was a no-op when I tried > > the test back then. (See below) > We tested several different mechanisms for readahead a few years ago > for the e2scan tool, and that resulted in the readahead patch that > Darrick updated recently. It definitely shows performance improvement. > > Whether POSIX_FADV_DONTNEED actually flushes pages from cache is a > separate question. My preference would be that if this is currently > a no-op that we work to fix it in the kernel so that it is working > for everyone rather than investing time and effort into code that is > only useful for e2fsprogs. I wrote a test program[1] that WILLNEED's 256M and then DONTNEED's it, and saw these results (on 3.14-rc4): Before WILLNEED total used free shared buffers cached Mem: 2049412 83764 1965648 0 0 14124 -/+ buffers/cache: 69640 1979772 Swap: 0 0 0 After WILLNEED total used free shared buffers cached Mem: 2049412 346304 1703108 0 262144 14148 -/+ buffers/cache: 70012 1979400 Swap: 0 0 0 Sleeping for 30 seconds... total used free shared buffers cached Mem: 2049412 346236 1703176 0 262144 14180 -/+ buffers/cache: 69912 1979500 Swap: 0 0 0 After DONTNEED total used free shared buffers cached Mem: 2049412 77112 1972300 0 0 14160 -/+ buffers/cache: 62952 1986460 Swap: 0 0 0 Sleeping for another 30 seconds... total used free shared buffers cached Mem: 2049412 76752 1972660 0 0 14180 -/+ buffers/cache: 62572 1986840 Swap: 0 0 0 Based on that, it looks as though the two fadvise calls actually work, at least on a quiet system. --D [1] test program: /* Test RA */ #define _XOPEN_SOURCE 600 #define _DARWIN_C_SOURCE #define _FILE_OFFSET_BITS 64 #define _LARGEFILE_SOURCE #define _LARGEFILE64_SOURCE #ifndef _GNU_SOURCE #define _GNU_SOURCE #endif #include <unistd.h> #include <stdio.h> #include <fcntl.h> #include <string.h> #include <sys/types.h> #include <sys/stat.h> #include <stdlib.h> int main(int argc, char *argv[]) { int fd, ret = 0; if (argc != 2 || strcmp(argv[1], "--help") == 0) { printf("Usage: %s device\n", argv[0]); return 0; } fd = open(argv[1], O_RDWR); if (fd < 0) { perror(argv[1]); goto fail; } ret = system("echo 3 > /proc/sys/vm/drop_caches"); if (ret) goto fail2; printf("Before WILLNEED\n"); ret = system("free"); if (ret) goto fail2; ret = posix_fadvise(fd, 0, 256 * 1048576, POSIX_FADV_WILLNEED); if (ret) goto fail2; printf("After WILLNEED\n"); ret = system("free"); if (ret) goto fail2; printf("Sleeping for 30 seconds...\n"); sleep(30); ret = system("free"); if (ret) goto fail2; ret = posix_fadvise(fd, 0, 256 * 1048576, POSIX_FADV_DONTNEED); if (ret) goto fail2; printf("After DONTNEED\n"); ret = system("free"); if (ret) goto fail2; printf("Sleeping for another 30 seconds...\n"); sleep(30); ret = system("free"); if (ret) goto fail2; fail2: close(fd); fail: if (ret) perror(argv[1]); return ret; } -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
diff --git a/e2fsck/unix.c b/e2fsck/unix.c index eeeef7c..33afc06 100644 --- a/e2fsck/unix.c +++ b/e2fsck/unix.c @@ -1181,6 +1181,7 @@ int main (int argc, char *argv[]) __u32 features[3]; char *cp; int qtype = -99; /* quota type */ + struct ext2fs_prefetch_handle *h = NULL; clear_problem_context(&pctx); sigcatcher_setup(); @@ -1638,9 +1639,21 @@ print_unsupp_features: quota_init_context(&ctx->qctx, ctx->fs, qtype); } + if (getenv("PREFETCH")) { + int flags = PREFETCH_INODES | PREFETCH_DIRS | PREFETCH_THREADED; + if (getenv("PREFETCH_WAIT")) + flags &= ~PREFETCH_THREADED; + retval = ext2fs_prefetch(fs, flags, &h); + if (retval) + com_err(ctx->program_name, retval, "prefetching"); + } + run_result = e2fsck_run(ctx); e2fsck_clear_progbar(ctx); + if (h) + ext2fs_prefetch_free(&h); + if (ctx->flags & E2F_FLAG_JOURNAL_INODE) { if (fix_problem(ctx, PR_6_RECREATE_JOURNAL, &pctx)) { if (journal_size < 1024) diff --git a/lib/ext2fs/Makefile.in b/lib/ext2fs/Makefile.in index a1b5a01..9fbf2b5 100644 --- a/lib/ext2fs/Makefile.in +++ b/lib/ext2fs/Makefile.in @@ -73,6 +73,7 @@ OBJS= $(DEBUGFS_LIB_OBJS) $(RESIZE_LIB_OBJS) $(E2IMAGE_LIB_OBJS) \ native.o \ newdir.o \ openfs.o \ + prefetch.o \ progress.o \ punch.o \ qcow2.o \ @@ -150,6 +151,7 @@ SRCS= ext2_err.c \ $(srcdir)/native.c \ $(srcdir)/newdir.c \ $(srcdir)/openfs.c \ + $(srcdir)/prefetch.c \ $(srcdir)/progress.c \ $(srcdir)/punch.c \ $(srcdir)/qcow2.c \ @@ -863,6 +865,12 @@ openfs.o: $(srcdir)/openfs.c $(top_builddir)/lib/config.h \ $(srcdir)/ext2_fs.h $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h \ $(srcdir)/ext2_io.h $(top_builddir)/lib/ext2fs/ext2_err.h \ $(srcdir)/ext2_ext_attr.h $(srcdir)/bitops.h $(srcdir)/e2image.h +prefetch.o: $(srcdir)/prefetch.c $(top_builddir)/lib/config.h \ + $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2fs.h \ + $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2_fs.h \ + $(srcdir)/ext3_extents.h $(top_srcdir)/lib/et/com_err.h $(srcdir)/ext2_io.h \ + $(top_builddir)/lib/ext2fs/ext2_err.h $(srcdir)/ext2_ext_attr.h \ + $(srcdir)/bitops.h $(srcdir)/ext2fsP.h progress.o: $(srcdir)/progress.c $(top_builddir)/lib/config.h \ $(top_builddir)/lib/dirpaths.h $(srcdir)/ext2fs.h \ $(top_builddir)/lib/ext2fs/ext2_types.h $(srcdir)/ext2_fs.h \ diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h index ba5c388..e634d2c 100644 --- a/lib/ext2fs/ext2fs.h +++ b/lib/ext2fs/ext2fs.h @@ -1522,6 +1522,19 @@ errcode_t ext2fs_mmp_update2(ext2_filsys fs, int immediately); errcode_t ext2fs_mmp_stop(ext2_filsys fs); unsigned ext2fs_mmp_new_seq(void); +/* prefetch.c */ +#define PREFETCH_THREADED (0x00000001) +#define PREFETCH_ERROR_ABORT (0x00000002) +#define PREFETCH_BITMAPS (0x00000004) +#define PREFETCH_INODES (0x00000008) +#define PREFETCH_MAPS (0x00000010) +#define PREFETCH_DIRS (0x00000020) +struct ext2fs_prefetch_handle; +errcode_t ext2fs_prefetch(ext2_filsys fs, int flags, + struct ext2fs_prefetch_handle **h); +errcode_t ext2fs_prefetch_wait(struct ext2fs_prefetch_handle *h); +errcode_t ext2fs_prefetch_free(struct ext2fs_prefetch_handle **h); + /* read_bb.c */ extern errcode_t ext2fs_read_bb_inode(ext2_filsys fs, ext2_badblocks_list *bb_list); diff --git a/lib/ext2fs/prefetch.c b/lib/ext2fs/prefetch.c new file mode 100644 index 0000000..022af41 --- /dev/null +++ b/lib/ext2fs/prefetch.c @@ -0,0 +1,456 @@ +/* + * prefetch.c --- Prefetch filesystem metadata. + * + * Copyright (C) 2014 by Oracle, Darrick J. Wong. + * + * %Begin-Header% + * This file may be redistributed under the terms of the GNU Library + * General Public License, version 2. + * %End-Header% + */ + +#define _LARGEFILE_SOURCE +#define _LARGEFILE64_SOURCE +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif + +#include <unistd.h> +#include <sys/syscall.h> +#include <fcntl.h> +#include <sys/stat.h> +#include <sys/types.h> +#include <signal.h> +#include <pthread.h> + +#include "config.h" +#include "ext2_fs.h" +#include "ext2fs.h" + +#define USE_THREADS 1 +#define USE_SUPER 1 + +struct ext2fs_prefetch_handle { + ext2_filsys fs; + int flags; + int done; + pid_t pid; +#ifdef USE_THREADS + pthread_t tid; +#endif +}; + +static int ignore_block(ext2_filsys fs, blk64_t *blocknr, e2_blkcnt_t blockcnt, + blk64_t ref_blk, int ref_offset, void *priv_data) +{ + return 0; +} + +struct dirent_iterate { + void *buf; + int flags; +}; + +static int dirent_block(ext2_filsys fs, blk64_t *blocknr, e2_blkcnt_t blockcnt, + blk64_t ref_blk, int ref_offset, void *priv_data) +{ + struct dirent_iterate *di = priv_data; + errcode_t err; + + err = io_channel_read_blk64(fs->io, *blocknr, 1, di->buf); + if (err && (di->flags & PREFETCH_ERROR_ABORT)) + return BLOCK_ABORT; + return 0; +} + +/* + * First dumb prefetch implementation: Separate process, just read data to + * get it into the page cache, at least. + */ +static void do_ext2fs_prefetch(ext2_filsys fs, int flags) +{ + void *buf; + blk64_t blk; + dgrp_t i; + ext2_inode_scan scan; + int length = EXT2_INODE_SIZE(fs->super); + ext2_ino_t ino; + errcode_t err; + struct ext2_inode inode; + struct dirent_iterate di; + int iter_flags; + unsigned int blocks_to_read; + + err = ext2fs_get_array(fs->blocksize, fs->inode_blocks_per_group, &buf); + if (err) + return; + + /* load bitmaps */ + if (!(flags & PREFETCH_BITMAPS)) + goto skip_bitmaps; + err = ext2fs_read_bitmaps(fs); + if (err && (flags & PREFETCH_ERROR_ABORT)) + goto out; + +skip_bitmaps: + /* load inode tables */ + if (!(flags & PREFETCH_INODES) || (flags & (PREFETCH_MAPS | + PREFETCH_DIRS))) + goto skip_itable; + + for (i = 0; i < fs->group_desc_count; i++) { + if (ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT)) + continue; + + blocks_to_read = fs->inode_blocks_per_group; + if (ext2fs_has_group_desc_csum(fs)) { + unsigned int num_inodes = + fs->super->s_inodes_per_group - + ext2fs_bg_itable_unused(fs, i); + blocks_to_read = (num_inodes * + EXT2_INODE_SIZE(fs->super)) / + fs->blocksize; + } + + blk = ext2fs_inode_table_loc(fs, i); + err = io_channel_read_blk64(fs->io, blk, blocks_to_read, buf); + if (err && (flags & PREFETCH_ERROR_ABORT)) + goto out; + } + +skip_itable: + /* load inodes */ + if (!(flags & (PREFETCH_MAPS | PREFETCH_DIRS))) + goto skip_inodes; + + err = ext2fs_open_inode_scan(fs, 0, &scan); + if (err && (flags & PREFETCH_ERROR_ABORT)) + goto out; + + di.buf = buf; + di.flags = flags; + do { + err = ext2fs_get_next_inode_full(scan, &ino, &inode, + sizeof(inode)); + if (err) + break; + if (!ino) + break; + if (!ext2fs_test_inode_bitmap2(fs->inode_map, ino)) + continue; + + iter_flags = BLOCK_FLAG_READ_ONLY | BLOCK_FLAG_DATA_ONLY; + if ((flags & PREFETCH_MAPS) && + !(flags & PREFETCH_DIRS) && + (LINUX_S_ISREG(inode.i_mode) || + LINUX_S_ISLNK(inode.i_mode))) { + err = ext2fs_block_iterate3(fs, ino, iter_flags, NULL, + ignore_block, &di); + } else if ((flags & PREFETCH_DIRS) && + !(flags & PREFETCH_MAPS) && + LINUX_S_ISDIR(inode.i_mode)) { + err = ext2fs_block_iterate3(fs, ino, iter_flags, NULL, + dirent_block, &di); + } else { + int (*func)(ext2_filsys fs, blk64_t *blocknr, + e2_blkcnt_t blockcnt, blk64_t ref_blk, + int ref_offset, void *priv_data) = + LINUX_S_ISDIR(inode.i_mode) ? dirent_block : + ignore_block; + err = ext2fs_block_iterate3(fs, ino, iter_flags, NULL, + func, &di); + } + if (err && (flags & PREFETCH_ERROR_ABORT)) + break; + + blk = ext2fs_file_acl_block(fs, &inode); + if (!blk) + continue; + err = io_channel_read_blk64(fs->io, blk, 1, buf); + if (err && (flags & PREFETCH_ERROR_ABORT)) + break; + } while (ino); + +out2: + ext2fs_close_inode_scan(scan); + +skip_inodes: +out: + ext2fs_free_mem(&buf); + + return; +} + +static void *prefetch_thread(void *data) +{ + struct ext2fs_prefetch_handle *pd = data; + do_ext2fs_prefetch(pd->fs, pd->flags); + return NULL; +} + +/* + * Second, less dumb prefetch: Use threads to preload metadata in group order. + */ +struct super_entry { + dgrp_t group; + ext2_ino_t num_inodes; +}; + +struct super_thread { + ext2_filsys fs; + pthread_t tid; + int flags; + struct super_entry *start, *end; + unsigned int skip_factor; + void *buf; +}; + +static void *super_func(void *data) +{ + struct super_thread *t = data; + ext2_filsys fs = t->fs; + int flags = t->flags; + void *buf = t->buf; + struct super_entry *e; + unsigned int blocks_to_read; + blk64_t blk; + ext2_ino_t i, ino; + unsigned int nr_read = 0; + struct dirent_iterate di; + struct ext2_inode inode; + int iter_flags; + errcode_t err; + + /* Read the inode tables */ + for (e = t->start; e < t->end; e += t->skip_factor) { + blocks_to_read = (e->num_inodes * + EXT2_INODE_SIZE(fs->super)) / fs->blocksize; + blk = ext2fs_inode_table_loc(fs, e->group); + err = io_channel_read_blk64(fs->io, blk, blocks_to_read, buf); + if (err && (flags & PREFETCH_ERROR_ABORT)) + continue; + } + + /* Scan inodes for extent/dir blocks */ + di.buf = buf; + di.flags = flags; + for (e = t->start; e < t->end; e += t->skip_factor) { + for (i = 0; i < e->num_inodes; i++) { + ino = e->group * fs->super->s_inodes_per_group + i; + err = ext2fs_read_inode(fs, ino, &inode); + if (err) + continue; + /* Skip unlinked or unknown-type inodes */ + if (!inode.i_links_count || + (inode.i_mode & 0xF000) == 0) + continue; + + iter_flags = BLOCK_FLAG_READ_ONLY | + BLOCK_FLAG_DATA_ONLY; + if ((flags & PREFETCH_MAPS) && + !(flags & PREFETCH_DIRS) && + (LINUX_S_ISREG(inode.i_mode) || + LINUX_S_ISLNK(inode.i_mode))) { + err = ext2fs_block_iterate3(fs, ino, + iter_flags, NULL, + ignore_block, &di); + } else if ((flags & PREFETCH_DIRS) && + !(flags & PREFETCH_MAPS) && + LINUX_S_ISDIR(inode.i_mode)) { + err = ext2fs_block_iterate3(fs, ino, + iter_flags, NULL, + dirent_block, &di); + } else { + int (*func)(ext2_filsys fs, blk64_t *blocknr, + e2_blkcnt_t blockcnt, + blk64_t ref_blk, + int ref_offset, void *priv_data) = + LINUX_S_ISDIR(inode.i_mode) ? dirent_block : + ignore_block; + err = ext2fs_block_iterate3(fs, ino, + iter_flags, NULL, + func, &di); + } + + blk = ext2fs_file_acl_block(fs, &inode); + if (!blk) + continue; + err = io_channel_read_blk64(fs->io, blk, 1, buf); + } + } + + return NULL; +} + +static void *super_prefetch(void *data) +{ + struct ext2fs_prefetch_handle *pd = data; + ext2_filsys fs = pd->fs; + int flags = pd->flags; + void *b, *r; + struct super_thread *threads = NULL, *t; + unsigned int num_threads = sysconf(_SC_NPROCESSORS_ONLN); + unsigned int j; + struct super_entry *entries = NULL, *e = NULL; + unsigned int num_entries = 0; + dgrp_t i; + ext2_ino_t num_inodes; + errcode_t err; + + /* Find all non-empty groups */ + for (i = 0; i < fs->group_desc_count; i++) { + if (ext2fs_bg_flags_test(fs, i, EXT2_BG_INODE_UNINIT)) + continue; + + num_inodes = fs->super->s_inodes_per_group; + if (ext2fs_bg_free_inodes_count(fs, i) == num_inodes) + continue; + if (ext2fs_has_group_desc_csum(fs)) + num_inodes -= ext2fs_bg_itable_unused(fs, i); + if (e == entries + num_entries) { + r = realloc(entries, (num_entries + 32) * + sizeof(*entries)); + if (r == NULL) { + err = errno; + goto out; + } + entries = r; + e = entries + num_entries; + num_entries += 32; + } + e->group = i; + e->num_inodes = num_inodes; + e++; + } + num_entries = e - entries; + + /* Set up the threads */ + if (getenv("PREFETCH_THREADS")) { + j = atoi(getenv("PREFETCH_THREADS")); + if (j > 0) + num_threads = j; + } + + err = ext2fs_get_arrayzero(num_threads, sizeof(*threads) + + (fs->blocksize * fs->inode_blocks_per_group), + &b); + if (err) + goto out; + threads = b + (fs->blocksize * fs->inode_blocks_per_group * + num_threads); + + for (j = 0, t = threads, e = entries; j < num_threads; j++, t++, e++) { + t->fs = fs; + t->flags = flags; + t->start = e; + t->end = entries + num_entries; + t->skip_factor = num_threads; + err = ext2fs_dup_handle(fs, &t->fs); + if (err) + goto out2; + t->fs->icache = NULL; + t->buf = b + (fs->blocksize * fs->inode_blocks_per_group * j); + pthread_create(&t->tid, NULL, super_func, t); + } + + /* Wait for threads */ + for (j = 0, t = threads; j < num_threads; j++, t++) + pthread_join(t->tid, NULL); + pd->done = 1; +out2: + ext2fs_free_mem(&b); +out: + free(entries); + return NULL; +} + +struct unix_private_data_hack { + int magic; + int dev; +}; + +errcode_t ext2fs_prefetch(ext2_filsys fs, int flags, + struct ext2fs_prefetch_handle **h) +{ + struct ext2fs_prefetch_handle *pd; + errcode_t err; + + err = ext2fs_get_memzero(sizeof(*pd), &pd); + if (err) + return err; + pd->fs = fs; + pd->flags = flags; + + /* Load the rest */ + if (flags & PREFETCH_THREADED) { + if (fs->io->manager == mmap_io_manager) { +#if USE_SUPER + struct timespec ts; + err = pthread_create(&pd->tid, NULL, super_prefetch, + pd); + if (err) + goto errout; + ts.tv_sec = 0; ts.tv_nsec = 500000; + nanosleep(&ts, NULL); +#elif USE_THREADS + err = pthread_create(&pd->tid, NULL, prefetch_thread, + pd); + if (err) + goto errout; +#else + goto single_thread; +#endif + } else if (fs->io->manager == unix_io_manager) { + pd->pid = fork(); + if (pd->pid < 0) { + err = errno; + goto errout; + } else if (pd->pid == 0) { + struct unix_private_data_hack *m = + fs->io->private_data; + m->dev = open(fs->device_name, O_RDONLY); + do_ext2fs_prefetch(fs, flags); + exit(0); + } + } + } else { +single_thread: +#if USE_SUPER + super_prefetch(pd); +#else + do_ext2fs_prefetch(fs, flags); +#endif + pd->done = 1; + } + *h = pd; + + return 0; +errout: + ext2fs_free_mem(&pd); + return err; +} + +errcode_t ext2fs_prefetch_wait(struct ext2fs_prefetch_handle *h) +{ + pid_t ret; + int status; + + if (h->flags & PREFETCH_THREADED && h->done != 0) { + if (h->tid) + ret = pthread_join(h->tid, NULL); + if (h->pid) { + ret = waitpid(h->pid, &status, WNOHANG); + if (ret == 0) + kill(h->pid, SIGKILL); + waitpid(h->pid, NULL, 0); + } + } + h->done = 1; + return 0; +} + +errcode_t ext2fs_prefetch_free(struct ext2fs_prefetch_handle **h) +{ + ext2fs_prefetch_wait(*h); + return ext2fs_free_mem(h); +}
Use threads with our new mmap IO manager to prefetch metadata. This results in a major e2fsck run time speedup. There's also a stupider multiprocess version that works with the good old UNIX IO manager to get pages into the page cache. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> --- e2fsck/unix.c | 13 + lib/ext2fs/Makefile.in | 8 + lib/ext2fs/ext2fs.h | 13 + lib/ext2fs/prefetch.c | 456 ++++++++++++++++++++++++++++++++++++++++++++++++ 4 files changed, 490 insertions(+) create mode 100644 lib/ext2fs/prefetch.c -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html