diff mbox

[2/3] ext4 directory index: read-ahead blocks v2

Message ID 20110620202854.2473133.32514.stgit@localhost.localdomain
State New, archived
Headers show

Commit Message

Bernd Schubert June 20, 2011, 8:28 p.m. UTC
From: Bernd Schubert <bernd.schubert@fastmail.fm>

changes from v1 -> v2:
Limit the number of read-ahead blocks as suggested by Andreas.

While creating files in large directories we noticed an endless number
of 4K reads. And those reads very much reduced file creation numbers
as shown by bonnie. While we would expect about 2000 creates/s, we
only got about 25 creates/s. Running the benchmarks for a long time
improved the numbers, but not above 200 creates/s.
It turned out those reads came from directory index block reads
and probably the bh cache never cached all dx blocks. Given by
the high number of directories we have (8192) and number of files required
to trigger the issue (16 million), rather probably bh cached dx blocks
got lost in favour of other less important blocks.
The patch below implements a read-ahead for *all* dx blocks of a directory
if a single dx block is missing in the cache. That also helps the LRU
to cache important dx blocks.

Unfortunately, it also has a performance trade-off for the first access to
a directory, although the READA flag is set already.
Therefore at least for now, this option is disabled by default, but may
be enabled using 'mount -o dx_read_ahead' or 'mount -odx_read_ahead=1'

Signed-off-by: Bernd Schubert <bernd.schubert@itwm.fraunhofer.de>
---
 Documentation/filesystems/ext4.txt |    6 +++
 fs/ext4/ext4.h                     |    3 +
 fs/ext4/inode.c                    |   28 ++++++++++++++
 fs/ext4/namei.c                    |   73 ++++++++++++++++++++++++++++++++++--
 fs/ext4/super.c                    |   17 ++++++++
 5 files changed, 123 insertions(+), 4 deletions(-)


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Theodore Ts'o July 16, 2011, 11:59 p.m. UTC | #1
On Mon, Jun 20, 2011 at 10:28:54PM +0200, Bernd Schubert wrote:
> From: Bernd Schubert <bernd.schubert@fastmail.fm>
> 
> changes from v1 -> v2:
> Limit the number of read-ahead blocks as suggested by Andreas.
> 
> While creating files in large directories we noticed an endless number
> of 4K reads. And those reads very much reduced file creation numbers
> as shown by bonnie. While we would expect about 2000 creates/s, we
> only got about 25 creates/s. Running the benchmarks for a long time
> improved the numbers, but not above 200 creates/s.
> It turned out those reads came from directory index block reads
> and probably the bh cache never cached all dx blocks. Given by
> the high number of directories we have (8192) and number of files required
> to trigger the issue (16 million), rather probably bh cached dx blocks
> got lost in favour of other less important blocks.
> The patch below implements a read-ahead for *all* dx blocks of a directory
> if a single dx block is missing in the cache. That also helps the LRU
> to cache important dx blocks.

If you have 8192 directories, and about 16 million files, that means
you have about 2,000 files per directory.  I'll assume that each file
averages 8-12 characters per file, so you need 24 bytes per directory
entry.  If we assume that each leaf block is about 2/3rds full, you
have about 17 leaf blocks, which means we're only talking about one
extent index block per directory.   Does that sound about right?

Even if I'm underestimating the number size of your index blocks, the
real problem you have a very inefficient system; probably something
like 80% or more of the space in your 8192 index blocks (one per
directory) are are empty.  Given that, it's no wonder the index blocks
are getting pushed out of memory.  If you reduce the number of
directories that you have, say by a factor of 4 so that you only have
2048 directories, you will still only have about one index block per
directory, but it will be much fuller, and those index blocks will be
hit 4 times more often, which probably makes them more likely that
they stay in memory.  It also means that instead of pinning about 32
megabytes of memory for all of your index blocks, you'll only pin
about 8 megabytes of memory.

It also makes me wonder why your patch is helping you.  If there's
only one index block per directory, then there's no readahead to
accomplish.  So maybe I'm underestimating how many leaf blocks you
have in an average directory.  But the file names would have to be
very, very, VERY large in order to cause us to have more than a single
index block.  

OK, so what am I missing?

						- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bernd Schubert July 17, 2011, 1:02 a.m. UTC | #2
Ted, thanks for looking into it!

On 07/17/2011 01:59 AM, Ted Ts'o wrote:
> On Mon, Jun 20, 2011 at 10:28:54PM +0200, Bernd Schubert wrote:
>> From: Bernd Schubert <bernd.schubert@fastmail.fm>
>>
>> changes from v1 -> v2:
>> Limit the number of read-ahead blocks as suggested by Andreas.
>>
>> While creating files in large directories we noticed an endless number
>> of 4K reads. And those reads very much reduced file creation numbers
>> as shown by bonnie. While we would expect about 2000 creates/s, we
>> only got about 25 creates/s. Running the benchmarks for a long time
>> improved the numbers, but not above 200 creates/s.
>> It turned out those reads came from directory index block reads
>> and probably the bh cache never cached all dx blocks. Given by
>> the high number of directories we have (8192) and number of files required
>> to trigger the issue (16 million), rather probably bh cached dx blocks
>> got lost in favour of other less important blocks.
>> The patch below implements a read-ahead for *all* dx blocks of a directory
>> if a single dx block is missing in the cache. That also helps the LRU
>> to cache important dx blocks.
> 
> If you have 8192 directories, and about 16 million files, that means
> you have about 2,000 files per directory.  I'll assume that each file
> averages 8-12 characters per file, so you need 24 bytes per directory
> entry.  If we assume that each leaf block is about 2/3rds full, you
> have about 17 leaf blocks, which means we're only talking about one
> extent index block per directory.   Does that sound about right?

I don't understand it either yet why we have so many, but each directory
has about 20 to 30 index blocks.

> 
> Even if I'm underestimating the number size of your index blocks, the
> real problem you have a very inefficient system; probably something
> like 80% or more of the space in your 8192 index blocks (one per
> directory) are are empty.  Given that, it's no wonder the index blocks
> are getting pushed out of memory.  If you reduce the number of
> directories that you have, say by a factor of 4 so that you only have
> 2048 directories, you will still only have about one index block per
> directory, but it will be much fuller, and those index blocks will be
> hit 4 times more often, which probably makes them more likely that
> they stay in memory.  It also means that instead of pinning about 32
> megabytes of memory for all of your index blocks, you'll only pin
> about 8 megabytes of memory.

For a file system with 16 million files, sure, 8192 hash directories are
far too much. However, it also easily might go up to 600 million or more
files. Lets assume 70000 to 100000 files per directory as worst case.
With that many files a hash table size of 8192 is more sane, I think. Of
course, that's a clear disadvantage of hash tables - either the size is
wrong, or slow re-hashsing is required.

> 
> It also makes me wonder why your patch is helping you.  If there's
> only one index block per directory, then there's no readahead to
> accomplish.  So maybe I'm underestimating how many leaf blocks you
> have in an average directory.  But the file names would have to be
> very, very, VERY large in order to cause us to have more than a single
> index block.  
> 
> OK, so what am I missing?

All files I tested with have a fixed length of 24 characters (16 byte
UUID + hostname).



Thanks,
Bernd



PS: Btw, I have a v3 patch series, that mostly fixed the 3rd patch ("Map
blocks with a single semaphore lock") to be checkpatch.pl clean. I just
didn't want to send another revision until we have agreed on the overall
concept (didn't want to send patch-spam...).



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 17, 2011, 1:12 p.m. UTC | #3
On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote:

> I don't understand it either yet why we have so many, but each directory
> has about 20 to 30 index blocks

Hmm.....  can you send me the output of ls -ld on one or two of the directories?  I want to see how big they are.

Also, just as a sanity check can you send me the following:

debugfs -R "htree /path/to/a/typical/directory" /dev/sdXX | gzip -9 > /tmp/htree_dump.gz2

I just want to take a look at the structure of one of these htrees and see if there's anything surprising going on....


-- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o July 18, 2011, 12:23 a.m. UTC | #4
> On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote:
> 
> > I don't understand it either yet why we have so many, but each directory
> > has about 20 to 30 index blocks

OK, I think I know what's goign on.  Those are 20-30 index blocks;
those are 20-30 leaf blocks.  Your directories are approximately
80-120k, each, right?

So what your patch is doing is constantly doing readahead to bring the
*entire* directory into the buffer cache any time you do a dx_probe.
That's definitely not what we would want to enable by default, but I
really don't like the idea of adding Yet Another Mount option.  It
expands our testing effort, and the reality is very few people will
take advantage of the mount option.

How about this?  What if we don't actually perform readahead, but
instead try to look up all of the blocks to see if they are in the
buffer cache using sb_find_get_block().  If it is in the the buffer
cache, it will get touched, so it will be less likely to be evicted
from the page cache.  So for a workload like yours, it should do what
you want.  But if won't cause all of the pages to get pulled in after
the first reference of the directory in question.

I'm still worried about the case of a very large directory (say an
unreaped tmp directory that has grown to be tens of megabytes).  If a
program does a sequential scan through the directory doing a
"readdir+stat" (i.e., for example a tmp cleaner or someone running the
command ls -sF"), we probably shouldn't be trying to keep all of those
directory blocks in memory.  So if a sequential scan is detected, that
should probably suppress the calls to sb_find_get_block(0.

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bernd Schubert July 19, 2011, 2:22 p.m. UTC | #5
Ted,

sorry for my late reply and thanks a lot for your help!

On 07/18/2011 02:23 AM, Ted Ts'o wrote:
>> On Jul 16, 2011, at 9:02 PM, Bernd Schubert wrote:
>>
>>> I don't understand it either yet why we have so many, but each directory
>>> has about 20 to 30 index blocks
>
> OK, I think I know what's goign on.  Those are 20-30 index blocks;
> those are 20-30 leaf blocks.  Your directories are approximately
> 80-120k, each, right?

Yes, you are right. For example:

drwxr-xr-x 2 root root 102400 Jul 18 13:39 FFB


I also uploaded the debugfs htree output to
> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/ext4/htree_dump.bz2

>
> So what your patch is doing is constantly doing readahead to bring the
> *entire* directory into the buffer cache any time you do a dx_probe.
> That's definitely not what we would want to enable by default, but I
> really don't like the idea of adding Yet Another Mount option.  It
> expands our testing effort, and the reality is very few people will
> take advantage of the mount option.
>
> How about this?  What if we don't actually perform readahead, but
> instead try to look up all of the blocks to see if they are in the
> buffer cache using sb_find_get_block().  If it is in the the buffer

In principle that should be mostly fine. We could read all directories 
on starting up our application and those pages would be kept in cache 
then. While our main concern right now is the meta data server, where 
that patch would help and where we will also change the on-disk-layout 
to entirely workaround that issue, the issue also effects storage 
servers. On those we are not sure, if the patch would help, as real data 
pages might drop those directory pages out of the cache.
Also interesting is that the hole issue might easily explain meta data 
issues I experienced in the past with Lustre and systems with lots of 
files per OST - Lustre and FhGFS have a rather similar on-disk layout 
for data files and so should suffer from similar underlying storage 
issues. We are already discussing here for some time if we could change 
that layout for FhGFS, but that would bring up other even more critical 
problems...

> cache, it will get touched, so it will be less likely to be evicted
> from the page cache.  So for a workload like yours, it should do what
> you want.  But if won't cause all of the pages to get pulled in after
> the first reference of the directory in question.

I think we would still need to map the ext4 block to the real block for 
sb_find_get_block(). So what about to keep most of the existing patch, 
but update ext4_bread_ra() to:

+/*
+ * Read-ahead blocks
+ */
+int ext4_bread_ra(struct inode *inode, ext4_lblk_t block)
+{
+       struct buffer_head *bh;
+       int err;
+
+       bh = ext4_getblk(NULL, inode, block, 0, &err);
+       if (!bh)
+               return -1;
+
+       if (buffer_uptodate(bh))
+		touch_buffer(bh); /* patch update here */
+
+       brelse(bh);
+       return 0;
+}
+


>
> I'm still worried about the case of a very large directory (say an
> unreaped tmp directory that has grown to be tens of megabytes).  If a
> program does a sequential scan through the directory doing a
> "readdir+stat" (i.e., for example a tmp cleaner or someone running the
> command ls -sF"), we probably shouldn't be trying to keep all of those
> directory blocks in memory.  So if a sequential scan is detected, that
> should probably suppress the calls to sb_find_get_block(0.

Do you have a suggestion how to detect that? Set a flag in the dir inode 
on during a readdir and if that flag is set we won't do the 
touch_buffer(bh)? So add flags in struct file i_private and struct inode 
private about the readdir and remove those on close of the file?
Better would be if there would exist a readdirplus syscall, which ls 
would use and which would set its intent itself... But even if we would 
add that, it would take ages until all user space programs would use it.



Thanks,
Bernd
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 3ae9bc9..fad70ea 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -404,6 +404,12 @@  dioread_nolock		locking. If the dioread_nolock option is specified
 i_version		Enable 64-bit inode version support. This option is
 			off by default.
 
+dx_read_ahead		Enables read-ahead of directory index blocks.
+			This option should be enabled if the filesystem several
+			directories with a high number of files. Disadvantage
+			is that on first access to a directory additional reads
+			come up, which might slow down other operations.
+
 Data Mode
 =========
 There are 3 different data modes:
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1921392..997323a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -916,6 +916,8 @@  struct ext4_inode_info {
 #define EXT4_MOUNT_DISCARD		0x40000000 /* Issue DISCARD requests */
 #define EXT4_MOUNT_INIT_INODE_TABLE	0x80000000 /* Initialize uninitialized itables */
 
+#define EXT4_MOUNT2_DX_READ_AHEAD	0x00002 /* Read ahead directory index blocks */
+
 #define clear_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt &= \
 						~EXT4_MOUNT_##opt
 #define set_opt(sb, opt)		EXT4_SB(sb)->s_mount_opt |= \
@@ -1802,6 +1804,7 @@  struct buffer_head *ext4_getblk(handle_t *, struct inode *,
 						ext4_lblk_t, int, int *);
 struct buffer_head *ext4_bread(handle_t *, struct inode *,
 						ext4_lblk_t, int, int *);
+int ext4_bread_ra(struct inode *inode, ext4_lblk_t block);
 int ext4_get_block(struct inode *inode, sector_t iblock,
 				struct buffer_head *bh_result, int create);
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a5763e3..938fb6c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -1490,6 +1490,9 @@  struct buffer_head *ext4_getblk(handle_t *handle, struct inode *inode,
 	return bh;
 }
 
+/*
+  * Synchronous read of blocks
+  */
 struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
 			       ext4_lblk_t block, int create, int *err)
 {
@@ -1500,6 +1503,7 @@  struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
 		return bh;
 	if (buffer_uptodate(bh))
 		return bh;
+
 	ll_rw_block(READ_META, 1, &bh);
 	wait_on_buffer(bh);
 	if (buffer_uptodate(bh))
@@ -1509,6 +1513,30 @@  struct buffer_head *ext4_bread(handle_t *handle, struct inode *inode,
 	return NULL;
 }
 
+/*
+ * Read-ahead blocks
+ */
+int ext4_bread_ra(struct inode *inode, ext4_lblk_t block)
+{
+	struct buffer_head *bh;
+	int err;
+
+	bh = ext4_getblk(NULL, inode, block, 0, &err);
+	if (!bh)
+		return -1;
+
+	if (buffer_uptodate(bh)) {
+		brelse(bh);
+		return 0;
+	}
+
+	ll_rw_block(READA, 1, &bh);
+
+	brelse(bh);
+	return 0;
+}
+
+
 static int walk_page_buffers(handle_t *handle,
 			     struct buffer_head *head,
 			     unsigned from,
diff --git a/fs/ext4/namei.c b/fs/ext4/namei.c
index bfb749f..9643722 100644
--- a/fs/ext4/namei.c
+++ b/fs/ext4/namei.c
@@ -49,6 +49,8 @@ 
 #define NAMEI_RA_SIZE	     (NAMEI_RA_CHUNKS * NAMEI_RA_BLOCKS)
 #define NAMEI_RA_INDEX(c,b)  (((c) * NAMEI_RA_BLOCKS) + (b))
 
+#define NAMEI_RA_DX_BLOCKS  32 /* Better use BH_LRU_SIZE? */
+
 static struct buffer_head *ext4_append(handle_t *handle,
 					struct inode *inode,
 					ext4_lblk_t *block, int *err)
@@ -334,6 +336,50 @@  struct stats dx_show_entries(struct dx_hash_info *hinfo, struct inode *dir,
 #endif /* DX_DEBUG */
 
 /*
+ * Read ahead directory index blocks
+ */
+static void dx_ra_blocks(struct inode *dir, struct dx_entry *entries,
+			 struct dx_entry *at)
+{
+	int i, err = 0;
+	struct dx_entry *first_ra_entry = entries + 1;
+	unsigned num_entries = dx_get_count(entries) - 1;
+
+	if (num_entries < 2 || num_entries > dx_get_limit(entries)) {
+		dxtrace(printk("dx read-ahead: invalid number of entries:%d\n",
+			       num_entries));
+		return;
+	}
+
+	/* limit read ahead blocks */
+	if (num_entries > NAMEI_RA_DX_BLOCKS) {
+		int min = at - first_ra_entry; /* first_ra_entry + min = at */
+		int max = num_entries - min - 1; /* at + max = last_ra_entry */
+		int half_limit = NAMEI_RA_DX_BLOCKS >> 1;
+
+		min = min(min, half_limit);
+		max = min(max, half_limit);
+
+		first_ra_entry = at - min;
+
+		/* We do not use exactly NAMEI_RA_DX_BLOCKS here, as the logic
+		 * for min and max would be unnecessarily complex */
+		num_entries = min + max;
+	}
+
+	dxtrace(printk("dx read-ahead: %d entries in dir-ino %lu \n",
+			num_entries, dir->i_ino));
+
+	i = 0;
+	do {
+		struct dx_entry *entry = first_ra_entry + i;
+
+		err = ext4_bread_ra(dir, dx_get_block(entry));
+		i++;
+	 } while (i < num_entries && !err);
+}
+
+/*
  * Probe for a directory leaf block to search.
  *
  * dx_probe can return ERR_BAD_DX_DIR, which means there was a format
@@ -347,11 +393,12 @@  dx_probe(const struct qstr *d_name, struct inode *dir,
 	 struct dx_hash_info *hinfo, struct dx_frame *frame_in, int *err)
 {
 	unsigned count, indirect;
-	struct dx_entry *at, *entries, *p, *q, *m;
+	struct dx_entry *at, *entries, *ra_entries, *p, *q, *m;
 	struct dx_root *root;
 	struct buffer_head *bh;
 	struct dx_frame *frame = frame_in;
 	u32 hash;
+	bool did_ra = false;
 
 	frame->bh = NULL;
 	if (!(bh = ext4_bread (NULL,dir, 0, 0, err)))
@@ -390,7 +437,7 @@  dx_probe(const struct qstr *d_name, struct inode *dir,
 		goto fail;
 	}
 
-	entries = (struct dx_entry *) (((char *)&root->info) +
+	ra_entries = entries = (struct dx_entry *) (((char *)&root->info) +
 				       root->info.info_length);
 
 	if (dx_get_limit(entries) != dx_root_limit(dir,
@@ -446,9 +493,27 @@  dx_probe(const struct qstr *d_name, struct inode *dir,
 		frame->bh = bh;
 		frame->entries = entries;
 		frame->at = at;
-		if (!indirect--) return frame;
-		if (!(bh = ext4_bread (NULL,dir, dx_get_block(at), 0, err)))
+
+		if (!did_ra && test_opt2(dir->i_sb, DX_READ_AHEAD)) {
+			/* read-ahead of dx blocks */
+			struct buffer_head *test_bh;
+			ext4_lblk_t block = dx_get_block(at);
+
+			test_bh = ext4_getblk(NULL, dir, block, 0, err);
+			if (test_bh && !buffer_uptodate(test_bh)) {
+				dx_ra_blocks(dir, ra_entries, at);
+				did_ra = true;
+			}
+			brelse(test_bh);
+		}
+
+		if (!indirect--)
+			return frame;
+
+		bh = ext4_bread(NULL, dir, dx_get_block(at), 0, err);
+		if (!bh)
 			goto fail2;
+
 		at = entries = ((struct dx_node *) bh->b_data)->entries;
 		if (dx_get_limit(entries) != dx_node_limit (dir)) {
 			ext4_warning(dir->i_sb,
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index cc5c157..9dd7c05 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -1119,6 +1119,9 @@  static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs)
 		seq_printf(seq, ",init_inode_table=%u",
 			   (unsigned) sbi->s_li_wait_mult);
 
+	if (test_opt2(sb, DX_READ_AHEAD))
+		seq_puts(seq, ",dx_read_ahead");
+
 	ext4_show_quota_options(seq, sb);
 
 	return 0;
@@ -1294,6 +1297,7 @@  enum {
 	Opt_dioread_nolock, Opt_dioread_lock,
 	Opt_discard, Opt_nodiscard,
 	Opt_init_inode_table, Opt_noinit_inode_table,
+	Opt_dx_read_ahead,
 };
 
 static const match_table_t tokens = {
@@ -1369,6 +1373,8 @@  static const match_table_t tokens = {
 	{Opt_init_inode_table, "init_itable=%u"},
 	{Opt_init_inode_table, "init_itable"},
 	{Opt_noinit_inode_table, "noinit_itable"},
+	{Opt_dx_read_ahead, "dx_read_ahead=%u"},
+	{Opt_dx_read_ahead, "dx_read_ahead"},
 	{Opt_err, NULL},
 };
 
@@ -1859,6 +1865,17 @@  set_qf_format:
 		case Opt_noinit_inode_table:
 			clear_opt(sb, INIT_INODE_TABLE);
 			break;
+		case Opt_dx_read_ahead:
+			if (args[0].from) {
+				if (match_int(&args[0], &option))
+					return 0;
+			} else
+				option = 1;	/* No argument, default to 1 */
+			if (option)
+				set_opt2(sb, DX_READ_AHEAD);
+			else
+				clear_opt2(sb, DX_READ_AHEAD);
+			break;
 		default:
 			ext4_msg(sb, KERN_ERR,
 			       "Unrecognized mount option \"%s\" "