Patchwork [v3,0/2] ext4: increase mbcache scalability

login
register
mail settings
Submitter Theodore Ts'o
Date Sept. 12, 2013, 12:23 p.m.
Message ID <20130912122317.GC12918@thunk.org>
Download mbox | patch
Permalink /patch/274540/
State Not Applicable
Headers show

Comments

Theodore Ts'o - Sept. 12, 2013, 12:23 p.m.
(I've trimmed the cc list to stop spamming people who probably don't
care as much about this).

So I've tried the following patch, and I've confirmed that for short
xattrs (i.e., that fit inside the inode body, assuming an inode size >
128 bytes), the mbcache paths don't trigger at all.

Could you try this patch and see if we can figure out why mbcache code
paths are triggering for you?

						- Ted

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - Sept. 13, 2013, 12:04 p.m.
On 09/12/2013 06:23 AM, Theodore Ts'o wrote:
> (I've trimmed the cc list to stop spamming people who probably don't
> care as much about this).
> 
> So I've tried the following patch, and I've confirmed that for short
> xattrs (i.e., that fit inside the inode body, assuming an inode size >
> 128 bytes), the mbcache paths don't trigger at all.
> 
> Could you try this patch and see if we can figure out why mbcache code
> paths are triggering for you?
> 
> 						- Ted
> 

I tried the patch.  First off, looks like the patch has a lot of warnings, it does produces tremendous non-stop warnings during aim7 run, particularly the one from line 1627.

Unfortunately, looks like with this patch we are still calling ext4_xattr_cache_find().

I added more debugging code to print out the first few xattrs that get us into ex4_xattr_cache_find().  This is what I have,

[   80.540977] ext4_xattr_block_set: name  len 28.
[  155.480692] ext4_xattr_block_set: name selinux len 32.
[  155.486531] ext4_xattr_block_set: name selinux len 32.
[  155.492283] ext4_xattr_block_set: name selinux len 32.
[  155.497988] ext4_xattr_block_set: name selinux len 32.
[  155.504064] ext4_xattr_block_set: name selinux len 32.
[  155.509771] ext4_xattr_block_set: name selinux len 32.
[  155.515475] ext4_xattr_block_set: name selinux len 32.
[  155.521179] ext4_xattr_block_set: name selinux len 32.

I'll continue debugging and get back with any new finding.

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Sept. 13, 2013, 6:59 p.m.
On 9/13/13 7:04 AM, Thavatchai Makphaibulchoke wrote:
> On 09/12/2013 06:23 AM, Theodore Ts'o wrote:
>> (I've trimmed the cc list to stop spamming people who probably don't
>> care as much about this).
>>
>> So I've tried the following patch, and I've confirmed that for short
>> xattrs (i.e., that fit inside the inode body, assuming an inode size >
>> 128 bytes), the mbcache paths don't trigger at all.
>>
>> Could you try this patch and see if we can figure out why mbcache code
>> paths are triggering for you?
>>
>> 						- Ted
>>
> 
> I tried the patch.  First off, looks like the patch has a lot of warnings, it does produces tremendous non-stop warnings during aim7 run, particularly the one from line 1627.
> 
> Unfortunately, looks like with this patch we are still calling ext4_xattr_cache_find().

All the patch does is issue warnings, so no fixing was expected ;)

> I added more debugging code to print out the first few xattrs that get us into ex4_xattr_cache_find().  This is what I have,
> 
> [   80.540977] ext4_xattr_block_set: name  len 28.
> [  155.480692] ext4_xattr_block_set: name selinux len 32.
> [  155.486531] ext4_xattr_block_set: name selinux len 32.
> [  155.492283] ext4_xattr_block_set: name selinux len 32.
> [  155.497988] ext4_xattr_block_set: name selinux len 32.
> [  155.504064] ext4_xattr_block_set: name selinux len 32.
> [  155.509771] ext4_xattr_block_set: name selinux len 32.
> [  155.515475] ext4_xattr_block_set: name selinux len 32.
> [  155.521179] ext4_xattr_block_set: name selinux len 32.

ok, all small...

> I'll continue debugging and get back with any new finding.

suggestions:

1) send us dumpe2fs -h /dev/sdXX | grep "Inode size" for this filesystem
2) print out the inode number in your printk's above
3) stat one of those inodes in debugfs after the run (debugfs stat <inodenumber>) - with <> around the inode nr

-Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - Sept. 17, 2013, 4:43 p.m.
On 09/13/2013 12:59 PM, Eric Sandeen wrote:
> 
> suggestions:
> 
> 1) send us dumpe2fs -h /dev/sdXX | grep "Inode size" for this filesystem
> 2) print out the inode number in your printk's above
> 3) stat one of those inodes in debugfs after the run (debugfs stat <inodenumber>) - with <> around the inode nr
> 
> -Eric
> 

I think I found out why we get into the mbcache path.  The test uses ramfss, which are mounted and unmounted at the start and end of the test.  Looks like a ramfs' default inode size is 128, causing all the mbcaching for the xattrs.  There seems to be nothing wrong with either SELinux or xattr.  Sorry for the confusion.

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Sept. 18, 2013, 2:19 p.m.
On 9/17/13 11:43 AM, Thavatchai Makphaibulchoke wrote:
> On 09/13/2013 12:59 PM, Eric Sandeen wrote:
>> 
>> suggestions:
>> 
>> 1) send us dumpe2fs -h /dev/sdXX | grep "Inode size" for this
>> filesystem 2) print out the inode number in your printk's above 3)
>> stat one of those inodes in debugfs after the run (debugfs stat
>> <inodenumber>) - with <> around the inode nr
>> 
>> -Eric
>> 
> 
> I think I found out why we get into the mbcache path.  The test uses
> ramfss, which are mounted and unmounted at the start and end of the
> test.  Looks like a ramfs' default inode size is 128, causing all the
> mbcaching for the xattrs.  There seems to be nothing wrong with
> either SELinux or xattr.  Sorry for the confusion.

smaller filesystems go back to 128 byte inodes, IIRC.

That's why I asked for dumpe2fs long ago ;)

But glad to have the mystery solved!

Thanks,
-Eric

> Thanks, Mak.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Theodore Ts'o - Sept. 21, 2013, 6:53 p.m.
On Wed, Sep 18, 2013 at 09:19:10AM -0500, Eric Sandeen wrote:
> > I think I found out why we get into the mbcache path.  The test uses
> > ramfss, which are mounted and unmounted at the start and end of the
> > test.  Looks like a ramfs' default inode size is 128, causing all the
> > mbcaching for the xattrs.  There seems to be nothing wrong with
> > either SELinux or xattr.  Sorry for the confusion.
> 
> smaller filesystems go back to 128 byte inodes, IIRC.

Mak, did you mean a ext4 file system created on a ramdisk, using a
/dev/ramNN device, or the literal "ramfs" file system?

I assume the former, right?

					- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - Sept. 23, 2013, 3:35 p.m.
On 09/21/2013 12:53 PM, Theodore Ts'o wrote:

> 
> Mak, did you mean a ext4 file system created on a ramdisk, using a
> /dev/ramNN device, or the literal "ramfs" file system?
> 
> I assume the former, right?
> 
> 					- Ted
> 

Yes Ted, that is correct, the filesystem using /dev/ramNN.

Thanks,
Mak.

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - Sept. 25, 2013, 6:43 p.m.
On 09/21/2013 05:12 PM, Andreas Dilger wrote:
> I just noticed that we have a patch for Lustre that adds a "no_mbcache" mount option to disable mbcache on a per-filesystem basis.
> 
> The newest patch we have is based on FC19 (3.10 kernel), is this something that would be of interest if we submitted it upstream?
> 
> http://review.whamcloud.com/#/c/7263/8/ldiskfs/kernel_patches/patches/fc19/ext4-disable-mb-cache.patch
> 
> Cheers, Andreas
> 

Here are some of the improvement I saw with SELinux disabled.

On an 80 core machine,

custom 15.5%
disk 27.81
fserver 5.33%
new_dbase 9.06%
new_fserver 5.45%


On an 8 core machine,

alltests 5.24
custom 2.26
shared 9.32
short 24.08

The rest is not noticably different.

Thanks,
Mak.


--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Andreas Dilger - Oct. 4, 2013, 3:22 a.m.
On 2013-09-25, at 12:43 PM, Thavatchai Makphaibulchoke wrote:
> On 09/21/2013 05:12 PM, Andreas Dilger wrote:
>> I just noticed that we have a patch for Lustre that adds a "no_mbcache" mount option to disable mbcache on a per-filesystem basis.
>> 
>> The newest patch we have is based on FC19 (3.10 kernel), is this something that would be of interest if we submitted it upstream?
>> 
>> http://review.whamcloud.com/#/c/7263/8/ldiskfs/kernel_patches/patches/fc19/ext4-disable-mb-cache.patch
>> 
>> Cheers, Andreas
>> 
> 
> Here are some of the improvement I saw with SELinux disabled.

To confirm, is this data on your "128-byte inode on ramdisk" config,
or is this on a system with a real disk and 256-byte inodes?

> On an 80 core machine,
> 
> custom 15.5%
> disk 27.81
> fserver 5.33%
> new_dbase 9.06%
> new_fserver 5.45%
> 
> 
> On an 8 core machine,
> 
> alltests 5.24
> custom 2.26
> shared 9.32
> short 24.08
> 
> The rest is not noticably different.

This is a pretty significant price to pay for SELinux in any case.
I guess it is probably lower overhead with 256-byte inodes, but
anything that adds 5-25% overhead shouldn't be taken for granted.

Cheers, Andreas





--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thavatchai Makphaibulchoke - Oct. 18, 2013, 12:43 p.m.
On 10/03/2013 09:22 PM, Andreas Dilger wrote:

> 
> This is a pretty significant price to pay for SELinux in any case.
> I guess it is probably lower overhead with 256-byte inodes, but
> anything that adds 5-25% overhead shouldn't be taken for granted.
> 
> Cheers, Andreas
> 

So far we have determined that,

- With SELinux enabled a ramdisk filesytem, with default node size of 128 bytes, extended xattr is generated
- In one of the aim7 workload, mbcache has a hit ratio of about 65%

Seems like mbcache itself and the mbcache lock optimization attempt by the patch could actually improve performance for a real word system where extended xattr is employed.  I believe the patch should go in if there isn't any concern.

Please let me know if there is any comment or concern about the patch.

Thanks,
Mak.



--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Sandeen - Oct. 19, 2013, 11:06 p.m.
On 10/18/13 7:43 AM, Thavatchai Makphaibulchoke wrote:
> On 10/03/2013 09:22 PM, Andreas Dilger wrote:
> 
>>
>> This is a pretty significant price to pay for SELinux in any case.
>> I guess it is probably lower overhead with 256-byte inodes, but
>> anything that adds 5-25% overhead shouldn't be taken for granted.
>>
>> Cheers, Andreas
>>
> 
> So far we have determined that,
> 
> - With SELinux enabled a ramdisk filesytem, with default node size of 128 bytes, extended xattr is generated
> - In one of the aim7 workload, mbcache has a hit ratio of about 65%
> 
> Seems like mbcache itself and the mbcache lock optimization attempt
> by the patch could actually improve performance for a real word
> system where extended xattr is employed. I believe the patch should
> go in if there isn't any concern.

But the 128-byte inode size is not common, so in general, mbcache isn't
exercised that often.

(still, it is there, and if there's a nice scalability patch for it,
it's probably worth it).

-Eric

> Please let me know if there is any comment or concern about the patch.
> 
> Thanks,
> Mak.
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Patch

diff --git a/fs/ext4/xattr.c b/fs/ext4/xattr.c
index c081e34..0942ae3 100644
--- a/fs/ext4/xattr.c
+++ b/fs/ext4/xattr.c
@@ -265,6 +265,8 @@  ext4_xattr_block_get(struct inode *inode, int name_index, const char *name,
 	struct ext4_xattr_entry *entry;
 	size_t size;
 	int error;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	ea_idebug(inode, "name=%d.%s, buffer=%p, buffer_size=%ld",
 		  name_index, name, buffer, (long)buffer_size);
@@ -286,6 +288,11 @@  bad_block:
 		error = -EIO;
 		goto cleanup;
 	}
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 	ext4_xattr_cache_insert(bh);
 	entry = BFIRST(bh);
 	error = ext4_xattr_find_entry(&entry, name_index, name, bh->b_size, 1);
@@ -409,6 +416,8 @@  ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 	struct inode *inode = dentry->d_inode;
 	struct buffer_head *bh = NULL;
 	int error;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	ea_idebug(inode, "buffer=%p, buffer_size=%ld",
 		  buffer, (long)buffer_size);
@@ -430,6 +439,11 @@  ext4_xattr_block_list(struct dentry *dentry, char *buffer, size_t buffer_size)
 		error = -EIO;
 		goto cleanup;
 	}
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 	ext4_xattr_cache_insert(bh);
 	error = ext4_xattr_list_entries(dentry, BFIRST(bh), buffer, buffer_size);
 
@@ -526,7 +540,14 @@  ext4_xattr_release_block(handle_t *handle, struct inode *inode,
 {
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 	ce = mb_cache_entry_get(ext4_xattr_cache, bh->b_bdev, bh->b_blocknr);
 	error = ext4_journal_get_write_access(handle, bh);
 	if (error)
@@ -745,12 +766,19 @@  ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 	struct ext4_xattr_search *s = &bs->s;
 	struct mb_cache_entry *ce = NULL;
 	int error = 0;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 #define header(x) ((struct ext4_xattr_header *)(x))
 
 	if (i->value && i->value_len > sb->s_blocksize)
 		return -ENOSPC;
 	if (s->base) {
+		if (__ratelimit(&_rs)) {
+			ext4_warning(inode->i_sb, "inode %lu",
+				     (unsigned long) inode->i_ino);
+			WARN_ON(1);
+		}
 		ce = mb_cache_entry_get(ext4_xattr_cache, bs->bh->b_bdev,
 					bs->bh->b_blocknr);
 		error = ext4_journal_get_write_access(handle, bs->bh);
@@ -769,6 +797,12 @@  ext4_xattr_block_set(handle_t *handle, struct inode *inode,
 				if (!IS_LAST_ENTRY(s->first))
 					ext4_xattr_rehash(header(s->base),
 							  s->here);
+				if (__ratelimit(&_rs)) {
+					ext4_warning(inode->i_sb,
+						     "inode %lu",
+						(unsigned long) inode->i_ino);
+					WARN_ON(1);
+				}
 				ext4_xattr_cache_insert(bs->bh);
 			}
 			unlock_buffer(bs->bh);
@@ -905,6 +939,11 @@  getblk_failed:
 			memcpy(new_bh->b_data, s->base, new_bh->b_size);
 			set_buffer_uptodate(new_bh);
 			unlock_buffer(new_bh);
+			if (__ratelimit(&_rs)) {
+				ext4_warning(inode->i_sb, "inode %lu",
+					     (unsigned long) inode->i_ino);
+				WARN_ON(1);
+			}
 			ext4_xattr_cache_insert(new_bh);
 			error = ext4_handle_dirty_xattr_block(handle,
 							      inode, new_bh);
@@ -1570,10 +1609,17 @@  ext4_xattr_cache_find(struct inode *inode, struct ext4_xattr_header *header,
 {
 	__u32 hash = le32_to_cpu(header->h_hash);
 	struct mb_cache_entry *ce;
+	static DEFINE_RATELIMIT_STATE(_rs, DEFAULT_RATELIMIT_INTERVAL,
+				      DEFAULT_RATELIMIT_BURST);
 
 	if (!header->h_hash)
 		return NULL;  /* never share */
 	ea_idebug(inode, "looking for cached blocks [%x]", (int)hash);
+	if (__ratelimit(&_rs)) {
+		ext4_warning(inode->i_sb, "inode %lu",
+			     (unsigned long) inode->i_ino);
+		WARN_ON(1);
+	}
 again:
 	ce = mb_cache_entry_find_first(ext4_xattr_cache, inode->i_sb->s_bdev,
 				       hash);