ext4: improve smp scalability for inode generation

Message ID 8760bcpdc8.fsf@openvz.org
State New
Headers show
Series
  • ext4: improve smp scalability for inode generation
Related show

Commit Message

Dmitry Monakhov Oct. 18, 2017, 5:36 p.m.
->s_next_generation is protected by s_next_gen_lock but it usage
pattern is very primitive and can be replaced with atomic_ops

This significantly improve creation/unlink scenario on SMP systems,
for example lat_fs_create_unlink test [1] on x2 E5-2680 (32vcpu) system
shows ~20% improvement.
| nr_tsk | wo/ patch | w/ patch |
|--------+-----------+----------|
|      1 |       137 |      140 |
|      2 |       224 |      233 |
|      4 |       356 |      372 |
|      8 |       439 |      519 |
|     16 |       443 |      585 |
|     32 |       598 |      695 |
|     64 |       559 |      707 |
|    128 |       385 |      437 |

Footnotes:
[1]https://github.com/dmonakhov/lmbench/blob/master/src/lat_fs_create_unlink.c

Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/ext4/ext4.h   | 3 +--
 fs/ext4/ialloc.c | 4 +---
 fs/ext4/ioctl.c  | 6 ++----
 fs/ext4/super.c  | 8 ++++----
 4 files changed, 8 insertions(+), 13 deletions(-)

Comments

Dmitry Monakhov Oct. 18, 2017, 6:08 p.m. | #1
Dmitry Monakhov <dmonakhov@openvz.org> writes:

> ->s_next_generation is protected by s_next_gen_lock but it usage
> pattern is very primitive and can be replaced with atomic_ops
>
> This significantly improve creation/unlink scenario on SMP systems,
> for example lat_fs_create_unlink test [1] on x2 E5-2680 (32vcpu) system
> shows ~20% improvement.
> | nr_tsk | wo/ patch | w/ patch |
> |--------+-----------+----------|
> |      1 |       137 |      140 |
> |      2 |       224 |      233 |
> |      4 |       356 |      372 |
> |      8 |       439 |      519 |
> |     16 |       443 |      585 |
> |     32 |       598 |      695 |
> |     64 |       559 |      707 |
> |    128 |       385 |      437 |

FYI with lazytime enabled lat_fs_create_unlink is ~16x times slower.
The reason is quite obvious ext4_update_other_inodes_time() increase
lock contention for inode_hash_lock (4k/256) times.

->ext4_do_update_inode
  ->ext4_update_other_inodes_time
    for (i = 0; i < inodes_per_block; i++, ino++, buf += inode_size)
      ->find_inode_nowait
        ->spin_lock(&inode_hash_lock) -> 16x contention increase

inode_hash_lock is known problem. I have patches to convert inode_hash_table
per bucket lock similar to dentry_hash, but this require massige changes in
various filesystems so will require a lot of time to be merged.

Currently lazytime amplify it significantly. May be it is reasonable to
use spin_trylock inside find_inode_nowait to make it true lightweight hint?
diff --git a/fs/inode.c b/fs/inode.c
index d1e35b5..a5b1cba1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1360,7 +1360,9 @@ struct inode *find_inode_nowait(struct super_block *sb,
 	struct inode *inode, *ret_inode = NULL;
 	int mval;
 
-	spin_lock(&inode_hash_lock);
+	if (!spin_trylock(&inode_hash_lock))
+		return NULL;
+
 	hlist_for_each_entry(inode, head, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
Andreas Dilger Oct. 19, 2017, 11:50 a.m. | #2
On Oct 18, 2017, at 11:36 AM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
> 
> 
> ->s_next_generation is protected by s_next_gen_lock but it usage
> pattern is very primitive and can be replaced with atomic_ops
> 
> This significantly improve creation/unlink scenario on SMP systems,
> for example lat_fs_create_unlink test [1] on x2 E5-2680 (32vcpu) system
> shows ~20% improvement.
> | nr_tsk | wo/ patch | w/ patch |
> |--------+-----------+----------|
> |      1 |       137 |      140 |
> |      2 |       224 |      233 |
> |      4 |       356 |      372 |
> |      8 |       439 |      519 |
> |     16 |       443 |      585 |
> |     32 |       598 |      695 |
> |     64 |       559 |      707 |
> |    128 |       385 |      437 |

Strictly speaking, we don't need a single global value for i_generation.
These are per-inode values, and just need to be relatively unique compared
to previous values for each inode.  There are also potential security
benefits from not having sequential i_generation numbers, since that makes
NFS file handle guessing a lot harder.

You could just increment the previous i_generation value for that inode
(if the old inode was read from disk), or generate a random number (also
likely to be CPU expensive and risk collisions), or use a per-CPU counter
(with some way to ensure threads allocating/freeing inodes on different
cores do not allocate the same generation in close proximity), like:

        cpuid = smp_processor_id();
        i_generation = sb->s_generation[cpuid] | cpuid;
        sb->s_generation[cpuid] += num_possible_cpus();

or whatever is fastest.  The above doesn't even need {get,put}_cpu(),
since it doesn't matter if there is a race in the update.

Since the locking of this one field shows up at a macro level, it would
be interesting if you took out the assignment completely, to see if this
shows further improvements, which would indicate we can still come up with
a better solution than the atomic you have proposed here.

Cheers, Andreas

> Footnotes:
> [1]https://github.com/dmonakhov/lmbench/blob/master/src/lat_fs_create_unlink.c
> 
> Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
> ---
> fs/ext4/ext4.h   | 3 +--
> fs/ext4/ialloc.c | 4 +---
> fs/ext4/ioctl.c  | 6 ++----
> fs/ext4/super.c  | 8 ++++----
> 4 files changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index e2abe01..6be1aa8 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1392,8 +1392,7 @@ struct ext4_sb_info {
> 	int s_first_ino;
> 	unsigned int s_inode_readahead_blks;
> 	unsigned int s_inode_goal;
> -	spinlock_t s_next_gen_lock;
> -	u32 s_next_generation;
> +	atomic_t s_next_generation;
> 	u32 s_hash_seed[4];
> 	int s_def_hash_version;
> 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index ee82302..d12dabc 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1138,9 +1138,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> 			   inode->i_ino);
> 		goto out;
> 	}
> -	spin_lock(&sbi->s_next_gen_lock);
> -	inode->i_generation = sbi->s_next_generation++;
> -	spin_unlock(&sbi->s_next_gen_lock);
> +	inode->i_generation = atomic_inc_return(&sbi->s_next_generation);
> 
> 	/* Precompute checksum seed for inode metadata */
> 	if (ext4_has_metadata_csum(sb)) {
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index afb66d4..7d8b1a5 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -157,10 +157,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
> 
> 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
> 
> -	spin_lock(&sbi->s_next_gen_lock);
> -	inode->i_generation = sbi->s_next_generation++;
> -	inode_bl->i_generation = sbi->s_next_generation++;
> -	spin_unlock(&sbi->s_next_gen_lock);
> +	inode_bl->i_generation = atomic_add_return(2, &sbi->s_next_generation);
> +	inode->i_generation = inode_bl->i_generation -1;
> 
> 	ext4_discard_preallocations(inode);
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index b104096..bfc6d2e 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3419,7 +3419,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> 	int err = 0;
> 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
> 	ext4_group_t first_not_zeroed;
> -
> +	u32 igen;
> +
> 	if ((data && !orig_data) || !sbi)
> 		goto out_free_base;
> 
> @@ -3977,9 +3978,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> 	}
> 
> 	sbi->s_gdb_count = db_count;
> -	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
> -	spin_lock_init(&sbi->s_next_gen_lock);
> -
> +	get_random_bytes(&igen, sizeof(u32));
> +	atomic_set(&sbi->s_next_generation, igen);
> 	setup_timer(&sbi->s_err_report, print_daily_error_info,
> 		(unsigned long) sb);
> 
> --
> 1.8.3.1
> 
> 


Cheers, Andreas
Theodore Ts'o Nov. 9, 2017, 3:23 a.m. | #3
Dmitry, can you try benchmarking this patch?

Thanks!!

						- Ted

commit f0e922e7235e1b5ba6fd964e2cf8dafed3248a15
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Wed Nov 8 22:21:58 2017 -0500

    ext4: improve smp scalability for inode generation
    
    ->s_next_generation is protected by s_next_gen_lock but its usage
    pattern is very primitive.  We don't actually need sequentailly
    increasing new generation numbers, so let's use prandom_u32() instead.
    
    Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 53ce95b52fd8..5e6d7b6f50c7 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1355,8 +1355,6 @@ struct ext4_sb_info {
 	int s_first_ino;
 	unsigned int s_inode_readahead_blks;
 	unsigned int s_inode_goal;
-	spinlock_t s_next_gen_lock;
-	u32 s_next_generation;
 	u32 s_hash_seed[4];
 	int s_def_hash_version;
 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ee823022aa34..da79eb5dba40 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1138,9 +1138,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			   inode->i_ino);
 		goto out;
 	}
-	spin_lock(&sbi->s_next_gen_lock);
-	inode->i_generation = sbi->s_next_generation++;
-	spin_unlock(&sbi->s_next_gen_lock);
+	inode->i_generation = prandom_u32();
 
 	/* Precompute checksum seed for inode metadata */
 	if (ext4_has_metadata_csum(sb)) {
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 144bbda2b808..98ad8172dfd3 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -17,6 +17,7 @@
 #include <linux/uuid.h>
 #include <linux/uaccess.h>
 #include <linux/delay.h>
+#include <linux/random.h>
 #include "ext4_jbd2.h"
 #include "ext4.h"
 #include <linux/fsmap.h>
@@ -157,10 +158,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
 
-	spin_lock(&sbi->s_next_gen_lock);
-	inode->i_generation = sbi->s_next_generation++;
-	inode_bl->i_generation = sbi->s_next_generation++;
-	spin_unlock(&sbi->s_next_gen_lock);
+	inode->i_generation = prandom_u32();
+	inode_bl->i_generation = prandom_u32();
 
 	ext4_discard_preallocations(inode);
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3a278faf5868..9f2e3eb5131f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3982,8 +3982,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	sbi->s_gdb_count = db_count;
-	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
-	spin_lock_init(&sbi->s_next_gen_lock);
 
 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
Dmitry Monakhov Nov. 10, 2017, 5:33 p.m. | #4
Theodore Ts'o <tytso@mit.edu> writes:

> Dmitry, can you try benchmarking this patch?
Hi,
I do not forget about your patch but it looks like some very strange
things happens since last measurements. create_unlink scenario degradates
significantly from 8-16 threads. It looks like something contented on
VFS because I see same result on xfs.
Even more I do not see this contention with 'perf lock record'. Probably
this is because some crappy locking primitives like hlist_bl which has
no lockdep/lockstat support. I'll notify you once found something.
>
> Thanks!!
>
> 						- Ted
>
> commit f0e922e7235e1b5ba6fd964e2cf8dafed3248a15
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Wed Nov 8 22:21:58 2017 -0500
>
>     ext4: improve smp scalability for inode generation
>     
>     ->s_next_generation is protected by s_next_gen_lock but its usage
>     pattern is very primitive.  We don't actually need sequentailly
>     increasing new generation numbers, so let's use prandom_u32() instead.
>     
>     Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
>     Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 53ce95b52fd8..5e6d7b6f50c7 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1355,8 +1355,6 @@ struct ext4_sb_info {
>  	int s_first_ino;
>  	unsigned int s_inode_readahead_blks;
>  	unsigned int s_inode_goal;
> -	spinlock_t s_next_gen_lock;
> -	u32 s_next_generation;
>  	u32 s_hash_seed[4];
>  	int s_def_hash_version;
>  	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index ee823022aa34..da79eb5dba40 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1138,9 +1138,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>  			   inode->i_ino);
>  		goto out;
>  	}
> -	spin_lock(&sbi->s_next_gen_lock);
> -	inode->i_generation = sbi->s_next_generation++;
> -	spin_unlock(&sbi->s_next_gen_lock);
> +	inode->i_generation = prandom_u32();
>  
>  	/* Precompute checksum seed for inode metadata */
>  	if (ext4_has_metadata_csum(sb)) {
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 144bbda2b808..98ad8172dfd3 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -17,6 +17,7 @@
>  #include <linux/uuid.h>
>  #include <linux/uaccess.h>
>  #include <linux/delay.h>
> +#include <linux/random.h>
>  #include "ext4_jbd2.h"
>  #include "ext4.h"
>  #include <linux/fsmap.h>
> @@ -157,10 +158,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
>  
>  	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
>  
> -	spin_lock(&sbi->s_next_gen_lock);
> -	inode->i_generation = sbi->s_next_generation++;
> -	inode_bl->i_generation = sbi->s_next_generation++;
> -	spin_unlock(&sbi->s_next_gen_lock);
> +	inode->i_generation = prandom_u32();
> +	inode_bl->i_generation = prandom_u32();
>  
>  	ext4_discard_preallocations(inode);
>  
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3a278faf5868..9f2e3eb5131f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3982,8 +3982,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>  	}
>  
>  	sbi->s_gdb_count = db_count;
> -	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
> -	spin_lock_init(&sbi->s_next_gen_lock);
>  
>  	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>
Andreas Dilger Nov. 10, 2017, 10:39 p.m. | #5
On Nov 8, 2017, at 8:23 PM, Theodore Ts'o <tytso@mit.edu> wrote:
> commit f0e922e7235e1b5ba6fd964e2cf8dafed3248a15
> Author: Theodore Ts'o <tytso@mit.edu>
> Date:   Wed Nov 8 22:21:58 2017 -0500
> 
>    ext4: improve smp scalability for inode generation
> 
>    ->s_next_generation is protected by s_next_gen_lock but its usage
>    pattern is very primitive.  We don't actually need sequentailly
>    increasing new generation numbers, so let's use prandom_u32() instead.

We discussed this on the ext4 concall this week, but came to the conclusion
that a 32-bit random number has an unacceptably high chance of collision
(about 1/2^16 allocations) due to the birthday paradox.

Instead, an approach like the following should be OK:
1. num_gen_per_cpu = 2^32/num_online_cpus()
2. initialize CPU0 generation = prandom_u32()
3. initialize CPUn generation = CPU0 + n * num_gen_per_cpu
4. use sequential generation numbers for each local CPU
5. after num_gen_per_cpu generations per CPU hit on any CPU, goto 2.

Cheers, Andreas
> 
>    Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
>    Signed-off-by: Theodore Ts'o <tytso@mit.edu>
> 
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 53ce95b52fd8..5e6d7b6f50c7 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1355,8 +1355,6 @@ struct ext4_sb_info {
> 	int s_first_ino;
> 	unsigned int s_inode_readahead_blks;
> 	unsigned int s_inode_goal;
> -	spinlock_t s_next_gen_lock;
> -	u32 s_next_generation;
> 	u32 s_hash_seed[4];
> 	int s_def_hash_version;
> 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
> index ee823022aa34..da79eb5dba40 100644
> --- a/fs/ext4/ialloc.c
> +++ b/fs/ext4/ialloc.c
> @@ -1138,9 +1138,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
> 			   inode->i_ino);
> 		goto out;
> 	}
> -	spin_lock(&sbi->s_next_gen_lock);
> -	inode->i_generation = sbi->s_next_generation++;
> -	spin_unlock(&sbi->s_next_gen_lock);
> +	inode->i_generation = prandom_u32();
> 
> 	/* Precompute checksum seed for inode metadata */
> 	if (ext4_has_metadata_csum(sb)) {
> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
> index 144bbda2b808..98ad8172dfd3 100644
> --- a/fs/ext4/ioctl.c
> +++ b/fs/ext4/ioctl.c
> @@ -17,6 +17,7 @@
> #include <linux/uuid.h>
> #include <linux/uaccess.h>
> #include <linux/delay.h>
> +#include <linux/random.h>
> #include "ext4_jbd2.h"
> #include "ext4.h"
> #include <linux/fsmap.h>
> @@ -157,10 +158,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
> 
> 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
> 
> -	spin_lock(&sbi->s_next_gen_lock);
> -	inode->i_generation = sbi->s_next_generation++;
> -	inode_bl->i_generation = sbi->s_next_generation++;
> -	spin_unlock(&sbi->s_next_gen_lock);
> +	inode->i_generation = prandom_u32();
> +	inode_bl->i_generation = prandom_u32();
> 
> 	ext4_discard_preallocations(inode);
> 
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 3a278faf5868..9f2e3eb5131f 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -3982,8 +3982,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> 	}
> 
> 	sbi->s_gdb_count = db_count;
> -	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
> -	spin_lock_init(&sbi->s_next_gen_lock);
> 
> 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
> 


Cheers, Andreas
Andreas Dilger Nov. 10, 2017, 10:55 p.m. | #6
> On Nov 10, 2017, at 3:39 PM, Andreas Dilger <adilger@dilger.ca> wrote:
> 
> On Nov 8, 2017, at 8:23 PM, Theodore Ts'o <tytso@mit.edu> wrote:
>> commit f0e922e7235e1b5ba6fd964e2cf8dafed3248a15
>> Author: Theodore Ts'o <tytso@mit.edu>
>> Date:   Wed Nov 8 22:21:58 2017 -0500
>> 
>>   ext4: improve smp scalability for inode generation
>> 
>>   ->s_next_generation is protected by s_next_gen_lock but its usage
>>   pattern is very primitive.  We don't actually need sequentailly
>>   increasing new generation numbers, so let's use prandom_u32() instead.
> 
> We discussed this on the ext4 concall this week, but came to the conclusion
> that a 32-bit random number has an unacceptably high chance of collision
> (about 1/2^16 allocations) due to the birthday paradox.
> 
> Instead, an approach like the following should be OK:
> 1. num_gen_per_cpu = 2^32/num_online_cpus()
> 2. initialize CPU0 generation = prandom_u32()
> 3. initialize CPUn generation = CPU0 + n * num_gen_per_cpu
> 4. use sequential generation numbers for each local CPU
> 5. after num_gen_per_cpu generations per CPU hit on any CPU, goto 2.

Actually, never mind.  I wrote a test program to verify this, and this
pointed out the flaw in my argument, namely that the birthday paradox
only matters if each random number is compared against all other random
numbers in the pool.

In this case, we only care whether i_generation.old == i_generation.new
for each individual inode, and not whether i_generation.new is the same
as _any_ old generation.

That means there is a 1/2^32 chance of collisions for each allocated
inode, rather than 1/2^16.  With only prandom_u32() there is always
a 1/2^32 chance that creating and deleting the same inode in a loop
will result in two identical sequential generation numbers.

There was also a chance of collisions with the old code, though it
wouldn't happen for a long time when the workload was reusing the
same set of inodes repeatedly (e.g. create and delete in a single
directory).  Instead, it would be more likely to hit for old inodes
that were recently deleted and re-allocated, or after a reboot.

This (mostly|only) affects NFS exports, so I'm not sure how likely
it is to happen in real life.

Cheers, Andreas

>> 
>>   Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
>>   Signed-off-by: Theodore Ts'o <tytso@mit.edu>
>> 
>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>> index 53ce95b52fd8..5e6d7b6f50c7 100644
>> --- a/fs/ext4/ext4.h
>> +++ b/fs/ext4/ext4.h
>> @@ -1355,8 +1355,6 @@ struct ext4_sb_info {
>> 	int s_first_ino;
>> 	unsigned int s_inode_readahead_blks;
>> 	unsigned int s_inode_goal;
>> -	spinlock_t s_next_gen_lock;
>> -	u32 s_next_generation;
>> 	u32 s_hash_seed[4];
>> 	int s_def_hash_version;
>> 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
>> diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
>> index ee823022aa34..da79eb5dba40 100644
>> --- a/fs/ext4/ialloc.c
>> +++ b/fs/ext4/ialloc.c
>> @@ -1138,9 +1138,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
>> 			   inode->i_ino);
>> 		goto out;
>> 	}
>> -	spin_lock(&sbi->s_next_gen_lock);
>> -	inode->i_generation = sbi->s_next_generation++;
>> -	spin_unlock(&sbi->s_next_gen_lock);
>> +	inode->i_generation = prandom_u32();
>> 
>> 	/* Precompute checksum seed for inode metadata */
>> 	if (ext4_has_metadata_csum(sb)) {
>> diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
>> index 144bbda2b808..98ad8172dfd3 100644
>> --- a/fs/ext4/ioctl.c
>> +++ b/fs/ext4/ioctl.c
>> @@ -17,6 +17,7 @@
>> #include <linux/uuid.h>
>> #include <linux/uaccess.h>
>> #include <linux/delay.h>
>> +#include <linux/random.h>
>> #include "ext4_jbd2.h"
>> #include "ext4.h"
>> #include <linux/fsmap.h>
>> @@ -157,10 +158,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
>> 
>> 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
>> 
>> -	spin_lock(&sbi->s_next_gen_lock);
>> -	inode->i_generation = sbi->s_next_generation++;
>> -	inode_bl->i_generation = sbi->s_next_generation++;
>> -	spin_unlock(&sbi->s_next_gen_lock);
>> +	inode->i_generation = prandom_u32();
>> +	inode_bl->i_generation = prandom_u32();
>> 
>> 	ext4_discard_preallocations(inode);
>> 
>> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
>> index 3a278faf5868..9f2e3eb5131f 100644
>> --- a/fs/ext4/super.c
>> +++ b/fs/ext4/super.c
>> @@ -3982,8 +3982,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
>> 	}
>> 
>> 	sbi->s_gdb_count = db_count;
>> -	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
>> -	spin_lock_init(&sbi->s_next_gen_lock);
>> 
>> 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
>> 
> 
> 
> Cheers, Andreas


Cheers, Andreas
Theodore Ts'o Nov. 10, 2017, 10:57 p.m. | #7
On Fri, Nov 10, 2017 at 08:33:07PM +0300, Dmitry Monakhov wrote:
> I do not forget about your patch but it looks like some very strange
> things happens since last measurements. create_unlink scenario degradates
> significantly from 8-16 threads. It looks like something contented on
> VFS because I see same result on xfs.
> Even more I do not see this contention with 'perf lock record'. Probably
> this is because some crappy locking primitives like hlist_bl which has
> no lockdep/lockstat support. I'll notify you once found something.

OK, I have another version of this patch in response to some concerns
Andreas had over a potential birthday attack issue with using
prandom_u32().  Actually, after looking at this more closely, the
birthday attack doesn't apply, since what matters isn't collisions
between arbitrary generation numbers (or more specifically, NFS file
handles using previousl generation numbers).  The problem only arises
when current incarnation of the inode uses a generation number which
has been used by a previous incarnation of the inode still cached at
some NFS client.

I'm still going back and forth about whether we should use this
version or the one I had posted earlier.  I suspect it's six of one,
half-dozen of the other...

						- Ted

commit f2a95d428e1f6f6fc3d70764d209673047545aa0
Author: Theodore Ts'o <tytso@mit.edu>
Date:   Fri Nov 10 16:25:57 2017 -0500

    ext4: improve smp scalability for inode generation
    
    ->s_next_generation is protected by s_next_gen_lock but its usage
    pattern is very primitive.  Replace it with a per-cpu set of
    counters.
    
    Reported-by: Dmitry Monakhov <dmonakhov@openvz.org>
    Signed-off-by: Theodore Ts'o <tytso@mit.edu>

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 53ce95b52fd8..167457109457 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1355,8 +1355,6 @@ struct ext4_sb_info {
 	int s_first_ino;
 	unsigned int s_inode_readahead_blks;
 	unsigned int s_inode_goal;
-	spinlock_t s_next_gen_lock;
-	u32 s_next_generation;
 	u32 s_hash_seed[4];
 	int s_def_hash_version;
 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
@@ -2371,6 +2369,8 @@ extern int ext4fs_dirhash(const char *name, int len, struct
 			  dx_hash_info *hinfo);
 
 /* ialloc.c */
+void __init ext4_generation_setup(void);
+__u32 ext4_new_generation(void);
 extern struct inode *__ext4_new_inode(handle_t *, struct inode *, umode_t,
 				      const struct qstr *qstr, __u32 goal,
 				      uid_t *owner, __u32 i_flags,
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ee823022aa34..ccf250343768 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -758,6 +758,24 @@ static int find_inode_bit(struct super_block *sb, ext4_group_t group,
 	return 1;
 }
 
+static DEFINE_PER_CPU(__u32, generation_counter);
+
+void __init ext4_generation_setup(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu) {
+		__u32 *p = per_cpu_ptr(&generation_counter, cpu);
+
+		*p = prandom_u32();
+	}
+}
+
+__u32 ext4_new_generation(void)
+{
+	return this_cpu_inc_return(generation_counter);
+}
+
 /*
  * There are two policies for allocating an inode.  If the new inode is
  * a directory, then a forward search is made for a block group with both
@@ -1138,9 +1156,7 @@ struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			   inode->i_ino);
 		goto out;
 	}
-	spin_lock(&sbi->s_next_gen_lock);
-	inode->i_generation = sbi->s_next_generation++;
-	spin_unlock(&sbi->s_next_gen_lock);
+	inode->i_generation = ext4_new_generation();
 
 	/* Precompute checksum seed for inode metadata */
 	if (ext4_has_metadata_csum(sb)) {
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index 144bbda2b808..14d41d122907 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -157,10 +157,8 @@ static long swap_inode_boot_loader(struct super_block *sb,
 
 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
 
-	spin_lock(&sbi->s_next_gen_lock);
-	inode->i_generation = sbi->s_next_generation++;
-	inode_bl->i_generation = sbi->s_next_generation++;
-	spin_unlock(&sbi->s_next_gen_lock);
+	inode->i_generation = ext4_new_generation();
+	inode_bl->i_generation = ext4_new_generation();
 
 	ext4_discard_preallocations(inode);
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 3a278faf5868..49aef2e30f9d 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3982,8 +3982,6 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	sbi->s_gdb_count = db_count;
-	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
-	spin_lock_init(&sbi->s_next_gen_lock);
 
 	timer_setup(&sbi->s_err_report, print_daily_error_info, 0);
 
@@ -5793,6 +5791,7 @@ static int __init ext4_init_fs(void)
 
 	/* Build-time check for flags consistency */
 	ext4_check_flag_values();
+	ext4_generation_setup();
 
 	for (i = 0; i < EXT4_WQ_HASH_SZ; i++)
 		init_waitqueue_head(&ext4__ioend_wq[i]);

Patch

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e2abe01..6be1aa8 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1392,8 +1392,7 @@  struct ext4_sb_info {
 	int s_first_ino;
 	unsigned int s_inode_readahead_blks;
 	unsigned int s_inode_goal;
-	spinlock_t s_next_gen_lock;
-	u32 s_next_generation;
+	atomic_t s_next_generation;
 	u32 s_hash_seed[4];
 	int s_def_hash_version;
 	int s_hash_unsigned;	/* 3 if hash should be signed, 0 if not */
diff --git a/fs/ext4/ialloc.c b/fs/ext4/ialloc.c
index ee82302..d12dabc 100644
--- a/fs/ext4/ialloc.c
+++ b/fs/ext4/ialloc.c
@@ -1138,9 +1138,7 @@  struct inode *__ext4_new_inode(handle_t *handle, struct inode *dir,
 			   inode->i_ino);
 		goto out;
 	}
-	spin_lock(&sbi->s_next_gen_lock);
-	inode->i_generation = sbi->s_next_generation++;
-	spin_unlock(&sbi->s_next_gen_lock);
+	inode->i_generation = atomic_inc_return(&sbi->s_next_generation);
 
 	/* Precompute checksum seed for inode metadata */
 	if (ext4_has_metadata_csum(sb)) {
diff --git a/fs/ext4/ioctl.c b/fs/ext4/ioctl.c
index afb66d4..7d8b1a5 100644
--- a/fs/ext4/ioctl.c
+++ b/fs/ext4/ioctl.c
@@ -157,10 +157,8 @@  static long swap_inode_boot_loader(struct super_block *sb,
 
 	inode->i_ctime = inode_bl->i_ctime = current_time(inode);
 
-	spin_lock(&sbi->s_next_gen_lock);
-	inode->i_generation = sbi->s_next_generation++;
-	inode_bl->i_generation = sbi->s_next_generation++;
-	spin_unlock(&sbi->s_next_gen_lock);
+	inode_bl->i_generation = atomic_add_return(2, &sbi->s_next_generation);
+	inode->i_generation = inode_bl->i_generation -1;
 
 	ext4_discard_preallocations(inode);
 
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index b104096..bfc6d2e 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -3419,7 +3419,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	int err = 0;
 	unsigned int journal_ioprio = DEFAULT_JOURNAL_IOPRIO;
 	ext4_group_t first_not_zeroed;
-
+	u32 igen;
+	
 	if ((data && !orig_data) || !sbi)
 		goto out_free_base;
 
@@ -3977,9 +3978,8 @@  static int ext4_fill_super(struct super_block *sb, void *data, int silent)
 	}
 
 	sbi->s_gdb_count = db_count;
-	get_random_bytes(&sbi->s_next_generation, sizeof(u32));
-	spin_lock_init(&sbi->s_next_gen_lock);
-
+	get_random_bytes(&igen, sizeof(u32));
+	atomic_set(&sbi->s_next_generation, igen);
 	setup_timer(&sbi->s_err_report, print_daily_error_info,
 		(unsigned long) sb);