diff mbox

fs: pipe/sockets/anon dentries should not have a parent

Message ID 4926D022.5060008@cosmosbay.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Nov. 21, 2008, 3:13 p.m. UTC
Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <dada1@cosmosbay.com>
>> Date: Fri, 21 Nov 2008 09:51:32 +0100
>>
>>> Now, I wish sockets and pipes not going through dcache, not tbench 
>>> affair
>>> of course but real workloads...
>>>
>>> running 8 processes on a 8 way machine doing a
>>> for (;;)
>>>     close(socket(AF_INET, SOCK_STREAM, 0));
>>>
>>> is slow as hell, we hit so many contended cache lines ...
>>>
>>> ticket spin locks are slower in this case (dcache_lock for example
>>> is taken twice when we allocate a socket(), once in d_alloc(), 
>>> another one
>>> in d_instantiate())
>>
>> As you of course know, this used to be a ton worse.  At least now
>> these things are unhashed. :)
> 
> Well, this is dust compared to what we currently have.
> 
> To allocate a socket we :
> 0) Do the usual file manipulation (pretty scalable these days)
>   (but recent drop_file_write_access() and co slow down a bit)
> 1) allocate an inode with new_inode()
>    This function :
>     - locks inode_lock,
>     - dirties nr_inodes counter
>     - dirties inode_in_use list  (for sockets, I doubt it is usefull)
>     - dirties superblock s_inodes.
>     - dirties last_ino counter
> All these are in different cache lines of course.
> 2) allocate a dentry
>   d_alloc() takes dcache_lock,
>   insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
>   dirties nr_dentry
> 3) d_instantiate() dentry  (dcache_lock taken again)
> 4) init_file() -> atomic_inc on sock_mnt->refcount (in case we want to 
> umount this vfs ...)
> 
> 
> 
> At close() time, we must undo the things. Its even more expensive because
> of the _atomic_dec_and_lock() that stress a lot, and because of two 
> cache lines that are touched when an element is deleted from a list.
> 
> for (i = 0; i < 1000*1000; i++)
>     close(socket(socket(AF_INET, SOCK_STREAM, 0));
> 
> Cost if run one one cpu :
> 
> real    0m1.561s
> user    0m0.092s
> sys     0m1.469s
> 
> If run on 8 CPUS :
> 
> real    0m27.496s
> user    0m0.657s
> sys     3m39.092s
> 
> 

[PATCH] fs: pipe/sockets/anon dentries should not have a parent

Linking pipe/sockets/anon dentries to one root 'parent' has no functional
impact at all, but a scalability one.

We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
We avoid an expensive atomic_dec_and_lock() call on the root dentry.

If we correct dnotify_parent() and inotify_d_instantiate() to take into account
a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Before patch, time to run 8 millions of close(socket()) calls on 8 CPUS was :

real    0m27.496s
user    0m0.657s
sys     3m39.092s

After patch :

real    0m23.997s
user    0m0.682s
sys     3m11.193s


Old oprofile :
CPU: Core 2, speed 3000.11 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
164257   164257        11.0245  11.0245    init_file
155488   319745        10.4359  21.4604    d_alloc
151887   471632        10.1942  31.6547    _atomic_dec_and_lock
91620    563252         6.1493  37.8039    inet_create
74245    637497         4.9831  42.7871    kmem_cache_alloc
46702    684199         3.1345  45.9216    dentry_iput
46186    730385         3.0999  49.0215    tcp_close
42824    773209         2.8742  51.8957    kmem_cache_free
37275    810484         2.5018  54.3975    wake_up_inode
36553    847037         2.4533  56.8508    tcp_v4_init_sock
35661    882698         2.3935  59.2443    inotify_d_instantiate
32998    915696         2.2147  61.4590    sysenter_past_esp
31442    947138         2.1103  63.5693    d_instantiate
31303    978441         2.1010  65.6703    generic_forget_inode
27533    1005974        1.8479  67.5183    vfs_dq_drop
24237    1030211        1.6267  69.1450    sock_attach_fd
19290    1049501        1.2947  70.4397    __copy_from_user_ll


New oprofile :
CPU: Core 2, speed 3000.24 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
147287   147287        10.3984  10.3984    new_inode
144884   292171        10.2287  20.6271    inet_create
93670    385841         6.6131  27.2402    init_file
89852    475693         6.3435  33.5837    wake_up_inode
80910    556603         5.7122  39.2959    kmem_cache_alloc
53588    610191         3.7833  43.0792    _atomic_dec_and_lock
44341    654532         3.1305  46.2096    generic_forget_inode
38710    693242         2.7329  48.9425    kmem_cache_free
37605    730847         2.6549  51.5974    tcp_v4_init_sock
37228    768075         2.6283  54.2257    d_alloc
34085    802160         2.4064  56.6321    tcp_close
32550    834710         2.2980  58.9301    sysenter_past_esp
25931    860641         1.8307  60.7608    vfs_dq_drop
24458    885099         1.7267  62.4875    d_kill
22015    907114         1.5542  64.0418    dentry_iput
18877    925991         1.3327  65.3745    __copy_from_user_ll
17873    943864         1.2618  66.6363    mwait_idle

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/anon_inodes.c |    2 +-
 fs/dnotify.c     |    2 +-
 fs/inotify.c     |    2 +-
 fs/pipe.c        |    2 +-
 net/socket.c     |    2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

Comments

Ingo Molnar Nov. 21, 2008, 3:21 p.m. UTC | #1
* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Before patch, time to run 8 millions of close(socket()) calls on 8 
> CPUS was :
>
> real    0m27.496s
> user    0m0.657s
> sys     3m39.092s
>
> After patch :
>
> real    0m23.997s
> user    0m0.682s
> sys     3m11.193s

cool :-)

What would it take to get it down to:

>> Cost if run one one cpu :
>>
>> real    0m1.561s
>> user    0m0.092s
>> sys     0m1.469s

i guess asking for a wall-clock cost of 1.561/8 would be too much? :)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 21, 2008, 3:28 p.m. UTC | #2
Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Before patch, time to run 8 millions of close(socket()) calls on 8 
>> CPUS was :
>>
>> real    0m27.496s
>> user    0m0.657s
>> sys     3m39.092s
>>
>> After patch :
>>
>> real    0m23.997s
>> user    0m0.682s
>> sys     3m11.193s
> 
> cool :-)
> 
> What would it take to get it down to:
> 
>>> Cost if run one one cpu :
>>>
>>> real    0m1.561s
>>> user    0m0.092s
>>> sys     0m1.469s
> 
> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
> 

It might be possible, depending on the level of hackery I am allowed to inject
in fs/dcache.c and fs/inode.c :)

wall cost of 1.56 (each cpu runs one loop of one million iterations)


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Nov. 21, 2008, 3:34 p.m. UTC | #3
* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
>> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>>
>>> Before patch, time to run 8 millions of close(socket()) calls on 8  
>>> CPUS was :
>>>
>>> real    0m27.496s
>>> user    0m0.657s
>>> sys     3m39.092s
>>>
>>> After patch :
>>>
>>> real    0m23.997s
>>> user    0m0.682s
>>> sys     3m11.193s
>>
>> cool :-)
>>
>> What would it take to get it down to:
>>
>>>> Cost if run one one cpu :
>>>>
>>>> real    0m1.561s
>>>> user    0m0.092s
>>>> sys     0m1.469s
>>
>> i guess asking for a wall-clock cost of 1.561/8 would be too much? :)
>>
>
> It might be possible, depending on the level of hackery I am allowed 
> to inject in fs/dcache.c and fs/inode.c :)

I think being able to open+close sockets in a scalable way is an 
undisputed prime-time workload on Linux. The numbers you showed look 
horrible.

Once you can show how much faster it could go via hacks, it should 
only be a matter of time to achieve that safely and cleanly.

> wall cost of 1.56 (each cpu runs one loop of one million iterations)

(indeed.)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 21, 2008, 3:36 p.m. UTC | #4
On Fri, Nov 21, 2008 at 04:13:38PM +0100, Eric Dumazet wrote:
> [PATCH] fs: pipe/sockets/anon dentries should not have a parent
>
> Linking pipe/sockets/anon dentries to one root 'parent' has no functional
> impact at all, but a scalability one.
>
> We can avoid touching a cache line at allocation stage (inside d_alloc(), no need
> to touch root->d_count), but also at freeing time (in d_kill, decrementing d_count)
> We avoid an expensive atomic_dec_and_lock() call on the root dentry.
>
> If we correct dnotify_parent() and inotify_d_instantiate() to take into account
> a NULL d_parent, we can call d_alloc() with a NULL parent instead of root dentry.

Sorry folks, but a NULL d_parent is a no-go from the VFS perspective,
but you can set d_parent to the dentry itself which is the magic used
for root of tree dentries.  They should also be marked
DCACHE_DISCONNECTED to make sure this is not unexpected.

And this kind of stuff really needs to go through -fsdevel.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 26, 2008, 11:27 p.m. UTC | #5
Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.6 second)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
 but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
 and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
 This function :
  - locks inode_lock,
  - dirties nr_inodes counter
  - dirties inode_in_use list  (for sockets/pipes, this is useless)
  - dirties superblock s_inodes. 
  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all contended cache lines for sockets, pipes
and anonymous fd  (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
  close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.325s   (instead of 1.561s)
user    0.091s
sys     1.234s


If run on 8 CPUS :

real    2.229s     <<<< instead of 27.496s >>>
user    0.695s
sys     16.903s

Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 2999.74 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
143791   143791        11.7849  11.7849    __slab_free
128404   272195        10.5238  22.3087    add_partial
99150    371345         8.1262  30.4349    kmem_cache_alloc
52031    423376         4.2644  34.6993    sysenter_past_esp
47752    471128         3.9137  38.6130    kmem_cache_free
47429    518557         3.8872  42.5002    tcp_close
34376    552933         2.8174  45.3176    __percpu_counter_add
29046    581979         2.3806  47.6982    copy_from_user
28249    610228         2.3152  50.0134    init_timer
26220    636448         2.1490  52.1624    __slab_alloc
23402    659850         1.9180  54.0803    discard_slab
20560    680410         1.6851  55.7654    __call_rcu
18288    698698         1.4989  57.2643    d_alloc
16425    715123         1.3462  58.6104    get_empty_filp
16237    731360         1.3308  59.9412    __fput
15729    747089         1.2891  61.2303    alloc_fd
15021    762110         1.2311  62.4614    alloc_inode
14690    776800         1.2040  63.6654    sys_close
14666    791466         1.2020  64.8674    inet_create
13638    805104         1.1178  65.9852    dput
12503    817607         1.0247  67.0099    iput_special
12231    829838         1.0024  68.0123    lock_sock_nested
12210    842048         1.0007  69.0130    fd_install
12137    854185         0.9947  70.0078    d_alloc_special
12058    866243         0.9883  70.9960    sock_init_data
11200    877443         0.9179  71.9140    release_sock
11114    888557         0.9109  72.8248    inotify_d_instantiate

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 at boot, or we use Christoph Lameter
patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

New cost if run on one cpu :

real    1.307s
user    0.094s
sys     1.214s

If run on 8 CPUS :

real    1.625s     <<<< instead of 27.496s or 2.229s >>>
user    0.771s
sys     12.061s

Oprofile results (for the 8 process run, 3 times):
CPU: Core 2, speed 3000.05 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
108005   108005        11.0758  11.0758    kmem_cache_alloc
52023    160028         5.3349  16.4107    sysenter_past_esp
47363    207391         4.8570  21.2678    tcp_close
45430    252821         4.6588  25.9266    kmem_cache_free
36566    289387         3.7498  29.6764    __percpu_counter_add
36085    325472         3.7005  33.3769    __slab_free
29185    354657         2.9929  36.3698    copy_from_user
28210    382867         2.8929  39.2627    init_timer
25663    408530         2.6317  41.8944    d_alloc_special
22360    430890         2.2930  44.1874    cap_file_alloc_security
19237    450127         1.9727  46.1601    __call_rcu
19097    469224         1.9584  48.1185    d_alloc
16962    486186         1.7394  49.8580    alloc_fd
16315    502501         1.6731  51.5311    __fput
16102    518603         1.6512  53.1823    get_empty_filp
14954    533557         1.5335  54.7158    inet_create
14468    548025         1.4837  56.1995    alloc_inode
14198    562223         1.4560  57.6555    sys_close
13905    576128         1.4259  59.0814    dput
12262    588390         1.2575  60.3389    lock_sock_nested
12203    600593         1.2514  61.5903    sock_attach_fd
12147    612740         1.2457  62.8360    iput_special
12049    624789         1.2356  64.0716    fd_install
12033    636822         1.2340  65.3056    sock_init_data
11999    648821         1.2305  66.5361    release_sock
11231    660052         1.1517  67.6878    inotify_d_instantiate
11068    671120         1.1350  68.8228    inet_csk_destroy_sock


This patch serie contains 6 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/6] fs: Introduce a per_cpu nr_dentry

Adding a per_cpu nr_dentry avoids cache line ping pongs between
cpus to maintain this metric.

We centralize decrements of nr_dentry in d_free(),
and increments in d_alloc().

d_alloc() can avoid taking dcache_lock if parent is NULL


[PATCH 2/6] fs: Introduce special dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SPECIAL flag, to mark a dentry as
a special one (for sockets, pipes, anonymous fd), and a new
d_alloc_special(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_special() for
special dentries.

Differences betwen a special dentry and a normal one are :

1) Special dentry has the DCACHE_SPECIAL flag
2) Special dentry's parent are themselves
  This to avoid taking a reference on 'root' dentry, shared
  by too many dentries.
3) They are not hashed into global hash table
4) Their d_alias list is empty

Internally, dput() can avoid an expensive atomic_dec_and_lock()
for special dentries.


(socket8 bench result : from 27.5s to 25.5s) 

[PATCH 3/6] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino.

Note : last_ino_get() method must be called with preemption disabled.

(socket8 bench result : 25.5s to 25s almost no differences, but 
this is because inode_lock cost is too heavy for the moment)

[PATCH 4/6] fs: Introduce a per_cpu nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : 25s to 20.5s)

[PATCH 5/6] fs: Introduce special inodes

 Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
 inodes allocation/freeing.

 In new_inode(), we test if super block has MS_SPECIAL flag set.
 If yes, we dont put inode in "inode_in_use" list nor "sb->s_inodes" list
 As inode_lock was taken only to protect these lists, we avoid it as well

 Using iput_special() from dput_special() avoids taking inode_lock
 at freeing time.

 This patch has a very noticeable effect, because we avoid dirtying 
 of three contended cache lines in new_inode(), and five cache lines
 in iput()

Note: Not sure if we can use MS_SPECIAL=MS_NOUSER, or if we
really need a different flag.

(socket8 bench result : from 20.5s to 2.94s) 

[PATCH 6/6] fs: Introduce kern_mount_special() to mount special vfs

This function arms a flag (MNT_SPECIAL) on the vfs, to avoid
refcounting on permanent system vfs.
Use this function for sockets, pipes, anonymous fds.

(socket8 bench result : from 2.94s to 2.23s)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
Overall diffstat :

 fs/anon_inodes.c       |   19 +-----
 fs/dcache.c            |  106 ++++++++++++++++++++++++++++++++-------
 fs/fs-writeback.c      |    2
 fs/inode.c             |  101 +++++++++++++++++++++++++++++++------
 fs/pipe.c              |   28 +---------
 fs/super.c             |    9 +++
 include/linux/dcache.h |    2
 include/linux/fs.h     |    8 ++
 include/linux/mount.h  |    5 +
 kernel/sysctl.c        |    6 +-
 mm/page-writeback.c    |    2
 net/socket.c           |   27 +--------
 12 files changed, 212 insertions(+), 103 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Nov. 27, 2008, 1:37 a.m. UTC | #6
On Thu, 27 Nov 2008, Eric Dumazet wrote:

> The last point is about SLUB being hit hard, unless we
> use slub_min_order=3 at boot, or we use Christoph Lameter
> patch (struct file RCU optimizations)
> http://thread.gmane.org/gmane.linux.kernel/418615
>
> If we boot machine with slub_min_order=3, SLUB overhead disappears.


I'd rather not be that drastic. Did you try increasing slub_min_objects
instead? Try 40-100. If we find the right number then we should update
the tuning to make sure that it pickes the right slab page sizes.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 27, 2008, 6:27 a.m. UTC | #7
Christoph Lameter a écrit :
> On Thu, 27 Nov 2008, Eric Dumazet wrote:
> 
>> The last point is about SLUB being hit hard, unless we
>> use slub_min_order=3 at boot, or we use Christoph Lameter
>> patch (struct file RCU optimizations)
>> http://thread.gmane.org/gmane.linux.kernel/418615
>>
>> If we boot machine with slub_min_order=3, SLUB overhead disappears.
> 
> 
> I'd rather not be that drastic. Did you try increasing slub_min_objects
> instead? Try 40-100. If we find the right number then we should update
> the tuning to make sure that it pickes the right slab page sizes.
> 
> 

4096/192 = 21

with slub_min_objects=22 :

# cat /sys/kernel/slab/filp/order
1
# time ./socket8
real    0m1.725s
user    0m0.685s
sys     0m12.955s

with slub_min_objects=45 :

# cat /sys/kernel/slab/filp/order
2
# time ./socket8
real    0m1.652s
user    0m0.694s
sys     0m12.367s

with slub_min_objects=80 :

# cat /sys/kernel/slab/filp/order
3
# time ./socket8
real    0m1.642s
user    0m0.719s
sys     0m12.315s

I would say slub_min_objects=45 is the optimal value on 32bit arches to
get acceptable performance on this workload (order=2 for filp kmem_cache)

Note : SLAB here is disastrous, but you already knew that :)

real    0m8.128s
user    0m0.748s
sys     1m3.467s

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 27, 2008, 9:39 a.m. UTC | #8
As I told you before, you absolutely must include the fsdevel list and
the VFS maintainer for a patchset like this.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Lameter Nov. 27, 2008, 2:44 p.m. UTC | #9
On Thu, 27 Nov 2008, Eric Dumazet wrote:

> with slub_min_objects=45 :
>
> # cat /sys/kernel/slab/filp/order
> 2
> # time ./socket8
> real    0m1.652s
> user    0m0.694s
> sys     0m12.367s

That may be a good value. How many processor do you have? Look at
calculate_order() in mm/slub.c:

   if (!min_objects)
                min_objects = 4 * (fls(nr_cpu_ids) + 1);

We couild increase the scaling factor there or start
with a mininum of 20 objects?


Try

	min_objects = 20 + 4 * (fls(nr_cpu_ids) + 1);

> I would say slub_min_objects=45 is the optimal value on 32bit arches to
> get acceptable performance on this workload (order=2 for filp kmem_cache)
>
> Note : SLAB here is disastrous, but you already knew that :)

Its good though to have examples where the queue management gets in the
way of performance.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ingo Molnar Nov. 28, 2008, 6:03 p.m. UTC | #10
* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Hi all
>
> Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> (From 27.5 seconds to 1.6 second)

Wow, that's incredibly impressive! :-)

	Ingo
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Peter Zijlstra Nov. 28, 2008, 6:47 p.m. UTC | #11
On Fri, 2008-11-28 at 19:03 +0100, Ingo Molnar wrote:
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > Hi all
> >
> > Short summary : Nice speedups for allocation/deallocation of sockets/pipes
> > (From 27.5 seconds to 1.6 second)
> 
> Wow, that's incredibly impressive! :-)

Yeah, we got a similar speedup on -rt by pushing those super-block files
list into per-cpu lists and doing crazy locking on them.

Of course avoiding them all together, like done here is a nicer option
but is sadly not a possibility for regular files (until hch gets around
to removing the need for the list).



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Christoph Hellwig Nov. 29, 2008, 6:38 a.m. UTC | #12
On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
> > Wow, that's incredibly impressive! :-)
> 
> Yeah, we got a similar speedup on -rt by pushing those super-block files
> list into per-cpu lists and doing crazy locking on them.
> 
> Of course avoiding them all together, like done here is a nicer option
> but is sadly not a possibility for regular files (until hch gets around
> to removing the need for the list).

We should have finished this long ago, thanks for the reminder.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 29, 2008, 8:07 a.m. UTC | #13
Christoph Hellwig a écrit :
> On Fri, Nov 28, 2008 at 07:47:56PM +0100, Peter Zijlstra wrote:
>>> Wow, that's incredibly impressive! :-)
>> Yeah, we got a similar speedup on -rt by pushing those super-block files
>> list into per-cpu lists and doing crazy locking on them.
>>
>> Of course avoiding them all together, like done here is a nicer option
>> but is sadly not a possibility for regular files (until hch gets around
>> to removing the need for the list).
> 
> We should have finished this long ago, thanks for the reminder.
> 
> 

inode_in_use could be percpu, at least.

Or just zap it, since we never have to scan it.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Nov. 29, 2008, 8:43 a.m. UTC | #14
Hi all

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 2.9 seconds (2.3 seconds with SLUB tweaks))

Long version :

For this second version, I removed the mntput()/mntget() optimization
since most reviewers are not convinced it is usefull.
This is a four lines patch that can be reconsidered later.

I chose the name SINGLE instead of SPECIAL to name
isolated dentries (for sockets, pipes, anonymous fd) that
have no parent and no relationship in the vfs.

Thanks all

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct files' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer)

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

Sample program :

for (i = 0; i < 1000000; i++)
	close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(benchmark named socket8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.325s   (instead of 1.561s)
user    0.091s
sys     1.234s


If run on 8 CPUS :

real    0m2.971s
user    0m0.726s
sys     0m21.310s

CPU: Core 2, speed 3000.04 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
189772   189772        12.7205  12.7205    _atomic_dec_and_lock
140467   330239         9.4155  22.1360    __slab_free
128210   458449         8.5940  30.7300    add_partial
121578   580027         8.1494  38.8794    kmem_cache_alloc
72626    652653         4.8681  43.7475    init_file
62720    715373         4.2041  47.9517    __percpu_counter_add
51632    767005         3.4609  51.4126    sysenter_past_esp
49196    816201         3.2976  54.7102    tcp_close
47933    864134         3.2130  57.9231    kmem_cache_free
29628    893762         1.9860  59.9091    copy_from_user
28443    922205         1.9065  61.8157    init_timer
25602    947807         1.7161  63.5318    __slab_alloc
22139    969946         1.4840  65.0158    discard_slab
20428    990374         1.3693  66.3851    __call_rcu
18174    1008548        1.2182  67.6033    alloc_fd
17643    1026191        1.1826  68.7859    __fput
17374    1043565        1.1646  69.9505    d_alloc
17196    1060761        1.1527  71.1031    sys_close
17024    1077785        1.1411  72.2442    inet_create
15208    1092993        1.0194  73.2636    alloc_inode
12201    1105194        0.8178  74.0815    fd_install
12167    1117361        0.8156  74.8970    lock_sock_nested
12123    1129484        0.8126  75.7096    get_empty_filp
11648    1141132        0.7808  76.4904    release_sock
11509    1152641        0.7715  77.2619    dput
11335    1163976        0.7598  78.0216    sock_init_data
11038    1175014        0.7399  78.7615    inet_csk_destroy_sock
10880    1185894        0.7293  79.4908    drop_file_write_access
10083    1195977        0.6759  80.1667    inotify_d_instantiate
9216     1205193        0.6178  80.7844    local_bh_enable_ip
8881     1214074        0.5953  81.3797    sysenter_do_call
8759     1222833        0.5871  81.9668    setup_object
8489     1231322        0.5690  82.5359    iput_single

So we now hit mntput()/mntget() and SLUB.

The last point is about SLUB being hit hard, unless we
use slub_min_order=3 (or slub_min_objects=45) at boot,
or we use Christoph Lameter patch (struct file RCU optimizations)
http://thread.gmane.org/gmane.linux.kernel/418615

If we boot machine with slub_min_order=3, SLUB overhead disappears.

If run on 8 CPUS :

real    0m2.315s
user    0m0.752s
sys     0m17.324s

CPU: Core 2, speed 3000.15 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
199409   199409        15.6440  15.6440    _atomic_dec_and_lock    (mntput())
141606   341015        11.1092  26.7532    kmem_cache_alloc
76071    417086         5.9679  32.7211    init_file
70595    487681         5.5383  38.2595    __percpu_counter_add
51595    539276         4.0477  42.3072    sysenter_past_esp
49313    588589         3.8687  46.1759    tcp_close
45503    634092         3.5698  49.7457    kmem_cache_free
41413    675505         3.2489  52.9946    __slab_free
29911    705416         2.3466  55.3412    copy_from_user
28979    734395         2.2735  57.6146    init_timer
22251    756646         1.7456  59.3602    get_empty_filp
19942    776588         1.5645  60.9247    __call_rcu
18348    794936         1.4394  62.3642    __fput
18328    813264         1.4379  63.8020    alloc_fd
17395    830659         1.3647  65.1667    sys_close
17301    847960         1.3573  66.5240    d_alloc
16570    864530         1.2999  67.8239    inet_create
15522    880052         1.2177  69.0417    alloc_inode
13185    893237         1.0344  70.0761    setup_object
12359    905596         0.9696  71.0456    fd_install
12275    917871         0.9630  72.0086    lock_sock_nested
11924    929795         0.9355  72.9441    release_sock
11790    941585         0.9249  73.8690    sock_init_data
11310    952895         0.8873  74.7563    dput
10924    963819         0.8570  75.6133    drop_file_write_access
10903    974722         0.8554  76.4687    inet_csk_destroy_sock
10184    984906         0.7990  77.2676    inotify_d_instantiate
9372     994278         0.7353  78.0029    local_bh_enable_ip
8901     1003179        0.6983  78.7012    sysenter_do_call
8569     1011748        0.6723  79.3735    iput_single
8194     1019942        0.6428  80.0163    inet_release


This patch serie contains 5 patches, against net-next-2.6 tree
(because this tree already contains network improvement on this
subject, but should apply on other trees)

[PATCH 1/5] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

(socket8 bench result : 27.5s to 25s)

[PATCH 2/5] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

(socket8 bench result : no difference at this point)

[PATCH 3/5] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 2^32 allocations)

(socket8 bench result : no difference)


[PATCH 4/5] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/5] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid 
taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of 
three contended cache lines in new_inode(), and five cache lines
in iput()

(socket8 bench result : from 19.9s to 2.3s)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
Overall diffstat :

 fs/anon_inodes.c       |   18 ------
 fs/dcache.c            |  100 ++++++++++++++++++++++++++++++--------
 fs/fs-writeback.c      |    2
 fs/inode.c             |  101 +++++++++++++++++++++++++++++++--------
 fs/pipe.c              |   25 +--------
 include/linux/dcache.h |    9 +++
 include/linux/fs.h     |   17 ++++++
 kernel/sysctl.c        |    6 +-
 mm/page-writeback.c    |    2
 net/socket.c           |   26 +---------
 10 files changed, 200 insertions(+), 106 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Dec. 11, 2008, 10:38 p.m. UTC | #15
Hi Andrew

Take v2 of this patch serie got no new feedback, maybe its time for mm
inclusion for a while ?

In this third version I added last two patches, one intialy from Christoph
Lameter, and one to avoid dirtying mnt->mnt_count on hardwired fs.

Many thanks to Christoph and Paul for this SLAB_DESTROY_PER_RCU work done
on "struct file".

Thank you

Short summary : Nice speedups for allocation/deallocation of sockets/pipes
(From 27.5 seconds to 1.62 s, on a 8 cpus machine)

Long version :

To allocate a socket or a pipe we :

0) Do the usual file table manipulation (pretty scalable these days,
but would be faster if 'struct file' were using SLAB_DESTROY_BY_RCU
and avoid call_rcu() cache killer). This point is addressed by 6th
patch.

1) allocate an inode with new_inode()
This function :
- locks inode_lock,
- dirties nr_inodes counter
- dirties inode_in_use list  (for sockets/pipes, this is useless)
- dirties superblock s_inodes.  - dirties last_ino counter
All these are in different cache lines unfortunatly.

2) allocate a dentry
d_alloc() takes dcache_lock,
insert dentry on its parent list (dirtying sock_mnt->mnt_sb->s_root)
dirties nr_dentry

3) d_instantiate() dentry  (dcache_lock taken again)

4) init_file() -> atomic_inc() on sock_mnt->refcount


At close() time, we must undo the things. Its even more expensive because
of the _atomic_dec_and_lock() that stress a lot, and because of two cache
lines that are touched when an element is deleted from a list
(previous and next items)

This is really bad, since sockets/pipes dont need to be visible in dcache
or an inode list per super block.

This patch series get rid of all but one contended cache lines for
sockets, pipes and anonymous fd  (signalfd, timerfd, ...)

socketallocbench is a very simple program (attached to this mail) that makes
a loop :

for (i = 0; i < 1000000; i++)
    close(socket(AF_INET, SOCK_STREAM, 0));

Cost if one cpu runs the program :

real    1.561s
user    0.092s
sys     1.469s

Cost if 8 processes are launched on a 8 CPU machine
(socketallocbench -n 8) :

real    27.496s   <<<< !!!! >>>>
user    0.657s
sys     3m39.092s

Oprofile results (for the 8 process run, 3 times):

CPU: Core 2, speed 3000.03 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
3347352  3347352       28.0232  28.0232    _atomic_dec_and_lock
3301428  6648780       27.6388  55.6620    d_instantiate
2971130  9619910       24.8736  80.5355    d_alloc
241318   9861228        2.0203  82.5558    init_file
146190   10007418       1.2239  83.7797    __slab_free
144149   10151567       1.2068  84.9864    inotify_d_instantiate
143971   10295538       1.2053  86.1917    inet_create
137168   10432706       1.1483  87.3401    new_inode
117549   10550255       0.9841  88.3242    add_partial
110795   10661050       0.9275  89.2517    generic_drop_inode
107137   10768187       0.8969  90.1486    kmem_cache_alloc
94029    10862216       0.7872  90.9358    tcp_close
82837    10945053       0.6935  91.6293    dput
67486    11012539       0.5650  92.1943    dentry_iput
57751    11070290       0.4835  92.6778    iput
54327    11124617       0.4548  93.1326    tcp_v4_init_sock
49921    11174538       0.4179  93.5505    sysenter_past_esp
47616    11222154       0.3986  93.9491    kmem_cache_free
30792    11252946       0.2578  94.2069    clear_inode
27540    11280486       0.2306  94.4375    copy_from_user
26509    11306995       0.2219  94.6594    init_timer
26363    11333358       0.2207  94.8801    discard_slab
25284    11358642       0.2117  95.0918    __fput
22482    11381124       0.1882  95.2800    __percpu_counter_add
20369    11401493       0.1705  95.4505    sock_alloc
18501    11419994       0.1549  95.6054    inet_csk_destroy_sock
17923    11437917       0.1500  95.7555    sys_close


This patch serie avoids all contented cache lines and makes this "bench"
pretty fast.


New cost if run on one cpu :

real    1.245s (instead of 1.561s)
user    0.074s
sys     1.161s


If run on 8 CPUS :

real    1.624s
user    0.580s
sys     12.296s


On oprofile, we finally can see network stuff coming at the front of
expensive stuff. (with the exception of kmem_cache_[z]alloc(), because
it has to clear 192 bytes of file structures, this takes half of the time)

CPU: Core 2, speed 3000.09 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (Unhalted core cycles) count 100
000
samples  cum. samples  %        cum. %     symbol name
176586   176586        10.9376  10.9376    kmem_cache_alloc
169838   346424        10.5196  21.4572    tcp_close
105331   451755         6.5241  27.9813    tcp_v4_init_sock
105146   556901         6.5126  34.4939    tcp_v4_destroy_sock
83307    640208         5.1600  39.6539    sysenter_past_esp
80241    720449         4.9701  44.6239    inet_csk_destroy_sock
74263    794712         4.5998  49.2237    kmem_cache_free
56806    851518         3.5185  52.7422    __percpu_counter_add
48619    900137         3.0114  55.7536    copy_from_user
44803    944940         2.7751  58.5287    init_timer
28539    973479         1.7677  60.2964    d_alloc
27795    1001274        1.7216  62.0180    alloc_fd
26747    1028021        1.6567  63.6747    __fput
24312    1052333        1.5059  65.1805    sys_close
24205    1076538        1.4992  66.6798    inet_create
22409    1098947        1.3880  68.0677    alloc_inode
21359    1120306        1.3230  69.3907    release_sock
19865    1140171        1.2304  70.6211    fd_install
19472    1159643        1.2061  71.8272    lock_sock_nested
18956    1178599        1.1741  73.0013    sock_init_data
17301    1195900        1.0716  74.0729    drop_file_write_access
17113    1213013        1.0600  75.1329    inotify_d_instantiate
16384    1229397        1.0148  76.1477    dput
15173    1244570        0.9398  77.0875    local_bh_enable_ip
15017    1259587        0.9301  78.0176    local_bh_enable
13354    1272941        0.8271  78.8448    __sock_create
13139    1286080        0.8138  79.6586    inet_release
13062    1299142        0.8090  80.4676    sysenter_do_call
11935    1311077        0.7392  81.2069    iput_single


This patch serie contains 7 patches, against linux-2.6 tree,
plus one patch in mm (fs: filp_cachep can be static in fs/file_table.c)

[PATCH 1/7] fs: Use a percpu_counter to track nr_dentry

Adding a percpu_counter nr_dentry avoids cache line ping pongs
between cpus to maintain this metric, and dcache_lock is
no more needed to protect dentry_stat.nr_dentry

We centralize nr_dentry updates at the right place :
- increments in d_alloc()
- decrements in d_free()

d_alloc() can avoid taking dcache_lock if parent is NULL

("socketallocbench -n 8" bench result : 27.5s to 25s)

[PATCH 2/7] fs: Use a percpu_counter to track nr_inodes

Avoids cache line ping pongs between cpus and prepare next patch,
because updates of nr_inodes dont need inode_lock anymore.

("socketallocbench -n 8" bench result : no difference at this point)

[PATCH 3/7] fs: Introduce a per_cpu last_ino allocator

new_inode() dirties a contended cache line to get increasing
inode numbers.

Solve this problem by providing to each cpu a per_cpu variable,
feeded by the shared last_ino, but once every 1024 allocations.

This reduce contention on the shared last_ino, and give same
spreading ino numbers than before.
(same wraparound after 232 allocations)

("socketallocbench -n 8" result : no difference)


[PATCH 4/7] fs: Introduce SINGLE dentries for pipes, socket, anon fd

Sockets, pipes and anonymous fds have interesting properties.

Like other files, they use a dentry and an inode.

But dentries for these kind of files are not hashed into dcache,
since there is no way someone can lookup such a file in the vfs tree.
(/proc/{pid}/fd/{number} uses a different mechanism)

Still, allocating and freeing such dentries are expensive processes,
because we currently take dcache_lock inside d_alloc(), d_instantiate(),
and dput(). This lock is very contended on SMP machines.

This patch defines a new DCACHE_SINGLE flag, to mark a dentry as
a single one (for sockets, pipes, anonymous fd), and a new
d_alloc_single(const struct qstr *name, struct inode *inode)
method, called by the three subsystems.

Internally, dput() can take a fast path to dput_single() for
SINGLE dentries. No more atomic_dec_and_lock()
for such dentries.


Differences betwen an SINGLE dentry and a normal one are :

1) SINGLE dentry has the DCACHE_SINGLE flag
2) SINGLE dentry's parent is itself (DCACHE_DISCONNECTED)
This to avoid taking a reference on sb 'root' dentry, shared
by too many dentries.
3) They are not hashed into global hash table (DCACHE_UNHASHED)
4) Their d_alias list is empty

(socket8 bench result : from 25s to 19.9s)

[PATCH 5/7] fs: new_inode_single() and iput_single()

Goal of this patch is to not touch inode_lock for socket/pipes/anonfd
inodes allocation/freeing.

SINGLE dentries are attached to inodes that dont need to be linked
in a list of inodes, being "inode_in_use" or "sb->s_inodes"
As inode_lock was taken only to protect these lists, we avoid taking it as well.

Using iput_single() from dput_single() avoids taking inode_lock
at freeing time.

This patch has a very noticeable effect, because we avoid dirtying of three contended cache lines in new_inode(), and five cache lines
in iput()

("socketallocbench -n 8" result : from 19.9s to 3.01s)


[PATH 6/7] fs: struct file move from call_rcu() to SLAB_DESTROY_BY_RCU

From: Christoph Lameter <cl@linux-foundation.org>

Currently we schedule RCU frees for each file we free separately. That has
several drawbacks against the earlier file handling (in 2.6.5 f.e.), which
did not require RCU callbacks:

1. Excessive number of RCU callbacks can be generated causing long RCU
  queues that in turn cause long latencies. We hit SLUB page allocation
  more often than necessary.

2. The cache hot object is not preserved between free and realloc. A close
  followed by another open is very fast with the RCUless approach because
  the last freed object is returned by the slab allocator that is
  still cache hot. RCU free means that the object is not immediately
  available again. The new object is cache cold and therefore open/close
  performance tests show a significant degradation with the RCU
  implementation.

One solution to this problem is to move the RCU freeing into the Slab
allocator by specifying SLAB_DESTROY_BY_RCU as an option at slab creation
time. The slab allocator will do RCU frees only when it is necessary
to dispose of slabs of objects (rare). So with that approach we can cut
out the RCU overhead significantly.

However, the slab allocator may return the object for another use even
before the RCU period has expired under SLAB_DESTROY_BY_RCU. This means
there is the (unlikely) possibility that the object is going to be
switched under us in sections protected by rcu_read_lock() and
rcu_read_unlock(). So we need to verify that we have acquired the correct
object after establishing a stable object reference (incrementing the
refcounter does that).


Signed-off-by: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

("socketallocbench -n 8" result : from 3.01s to 2.20s)

[PATCH 7/7] fs: MS_NOREFCOUNT

Some fs are hardwired into kernel, and mntput()/mntget() hit a contended
cache line. We define a new superblock flag, MS_NOREFCOUNT, that is set
on socket, pipes and anonymous fd superblocks. mntput()/mntget() become
null ops on these fs.

("socketallocbench -n 8" result : from 2.20s to 1.64s)

cat socketallocbench.c
/*
 * socketallocbench benchmark
 *
 * Usage : socket [-n procs]  [-l loops]
 */
#include <sys/socket.h>
#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>
#include <sys/wait.h>

void dowork(int loops)
{
        int i;

        for (i = 0; i < loops; i++)
                close(socket(AF_INET, SOCK_STREAM, 0));
}

int main(int argc, char *argv[])
{
        int i;
        int n = 1;
        int loops = 1000000;
        pid_t *pidtable;

        while ((i = getopt(argc, argv, "n:l:")) != EOF) {
                if (i == 'n')
                        n = atoi(optarg);
                if (i == 'l')
                        loops = atoi(optarg);
        }
        pidtable = malloc(n * sizeof(pid_t));
        for (i = 1; i < n; i++) {
                pidtable[i] = fork();
                if (pidtable[i] == 0) {
                        dowork(loops);
                        _exit(0);
                }
                if (pidtable[i] == -1) {
                        perror("fork");
                        n = i;
                        break;
                }
        }
        dowork(loops);
        for (i = 1; i < n; i++) {
                int status;

                wait(&status);
                }
        return 0;
}
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/fs/anon_inodes.c b/fs/anon_inodes.c
index 3662dd4..22cce87 100644
--- a/fs/anon_inodes.c
+++ b/fs/anon_inodes.c
@@ -92,7 +92,7 @@  int anon_inode_getfd(const char *name, const struct file_operations *fops,
 	this.name = name;
 	this.len = strlen(name);
 	this.hash = 0;
-	dentry = d_alloc(anon_inode_mnt->mnt_sb->s_root, &this);
+	dentry = d_alloc(NULL, &this);
 	if (!dentry)
 		goto err_put_unused_fd;
 
diff --git a/fs/dnotify.c b/fs/dnotify.c
index 676073b..66066a3 100644
--- a/fs/dnotify.c
+++ b/fs/dnotify.c
@@ -173,7 +173,7 @@  void dnotify_parent(struct dentry *dentry, unsigned long event)
 
 	spin_lock(&dentry->d_lock);
 	parent = dentry->d_parent;
-	if (parent->d_inode->i_dnotify_mask & event) {
+	if (parent && parent->d_inode->i_dnotify_mask & event) {
 		dget(parent);
 		spin_unlock(&dentry->d_lock);
 		__inode_dir_notify(parent->d_inode, event);
diff --git a/fs/inotify.c b/fs/inotify.c
index 7bbed1b..9f051bb 100644
--- a/fs/inotify.c
+++ b/fs/inotify.c
@@ -270,7 +270,7 @@  void inotify_d_instantiate(struct dentry *entry, struct inode *inode)
 
 	spin_lock(&entry->d_lock);
 	parent = entry->d_parent;
-	if (parent->d_inode && inotify_inode_watched(parent->d_inode))
+	if (parent && parent->d_inode && inotify_inode_watched(parent->d_inode))
 		entry->d_flags |= DCACHE_INOTIFY_PARENT_WATCHED;
 	spin_unlock(&entry->d_lock);
 }
diff --git a/fs/pipe.c b/fs/pipe.c
index 7aea8b8..4b961bc 100644
--- a/fs/pipe.c
+++ b/fs/pipe.c
@@ -926,7 +926,7 @@  struct file *create_write_pipe(int flags)
 		goto err;
 
 	err = -ENOMEM;
-	dentry = d_alloc(pipe_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc(NULL, &name);
 	if (!dentry)
 		goto err_inode;
 
diff --git a/net/socket.c b/net/socket.c
index e9d65ea..b84de7d 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -373,7 +373,7 @@  static int sock_attach_fd(struct socket *sock, struct file *file, int flags)
 	struct dentry *dentry;
 	struct qstr name = { .name = "" };
 
-	dentry = d_alloc(sock_mnt->mnt_sb->s_root, &name);
+	dentry = d_alloc(NULL, &name);
 	if (unlikely(!dentry))
 		return -ENOMEM;