diff mbox

[Bug,106241] New: shutdown(3)/close(3) behaviour is incorrect for sockets in accept(3)

Message ID 1445991236.7476.59.camel@edumazet-glaptop2.roam.corp.google.com
State RFC, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Oct. 28, 2015, 12:13 a.m. UTC
On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:

> 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> with some use patterns.  The cacheline pingpong in the bitmap is probably
> inevitable, unless we accept considerably heavier memory footprint,
> but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> to trigger - close(3);open(...); will have the next open() after that
> scanning the entire in-use bitmap.  I think I see a way to improve it
> without slowing the normal case down, but I'll need to experiment a
> bit before I post patches.  Anybody with examples of real-world loads
> that make our descriptor allocator to degrade is very welcome to post
> the reproducers...

Well, I do have real-world loads, but quite hard to setup in a lab :(

Note that we also hit the 'struct cred'->usage refcount for every
open()/close()/sock_alloc(), and simply moving uid/gid out of the first
cache line really helps, as current_fsuid() and current_fsgid() no
longer forces a pingpong.

I moved seldom used fields on the first cache line, so that overall
memory usage did not change (192 bytes on 64 bit arches)





--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Al Viro Oct. 28, 2015, 12:35 p.m. UTC | #1
[Linus and Dave added, Solaris and NetBSD folks dropped from Cc]

On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> 
> > 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > inevitable, unless we accept considerably heavier memory footprint,
> > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > to trigger - close(3);open(...); will have the next open() after that
> > scanning the entire in-use bitmap.  I think I see a way to improve it
> > without slowing the normal case down, but I'll need to experiment a
> > bit before I post patches.  Anybody with examples of real-world loads
> > that make our descriptor allocator to degrade is very welcome to post
> > the reproducers...
> 
> Well, I do have real-world loads, but quite hard to setup in a lab :(
> 
> Note that we also hit the 'struct cred'->usage refcount for every
> open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> cache line really helps, as current_fsuid() and current_fsgid() no
> longer forces a pingpong.
> 
> I moved seldom used fields on the first cache line, so that overall
> memory usage did not change (192 bytes on 64 bit arches)

[snip]

Makes sense, but there's a funny thing about that refcount - the part
coming from ->f_cred is the most frequently changed *and* almost all
places using ->f_cred are just looking at its fields and do not manipulate
its refcount.  The only exception (do_process_acct()) is easy to eliminate
just by storing a separate reference to the current creds of acct(2) caller
and using it instead of looking at ->f_cred.  What's more, the place where we
grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
most of the time such a reference is there for dropping ->f_cred (in
file_free()/file_free_rcu()).

With that change in kernel/acct.c done, we could do the following:
	a) split the cred refcount into the normal and percpu parts and
add a spinlock in there.
	b) have put_cred() do this:
		if (atomic_dec_and_test(&cred->usage)) {
			this_cpu_add(&cred->f_cred_usage, 1);
			call_rcu(&cred->rcu, put_f_cred_rcu);
		}
	c) have get_empty_filp() increment current_cred ->f_cred_usage with
this_cpu_add()
	d) have file_free() do
		percpu_counter_dec(&nr_files);
		rcu_read_lock();
		if (likely(atomic_read(&f->f_cred->usage))) {
			this_cpu_add(&f->f_cred->f_cred_usage, -1);
			rcu_read_unlock();
			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
		} else {
			rcu_read_unlock();
			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
		}
file_free_rcu() being
static void file_free_rcu(struct rcu_head *head)
{
        struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
        put_f_cred(&f->f_cred->rcu);
        kmem_cache_free(filp_cachep, f);
}
and file_free_rcu_light() - the same sans put_f_cred();

with put_f_cred() doing
	spin_lock cred->lock
	this_cpu_add(&cred->f_cred_usage, -1);
	find the sum of cred->f_cred_usage
	spin_unlock cred->lock
	if the sum has not reached 0
		return
	current put_cred_rcu(cred)

IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
(most of) ->f_cred dropping.

Note that there are two paths leading to put_f_cred() in the above - via
call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
&f->f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
different rcu_head are used.

atomic_read() check in file_free() might give false positives if it comes
just before put_cred() on another CPU kills the last non-f_cred reference.
It's not a problem, since put_f_cred() from that put_cred() won't be
executed until we drop rcu_read_lock(), so we can safely decrement the
cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
be dropping the last of that - the same put_cred() would've incremented
->f_cred_usage).

Does anybody see problems with that approach?  I'm going to grab some sleep
(only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
field-reordering patch when I get up...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 28, 2015, 1:24 p.m. UTC | #2
On Wed, 2015-10-28 at 12:35 +0000, Al Viro wrote:
> [Linus and Dave added, Solaris and NetBSD folks dropped from Cc]
> 
> On Tue, Oct 27, 2015 at 05:13:56PM -0700, Eric Dumazet wrote:
> > On Tue, 2015-10-27 at 23:17 +0000, Al Viro wrote:
> > 
> > > 	* [Linux-specific aside] our __alloc_fd() can degrade quite badly
> > > with some use patterns.  The cacheline pingpong in the bitmap is probably
> > > inevitable, unless we accept considerably heavier memory footprint,
> > > but we also have a case when alloc_fd() takes O(n) and it's _not_ hard
> > > to trigger - close(3);open(...); will have the next open() after that
> > > scanning the entire in-use bitmap.  I think I see a way to improve it
> > > without slowing the normal case down, but I'll need to experiment a
> > > bit before I post patches.  Anybody with examples of real-world loads
> > > that make our descriptor allocator to degrade is very welcome to post
> > > the reproducers...
> > 
> > Well, I do have real-world loads, but quite hard to setup in a lab :(
> > 
> > Note that we also hit the 'struct cred'->usage refcount for every
> > open()/close()/sock_alloc(), and simply moving uid/gid out of the first
> > cache line really helps, as current_fsuid() and current_fsgid() no
> > longer forces a pingpong.
> > 
> > I moved seldom used fields on the first cache line, so that overall
> > memory usage did not change (192 bytes on 64 bit arches)
> 
> [snip]
> 
> Makes sense, but there's a funny thing about that refcount - the part
> coming from ->f_cred is the most frequently changed *and* almost all
> places using ->f_cred are just looking at its fields and do not manipulate
> its refcount.  The only exception (do_process_acct()) is easy to eliminate
> just by storing a separate reference to the current creds of acct(2) caller
> and using it instead of looking at ->f_cred.  What's more, the place where we
> grab what will be ->f_cred is guaranteed to have a non-f_cred reference *and*
> most of the time such a reference is there for dropping ->f_cred (in
> file_free()/file_free_rcu()).
> 
> With that change in kernel/acct.c done, we could do the following:
> 	a) split the cred refcount into the normal and percpu parts and
> add a spinlock in there.
> 	b) have put_cred() do this:
> 		if (atomic_dec_and_test(&cred->usage)) {
> 			this_cpu_add(&cred->f_cred_usage, 1);
> 			call_rcu(&cred->rcu, put_f_cred_rcu);
> 		}
> 	c) have get_empty_filp() increment current_cred ->f_cred_usage with
> this_cpu_add()
> 	d) have file_free() do
> 		percpu_counter_dec(&nr_files);
> 		rcu_read_lock();
> 		if (likely(atomic_read(&f->f_cred->usage))) {
> 			this_cpu_add(&f->f_cred->f_cred_usage, -1);
> 			rcu_read_unlock();
> 			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu_light);
> 		} else {
> 			rcu_read_unlock();
> 			call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
> 		}
> file_free_rcu() being
> static void file_free_rcu(struct rcu_head *head)
> {
>         struct file *f = container_of(head, struct file, f_u.fu_rcuhead);
>         put_f_cred(&f->f_cred->rcu);
>         kmem_cache_free(filp_cachep, f);
> }
> and file_free_rcu_light() - the same sans put_f_cred();
> 
> with put_f_cred() doing
> 	spin_lock cred->lock
> 	this_cpu_add(&cred->f_cred_usage, -1);
> 	find the sum of cred->f_cred_usage
> 	spin_unlock cred->lock
> 	if the sum has not reached 0
> 		return
> 	current put_cred_rcu(cred)
> 
> IOW, let's try to get rid of cross-cpu stores in ->f_cred grabbing and
> (most of) ->f_cred dropping.
> 
> Note that there are two paths leading to put_f_cred() in the above - via
> call_rcu() on &cred->rcu and from file_free_rcu() called via call_rcu() on
> &f->f_u.fu_rcuhead.  Both are RCU-delayed and they can happen in parallel -
> different rcu_head are used.
> 
> atomic_read() check in file_free() might give false positives if it comes
> just before put_cred() on another CPU kills the last non-f_cred reference.
> It's not a problem, since put_f_cred() from that put_cred() won't be
> executed until we drop rcu_read_lock(), so we can safely decrement the
> cred->f_cred_usage without cred->lock here (and we are guaranteed that we won't
> be dropping the last of that - the same put_cred() would've incremented
> ->f_cred_usage).
> 
> Does anybody see problems with that approach?  I'm going to grab some sleep
> (only a couple of hours so far tonight ;-/), will cook an incremental to Eric's
> field-reordering patch when I get up...

Before I take a deep look at your suggestion, are you sure plain use of
include/linux/percpu-refcount.h infra is not possible for struct cred ?

Thanks !


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 28, 2015, 2:47 p.m. UTC | #3
On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:

> Before I take a deep look at your suggestion, are you sure plain use of
> include/linux/percpu-refcount.h infra is not possible for struct cred ?

BTW, I am not convinced we need to spend so much energy and per-cpu
memory for struct cred refcount.

The big problem is fd array spinlock of course and bitmap search for
POSIX compliance.

The cache line trashing in struct cred is a minor one ;)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 28, 2015, 9:13 p.m. UTC | #4
On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> 
> > Before I take a deep look at your suggestion, are you sure plain use of
> > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> 
> BTW, I am not convinced we need to spend so much energy and per-cpu
> memory for struct cred refcount.
> 
> The big problem is fd array spinlock of course and bitmap search for
> POSIX compliance.
> 
> The cache line trashing in struct cred is a minor one ;)

percpu-refcount isn't convenient - the only such candidate for ref_kill in
there is "all other references are gone", and that can happen in
interesting locking environments.  I doubt that it would be a good fit, TBH...

Cacheline pingpong on the descriptors bitmap is probably inevitable, but
it's not the only problem in the existing implementation - close a small
descriptor when you've got a lot of them and look for the second open
after that.  _That_ can lead to thousands of cachelines being read through,
all under the table spinlock.  It's literally orders of magnitude worse.
And if the first open after that close happens to be for a short-living
descriptor, you'll get the same situation back in your face as soon as you
close it.

I think we can seriously improve that without screwing the fast path by
adding "summary" bitmaps once the primary grows past the cacheline worth
of bits.  With bits in the summary bitmap corresponding to cacheline-sized
chunks of the primary, being set iff all bits in the corresponding chunk
are set.  If the summary map grows larger than one cacheline, add the
second-order one (that happens at quarter million descriptors and serves
until 128 million; adding the third-order map is probably worthless).

I want to maintain the same kind of "everything below this is known to be
in use" thing as we do now.  Allocation would start with looking into the
same place in primary bitmap where we'd looked now and similar search
forward for zero bit.  _However_, it would stop at cacheline boundary.
If nothing had been found, we look in the corresponding place in the
summary bitmap and search for zero bit there.  Again, no more than up
to the cacheline boundary.  If something is found, we've got a chunk in
the primary known to contain a zero bit; if not - go to the second-level
and search there, etc.

When a zero bit in the primary had been found, check if it's within the
rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
that bit.  If there are zero bits left in the same word - we are done,
otherwise check the still unread words in the cacheline and see if all
of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.

Normal case is exactly the same as now - one cacheline accessed and modified.
We might end up touching more than that, but it's going to be rare and
the cases when it happens are very likely to lead to much worse amount of
memory traffic with the current code.

Freeing is done by zeroing the bit in primary, checking for other zero bits
nearby and buggering off if there are such.  If the entire cacheline used
to be all-bits-set, clear the bit in summary and, if there's a second-order
summary, get the bit in there clear as well - it's probably not worth
bothering with checking that all the cacheline in summary bitmap had been
all-bits-set.  Again, the normal case is the same as now.

It'll need profiling and tuning, but AFAICS it's doable without making the
things worse than they are now, and it should get rid of those O(N) fetches
under spinlock cases.  And yes, those are triggerable and visible in
profiles.  IMO it's worth trying to fix...

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 28, 2015, 9:44 p.m. UTC | #5
On Wed, 2015-10-28 at 21:13 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 07:47:57AM -0700, Eric Dumazet wrote:
> > On Wed, 2015-10-28 at 06:24 -0700, Eric Dumazet wrote:
> > 
> > > Before I take a deep look at your suggestion, are you sure plain use of
> > > include/linux/percpu-refcount.h infra is not possible for struct cred ?
> > 
> > BTW, I am not convinced we need to spend so much energy and per-cpu
> > memory for struct cred refcount.
> > 
> > The big problem is fd array spinlock of course and bitmap search for
> > POSIX compliance.
> > 
> > The cache line trashing in struct cred is a minor one ;)
> 
> percpu-refcount isn't convenient - the only such candidate for ref_kill in
> there is "all other references are gone", and that can happen in
> interesting locking environments.  I doubt that it would be a good fit, TBH...

OK then ...

> Cacheline pingpong on the descriptors bitmap is probably inevitable, but
> it's not the only problem in the existing implementation - close a small
> descriptor when you've got a lot of them and look for the second open
> after that.  _That_ can lead to thousands of cachelines being read through,
> all under the table spinlock.  It's literally orders of magnitude worse.
> And if the first open after that close happens to be for a short-living
> descriptor, you'll get the same situation back in your face as soon as you
> close it.
> 
> I think we can seriously improve that without screwing the fast path by
> adding "summary" bitmaps once the primary grows past the cacheline worth
> of bits.  With bits in the summary bitmap corresponding to cacheline-sized
> chunks of the primary, being set iff all bits in the corresponding chunk
> are set.  If the summary map grows larger than one cacheline, add the
> second-order one (that happens at quarter million descriptors and serves
> until 128 million; adding the third-order map is probably worthless).
> 
> I want to maintain the same kind of "everything below this is known to be
> in use" thing as we do now.  Allocation would start with looking into the
> same place in primary bitmap where we'd looked now and similar search
> forward for zero bit.  _However_, it would stop at cacheline boundary.
> If nothing had been found, we look in the corresponding place in the
> summary bitmap and search for zero bit there.  Again, no more than up
> to the cacheline boundary.  If something is found, we've got a chunk in
> the primary known to contain a zero bit; if not - go to the second-level
> and search there, etc.
> 
> When a zero bit in the primary had been found, check if it's within the
> rlimit (passed to __alloc_fd() explicitly) and either bugger off or set
> that bit.  If there are zero bits left in the same word - we are done,
> otherwise check the still unread words in the cacheline and see if all
> of them are ~0UL.  If all of them are, set the bit in summary bitmap, etc.
> 
> Normal case is exactly the same as now - one cacheline accessed and modified.
> We might end up touching more than that, but it's going to be rare and
> the cases when it happens are very likely to lead to much worse amount of
> memory traffic with the current code.
> 
> Freeing is done by zeroing the bit in primary, checking for other zero bits
> nearby and buggering off if there are such.  If the entire cacheline used
> to be all-bits-set, clear the bit in summary and, if there's a second-order
> summary, get the bit in there clear as well - it's probably not worth
> bothering with checking that all the cacheline in summary bitmap had been
> all-bits-set.  Again, the normal case is the same as now.
> 
> It'll need profiling and tuning, but AFAICS it's doable without making the
> things worse than they are now, and it should get rid of those O(N) fetches
> under spinlock cases.  And yes, those are triggerable and visible in
> profiles.  IMO it's worth trying to fix...
> 

Well, all this complexity goes away with a O_FD_FASTALLOC /
SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
tells the kernel we do not care getting the lowest possible fd as POSIX
mandates.

With this bit set, the bitmap search can start at a random point, and we
find a lot in O(1) : one cache line miss, if you have at least one free
bit/slot per 512 bits (64 bytes cache line).

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

... // active sockets
socket(AF_INET, SOCK_STREAM | SOCK_FD_FASTALLOC, 0);
... // passive sockets
accept4(sockfd, ..., SOCK_FD_FASTALLOC);
...

Except for legacy stuff and stdin/stdout/stderr games, I really doubt
lot of applications absolutely rely on the POSIX thing...



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 28, 2015, 10:33 p.m. UTC | #6
On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:

> Well, all this complexity goes away with a O_FD_FASTALLOC /
> SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> tells the kernel we do not care getting the lowest possible fd as POSIX
> mandates.

... which won't do a damn thing for existing userland.

> Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> lot of applications absolutely rely on the POSIX thing...

We obviously can't turn that into default behaviour, though.  BTW, what
distribution do you have in mind for those random descriptors?  Uniform
on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
memory footprint pretty soon...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 28, 2015, 11:08 p.m. UTC | #7
On Wed, 2015-10-28 at 22:33 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 02:44:28PM -0700, Eric Dumazet wrote:
> 
> > Well, all this complexity goes away with a O_FD_FASTALLOC /
> > SOCK_FD_FASTALLOC bit in various fd allocations, which specifically
> > tells the kernel we do not care getting the lowest possible fd as POSIX
> > mandates.
> 
> ... which won't do a damn thing for existing userland.

For the userland that need +5,000,000 socket, I can tell you they are
using this flag as soon they are aware it exists ;)

> 
> > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > lot of applications absolutely rely on the POSIX thing...
> 
> We obviously can't turn that into default behaviour, though.  BTW, what
> distribution do you have in mind for those random descriptors?  Uniform
> on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> memory footprint pretty soon...

Simply [0 , fdt->max_fds] is working well in most cases.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 29, 2015, 12:15 a.m. UTC | #8
On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > lot of applications absolutely rely on the POSIX thing...
> > 
> > We obviously can't turn that into default behaviour, though.  BTW, what
> > distribution do you have in mind for those random descriptors?  Uniform
> > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > memory footprint pretty soon...
> 
> Simply [0 , fdt->max_fds] is working well in most cases.

Umm...  So first you dup2() to establish the ->max_fds you want, then
do such opens?  What used/unused ratio do you expect to deal with?
And what kind of locking are you going to use?  Keep in mind that
e.g. dup2() is dependent on the lack of allocations while it's working,
so it's not as simple as "we don't need no stinkin' ->files_lock"...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 29, 2015, 3:29 a.m. UTC | #9
On Thu, 2015-10-29 at 00:15 +0000, Al Viro wrote:
> On Wed, Oct 28, 2015 at 04:08:29PM -0700, Eric Dumazet wrote:
> > > > Except for legacy stuff and stdin/stdout/stderr games, I really doubt
> > > > lot of applications absolutely rely on the POSIX thing...
> > > 
> > > We obviously can't turn that into default behaviour, though.  BTW, what
> > > distribution do you have in mind for those random descriptors?  Uniform
> > > on [0,INT_MAX] is a bad idea for obvious reasons - you'll blow the
> > > memory footprint pretty soon...
> > 
> > Simply [0 , fdt->max_fds] is working well in most cases.
> 
> Umm...  So first you dup2() to establish the ->max_fds you want, then
> do such opens?

Yes, dup2() is done at program startup, knowing the expected max load
(in term of concurrent fd) + ~10 % (actual fd array size can be more
than this because of power of two rounding in alloc_fdtable() )

But this is an optimization : If you do not use the initial dup2(), the
fd array can be automatically expanded if needed (all slots are in use)

>   What used/unused ratio do you expect to deal with?
> And what kind of locking are you going to use?  Keep in mind that
> e.g. dup2() is dependent on the lack of allocations while it's working,
> so it's not as simple as "we don't need no stinkin' ->files_lock"...

No locking change. files->file_lock is still taken.

We only want to minimize time to find an empty slot.

The trick is to not start bitmap search at files->next_fd, but a random
point. This is a win if we assume there are enough holes.

low = start;
if (low < files->next_fd)
    low = files->next_fd;

res = -1;
if (flags & O_FD_FASTALLOC) {
	random_point = pick_random_between(low, fdt->max_fds);

	res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
				random_point);
	/* No empty slot found, try the other range */
	if (res >= fdt->max_fds) {
		res = find_next_zero_bit(fdt->open_fds,
				low, random_point);
		if (res >= random_point)
			res = -1;
	}
}
...




--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 29, 2015, 4:16 a.m. UTC | #10
On Wed, Oct 28, 2015 at 08:29:41PM -0700, Eric Dumazet wrote:

> But this is an optimization : If you do not use the initial dup2(), the
> fd array can be automatically expanded if needed (all slots are in use)

Whee...

> No locking change. files->file_lock is still taken.
> 
> We only want to minimize time to find an empty slot.

Then I'd say that my variant is going to win.  It *will* lead to
cacheline pingpong in more cases than yours, but I'm quite sure that
it will be a win as far as the total amount of cachelines accessed.

> The trick is to not start bitmap search at files->next_fd, but a random
> point. This is a win if we assume there are enough holes.
> 
> low = start;
> if (low < files->next_fd)
>     low = files->next_fd;
> 
> res = -1;
> if (flags & O_FD_FASTALLOC) {
> 	random_point = pick_random_between(low, fdt->max_fds);
> 
> 	res = find_next_zero_bit(fdt->open_fds, fdt->max_fds,
> 				random_point);
> 	/* No empty slot found, try the other range */
> 	if (res >= fdt->max_fds) {
> 		res = find_next_zero_bit(fdt->open_fds,
> 				low, random_point);
> 		if (res >= random_point)
> 			res = -1;
> 	}
> }

Have you tried to experiment with that in userland?  I mean, emulate that
thing in normal userland code, count the cacheline accesses and drive it
with the use patterns collected from actual applications.

I can sit down and play with math expectations, but I suspect that it's
easier to experiment.  It's nothing but an intuition (I hadn't seriously
done probability theory in quite a while, and my mathematical tastes run
more to geometry and topology anyway), but... I would expect it to degrade
badly when the bitmap is reasonably dense.

Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
not cheap - it's done via #PF triggered in kernel mode, with handler
noticing that the faulting address is in vmalloc range and doing the
right thing.  IOW, if your bitmap is very sparse, the price of page faults
needs to be taken into account.

AFAICS, the only benefit of that thing is keeping dirtied cachelines far
from each other.  Which might be a win overall, but I'm not convinced that
the rest won't offset the effect of that...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Oct. 29, 2015, 12:35 p.m. UTC | #11
On Thu, 2015-10-29 at 04:16 +0000, Al Viro wrote:

> Have you tried to experiment with that in userland?  I mean, emulate that
> thing in normal userland code, count the cacheline accesses and drive it
> with the use patterns collected from actual applications.

Sure.

> 
> I can sit down and play with math expectations, but I suspect that it's
> easier to experiment.  It's nothing but an intuition (I hadn't seriously
> done probability theory in quite a while, and my mathematical tastes run
> more to geometry and topology anyway), but... I would expect it to degrade
> badly when the bitmap is reasonably dense.
> 
> Note, BTW, that vmalloc'ed memory gets populated as you read it, and it's
> not cheap - it's done via #PF triggered in kernel mode, with handler
> noticing that the faulting address is in vmalloc range and doing the
> right thing.  IOW, if your bitmap is very sparse, the price of page faults
> needs to be taken into account.

This vmalloc PF is pure noise.
This only matters for the very first allocations.

We target programs opening zillions of fd in their lifetime ;)

Not having to expand a 4,000,000 slots fd array while fully loaded also
removes a latency spike that is very often not desirable.


> 
> AFAICS, the only benefit of that thing is keeping dirtied cachelines far
> from each other.  Which might be a win overall, but I'm not convinced that
> the rest won't offset the effect of that...

Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
It also lowers time we hold the fd array spinlock while doing fd alloc.

User land test program I wrote few months back

Current kernel :

    64.98%  [kernel]          [k] queued_spin_lock_slowpath    
    14.88%  opensock          [.] memset    // this part simulates user land actual work ;)                   
    11.15%  [kernel]          [k] _find_next_bit.part.0        
     0.69%  [kernel]          [k] _raw_spin_lock               
     0.46%  [kernel]          [k] memset_erms                  
     0.38%  [kernel]          [k] sk_alloc                     
     0.37%  [kernel]          [k] kmem_cache_alloc             
     0.33%  [kernel]          [k] get_empty_filp               
     0.31%  [kernel]          [k] kmem_cache_free              
     0.26%  [kernel]          [k] __alloc_fd                   
     0.26%  opensock          [.] child_function               
     0.18%  [kernel]          [k] inode_init_always            
     0.17%  opensock          [.] __random_r                   


/*
 * test for b/9072743 : fd scaling on gigantic process (with ~ 10,000,000 TCP sockets)
 * - Size fd arrays in kernel to avoid resizings that kill latencies.
 * - Then launch xx threads doing
 *    populate the fd array of the process, opening 'max' files.
 *    
 *    - Loop : close(randomfd()), socket(AF_INET, SOCK_STREAM, 0);
 *
 * Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]
 */

#include <pthread.h>
#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>

unsigned int count;
int skflags;

#define NBTHREADS_MAX 4096
pthread_t tid[NBTHREADS_MAX];
int nbthreads;
int nbthreads_req = 24;
int stop_all;

#ifndef O_FD_FASTALLOC
#define O_FD_FASTALLOC 0x40000000
#endif

#ifndef SOCK_FD_FASTALLOC
#define SOCK_FD_FASTALLOC O_FD_FASTALLOC
#endif

/* expand kernel fd array for optimal perf.
 * This could be done by doing a loop on dup(),
 * or can be done using dup2()
 */
int expand_fd_array(int max)
{
	int target, res;
	int fd = socket(AF_INET, SOCK_STREAM, 0);

	if (fd == -1) {
		perror("socket()");
		return -1;
	}
	for (;;) {
		count = max;
		target = count;
		if (skflags & SOCK_FD_FASTALLOC)
			target += count/10;
		res = dup2(fd, target);
		if (res != -1) {
			close(res);
			break;
		}
		max -= max/10;
	}
	printf("count=%u (check/increase ulimit -n)\n", count);
	return 0;
}

static char state[32] = {
	0, 1, 2, 3, 4, 5, 6, 7,
	8, 9, 10, 11, 12, 13, 14, 15,
	16, 17, 18, 19, 20, 21, 22, 23,
	24, 25, 26, 27, 28, 29, 30, 31
};

/* each thread is using ~400 KB of data per unit of work */
#define WORKING_SET_SIZE 400000

static void *child_function(void *arg)
{
	unsigned int max = count / nbthreads_req;
	struct random_data buf;
	unsigned int idx;
	int *tab;
	unsigned long iter = 0;
	unsigned long *work_set = malloc(WORKING_SET_SIZE);
	int i;

	if (!work_set)
		return NULL;
	tab = malloc(max * sizeof(int));
	if (!tab) {
		free(work_set);
		return NULL;
	}
	memset(tab, 255, max * sizeof(int));
	
	initstate_r(getpid(), state, sizeof(state), &buf);

	tab[0] = socket(AF_INET, SOCK_STREAM | skflags, 0);
	for (i = 1; i < max; i++)
		tab[i] = dup(tab[0]);

	while (!stop_all) {
		random_r(&buf, &idx);
		idx = idx % max;
		close(tab[idx]);

		/* user space needs typically to use a bit of the memory. */
		memset(work_set, idx, WORKING_SET_SIZE);

		tab[idx] = socket(AF_INET, SOCK_STREAM | skflags, 0);
		if (tab[idx] == -1) {
			perror("socket");
			break;
		}
		iter++;
	}
	for (i = 0; i < max; i++)
		close(tab[idx]);
	free(tab);
	free(work_set);
	printf("%lu\n", iter);
	return NULL;
}

static int launch_threads(void)
{
	int i, err;

	for (i = 0; i < nbthreads_req; i++) {
		err = pthread_create(&tid[i], NULL, child_function, NULL);
		if (err)
			return err;
		nbthreads++;
	}
	return 0;
}

static void wait_end(void)
{
	int i;
	for (i = 0; i < nbthreads; i++)
		pthread_join(tid[i], NULL);
}

static void usage(int code)
{
	fprintf(stderr, "Usage : opensock [ -n fds_count ] [ -t threads_count] [-f]\n");
	exit(code);
}

int main(int argc, char *argv[])
{
	int c;
	int max = 1000000;
	int duration = 10;

	while ((c = getopt(argc, argv, "fn:t:l:")) != -1) {
		switch (c) {
		case 'f':
			skflags = SOCK_FD_FASTALLOC;
			break;
		case 'n':
			max = atoi(optarg);
			break;
		case 't':
			nbthreads_req = atoi(optarg);
			if (nbthreads_req > NBTHREADS_MAX)
				usage(1);
			break;
		case 'l':
			duration = atoi(optarg);
			break;
		default:
			usage(1);
		}
	}
	system("sysctl -w fs.file-max=8000000");
	expand_fd_array(max);
	launch_threads();
	sleep(duration);
	stop_all = 1;
	wait_end();
}


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 30, 2015, 5:18 p.m. UTC | #12
On Thu, Oct 29, 2015 at 5:35 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
> Well, I already tested the O_FD_FASTALLOC thing, and I can tell you
> find_next_zero_bit() is nowhere to be found in kernel profiles anymore.
> It also lowers time we hold the fd array spinlock while doing fd alloc.

I do wonder if we couldn't just speed up the bitmap allocator by an
order of magnitude. It would be nicer to be able to make existing
loads faster without any new "don't bother with POSIX semantics" flag.

We could, for example, extend the "open_fds" bitmap with a
"second-level" bitmap that has information on when the first level is
full. We traverse the open_fd's bitmap one word at a time anyway, we
could have a second-level bitmap that has one bit per word to say
whether that word is already full.

So you'd basically have:

 - open_fds: array with the real bitmap (exactly like we do now)
 - full_fds_bits: array of bits that shows which of the "unsigned
longs" in 'open_fs' are already full.

and then you can basically walk 4096 fd's at a time by just checking
one single word in the full_fds_bits[] array.

So in __alloc_fd(), instead of using find_next_zero_bit(), you'd use
"find_next_fd()", which woulf do something like

    #define NR_BITS_PER_BIT (64*sizeof(long)*sizeof(long))

    unsigned long find_next_fd(struct fdtable *fdt, unsigned long start)
    {
        while (start < fdt->max_fds) {
            unsigned long startbit = start / BITS_PER_LONG;
            unsigned long bitbit = startbit / BITS_PER_LONG;
            unsigned long bitmax = (bitbit+1) * BITS_PER_LONG * BITS_PER_LONG;

            if (bitmax > max_fds)
                bitmax = fdt->max_fds;

            // Are all the bits in all the bitbit arrays already set?
            if (!~fds->full_fds_bits[bitbit]) {
                start = bitmax;
                continue;
            }

            fd = find_next_zero_bit(fdt->open_fds, bitmax, start);
            if (fd < bitmax)
                return fd;
        }
        return fdt->max_fds;
    }

which really should cut down on the bit traversal by a factor of 64.

Of course, then you do have to maintain that bitmap-of-bits in
__set_open_fd() and __clear_open_fd(), but that should be pretty
trivial too:

 - __clear_open_fd() would become just

        __clear_bit(fd, fdt->open_fds);
        __clear_bit(fd / BITS_PER_LONG, fdt->full_fds_bits);

 - and __set_open_fd() would become

        __set_bit(fd, fdt->open_fds);
        fd /= BITS_PER_LONG;
        if (!~fdt->open_fds[fd])
            __set_bit(fd, fdt->full_fds_bits);

or something.

NOTE NOTE NOTE! The above is all very much written without any testing
in the email client, so while I tried to make it look like "real
code", consider the above just pseudo-code.

The advantage of the above is that it should just work for existing
binaries. It may not be quite as optimal as just introducing a new
"don't care about POSIX" feature, but quite frankly, if it cuts down
the bad case of "find_next_zero_bit()" by a factror of 64 (and then
adds a *small* expense factor on top of that), I suspect it should be
"good enough" even for your nasty case.

What do you think? Willing to try the above approach (with any
inevitable bug-fixes) and see how it compares?

Obviously in addition to any fixes to my pseudo-code above you'd need
to add the allocations for the new "full_fds_bits" etc, but I think it
should be easy to make the full_fds_bit allocation be *part* of the
"open_fds" allocation, so you wouldn't need a new allocation in
alloc_fdtable(). We already do that whole "use a single allocation" to
combine open_fds with close_on_exec into one single allocation.

                    Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Al Viro Oct. 30, 2015, 9:02 p.m. UTC | #13
On Fri, Oct 30, 2015 at 10:18:12AM -0700, Linus Torvalds wrote:

> I do wonder if we couldn't just speed up the bitmap allocator by an
> order of magnitude. It would be nicer to be able to make existing
> loads faster without any new "don't bother with POSIX semantics" flag.
> 
> We could, for example, extend the "open_fds" bitmap with a
> "second-level" bitmap that has information on when the first level is
> full. We traverse the open_fd's bitmap one word at a time anyway, we
> could have a second-level bitmap that has one bit per word to say
> whether that word is already full.

Your variant has 1:64 ratio; obviously better than now, but we can actually
do 1:bits-per-cacheline quite easily.

I've been playing with a variant that has more than two bitmaps, and
AFAICS it
	a) does not increase the amount of cacheline pulled and
	b) keeps it well-bound even in the absolutely worst case
(128M-odd descriptors filled, followed by close(0);dup2(1,0); -
in that case it ends up accessing the 7 cachelines worth of
bitmaps; your variant will read through 4000-odd cachelines of
the summary bitmap alone; the mainline is even worse).

> The advantage of the above is that it should just work for existing
> binaries. It may not be quite as optimal as just introducing a new
> "don't care about POSIX" feature, but quite frankly, if it cuts down
> the bad case of "find_next_zero_bit()" by a factror of 64 (and then
> adds a *small* expense factor on top of that), I suspect it should be
> "good enough" even for your nasty case.
> 
> What do you think? Willing to try the above approach (with any
> inevitable bug-fixes) and see how it compares?
> 
> Obviously in addition to any fixes to my pseudo-code above you'd need
> to add the allocations for the new "full_fds_bits" etc, but I think it
> should be easy to make the full_fds_bit allocation be *part* of the
> "open_fds" allocation, so you wouldn't need a new allocation in
> alloc_fdtable(). We already do that whole "use a single allocation" to
> combine open_fds with close_on_exec into one single allocation.

I'll finish testing what I've got and post it; it costs 3 extra pointers
in the files_struct and a bit fatter bitmap allocation (less than 0.2%
extra).  All the arguments regarding the unmodified binaries apply, of
course, and so far it looks fairly compact...
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Linus Torvalds Oct. 30, 2015, 9:23 p.m. UTC | #14
On Fri, Oct 30, 2015 at 2:02 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Your variant has 1:64 ratio; obviously better than now, but we can actually
> do 1:bits-per-cacheline quite easily.

Ok, but in that case you end up needing a counter for each cacheline
too (to count how many bits, in order to know when to say "cacheline
is entirely full").

Because otherwise you'll end up having to check the whole cacheline
(rather than check just one word, or check the "bit counter") when you
set a bit.

So I suspect a "bitmap of full words" ends up being much simpler.  But
hey, if you can make your more complex thing work, go right ahead.

               Linus
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/include/linux/cred.h b/include/linux/cred.h
index 8d70e1361ecd..460efae83522 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -124,7 +124,17 @@  struct cred {
 #define CRED_MAGIC     0x43736564
 #define CRED_MAGIC_DEAD        0x44656144
 #endif
-       kuid_t          uid;            /* real UID of the task */
+       struct rcu_head rcu;            /* RCU deletion hook */
+
+       kernel_cap_t    cap_inheritable; /* caps our children can inherit */
+       kernel_cap_t    cap_permitted;  /* caps we're permitted */
+       kernel_cap_t    cap_effective;  /* caps we can actually use */
+       kernel_cap_t    cap_bset;       /* capability bounding set */
+       kernel_cap_t    cap_ambient;    /* Ambient capability set */
+
+       kuid_t          uid ____cacheline_aligned_in_smp;
+                                       /* real UID of the task */
+
        kgid_t          gid;            /* real GID of the task */
        kuid_t          suid;           /* saved UID of the task */
        kgid_t          sgid;           /* saved GID of the task */
@@ -133,11 +143,6 @@  struct cred {
        kuid_t          fsuid;          /* UID for VFS ops */
        kgid_t          fsgid;          /* GID for VFS ops */
        unsigned        securebits;     /* SUID-less security management */
-       kernel_cap_t    cap_inheritable; /* caps our children can inherit */
-       kernel_cap_t    cap_permitted;  /* caps we're permitted */
-       kernel_cap_t    cap_effective;  /* caps we can actually use */
-       kernel_cap_t    cap_bset;       /* capability bounding set */
-       kernel_cap_t    cap_ambient;    /* Ambient capability set */
 #ifdef CONFIG_KEYS
        unsigned char   jit_keyring;    /* default keyring to attach requested
                                         * keys to */
@@ -152,7 +157,6 @@  struct cred {
        struct user_struct *user;       /* real user ID subscription */
        struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
        struct group_info *group_info;  /* supplementary groups for euid/fsgid */
-       struct rcu_head rcu;            /* RCU deletion hook */
 };
 
 extern void __put_cred(struct cred *);