diff mbox

virt-manager broken by bind(0) in net-next.

Message ID 20090130225113.GA13977@ioremap.net
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Evgeniy Polyakov Jan. 30, 2009, 10:51 p.m. UTC
On Fri, Jan 30, 2009 at 11:30:22PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > It should contain rough number of sockets, there is no need to be very
> > precise because of this hueristic.
> 
> Denying there is a bug is... well... I dont know what to say.
> 
> I wonder why we still use atomic_t all over the kernel.

It is not a bug. It is not supposed to be precise. At all.
I implemented a simple heuristic on when diferent bind port selection
algorithm should start: roughly when number of opened sockets equals to
some predefined value (sysctl at the moment, but it could be 64k or
anything else), so if that number is loosely maintained and does not
precisely corresponds to the number of sockets, it is not a problem.

You also saw 'again' lavel which has magic 5 number - it is another
heuristic - since lock is dropped atfer the bind bucket check, and we
selected it, it is possible that non-reuse socket will be added into the
bucket, so we will have to rerun the process again. I limited this to
the 5 attempts only, since it is better than what we have right now (I
never saw more than 2 attempts needed in the tests), when number of
bound sockets does not exceed 64k.

> > I used free alignment slot so that socket structure would not be
> > icreased.
> 
> Are you kidding ?
> 
> bsockets is not part of socket structure, but part of "struct inet_hashinfo",

Yes, I mistyped.

> shared by all cpus and accessed several thousand times per second on many
> machines.
> 
> Please read the comment three lines after 'the free alignemnt slot'
> you chose.... You just introduced one write on a cache line
> that is supposed to *not* be written.

I have no objection on moving this anywhere at the end of the structure
like after bind_bucket_cachep.

Comments

stephen hemminger Jan. 31, 2009, 12:36 a.m. UTC | #1
On Sat, 31 Jan 2009 01:51:14 +0300
Evgeniy Polyakov <zbr@ioremap.net> wrote:

> On Fri, Jan 30, 2009 at 11:30:22PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> > > It should contain rough number of sockets, there is no need to be very
> > > precise because of this hueristic.
> > 
> > Denying there is a bug is... well... I dont know what to say.
> > 
> > I wonder why we still use atomic_t all over the kernel.
> 
> It is not a bug. It is not supposed to be precise. At all.
> I implemented a simple heuristic on when diferent bind port selection
> algorithm should start: roughly when number of opened sockets equals to
> some predefined value (sysctl at the moment, but it could be 64k or
> anything else), so if that number is loosely maintained and does not
> precisely corresponds to the number of sockets, it is not a problem.
> 
> You also saw 'again' lavel which has magic 5 number - it is another
> heuristic - since lock is dropped atfer the bind bucket check, and we
> selected it, it is possible that non-reuse socket will be added into the
> bucket, so we will have to rerun the process again. I limited this to
> the 5 attempts only, since it is better than what we have right now (I
> never saw more than 2 attempts needed in the tests), when number of
> bound sockets does not exceed 64k.
> 
>

How is any of this supposed to fix the bug?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Jan. 31, 2009, 2:52 a.m. UTC | #2
My working hypothesis is:
  1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
     preferred over plain IPV4.
  2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
  3. protocol gets screwed up after that.

It is probably reproducible with other services that support IPV6.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Evgeniy Polyakov Jan. 31, 2009, 8:35 a.m. UTC | #3
On Fri, Jan 30, 2009 at 04:36:00PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
> > It is not a bug. It is not supposed to be precise. At all.
> > I implemented a simple heuristic on when diferent bind port selection
> > algorithm should start: roughly when number of opened sockets equals to
> > some predefined value (sysctl at the moment, but it could be 64k or
> > anything else), so if that number is loosely maintained and does not
> > precisely corresponds to the number of sockets, it is not a problem.
> > 
> > You also saw 'again' lavel which has magic 5 number - it is another
> > heuristic - since lock is dropped atfer the bind bucket check, and we
> > selected it, it is possible that non-reuse socket will be added into the
> > bucket, so we will have to rerun the process again. I limited this to
> > the 5 attempts only, since it is better than what we have right now (I
> > never saw more than 2 attempts needed in the tests), when number of
> > bound sockets does not exceed 64k.
> > 
> >
> 
> How is any of this supposed to fix the bug?

Nothing from above fixes the bug. It was an explaination of how things
work. Patch is based on Eric's observation about unconditional (compared
to old code) attempt to get the new socket bucket when code should just
return.
Evgeniy Polyakov Jan. 31, 2009, 8:37 a.m. UTC | #4
On Fri, Jan 30, 2009 at 06:52:24PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
> My working hypothesis is:
>   1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
>      preferred over plain IPV4.
>   2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
>   3. protocol gets screwed up after that.
> 
> It is probably reproducible with other services that support IPV6.

getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
Previously it bailed out, but with my change it will try again without
reason for doing this. With the patch I sent based on Eric's observation
things should be fine.
Eric Dumazet Jan. 31, 2009, 9:17 a.m. UTC | #5
Evgeniy Polyakov a écrit :
> On Fri, Jan 30, 2009 at 06:52:24PM -0800, Stephen Hemminger (shemminger@vyatta.com) wrote:
>> My working hypothesis is:
>>   1. Something about Evgeniy's patch makes IPV6 (actually IPV4 in IPV6) be
>>      preferred over plain IPV4.
>>   2. Vino server (VNC) doesn't think ::ffff::127.0.0.1 is really the localhost
>>   3. protocol gets screwed up after that.
>>
>> It is probably reproducible with other services that support IPV6.
> 
> getaddrinfo() returns list of addresses and IPv6 was the first one iirc.
> Previously it bailed out, but with my change it will try again without
> reason for doing this. With the patch I sent based on Eric's observation
> things should be fine.
> 

Problem is your patch is wrong Evgeniy, please think about it litle bit more
and resubmit it. 

Take the time to run this $0.02 program, before and after your upcoming fix :


$ cat size.c
#include <net/inet_hashtables.h>
extern int printf(const char *, ...);
int main(int argc, char *argv[])
{
        printf("offsetof(struct inet_hashinfo, bsockets)=0x%x\n",
                offsetof(struct inet_hashinfo, bsockets));
        return 0;
}
$ make size.o ; gcc -o size size.o ; ./size
  CHK     include/linux/version.h
  CHK     include/linux/utsrelease.h
  SYMLINK include/asm -> include/asm-x86
  CALL    scripts/checksyscalls.sh
  CC      size.o
offsetof(struct inet_hashinfo, bsockets)=0x18


offset of bsockets being 0x18 or 0x20 is same result : bad because in
same cache line than ehash, ehash_locks, ehash_size, ehash_locks_mask,
bhash, bhash_size, unless your cpu is a Pentium.

Also, I suggest you change bsockets to something more appropriate, eg a
percpu counter.

Thank you.
Eric

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

--- ./include/net/inet_hashtables.h~	2009-01-19 22:19:11.000000000 +0300
+++ ./include/net/inet_hashtables.h	2009-01-31 01:48:21.000000000 +0300
@@ -134,7 +134,6 @@ 
 	struct inet_bind_hashbucket	*bhash;
 
 	unsigned int			bhash_size;
-	int				bsockets;
 
 	struct kmem_cache		*bind_bucket_cachep;
 
@@ -148,6 +147,7 @@ 
 	 * table where wildcard'd TCP sockets can exist.  Hash function here
 	 * is just local port number.
 	 */
+	int				bsockets;
 	struct inet_listen_hashbucket	listening_hash[INET_LHTABLE_SIZE]
 					____cacheline_aligned_in_smp;
 
--- ./net/ipv4/inet_connection_sock.c~	2009-01-19 22:21:08.000000000 +0300
+++ ./net/ipv4/inet_connection_sock.c	2009-01-31 01:50:20.000000000 +0300
@@ -172,7 +172,8 @@ 
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+					smallest_size != -1 && --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}