diff mbox

virt-manager broken by bind(0) in net-next.

Message ID 498349F7.4050300@cosmosbay.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Jan. 30, 2009, 6:41 p.m. UTC
Stephen Hemminger a écrit :
> On Fri, 30 Jan 2009 23:53:37 +1100
> Herbert Xu <herbert@gondor.apana.org.au> wrote:
> 
>> Evgeniy Polyakov <zbr@ioremap.net> wrote:
>>> So it is not explicit bind call, but port autoselection in the
>>> connect(). Can you check what errno is returned?
>>> Did I understand it right, that connect fails, you try different
>>> address, but then suddenly all those sockets become 'alive'?
>> Yes, I think a good strace vs. a bad strace would be really helpful
>> in these cases.
>>
>> Thanks,
> 
> I have the strace but it comes up no different.
> What is different is that in the broken case (net-next), I see
> IPV6 being used:
> 
> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> ESTAB      23769  0        ::ffff:127.0.0.1:5900      ::ffff:127.0.0.1:55987   
> ESTAB      0      0               127.0.0.1:55987            127.0.0.1:5900
> 
> and in the working case (2.6.29-rc3), IPV4 is being used
> State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> ESTAB      0      0               127.0.0.1:58894            127.0.0.1:5901    
> ESTAB      0      0               127.0.0.1:5901             127.0.0.1:58894 
> 

Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a

I see use of a hashinfo->bsockets field that :

- lacks proper lock/synchronization
- suffers from cache line ping pongs on SMP

Also there might be a problem at line 175

if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
	spin_unlock(&head->lock);
	goto again;

If we entered inet_csk_get_port() with a non null snum, we can "goto again"
while it was not expected.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

Evgeniy Polyakov Jan. 30, 2009, 9:50 p.m. UTC | #1
On Fri, Jan 30, 2009 at 07:41:59PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
> Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> 
> I see use of a hashinfo->bsockets field that :
> 
> - lacks proper lock/synchronization

It should contain rough number of sockets, there is no need to be very
precise because of this hueristic.

> - suffers from cache line ping pongs on SMP

I used free alignment slot so that socket structure would not be
icreased.

> Also there might be a problem at line 175
> 
> if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
> 	spin_unlock(&head->lock);
> 	goto again;
> 
> If we entered inet_csk_get_port() with a non null snum, we can "goto again"
> while it was not expected.
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index df8e72f..752c6b2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -172,7 +172,8 @@ tb_found:
>  		} else {
>  			ret = 1;
>  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
> -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
> +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
> +					smallest_size == -1 &&  --attempts >= 0) {

I think it should be smallest_size != -1, since we really want to goto
to the again label when hueristic is used, which in turn changes
smallest_size.
Eric Dumazet Jan. 30, 2009, 10:30 p.m. UTC | #2
Evgeniy Polyakov a écrit :
> On Fri, Jan 30, 2009 at 07:41:59PM +0100, Eric Dumazet (dada1@cosmosbay.com) wrote:
>> Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
>>
>> I see use of a hashinfo->bsockets field that :
>>
>> - lacks proper lock/synchronization
> 
> It should contain rough number of sockets, there is no need to be very
> precise because of this hueristic.

Denying there is a bug is... well... I dont know what to say.

I wonder why we still use atomic_t all over the kernel.

> 
>> - suffers from cache line ping pongs on SMP
> 
> I used free alignment slot so that socket structure would not be
> icreased.

Are you kidding ?

bsockets is not part of socket structure, but part of "struct inet_hashinfo",
shared by all cpus and accessed several thousand times per second on many
machines.

Please read the comment three lines after 'the free alignemnt slot'
you chose.... You just introduced one write on a cache line
that is supposed to *not* be written.

        unsigned int                    bhash_size;
        int                             bsockets;

        struct kmem_cache               *bind_bucket_cachep;

        /* All the above members are written once at bootup and
         * never written again _or_ are predominantly read-access.
         *
         * Now align to a new cache line as all the following members
         * might be often dirty.
         */



> 
>> Also there might be a problem at line 175
>>
>> if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
>> 	spin_unlock(&head->lock);
>> 	goto again;
>>
>> If we entered inet_csk_get_port() with a non null snum, we can "goto again"
>> while it was not expected.
>>
>> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
>> index df8e72f..752c6b2 100644
>> --- a/net/ipv4/inet_connection_sock.c
>> +++ b/net/ipv4/inet_connection_sock.c
>> @@ -172,7 +172,8 @@ tb_found:
>>  		} else {
>>  			ret = 1;
>>  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
>> -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
>> +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
>> +					smallest_size == -1 &&  --attempts >= 0) {
> 
> I think it should be smallest_size != -1, since we really want to goto
> to the again label when hueristic is used, which in turn changes
> smallest_size.
> 

Yep


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
stephen hemminger Feb. 1, 2009, 5:29 a.m. UTC | #3
On Fri, 30 Jan 2009 19:41:59 +0100
Eric Dumazet <dada1@cosmosbay.com> wrote:

> Stephen Hemminger a écrit :
> > On Fri, 30 Jan 2009 23:53:37 +1100
> > Herbert Xu <herbert@gondor.apana.org.au> wrote:
> > 
> >> Evgeniy Polyakov <zbr@ioremap.net> wrote:
> >>> So it is not explicit bind call, but port autoselection in the
> >>> connect(). Can you check what errno is returned?
> >>> Did I understand it right, that connect fails, you try different
> >>> address, but then suddenly all those sockets become 'alive'?
> >> Yes, I think a good strace vs. a bad strace would be really helpful
> >> in these cases.
> >>
> >> Thanks,
> > 
> > I have the strace but it comes up no different.
> > What is different is that in the broken case (net-next), I see
> > IPV6 being used:
> > 
> > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > ESTAB      23769  0        ::ffff:127.0.0.1:5900      ::ffff:127.0.0.1:55987   
> > ESTAB      0      0               127.0.0.1:55987            127.0.0.1:5900
> > 
> > and in the working case (2.6.29-rc3), IPV4 is being used
> > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > ESTAB      0      0               127.0.0.1:58894            127.0.0.1:5901    
> > ESTAB      0      0               127.0.0.1:5901             127.0.0.1:58894 
> > 
> 
> Reviewing commit a9d8f9110d7e953c2f2b521087a4179677843c2a
> 
> I see use of a hashinfo->bsockets field that :
> 
> - lacks proper lock/synchronization
> - suffers from cache line ping pongs on SMP
> 
> Also there might be a problem at line 175
> 
> if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) { 
> 	spin_unlock(&head->lock);
> 	goto again;
> 
> If we entered inet_csk_get_port() with a non null snum, we can "goto again"
> while it was not expected.
> 
> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index df8e72f..752c6b2 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -172,7 +172,8 @@ tb_found:
>  		} else {
>  			ret = 1;
>  			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
> -				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
> +				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
> +					smallest_size == -1 &&  --attempts >= 0) {
>  					spin_unlock(&head->lock);
>  					goto again;
>  				}
> 
> 

That didn't fix it.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index df8e72f..752c6b2 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -172,7 +172,8 @@  tb_found:
 		} else {
 			ret = 1;
 			if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb)) {
-				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN && --attempts >= 0) {
+				if (sk->sk_reuse && sk->sk_state != TCP_LISTEN &&
+					smallest_size == -1 &&  --attempts >= 0) {
 					spin_unlock(&head->lock);
 					goto again;
 				}