diff mbox

2.6.34: Problem with UDP traffic on lo + poll(?)

Message ID 1283887569.2634.95.camel@edumazet-laptop
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Eric Dumazet Sept. 7, 2010, 7:26 p.m. UTC
Le mardi 07 septembre 2010 à 18:36 +0200, Eric Dumazet a écrit :

> Hmm, I have a pretty good idea of what the problem is, and will post a
> fix soon ;)

David, if you feel this is too invasive for stable, we can make UDP
rehash the socket in case we dont want to change ip4_datagram_connect()


[PATCH] inet: Dont set inet_rcv_saddr in connect()

So the problem is that the sequence :

socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)

connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
sin_addr=inet_addr("1.2.3.4")}, 28)

1) Does an implicit inet_autobind()
  (using an INADDR_ANY address, and selecting a random port).

2) Then does an ip4_datagram_connect() to specify the address/port of
remote end point.

Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
INADDR_ANY to IP source address, given the current route to remote end
point). Only the first connect() on the socket does this. Following ones
dont change the (possibly wrong) source address.

This breaks the secondary UDP hash, based on (ADDRESS, port), that was
computed by inet_autobind().

This also potentially breaks multiple connect() to change remote
endpoints, because old source address might be non usable for packets to
new destination.

If route happens to change, then we should automatically change our
source address too, at next sendmsg() call, and UDP code deals with this
just fine.

If an application needs to specify a precise source address, it must use
bind() system call. connect() man page only refers to remote address,
not local one.

Reported-by: Krzysztof Olędzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/datagram.c |    9 +++++++++
 1 file changed, 9 insertions(+)



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Comments

David Miller Sept. 7, 2010, 7:59 p.m. UTC | #1
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 07 Sep 2010 21:26:09 +0200

>  	}
> +/*
> + * Should connect() change inet_rcv_saddr ?
> + * It should not IMHO, because we want to specify the peer to which
> + * datagrams are to be sent, regardless of our source address that might
> + * change in the future, after a route change.
> + * To specify our source address, bind() is the right API.
> + */
> +#if 0
>  	if (!inet->inet_saddr)
>  		inet->inet_saddr = rt->rt_src;	/* Update source address */
>  	if (!inet->inet_rcv_saddr)
>  		inet->inet_rcv_saddr = rt->rt_src;
> +#endif
>  	inet->inet_daddr = rt->rt_dst;
>  	inet->inet_dport = usin->sin_port;
>  	sk->sk_state = TCP_ESTABLISHED;

Eric, please just delete the code block instead of leaving it
there inside of an #if 0 block.

If there is information conveyed by the unused code, add that
information to the nice comment you're adding :-)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Krzysztof Oledzki Sept. 7, 2010, 9:28 p.m. UTC | #2
On 2010-09-07 21:26, Eric Dumazet wrote:
> Le mardi 07 septembre 2010 à 18:36 +0200, Eric Dumazet a écrit :
>
>> Hmm, I have a pretty good idea of what the problem is, and will post a
>> fix soon ;)
>
> David, if you feel this is too invasive for stable, we can make UDP
> rehash the socket in case we dont want to change ip4_datagram_connect()
>
>
> [PATCH] inet: Dont set inet_rcv_saddr in connect()
>
> So the problem is that the sequence :
>
> socket(PF_INET, SOCK_DGRAM, IPPROTO_IP)
>
> connect(fd, {sa_family=AF_INET, sin_port=htons(xx),
> sin_addr=inet_addr("1.2.3.4")}, 28)
>
> 1) Does an implicit inet_autobind()
>    (using an INADDR_ANY address, and selecting a random port).
>
> 2) Then does an ip4_datagram_connect() to specify the address/port of
> remote end point.
>
> Problem is ip4_datagram_connect() also sets inet->inet_rcv_saddr (from
> INADDR_ANY to IP source address, given the current route to remote end
> point). Only the first connect() on the socket does this. Following ones
> dont change the (possibly wrong) source address.
>
> This breaks the secondary UDP hash, based on (ADDRESS, port), that was
> computed by inet_autobind().
>
> This also potentially breaks multiple connect() to change remote
> endpoints, because old source address might be non usable for packets to
> new destination.
>
> If route happens to change, then we should automatically change our
> source address too, at next sendmsg() call, and UDP code deals with this
> just fine.
>
> If an application needs to specify a precise source address, it must use
> bind() system call. connect() man page only refers to remote address,
> not local one.
>
> Reported-by: Krzysztof Olędzki <ole@ans.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>   net/ipv4/datagram.c |    9 +++++++++
>   1 file changed, 9 insertions(+)
>
> diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
> index f055094..8a17241 100644
> --- a/net/ipv4/datagram.c
> +++ b/net/ipv4/datagram.c
> @@ -60,10 +60,19 @@ int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
>   		ip_rt_put(rt);
>   		return -EACCES;
>   	}
> +/*
> + * Should connect() change inet_rcv_saddr ?
> + * It should not IMHO, because we want to specify the peer to which
> + * datagrams are to be sent, regardless of our source address that might
> + * change in the future, after a route change.
> + * To specify our source address, bind() is the right API.
> + */
> +#if 0
>   	if (!inet->inet_saddr)
>   		inet->inet_saddr = rt->rt_src;	/* Update source address */
>   	if (!inet->inet_rcv_saddr)
>   		inet->inet_rcv_saddr = rt->rt_src;
> +#endif
>   	inet->inet_daddr = rt->rt_dst;
>   	inet->inet_dport = usin->sin_port;
>   	sk->sk_state = TCP_ESTABLISHED;

With the above patch I'm no longer able to reproduce the problem. Thanks!

Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>

BTW: why it takes so long to trigger this bug and it is only possible 
over a loopback interface?

Best regards,

			Krzysztof Olędzki
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 7, 2010, 9:39 p.m. UTC | #3
Le mardi 07 septembre 2010 à 23:28 +0200, Krzysztof Olędzki a écrit :

> With the above patch I'm no longer able to reproduce the problem. Thanks!
> 
> Tested-by: Krzysztof Piotr Oledzki <ole@ans.pl>
> 

Thanks a lot !

> BTW: why it takes so long to trigger this bug and it is only possible 
> over a loopback interface?

Its a bit tricky : You need at least 10 sockets linked in a particular
hash chain.

To check this, you can :

cat /proc/net/udp

maybe you have many sockets on port 123 or 53 ?

And about loopback, I have no idea... I am pretty sure I can trigger the
bug with other interfaces.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Krzysztof Oledzki Sept. 7, 2010, 9:51 p.m. UTC | #4
On 2010-09-07 23:39, Eric Dumazet wrote:
> Le mardi 07 septembre 2010 à 23:28 +0200, Krzysztof Olędzki a écrit :
> 
>> With the above patch I'm no longer able to reproduce the problem. Thanks!
>>
>> Tested-by: Krzysztof Piotr Oledzki<ole@ans.pl>
>>
> 
> Thanks a lot !
> 
>> BTW: why it takes so long to trigger this bug and it is only possible
>> over a loopback interface?
> 
> Its a bit tricky : You need at least 10 sockets linked in a particular
> hash chain.
> 
> To check this, you can :
> 
> cat /proc/net/udp
> 
> maybe you have many sockets on port 123 or 53 ?

On one affected host I have 3+7 and on the other, also affacted one, I have 3+6:

root@sowa:~# egrep -cw '(53|123):' /proc/net/udp
10
root@sowa:~# egrep -w '(53|123):' /proc/net/udp
   53: 3582A8C0:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084654 2 ffff8800cc012700 0
   53: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084652 2 ffff8800cc010900 0
  123: D683A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4911 2 ffff88012de96400 0
  123: 7B85A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4910 2 ffff88012de96100 0
  123: 8982A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4909 2 ffff88012de95e00 0
  123: 7B82A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4908 2 ffff88012de95b00 0
  123: 3582A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4907 2 ffff88012de95800 0
  123: 1F7EA8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4906 2 ffff88012de95500 0
  123: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4905 2 ffff88012de95200 0
  123: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4899 2 ffff88012de94c00 0

root@cmyk:~# egrep -cw '(53|123):' /proc/net/udp
9
root@cmyk:~# egrep -w '(53|123):' /proc/net/udp
   53: A2CD1253:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 3774 2 eeaa9840 0
   53: 0100A8C0:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 3772 2 eeaa9a40 0
   53: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 3770 2 eeaa9c40 0
  123: A6CD1253:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4693 2 eeacdc80 0
  123: A5CD1253:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4692 2 eeaa9640 0
  123: A2CD1253:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4691 2 ed201700 0
  123: 0100A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4690 2 eeacd880 0
  123: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4689 2 ec649740 0
  123: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4683 2 ed201d00 0

But how 123 is related to 53?

> And about loopback, I have no idea... I am pretty sure I can trigger the
> bug with other interfaces.

OK. Probably it is because my other hosts have only a single IP and only
the problematic ones have both DNS server and multiple IP (many sockets).

Best regards,

			Krzysztof Olędzki
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Eric Dumazet Sept. 8, 2010, 4:12 a.m. UTC | #5
Le mardi 07 septembre 2010 à 23:51 +0200, Krzysztof Olędzki a écrit :
> On 2010-09-07 23:39, Eric Dumazet wrote:
> > Le mardi 07 septembre 2010 à 23:28 +0200, Krzysztof Olędzki a écrit :
> > 
> >> With the above patch I'm no longer able to reproduce the problem. Thanks!
> >>
> >> Tested-by: Krzysztof Piotr Oledzki<ole@ans.pl>
> >>
> > 
> > Thanks a lot !
> > 
> >> BTW: why it takes so long to trigger this bug and it is only possible
> >> over a loopback interface?
> > 
> > Its a bit tricky : You need at least 10 sockets linked in a particular
> > hash chain.
> > 
> > To check this, you can :
> > 
> > cat /proc/net/udp
> > 
> > maybe you have many sockets on port 123 or 53 ?
> 
> On one affected host I have 3+7 and on the other, also affacted one, I have 3+6:
> 
> root@sowa:~# egrep -cw '(53|123):' /proc/net/udp
> 10
> root@sowa:~# egrep -w '(53|123):' /proc/net/udp
>    53: 3582A8C0:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084654 2 ffff8800cc012700 0
>    53: 0100007F:0035 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 6084652 2 ffff8800cc010900 0
>   123: D683A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4911 2 ffff88012de96400 0
>   123: 7B85A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4910 2 ffff88012de96100 0
>   123: 8982A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4909 2 ffff88012de95e00 0
>   123: 7B82A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4908 2 ffff88012de95b00 0
>   123: 3582A8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4907 2 ffff88012de95800 0
>   123: 1F7EA8C0:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4906 2 ffff88012de95500 0
>   123: 0100007F:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4905 2 ffff88012de95200 0
>   123: 00000000:007B 00000000:0000 07 00000000:00000000 00:00000000 00000000     0        0 4899 2 ffff88012de94c00 0
> 
> But how 123 is related to 53?
> 

I was mentioning 123 or 53, as probable suspects :)


When a socket is created, and connect() called, autobind() chooses a
source port X for this socket. 

if ((X % udp_hash_size) == 123), socket is inserted in hash chain number
123.

Bug then triggers, because when a packet is received for this socket, we
find a slot with more than 10 sockets -> Search is done on secondary
chain Z2, where we dont find the socket since its rcv_addr changed after
we inserted it (in chain Y2). Packet is dropped (as seen in netstat -s)


> > And about loopback, I have no idea... I am pretty sure I can trigger the
> > bug with other interfaces.
> 
> OK. Probably it is because my other hosts have only a single IP and only
> the problematic ones have both DNS server and multiple IP (many sockets).
> 

Yes


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/datagram.c b/net/ipv4/datagram.c
index f055094..8a17241 100644
--- a/net/ipv4/datagram.c
+++ b/net/ipv4/datagram.c
@@ -60,10 +60,19 @@  int ip4_datagram_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 		ip_rt_put(rt);
 		return -EACCES;
 	}
+/*
+ * Should connect() change inet_rcv_saddr ?
+ * It should not IMHO, because we want to specify the peer to which
+ * datagrams are to be sent, regardless of our source address that might
+ * change in the future, after a route change.
+ * To specify our source address, bind() is the right API.
+ */
+#if 0
 	if (!inet->inet_saddr)
 		inet->inet_saddr = rt->rt_src;	/* Update source address */
 	if (!inet->inet_rcv_saddr)
 		inet->inet_rcv_saddr = rt->rt_src;
+#endif
 	inet->inet_daddr = rt->rt_dst;
 	inet->inet_dport = usin->sin_port;
 	sk->sk_state = TCP_ESTABLISHED;