diff mbox series

net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

Message ID 20180603174705.51802-1-zenczykowski@gmail.com
State Accepted, archived
Delegated to: David Miller
Headers show
Series net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets | expand

Commit Message

Maciej Żenczykowski June 3, 2018, 5:47 p.m. UTC
From: Maciej Żenczykowski <maze@google.com>

It is not safe to do so because such sockets are already in the
hash tables and changing these options can result in invalidating
the tb->fastreuse(port) caching.

This can have later far reaching consequences wrt. bind conflict checks
which rely on these caches (for optimization purposes).

Not to mention that you can currently end up with two identical
non-reuseport listening sockets bound to the same local ip:port
by clearing reuseport on them after they've already both been bound.

There is unfortunately no EISBOUND error or anything similar,
and EISCONN seems to be misleading for a bound-but-not-connected
socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
is the closest you can get to meaning 'socket in bad state'.
(although perhaps EINVAL wouldn't be a bad choice either?)

This does unfortunately run the risk of breaking buggy
userspace programs...

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Eric Dumazet <edumazet@google.com>

Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
---
 net/core/sock.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

Comments

Christoph Paasch June 3, 2018, 7:54 p.m. UTC | #1
Hello,

On Sun, Jun 3, 2018 at 10:47 AM, Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> From: Maciej Żenczykowski <maze@google.com>
>
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
>
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
>
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.

as a side-note: Some time back I realized that one can also - on the
active opener side - create two TCP connections with the same 5-tuple
going out over the same interface.

One simply needs to first create a connection with a socket that has
SO_BINDTODEV set that specifies the same interface as the default
route. The second socket (which doesn't uses SO_BINDTODEV) then can
end up using the same source-port, if the range of available ports has
been exhausted.
This makes for some interesting packet-traces! :)

This is because INET_MATCH in __inet_check_established only checks for
!(sk->sk_bound_dev_if). inet_hash_connect() probably would need info
of the route's outgoing interface (of the new socket) to decide
whether or not there is a match.

But even that wouldn't be failsafe as the routing could change later
on... So, I dropped the ball on that.

Not sure if it's a big deal or not...


Cheers,
Christoph



>
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
>
> This does unfortunately run the risk of breaking buggy
> userspace programs...
>
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
>
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> ---
>  net/core/sock.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>                         sock_valbool_flag(sk, SOCK_DBG, valbool);
>                 break;
>         case SO_REUSEADDR:
> -               sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +               val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +               if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +                   inet_sk(sk)->inet_num &&
> +                   (sk->sk_reuse != val)) {
> +                       ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +                       break;
> +               }
> +               sk->sk_reuse = val;
>                 break;
>         case SO_REUSEPORT:
> +               if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +                   inet_sk(sk)->inet_num &&
> +                   (sk->sk_reuseport != valbool)) {
> +                       ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +                       break;
> +               }
>                 sk->sk_reuseport = valbool;
>                 break;
>         case SO_TYPE:
> --
> 2.17.1.1185.g55be947832-goog
>
Eric Dumazet June 4, 2018, 5:24 p.m. UTC | #2
On 06/03/2018 10:47 AM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 

Reviewed-by: Eric Dumazet <edumazet@google.com>
David Miller June 4, 2018, 9:14 p.m. UTC | #3
From: "Maciej Żenczykowski" <zenczykowski@gmail.com>
Date: Sun,  3 Jun 2018 10:47:05 -0700

> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b

Applied and queued up for -stable.
Andrei Vagin June 6, 2018, 11:25 p.m. UTC | #4
This patch breaks CRIU tests:

===================== Run zdtm/transition/socket-tcp6 in h =====================
Start test
./socket-tcp6 --pidfile=socket-tcp6.pid --outfile=socket-tcp6.out
start time for zdtm/transition/socket-tcp6: 0.90
Run criu dump
Run criu restore
=[log]=> dump/zdtm/transition/socket-tcp6/39/1/restore.log
------------------------ grep Error ------------------------
(00.132120)     39: 	restore rcvlowat 1 for socket
(00.132130)     39: 	restore mark 0 for socket
(00.132149)     39: 		Create fd for 6
(00.132157)     39: Schedule 3 socket for repair off
(00.132183)     39: Error (criu/sockets.c:379): Can't set 1:15 (len 4): Structure needs cleaning
(00.132192)     39: Error (criu/files.c:1243): Unable to open fd=6 id=0x8
(00.132241) Error (criu/cr-restore.c:2391): Failed to wait inprogress tasks
(00.132286) Error (criu/cr-restore.c:2568): Restoring FAILED.
------------------------ ERROR OVER ------------------------
############ Test zdtm/transition/socket-tcp6 FAIL at CRIU restore #############

https://travis-ci.org/avagin/linux/jobs/388989833

We use these options to restore tcp sockets. On the first stage, CRIU creates
all sockets with SO_REUSEADDR and SO_REUSEPORT, than it restores established
and listening sockets. After that criu restores values of SO_REUSEADDR and
SO_REUSEPORT options.

On Sun, Jun 03, 2018 at 10:47:05AM -0700, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> 
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/sock.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>  			sock_valbool_flag(sk, SOCK_DBG, valbool);
>  		break;
>  	case SO_REUSEADDR:
> -		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuse != val)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +			break;
> +		}
> +		sk->sk_reuse = val;
>  		break;
>  	case SO_REUSEPORT:
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuseport != valbool)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +			break;
> +		}
>  		sk->sk_reuseport = valbool;
>  		break;
>  	case SO_TYPE:
Maciej Żenczykowski June 7, 2018, 12:25 a.m. UTC | #5
Yes, it does, we found this internally last night and been debating
what to do about it.

Fundamentally what it points out is that prior to this patch CRIU
could get the host into an inconsistent state.
It inserts all the sockets into the hashtables with SO_REUSEADDR set,
and then (potentially) clears it on some of them...
But the tb cache still thinks it's set on all of them.
So later attempts to bind() a socket with SO_REUSEADDR set can then
succeed even though they should fail (or something like that).

I wonder if we instead need a socket option to basically say 'ignore
all conflicts' that CRIU could set, and then clear post
bind/listen/connect
hash table insertion...

Or maybe the transition from 1->0 is valid, but from 0->1 isn't??

Or we need special per-protocol code in the SO_REUSE{ADDR,PORT}
setsockopt handler to recalculate the tb cache?

Anyone have any smart ideas?
Andrei Vagin June 7, 2018, 5:51 a.m. UTC | #6
On Wed, Jun 06, 2018 at 05:25:51PM -0700, Maciej Żenczykowski wrote:
> Yes, it does, we found this internally last night and been debating
> what to do about it.
> 
> Fundamentally what it points out is that prior to this patch CRIU
> could get the host into an inconsistent state.

Yes, I understand the problem. It would be good to
find a way how to fix this without breaking CRIU...

> It inserts all the sockets into the hashtables with SO_REUSEADDR set,
> and then (potentially) clears it on some of them...
> But the tb cache still thinks it's set on all of them.
> So later attempts to bind() a socket with SO_REUSEADDR set can then
> succeed even though they should fail (or something like that).
> 
> I wonder if we instead need a socket option to basically say 'ignore
> all conflicts' that CRIU could set, and then clear post
> bind/listen/connect
> hash table insertion...

> 
> Or maybe the transition from 1->0 is valid, but from 0->1 isn't??

I wanted to say that criu needs only the transition from 1->0, but then
I found that that TCP_REPAIR changes sk->sk_reuse too. When we switch a
socket into the repair mode, sk_reuse is set to SK_FORCE_REUSE. But when
we disable the repair mode for a socket, sk_reuse is always set to
SK_NO_REUSE, then we need to be able to restore the origin value for
it somehow...

> 
> Or we need special per-protocol code in the SO_REUSE{ADDR,PORT}
> setsockopt handler to recalculate the tb cache?
> 
> Anyone have any smart ideas?
Maciej Żenczykowski June 8, 2018, 10:07 a.m. UTC | #7
I think we probably need to make sk->sk_reuse back into a boolean.
(ie. eliminate SK_FORCE_REUSE)

Then add a new tcp/udp sk->ignore_bind_conflicts boolean setting...
(ie. not just for tcp, but sol_socket)  [or perhaps SO_REPAIR,
sk->repair or something]

What I'm not certain of is exactly what sorts of conflicts it should ignore...
all?  probably not, still seems utterly wrong to allow creation of 2 connected
tcp sockets with identical 5-tuples.

Would it only ignore conflicts against other i_b_c sockets?
ie. set it on all sockets as we're repairing, then clear it on them
all once we're done?

and ignore all the fast caching when checking conflicts for an i_b_c socket?

For CRIU is it safe to assume we're restoring an entire namespace into
a new namespace?

Could we perhaps instead allow a new namespace to ignore bind conflicts until
we flip it into enforcing mode?
Andrei Vagin June 11, 2018, 6:35 p.m. UTC | #8
Cc: Pavel

On Fri, Jun 08, 2018 at 03:07:30AM -0700, Maciej Żenczykowski wrote:
> I think we probably need to make sk->sk_reuse back into a boolean.
> (ie. eliminate SK_FORCE_REUSE)
> 
> Then add a new tcp/udp sk->ignore_bind_conflicts boolean setting...
> (ie. not just for tcp, but sol_socket)  [or perhaps SO_REPAIR,
> sk->repair or something]
> 
> What I'm not certain of is exactly what sorts of conflicts it should ignore...
> all?  probably not, still seems utterly wrong to allow creation of 2 connected
> tcp sockets with identical 5-tuples.

It is required when we are restoring i_b_c sockets on a server side.  In
this cases, they all have the same source address of a listening socket.

To restore these sockets, we need to be able to create a listening socket
and all i_b_c sockets and bind them all to the same source address.

BTW: Here is an example of how tcp_repair works:
https://github.com/avagin/tcp-repair/blob/master/tcp-constructor.c

> 
> Would it only ignore conflicts against other i_b_c sockets?
> ie. set it on all sockets as we're repairing, then clear it on them
> all once we're done?

TCP_REPAIR (which is set SK_FORCE_REUSE) is used to restore only i_b_c
sockets. SK_FORCE_REUSE is needed to ignore bind conflicts for repaired
sockets. It ignores conflicts agains other i_b_c and listen sockets.

The current idea is that CRIU will restore listening sockets first, and
them it will restore i_b_c sockets.

Pls, take a look at the attached patch.

> 
> and ignore all the fast caching when checking conflicts for an i_b_c socket?
> 
> For CRIU is it safe to assume we're restoring an entire namespace into
> a new namespace?

No. It isn't. CRIU can restore processes in an existing network namespace.

> 
> Could we perhaps instead allow a new namespace to ignore bind conflicts until
> we flip it into enforcing mode?

No, we could not
From 990baa56993827ae6f4441cf078eddf73389d6ee Mon Sep 17 00:00:00 2001
From: Andrei Vagin <avagin@openvz.org>
Date: Fri, 8 Jun 2018 23:27:46 -0700
Subject: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse

Currently sk_reuse can have there values: SK_NO_REUSE, SK_CAN_REUSE,
SK_FORCE_REUSE. SK_CAN_REUSE is set by SOL_REUSEADDR.  SK_FORCE_REUSE is
used to ignore bind conflicts for sockets in the repair mode.

This patch makes sk->sk_reuse back into a boolean and adds
sk->sk_force_reuse to track SK_FORCE_REUSE separatly.

Recently here were changes which prohibit to change
SO_REUSEADDR/SO_REUSEPORT on bound sockets and now it is impossible to
set origin values of these parameters for restored (repaired) sockets.

With introduced changes, the tcp_repair mode doesn't affect sk_reuse, so
it is possible to set its value before switching a socket into the
repair mode.

Fixes: f396922d862a ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets")
Signed-off-by: Andrei Vagin <avagin@openvz.org>
---
 include/net/sock.h              | 13 ++++---------
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/tcp.c                  |  4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b75419eafe..8ad19286ab9e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -130,6 +130,7 @@ typedef __u64 __bitwise __addrpair;
  *	@skc_family: network address family
  *	@skc_state: Connection state
  *	@skc_reuse: %SO_REUSEADDR setting
+ *	@skc_force_reuse: ignore bind conflicts
  *	@skc_reuseport: %SO_REUSEPORT setting
  *	@skc_bound_dev_if: bound device index if != 0
  *	@skc_bind_node: bind hash linkage for various protocol lookup tables
@@ -174,7 +175,8 @@ struct sock_common {
 
 	unsigned short		skc_family;
 	volatile unsigned char	skc_state;
-	unsigned char		skc_reuse:4;
+	unsigned char		skc_reuse:1;
+	unsigned char		skc_force_reuse:1;
 	unsigned char		skc_reuseport:1;
 	unsigned char		skc_ipv6only:1;
 	unsigned char		skc_net_refcnt:1;
@@ -339,6 +341,7 @@ struct sock {
 #define sk_family		__sk_common.skc_family
 #define sk_state		__sk_common.skc_state
 #define sk_reuse		__sk_common.skc_reuse
+#define sk_force_reuse		__sk_common.skc_force_reuse
 #define sk_reuseport		__sk_common.skc_reuseport
 #define sk_ipv6only		__sk_common.skc_ipv6only
 #define sk_net_refcnt		__sk_common.skc_net_refcnt
@@ -502,16 +505,8 @@ enum sk_pacing {
 #define rcu_dereference_sk_user_data(sk)	rcu_dereference(__sk_user_data((sk)))
 #define rcu_assign_sk_user_data(sk, ptr)	rcu_assign_pointer(__sk_user_data((sk)), ptr)
 
-/*
- * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
- * or not whether his port will be reused by someone else. SK_FORCE_REUSE
- * on a socket means that the socket will reuse everybody else's port
- * without looking at the other's sk_reuse value.
- */
-
 #define SK_NO_REUSE	0
 #define SK_CAN_REUSE	1
-#define SK_FORCE_REUSE	2
 
 int sk_set_peek_off(struct sock *sk, int val);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 33a88e045efd..2ac1c591b60c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -306,7 +306,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 		goto fail_unlock;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
-		if (sk->sk_reuse == SK_FORCE_REUSE)
+		if (sk->sk_force_reuse)
 			goto success;
 
 		if ((tb->fastreuse > 0 && reuse) ||
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 2741953adaba..70bfdd5a2fc4 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2810,11 +2810,11 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
 			err = -EPERM;
 		else if (val == 1) {
 			tp->repair = 1;
-			sk->sk_reuse = SK_FORCE_REUSE;
+			sk->sk_force_reuse = 1;
 			tp->repair_queue = TCP_NO_QUEUE;
 		} else if (val == 0) {
 			tp->repair = 0;
-			sk->sk_reuse = SK_NO_REUSE;
+			sk->sk_force_reuse = 0;
 			tcp_send_window_probe(sk);
 		} else
 			err = -EINVAL;
Andrei Vagin June 11, 2018, 6:57 p.m. UTC | #9
On Sun, Jun 03, 2018 at 10:47:05AM -0700, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> Cc: Eric Dumazet <edumazet@google.com>
> 
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> Reviewed-by: Eric Dumazet <edumazet@google.com>
> ---
>  net/core/sock.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
>  			sock_valbool_flag(sk, SOCK_DBG, valbool);
>  		break;
>  	case SO_REUSEADDR:
> -		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuse != val)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;

There are a few more states like TCP_LAST_ACK, TCP_CLOSE_WAIT which
means that a socket is connected.

Actually, I don't see any reasons to return two different values here.

> +			break;
> +		}
> +		sk->sk_reuse = val;
>  		break;
>  	case SO_REUSEPORT:
> +		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> +		    inet_sk(sk)->inet_num &&
> +		    (sk->sk_reuseport != valbool)) {
> +			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
> +			break;
> +		}
>  		sk->sk_reuseport = valbool;
>  		break;
>  	case SO_TYPE:
Marc Dionne June 11, 2018, 9:25 p.m. UTC | #10
On Sun, Jun 3, 2018 at 2:47 PM, Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
> From: Maciej Żenczykowski <maze@google.com>
>
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
>
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
>
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
>
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
>
> This does unfortunately run the risk of breaking buggy
> userspace programs...

This change is a potential performance regression for us.  Our
networking code sets up multiple sockets used by multiple threads to
listen on the same udp port.  For the first socket, the bind is done
before setting SO_REUSEPORT so that we can get an error and detect
that the port is currently in use by a different process, possibly a
previous incarnation of the same application.

With this change, the servers end up with a single listener socket and
thread, which can have a large impact on performance for a busy
server.

While we can adjust for the new behaviour going forward, this could be
an unpleasant surprise for our customers who get this update when
moving to a new kernel version or through a backport to stable
kernels, if this is targeted for stable.

Marc
Maciej Żenczykowski June 11, 2018, 10:29 p.m. UTC | #11
> This change is a potential performance regression for us.  Our
> networking code sets up multiple sockets used by multiple threads to
> listen on the same udp port.  For the first socket, the bind is done
> before setting SO_REUSEPORT so that we can get an error and detect
> that the port is currently in use by a different process, possibly a
> previous incarnation of the same application.
>
> With this change, the servers end up with a single listener socket and
> thread, which can have a large impact on performance for a busy
> server.
>
> While we can adjust for the new behaviour going forward, this could be
> an unpleasant surprise for our customers who get this update when
> moving to a new kernel version or through a backport to stable
> kernels, if this is targeted for stable.

Probably you can:
fd1=socket()
fd2=socket()
bind(fd1,port=0)
setsockopt(fd2,REUSEPORT,1)
port = getsockname(fd1)
close(fd1)
bind(fd2,port)

Although yeah there's a slight chance of a race (ie. 2nd bind failing,
in which case close() and retry).
Marc Dionne June 11, 2018, 11:09 p.m. UTC | #12
On Mon, Jun 11, 2018 at 7:29 PM, Maciej Żenczykowski
<zenczykowski@gmail.com> wrote:
>> This change is a potential performance regression for us.  Our
>> networking code sets up multiple sockets used by multiple threads to
>> listen on the same udp port.  For the first socket, the bind is done
>> before setting SO_REUSEPORT so that we can get an error and detect
>> that the port is currently in use by a different process, possibly a
>> previous incarnation of the same application.
>>
>> With this change, the servers end up with a single listener socket and
>> thread, which can have a large impact on performance for a busy
>> server.
>>
>> While we can adjust for the new behaviour going forward, this could be
>> an unpleasant surprise for our customers who get this update when
>> moving to a new kernel version or through a backport to stable
>> kernels, if this is targeted for stable.
>
> Probably you can:
> fd1=socket()
> fd2=socket()
> bind(fd1,port=0)
> setsockopt(fd2,REUSEPORT,1)
> port = getsockname(fd1)
> close(fd1)
> bind(fd2,port)
>
> Although yeah there's a slight chance of a race (ie. 2nd bind failing,
> in which case close() and retry).

This would be a bit different since we want a specific port rather
than a wildcard, but yes, a temporary socket could be bound just to
check if the port is in use and closed afterwards.  The problem is
that since fd2 has REUSEPORT set, that second bind might well succeed
if there was a race and the other user of the port also has REUSEPORT
set, as could be the case if it's another instance of the same
application.

Marc
diff mbox series

Patch

diff --git a/net/core/sock.c b/net/core/sock.c
index 435a0ba85e52..feca4c98f8a0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -728,9 +728,22 @@  int sock_setsockopt(struct socket *sock, int level, int optname,
 			sock_valbool_flag(sk, SOCK_DBG, valbool);
 		break;
 	case SO_REUSEADDR:
-		sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
+		val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
+		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
+		    inet_sk(sk)->inet_num &&
+		    (sk->sk_reuse != val)) {
+			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
+			break;
+		}
+		sk->sk_reuse = val;
 		break;
 	case SO_REUSEPORT:
+		if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
+		    inet_sk(sk)->inet_num &&
+		    (sk->sk_reuseport != valbool)) {
+			ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : -EUCLEAN;
+			break;
+		}
 		sk->sk_reuseport = valbool;
 		break;
 	case SO_TYPE: