diff mbox series

[net-next] net: Add TCP_FORCE_LINGER2 to TCP setsockopt

Message ID 20200421121737.3269-1-cambda@linux.alibaba.com
State Changes Requested
Delegated to: David Miller
Headers show
Series [net-next] net: Add TCP_FORCE_LINGER2 to TCP setsockopt | expand

Commit Message

Cambda Zhu April 21, 2020, 12:17 p.m. UTC
This patch adds a new TCP socket option named TCP_FORCE_LINGER2. The
option has same behavior as TCP_LINGER2, except the tp->linger2 value
can be greater than sysctl_tcp_fin_timeout if the user_ns is capable
with CAP_NET_ADMIN.

As a server, different sockets may need different FIN-WAIT timeout and
in most cases the system default value will be used. The timeout can
be adjusted by setting TCP_LINGER2 but cannot be greater than the
system default value. If one socket needs a timeout greater than the
default, we have to adjust the sysctl which affects all sockets using
the system default value. And if we want to adjust it for just one
socket and keep the original value for others, all the other sockets
have to set TCP_LINGER2. But with TCP_FORCE_LINGER2, the net admin can
set greater tp->linger2 than the default for one socket and keep
the sysctl_tcp_fin_timeout unchanged.

Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
---
 include/uapi/linux/capability.h | 1 +
 include/uapi/linux/tcp.h        | 1 +
 net/ipv4/tcp.c                  | 9 +++++++++
 3 files changed, 11 insertions(+)

Comments

David Miller April 23, 2020, 2:19 a.m. UTC | #1
From: Cambda Zhu <cambda@linux.alibaba.com>
Date: Tue, 21 Apr 2020 20:17:37 +0800

> This patch adds a new TCP socket option named TCP_FORCE_LINGER2. The
> option has same behavior as TCP_LINGER2, except the tp->linger2 value
> can be greater than sysctl_tcp_fin_timeout if the user_ns is capable
> with CAP_NET_ADMIN.
> 
> As a server, different sockets may need different FIN-WAIT timeout and
> in most cases the system default value will be used. The timeout can
> be adjusted by setting TCP_LINGER2 but cannot be greater than the
> system default value. If one socket needs a timeout greater than the
> default, we have to adjust the sysctl which affects all sockets using
> the system default value. And if we want to adjust it for just one
> socket and keep the original value for others, all the other sockets
> have to set TCP_LINGER2. But with TCP_FORCE_LINGER2, the net admin can
> set greater tp->linger2 than the default for one socket and keep
> the sysctl_tcp_fin_timeout unchanged.
> 
> Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>

Eric, please review.
Eric Dumazet April 23, 2020, 3:01 a.m. UTC | #2
On 4/21/20 5:17 AM, Cambda Zhu wrote:
> This patch adds a new TCP socket option named TCP_FORCE_LINGER2. The
> option has same behavior as TCP_LINGER2, except the tp->linger2 value
> can be greater than sysctl_tcp_fin_timeout if the user_ns is capable
> with CAP_NET_ADMIN.
> 
> As a server, different sockets may need different FIN-WAIT timeout and
> in most cases the system default value will be used. The timeout can
> be adjusted by setting TCP_LINGER2 but cannot be greater than the
> system default value. If one socket needs a timeout greater than the
> default, we have to adjust the sysctl which affects all sockets using
> the system default value. And if we want to adjust it for just one
> socket and keep the original value for others, all the other sockets
> have to set TCP_LINGER2. But with TCP_FORCE_LINGER2, the net admin can
> set greater tp->linger2 than the default for one socket and keep
> the sysctl_tcp_fin_timeout unchanged.
> 
> Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
> ---
>  include/uapi/linux/capability.h | 1 +
>  include/uapi/linux/tcp.h        | 1 +
>  net/ipv4/tcp.c                  | 9 +++++++++
>  3 files changed, 11 insertions(+)
> 
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 272dc69fa080..0e30c9756a04 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -199,6 +199,7 @@ struct vfs_ns_cap_data {
>  /* Allow multicasting */
>  /* Allow read/write of device-specific registers */
>  /* Allow activation of ATM control sockets */
> +/* Allow setting TCP_LINGER2 regardless of sysctl_tcp_fin_timeout */
>  
>  #define CAP_NET_ADMIN        12
>  
> diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
> index f2acb2566333..e21e0ce98ca1 100644
> --- a/include/uapi/linux/tcp.h
> +++ b/include/uapi/linux/tcp.h
> @@ -128,6 +128,7 @@ enum {
>  #define TCP_CM_INQ		TCP_INQ
>  
>  #define TCP_TX_DELAY		37	/* delay outgoing packets by XX usec */
> +#define TCP_FORCE_LINGER2	38	/* Set TCP_LINGER2 regardless of sysctl_tcp_fin_timeout */
>  
>  
>  #define TCP_REPAIR_ON		1
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index 6d87de434377..898a675d863e 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -3149,6 +3149,15 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
>  			tcp_enable_tx_delay();
>  		tp->tcp_tx_delay = val;
>  		break;
> +	case TCP_FORCE_LINGER2:
> +		if (val < 0)
> +			tp->linger2 = -1;
> +		else if (val > net->ipv4.sysctl_tcp_fin_timeout / HZ &&
> +			 !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
> +			tp->linger2 = 0;
> +		else
> +			tp->linger2 = val * HZ;

This multiply could overflow.

Since tp->linger2 is an int, and a negative value has a specific meaning,
you probably should have some sanity checks.

Even if the old TCP_LINGER2 silently put a 0,
maybe a new option should return an error if val*HZ would overflow.



> +		break;
>  	default:
>  		err = -ENOPROTOOPT;
>  		break;
>
David Laight April 23, 2020, 8:52 a.m. UTC | #3
From: Cambda Zhu
> Sent: 21 April 2020 13:18
> 
> This patch adds a new TCP socket option named TCP_FORCE_LINGER2. The
> option has same behavior as TCP_LINGER2, except the tp->linger2 value
> can be greater than sysctl_tcp_fin_timeout if the user_ns is capable
> with CAP_NET_ADMIN.

Did you consider adding an extra sysctl so that the limit
for TCP_LINGER2 could be greater than the default?

It might even be sensible to set the limit to a few times
the default.

All users can set the socket buffer sizes to twice the default.
Being able to do the same for the linger timeout wouldn't be a problem.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Cambda Zhu April 23, 2020, 10:33 a.m. UTC | #4
> On Apr 23, 2020, at 16:52, David Laight <David.Laight@ACULAB.COM> wrote:
> 
> From: Cambda Zhu
>> Sent: 21 April 2020 13:18
>> 
>> This patch adds a new TCP socket option named TCP_FORCE_LINGER2. The
>> option has same behavior as TCP_LINGER2, except the tp->linger2 value
>> can be greater than sysctl_tcp_fin_timeout if the user_ns is capable
>> with CAP_NET_ADMIN.
> 
> Did you consider adding an extra sysctl so that the limit
> for TCP_LINGER2 could be greater than the default?
> 
> It might even be sensible to set the limit to a few times
> the default.
> 
> All users can set the socket buffer sizes to twice the default.
> Being able to do the same for the linger timeout wouldn't be a problem.
> 
> 	David
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)

The default tcp_fin_timeout is 60s which may be too long for a server with
tens of thousands connections or qps. To reduce the connections as many as
possible, the system administer may set the sysctl_tcp_fin_timeout and most
of the connections would use the default timeout. The timeout may be very
sensitive. For example if sysctl_tcp_fin_timeout is 10s, 20s may cause double
the FIN-WAIT connections and server load needs to be reevaluated.

Besides from my experience a longer FIN-WAIT timeout will have a better
experience for clients, because there’re some clients, such as video player,
needing to finish the play before closing the connection. If we add a sysctl
with default/max timeout for all users, all connections should be set to the
max timeout except some special requirements which can also be set with
TCP_LINGER2 to use a smaller one.

Maybe my experience is not correct for your scenes, so could you describe
a situation that the timeout with default/max value is better?

Regards,
Cambda
diff mbox series

Patch

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 272dc69fa080..0e30c9756a04 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -199,6 +199,7 @@  struct vfs_ns_cap_data {
 /* Allow multicasting */
 /* Allow read/write of device-specific registers */
 /* Allow activation of ATM control sockets */
+/* Allow setting TCP_LINGER2 regardless of sysctl_tcp_fin_timeout */
 
 #define CAP_NET_ADMIN        12
 
diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index f2acb2566333..e21e0ce98ca1 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -128,6 +128,7 @@  enum {
 #define TCP_CM_INQ		TCP_INQ
 
 #define TCP_TX_DELAY		37	/* delay outgoing packets by XX usec */
+#define TCP_FORCE_LINGER2	38	/* Set TCP_LINGER2 regardless of sysctl_tcp_fin_timeout */
 
 
 #define TCP_REPAIR_ON		1
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 6d87de434377..898a675d863e 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3149,6 +3149,15 @@  static int do_tcp_setsockopt(struct sock *sk, int level,
 			tcp_enable_tx_delay();
 		tp->tcp_tx_delay = val;
 		break;
+	case TCP_FORCE_LINGER2:
+		if (val < 0)
+			tp->linger2 = -1;
+		else if (val > net->ipv4.sysctl_tcp_fin_timeout / HZ &&
+			 !ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+			tp->linger2 = 0;
+		else
+			tp->linger2 = val * HZ;
+		break;
 	default:
 		err = -ENOPROTOOPT;
 		break;