net: introduce ip_local_unbindable_ports sysctl
diff mbox series

Message ID 20191127001313.183170-1-zenczykowski@gmail.com
State Deferred
Delegated to: David Miller
Headers show
Series
  • net: introduce ip_local_unbindable_ports sysctl
Related show

Commit Message

Maciej Żenczykowski Nov. 27, 2019, 12:13 a.m. UTC
From: Maciej Żenczykowski <maze@google.com>

and associated inet_is_local_unbindable_port() helper function:
use it to make explicitly binding to an unbindable port return
-EPERM 'Operation not permitted'.

Autobind doesn't honour this new sysctl since:
  (a) you can simply set both if that's the behaviour you desire
  (b) there could be a use for preventing explicit while allowing auto
  (c) it's faster in the relatively critical path of doing port selection
      during connect() to only check one bitmap instead of both

Various ports may have special use cases which are not suitable for
use by general userspace applications. Currently, ports specified in
ip_local_reserved_ports sysctl will not be returned only in case of
automatic port assignment, but nothing prevents you from explicitly
binding to them - even from an entirely unprivileged process.

In certain cases it is desirable to prevent the host from assigning the
ports even in case of explicit binds, even from superuser processes.

Example use cases might be:
 - a port being stolen by the nic for remote serial console, remote
   power management or some other sort of debugging functionality
   (crash collection, gdb, direct access to some other microcontroller
   on the nic or motherboard, remote management of the nic itself).
 - a transparent proxy where packets are being redirected: in case
   a socket matches this connection, packets from this application
   would be incorrectly sent to one of the endpoints.

Initially I wanted to solve this problem via the simple one line:

static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port) {
-       return port < net->ipv4.sysctl_ip_prot_sock;
+       return port < net->ipv4.sysctl_ip_prot_sock || inet_is_local_reserved_port(net, port);
}

However, this doesn't work for two reasons:
  (a) it changes userspace visible behaviour of the existing local
      reserved ports sysctl, and there appears to be enough documentation
      on the internet talking about setting it to make this a bad idea
  (b) it doesn't prevent privileged apps from using these ports,
      CAP_BIND_SERVICE is relatively likely to be available to, for example,
      a recursive DNS server so it can listed on port 53, which also needs
      to do src port randomization for outgoing queries due to security
      reasons (and it thus does manual port binding).

If we *know* that certain ports are simply unusable, then it's better
nothing even gets the opportunity to try to use them.  This way we at
least get a quick failure, instead of some sort of timeout (or possibly
even corruption of the data stream of the non-kernel based use case).

Test:
  vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports

  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
  vm:~# echo 3967 > /proc/sys/net/ipv4/ip_local_unbindable_ports
  vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
  3967
  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
  socket.error: (1, 'Operation not permitted')
  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
  socket.error: (1, 'Operation not permitted')

Cc: Sean Tranchetti <stranche@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linux SCTP <linux-sctp@vger.kernel.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 Documentation/networking/ip-sysctl.txt | 13 +++++++++++++
 include/net/ip.h                       | 12 ++++++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/af_inet.c                     |  4 ++++
 net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++--
 net/ipv6/af_inet6.c                    |  2 ++
 net/sctp/socket.c                      |  5 +++++
 7 files changed, 53 insertions(+), 2 deletions(-)

Comments

Subash Abhinov Kasiviswanathan Nov. 27, 2019, 2:10 a.m. UTC | #1
On 2019-11-26 17:13, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> and associated inet_is_local_unbindable_port() helper function:
> use it to make explicitly binding to an unbindable port return
> -EPERM 'Operation not permitted'.
> 
> Autobind doesn't honour this new sysctl since:
>   (a) you can simply set both if that's the behaviour you desire
>   (b) there could be a use for preventing explicit while allowing auto
>   (c) it's faster in the relatively critical path of doing port 
> selection
>       during connect() to only check one bitmap instead of both
> 
> Various ports may have special use cases which are not suitable for
> use by general userspace applications. Currently, ports specified in
> ip_local_reserved_ports sysctl will not be returned only in case of
> automatic port assignment, but nothing prevents you from explicitly
> binding to them - even from an entirely unprivileged process.
> 
> In certain cases it is desirable to prevent the host from assigning the
> ports even in case of explicit binds, even from superuser processes.
> 
> Example use cases might be:
>  - a port being stolen by the nic for remote serial console, remote
>    power management or some other sort of debugging functionality
>    (crash collection, gdb, direct access to some other microcontroller
>    on the nic or motherboard, remote management of the nic itself).
>  - a transparent proxy where packets are being redirected: in case
>    a socket matches this connection, packets from this application
>    would be incorrectly sent to one of the endpoints.
> 
> Initially I wanted to solve this problem via the simple one line:
> 
> static inline bool inet_port_requires_bind_service(struct net *net,
> unsigned short port) {
> -       return port < net->ipv4.sysctl_ip_prot_sock;
> +       return port < net->ipv4.sysctl_ip_prot_sock ||
> inet_is_local_reserved_port(net, port);
> }
> 
> However, this doesn't work for two reasons:
>   (a) it changes userspace visible behaviour of the existing local
>       reserved ports sysctl, and there appears to be enough 
> documentation
>       on the internet talking about setting it to make this a bad idea
>   (b) it doesn't prevent privileged apps from using these ports,
>       CAP_BIND_SERVICE is relatively likely to be available to, for 
> example,
>       a recursive DNS server so it can listed on port 53, which also 
> needs
>       to do src port randomization for outgoing queries due to security
>       reasons (and it thus does manual port binding).
> 
> If we *know* that certain ports are simply unusable, then it's better
> nothing even gets the opportunity to try to use them.  This way we at
> least get a quick failure, instead of some sort of timeout (or possibly
> even corruption of the data stream of the non-kernel based use case).
> 
> Test:
>   vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
> 
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6,
> socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6,
> socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
>   vm:~# echo 3967 > /proc/sys/net/ipv4/ip_local_unbindable_ports
>   vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
>   3967
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6,
> socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
>   socket.error: (1, 'Operation not permitted')
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6,
> socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
>   socket.error: (1, 'Operation not permitted')
> 
> Cc: Sean Tranchetti <stranche@codeaurora.org>
> Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Linux SCTP <linux-sctp@vger.kernel.org>
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> ---
>  Documentation/networking/ip-sysctl.txt | 13 +++++++++++++
>  include/net/ip.h                       | 12 ++++++++++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/af_inet.c                     |  4 ++++
>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++--
>  net/ipv6/af_inet6.c                    |  2 ++
>  net/sctp/socket.c                      |  5 +++++
>  7 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt
> b/Documentation/networking/ip-sysctl.txt
> index fd26788e8c96..7129646a18bd 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -940,6 +940,19 @@ ip_local_reserved_ports - list of comma separated 
> ranges
> 
>  	Default: Empty
> 
> +ip_local_unbindable_ports - list of comma separated ranges
> +	Specify the ports which are not directly bind()able.
> +
> +	Usually you would use this to block the use of ports which
> +	are invalid due to something outside of the control of the
> +	kernel.  For example a port stolen by the nic for serial
> +	console, remote power management or debugging.
> +
> +	There's a relatively high chance you will also want to list
> +	these ports in 'ip_local_reserved_ports' to prevent autobinding.
> +
> +	Default: Empty
> +
>  ip_unprivileged_port_start - INTEGER
>  	This is a per-namespace sysctl.  It defines the first
>  	unprivileged port in the network namespace.  Privileged ports
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 02d68e346f67..14b99bf59ffc 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -346,6 +346,13 @@ static inline bool
> inet_is_local_reserved_port(struct net *net, unsigned short p
>  	return test_bit(port, net->ipv4.sysctl_local_reserved_ports);
>  }
> 
> +static inline bool inet_is_local_unbindable_port(struct net *net,
> unsigned short port)
> +{
> +	if (!net->ipv4.sysctl_local_unbindable_ports)
> +		return false;
> +	return test_bit(port, net->ipv4.sysctl_local_unbindable_ports);
> +}
> +
>  static inline bool sysctl_dev_name_is_allowed(const char *name)
>  {
>  	return strcmp(name, "default") != 0  && strcmp(name, "all") != 0;
> @@ -362,6 +369,11 @@ static inline bool
> inet_is_local_reserved_port(struct net *net, unsigned short p
>  	return false;
>  }
> 
> +static inline bool inet_is_local_unbindable_port(struct net *net,
> unsigned short port)
> +{
> +	return false;
> +}
> +
>  static inline bool inet_port_requires_bind_service(struct net *net,
> unsigned short port)
>  {
>  	return port < PROT_SOCK;
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index c0c0791b1912..6a235651925d 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -197,6 +197,7 @@ struct netns_ipv4 {
> 
>  #ifdef CONFIG_SYSCTL
>  	unsigned long *sysctl_local_reserved_ports;
> +	unsigned long *sysctl_local_unbindable_ports;
>  	int sysctl_ip_prot_sock;
>  #endif
> 
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 2fe295432c24..b26046431612 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -494,6 +494,10 @@ int __inet_bind(struct sock *sk, struct sockaddr
> *uaddr, int addr_len,
>  		goto out;
> 
>  	snum = ntohs(addr->sin_port);
> +	err = -EPERM;
> +	if (snum && inet_is_local_unbindable_port(net, snum))
> +		goto out;
> +
>  	err = -EACCES;
>  	if (snum && inet_port_requires_bind_service(net, snum) &&
>  	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index fcb2cd167f64..fd363b57a653 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -745,6 +745,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_do_large_bitmap,
>  	},
> +	{
> +		.procname	= "ip_local_unbindable_ports",
> +		.data		= &init_net.ipv4.sysctl_local_unbindable_ports,
> +		.maxlen		= 65536,
> +		.mode		= 0644,
> +		.proc_handler	= proc_do_large_bitmap,
> +	},
>  	{
>  		.procname	= "ip_no_pmtu_disc",
>  		.data		= &init_net.ipv4.sysctl_ip_no_pmtu_disc,
> @@ -1353,11 +1360,17 @@ static __net_init int
> ipv4_sysctl_init_net(struct net *net)
> 
>  	net->ipv4.sysctl_local_reserved_ports = kzalloc(65536 / 8, 
> GFP_KERNEL);
>  	if (!net->ipv4.sysctl_local_reserved_ports)
> -		goto err_ports;
> +		goto err_reserved_ports;
> +
> +	net->ipv4.sysctl_local_unbindable_ports = kzalloc(65536 / 8, 
> GFP_KERNEL);
> +	if (!net->ipv4.sysctl_local_unbindable_ports)
> +		goto err_unbindable_ports;
> 
>  	return 0;
> 
> -err_ports:
> +err_unbindable_ports:
> +	kfree(net->ipv4.sysctl_local_reserved_ports);
> +err_reserved_ports:
>  	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
>  err_reg:
>  	if (!net_eq(net, &init_net))
> @@ -1370,6 +1383,7 @@ static __net_exit void
> ipv4_sysctl_exit_net(struct net *net)
>  {
>  	struct ctl_table *table;
> 
> +	kfree(net->ipv4.sysctl_local_unbindable_ports);
>  	kfree(net->ipv4.sysctl_local_reserved_ports);
>  	table = net->ipv4.ipv4_hdr->ctl_table_arg;
>  	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 60e2ff91a5b3..3c83e3200543 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -292,6 +292,8 @@ static int __inet6_bind(struct sock *sk, struct
> sockaddr *uaddr, int addr_len,
>  		return -EINVAL;
> 
>  	snum = ntohs(addr->sin6_port);
> +	if (snum && inet_is_local_unbindable_port(net, snum))
> +		return -EPERM;
>  	if (snum && inet_port_requires_bind_service(net, snum) &&
>  	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>  		return -EACCES;
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 0b485952a71c..d1c93542419d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -384,6 +384,9 @@ static int sctp_do_bind(struct sock *sk, union
> sctp_addr *addr, int len)
>  		}
>  	}
> 
> +	if (snum && inet_is_local_unbindable_port(net, snum))
> +		return -EPERM;
> +
>  	if (snum && inet_port_requires_bind_service(net, snum) &&
>  	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>  		return -EACCES;
> @@ -1061,6 +1064,8 @@ static int sctp_connect_new_asoc(struct 
> sctp_endpoint *ep,
>  		if (sctp_autobind(sk))
>  			return -EAGAIN;
>  	} else {
> +		if (inet_is_local_unbindable_port(net, ep->base.bind_addr.port))
> +			return -EPERM;
>  		if (inet_port_requires_bind_service(net, ep->base.bind_addr.port) &&
>  		    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>  			return -EACCES;

Thanks Maciej.
This works fine for me (seeing some minor merge conflicts on net-next 
but applies
fine on net).

Reviewed-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Marcelo Ricardo Leitner Nov. 27, 2019, 1:14 p.m. UTC | #2
On Tue, Nov 26, 2019 at 04:13:13PM -0800, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> and associated inet_is_local_unbindable_port() helper function:
> use it to make explicitly binding to an unbindable port return
> -EPERM 'Operation not permitted'.
> 
> Autobind doesn't honour this new sysctl since:
>   (a) you can simply set both if that's the behaviour you desire
>   (b) there could be a use for preventing explicit while allowing auto
>   (c) it's faster in the relatively critical path of doing port selection
>       during connect() to only check one bitmap instead of both
...
> If we *know* that certain ports are simply unusable, then it's better
> nothing even gets the opportunity to try to use them.  This way we at
> least get a quick failure, instead of some sort of timeout (or possibly
> even corruption of the data stream of the non-kernel based use case).

This is doable with SELinux today, no?
Maciej Żenczykowski Nov. 27, 2019, 8:50 p.m. UTC | #3
On Wed, Nov 27, 2019 at 5:14 AM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
>
> On Tue, Nov 26, 2019 at 04:13:13PM -0800, Maciej Żenczykowski wrote:
> > From: Maciej Żenczykowski <maze@google.com>
> >
> > and associated inet_is_local_unbindable_port() helper function:
> > use it to make explicitly binding to an unbindable port return
> > -EPERM 'Operation not permitted'.
> >
> > Autobind doesn't honour this new sysctl since:
> >   (a) you can simply set both if that's the behaviour you desire
> >   (b) there could be a use for preventing explicit while allowing auto
> >   (c) it's faster in the relatively critical path of doing port selection
> >       during connect() to only check one bitmap instead of both
> ...
> > If we *know* that certain ports are simply unusable, then it's better
> > nothing even gets the opportunity to try to use them.  This way we at
> > least get a quick failure, instead of some sort of timeout (or possibly
> > even corruption of the data stream of the non-kernel based use case).
>
> This is doable with SELinux today, no?

Perhaps, but SELinux isn't used by many distros, including the servers
where I have nics that steal some ports.  It's also much much
more difficult, requiring a policy, compilers, etc... and it gets even
more complex if you need to dynamically modify the set of ports,
which requires extra tools and runtime permissions.
David Miller Nov. 27, 2019, 10:33 p.m. UTC | #4
From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Wed, 27 Nov 2019 12:50:39 -0800

> On Wed, Nov 27, 2019 at 5:14 AM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
>>
>> On Tue, Nov 26, 2019 at 04:13:13PM -0800, Maciej Żenczykowski wrote:
>> > From: Maciej Żenczykowski <maze@google.com>
>> >
>> > and associated inet_is_local_unbindable_port() helper function:
>> > use it to make explicitly binding to an unbindable port return
>> > -EPERM 'Operation not permitted'.
>> >
>> > Autobind doesn't honour this new sysctl since:
>> >   (a) you can simply set both if that's the behaviour you desire
>> >   (b) there could be a use for preventing explicit while allowing auto
>> >   (c) it's faster in the relatively critical path of doing port selection
>> >       during connect() to only check one bitmap instead of both
>> ...
>> > If we *know* that certain ports are simply unusable, then it's better
>> > nothing even gets the opportunity to try to use them.  This way we at
>> > least get a quick failure, instead of some sort of timeout (or possibly
>> > even corruption of the data stream of the non-kernel based use case).
>>
>> This is doable with SELinux today, no?
> 
> Perhaps, but SELinux isn't used by many distros, including the servers
> where I have nics that steal some ports.  It's also much much
> more difficult, requiring a policy, compilers, etc... and it gets even
> more complex if you need to dynamically modify the set of ports,
> which requires extra tools and runtime permissions.

I can see both sides of this argument, but anyways this is a new features and
thus net-next material.  It's nice to keep this discussion going, of course,
but if this trends in the positive you still need to resubmit this when
net-next opens back up.

Thanks.
Marcelo Ricardo Leitner Nov. 27, 2019, 11 p.m. UTC | #5
On Wed, Nov 27, 2019 at 12:50:39PM -0800, Maciej Żenczykowski wrote:
> On Wed, Nov 27, 2019 at 5:14 AM Marcelo Ricardo Leitner
> <marcelo.leitner@gmail.com> wrote:
> >
> > On Tue, Nov 26, 2019 at 04:13:13PM -0800, Maciej Żenczykowski wrote:
> > > From: Maciej Żenczykowski <maze@google.com>
> > >
> > > and associated inet_is_local_unbindable_port() helper function:
> > > use it to make explicitly binding to an unbindable port return
> > > -EPERM 'Operation not permitted'.
> > >
> > > Autobind doesn't honour this new sysctl since:
> > >   (a) you can simply set both if that's the behaviour you desire
> > >   (b) there could be a use for preventing explicit while allowing auto
> > >   (c) it's faster in the relatively critical path of doing port selection
> > >       during connect() to only check one bitmap instead of both
> > ...
> > > If we *know* that certain ports are simply unusable, then it's better
> > > nothing even gets the opportunity to try to use them.  This way we at
> > > least get a quick failure, instead of some sort of timeout (or possibly
> > > even corruption of the data stream of the non-kernel based use case).
> >
> > This is doable with SELinux today, no?
> 
> Perhaps, but SELinux isn't used by many distros, including the servers
> where I have nics that steal some ports.  It's also much much
> more difficult, requiring a policy, compilers, etc... and it gets even
> more complex if you need to dynamically modify the set of ports,
> which requires extra tools and runtime permissions.

I'm no SELinux expert, but my /etc/ssh/sshd_config has this nice handy
comment:
# If you want to change the port on a SELinux system, you have to tell
# SELinux about this change.
# semanage port -a -t ssh_port_t -p tcp #PORTNUMBER

The kernel has no specific knowledge of 'ssh_port_t' and all I need to
do to allow such port, is run the command above. No compiler, etc.
The distribution would have to have a policy, say,
'unbindable_ports_t', and it could work similarly, I suppose, but I
have no knowledge on this part.

As a reference only,
# semanage port -l
gives a great list of ports that daemons are supposed to be using, and
it supports ranges and so, like:
amqp_port_t                    tcp      15672, 5671-5672
gluster_port_t                 tcp      38465-38469, 24007-24027

On not having SELinux enabled, you got me there. I not really willing
to enter a "to do SELinux or not" discussion. :-)
Maciej Żenczykowski Nov. 29, 2019, 8 p.m. UTC | #6
> > Perhaps, but SELinux isn't used by many distros, including the servers
> > where I have nics that steal some ports.  It's also much much
> > more difficult, requiring a policy, compilers, etc... and it gets even
> > more complex if you need to dynamically modify the set of ports,
> > which requires extra tools and runtime permissions.
>
> I'm no SELinux expert, but my /etc/ssh/sshd_config has this nice handy
> comment:
> # If you want to change the port on a SELinux system, you have to tell
> # SELinux about this change.
> # semanage port -a -t ssh_port_t -p tcp #PORTNUMBER

Right, so I'm also not at all an SELinux expert.

But: I run Fedora as my preferred distro of choice and it is of course
SELinux enabled (and has been for many many years now) and I'm aware
of that ssh port magic, precisely because I sometimes run ssh on a
port different than 22.

I'm also working full time on Android, where I bash my head against
SELinux quite regularly ;-)

> The kernel has no specific knowledge of 'ssh_port_t' and all I need to
> do to allow such port, is run the command above.

I don't think that's actually true.  I believe somewhere in the
SELinux policy there is a reference to a set named ssh_port_t, some
binary number assigned to it, and a default set of ports (ie. 22).

semanage, looks up 'ssh_port_t' maps it to some number X and tells the
kernel that port P needs to be added to set X, and it needs super
privileges to do so, because obviously managing the SELinux policy is
about as privileged as you can get.  And then it stores that P should
be in X somewhere on disk so it survives reboot.  I don't know if
along the way it reloaded the entire policy or just a subset.

strace'ing semanage seems to roughly confirm this (although there's
tons and tons of output...)

> No compiler, etc.

The compiler is present on fedora at least.  I don't know if semanage
invokes it or not, obviously at some point it was invoked.

> The distribution would have to have a policy, say,
> 'unbindable_ports_t', and it could work similarly, I suppose, but I
> have no knowledge on this part.

Yes, the OS image has to include 'semanage' and all the selinux
tooling, and has to ship a policy which has the unbindable_ports_t
already defined in it...

But here's I think the fundamental problem with an SELinux approach.
By default SELinux is deny all.  All you can do is grant extra privs.
I'm not aware of a way to actually say system wide disallow X.
'neverallow' is a compile time policy enforcement hack, and isn't
(afaik) translated into anything that's actually given to the kernel.

So really instead you'd need to create a port set for every app -
including a default , and then make sure that whatever port you want
to block is excluded from each of those sets.

ie.
[root@gaia ~]# semanage port -l | egrep '32768|65535'
ephemeral_port_t               tcp      32768-60999
ephemeral_port_t               udp      32768-60999
unreserved_port_t              sctp     1024-65535
unreserved_port_t              tcp      61001-65535, 1024-32767
unreserved_port_t              udp      61001-65535, 1024-32767

You need to iterate through all port sets, and remove port P from each
of them in turn.
(can port sets overlap? not sure... do ports need to be in some set?
perhaps also need to add the port to the unbindable set or something)

It's certainly doable, but it's a *lot* of work.
If you're on Fedora... then this isn't a huge problem, because someone
did most of the work for you already...  But, on any other distro?

And let's not even get started wrt. interactions with network namespaces.

I also don't know of a way to allow changes to selinux for NET_ADMIN
(correct me if I'm wrong),
I also don't immediately know how to do this with /proc settings, but
it seems like it should be achievable either via some additional
patches, or via some clever boot-time network namespace creation
hackery (on Android I imagine the answer would involve some sort of
selinux tagging of the appropriate sysctl file in order to allow netd
to change it).

> On not having SELinux enabled, you got me there. I not really willing
> to enter a "to do SELinux or not" discussion. :-)

:-)

I'm of the opinion that SELinux and other security policy modules
should be reserved for things related to system wide security policy.
Not for things that are more along the lines of 'functionality'.

Also selinux has 'permissive' mode which causes the system to ignore
all selinux access controls (in favour of just logging) and this is
what is commonly used during development (because it's such a pain to
work with).

This could also technically be probably done via bpf syscall filters,
or bpf cgroup hooks...
But those approaches have performance implications and require a huge
amount of machinery and complexity to manage.
Marcelo Ricardo Leitner Dec. 4, 2019, 6:27 p.m. UTC | #7
On Fri, Nov 29, 2019 at 09:00:19PM +0100, Maciej Żenczykowski wrote:
...
> I'm of the opinion that SELinux and other security policy modules
> should be reserved for things related to system wide security policy.
> Not for things that are more along the lines of 'functionality'.

Makes sense.

> 
> Also selinux has 'permissive' mode which causes the system to ignore
> all selinux access controls (in favour of just logging) and this is
> what is commonly used during development (because it's such a pain to
> work with).

Agree, this would be a big problem.
IOW, "you don't have permission to access to this" != "you just can't use this, no
matter what"

FWIW, I rest my case :-)

Thanks,
Marcelo
Maciej Żenczykowski Dec. 9, 2019, 10:43 p.m. UTC | #8
Since net-next is open, I'm resending, with just Cc/reviewed-by
updates - code is the same.
Lorenzo Colitti Dec. 13, 2019, 12:25 a.m. UTC | #9
On Thu, Nov 28, 2019 at 8:00 AM Marcelo Ricardo Leitner
<marcelo.leitner@gmail.com> wrote:
> I'm no SELinux expert, but my /etc/ssh/sshd_config has this nice handy
> comment:
> # If you want to change the port on a SELinux system, you have to tell
> # SELinux about this change.
> # semanage port -a -t ssh_port_t -p tcp #PORTNUMBER
>
> The kernel has no specific knowledge of 'ssh_port_t' and all I need to
> do to allow such port, is run the command above. No compiler, etc.
> The distribution would have to have a policy, say,
> 'unbindable_ports_t', and it could work similarly, I suppose, but I
> have no knowledge on this part.

For security reasons, Android does not allow reloading selinux policy
after boot. I'm not a selinux expert either, but semanage-port(8) has:

       -N, --noreload
              Do not reload policy after commit

which suggests that it works by reloading the policy.
Neil Horman Dec. 13, 2019, 11:49 a.m. UTC | #10
On Tue, Nov 26, 2019 at 04:13:13PM -0800, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski <maze@google.com>
> 
> and associated inet_is_local_unbindable_port() helper function:
> use it to make explicitly binding to an unbindable port return
> -EPERM 'Operation not permitted'.
> 
> Autobind doesn't honour this new sysctl since:
>   (a) you can simply set both if that's the behaviour you desire
>   (b) there could be a use for preventing explicit while allowing auto
>   (c) it's faster in the relatively critical path of doing port selection
>       during connect() to only check one bitmap instead of both
> 
> Various ports may have special use cases which are not suitable for
> use by general userspace applications. Currently, ports specified in
> ip_local_reserved_ports sysctl will not be returned only in case of
> automatic port assignment, but nothing prevents you from explicitly
> binding to them - even from an entirely unprivileged process.
> 
> In certain cases it is desirable to prevent the host from assigning the
> ports even in case of explicit binds, even from superuser processes.
> 
> Example use cases might be:
>  - a port being stolen by the nic for remote serial console, remote
>    power management or some other sort of debugging functionality
>    (crash collection, gdb, direct access to some other microcontroller
>    on the nic or motherboard, remote management of the nic itself).
>  - a transparent proxy where packets are being redirected: in case
>    a socket matches this connection, packets from this application
>    would be incorrectly sent to one of the endpoints.
> 
> Initially I wanted to solve this problem via the simple one line:
> 
> static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port) {
> -       return port < net->ipv4.sysctl_ip_prot_sock;
> +       return port < net->ipv4.sysctl_ip_prot_sock || inet_is_local_reserved_port(net, port);
> }
> 
> However, this doesn't work for two reasons:
>   (a) it changes userspace visible behaviour of the existing local
>       reserved ports sysctl, and there appears to be enough documentation
>       on the internet talking about setting it to make this a bad idea
>   (b) it doesn't prevent privileged apps from using these ports,
>       CAP_BIND_SERVICE is relatively likely to be available to, for example,
>       a recursive DNS server so it can listed on port 53, which also needs
>       to do src port randomization for outgoing queries due to security
>       reasons (and it thus does manual port binding).
> 
> If we *know* that certain ports are simply unusable, then it's better
> nothing even gets the opportunity to try to use them.  This way we at
> least get a quick failure, instead of some sort of timeout (or possibly
> even corruption of the data stream of the non-kernel based use case).
> 
> Test:
>   vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
> 
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
>   vm:~# echo 3967 > /proc/sys/net/ipv4/ip_local_unbindable_ports
>   vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
>   3967
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
>   socket.error: (1, 'Operation not permitted')
>   vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
>   socket.error: (1, 'Operation not permitted')
> 
> Cc: Sean Tranchetti <stranche@codeaurora.org>
> Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Cc: Linux SCTP <linux-sctp@vger.kernel.org>
> Signed-off-by: Maciej Żenczykowski <maze@google.com>
> ---
>  Documentation/networking/ip-sysctl.txt | 13 +++++++++++++
>  include/net/ip.h                       | 12 ++++++++++++
>  include/net/netns/ipv4.h               |  1 +
>  net/ipv4/af_inet.c                     |  4 ++++
>  net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++--
>  net/ipv6/af_inet6.c                    |  2 ++
>  net/sctp/socket.c                      |  5 +++++
>  7 files changed, 53 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index fd26788e8c96..7129646a18bd 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -940,6 +940,19 @@ ip_local_reserved_ports - list of comma separated ranges
>  
>  	Default: Empty
>  
> +ip_local_unbindable_ports - list of comma separated ranges
> +	Specify the ports which are not directly bind()able.
> +
> +	Usually you would use this to block the use of ports which
> +	are invalid due to something outside of the control of the
> +	kernel.  For example a port stolen by the nic for serial
> +	console, remote power management or debugging.
> +
> +	There's a relatively high chance you will also want to list
> +	these ports in 'ip_local_reserved_ports' to prevent autobinding.
> +
> +	Default: Empty
> +
>  ip_unprivileged_port_start - INTEGER
>  	This is a per-namespace sysctl.  It defines the first
>  	unprivileged port in the network namespace.  Privileged ports
> diff --git a/include/net/ip.h b/include/net/ip.h
> index 02d68e346f67..14b99bf59ffc 100644
> --- a/include/net/ip.h
> +++ b/include/net/ip.h
> @@ -346,6 +346,13 @@ static inline bool inet_is_local_reserved_port(struct net *net, unsigned short p
>  	return test_bit(port, net->ipv4.sysctl_local_reserved_ports);
>  }
>  
> +static inline bool inet_is_local_unbindable_port(struct net *net, unsigned short port)
> +{
> +	if (!net->ipv4.sysctl_local_unbindable_ports)
> +		return false;
> +	return test_bit(port, net->ipv4.sysctl_local_unbindable_ports);
> +}
> +
>  static inline bool sysctl_dev_name_is_allowed(const char *name)
>  {
>  	return strcmp(name, "default") != 0  && strcmp(name, "all") != 0;
> @@ -362,6 +369,11 @@ static inline bool inet_is_local_reserved_port(struct net *net, unsigned short p
>  	return false;
>  }
>  
> +static inline bool inet_is_local_unbindable_port(struct net *net, unsigned short port)
> +{
> +	return false;
> +}
> +
>  static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port)
>  {
>  	return port < PROT_SOCK;
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index c0c0791b1912..6a235651925d 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -197,6 +197,7 @@ struct netns_ipv4 {
>  
>  #ifdef CONFIG_SYSCTL
>  	unsigned long *sysctl_local_reserved_ports;
> +	unsigned long *sysctl_local_unbindable_ports;
>  	int sysctl_ip_prot_sock;
>  #endif
>  
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 2fe295432c24..b26046431612 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -494,6 +494,10 @@ int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
>  		goto out;
>  
>  	snum = ntohs(addr->sin_port);
> +	err = -EPERM;
> +	if (snum && inet_is_local_unbindable_port(net, snum))
> +		goto out;
> +
>  	err = -EACCES;
>  	if (snum && inet_port_requires_bind_service(net, snum) &&
>  	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index fcb2cd167f64..fd363b57a653 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -745,6 +745,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_do_large_bitmap,
>  	},
> +	{
> +		.procname	= "ip_local_unbindable_ports",
> +		.data		= &init_net.ipv4.sysctl_local_unbindable_ports,
> +		.maxlen		= 65536,
> +		.mode		= 0644,
> +		.proc_handler	= proc_do_large_bitmap,
> +	},
>  	{
>  		.procname	= "ip_no_pmtu_disc",
>  		.data		= &init_net.ipv4.sysctl_ip_no_pmtu_disc,
> @@ -1353,11 +1360,17 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
>  
>  	net->ipv4.sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
>  	if (!net->ipv4.sysctl_local_reserved_ports)
> -		goto err_ports;
> +		goto err_reserved_ports;
> +
> +	net->ipv4.sysctl_local_unbindable_ports = kzalloc(65536 / 8, GFP_KERNEL);
> +	if (!net->ipv4.sysctl_local_unbindable_ports)
> +		goto err_unbindable_ports;
>  
>  	return 0;
>  
> -err_ports:
> +err_unbindable_ports:
> +	kfree(net->ipv4.sysctl_local_reserved_ports);
> +err_reserved_ports:
>  	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
>  err_reg:
>  	if (!net_eq(net, &init_net))
> @@ -1370,6 +1383,7 @@ static __net_exit void ipv4_sysctl_exit_net(struct net *net)
>  {
>  	struct ctl_table *table;
>  
> +	kfree(net->ipv4.sysctl_local_unbindable_ports);
>  	kfree(net->ipv4.sysctl_local_reserved_ports);
>  	table = net->ipv4.ipv4_hdr->ctl_table_arg;
>  	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
> diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
> index 60e2ff91a5b3..3c83e3200543 100644
> --- a/net/ipv6/af_inet6.c
> +++ b/net/ipv6/af_inet6.c
> @@ -292,6 +292,8 @@ static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
>  		return -EINVAL;
>  
>  	snum = ntohs(addr->sin6_port);
> +	if (snum && inet_is_local_unbindable_port(net, snum))
> +		return -EPERM;
>  	if (snum && inet_port_requires_bind_service(net, snum) &&
>  	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>  		return -EACCES;
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 0b485952a71c..d1c93542419d 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -384,6 +384,9 @@ static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
>  		}
>  	}
>  
> +	if (snum && inet_is_local_unbindable_port(net, snum))
> +		return -EPERM;
> +
>  	if (snum && inet_port_requires_bind_service(net, snum) &&
>  	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>  		return -EACCES;
> @@ -1061,6 +1064,8 @@ static int sctp_connect_new_asoc(struct sctp_endpoint *ep,
>  		if (sctp_autobind(sk))
>  			return -EAGAIN;
>  	} else {
> +		if (inet_is_local_unbindable_port(net, ep->base.bind_addr.port))
> +			return -EPERM;
>  		if (inet_port_requires_bind_service(net, ep->base.bind_addr.port) &&
>  		    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
>  			return -EACCES;
> -- 
> 2.24.0.432.g9d3f5f5b63-goog
> 
> 

Just out of curiosity, why are the portreserve and portrelease utilities not a
solution to this use case?

Neil
Lorenzo Colitti Dec. 19, 2019, 9:35 a.m. UTC | #11
On Fri, 13 Dec 2019, 20:49 Neil Horman, <nhorman@tuxdriver.com> wrote:
> Just out of curiosity, why are the portreserve and portrelease utilities not a
> solution to this use case?

As I understand it, those utilities keep the ports reserved by binding
to them so that no other process can. This doesn't work for Android
because there are conformance tests that probe the device from the
network and check that there are no open ports.
Neil Horman Dec. 19, 2019, 1:17 p.m. UTC | #12
On Thu, Dec 19, 2019 at 06:35:13PM +0900, Lorenzo Colitti wrote:
> On Fri, 13 Dec 2019, 20:49 Neil Horman, <nhorman@tuxdriver.com> wrote:
> > Just out of curiosity, why are the portreserve and portrelease utilities not a
> > solution to this use case?
> 
> As I understand it, those utilities keep the ports reserved by binding
> to them so that no other process can. This doesn't work for Android
> because there are conformance tests that probe the device from the
> network and check that there are no open ports.
> 
But you can address that with some augmentation to portreserve (i.e. just have
it add an iptables rule to drop frames on that port, or respond with a port
unreachable icmp message)

Neil
Lorenzo Colitti Dec. 19, 2019, 2:02 p.m. UTC | #13
On Thu, Dec 19, 2019 at 10:17 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > As I understand it, those utilities keep the ports reserved by binding
> > to them so that no other process can. This doesn't work for Android
> > because there are conformance tests that probe the device from the
> > network and check that there are no open ports.
> >
> But you can address that with some augmentation to portreserve (i.e. just have
> it add an iptables rule to drop frames on that port, or respond with a port
> unreachable icmp message)

There are also tests that run on device by inspecting
/proc/net/{tcp,udp} to check that there are no open sockets. We'd have
to change them as well.

But sure. It's not impossible to do this in userspace. We wouldn't use
portreserve itself because the work to package it and make it work on
Android (which has no /etc/services file), would likely be greater
than just adding the code to an existing Android daemon (and because
the reaction of the portreserve maintainers might be similar to yours:
"you don't need to add code to portreserve for this, just use a script
that shells out to iptables").

But in any case, the result would be more complicated to use and
maintain, and it would likely also be less realistic, such that a
sophisticated conformance test might still find that the port was
actually bound. Other users of the kernel wouldn't get to use this
sysctl, and the userspace code can't be easily reused in other
open-source projects, so the community gets nothing useful. That
doesn't seem great.

Or, we could take this patch and maintain it in the Android kernel
tree. Android kernels get a tiny bit further from mainline. Other uses
of the kernel wouldn't get to use this sysctl, and again the community
gets nothing useful. That doesn't seem great either.
Neil Horman Dec. 19, 2019, 4:57 p.m. UTC | #14
On Thu, Dec 19, 2019 at 11:02:32PM +0900, Lorenzo Colitti wrote:
> On Thu, Dec 19, 2019 at 10:17 PM Neil Horman <nhorman@tuxdriver.com> wrote:
> > > As I understand it, those utilities keep the ports reserved by binding
> > > to them so that no other process can. This doesn't work for Android
> > > because there are conformance tests that probe the device from the
> > > network and check that there are no open ports.
> > >
> > But you can address that with some augmentation to portreserve (i.e. just have
> > it add an iptables rule to drop frames on that port, or respond with a port
> > unreachable icmp message)
> 
> There are also tests that run on device by inspecting
> /proc/net/{tcp,udp} to check that there are no open sockets. We'd have
> to change them as well.
> 
Ok, that seems reasonable.

> But sure. It's not impossible to do this in userspace. We wouldn't use
> portreserve itself because the work to package it and make it work on
> Android (which has no /etc/services file), would likely be greater
> than just adding the code to an existing Android daemon (and because
> the reaction of the portreserve maintainers might be similar to yours:
> "you don't need to add code to portreserve for this, just use a script
> that shells out to iptables").
> 
Possibly, but sure, you could add the same functionality to some other existing
daemon.

> But in any case, the result would be more complicated to use and
> maintain, and it would likely also be less realistic, such that a
> sophisticated conformance test might still find that the port was
> actually bound.
One would think that a sufficiently sophisticated script could understand that a
port was bound not for the purposes of use, but rather for the purposes of
prevention of use by other processes.  But I take your meaning, the fanout here
starts to get large.

> Other users of the kernel wouldn't get to use this
> sysctl, and the userspace code can't be easily reused in other
> open-source projects, so the community gets nothing useful. That
> doesn't seem great.
> 
Well, that assumes you implement this in a non-open daemon, but thats
your perogative.

> Or, we could take this patch and maintain it in the Android kernel
> tree. Android kernels get a tiny bit further from mainline. Other uses
> of the kernel wouldn't get to use this sysctl, and again the community
> gets nothing useful. That doesn't seem great either.
> 
That seems....agressive.  I'm not saying this is a bad feature, I'm really just
trying to think through how else this might be accomplished without the need to
implement and maintain another sysctl.

FWIW, bpf offers hooks in both inet6_bind and inet_bind.  Another option would
be to implement a bfp program at each of those hooks that filtered on the set of
blacklisted ports you want to prevent the use of.  I'm not sure how wide the
scope of this feature is for use, but if its limited to your use case, perhaps
thats an alternative solution.

Neil
David Miller Dec. 19, 2019, 5:52 p.m. UTC | #15
From: Lorenzo Colitti <lorenzo@google.com>
Date: Thu, 19 Dec 2019 23:02:32 +0900

> But in any case, the result would be more complicated to use and
> maintain, and it would likely also be less realistic, such that a
> sophisticated conformance test might still find that the port was
> actually bound. Other users of the kernel wouldn't get to use this
> sysctl, and the userspace code can't be easily reused in other
> open-source projects, so the community gets nothing useful. That
> doesn't seem great.

The same argument can be made about kernel changes that are only
needed by Android because they refuse to use a userspace solution that
frankly can do the job.

Can you see why these Android special case discussions are so
frustrating for kernel devs?

And using the "we'll just have a local kernel change in the Android
kernel" threat as leverage in the discussion... yeah very unpleasant
indeed.

Patch
diff mbox series

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index fd26788e8c96..7129646a18bd 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -940,6 +940,19 @@  ip_local_reserved_ports - list of comma separated ranges
 
 	Default: Empty
 
+ip_local_unbindable_ports - list of comma separated ranges
+	Specify the ports which are not directly bind()able.
+
+	Usually you would use this to block the use of ports which
+	are invalid due to something outside of the control of the
+	kernel.  For example a port stolen by the nic for serial
+	console, remote power management or debugging.
+
+	There's a relatively high chance you will also want to list
+	these ports in 'ip_local_reserved_ports' to prevent autobinding.
+
+	Default: Empty
+
 ip_unprivileged_port_start - INTEGER
 	This is a per-namespace sysctl.  It defines the first
 	unprivileged port in the network namespace.  Privileged ports
diff --git a/include/net/ip.h b/include/net/ip.h
index 02d68e346f67..14b99bf59ffc 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -346,6 +346,13 @@  static inline bool inet_is_local_reserved_port(struct net *net, unsigned short p
 	return test_bit(port, net->ipv4.sysctl_local_reserved_ports);
 }
 
+static inline bool inet_is_local_unbindable_port(struct net *net, unsigned short port)
+{
+	if (!net->ipv4.sysctl_local_unbindable_ports)
+		return false;
+	return test_bit(port, net->ipv4.sysctl_local_unbindable_ports);
+}
+
 static inline bool sysctl_dev_name_is_allowed(const char *name)
 {
 	return strcmp(name, "default") != 0  && strcmp(name, "all") != 0;
@@ -362,6 +369,11 @@  static inline bool inet_is_local_reserved_port(struct net *net, unsigned short p
 	return false;
 }
 
+static inline bool inet_is_local_unbindable_port(struct net *net, unsigned short port)
+{
+	return false;
+}
+
 static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port)
 {
 	return port < PROT_SOCK;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c0c0791b1912..6a235651925d 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -197,6 +197,7 @@  struct netns_ipv4 {
 
 #ifdef CONFIG_SYSCTL
 	unsigned long *sysctl_local_reserved_ports;
+	unsigned long *sysctl_local_unbindable_ports;
 	int sysctl_ip_prot_sock;
 #endif
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2fe295432c24..b26046431612 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -494,6 +494,10 @@  int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 		goto out;
 
 	snum = ntohs(addr->sin_port);
+	err = -EPERM;
+	if (snum && inet_is_local_unbindable_port(net, snum))
+		goto out;
+
 	err = -EACCES;
 	if (snum && inet_port_requires_bind_service(net, snum) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index fcb2cd167f64..fd363b57a653 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -745,6 +745,13 @@  static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_do_large_bitmap,
 	},
+	{
+		.procname	= "ip_local_unbindable_ports",
+		.data		= &init_net.ipv4.sysctl_local_unbindable_ports,
+		.maxlen		= 65536,
+		.mode		= 0644,
+		.proc_handler	= proc_do_large_bitmap,
+	},
 	{
 		.procname	= "ip_no_pmtu_disc",
 		.data		= &init_net.ipv4.sysctl_ip_no_pmtu_disc,
@@ -1353,11 +1360,17 @@  static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
 	if (!net->ipv4.sysctl_local_reserved_ports)
-		goto err_ports;
+		goto err_reserved_ports;
+
+	net->ipv4.sysctl_local_unbindable_ports = kzalloc(65536 / 8, GFP_KERNEL);
+	if (!net->ipv4.sysctl_local_unbindable_ports)
+		goto err_unbindable_ports;
 
 	return 0;
 
-err_ports:
+err_unbindable_ports:
+	kfree(net->ipv4.sysctl_local_reserved_ports);
+err_reserved_ports:
 	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
 err_reg:
 	if (!net_eq(net, &init_net))
@@ -1370,6 +1383,7 @@  static __net_exit void ipv4_sysctl_exit_net(struct net *net)
 {
 	struct ctl_table *table;
 
+	kfree(net->ipv4.sysctl_local_unbindable_ports);
 	kfree(net->ipv4.sysctl_local_reserved_ports);
 	table = net->ipv4.ipv4_hdr->ctl_table_arg;
 	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 60e2ff91a5b3..3c83e3200543 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -292,6 +292,8 @@  static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
+	if (snum && inet_is_local_unbindable_port(net, snum))
+		return -EPERM;
 	if (snum && inet_port_requires_bind_service(net, snum) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 0b485952a71c..d1c93542419d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -384,6 +384,9 @@  static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 		}
 	}
 
+	if (snum && inet_is_local_unbindable_port(net, snum))
+		return -EPERM;
+
 	if (snum && inet_port_requires_bind_service(net, snum) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
@@ -1061,6 +1064,8 @@  static int sctp_connect_new_asoc(struct sctp_endpoint *ep,
 		if (sctp_autobind(sk))
 			return -EAGAIN;
 	} else {
+		if (inet_is_local_unbindable_port(net, ep->base.bind_addr.port))
+			return -EPERM;
 		if (inet_port_requires_bind_service(net, ep->base.bind_addr.port) &&
 		    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 			return -EACCES;