[v2] net: introduce ip_local_unbindable_ports sysctl
diff mbox series

Message ID 20191209224530.156283-1-zenczykowski@gmail.com
State Changes Requested
Delegated to: David Miller
Headers show
Series
  • [v2] net: introduce ip_local_unbindable_ports sysctl
Related show

Commit Message

Maciej Żenczykowski Dec. 9, 2019, 10:45 p.m. UTC
From: Maciej Żenczykowski <maze@google.com>

and associated inet_is_local_unbindable_port() helper function:
use it to make explicitly binding to an unbindable port return
-EPERM 'Operation not permitted'.

Autobind doesn't honour this new sysctl since:
  (a) you can simply set both if that's the behaviour you desire
  (b) there could be a use for preventing explicit while allowing auto
  (c) it's faster in the relatively critical path of doing port selection
      during connect() to only check one bitmap instead of both

Various ports may have special use cases which are not suitable for
use by general userspace applications. Currently, ports specified in
ip_local_reserved_ports sysctl will not be returned only in case of
automatic port assignment, but nothing prevents you from explicitly
binding to them - even from an entirely unprivileged process.

In certain cases it is desirable to prevent the host from assigning the
ports even in case of explicit binds, even from superuser processes.

Example use cases might be:
 - a port being stolen by the nic for remote serial console, remote
   power management or some other sort of debugging functionality
   (crash collection, gdb, direct access to some other microcontroller
   on the nic or motherboard, remote management of the nic itself).
 - a transparent proxy where packets are being redirected: in case
   a socket matches this connection, packets from this application
   would be incorrectly sent to one of the endpoints.

Initially I wanted to solve this problem via the simple one line:

static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port) {
-       return port < net->ipv4.sysctl_ip_prot_sock;
+       return port < net->ipv4.sysctl_ip_prot_sock || inet_is_local_reserved_port(net, port);
}

However, this doesn't work for two reasons:
  (a) it changes userspace visible behaviour of the existing local
      reserved ports sysctl, and there appears to be enough documentation
      on the internet talking about setting it to make this a bad idea
  (b) it doesn't prevent privileged apps from using these ports,
      CAP_BIND_SERVICE is relatively likely to be available to, for example,
      a recursive DNS server so it can listed on port 53, which also needs
      to do src port randomization for outgoing queries due to security
      reasons (and it thus does manual port binding).

If we *know* that certain ports are simply unusable, then it's better
nothing even gets the opportunity to try to use them.  This way we at
least get a quick failure, instead of some sort of timeout (or possibly
even corruption of the data stream of the non-kernel based use case).

Test:
  vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports

  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
  vm:~# echo 3967 > /proc/sys/net/ipv4/ip_local_unbindable_ports
  vm:~# cat /proc/sys/net/ipv4/ip_local_unbindable_ports
  3967
  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_STREAM, 0); s.bind(("::", 3967))'
  socket.error: (1, 'Operation not permitted')
  vm:~# python -c 'import socket; s = socket.socket(socket.AF_INET6, socket.SOCK_DGRAM, 0); s.bind(("::", 3967))'
  socket.error: (1, 'Operation not permitted')

Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Sean Tranchetti <stranche@codeaurora.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Linux SCTP <linux-sctp@vger.kernel.org>
Reviewed-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
---
 Documentation/networking/ip-sysctl.txt | 13 +++++++++++++
 include/net/ip.h                       | 12 ++++++++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/af_inet.c                     |  4 ++++
 net/ipv4/sysctl_net_ipv4.c             | 18 ++++++++++++++++--
 net/ipv6/af_inet6.c                    |  2 ++
 net/sctp/socket.c                      |  5 +++++
 7 files changed, 53 insertions(+), 2 deletions(-)

Comments

Jakub Kicinski Dec. 9, 2019, 11:42 p.m. UTC | #1
On Mon,  9 Dec 2019 14:45:30 -0800, Maciej Żenczykowski wrote:
> Example use cases might be:
>  - a port being stolen by the nic for remote serial console, remote
>    power management or some other sort of debugging functionality
>    (crash collection, gdb, direct access to some other microcontroller
>    on the nic or motherboard, remote management of the nic itself).

This use case makes me a little uncomfortable.

Could you elaborate what protocols and products are in need of this
functionality?

Why can't the NIC just get its own IP like it usually does with NCSI?
Maciej Żenczykowski Dec. 10, 2019, 12:02 a.m. UTC | #2
> Could you elaborate what protocols and products are in need of this
> functionality?

The ones I'm aware of are:
(a) Google's servers
(b) Android on at least some chipsets (Qualcomm at the bare minimum,
but I think it's pretty standard a solution) where there's a complex
port sharing scheme between the Linux kernel on the Application
Processor and the Firmware running on the modem (for ipv4 we only get
one ip address from the cellular carrier).  It's basically required
for things like wifi calling to work.

> Why can't the NIC just get its own IP like it usually does with NCSI?

Because often these nics are deployed as in place upgrades in
environments where there's a limited number of IPs.
Say a rack with a /27 ipv4 subnet (2**5 = 32 -> 29 usable ips, since
network/broadcast/gateway are burned) and 15+ pre-existing machines.
This means there's not enough IPs to assign separate ones for the nics.
Renumbering the rack, would imply renumbering the datacenter, etc...
And ipv4 - even RFC1918 - has long run out - so even in new
deployments there's not enough IPv4 ips to give to nics, and IPv6
isn't yet deployed *everywhere*.
Jakub Kicinski Dec. 10, 2019, 12:18 a.m. UTC | #3
On Tue, 10 Dec 2019 01:02:08 +0100, Maciej Żenczykowski wrote:
> > Could you elaborate what protocols and products are in need of this
> > functionality?  
> 
> The ones I'm aware of are:
> (a) Google's servers
> (b) Android on at least some chipsets (Qualcomm at the bare minimum,
> but I think it's pretty standard a solution) where there's a complex
> port sharing scheme between the Linux kernel on the Application
> Processor and the Firmware running on the modem (for ipv4 we only get
> one ip address from the cellular carrier).  It's basically required
> for things like wifi calling to work.

Okay, that's what I was suspecting.  It'd be great if the real
motivation for a patch was spelled out in the commit message :/

So some SoCs which run non-vanilla kernels require hacks to steal
ports from the networking stack for use by proprietary firmware.

I don't see how merging this patch benefits the community.

> > Why can't the NIC just get its own IP like it usually does with NCSI?  
> 
> Because often these nics are deployed as in place upgrades in
> environments where there's a limited number of IPs.
> Say a rack with a /27 ipv4 subnet (2**5 = 32 -> 29 usable ips, since
> network/broadcast/gateway are burned) and 15+ pre-existing machines.
> This means there's not enough IPs to assign separate ones for the nics.
> Renumbering the rack, would imply renumbering the datacenter, etc...
> And ipv4 - even RFC1918 - has long run out - so even in new
> deployments there's not enough IPv4 ips to give to nics, and IPv6
> isn't yet deployed *everywhere*.

So the conditions for this are:
 - in-place upgrade of an existing rack
 - IPv4 only
 - the existing servers didn't have NCSI or otherwise IPs for OOB
   control

Unlike the AP one this sounds like a very rare scenario..
Subash Abhinov Kasiviswanathan Dec. 10, 2019, 7 a.m. UTC | #4
> Okay, that's what I was suspecting.  It'd be great if the real
> motivation for a patch was spelled out in the commit message :/
> 
> So some SoCs which run non-vanilla kernels require hacks to steal
> ports from the networking stack for use by proprietary firmware.
> 
> I don't see how merging this patch benefits the community.
> 

This is just a transparent proxy scenario though.
We block the specific ports so that there is no unrelated traffic
belonging to host proxied here incorrectly.
Maciej Żenczykowski Dec. 10, 2019, 11:46 a.m. UTC | #5
> Okay, that's what I was suspecting.  It'd be great if the real
> motivation for a patch was spelled out in the commit message :/

It is, but the commit message is already extremely long.
At some point essays and discussions belong in email and not in the
commit message.

Here's another use case:

A network where firewall policy or network behaviour blocks all
traffic using specific ports.

I've seen generic firewalls that unconditionally drop all BGP or SMTP
port traffic, or all traffic on ports 5060/5061 (regardless of
direction) or on 25/53/80/123/443/853/3128/8000/8080/8088/8888
(usually due to some ill guided security policies against sip or open
proxies or xxx). If you happen to use port XXXX as your source port
your connection just hangs (packets are blackholed).

Sure you can argue the network is broken, but in the real world you
often can't fix it... Go try and convince your ISP that they should
only drop inbound connections to port 8000, but not outgoing
connections from port 8000 - you'll go crazy before you find someone
who even understands what you're talking about - and even if you find
such a person, they'll probably be too busy to change things - and
even though it might be a 1 letter change (port -> dport) - it still
might take months of testing and rollout before it's fully deployed.

I've seen networks where specific ports are automatically classified
as super high priority (network control) so you don't want anything
using these ports without very good reason (common for BGP for
example, or for encap schemes).

Or a specific port number being reserved by GUE or other udp encap
schemes and thus unsafe to use for generic traffic (because the
network or even the kernel itself might for example auto decapsulate
it [via tc ebpf for example], or parse the interior of the packet for
flowhashing purposes...).

[I'll take this opportunity to point out that due to poor flow hashing
behaviour GRE is basically unusable at scale (not to mention poorly
extensible), and thus GUE and other UDP encap schemes are taking over]

Or you might want to forward udp port 4500 from your external IP to a
dedicated ipsec box or some hardware offload engine... etc.

> So some SoCs which run non-vanilla kernels require hacks to steal
> ports from the networking stack for use by proprietary firmware.
> I don't see how merging this patch benefits the community.

I think you're failing to account for the fact that the majority of
Linux users are Android users - there's around 2.5 billion Android
phones in the wild... - but perhaps you don't consider your users (or
Android?) to be part of your community?

btw. Chrome OS is also Linux based (and if a quick google search is to
be believed, about 1/7th of the linux desktop/laptop share), but since
it supports running Android apps, it needs to have all Android
specific generic kernel changes...

The reason Android runs non-vanilla kernels is *because* patches like
this - that make Linux work in the real world - are missing from
vanilla Linux
(I can think of a few other networking patches off the top of my head
where we've been unable to upstream them for no particularly good
reason).

> So the conditions for this are:
>  - in-place upgrade of an existing rack

No that's just an example.  That said in place upgrades aren't
particularly rare.

>  - IPv4 only

Believe it or not most embedded gear is still very much ipv4 only, as
much as I hate that - I've been working on ipv6 deployment for over a
decade now, and the amount of stuff that's IPv4-only is still
staggering.

> Unlike the AP one this sounds like a very rare scenario..

And yet I have 2 ports for 2 different pieces of hardware that I need
to block this way (and a third one for GUE and a fourth one for some
ipsec-like crypto transport).
Jakub Kicinski Dec. 10, 2019, 5:12 p.m. UTC | #6
On Tue, 10 Dec 2019 07:00:29 +0000, subashab@codeaurora.org wrote:
> > Okay, that's what I was suspecting.  It'd be great if the real
> > motivation for a patch was spelled out in the commit message :/
> > 
> > So some SoCs which run non-vanilla kernels require hacks to steal
> > ports from the networking stack for use by proprietary firmware.
> > 
> > I don't see how merging this patch benefits the community.
> 
> This is just a transparent proxy scenario though.
> We block the specific ports so that there is no unrelated traffic
> belonging to host proxied here incorrectly.

It's a form of one, agreed, although let's be honest - someone reading
the transparent proxy use case in the commit message will not think of
a complex AP scenario, but rather of a locally configured transparent
proxy with IPtables or sockets or such.
Jakub Kicinski Dec. 10, 2019, 5:31 p.m. UTC | #7
On Tue, 10 Dec 2019 12:46:29 +0100, Maciej Żenczykowski wrote:
> > Okay, that's what I was suspecting.  It'd be great if the real
> > motivation for a patch was spelled out in the commit message :/  
> 
> It is, but the commit message is already extremely long.

Long, yet it doesn't mention the _real_ reason for the patch.

> At some point essays and discussions belong in email and not in the
> commit message.

Ugh just admit you didn't mention the primary use case in the commit
log, and we can move on.

> Here's another use case:
> 
> A network where firewall policy or network behaviour blocks all
> traffic using specific ports.
> 
> I've seen generic firewalls that unconditionally drop all BGP or SMTP
> port traffic, or all traffic on ports 5060/5061 (regardless of
> direction) or on 25/53/80/123/443/853/3128/8000/8080/8088/8888
> (usually due to some ill guided security policies against sip or open
> proxies or xxx). If you happen to use port XXXX as your source port
> your connection just hangs (packets are blackholed).
> 
> Sure you can argue the network is broken, but in the real world you
> often can't fix it... Go try and convince your ISP that they should
> only drop inbound connections to port 8000, but not outgoing
> connections from port 8000 - you'll go crazy before you find someone
> who even understands what you're talking about - and even if you find
> such a person, they'll probably be too busy to change things - and
> even though it might be a 1 letter change (port -> dport) - it still
> might take months of testing and rollout before it's fully deployed.
> 
> I've seen networks where specific ports are automatically classified
> as super high priority (network control) so you don't want anything
> using these ports without very good reason (common for BGP for
> example, or for encap schemes).
> 
> Or a specific port number being reserved by GUE or other udp encap
> schemes and thus unsafe to use for generic traffic (because the
> network or even the kernel itself might for example auto decapsulate
> it [via tc ebpf for example], or parse the interior of the packet for
> flowhashing purposes...).
> 
> [I'll take this opportunity to point out that due to poor flow hashing
> behaviour GRE is basically unusable at scale (not to mention poorly
> extensible), and thus GUE and other UDP encap schemes are taking over]
> 
> Or you might want to forward udp port 4500 from your external IP to a
> dedicated ipsec box or some hardware offload engine... etc.

It's networking you can concoct a scenario to justify anything.

> > So some SoCs which run non-vanilla kernels require hacks to steal
> > ports from the networking stack for use by proprietary firmware.
> > I don't see how merging this patch benefits the community.  
> 
> I think you're failing to account for the fact that the majority of
> Linux users are Android users - there's around 2.5 billion Android
> phones in the wild... - but perhaps you don't consider your users (or
> Android?) to be part of your community?

I don't consider users of non-vanilla kernels to necessarily be a
reason to merge patches upstream, no. They carry literally millions 
of lines of patches out of tree, let them carry this patch, too.
If I can't boot a vanilla kernel on those devices, and clearly there is
no intent by the device manufacturers for me to ever will, why would I
care? Some companies care about upstream, and those should be rewarded
by us taking some of the maintenance off their hands. Some don't:
https://www.youtube.com/watch?v=_36yNWw_07g (link to Linus+nVidia video) 
even tho they sell majority of SoCs for 2.5 billion devices.

> btw. Chrome OS is also Linux based (and if a quick google search is to
> be believed, about 1/7th of the linux desktop/laptop share), but since
> it supports running Android apps, it needs to have all Android
> specific generic kernel changes...
> 
> The reason Android runs non-vanilla kernels is *because* patches like
> this - that make Linux work in the real world - are missing from
> vanilla Linux
> (I can think of a few other networking patches off the top of my head
> where we've been unable to upstream them for no particularly good
> reason).

The way to get those patches upstream is to have a honest discussion
about the use case so people can validate the design. Not by sending
a patch with a 5 page commit message which fails to clearly state the
motivation for the feature :/
Subash Abhinov Kasiviswanathan Dec. 10, 2019, 6:12 p.m. UTC | #8
> It's a form of one, agreed, although let's be honest - someone reading
> the transparent proxy use case in the commit message will not think of
> a complex AP scenario, but rather of a locally configured transparent
> proxy with IPtables or sockets or such.

Transparent proxy could be implemented using eBPF + XDP and those don't
need sockets. However, in that case we do need to block those specific 
ports
to avoid messing with unrelated traffic.
David Miller Dec. 10, 2019, 7:28 p.m. UTC | #9
From: Maciej Żenczykowski <zenczykowski@gmail.com>
Date: Tue, 10 Dec 2019 12:46:29 +0100

> At some point essays and discussions belong in email and not in the
> commit message.

Wrong, full details on the context and impetus matter, no matter how
voluminous.

If you put what you ate for breakfast in the commit log message I wouldn't
complain.
Lorenzo Colitti Dec. 13, 2019, 12:16 a.m. UTC | #10
On Wed, Dec 11, 2019 at 2:31 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> I don't consider users of non-vanilla kernels to necessarily be a
> reason to merge patches upstream, no. They carry literally millions
> of lines of patches out of tree, let them carry this patch, too.
> If I can't boot a vanilla kernel on those devices, and clearly there is
> no intent by the device manufacturers for me to ever will, why would I
> care?

That's *not* the intent.
https://arstechnica.com/gadgets/2019/11/google-outlines-plans-for-mainline-linux-kernel-support-in-android/

> > The reason Android runs non-vanilla kernels is *because* patches like
> > this - that make Linux work in the real world - are missing from
> > vanilla Linux

That's exactly the point here. Saying, "Android will never use
mainline, so why should mainline take their patches" is a
self-fulfilling prophecy. Obviously, if mainline never takes Android
patches, then yes, Android will never be able to use mainline. We do
have an Android tree we can take this patch into. But we don't want to
take it without at least attempting to get it into mainline first.

The use case here is pretty simple. There are many CPUs in a mobile
phone. The baseband processor ("modem") implements much of the
functionality required by cellular networks, so if you want cellular
voice or data, it needs to be able to talk to the network. For many
reasons (architectural, power conservation, security), the modem needs
to be able to talk directly to the cellular network. This includes,
for example, SIP/RTP media streams that go directly to the audio
hardware, IKE traffic that is sent directly by the modem because only
the modem has the keys, etc. Normally this happens directly on the
cellular interface and Linux/Android is unaware of it. But, when using
wifi calling (which is an IPsec tunnel over wifi to an endpoint inside
the cellular network), the device only has one IPv4 address, and the
baseband processor and the application processor (the CPU that runs
Linux/Android) have to share it. This means that some ports have to be
reserved so that the baseband processor can depend on using them. NAT
cannot be used because the 3GPP standards require protocols that are
not very NAT-friendly, and because the modem needs to be able to
accept unsolicited inbound traffic.

Other than "commit message doesn't have a use case", are there
technical concerns with this patch?
Jakub Kicinski Dec. 13, 2019, 12:47 a.m. UTC | #11
On Fri, 13 Dec 2019 09:16:03 +0900, Lorenzo Colitti wrote:
> On Wed, Dec 11, 2019 at 2:31 AM Jakub Kicinski wrote:
> > I don't consider users of non-vanilla kernels to necessarily be a
> > reason to merge patches upstream, no. They carry literally millions
> > of lines of patches out of tree, let them carry this patch, too.
> > If I can't boot a vanilla kernel on those devices, and clearly there is
> > no intent by the device manufacturers for me to ever will, why would I
> > care?  
> 
> That's *not* the intent.
> https://arstechnica.com/gadgets/2019/11/google-outlines-plans-for-mainline-linux-kernel-support-in-android/
> 
> > > The reason Android runs non-vanilla kernels is *because* patches like
> > > this - that make Linux work in the real world - are missing from
> > > vanilla Linux  
> 
> That's exactly the point here. Saying, "Android will never use
> mainline, so why should mainline take their patches" is a
> self-fulfilling prophecy. Obviously, if mainline never takes Android
> patches, then yes, Android will never be able to use mainline. We do
> have an Android tree we can take this patch into. But we don't want to
> take it without at least attempting to get it into mainline first.
> 
> The use case here is pretty simple. There are many CPUs in a mobile
> phone. The baseband processor ("modem") implements much of the
> functionality required by cellular networks, so if you want cellular
> voice or data, it needs to be able to talk to the network. For many
> reasons (architectural, power conservation, security), the modem needs
> to be able to talk directly to the cellular network. This includes,
> for example, SIP/RTP media streams that go directly to the audio
> hardware, IKE traffic that is sent directly by the modem because only
> the modem has the keys, etc. Normally this happens directly on the
> cellular interface and Linux/Android is unaware of it. But, when using
> wifi calling (which is an IPsec tunnel over wifi to an endpoint inside
> the cellular network), the device only has one IPv4 address, and the
> baseband processor and the application processor (the CPU that runs
> Linux/Android) have to share it. This means that some ports have to be
> reserved so that the baseband processor can depend on using them. NAT
> cannot be used because the 3GPP standards require protocols that are
> not very NAT-friendly, and because the modem needs to be able to
> accept unsolicited inbound traffic.
> 
> Other than "commit message doesn't have a use case", are there
> technical concerns with this patch?

Maybe a minor question or two, but the main complaint is the commit
message.

How are the ports which get reserved communicated between the baseband
and the AP? Is this part of the standard? Is the driver that talks to
the base band in the user space and it knows which ports to reserve
statically? Or does the modem dynamically request ports to
reserve/inform the host of ports in use?

Should the sysfs interface make sure there are not existing sockets
using requested ports which would stop working? If we may need it one
day better add it now..
Lorenzo Colitti Dec. 13, 2019, 12:57 a.m. UTC | #12
On Fri, Dec 13, 2019 at 9:47 AM Jakub Kicinski
<jakub.kicinski@netronome.com> wrote:
> How are the ports which get reserved communicated between the baseband
> and the AP? Is this part of the standard? Is the driver that talks to
> the base band in the user space and it knows which ports to reserve
> statically? Or does the modem dynamically request ports to
> reserve/inform the host of ports in use?

I'm not an expert in that part of the system, but my understanding is
that the primary way this is used is to pre-allocate a block of ports
to be used by the modem on boot, before other applications can bind to
ports. Subash, do you have more details?
Subash Abhinov Kasiviswanathan Dec. 13, 2019, 1:53 a.m. UTC | #13
On 2019-12-12 17:57, Lorenzo Colitti wrote:
> On Fri, Dec 13, 2019 at 9:47 AM Jakub Kicinski
> <jakub.kicinski@netronome.com> wrote:
>> How are the ports which get reserved communicated between the baseband
>> and the AP? Is this part of the standard? Is the driver that talks to
>> the base band in the user space and it knows which ports to reserve
>> statically? Or does the modem dynamically request ports to
>> reserve/inform the host of ports in use?
> 
> I'm not an expert in that part of the system, but my understanding is
> that the primary way this is used is to pre-allocate a block of ports
> to be used by the modem on boot, before other applications can bind to
> ports. Subash, do you have more details?

AFAIK these ports are randomly picked and not from a standard.
Userspace gets this information through qrtr during boot.

Atleast in our case, there cannot be any existing user of these ports
since these ports are blocked prior to mobile connection establishment.
We could call SOCK_DIAG_DESTROY on these ports from userspace as a
precaution as applications would gracefully handle the socket errors.
Jakub Kicinski Dec. 13, 2019, 2:04 a.m. UTC | #14
On Thu, 12 Dec 2019 18:53:19 -0700, subashab@codeaurora.org wrote:
> On 2019-12-12 17:57, Lorenzo Colitti wrote:
> > On Fri, Dec 13, 2019 at 9:47 AM Jakub Kicinski wrote:  
> >> How are the ports which get reserved communicated between the baseband
> >> and the AP? Is this part of the standard? Is the driver that talks to
> >> the base band in the user space and it knows which ports to reserve
> >> statically? Or does the modem dynamically request ports to
> >> reserve/inform the host of ports in use?  
> > 
> > I'm not an expert in that part of the system, but my understanding is
> > that the primary way this is used is to pre-allocate a block of ports
> > to be used by the modem on boot, before other applications can bind to
> > ports. Subash, do you have more details?  
> 
> AFAIK these ports are randomly picked and not from a standard.
> Userspace gets this information through qrtr during boot.
> 
> Atleast in our case, there cannot be any existing user of these ports
> since these ports are blocked prior to mobile connection establishment.

Not even a listening socket?

> We could call SOCK_DIAG_DESTROY on these ports from userspace as a
> precaution as applications would gracefully handle the socket errors.

Right, or kernel could walk them, since presumably every application
using this functionality should do it, anyway? But no strong feeling 
on this if nobody else feels this is needed.

Patch
diff mbox series

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index fd26788e8c96..7129646a18bd 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -940,6 +940,19 @@  ip_local_reserved_ports - list of comma separated ranges
 
 	Default: Empty
 
+ip_local_unbindable_ports - list of comma separated ranges
+	Specify the ports which are not directly bind()able.
+
+	Usually you would use this to block the use of ports which
+	are invalid due to something outside of the control of the
+	kernel.  For example a port stolen by the nic for serial
+	console, remote power management or debugging.
+
+	There's a relatively high chance you will also want to list
+	these ports in 'ip_local_reserved_ports' to prevent autobinding.
+
+	Default: Empty
+
 ip_unprivileged_port_start - INTEGER
 	This is a per-namespace sysctl.  It defines the first
 	unprivileged port in the network namespace.  Privileged ports
diff --git a/include/net/ip.h b/include/net/ip.h
index 5b317c9f4470..045432e6d18e 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -346,6 +346,13 @@  static inline bool inet_is_local_reserved_port(struct net *net, unsigned short p
 	return test_bit(port, net->ipv4.sysctl_local_reserved_ports);
 }
 
+static inline bool inet_is_local_unbindable_port(struct net *net, unsigned short port)
+{
+	if (!net->ipv4.sysctl_local_unbindable_ports)
+		return false;
+	return test_bit(port, net->ipv4.sysctl_local_unbindable_ports);
+}
+
 static inline bool sysctl_dev_name_is_allowed(const char *name)
 {
 	return strcmp(name, "default") != 0  && strcmp(name, "all") != 0;
@@ -362,6 +369,11 @@  static inline bool inet_is_local_reserved_port(struct net *net, unsigned short p
 	return false;
 }
 
+static inline bool inet_is_local_unbindable_port(struct net *net, unsigned short port)
+{
+	return false;
+}
+
 static inline bool inet_port_requires_bind_service(struct net *net, unsigned short port)
 {
 	return port < PROT_SOCK;
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index c0c0791b1912..6a235651925d 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -197,6 +197,7 @@  struct netns_ipv4 {
 
 #ifdef CONFIG_SYSCTL
 	unsigned long *sysctl_local_reserved_ports;
+	unsigned long *sysctl_local_unbindable_ports;
 	int sysctl_ip_prot_sock;
 #endif
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 2fe295432c24..b26046431612 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -494,6 +494,10 @@  int __inet_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 		goto out;
 
 	snum = ntohs(addr->sin_port);
+	err = -EPERM;
+	if (snum && inet_is_local_unbindable_port(net, snum))
+		goto out;
+
 	err = -EACCES;
 	if (snum && inet_port_requires_bind_service(net, snum) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index fcb2cd167f64..fd363b57a653 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -745,6 +745,13 @@  static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_do_large_bitmap,
 	},
+	{
+		.procname	= "ip_local_unbindable_ports",
+		.data		= &init_net.ipv4.sysctl_local_unbindable_ports,
+		.maxlen		= 65536,
+		.mode		= 0644,
+		.proc_handler	= proc_do_large_bitmap,
+	},
 	{
 		.procname	= "ip_no_pmtu_disc",
 		.data		= &init_net.ipv4.sysctl_ip_no_pmtu_disc,
@@ -1353,11 +1360,17 @@  static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_local_reserved_ports = kzalloc(65536 / 8, GFP_KERNEL);
 	if (!net->ipv4.sysctl_local_reserved_ports)
-		goto err_ports;
+		goto err_reserved_ports;
+
+	net->ipv4.sysctl_local_unbindable_ports = kzalloc(65536 / 8, GFP_KERNEL);
+	if (!net->ipv4.sysctl_local_unbindable_ports)
+		goto err_unbindable_ports;
 
 	return 0;
 
-err_ports:
+err_unbindable_ports:
+	kfree(net->ipv4.sysctl_local_reserved_ports);
+err_reserved_ports:
 	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
 err_reg:
 	if (!net_eq(net, &init_net))
@@ -1370,6 +1383,7 @@  static __net_exit void ipv4_sysctl_exit_net(struct net *net)
 {
 	struct ctl_table *table;
 
+	kfree(net->ipv4.sysctl_local_unbindable_ports);
 	kfree(net->ipv4.sysctl_local_reserved_ports);
 	table = net->ipv4.ipv4_hdr->ctl_table_arg;
 	unregister_net_sysctl_table(net->ipv4.ipv4_hdr);
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index d727c3b41495..41f453906f2f 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -292,6 +292,8 @@  static int __inet6_bind(struct sock *sk, struct sockaddr *uaddr, int addr_len,
 		return -EINVAL;
 
 	snum = ntohs(addr->sin6_port);
+	if (snum && inet_is_local_unbindable_port(net, snum))
+		return -EPERM;
 	if (snum && inet_port_requires_bind_service(net, snum) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 0b485952a71c..d1c93542419d 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -384,6 +384,9 @@  static int sctp_do_bind(struct sock *sk, union sctp_addr *addr, int len)
 		}
 	}
 
+	if (snum && inet_is_local_unbindable_port(net, snum))
+		return -EPERM;
+
 	if (snum && inet_port_requires_bind_service(net, snum) &&
 	    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 		return -EACCES;
@@ -1061,6 +1064,8 @@  static int sctp_connect_new_asoc(struct sctp_endpoint *ep,
 		if (sctp_autobind(sk))
 			return -EAGAIN;
 	} else {
+		if (inet_is_local_unbindable_port(net, ep->base.bind_addr.port))
+			return -EPERM;
 		if (inet_port_requires_bind_service(net, ep->base.bind_addr.port) &&
 		    !ns_capable(net->user_ns, CAP_NET_BIND_SERVICE))
 			return -EACCES;