diff mbox series

[net-next,v1,1/5] net: allow binding socket in a VRF when there's an unbound socket

Message ID 20180924161326.17167-2-mmanning@vyatta.att-mail.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series vrf: allow simultaneous service instances in default and other VRFs | expand

Commit Message

Mike Manning Sept. 24, 2018, 4:13 p.m. UTC
From: Robert Shearman <rshearma@vyatta.att-mail.com>

There is no easy way currently for applications that want to receive
packets in the default VRF to be isolated from packets arriving in
VRFs, which makes using VRF-unaware applications in a VRF-aware system
a potential security risk.

So change the inet socket lookup to avoid packets arriving on a device
enslaved to an l3mdev from matching unbound sockets by removing the
wildcard for non sk_bound_dev_if and instead relying on check against
the secondary device index, which will be 0 when the input device is
not enslaved to an l3mdev and so match against an unbound socket and
not match when the input device is enslaved.

The existing net.ipv4.tcp_l3mdev_accept & net.ipv4.udp_l3mdev_accept
sysctls, which are documented as allowing the working across all VRF
domains, can be used to also work in the default VRF by causing
unbound sockets to match against packets arriving on a device
enslaved to an l3mdev.

Change the socket binding to take the l3mdev into account to allow an
unbound socket to not conflict sockets bound to an l3mdev given the
datapath isolation now guaranteed.

Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
---
 Documentation/networking/vrf.txt |  9 +++++----
 include/net/inet6_hashtables.h   |  5 ++---
 include/net/inet_hashtables.h    | 31 ++++++++++++++++++++++++-------
 include/net/inet_sock.h          | 13 +++++++++++++
 net/core/sock.c                  |  2 ++
 net/ipv4/inet_connection_sock.c  | 13 ++++++++++---
 net/ipv4/inet_hashtables.c       | 34 +++++++++++++++++++++-------------
 net/ipv4/ip_sockglue.c           |  3 +++
 net/ipv4/raw.c                   |  4 ++--
 net/ipv4/udp.c                   | 15 ++++++---------
 net/ipv6/datagram.c              |  5 ++++-
 net/ipv6/inet6_hashtables.c      | 14 ++++++--------
 net/ipv6/ipv6_sockglue.c         |  3 +++
 net/ipv6/raw.c                   |  6 +++---
 net/ipv6/udp.c                   | 14 +++++---------
 15 files changed, 109 insertions(+), 62 deletions(-)

Comments

David Ahern Sept. 24, 2018, 10:44 p.m. UTC | #1
On 9/24/18 10:13 AM, Mike Manning wrote:
> From: Robert Shearman <rshearma@vyatta.att-mail.com>
> 
> There is no easy way currently for applications that want to receive
> packets in the default VRF to be isolated from packets arriving in
> VRFs, which makes using VRF-unaware applications in a VRF-aware system
> a potential security risk.

That comment is not correct.

The point of the l3mdev sysctl's is to prohibit this case. Setting
net.ipv4.{tcp,udp}_l3mdev_accept=0 means that a packet arriving on an
interface enslaved to a VRF can not be received by a global socket.

Setting the l3mdev to 1 allows the default socket to work across VRFs.
If that is not what you want for a given app or a given VRF, then one
option is to add netfilter rules on the VRF device to prohibit it. I
just verified this works for both tcp and udp.

Further, overlapping binds are allowed using SO_REUSEPORT meaning I can
have a server running in the default vrf bound to a port AND a server
running bound to a specific vrf and the same port:

udp    UNCONN     0      0      *%red:12345                 *:*
           users:(("vrf-test",pid=1376,fd=3))
udp    UNCONN     0      0       *:12345                 *:*
        users:(("vrf-test",pid=1375,fd=3))

tcp    LISTEN     0      1      *%red:12345                 *:*
           users:(("vrf-test",pid=1356,fd=3))
tcp    LISTEN     0      1       *:12345                 *:*
        users:(("vrf-test",pid=1352,fd=3))

For packets arriving on an interface enslaved to a VRF the socket lookup
will pick the VRF server over the global one.

--

With this patch set I am seeing a number of tests failing -- socket
connections working when they should not or not working when they
should. I only skimmed the results. I am guessing this patch is the
reason, but that is just a guess.

You need to make sure all permutations of:
1. net.ipv4.{tcp,udp}_l3mdev_accept={0,1},
2. connection in the default VRF and in a VRF,
3. locally originated and remote traffic,
4. ipv4 and ipv6

continue to work as expected meaning packets flow when they should and
fail with the right error when they should not. I believe the UDP cases
were the main ones failing.

Given the test failures, I did not look at the code changes in the patch.
Mike Manning Sept. 25, 2018, 3:26 p.m. UTC | #2
On 24/09/2018 23:44, David Ahern wrote:
> On 9/24/18 10:13 AM, Mike Manning wrote:
>> From: Robert Shearman <rshearma@vyatta.att-mail.com>
>>
>> There is no easy way currently for applications that want to receive
>> packets in the default VRF to be isolated from packets arriving in
>> VRFs, which makes using VRF-unaware applications in a VRF-aware system
>> a potential security risk.
> 
> That comment is not correct.
> 
> The point of the l3mdev sysctl's is to prohibit this case. Setting
> net.ipv4.{tcp,udp}_l3mdev_accept=0 means that a packet arriving on an
> interface enslaved to a VRF can not be received by a global socket.
Hi David, thanks for reviewing this. The converse does not hold though, 
i.e. there is no guarantee that the unbound socket will be selected for 
packets when not in a VRF, if there is an unbound socket and a socket 
bound to a VRF. Also, such packets should not be handled by the socket 
in the VRF if there is no unbound socket. We also had an issue with raw 
socket lookup device bind matching. I can break this particular patch 
into smaller patches and provide more detail, would this help? I will 
also update/break up the other patches according to your comments.

> 
> Setting the l3mdev to 1 allows the default socket to work across VRFs.
> If that is not what you want for a given app or a given VRF, then one
> option is to add netfilter rules on the VRF device to prohibit it. I
> just verified this works for both tcp and udp.

Netfilter is per application and so does not scale. I have not checked 
if it is suitable for packet handling on raw sockets.

> 
> Further, overlapping binds are allowed using SO_REUSEPORT meaning I can
> have a server running in the default vrf bound to a port AND a server
> running bound to a specific vrf and the same port:
> 
> udp    UNCONN     0      0      *%red:12345                 *:*
>             users:(("vrf-test",pid=1376,fd=3))
> udp    UNCONN     0      0       *:12345                 *:*
>          users:(("vrf-test",pid=1375,fd=3))
> 
> tcp    LISTEN     0      1      *%red:12345                 *:*
>             users:(("vrf-test",pid=1356,fd=3))
> tcp    LISTEN     0      1       *:12345                 *:*
>          users:(("vrf-test",pid=1352,fd=3))
> 
> For packets arriving on an interface enslaved to a VRF the socket lookup
> will pick the VRF server over the global one.

Agreed, but the converse is not guaranteed to hold i.e. packets that are 
not in a VRF may be handled by a socket bound to a VRF.

We do use SO_REUSEPORT for our own applications so as to run instances 
in the default and other VRFs, but still require these patches (or 
similar) due to how packets are handled when there is an unbound socket 
and sockets bound to different VRFs.

> 
> --
> 
> With this patch set I am seeing a number of tests failing -- socket
> connections working when they should not or not working when they
> should. I only skimmed the results. I am guessing this patch is the
> reason, but that is just a guess.
> 
> You need to make sure all permutations of:
> 1. net.ipv4.{tcp,udp}_l3mdev_accept={0,1},
> 2. connection in the default VRF and in a VRF,
> 3. locally originated and remote traffic,
> 4. ipv4 and ipv6
>

We are using raw, datagram and stream sockets for ipv4 & ipv6, require 
connectivity for local and remote addresses where appropriate and need 
route leaking between VRFs when configured, we are unaware of any 
outstanding bugs. Is there some way that I can run/analyze the tests 
that are failing for you?

Also cf patch 2/5 note that ping to link-local addresses is handled 
consistently with that to global addresses in a VRF, so this now 
succeeds if ping is done in the VRF, i.e. 'sudo ip vrf exec <vrf> ping 
<ll> -I <intf>

> continue to work as expected meaning packets flow when they should and
> fail with the right error when they should not. I believe the UDP cases
> were the main ones failing.
> 
> Given the test failures, I did not look at the code changes in the patch.
>
David Ahern Sept. 25, 2018, 5:16 p.m. UTC | #3
On 9/25/18 9:26 AM, Mike Manning wrote:
> On 24/09/2018 23:44, David Ahern wrote:
>> On 9/24/18 10:13 AM, Mike Manning wrote:
>>> From: Robert Shearman <rshearma@vyatta.att-mail.com>
>>>
>>> There is no easy way currently for applications that want to receive
>>> packets in the default VRF to be isolated from packets arriving in
>>> VRFs, which makes using VRF-unaware applications in a VRF-aware system
>>> a potential security risk.
>>
>> That comment is not correct.
>>
>> The point of the l3mdev sysctl's is to prohibit this case. Setting
>> net.ipv4.{tcp,udp}_l3mdev_accept=0 means that a packet arriving on an
>> interface enslaved to a VRF can not be received by a global socket.
> Hi David, thanks for reviewing this. The converse does not hold though,
> i.e. there is no guarantee that the unbound socket will be selected for
> packets when not in a VRF, if there is an unbound socket and a socket
> bound to a VRF. Also, such packets should not be handled by the socket

I need an explicit example here. You are saying a packet arriving on an
interface not enslaved to a VRF might match a socket bound to a VRF?


> in the VRF if there is no unbound socket. We also had an issue with raw
> socket lookup device bind matching. I can break this particular patch
> into smaller patches and provide more detail, would this help? I will
> also update/break up the other patches according to your comments.

Why not add an l3mdev sysctl for raw sockets then?

Yes, please send smaller patches. A diff stat of:
    15 files changed, 109 insertions(+), 62 deletions(-)
is a bit harsh.

> 
>>
>> Setting the l3mdev to 1 allows the default socket to work across VRFs.
>> If that is not what you want for a given app or a given VRF, then one
>> option is to add netfilter rules on the VRF device to prohibit it. I
>> just verified this works for both tcp and udp.
> 
> Netfilter is per application and so does not scale. I have not checked
> if it is suitable for packet handling on raw sockets.
> 
>>
>> Further, overlapping binds are allowed using SO_REUSEPORT meaning I can
>> have a server running in the default vrf bound to a port AND a server
>> running bound to a specific vrf and the same port:
>>
>> udp    UNCONN     0      0      *%red:12345                 *:*
>>             users:(("vrf-test",pid=1376,fd=3))
>> udp    UNCONN     0      0       *:12345                 *:*
>>          users:(("vrf-test",pid=1375,fd=3))
>>
>> tcp    LISTEN     0      1      *%red:12345                 *:*
>>             users:(("vrf-test",pid=1356,fd=3))
>> tcp    LISTEN     0      1       *:12345                 *:*
>>          users:(("vrf-test",pid=1352,fd=3))
>>
>> For packets arriving on an interface enslaved to a VRF the socket lookup
>> will pick the VRF server over the global one.
> 
> Agreed, but the converse is not guaranteed to hold i.e. packets that are
> not in a VRF may be handled by a socket bound to a VRF.
> 
> We do use SO_REUSEPORT for our own applications so as to run instances
> in the default and other VRFs, but still require these patches (or
> similar) due to how packets are handled when there is an unbound socket
> and sockets bound to different VRFs.

Why can't compute_score be adjusted to account for that case?

> 
>>
>> -- 
>>
>> With this patch set I am seeing a number of tests failing -- socket
>> connections working when they should not or not working when they
>> should. I only skimmed the results. I am guessing this patch is the
>> reason, but that is just a guess.
>>
>> You need to make sure all permutations of:
>> 1. net.ipv4.{tcp,udp}_l3mdev_accept={0,1},
>> 2. connection in the default VRF and in a VRF,
>> 3. locally originated and remote traffic,
>> 4. ipv4 and ipv6
>>
> 
> We are using raw, datagram and stream sockets for ipv4 & ipv6, require
> connectivity for local and remote addresses where appropriate and need
> route leaking between VRFs when configured, we are unaware of any
> outstanding bugs. Is there some way that I can run/analyze the tests
> that are failing for you?

I am not distributing my vrf tests right now. Before sending the
response I quickly verified one case is easy for you to see: set the udp
sysctl to 0, start a global server, send a packet to it via an interface
enslaved to a VRF. It should fail ECONNREFUSED (no socket match) but
instead packet reaches the server.

> 
> Also cf patch 2/5 note that ping to link-local addresses is handled
> consistently with that to global addresses in a VRF, so this now
> succeeds if ping is done in the VRF, i.e. 'sudo ip vrf exec <vrf> ping
> <ll> -I <intf>

Shifting packets destined to a LLA from the real device to the vrf
device is a change in behavior. It is not clear to me at the moment that
it will not cause a problem.

> 
>> continue to work as expected meaning packets flow when they should and
>> fail with the right error when they should not. I believe the UDP cases
>> were the main ones failing.
>>
>> Given the test failures, I did not look at the code changes in the patch.
>>
> 

A couple of the patches are fine as is - or need a small change. It
might be easier for you to send those outside of the socket lookup set.
Mike Manning Oct. 1, 2018, 8:48 a.m. UTC | #4
On 25/09/2018 18:16, David Ahern wrote:
> On 9/25/18 9:26 AM, Mike Manning wrote:
>> On 24/09/2018 23:44, David Ahern wrote:
>>> On 9/24/18 10:13 AM, Mike Manning wrote:
>>>> From: Robert Shearman <rshearma@vyatta.att-mail.com>
>>>>
>>>> There is no easy way currently for applications that want to receive
>>>> packets in the default VRF to be isolated from packets arriving in
>>>> VRFs, which makes using VRF-unaware applications in a VRF-aware system
>>>> a potential security risk.
>>> That comment is not correct.
>>>
>>> The point of the l3mdev sysctl's is to prohibit this case. Setting
>>> net.ipv4.{tcp,udp}_l3mdev_accept=0 means that a packet arriving on an
>>> interface enslaved to a VRF can not be received by a global socket.
>> Hi David, thanks for reviewing this. The converse does not hold though,
>> i.e. there is no guarantee that the unbound socket will be selected for
>> packets when not in a VRF, if there is an unbound socket and a socket
>> bound to a VRF. Also, such packets should not be handled by the socket
> I need an explicit example here. You are saying a packet arriving on an
> interface not enslaved to a VRF might match a socket bound to a VRF?
This problem occurs when different service instances are listening on an
unbound socket and sockets bound to VRFs respectively. Received packets
that are not in a VRF are not guaranteed to be handled by the unbound
socket.
>> in the VRF if there is no unbound socket. We also had an issue with raw
>> socket lookup device bind matching. I can break this particular patch
>> into smaller patches and provide more detail, would this help? I will
>> also update/break up the other patches according to your comments.
> Why not add an l3mdev sysctl for raw sockets then?
I have now added this, see patch 4/10.
> Yes, please send smaller patches. A diff stat of:
>     15 files changed, 109 insertions(+), 62 deletions(-)
> is a bit harsh.
I have removed the 2 patches you are ok with and have submitted them
separately , and have split the remaining 3 into 10 smaller patches.
>>> Setting the l3mdev to 1 allows the default socket to work across VRFs.
>>> If that is not what you want for a given app or a given VRF, then one
>>> option is to add netfilter rules on the VRF device to prohibit it. I
>>> just verified this works for both tcp and udp.
>> Netfilter is per application and so does not scale. I have not checked
>> if it is suitable for packet handling on raw sockets.
>>
>>> Further, overlapping binds are allowed using SO_REUSEPORT meaning I can
>>> have a server running in the default vrf bound to a port AND a server
>>> running bound to a specific vrf and the same port:
>>>
>>> udp    UNCONN     0      0      *%red:12345                 *:*
>>>             users:(("vrf-test",pid=1376,fd=3))
>>> udp    UNCONN     0      0       *:12345                 *:*
>>>          users:(("vrf-test",pid=1375,fd=3))
>>>
>>> tcp    LISTEN     0      1      *%red:12345                 *:*
>>>             users:(("vrf-test",pid=1356,fd=3))
>>> tcp    LISTEN     0      1       *:12345                 *:*
>>>          users:(("vrf-test",pid=1352,fd=3))
>>>
>>> For packets arriving on an interface enslaved to a VRF the socket lookup
>>> will pick the VRF server over the global one.
>> Agreed, but the converse is not guaranteed to hold i.e. packets that are
>> not in a VRF may be handled by a socket bound to a VRF.
>>
>> We do use SO_REUSEPORT for our own applications so as to run instances
>> in the default and other VRFs, but still require these patches (or
>> similar) due to how packets are handled when there is an unbound socket
>> and sockets bound to different VRFs.
> Why can't compute_score be adjusted to account for that case?
Yes, this is what we are doing. This is now in patch 2/10 for stream and
3/10 for datagram sockets, and see 5/10 for raw sockets.
>>> -- 
>>>
>>> With this patch set I am seeing a number of tests failing -- socket
>>> connections working when they should not or not working when they
>>> should. I only skimmed the results. I am guessing this patch is the
>>> reason, but that is just a guess.
>>>
>>> You need to make sure all permutations of:
>>> 1. net.ipv4.{tcp,udp}_l3mdev_accept={0,1},
>>> 2. connection in the default VRF and in a VRF,
>>> 3. locally originated and remote traffic,
>>> 4. ipv4 and ipv6
>>>
>> We are using raw, datagram and stream sockets for ipv4 & ipv6, require
>> connectivity for local and remote addresses where appropriate and need
>> route leaking between VRFs when configured, we are unaware of any
>> outstanding bugs. Is there some way that I can run/analyze the tests
>> that are failing for you?
> I am not distributing my vrf tests right now. Before sending the
> response I quickly verified one case is easy for you to see: set the udp
> sysctl to 0, start a global server, send a packet to it via an interface
> enslaved to a VRF. It should fail ECONNREFUSED (no socket match) but
> instead packet reaches the server.
I have tried with netcat & socat to check this, and while the
connectivity handling is correct, this does not show the failure error
(it doesn't without the changes either). Before I look further into
this, can I check whether you had only the udp sysctl set to 0 i.e. the
tcp sysctl to 1? The reason I ask is that in the original series, we
were using a common function inet_sk_bound_dev_eq() for
raw/stream/datagram sockets, and this fn was checking only the tcp
sysctl (as we always have the udp and tcp sysctl set to 0). In series
v2, I have provided separate fns which check the relevant raw, tcp or
udp sysctl.
>> Also cf patch 2/5 note that ping to link-local addresses is handled
>> consistently with that to global addresses in a VRF, so this now
>> succeeds if ping is done in the VRF, i.e. 'sudo ip vrf exec <vrf> ping
>> <ll> -I <intf>
> Shifting packets destined to a LLA from the real device to the vrf
> device is a change in behavior. It is not clear to me at the moment that
> it will not cause a problem.
I have provided a clearer description of why we have had to do this in
patch 7/10.
>>> continue to work as expected meaning packets flow when they should and
>>> fail with the right error when they should not. I believe the UDP cases
>>> were the main ones failing.
>>>
>>> Given the test failures, I did not look at the code changes in the patch.
>>>
> A couple of the patches are fine as is - or need a small change. It
> might be easier for you to send those outside of the socket lookup set.
Thank you for reviewing these.
diff mbox series

Patch

diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt
index 8ff7b4c8f91b..d4b129402d57 100644
--- a/Documentation/networking/vrf.txt
+++ b/Documentation/networking/vrf.txt
@@ -103,6 +103,11 @@  VRF device:
 
 or to specify the output device using cmsg and IP_PKTINFO.
 
+By default the scope of the port bindings for unbound sockets is
+limited to the default VRF. That is, it will not be matched by packets
+arriving on interfaces enslaved to an l3mdev and processes may bind to
+the same port if they bind to an l3mdev.
+
 TCP & UDP services running in the default VRF context (ie., not bound
 to any VRF device) can work across all VRF domains by enabling the
 tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
@@ -112,10 +117,6 @@  tcp_l3mdev_accept and udp_l3mdev_accept sysctl options:
 netfilter rules on the VRF device can be used to limit access to services
 running in the default VRF context as well.
 
-The default VRF does not have limited scope with respect to port bindings.
-That is, if a process does a wildcard bind to a port in the default VRF it
-owns the port across all VRF domains within the network namespace.
-
 ################################################################################
 
 Using iproute2 for VRFs
diff --git a/include/net/inet6_hashtables.h b/include/net/inet6_hashtables.h
index 6e91e38a31da..9db98af46985 100644
--- a/include/net/inet6_hashtables.h
+++ b/include/net/inet6_hashtables.h
@@ -115,9 +115,8 @@  int inet6_hash(struct sock *sk);
 	 ((__sk)->sk_family == AF_INET6)			&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_daddr, (__saddr))		&&	\
 	 ipv6_addr_equal(&(__sk)->sk_v6_rcv_saddr, (__daddr))	&&	\
-	 (!(__sk)->sk_bound_dev_if	||				\
-	   ((__sk)->sk_bound_dev_if == (__dif))	||			\
-	   ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
+	 (((__sk)->sk_bound_dev_if == (__dif))	||			\
+	  ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
 	 net_eq(sock_net(__sk), (__net)))
 
 #endif /* _INET6_HASHTABLES_H */
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..866efd35ded4 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -79,6 +79,7 @@  struct inet_ehash_bucket {
 
 struct inet_bind_bucket {
 	possible_net_t		ib_net;
+	int			l3mdev;
 	unsigned short		port;
 	signed char		fastreuse;
 	signed char		fastreuseport;
@@ -188,10 +189,28 @@  static inline void inet_ehash_locks_free(struct inet_hashinfo *hashinfo)
 	hashinfo->ehash_locks = NULL;
 }
 
+#ifdef CONFIG_NET_L3_MASTER_DEV
+static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+					int dif, int sdif)
+{
+	if (!bound_dev_if)
+		return !sdif || net->ipv4.sysctl_tcp_l3mdev_accept;
+	return bound_dev_if == dif || bound_dev_if == sdif;
+}
+#else
+static inline bool inet_sk_bound_dev_eq(struct net *net, int bound_dev_if,
+					int dif, int sdif)
+{
+	if (!bound_dev_if)
+		return true;
+	return bound_dev_if == dif;
+}
+#endif
+
 struct inet_bind_bucket *
 inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
-			const unsigned short snum);
+			const unsigned short snum, int l3mdev);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
 			      struct inet_bind_bucket *tb);
 
@@ -282,9 +301,8 @@  static inline struct sock *inet_lookup_listener(struct net *net,
 #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
 	(((__sk)->sk_portpair == (__ports))			&&	\
 	 ((__sk)->sk_addrpair == (__cookie))			&&	\
-	 (!(__sk)->sk_bound_dev_if	||				\
-	   ((__sk)->sk_bound_dev_if == (__dif))			||	\
-	   ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
+	 (((__sk)->sk_bound_dev_if == (__dif))			||	\
+	  ((__sk)->sk_bound_dev_if == (__sdif)))		&&	\
 	 net_eq(sock_net(__sk), (__net)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
@@ -294,9 +312,8 @@  static inline struct sock *inet_lookup_listener(struct net *net,
 	(((__sk)->sk_portpair == (__ports))		&&		\
 	 ((__sk)->sk_daddr	== (__saddr))		&&		\
 	 ((__sk)->sk_rcv_saddr	== (__daddr))		&&		\
-	 (!(__sk)->sk_bound_dev_if	||				\
-	   ((__sk)->sk_bound_dev_if == (__dif))		||		\
-	   ((__sk)->sk_bound_dev_if == (__sdif)))	&&		\
+	 (((__sk)->sk_bound_dev_if == (__dif))		||		\
+	  ((__sk)->sk_bound_dev_if == (__sdif)))	&&		\
 	 net_eq(sock_net(__sk), (__net)))
 #endif /* 64-bit arch */
 
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index e03b93360f33..92e0aa3958f6 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -130,6 +130,19 @@  static inline int inet_request_bound_dev_if(const struct sock *sk,
 	return sk->sk_bound_dev_if;
 }
 
+static inline int inet_sk_bound_l3mdev(const struct sock *sk)
+{
+#ifdef CONFIG_NET_L3_MASTER_DEV
+	struct net *net = sock_net(sk);
+
+	if (!net->ipv4.sysctl_tcp_l3mdev_accept)
+		return l3mdev_master_ifindex_by_index(net,
+						      sk->sk_bound_dev_if);
+#endif
+
+	return 0;
+}
+
 static inline struct ip_options_rcu *ireq_opt_deref(const struct inet_request_sock *ireq)
 {
 	return rcu_dereference_check(ireq->ireq_opt,
diff --git a/net/core/sock.c b/net/core/sock.c
index 3730eb855095..da1cbb88a6bf 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -567,6 +567,8 @@  static int sock_setbindtodevice(struct sock *sk, char __user *optval,
 
 	lock_sock(sk);
 	sk->sk_bound_dev_if = index;
+	if (sk->sk_prot->rehash)
+		sk->sk_prot->rehash(sk);
 	sk_dst_reset(sk);
 	release_sock(sk);
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index dfd5009f96ef..97bba5b3d69f 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -183,7 +183,9 @@  inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 	int i, low, high, attempt_half;
 	struct inet_bind_bucket *tb;
 	u32 remaining, offset;
+	int l3mdev;
 
+	l3mdev = inet_sk_bound_l3mdev(sk);
 	attempt_half = (sk->sk_reuse == SK_CAN_REUSE) ? 1 : 0;
 other_half_scan:
 	inet_get_local_port_range(net, &low, &high);
@@ -219,7 +221,8 @@  inet_csk_find_open_port(struct sock *sk, struct inet_bind_bucket **tb_ret, int *
 						  hinfo->bhash_size)];
 		spin_lock_bh(&head->lock);
 		inet_bind_bucket_for_each(tb, &head->chain)
-			if (net_eq(ib_net(tb), net) && tb->port == port) {
+			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
+			    tb->port == port) {
 				if (!inet_csk_bind_conflict(sk, tb, false, false))
 					goto success;
 				goto next_port;
@@ -293,6 +296,9 @@  int inet_csk_get_port(struct sock *sk, unsigned short snum)
 	struct net *net = sock_net(sk);
 	struct inet_bind_bucket *tb = NULL;
 	kuid_t uid = sock_i_uid(sk);
+	int l3mdev;
+
+	l3mdev = inet_sk_bound_l3mdev(sk);
 
 	if (!port) {
 		head = inet_csk_find_open_port(sk, &tb, &port);
@@ -306,11 +312,12 @@  int inet_csk_get_port(struct sock *sk, unsigned short snum)
 					  hinfo->bhash_size)];
 	spin_lock_bh(&head->lock);
 	inet_bind_bucket_for_each(tb, &head->chain)
-		if (net_eq(ib_net(tb), net) && tb->port == port)
+		if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
+		    tb->port == port)
 			goto tb_found;
 tb_not_found:
 	tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-				     net, head, port);
+				     net, head, port, l3mdev);
 	if (!tb)
 		goto fail_unlock;
 tb_found:
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index f5c9ef2586de..2ec684057ebd 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -65,12 +65,14 @@  static u32 sk_ehashfn(const struct sock *sk)
 struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 						 struct net *net,
 						 struct inet_bind_hashbucket *head,
-						 const unsigned short snum)
+						 const unsigned short snum,
+						 int l3mdev)
 {
 	struct inet_bind_bucket *tb = kmem_cache_alloc(cachep, GFP_ATOMIC);
 
 	if (tb) {
 		write_pnet(&tb->ib_net, net);
+		tb->l3mdev    = l3mdev;
 		tb->port      = snum;
 		tb->fastreuse = 0;
 		tb->fastreuseport = 0;
@@ -135,6 +137,7 @@  int __inet_inherit_port(const struct sock *sk, struct sock *child)
 			table->bhash_size);
 	struct inet_bind_hashbucket *head = &table->bhash[bhash];
 	struct inet_bind_bucket *tb;
+	int l3mdev;
 
 	spin_lock(&head->lock);
 	tb = inet_csk(sk)->icsk_bind_hash;
@@ -143,6 +146,8 @@  int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		return -ENOENT;
 	}
 	if (tb->port != port) {
+		l3mdev = inet_sk_bound_l3mdev(sk);
+
 		/* NOTE: using tproxy and redirecting skbs to a proxy
 		 * on a different listener port breaks the assumption
 		 * that the listener socket's icsk_bind_hash is the same
@@ -150,12 +155,13 @@  int __inet_inherit_port(const struct sock *sk, struct sock *child)
 		 * create a new bind bucket for the child here. */
 		inet_bind_bucket_for_each(tb, &head->chain) {
 			if (net_eq(ib_net(tb), sock_net(sk)) &&
-			    tb->port == port)
+			    tb->l3mdev == l3mdev && tb->port == port)
 				break;
 		}
 		if (!tb) {
 			tb = inet_bind_bucket_create(table->bind_bucket_cachep,
-						     sock_net(sk), head, port);
+						     sock_net(sk), head, port,
+						     l3mdev);
 			if (!tb) {
 				spin_unlock(&head->lock);
 				return -ENOMEM;
@@ -229,6 +235,7 @@  static inline int compute_score(struct sock *sk, struct net *net,
 {
 	int score = -1;
 	struct inet_sock *inet = inet_sk(sk);
+	bool dev_match;
 
 	if (net_eq(sock_net(sk), net) && inet->inet_num == hnum &&
 			!ipv6_only_sock(sk)) {
@@ -239,15 +246,12 @@  static inline int compute_score(struct sock *sk, struct net *net,
 				return -1;
 			score += 4;
 		}
-		if (sk->sk_bound_dev_if || exact_dif) {
-			bool dev_match = (sk->sk_bound_dev_if == dif ||
-					  sk->sk_bound_dev_if == sdif);
+		dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+						 dif, sdif);
+		if (!dev_match)
+			return -1;
+		score += 4;
 
-			if (!dev_match)
-				return -1;
-			if (sk->sk_bound_dev_if)
-				score += 4;
-		}
 		if (sk->sk_incoming_cpu == raw_smp_processor_id())
 			score++;
 	}
@@ -675,6 +679,7 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 	u32 remaining, offset;
 	int ret, i, low, high;
 	static u32 hint;
+	int l3mdev;
 
 	if (port) {
 		head = &hinfo->bhash[inet_bhashfn(net, port,
@@ -693,6 +698,8 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		return ret;
 	}
 
+	l3mdev = inet_sk_bound_l3mdev(sk);
+
 	inet_get_local_port_range(net, &low, &high);
 	high++; /* [32768, 60999] -> [32768, 61000[ */
 	remaining = high - low;
@@ -719,7 +726,8 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		 * the established check is already unique enough.
 		 */
 		inet_bind_bucket_for_each(tb, &head->chain) {
-			if (net_eq(ib_net(tb), net) && tb->port == port) {
+			if (net_eq(ib_net(tb), net) && tb->l3mdev == l3mdev &&
+			    tb->port == port) {
 				if (tb->fastreuse >= 0 ||
 				    tb->fastreuseport >= 0)
 					goto next_port;
@@ -732,7 +740,7 @@  int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		}
 
 		tb = inet_bind_bucket_create(hinfo->bind_bucket_cachep,
-					     net, head, port);
+					     net, head, port, l3mdev);
 		if (!tb) {
 			spin_unlock_bh(&head->lock);
 			return -ENOMEM;
diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index c0fe5ad996f2..026971314c43 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -892,6 +892,9 @@  static int do_ip_setsockopt(struct sock *sk, int level,
 		dev_put(dev);
 
 		err = -EINVAL;
+		if (!sk->sk_bound_dev_if && midx)
+			break;
+
 		if (sk->sk_bound_dev_if &&
 		    mreq.imr_ifindex != sk->sk_bound_dev_if &&
 		    (!midx || midx != sk->sk_bound_dev_if))
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 33df4d76db2d..8a0d568d7aec 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -70,6 +70,7 @@ 
 #include <net/snmp.h>
 #include <net/tcp_states.h>
 #include <net/inet_common.h>
+#include <net/inet_hashtables.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
 #include <linux/rtnetlink.h>
@@ -131,8 +132,7 @@  struct sock *__raw_v4_lookup(struct net *net, struct sock *sk,
 		if (net_eq(sock_net(sk), net) && inet->inet_num == num	&&
 		    !(inet->inet_daddr && inet->inet_daddr != raddr) 	&&
 		    !(inet->inet_rcv_saddr && inet->inet_rcv_saddr != laddr) &&
-		    !(sk->sk_bound_dev_if && sk->sk_bound_dev_if != dif &&
-		      sk->sk_bound_dev_if != sdif))
+		    inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif))
 			goto found; /* gotcha */
 	}
 	sk = NULL;
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f4e35b2ff8b8..3d59ab47a85d 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -371,6 +371,7 @@  static int compute_score(struct sock *sk, struct net *net,
 {
 	int score;
 	struct inet_sock *inet;
+	bool dev_match;
 
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
@@ -398,15 +399,11 @@  static int compute_score(struct sock *sk, struct net *net,
 		score += 4;
 	}
 
-	if (sk->sk_bound_dev_if || exact_dif) {
-		bool dev_match = (sk->sk_bound_dev_if == dif ||
-				  sk->sk_bound_dev_if == sdif);
-
-		if (!dev_match)
-			return -1;
-		if (sk->sk_bound_dev_if)
-			score += 4;
-	}
+	dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+					 dif, sdif);
+	if (!dev_match)
+		return -1;
+	score += 4;
 
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index 1ede7a16a0be..4813293d4fad 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -782,7 +782,10 @@  int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 
 			if (src_info->ipi6_ifindex) {
 				if (fl6->flowi6_oif &&
-				    src_info->ipi6_ifindex != fl6->flowi6_oif)
+				    src_info->ipi6_ifindex != fl6->flowi6_oif &&
+				    (sk->sk_bound_dev_if != fl6->flowi6_oif ||
+				     !sk_dev_equal_l3scope(
+					     sk, src_info->ipi6_ifindex)))
 					return -EINVAL;
 				fl6->flowi6_oif = src_info->ipi6_ifindex;
 			}
diff --git a/net/ipv6/inet6_hashtables.c b/net/ipv6/inet6_hashtables.c
index 3d7c7460a0c5..5eeeba7181a1 100644
--- a/net/ipv6/inet6_hashtables.c
+++ b/net/ipv6/inet6_hashtables.c
@@ -99,6 +99,7 @@  static inline int compute_score(struct sock *sk, struct net *net,
 				const int dif, const int sdif, bool exact_dif)
 {
 	int score = -1;
+	bool dev_match;
 
 	if (net_eq(sock_net(sk), net) && inet_sk(sk)->inet_num == hnum &&
 	    sk->sk_family == PF_INET6) {
@@ -109,15 +110,12 @@  static inline int compute_score(struct sock *sk, struct net *net,
 				return -1;
 			score++;
 		}
-		if (sk->sk_bound_dev_if || exact_dif) {
-			bool dev_match = (sk->sk_bound_dev_if == dif ||
-					  sk->sk_bound_dev_if == sdif);
+		dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+						 dif, sdif);
+		if (!dev_match)
+			return -1;
+		score++;
 
-			if (!dev_match)
-				return -1;
-			if (sk->sk_bound_dev_if)
-				score++;
-		}
 		if (sk->sk_incoming_cpu == raw_smp_processor_id())
 			score++;
 	}
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index c0cac9cc3a28..7dfbc797b130 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -626,6 +626,9 @@  static int do_ipv6_setsockopt(struct sock *sk, int level, int optname,
 
 			rcu_read_unlock();
 
+			if (!sk->sk_bound_dev_if && midx)
+				goto e_inval;
+
 			if (sk->sk_bound_dev_if &&
 			    sk->sk_bound_dev_if != val &&
 			    (!midx || midx != sk->sk_bound_dev_if))
diff --git a/net/ipv6/raw.c b/net/ipv6/raw.c
index 413d98bf24f4..2f61a9f1a2b2 100644
--- a/net/ipv6/raw.c
+++ b/net/ipv6/raw.c
@@ -49,6 +49,7 @@ 
 #include <net/transp_v6.h>
 #include <net/udp.h>
 #include <net/inet_common.h>
+#include <net/inet_hashtables.h>
 #include <net/tcp_states.h>
 #if IS_ENABLED(CONFIG_IPV6_MIP6)
 #include <net/mip6.h>
@@ -86,9 +87,8 @@  struct sock *__raw_v6_lookup(struct net *net, struct sock *sk,
 			    !ipv6_addr_equal(&sk->sk_v6_daddr, rmt_addr))
 				continue;
 
-			if (sk->sk_bound_dev_if &&
-			    sk->sk_bound_dev_if != dif &&
-			    sk->sk_bound_dev_if != sdif)
+			if (!inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if,
+						  dif, sdif))
 				continue;
 
 			if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr)) {
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 83f4c77c79d8..e22b7dd78c9b 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -117,6 +117,7 @@  static int compute_score(struct sock *sk, struct net *net,
 {
 	int score;
 	struct inet_sock *inet;
+	bool dev_match;
 
 	if (!net_eq(sock_net(sk), net) ||
 	    udp_sk(sk)->udp_port_hash != hnum ||
@@ -144,15 +145,10 @@  static int compute_score(struct sock *sk, struct net *net,
 		score++;
 	}
 
-	if (sk->sk_bound_dev_if || exact_dif) {
-		bool dev_match = (sk->sk_bound_dev_if == dif ||
-				  sk->sk_bound_dev_if == sdif);
-
-		if (!dev_match)
-			return -1;
-		if (sk->sk_bound_dev_if)
-			score++;
-	}
+	dev_match = inet_sk_bound_dev_eq(net, sk->sk_bound_dev_if, dif, sdif);
+	if (!dev_match)
+		return -1;
+	score++;
 
 	if (sk->sk_incoming_cpu == raw_smp_processor_id())
 		score++;