diff mbox series

[net-next] net: preserve sock reference when scrubbing the skb.

Message ID 20180625155610.30802-1-fbl@redhat.com
State Changes Requested, archived
Delegated to: David Miller
Headers show
Series [net-next] net: preserve sock reference when scrubbing the skb. | expand

Commit Message

Flavio Leitner June 25, 2018, 3:56 p.m. UTC
The sock reference is lost when scrubbing the packet and that breaks
TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
performance impacts of about 50% in a single TCP stream when crossing
network namespaces.

XPS breaks because the queue mapping stored in the socket is not
available, so another random queue might be selected when the stack
needs to transmit something like a TCP ACK, or TCP Retransmissions.
That causes packet re-ordering and/or performance issues.

TSQ breaks because it orphans the packet while it is still in the
host, so packets are queued contributing to the buffer bloat problem.

Preserving the sock reference fixes both issues. The socket is
orphaned anyways in the receiving path before any relevant action,
but the transmit side needs some extra checking included in the
patch.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
---
 include/net/netfilter/nf_log.h         | 3 ++-
 net/core/skbuff.c                      | 1 -
 net/ipv4/netfilter/nf_log_ipv4.c       | 8 ++++----
 net/ipv6/netfilter/nf_log_ipv6.c       | 8 ++++----
 net/netfilter/nf_conntrack_broadcast.c | 2 +-
 net/netfilter/nf_log_common.c          | 5 +++--
 net/netfilter/nf_nat_core.c            | 6 +++++-
 net/netfilter/nft_meta.c               | 9 ++++++---
 net/netfilter/nft_socket.c             | 5 ++++-
 net/netfilter/xt_cgroup.c              | 6 ++++--
 net/netfilter/xt_owner.c               | 2 +-
 net/netfilter/xt_recent.c              | 3 ++-
 net/netfilter/xt_socket.c              | 6 ++++++
 13 files changed, 42 insertions(+), 22 deletions(-)

Comments

Cong Wang June 26, 2018, 4:15 a.m. UTC | #1
On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
>
> The sock reference is lost when scrubbing the packet and that breaks
> TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
> performance impacts of about 50% in a single TCP stream when crossing
> network namespaces.
>
> XPS breaks because the queue mapping stored in the socket is not
> available, so another random queue might be selected when the stack
> needs to transmit something like a TCP ACK, or TCP Retransmissions.
> That causes packet re-ordering and/or performance issues.
>
> TSQ breaks because it orphans the packet while it is still in the
> host, so packets are queued contributing to the buffer bloat problem.

Why should TSQ in one stack care about buffer bloat in another stack?

Actually, I think the current behavior is correct, once the packet leaves
its current stack (or netns), it should relief the backpressure on TCP
socket in this stack, whether it will be queued in another stack is beyond
its concern. This breaks the isolation between networking stacks.
Eric Dumazet June 26, 2018, 6:41 a.m. UTC | #2
On 06/25/2018 09:15 PM, Cong Wang wrote:
> On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
>>
>> The sock reference is lost when scrubbing the packet and that breaks
>> TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
>> performance impacts of about 50% in a single TCP stream when crossing
>> network namespaces.
>>
>> XPS breaks because the queue mapping stored in the socket is not
>> available, so another random queue might be selected when the stack
>> needs to transmit something like a TCP ACK, or TCP Retransmissions.
>> That causes packet re-ordering and/or performance issues.
>>
>> TSQ breaks because it orphans the packet while it is still in the
>> host, so packets are queued contributing to the buffer bloat problem.
> 
> Why should TSQ in one stack care about buffer bloat in another stack?
> 
> Actually, I think the current behavior is correct, once the packet leaves
> its current stack (or netns), it should relief the backpressure on TCP
> socket in this stack, whether it will be queued in another stack is beyond
> its concern. This breaks the isolation between networking stacks.
> 

We discussed about this during netconf Cong, nobody was against this planned removal.

When a packet is attached to a socket, we should keep the association as much as possible.

Only when a new association needs to be done, skb_orphan() needs to be called.

Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
can evades SO_SNDBUF limits.

I am not sure why the patch is so complex, I would have simply removed the skb_orphan().
Flavio Leitner June 26, 2018, 12:38 p.m. UTC | #3
On Mon, Jun 25, 2018 at 11:41:05PM -0700, Eric Dumazet wrote:
> I am not sure why the patch is so complex, I would have simply
> removed the skb_orphan().

All the rest is fixing netfilter to consider the namespaces before
doing any actions.  For instance, it would be possible to dump GID/UID
of the process originating the packet in a foreign net namespace.

Another example is matching if it has a socket or not. I don't think
it's correct to match on a socket that is on a foreign netns.

So, you're right that removing the skb_orphan() is the only thing
required to fix both issues, but then what about the above examples?
Eric Dumazet June 26, 2018, 1:06 p.m. UTC | #4
On 06/26/2018 05:38 AM, Flavio Leitner wrote:
> On Mon, Jun 25, 2018 at 11:41:05PM -0700, Eric Dumazet wrote:
>> I am not sure why the patch is so complex, I would have simply
>> removed the skb_orphan().
> 
> All the rest is fixing netfilter to consider the namespaces before
> doing any actions.  For instance, it would be possible to dump GID/UID
> of the process originating the packet in a foreign net namespace.
> 
> Another example is matching if it has a socket or not. I don't think
> it's correct to match on a socket that is on a foreign netns.
> 
> So, you're right that removing the skb_orphan() is the only thing
> required to fix both issues, but then what about the above examples?
> 

Please split the work in two parts.

1) Preparation, with clear commit log :)

2) Removal of skb_orphan()

Thanks !
Flavio Leitner June 26, 2018, 1:32 p.m. UTC | #5
On Tue, Jun 26, 2018 at 06:06:36AM -0700, Eric Dumazet wrote:
> 
> 
> On 06/26/2018 05:38 AM, Flavio Leitner wrote:
> > On Mon, Jun 25, 2018 at 11:41:05PM -0700, Eric Dumazet wrote:
> >> I am not sure why the patch is so complex, I would have simply
> >> removed the skb_orphan().
> > 
> > All the rest is fixing netfilter to consider the namespaces before
> > doing any actions.  For instance, it would be possible to dump GID/UID
> > of the process originating the packet in a foreign net namespace.
> > 
> > Another example is matching if it has a socket or not. I don't think
> > it's correct to match on a socket that is on a foreign netns.
> > 
> > So, you're right that removing the skb_orphan() is the only thing
> > required to fix both issues, but then what about the above examples?
> > 
> 
> Please split the work in two parts.
> 
> 1) Preparation, with clear commit log :)
> 
> 2) Removal of skb_orphan()

ok, will do.
Thanks,
Cong Wang June 26, 2018, 9:48 p.m. UTC | #6
On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/25/2018 09:15 PM, Cong Wang wrote:
> > On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
> >>
> >> The sock reference is lost when scrubbing the packet and that breaks
> >> TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
> >> performance impacts of about 50% in a single TCP stream when crossing
> >> network namespaces.
> >>
> >> XPS breaks because the queue mapping stored in the socket is not
> >> available, so another random queue might be selected when the stack
> >> needs to transmit something like a TCP ACK, or TCP Retransmissions.
> >> That causes packet re-ordering and/or performance issues.
> >>
> >> TSQ breaks because it orphans the packet while it is still in the
> >> host, so packets are queued contributing to the buffer bloat problem.
> >
> > Why should TSQ in one stack care about buffer bloat in another stack?
> >
> > Actually, I think the current behavior is correct, once the packet leaves
> > its current stack (or netns), it should relief the backpressure on TCP
> > socket in this stack, whether it will be queued in another stack is beyond
> > its concern. This breaks the isolation between networking stacks.
> >
>
> We discussed about this during netconf Cong, nobody was against this planned removal.

I agreed to keep skb->sk, but didn't realize it actually impacts TSQ too.


>
> When a packet is attached to a socket, we should keep the association as much as possible.

As much as possible within one stack, I agree. I still don't understand
why we should keep it across the stack boundary.

>
> Only when a new association needs to be done, skb_orphan() needs to be called.
>
> Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> can evades SO_SNDBUF limits.

Right before leaving the stack is not too soon, it is the latest
actually, for veth case.
Flavio Leitner June 26, 2018, 10:03 p.m. UTC | #7
On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > When a packet is attached to a socket, we should keep the association as much as possible.
> 
> As much as possible within one stack, I agree. I still don't understand
> why we should keep it across the stack boundary.
> 
> > Only when a new association needs to be done, skb_orphan() needs to be called.
> >
> > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > can evades SO_SNDBUF limits.
> 
> Right before leaving the stack is not too soon, it is the latest
> actually, for veth case.

Depends on how you view things - it's the same host/stack sharing the
same resources, so why should we not keep it?
Cong Wang June 26, 2018, 10:47 p.m. UTC | #8
On Tue, Jun 26, 2018 at 3:03 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> > On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > When a packet is attached to a socket, we should keep the association as much as possible.
> >
> > As much as possible within one stack, I agree. I still don't understand
> > why we should keep it across the stack boundary.
> >
> > > Only when a new association needs to be done, skb_orphan() needs to be called.
> > >
> > > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > > can evades SO_SNDBUF limits.
> >
> > Right before leaving the stack is not too soon, it is the latest
> > actually, for veth case.
>
> Depends on how you view things - it's the same host/stack sharing the
> same resources, so why should we not keep it?

Because stacks are supposed to be independent, netdevices are
isolated, iptables and route tables too. This is how netns is designed
from the beginning. The trend today is actually more isolation instead
of more sharing, given the popularity of containers.

You need to justify why you want to break the TSQ's scope here,
which is obviously not compatible with netns design.
Flavio Leitner June 26, 2018, 11:33 p.m. UTC | #9
On Tue, Jun 26, 2018 at 03:47:31PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 3:03 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 02:48:47PM -0700, Cong Wang wrote:
> > > On Mon, Jun 25, 2018 at 11:41 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > > > When a packet is attached to a socket, we should keep the association as much as possible.
> > >
> > > As much as possible within one stack, I agree. I still don't understand
> > > why we should keep it across the stack boundary.
> > >
> > > > Only when a new association needs to be done, skb_orphan() needs to be called.
> > > >
> > > > Doing this skb_orphan() too soon breaks back pressure in general, this is bad, since a socket
> > > > can evades SO_SNDBUF limits.
> > >
> > > Right before leaving the stack is not too soon, it is the latest
> > > actually, for veth case.
> >
> > Depends on how you view things - it's the same host/stack sharing the
> > same resources, so why should we not keep it?
> 
> Because stacks are supposed to be independent, netdevices are
> isolated, iptables and route tables too. This is how netns is designed
> from the beginning. The trend today is actually more isolation instead
> of more sharing, given the popularity of containers.

It is still isolated, the sk carries the netns info and it is
orphaned when it re-enters the stack.
Eric Dumazet June 26, 2018, 11:53 p.m. UTC | #10
On 06/26/2018 03:47 PM, Cong Wang wrote:
> 
> You need to justify why you want to break the TSQ's scope here,
> which is obviously not compatible with netns design.

You have to explain why you do not want us to fix this buggy behavior.

Right now TSQ (and more generally back pressure) is broken by this skb_orphan()

So we want to restore TSQ (and back pressure)

TSQ scope never mentioned netns.
We (TCP stack TSQ handler) want to be notified when this packet leaves the host,
even if it had to traverse multiple netns (for whatever reasons).

_If_ a packet is locally 'consumed' (like on loopback device, or veth pair),
then the skb_orphan() will automatically be done.

If you have a case where this skb_orphan() is needed, please add it at the needed place.
Cong Wang June 27, 2018, 12:29 a.m. UTC | #11
On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> It is still isolated, the sk carries the netns info and it is
> orphaned when it re-enters the stack.

Then what difference does your patch make?

Before your patch:
veth orphans skb in its xmit

After your patch:
RX orphans it when re-entering stack (as you claimed, I don't know)

And for veth pair:
xmit from one side is RX for the other side

So, where is the queueing? Where is the buffer bloat? GRO list??
Flavio Leitner June 27, 2018, 12:39 a.m. UTC | #12
On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > It is still isolated, the sk carries the netns info and it is
> > orphaned when it re-enters the stack.
> 
> Then what difference does your patch make?

Don't forget it is fixing two issues.

> Before your patch:
> veth orphans skb in its xmit
> 
> After your patch:
> RX orphans it when re-entering stack (as you claimed, I don't know)

ip_rcv, and equivalents.

> And for veth pair:
> xmit from one side is RX for the other side
> So, where is the queueing? Where is the buffer bloat? GRO list??

CPU backlog.
Cong Wang June 27, 2018, 12:44 a.m. UTC | #13
On Tue, Jun 26, 2018 at 4:53 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/26/2018 03:47 PM, Cong Wang wrote:
> >
> > You need to justify why you want to break the TSQ's scope here,
> > which is obviously not compatible with netns design.
>
> You have to explain why you do not want us to fix this buggy behavior.
>
> Right now TSQ (and more generally back pressure) is broken by this skb_orphan()
>
> So we want to restore TSQ (and back pressure)
>
> TSQ scope never mentioned netns.

Conceptually, it is limiting the pipeline from L4 to L2 within one stack, now
you are extending it to NS1 (L4...L2)...NS2 (L2...L4) which could across
multiple stacks and _in theory_ could be infinitely long.

And TSQ setting is per-netns:

2180 static bool tcp_small_queue_check(struct sock *sk, const struct
sk_buff *skb,
2181                                   unsigned int factor)
2182 {
2183         unsigned int limit;
2184
2185         limit = max(2 * skb->truesize, sk->sk_pacing_rate >>
sk->sk_pacing_shift);
2186         limit = min_t(u32, limit,
2187                       sock_net(sk)->ipv4.sysctl_tcp_limit_output_bytes);
2188         limit <<= factor;

Should I expect to increase sysctl_tcp_limit_output_bytes based on the
number of stacks it crosses?

> We (TCP stack TSQ handler) want to be notified when this packet leaves the host,
> even if it had to traverse multiple netns (for whatever reasons).
>
> _If_ a packet is locally 'consumed' (like on loopback device, or veth pair),
> then the skb_orphan() will automatically be done.
>
> If you have a case where this skb_orphan() is needed, please add it at the needed place.


With this, a netns could totally throttle a TCP socket in a different
netns by holding the packets infinitely (e.g. putting them in a loop).
This is where the isolation breaks.
Cong Wang June 27, 2018, 1:28 a.m. UTC | #14
On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > >
> > > It is still isolated, the sk carries the netns info and it is
> > > orphaned when it re-enters the stack.
> >
> > Then what difference does your patch make?
>
> Don't forget it is fixing two issues.

Sure. I am only talking about TSQ from the very beginning.
Let me rephrase my above question:
What difference does your patch make to TSQ?


>
> > Before your patch:
> > veth orphans skb in its xmit
> >
> > After your patch:
> > RX orphans it when re-entering stack (as you claimed, I don't know)
>
> ip_rcv, and equivalents.


ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)

>
> > And for veth pair:
> > xmit from one side is RX for the other side
> > So, where is the queueing? Where is the buffer bloat? GRO list??
>
> CPU backlog.

Yeah, but this is never targeted by TSQ:

        tcp_limit_output_bytes limits the number of bytes on qdisc
        or device to reduce artificial RTT/cwnd and reduce bufferbloat.

which means you have to update Documentation/networking/ip-sysctl.txt
too.
Eric Dumazet June 27, 2018, 2:32 a.m. UTC | #15
On 06/26/2018 05:29 PM, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
>>
>> It is still isolated, the sk carries the netns info and it is
>> orphaned when it re-enters the stack.
> 
> Then what difference does your patch make?
> 
> Before your patch:
> veth orphans skb in its xmit
> 
> After your patch:
> RX orphans it when re-entering stack (as you claimed, I don't know)
> 
> And for veth pair:
> xmit from one side is RX for the other side
> 
> So, where is the queueing? Where is the buffer bloat? GRO list??
> 

By re-entering the stack, Flavio probably meant storing this skb in
a socket receive queue, or anything that should already modify skb->destructor
(and thus call skb_orphan() before the modification)

If skb sits in some qdisc, like fq on ipvlan master device, we do not want skb->sk to be scrubbed,
just because ipvlan slave and master might be in different netns.
Eric Dumazet June 27, 2018, 2:35 a.m. UTC | #16
On 06/26/2018 05:44 PM, Cong Wang wrote:
 
> With this, a netns could totally throttle a TCP socket in a different
> netns by holding the packets infinitely (e.g. putting them in a loop).
> This is where the isolation breaks.
> 

That is fine, really. Admin error -> Working as intended.

The current scrubbing is simply wrong, not documented, and added by someone
who had absolutely not intended all the side effects.
Flavio Leitner June 27, 2018, 12:31 p.m. UTC | #17
On Tue, Jun 26, 2018 at 06:28:27PM -0700, Cong Wang wrote:
> On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > >
> > > > It is still isolated, the sk carries the netns info and it is
> > > > orphaned when it re-enters the stack.
> > >
> > > Then what difference does your patch make?
> >
> > Don't forget it is fixing two issues.
> 
> Sure. I am only talking about TSQ from the very beginning.
> Let me rephrase my above question:
> What difference does your patch make to TSQ?

It avoids burstiness.

> > > Before your patch:
> > > veth orphans skb in its xmit
> > >
> > > After your patch:
> > > RX orphans it when re-entering stack (as you claimed, I don't know)
> >
> > ip_rcv, and equivalents.
> 
> ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)

Maybe you found a problem, could you please point me to where in
between L1 to L3 the socket is relevant?


> > > And for veth pair:
> > > xmit from one side is RX for the other side
> > > So, where is the queueing? Where is the buffer bloat? GRO list??
> >
> > CPU backlog.
> 
> Yeah, but this is never targeted by TSQ:
> 
>         tcp_limit_output_bytes limits the number of bytes on qdisc
>         or device to reduce artificial RTT/cwnd and reduce bufferbloat.
> 
> which means you have to update Documentation/networking/ip-sysctl.txt
> too.

How it is never targeted? Whole point is to avoid queueing traffic.
Would you be okay if I include this chunk?

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index ce8fbf5aa63c..f4c042be0216 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -733,11 +733,11 @@ tcp_limit_output_bytes - INTEGER
 	Controls TCP Small Queue limit per tcp socket.
 	TCP bulk sender tends to increase packets in flight until it
 	gets losses notifications. With SNDBUF autotuning, this can
-	result in a large amount of packets queued in qdisc/device
-	on the local machine, hurting latency of other flows, for
-	typical pfifo_fast qdiscs.
-	tcp_limit_output_bytes limits the number of bytes on qdisc
-	or device to reduce artificial RTT/cwnd and reduce bufferbloat.
+	result in a large amount of packets queued on the local machine
+	(e.g.: qdiscs, CPU backlog, or device) hurting latency of other
+	flows, for typical pfifo_fast qdiscs.  tcp_limit_output_bytes
+	limits the number of bytes on qdisc or device to reduce artificial
+	RTT/cwnd and reduce bufferbloat.
 	Default: 262144
 
 tcp_challenge_ack_limit - INTEGER
Cong Wang June 27, 2018, 6:59 p.m. UTC | #18
On Tue, Jun 26, 2018 at 7:35 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/26/2018 05:44 PM, Cong Wang wrote:
>
> > With this, a netns could totally throttle a TCP socket in a different
> > netns by holding the packets infinitely (e.g. putting them in a loop).
> > This is where the isolation breaks.
> >
>
> That is fine, really. Admin error -> Working as intended.

The point is never it is an error or not, the point is one netns could
influence another one with this change.

>
> The current scrubbing is simply wrong, not documented, and added by someone
> who had absolutely not intended all the side effects.
>

IIRC, this skb_orphan() was introduced much earlier than TSQ, probably
from the beginning of veth.

Leaving the stack should be effectively equivalent to leaving the host,
from the view of network isolation.
Cong Wang June 27, 2018, 7:06 p.m. UTC | #19
On Wed, Jun 27, 2018 at 5:32 AM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Tue, Jun 26, 2018 at 06:28:27PM -0700, Cong Wang wrote:
> > On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
> > >
> > > On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > > > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > > >
> > > > > It is still isolated, the sk carries the netns info and it is
> > > > > orphaned when it re-enters the stack.
> > > >
> > > > Then what difference does your patch make?
> > >
> > > Don't forget it is fixing two issues.
> >
> > Sure. I am only talking about TSQ from the very beginning.
> > Let me rephrase my above question:
> > What difference does your patch make to TSQ?
>
> It avoids burstiness.

Never even mentioned in changelog or in your patch. :-/


>
> > > > Before your patch:
> > > > veth orphans skb in its xmit
> > > >
> > > > After your patch:
> > > > RX orphans it when re-entering stack (as you claimed, I don't know)
> > >
> > > ip_rcv, and equivalents.
> >
> > ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)
>
> Maybe you found a problem, could you please point me to where in
> between L1 to L3 the socket is relevant?
>

Of course, ingress qdisc is in L2. Do I need to say more? This
is where we can re-route the packets, for example, redirecting it to
yet another netns. This is in fact what we use in production, not anything
that only in my imagination.

You really have to think about why you allow a different netns influence
another netns by holding the skb to throttle the source TCP socket.


>
> > > > And for veth pair:
> > > > xmit from one side is RX for the other side
> > > > So, where is the queueing? Where is the buffer bloat? GRO list??
> > >
> > > CPU backlog.
> >
> > Yeah, but this is never targeted by TSQ:
> >
> >         tcp_limit_output_bytes limits the number of bytes on qdisc
> >         or device to reduce artificial RTT/cwnd and reduce bufferbloat.
> >
> > which means you have to update Documentation/networking/ip-sysctl.txt
> > too.
>
> How it is never targeted? Whole point is to avoid queueing traffic.

What queues? You really need to define it, seriously.


> Would you be okay if I include this chunk?

No, still lack of an explanation why it comes across netns for
a good reason.

>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index ce8fbf5aa63c..f4c042be0216 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -733,11 +733,11 @@ tcp_limit_output_bytes - INTEGER
>         Controls TCP Small Queue limit per tcp socket.
>         TCP bulk sender tends to increase packets in flight until it
>         gets losses notifications. With SNDBUF autotuning, this can
> -       result in a large amount of packets queued in qdisc/device
> -       on the local machine, hurting latency of other flows, for
> -       typical pfifo_fast qdiscs.
> -       tcp_limit_output_bytes limits the number of bytes on qdisc
> -       or device to reduce artificial RTT/cwnd and reduce bufferbloat.
> +       result in a large amount of packets queued on the local machine
> +       (e.g.: qdiscs, CPU backlog, or device) hurting latency of other


Apparently CPU backlog never happens when leaving the host.
Eric Dumazet June 27, 2018, 7:33 p.m. UTC | #20
On 06/27/2018 11:59 AM, Cong Wang wrote:

> 
> IIRC, this skb_orphan() was introduced much earlier than TSQ, probably
> from the beginning of veth.

Sigh

SO_SNDBUF was invented years ago before veth.

You focus on TSQ while it is only one of the many things that are broken.

> 
> Leaving the stack should be effectively equivalent to leaving the host,
> from the view of network isolation.
> 


Having a UDP socket being able to burn a cpu and fill a qdisc is a major bug.

Bu default (blocking send() syscalls) the following loop should
block the thread if socket sk_wmem_alloc hits sk_sndbuf, this is
the beauty of backpressure.

while (1)
    send(fd, ...);  

With skb_orphan(), sk_wmem_alloc will stay around 0, so the loop will burn a cpu
and fill a qdisc, eventually breaking "network isolation", since other sockets
might be unable to send a single packet.

If you have a concrete case where the skb_orphan() is needed, then you will have
to add a parameter to let the admin opt-in for this.
Cong Wang June 27, 2018, 7:55 p.m. UTC | #21
On Wed, Jun 27, 2018 at 12:33 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 06/27/2018 11:59 AM, Cong Wang wrote:
>
> >
> > IIRC, this skb_orphan() was introduced much earlier than TSQ, probably
> > from the beginning of veth.
>
> Sigh
>
> SO_SNDBUF was invented years ago before veth.

Yeah, probably when there was only one stack on one host.
SO_SNDBUF is aligned to networking stack basis.

>
> You focus on TSQ while it is only one of the many things that are broken.
>

I think it is the opposite: this patchset _potentially_ breaks things, not fixes
anything.


> >
> > Leaving the stack should be effectively equivalent to leaving the host,
> > from the view of network isolation.
> >
>
>
> Having a UDP socket being able to burn a cpu and fill a qdisc is a major bug.
>


Very true, network isolation never isolates CPU or memory.
It is cpuset's job to provide physical CPU isolation, not networking
namespace. I don't want to defend this, but it is the current design.


> Bu default (blocking send() syscalls) the following loop should
> block the thread if socket sk_wmem_alloc hits sk_sndbuf, this is
> the beauty of backpressure.
>
> while (1)
>     send(fd, ...);
>
> With skb_orphan(), sk_wmem_alloc will stay around 0, so the loop will burn a cpu
> and fill a qdisc, eventually breaking "network isolation", since other sockets
> might be unable to send a single packet.

Won't the same happen when congestion on a physical connection
between two hosts? Does 'host isolation' break too?

>
> If you have a concrete case where the skb_orphan() is needed, then you will have
> to add a parameter to let the admin opt-in for this.

Please see the other reply from me, where I list 3 or 4 reasons.
Flavio Leitner June 27, 2018, 8:19 p.m. UTC | #22
On Wed, Jun 27, 2018 at 12:06:16PM -0700, Cong Wang wrote:
> On Wed, Jun 27, 2018 at 5:32 AM Flavio Leitner <fbl@redhat.com> wrote:
> >
> > On Tue, Jun 26, 2018 at 06:28:27PM -0700, Cong Wang wrote:
> > > On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > >
> > > > On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > > > > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > > > >
> > > > > > It is still isolated, the sk carries the netns info and it is
> > > > > > orphaned when it re-enters the stack.
> > > > >
> > > > > Then what difference does your patch make?
> > > >
> > > > Don't forget it is fixing two issues.
> > >
> > > Sure. I am only talking about TSQ from the very beginning.
> > > Let me rephrase my above question:
> > > What difference does your patch make to TSQ?
> >
> > It avoids burstiness.
> 
> Never even mentioned in changelog or in your patch. :-/

It's part of queueing and helping the bufferbloat problem in the
commit message.

> > > > > Before your patch:
> > > > > veth orphans skb in its xmit
> > > > >
> > > > > After your patch:
> > > > > RX orphans it when re-entering stack (as you claimed, I don't know)
> > > >
> > > > ip_rcv, and equivalents.
> > >
> > > ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)
> >
> > Maybe you found a problem, could you please point me to where in
> > between L1 to L3 the socket is relevant?
> 
> Of course, ingress qdisc is in L2. Do I need to say more? This
> is where we can re-route the packets, for example, redirecting it to
> yet another netns. This is in fact what we use in production, not anything
> that only in my imagination.
> 
> You really have to think about why you allow a different netns influence
> another netns by holding the skb to throttle the source TCP socket.

Maybe I wasn't clear and you didn't understand the question. Please find
a spot where the preserved socket is used incorrectly.

> > > which means you have to update Documentation/networking/ip-sysctl.txt
> > > too.
> >
> > How it is never targeted? Whole point is to avoid queueing traffic.
> 
> What queues? You really need to define it, seriously.
> 
> 
> > Would you be okay if I include this chunk?
> 
> No, still lack of an explanation why it comes across netns for
> a good reason.

Because it doesn't. Since you talk more about veth, let's pick it
as an example. The TX is nothing more than add to the CPU backlog,
right? That is netns agnostic. The same for processing that queue
which will push the skb anyways and will call skb_orphan().

How can one netns avoid/delay the skb_orphan()? And even if does
that, what gain will you have to allow queuing of more and more
packets in the CPU backlog? It is stalled.

> > diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> > index ce8fbf5aa63c..f4c042be0216 100644
> > --- a/Documentation/networking/ip-sysctl.txt
> > +++ b/Documentation/networking/ip-sysctl.txt
> > @@ -733,11 +733,11 @@ tcp_limit_output_bytes - INTEGER
> >         Controls TCP Small Queue limit per tcp socket.
> >         TCP bulk sender tends to increase packets in flight until it
> >         gets losses notifications. With SNDBUF autotuning, this can
> > -       result in a large amount of packets queued in qdisc/device
> > -       on the local machine, hurting latency of other flows, for
> > -       typical pfifo_fast qdiscs.
> > -       tcp_limit_output_bytes limits the number of bytes on qdisc
> > -       or device to reduce artificial RTT/cwnd and reduce bufferbloat.
> > +       result in a large amount of packets queued on the local machine
> > +       (e.g.: qdiscs, CPU backlog, or device) hurting latency of other
> 
> 
> Apparently CPU backlog never happens when leaving the host.
Cong Wang June 28, 2018, 9:51 p.m. UTC | #23
On Wed, Jun 27, 2018 at 1:19 PM Flavio Leitner <fbl@redhat.com> wrote:
>
> On Wed, Jun 27, 2018 at 12:06:16PM -0700, Cong Wang wrote:
> > On Wed, Jun 27, 2018 at 5:32 AM Flavio Leitner <fbl@redhat.com> wrote:
> > >
> > > On Tue, Jun 26, 2018 at 06:28:27PM -0700, Cong Wang wrote:
> > > > On Tue, Jun 26, 2018 at 5:39 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > > >
> > > > > On Tue, Jun 26, 2018 at 05:29:51PM -0700, Cong Wang wrote:
> > > > > > On Tue, Jun 26, 2018 at 4:33 PM Flavio Leitner <fbl@redhat.com> wrote:
> > > > > > >
> > > > > > > It is still isolated, the sk carries the netns info and it is
> > > > > > > orphaned when it re-enters the stack.
> > > > > >
> > > > > > Then what difference does your patch make?
> > > > >
> > > > > Don't forget it is fixing two issues.
> > > >
> > > > Sure. I am only talking about TSQ from the very beginning.
> > > > Let me rephrase my above question:
> > > > What difference does your patch make to TSQ?
> > >
> > > It avoids burstiness.
> >
> > Never even mentioned in changelog or in your patch. :-/
>
> It's part of queueing and helping the bufferbloat problem in the
> commit message.

Please don't add all queues in this scope. Are you really
going to put all queues in networking into your "bufferbloat" claim?
Seriously? Please get it defined, seriously. You really need to
read into the other reply from me, none of you or David even
seriously finish reading it.


>
> > > > > > Before your patch:
> > > > > > veth orphans skb in its xmit
> > > > > >
> > > > > > After your patch:
> > > > > > RX orphans it when re-entering stack (as you claimed, I don't know)
> > > > >
> > > > > ip_rcv, and equivalents.
> > > >
> > > > ip_rcv() is L3, we enter a stack from L1. So your above claim is incorrect. :)
> > >
> > > Maybe you found a problem, could you please point me to where in
> > > between L1 to L3 the socket is relevant?
> >
> > Of course, ingress qdisc is in L2. Do I need to say more? This
> > is where we can re-route the packets, for example, redirecting it to
> > yet another netns. This is in fact what we use in production, not anything
> > that only in my imagination.
> >
> > You really have to think about why you allow a different netns influence
> > another netns by holding the skb to throttle the source TCP socket.
>
> Maybe I wasn't clear and you didn't understand the question. Please find
> a spot where the preserved socket is used incorrectly.


It's sad you still don't get what I mean, I never complain you leak skb->sk,
I complain you break TSQ. Dragging discussion into skb->sk doesn't
even help.


>
> > > > which means you have to update Documentation/networking/ip-sysctl.txt
> > > > too.
> > >
> > > How it is never targeted? Whole point is to avoid queueing traffic.
> >
> > What queues? You really need to define it, seriously.
> >
> >
> > > Would you be okay if I include this chunk?
> >
> > No, still lack of an explanation why it comes across netns for
> > a good reason.
>
> Because it doesn't. Since you talk more about veth, let's pick it
> as an example. The TX is nothing more than add to the CPU backlog,


That's RX, assume "CPU backlog" here still means softnet_data.


> right? That is netns agnostic. The same for processing that queue
> which will push the skb anyways and will call skb_orphan().


Once it leaves TX, it leaves the stack. skb_orphan() called
in L3 (as you claimed) is already in yet another stack.


>
> How can one netns avoid/delay the skb_orphan()? And even if does
> that, what gain will you have to allow queuing of more and more
> packets in the CPU backlog? It is stalled.

Please read the other reply from me, you don't even understand
what a boundary of a stack is.
Cong Wang June 28, 2018, 11:18 p.m. UTC | #24
On Mon, Jun 25, 2018 at 8:59 AM Flavio Leitner <fbl@redhat.com> wrote:
> XPS breaks because the queue mapping stored in the socket is not
> available, so another random queue might be selected when the stack
> needs to transmit something like a TCP ACK, or TCP Retransmissions.
> That causes packet re-ordering and/or performance issues.

Now let me look at the XPS part, a key question first:

By queue mapping stored in socket, you mean sk_tx_queue_get(),
which is only called in __netdev_pick_tx(), and of course even before
hitting qdisc layer.

However, veth device orphans the skb inside its veth_xmit(),
(dev_forward_skb()), which is after going through qdisc layer.

So, how could the skb_orphan() called _after_ XPS break XPS?

We are talking about a simple netns-to-netns case, so XPS won't
be hit again once leaves it.

Another _dumb_ question:

veth is virtual device, it has literally no queues, I know technically
there is a queue for installing qdisc.

So, why does even queue mapping matters here???
diff mbox series

Patch

diff --git a/include/net/netfilter/nf_log.h b/include/net/netfilter/nf_log.h
index e811ac07ea94..0d3920896d50 100644
--- a/include/net/netfilter/nf_log.h
+++ b/include/net/netfilter/nf_log.h
@@ -106,7 +106,8 @@  int nf_log_dump_udp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 int nf_log_dump_tcp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 			   u8 proto, int fragment, unsigned int offset,
 			   unsigned int logflags);
-void nf_log_dump_sk_uid_gid(struct nf_log_buf *m, struct sock *sk);
+void nf_log_dump_sk_uid_gid(struct net *net, struct nf_log_buf *m,
+			    struct sock *sk);
 void nf_log_dump_packet_common(struct nf_log_buf *m, u_int8_t pf,
 			       unsigned int hooknum, const struct sk_buff *skb,
 			       const struct net_device *in,
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c642304f178c..37263095545a 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4911,7 +4911,6 @@  void skb_scrub_packet(struct sk_buff *skb, bool xnet)
 		return;
 
 	ipvs_reset(skb);
-	skb_orphan(skb);
 	skb->mark = 0;
 }
 EXPORT_SYMBOL_GPL(skb_scrub_packet);
diff --git a/net/ipv4/netfilter/nf_log_ipv4.c b/net/ipv4/netfilter/nf_log_ipv4.c
index 4388de0e5380..1e6f28c97d3a 100644
--- a/net/ipv4/netfilter/nf_log_ipv4.c
+++ b/net/ipv4/netfilter/nf_log_ipv4.c
@@ -35,7 +35,7 @@  static const struct nf_loginfo default_loginfo = {
 };
 
 /* One level of recursion won't kill us */
-static void dump_ipv4_packet(struct nf_log_buf *m,
+static void dump_ipv4_packet(struct net *net, struct nf_log_buf *m,
 			     const struct nf_loginfo *info,
 			     const struct sk_buff *skb, unsigned int iphoff)
 {
@@ -183,7 +183,7 @@  static void dump_ipv4_packet(struct nf_log_buf *m,
 			/* Max length: 3+maxlen */
 			if (!iphoff) { /* Only recurse once. */
 				nf_log_buf_add(m, "[");
-				dump_ipv4_packet(m, info, skb,
+				dump_ipv4_packet(net, m, info, skb,
 					    iphoff + ih->ihl*4+sizeof(_icmph));
 				nf_log_buf_add(m, "] ");
 			}
@@ -251,7 +251,7 @@  static void dump_ipv4_packet(struct nf_log_buf *m,
 
 	/* Max length: 15 "UID=4294967295 " */
 	if ((logflags & NF_LOG_UID) && !iphoff)
-		nf_log_dump_sk_uid_gid(m, skb->sk);
+		nf_log_dump_sk_uid_gid(net, m, skb->sk);
 
 	/* Max length: 16 "MARK=0xFFFFFFFF " */
 	if (!iphoff && skb->mark)
@@ -333,7 +333,7 @@  static void nf_log_ip_packet(struct net *net, u_int8_t pf,
 	if (in != NULL)
 		dump_ipv4_mac_header(m, loginfo, skb);
 
-	dump_ipv4_packet(m, loginfo, skb, 0);
+	dump_ipv4_packet(net, m, loginfo, skb, 0);
 
 	nf_log_buf_close(m);
 }
diff --git a/net/ipv6/netfilter/nf_log_ipv6.c b/net/ipv6/netfilter/nf_log_ipv6.c
index b397a8fe88b9..c6bf580d0f33 100644
--- a/net/ipv6/netfilter/nf_log_ipv6.c
+++ b/net/ipv6/netfilter/nf_log_ipv6.c
@@ -36,7 +36,7 @@  static const struct nf_loginfo default_loginfo = {
 };
 
 /* One level of recursion won't kill us */
-static void dump_ipv6_packet(struct nf_log_buf *m,
+static void dump_ipv6_packet(struct net *net, struct nf_log_buf *m,
 			     const struct nf_loginfo *info,
 			     const struct sk_buff *skb, unsigned int ip6hoff,
 			     int recurse)
@@ -258,7 +258,7 @@  static void dump_ipv6_packet(struct nf_log_buf *m,
 			/* Max length: 3+maxlen */
 			if (recurse) {
 				nf_log_buf_add(m, "[");
-				dump_ipv6_packet(m, info, skb,
+				dump_ipv6_packet(net, m, info, skb,
 						 ptr + sizeof(_icmp6h), 0);
 				nf_log_buf_add(m, "] ");
 			}
@@ -278,7 +278,7 @@  static void dump_ipv6_packet(struct nf_log_buf *m,
 
 	/* Max length: 15 "UID=4294967295 " */
 	if ((logflags & NF_LOG_UID) && recurse)
-		nf_log_dump_sk_uid_gid(m, skb->sk);
+		nf_log_dump_sk_uid_gid(net, m, skb->sk);
 
 	/* Max length: 16 "MARK=0xFFFFFFFF " */
 	if (recurse && skb->mark)
@@ -365,7 +365,7 @@  static void nf_log_ip6_packet(struct net *net, u_int8_t pf,
 	if (in != NULL)
 		dump_ipv6_mac_header(m, loginfo, skb);
 
-	dump_ipv6_packet(m, loginfo, skb, skb_network_offset(skb), 1);
+	dump_ipv6_packet(net, m, loginfo, skb, skb_network_offset(skb), 1);
 
 	nf_log_buf_close(m);
 }
diff --git a/net/netfilter/nf_conntrack_broadcast.c b/net/netfilter/nf_conntrack_broadcast.c
index a1086bdec242..5423b197d98a 100644
--- a/net/netfilter/nf_conntrack_broadcast.c
+++ b/net/netfilter/nf_conntrack_broadcast.c
@@ -32,7 +32,7 @@  int nf_conntrack_broadcast_help(struct sk_buff *skb,
 	__be32 mask = 0;
 
 	/* we're only interested in locally generated packets */
-	if (skb->sk == NULL)
+	if (skb->sk == NULL || !net_eq(nf_ct_net(ct), sock_net(skb->sk)))
 		goto out;
 	if (rt == NULL || !(rt->rt_flags & RTCF_BROADCAST))
 		goto out;
diff --git a/net/netfilter/nf_log_common.c b/net/netfilter/nf_log_common.c
index dc61399e30be..a8c5c846aec1 100644
--- a/net/netfilter/nf_log_common.c
+++ b/net/netfilter/nf_log_common.c
@@ -132,9 +132,10 @@  int nf_log_dump_tcp_header(struct nf_log_buf *m, const struct sk_buff *skb,
 }
 EXPORT_SYMBOL_GPL(nf_log_dump_tcp_header);
 
-void nf_log_dump_sk_uid_gid(struct nf_log_buf *m, struct sock *sk)
+void nf_log_dump_sk_uid_gid(struct net *net, struct nf_log_buf *m,
+			    struct sock *sk)
 {
-	if (!sk || !sk_fullsock(sk))
+	if (!sk || !sk_fullsock(sk) || !net_eq(net, sock_net(sk)))
 		return;
 
 	read_lock_bh(&sk->sk_callback_lock);
diff --git a/net/netfilter/nf_nat_core.c b/net/netfilter/nf_nat_core.c
index 46f9df99d276..86df2a1666fd 100644
--- a/net/netfilter/nf_nat_core.c
+++ b/net/netfilter/nf_nat_core.c
@@ -108,6 +108,7 @@  int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
 	struct flowi fl;
 	unsigned int hh_len;
 	struct dst_entry *dst;
+	struct sock *sk = skb->sk;
 	int err;
 
 	err = xfrm_decode_session(skb, &fl, family);
@@ -119,7 +120,10 @@  int nf_xfrm_me_harder(struct net *net, struct sk_buff *skb, unsigned int family)
 		dst = ((struct xfrm_dst *)dst)->route;
 	dst_hold(dst);
 
-	dst = xfrm_lookup(net, dst, &fl, skb->sk, 0);
+	if (sk && !net_eq(net, sock_net(sk)))
+		sk = NULL;
+
+	dst = xfrm_lookup(net, dst, &fl, sk, 0);
 	if (IS_ERR(dst))
 		return PTR_ERR(dst);
 
diff --git a/net/netfilter/nft_meta.c b/net/netfilter/nft_meta.c
index 1105a23bda5e..2b94dcc43456 100644
--- a/net/netfilter/nft_meta.c
+++ b/net/netfilter/nft_meta.c
@@ -107,7 +107,8 @@  static void nft_meta_get_eval(const struct nft_expr *expr,
 		break;
 	case NFT_META_SKUID:
 		sk = skb_to_full_sk(skb);
-		if (!sk || !sk_fullsock(sk))
+		if (!sk || !sk_fullsock(sk) ||
+		    !net_eq(nft_net(pkt), sock_net(sk)))
 			goto err;
 
 		read_lock_bh(&sk->sk_callback_lock);
@@ -123,7 +124,8 @@  static void nft_meta_get_eval(const struct nft_expr *expr,
 		break;
 	case NFT_META_SKGID:
 		sk = skb_to_full_sk(skb);
-		if (!sk || !sk_fullsock(sk))
+		if (!sk || !sk_fullsock(sk) ||
+		    !net_eq(nft_net(pkt), sock_net(sk)))
 			goto err;
 
 		read_lock_bh(&sk->sk_callback_lock);
@@ -214,7 +216,8 @@  static void nft_meta_get_eval(const struct nft_expr *expr,
 #ifdef CONFIG_CGROUP_NET_CLASSID
 	case NFT_META_CGROUP:
 		sk = skb_to_full_sk(skb);
-		if (!sk || !sk_fullsock(sk))
+		if (!sk || !sk_fullsock(sk) ||
+		    !net_eq(nft_net(pkt), sock_net(sk)))
 			goto err;
 		*dest = sock_cgroup_classid(&sk->sk_cgrp_data);
 		break;
diff --git a/net/netfilter/nft_socket.c b/net/netfilter/nft_socket.c
index 74e1b3bd6954..998c2b546f6d 100644
--- a/net/netfilter/nft_socket.c
+++ b/net/netfilter/nft_socket.c
@@ -23,6 +23,9 @@  static void nft_socket_eval(const struct nft_expr *expr,
 	struct sock *sk = skb->sk;
 	u32 *dest = &regs->data[priv->dreg];
 
+	if (sk && !net_eq(nft_net(pkt), sock_net(sk)))
+		sk = NULL;
+
 	if (!sk)
 		switch(nft_pf(pkt)) {
 		case NFPROTO_IPV4:
@@ -39,7 +42,7 @@  static void nft_socket_eval(const struct nft_expr *expr,
 			return;
 		}
 
-	if(!sk) {
+	if (!sk) {
 		nft_reg_store8(dest, 0);
 		return;
 	}
diff --git a/net/netfilter/xt_cgroup.c b/net/netfilter/xt_cgroup.c
index 7df2dece57d3..5d92e1781980 100644
--- a/net/netfilter/xt_cgroup.c
+++ b/net/netfilter/xt_cgroup.c
@@ -72,8 +72,9 @@  static bool
 cgroup_mt_v0(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_cgroup_info_v0 *info = par->matchinfo;
+	struct sock *sk = skb->sk;
 
-	if (skb->sk == NULL || !sk_fullsock(skb->sk))
+	if (!sk || !sk_fullsock(sk) || !net_eq(xt_net(par), sock_net(sk)))
 		return false;
 
 	return (info->id == sock_cgroup_classid(&skb->sk->sk_cgrp_data)) ^
@@ -85,8 +86,9 @@  static bool cgroup_mt_v1(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct xt_cgroup_info_v1 *info = par->matchinfo;
 	struct sock_cgroup_data *skcd = &skb->sk->sk_cgrp_data;
 	struct cgroup *ancestor = info->priv;
+	struct sock *sk = skb->sk;
 
-	if (!skb->sk || !sk_fullsock(skb->sk))
+	if (!sk || !sk_fullsock(sk) || !net_eq(xt_net(par), sock_net(sk)))
 		return false;
 
 	if (ancestor)
diff --git a/net/netfilter/xt_owner.c b/net/netfilter/xt_owner.c
index 3d705c688a27..46686fb73784 100644
--- a/net/netfilter/xt_owner.c
+++ b/net/netfilter/xt_owner.c
@@ -67,7 +67,7 @@  owner_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	struct sock *sk = skb_to_full_sk(skb);
 	struct net *net = xt_net(par);
 
-	if (sk == NULL || sk->sk_socket == NULL)
+	if (!sk || !sk->sk_socket || !net_eq(net, sock_net(sk)))
 		return (info->match ^ info->invert) == 0;
 	else if (info->match & info->invert & XT_OWNER_SOCKET)
 		/*
diff --git a/net/netfilter/xt_recent.c b/net/netfilter/xt_recent.c
index 07085c22b19c..f44de4bc2100 100644
--- a/net/netfilter/xt_recent.c
+++ b/net/netfilter/xt_recent.c
@@ -265,7 +265,8 @@  recent_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	}
 
 	/* use TTL as seen before forwarding */
-	if (xt_out(par) != NULL && skb->sk == NULL)
+	if (xt_out(par) != NULL &&
+	    (!skb->sk || !net_eq(net, sock_net(skb->sk))))
 		ttl++;
 
 	spin_lock_bh(&recent_lock);
diff --git a/net/netfilter/xt_socket.c b/net/netfilter/xt_socket.c
index 5c0779c4fa3c..e2795f5f7000 100644
--- a/net/netfilter/xt_socket.c
+++ b/net/netfilter/xt_socket.c
@@ -58,6 +58,9 @@  socket_match(const struct sk_buff *skb, struct xt_action_param *par,
 
 	if (!sk)
 		sk = nf_sk_lookup_slow_v4(xt_net(par), skb, xt_in(par));
+	else if (!net_eq(xt_net(par), sock_net(sk)))
+		sk = NULL;
+
 	if (sk) {
 		bool wildcard;
 		bool transparent = true;
@@ -115,6 +118,9 @@  socket_mt6_v1_v2_v3(const struct sk_buff *skb, struct xt_action_param *par)
 
 	if (!sk)
 		sk = nf_sk_lookup_slow_v6(xt_net(par), skb, xt_in(par));
+	else if (!net_eq(xt_net(par), sock_net(sk)))
+		sk = NULL;
+
 	if (sk) {
 		bool wildcard;
 		bool transparent = true;