net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns

Message ID 1520953642-8145-1-git-send-email-liran.alon@oracle.com
State Deferred
Delegated to: David Miller
Headers show
Series
  • net: dev_forward_skb(): Scrub packet's per-netns info only when crossing netns
Related show

Commit Message

Liran Alon March 13, 2018, 3:07 p.m.
Before this commit, dev_forward_skb() always cleared packet's
per-network-namespace info. Even if the packet doesn't cross
network namespaces.

The comment above dev_forward_skb() describes that this is done
because the receiving device may be in another network namespace.
However, this case can easily be tested for and therefore we can
scrub packet's per-network-namespace info only when receiving device
is indeed in another network namespace.

Therefore, this commit changes ____dev_forward_skb() to tell
skb_scrub_packet() that skb has crossed network-namespace only in case
transmitting device (skb->dev) network namespace is different then
receiving device (dev) network namespace.

An example of a netdev that use skb_forward_skb() is veth.
Thus, before this commit a packet transmitted from one veth peer to
another when both veth peers are on same network namespace will lose
it's skb->mark. The bug could easily be demonstrated by the following:

ip netns add test
ip netns exec test bash
ip link add veth-a type veth peer name veth-b
ip link set veth-a up
ip link set veth-b up
ip addr add dev veth-a 12.0.0.1/24
tc qdisc add dev veth-a root handle 1 prio
tc qdisc add dev veth-b ingress
tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337
tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!"
dmesg -C
ping 12.0.0.2
dmesg

Before this change, the above will print nothing to dmesg.
After this change, "skb->mark 1337!" will be printed as necessary.

Signed-off-by: Liran Alon <liran.alon@oracle.com>
Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
 include/linux/netdevice.h | 2 +-
 net/core/dev.c            | 6 +++---
 2 files changed, 4 insertions(+), 4 deletions(-)

Comments

Yuval Shaia March 13, 2018, 4:13 p.m. | #1
On Tue, Mar 13, 2018 at 05:07:22PM +0200, Liran Alon wrote:
> Before this commit, dev_forward_skb() always cleared packet's
> per-network-namespace info. Even if the packet doesn't cross
> network namespaces.
> 
> The comment above dev_forward_skb() describes that this is done
> because the receiving device may be in another network namespace.
> However, this case can easily be tested for and therefore we can
> scrub packet's per-network-namespace info only when receiving device
> is indeed in another network namespace.
> 
> Therefore, this commit changes ____dev_forward_skb() to tell
> skb_scrub_packet() that skb has crossed network-namespace only in case
> transmitting device (skb->dev) network namespace is different then
> receiving device (dev) network namespace.
> 
> An example of a netdev that use skb_forward_skb() is veth.
> Thus, before this commit a packet transmitted from one veth peer to
> another when both veth peers are on same network namespace will lose
> it's skb->mark. The bug could easily be demonstrated by the following:
> 
> ip netns add test
> ip netns exec test bash
> ip link add veth-a type veth peer name veth-b
> ip link set veth-a up
> ip link set veth-b up
> ip addr add dev veth-a 12.0.0.1/24
> tc qdisc add dev veth-a root handle 1 prio
> tc qdisc add dev veth-b ingress
> tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337
> tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!"
> dmesg -C
> ping 12.0.0.2
> dmesg
> 
> Before this change, the above will print nothing to dmesg.
> After this change, "skb->mark 1337!" will be printed as necessary.

Hi Liran,

> 
> Signed-off-by: Liran Alon <liran.alon@oracle.com>
> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
> Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>

I did not earned the credits for SOB, only r-b.

Yuval

> ---
>  include/linux/netdevice.h | 2 +-
>  net/core/dev.c            | 6 +++---
>  2 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5eef6c8e2741..5908f1e31ee2 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
>  		return NET_RX_DROP;
>  	}
>  
> -	skb_scrub_packet(skb, true);
> +	skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
>  	skb->priority = 0;
>  	return 0;
>  }
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 2cedf520cb28..087787dd0a50 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
>   * start_xmit function of one device into the receive queue
>   * of another device.
>   *
> - * The receiving device may be in another namespace, so
> - * we have to clear all information in the skb that could
> - * impact namespace isolation.
> + * The receiving device may be in another namespace.
> + * In that case, we have to clear all information in the
> + * skb that could impact namespace isolation.
>   */
>  int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
>  {
> -- 
> 1.9.1
>
Yuval Shaia March 14, 2018, 12:03 p.m. | #2
On Tue, Mar 13, 2018 at 06:13:45PM +0200, Yuval Shaia wrote:
> On Tue, Mar 13, 2018 at 05:07:22PM +0200, Liran Alon wrote:
> > Before this commit, dev_forward_skb() always cleared packet's
> > per-network-namespace info. Even if the packet doesn't cross
> > network namespaces.
> > 
> > The comment above dev_forward_skb() describes that this is done
> > because the receiving device may be in another network namespace.
> > However, this case can easily be tested for and therefore we can
> > scrub packet's per-network-namespace info only when receiving device
> > is indeed in another network namespace.
> > 
> > Therefore, this commit changes ____dev_forward_skb() to tell
> > skb_scrub_packet() that skb has crossed network-namespace only in case
> > transmitting device (skb->dev) network namespace is different then
> > receiving device (dev) network namespace.
> > 
> > An example of a netdev that use skb_forward_skb() is veth.
> > Thus, before this commit a packet transmitted from one veth peer to
> > another when both veth peers are on same network namespace will lose
> > it's skb->mark. The bug could easily be demonstrated by the following:
> > 
> > ip netns add test
> > ip netns exec test bash
> > ip link add veth-a type veth peer name veth-b
> > ip link set veth-a up
> > ip link set veth-b up
> > ip addr add dev veth-a 12.0.0.1/24
> > tc qdisc add dev veth-a root handle 1 prio
> > tc qdisc add dev veth-b ingress
> > tc filter add dev veth-a parent 1: u32 match u32 0 0 action skbedit mark 1337
> > tc filter add dev veth-b parent ffff: basic match 'meta(nf_mark eq 1337)' action simple "skb->mark 1337!"
> > dmesg -C
> > ping 12.0.0.2
> > dmesg
> > 
> > Before this change, the above will print nothing to dmesg.
> > After this change, "skb->mark 1337!" will be printed as necessary.
> 
> Hi Liran,
> 
> > 
> > Signed-off-by: Liran Alon <liran.alon@oracle.com>
> > Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> 
> I did not earned the credits for SOB, only r-b.

Had an offlist conversation with Liran,
Turns out that this SOB is ok.

Yuval

> 
> Yuval
> 
> > ---
> >  include/linux/netdevice.h | 2 +-
> >  net/core/dev.c            | 6 +++---
> >  2 files changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 5eef6c8e2741..5908f1e31ee2 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -3371,7 +3371,7 @@ static __always_inline int ____dev_forward_skb(struct net_device *dev,
> >  		return NET_RX_DROP;
> >  	}
> >  
> > -	skb_scrub_packet(skb, true);
> > +	skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
> >  	skb->priority = 0;
> >  	return 0;
> >  }
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 2cedf520cb28..087787dd0a50 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -1877,9 +1877,9 @@ int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> >   * start_xmit function of one device into the receive queue
> >   * of another device.
> >   *
> > - * The receiving device may be in another namespace, so
> > - * we have to clear all information in the skb that could
> > - * impact namespace isolation.
> > + * The receiving device may be in another namespace.
> > + * In that case, we have to clear all information in the
> > + * skb that could impact namespace isolation.
> >   */
> >  int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
> >  {
> > -- 
> > 1.9.1
> >
Shmulik Ladkani March 15, 2018, 9:21 a.m. | #3
Hi,

On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <liran.alon@oracle.com> wrote:
> Before this commit, dev_forward_skb() always cleared packet's
> per-network-namespace info. Even if the packet doesn't cross
> network namespaces.
> 
> The comment above dev_forward_skb() describes that this is done
> because the receiving device may be in another network namespace.
> However, this case can easily be tested for and therefore we can
> scrub packet's per-network-namespace info only when receiving device
> is indeed in another network namespace.
> 
> Therefore, this commit changes ____dev_forward_skb() to tell
> skb_scrub_packet() that skb has crossed network-namespace only in case
> transmitting device (skb->dev) network namespace is different then
> receiving device (dev) network namespace.

Assuming the premise of this commit is correct, note it may not act as
intended for xnet situation in ipvlan_process_multicast, snip:

		nskb->dev = ipvlan->dev;
		if (tx_pkt)
			ret = dev_forward_skb(ipvlan->dev, nskb);
		else
			ret = netif_rx(nskb);

as 'dev' gets already assigned to nskb prior dev_forward_skb (hence in
____dev_forward_skb both dev and skb->dev are the same).
Fortunately every ipvlan_multicast_enqueue call is preceded by a forced
scrub; It would be future-proof to not assign nskb->dev in the
dev_forward_skb case (assign it only in the netif_rx case).


Regarding the premise of this commit, this "reduces" the
ipvs/orphan/mark scrubbing in the following *non* xnet situations:

 1. mac2vlan port xmit to other macvlan ports in Bridge Mode
 2. similarly for ipvlan
 3. veth xmit
 4. l2tp_eth_dev_recv
 5. bpf redirect/clone_redirect ingress actions

Regarding l2tp recv, this commit seems to align the srubbing behavior
with ip tunnels (full scrub only if crossing netns, see ip_tunnel_rcv).

Regarding veth xmit, it does makes sense to preserve the fields if not
crossing netns. This is also the case when one uses tc mirred.

Regarding bpf redirect, well, it depends on the expectations of each bpf
program.
I'd argue that preserving the fields (at least the mark field) in the
*non* xnet makes sense and provides more information and therefore more
capabilities; Alas this might change behavior already being relied on.

Maybe Daniel can comment on the matter.

Regards,
Shmulik
Daniel Borkmann March 15, 2018, 11:56 a.m. | #4
On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
> Hi,
> 
> On Tue, 13 Mar 2018 17:07:22 +0200 Liran Alon <liran.alon@oracle.com> wrote:
>> Before this commit, dev_forward_skb() always cleared packet's
>> per-network-namespace info. Even if the packet doesn't cross
>> network namespaces.
>>
>> The comment above dev_forward_skb() describes that this is done
>> because the receiving device may be in another network namespace.
>> However, this case can easily be tested for and therefore we can
>> scrub packet's per-network-namespace info only when receiving device
>> is indeed in another network namespace.
>>
>> Therefore, this commit changes ____dev_forward_skb() to tell
>> skb_scrub_packet() that skb has crossed network-namespace only in case
>> transmitting device (skb->dev) network namespace is different then
>> receiving device (dev) network namespace.
> 
> Assuming the premise of this commit is correct, note it may not act as
> intended for xnet situation in ipvlan_process_multicast, snip:
> 
> 		nskb->dev = ipvlan->dev;
> 		if (tx_pkt)
> 			ret = dev_forward_skb(ipvlan->dev, nskb);
> 		else
> 			ret = netif_rx(nskb);
> 
> as 'dev' gets already assigned to nskb prior dev_forward_skb (hence in
> ____dev_forward_skb both dev and skb->dev are the same).
> Fortunately every ipvlan_multicast_enqueue call is preceded by a forced
> scrub; It would be future-proof to not assign nskb->dev in the
> dev_forward_skb case (assign it only in the netif_rx case).
> 
> 
> Regarding the premise of this commit, this "reduces" the
> ipvs/orphan/mark scrubbing in the following *non* xnet situations:
> 
>  1. mac2vlan port xmit to other macvlan ports in Bridge Mode
>  2. similarly for ipvlan
>  3. veth xmit
>  4. l2tp_eth_dev_recv
>  5. bpf redirect/clone_redirect ingress actions
> 
> Regarding l2tp recv, this commit seems to align the srubbing behavior
> with ip tunnels (full scrub only if crossing netns, see ip_tunnel_rcv).
> 
> Regarding veth xmit, it does makes sense to preserve the fields if not
> crossing netns. This is also the case when one uses tc mirred.
> 
> Regarding bpf redirect, well, it depends on the expectations of each bpf
> program.
> I'd argue that preserving the fields (at least the mark field) in the
> *non* xnet makes sense and provides more information and therefore more
> capabilities; Alas this might change behavior already being relied on.
> 
> Maybe Daniel can comment on the matter.

Overall I think it might be nice to not need scrubbing skb in such cases,
although my concern would be that this has potential to break existing
setups when they would expect mark being zero on other veth peer in any
case since it's the behavior for a long time already. The safer option
would be to have some sort of explicit opt-in e.g. on link creation to let
the skb->mark pass through unscrubbed. This would definitely be a useful
option e.g. when mark is set in the netns facing veth via clsact/egress
on xmit and when the container is unprivileged anyway.

Thanks,
Daniel
Shmulik Ladkani March 15, 2018, 12:50 p.m. | #5
Hi,

On Thu, 15 Mar 2018 12:56:13 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
> > 
> > Regarding veth xmit, it does makes sense to preserve the fields if not
> > crossing netns. This is also the case when one uses tc mirred.
> > 
> > Regarding bpf redirect, well, it depends on the expectations of each bpf
> > program.
> > I'd argue that preserving the fields (at least the mark field) in the
> > *non* xnet makes sense and provides more information and therefore more
> > capabilities; Alas this might change behavior already being relied on.
> > 
> > Maybe Daniel can comment on the matter.  
> 
> Overall I think it might be nice to not need scrubbing skb in such cases,
> although my concern would be that this has potential to break existing
> setups when they would expect mark being zero on other veth peer in any
> case since it's the behavior for a long time already. The safer option
> would be to have some sort of explicit opt-in e.g. on link creation to let
> the skb->mark pass through unscrubbed. This would definitely be a useful
> option e.g. when mark is set in the netns facing veth via clsact/egress
> on xmit and when the container is unprivileged anyway.

For the veth xmit case, an opt-in flag which disables mark scrubbing in
the *non* xnet veth-pair seems reasonable.

But what about bpf_redirect BPF_F_INGRESS, in setups not invovling
containers?
Currently bpf_redirect is implemented using dev_forward_skb which
*fully* scrubs the skb, even if the target device is on same netns as
skb->dev is.

One might use ebpf programs that perform BPF_F_INGRESS bpf_redirect, for
example for demuxing skbs arriving on some "master" device into various
"slave" devices using specialized critiria.

It would be beneficial to have the mark preserved when skb is injected
to the slave device's rx path (especially when it's on the same netns).

Liran's patch fixes this - but at the cost of changing existing behavior
for BPF_F_INGRESS users (formerly: fully scrubbed; post patch: scrubbed
only if xnet).

I wonder, do you know of implementations that actually RELY on the fact
that BPF_F_INGRESS actually clears the mark, in the *non* xnet case?

Regards,
Shmulik
Daniel Borkmann March 15, 2018, 3:13 p.m. | #6
On 03/15/2018 01:50 PM, Shmulik Ladkani wrote:
> On Thu, 15 Mar 2018 12:56:13 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 03/15/2018 10:21 AM, Shmulik Ladkani wrote:
>>>
>>> Regarding veth xmit, it does makes sense to preserve the fields if not
>>> crossing netns. This is also the case when one uses tc mirred.
>>>
>>> Regarding bpf redirect, well, it depends on the expectations of each bpf
>>> program.
>>> I'd argue that preserving the fields (at least the mark field) in the
>>> *non* xnet makes sense and provides more information and therefore more
>>> capabilities; Alas this might change behavior already being relied on.
>>>
>>> Maybe Daniel can comment on the matter.  
>>
>> Overall I think it might be nice to not need scrubbing skb in such cases,
>> although my concern would be that this has potential to break existing
>> setups when they would expect mark being zero on other veth peer in any
>> case since it's the behavior for a long time already. The safer option
>> would be to have some sort of explicit opt-in e.g. on link creation to let
>> the skb->mark pass through unscrubbed. This would definitely be a useful
>> option e.g. when mark is set in the netns facing veth via clsact/egress
>> on xmit and when the container is unprivileged anyway.
> 
> For the veth xmit case, an opt-in flag which disables mark scrubbing in
> the *non* xnet veth-pair seems reasonable.
> 
> But what about bpf_redirect BPF_F_INGRESS, in setups not invovling
> containers?
> Currently bpf_redirect is implemented using dev_forward_skb which
> *fully* scrubs the skb, even if the target device is on same netns as
> skb->dev is.
> 
> One might use ebpf programs that perform BPF_F_INGRESS bpf_redirect, for
> example for demuxing skbs arriving on some "master" device into various
> "slave" devices using specialized critiria.
> 
> It would be beneficial to have the mark preserved when skb is injected
> to the slave device's rx path (especially when it's on the same netns).

Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK
flag to opt-in in general case (xnet/non-xnet) and where helper bails out
on unknown flag, but also for the redirect in the same netns I think it would
be useful to have a similar redirect mode as in ipvlan master where instead
of dev_forward_skb() you would set the skb->dev = dev and have a similar
notion of RX_HANDLER_ANOTHER. Was thinking about the latter more recently
but haven't gotten to implement it yet.

> Liran's patch fixes this - but at the cost of changing existing behavior
> for BPF_F_INGRESS users (formerly: fully scrubbed; post patch: scrubbed
> only if xnet).
> 
> I wonder, do you know of implementations that actually RELY on the fact
> that BPF_F_INGRESS actually clears the mark, in the *non* xnet case?

Not that I'm aware of right now, but hard to tell what other people run
in the wild.

But lets presume for a sec you would _not_ scrub it, then how are users
supposed to make use of this? The feature/bug may not be critical enough
(well, otherwise it wouldn't have been like this for long time) for stable,
so to write an app relying on it the behavior will change from kernel A to
kernel B, where you need to end up having a full blown veth run-time test
in order to figure it out before you can use it, not really useful either.

Thanks,
Daniel
Shmulik Ladkani March 15, 2018, 3:54 p.m. | #7
On Thu, 15 Mar 2018 16:13:39 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 03/15/2018 01:50 PM, Shmulik Ladkani wrote:
> > 
> > It would be beneficial to have the mark preserved when skb is injected
> > to the slave device's rx path (especially when it's on the same netns).  
> 
> Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK
> flag to opt-in in general case (xnet/non-xnet)

Sounds okay to me.

> But lets presume for a sec you would _not_ scrub it, then how are users
> supposed to make use of this? The feature/bug may not be critical enough
> (well, otherwise it wouldn't have been like this for long time) for stable,
> so to write an app relying on it the behavior will change from kernel A to
> kernel B, where you need to end up having a full blown veth run-time test
> in order to figure it out before you can use it, not really useful either.

Let's assume BPF_F_PRESERVE_MARK is a feature then, which is available only
in new kernels.
As said, this flag will not be honored by older kernels.

But your "run-time test" argument is true for every new flag-bit
introduced to bpf functions, for example:
 BPF_F_SEQ_NUMBER was added after other skb_set_tunnel_key flags,
 Same for BPF_F_INVALIDATE_HASH (skb_store_bytes), BPF_F_MARK_ENFORCE
 (l4_csum_replace) and others.

With every flag addition, the flag mask validation in the corresponding
bpf function has been relaxed to support it.

Why is BPF_F_PRESERVE_MARK any different from any previous flag addition?

Thanks,
Shmulik
Daniel Borkmann March 15, 2018, 5:48 p.m. | #8
On 03/15/2018 04:54 PM, Shmulik Ladkani wrote:
> On Thu, 15 Mar 2018 16:13:39 +0100 Daniel Borkmann <daniel@iogearbox.net> wrote:
>> On 03/15/2018 01:50 PM, Shmulik Ladkani wrote:
>>>
>>> It would be beneficial to have the mark preserved when skb is injected
>>> to the slave device's rx path (especially when it's on the same netns).  
>>
>> Right, I think also here the easiest would be to have a BPF_F_PRESERVE_MARK
>> flag to opt-in in general case (xnet/non-xnet)
> 
> Sounds okay to me.
> 
>> But lets presume for a sec you would _not_ scrub it, then how are users
>> supposed to make use of this? The feature/bug may not be critical enough
>> (well, otherwise it wouldn't have been like this for long time) for stable,
>> so to write an app relying on it the behavior will change from kernel A to
>> kernel B, where you need to end up having a full blown veth run-time test
>> in order to figure it out before you can use it, not really useful either.
> 
> Let's assume BPF_F_PRESERVE_MARK is a feature then, which is available only
> in new kernels.
> As said, this flag will not be honored by older kernels.
> 
> But your "run-time test" argument is true for every new flag-bit
> introduced to bpf functions, for example:
>  BPF_F_SEQ_NUMBER was added after other skb_set_tunnel_key flags,
>  Same for BPF_F_INVALIDATE_HASH (skb_store_bytes), BPF_F_MARK_ENFORCE
>  (l4_csum_replace) and others.
> 
> With every flag addition, the flag mask validation in the corresponding
> bpf function has been relaxed to support it.
> 
> Why is BPF_F_PRESERVE_MARK any different from any previous flag addition?

Not really different than other BPF helpers, but you can simply figure it out
via BPF_PROG_TEST_RUN by calling bpf_redirect() helper on some fake ifindex
and check the return value whether it's shot or redirect. This is definitely
trivial to do and doesn't require any devs to set up compared to otherwise
full blown setup. E.g. test_verifier uses this for the test case array it
contains.

Cheers,
Daniel
David Miller March 20, 2018, 2:47 p.m. | #9
From: Liran Alon <liran.alon@oracle.com>
Date: Tue, 13 Mar 2018 17:07:22 +0200

> Before this commit, dev_forward_skb() always cleared packet's
> per-network-namespace info. Even if the packet doesn't cross
> network namespaces.

There was a lot of discussion about this patch.

Particularly whether it could potentially break current
users or not.

If this is resolved and the patch should still be applied,
please repost and the folks involved in this dicussion should
add their ACKs.

Thanks.
Liran Alon March 20, 2018, 3:34 p.m. | #10
On 20/03/18 16:47, David Miller wrote:
> From: Liran Alon <liran.alon@oracle.com>
> Date: Tue, 13 Mar 2018 17:07:22 +0200
>
>> Before this commit, dev_forward_skb() always cleared packet's
>> per-network-namespace info. Even if the packet doesn't cross
>> network namespaces.
>
> There was a lot of discussion about this patch.
>
> Particularly whether it could potentially break current
> users or not.
>
> If this is resolved and the patch should still be applied,
> please repost and the folks involved in this dicussion should
> add their ACKs.
>
> Thanks.
>

The problem is that I don't think we have reached an agreement.
I would be happy to here your opinion on the issue at hand here.

I personally don't understand why we should maintain 
backwards-comparability to this behaviour. How would a user rely on the 
fact that skb->mark is scrubbed when it is passed between 2 netdevs on 
the same netns but only when it is passed between very specific netdev 
type (one of them being veth-peers).
This behaviour seems to have been created by mistake.
This feature is not documented to user-mode and I don't see why it is 
legit for the user to rely on it.

In addition, even if we do want to maintain backwards-comparability to 
this behaviour, I think it is enough to have an opt-in flag in 
/proc/sys/net/core/ that when set to 1 will activate the fix in 
dev_forward_skb() provided by this patch. That would also be a very 
simple change to the patch provided here.

Do you agree? Or do you think we should have a flag per netdev like 
suggested in other replies to this thread?

Thanks,
-Liran
David Miller March 20, 2018, 4 p.m. | #11
From: Liran Alon <LIRAN.ALON@ORACLE.COM>
Date: Tue, 20 Mar 2018 17:34:38 +0200

> I personally don't understand why we should maintain
> backwards-comparability to this behaviour.

The reason is because not breaking things is a cornerstone of Linux
kernel development.

> This feature is not documented to user-mode and I don't see why it
> is legit for the user to rely on it.

Whether it is documented or not is irrelevant.  A lot of our
interfaces and behaviors are not documented or poorly documented
at best.

> In addition, even if we do want to maintain backwards-comparability to
> this behaviour, I think it is enough to have an opt-in flag in
> /proc/sys/net/core/ that when set to 1 will activate the fix in
> dev_forward_skb() provided by this patch. That would also be a very
> simple change to the patch provided here.

Making it opt-in makes it more palatable, that's for sure.
Liran Alon March 20, 2018, 4:11 p.m. | #12
On 20/03/18 18:00, David Miller wrote:
> From: Liran Alon <LIRAN.ALON@ORACLE.COM>
> Date: Tue, 20 Mar 2018 17:34:38 +0200
>
>> I personally don't understand why we should maintain
>> backwards-comparability to this behaviour.
>
> The reason is because not breaking things is a cornerstone of Linux
> kernel development.
>
>> This feature is not documented to user-mode and I don't see why it
>> is legit for the user to rely on it.
>
> Whether it is documented or not is irrelevant.  A lot of our
> interfaces and behaviors are not documented or poorly documented
> at best.
>
>> In addition, even if we do want to maintain backwards-comparability to
>> this behaviour, I think it is enough to have an opt-in flag in
>> /proc/sys/net/core/ that when set to 1 will activate the fix in
>> dev_forward_skb() provided by this patch. That would also be a very
>> simple change to the patch provided here.
>
> Making it opt-in makes it more palatable, that's for sure.
>

1. Do we want to make a flag for every bug that is user-space visible? I 
think there is place for consideration on a per-case basis. I still 
don't see how a user can utilize this behaviour. He is basically loosing 
information (skb->mark) without this patch.

2. Having said that, I don't mind changing patch to maintain backwards 
compatibility here. However, there was also a discussion here on where 
the flag should sit. I think that a global /proc/sys/net/core/ flag 
should be enough. Do you agree it's sufficient for now?

Thanks,
-Liran
David Miller March 20, 2018, 4:34 p.m. | #13
From: Liran Alon <LIRAN.ALON@ORACLE.COM>
Date: Tue, 20 Mar 2018 18:11:49 +0200

> 1. Do we want to make a flag for every bug that is user-space visible?
> I think there is place for consideration on a per-case basis. I still
> don't see how a user can utilize this behaviour. He is basically
> loosing information (skb->mark) without this patch.

And maybe people trying to work in this situation have added code to
get the mark set some other way, or to validate that it is in fact
zero after passing through, which we would break with this change.

If it's set to zero now, it's reasonable to expect it to be zero.

By changing it to non-zero, different rules and routes will match
and this for sure has potential to break things.
Liran Alon March 20, 2018, 4:39 p.m. | #14
On 20/03/18 18:34, David Miller wrote:
> From: Liran Alon <LIRAN.ALON@ORACLE.COM>
> Date: Tue, 20 Mar 2018 18:11:49 +0200
>
>> 1. Do we want to make a flag for every bug that is user-space visible?
>> I think there is place for consideration on a per-case basis. I still
>> don't see how a user can utilize this behaviour. He is basically
>> loosing information (skb->mark) without this patch.
>
> And maybe people trying to work in this situation have added code to
> get the mark set some other way, or to validate that it is in fact
> zero after passing through, which we would break with this change.
>
> If it's set to zero now, it's reasonable to expect it to be zero.
>
> By changing it to non-zero, different rules and routes will match
> and this for sure has potential to break things.
>

OK.

What is your opinion in regards if it's OK to put the flag enabling this 
"fix" in /proc/sys/net/core? Do you think it's sufficient?
Valdis.Kletnieks@vt.edu March 20, 2018, 6:51 p.m. | #15
On Tue, 20 Mar 2018 18:39:47 +0200, Liran Alon said:
> What is your opinion in regards if it's OK to put the flag enabling this
> "fix" in /proc/sys/net/core? Do you think it's sufficient?

Umm.. *which* /proc/sys/net/core?  These could differ for things that
are in different namespaces.  Or are you proposing one systemwide
global value (which also gets "interesting" if it's writable inside a
container and changes the behavior a different container sees...)
Liran Alon March 20, 2018, 9:12 p.m. | #16
On 20/03/18 20:51, valdis.kletnieks@vt.edu wrote:
> On Tue, 20 Mar 2018 18:39:47 +0200, Liran Alon said:
>> What is your opinion in regards if it's OK to put the flag enabling this
>> "fix" in /proc/sys/net/core? Do you think it's sufficient?
>
> Umm.. *which* /proc/sys/net/core?  These could differ for things that
> are in different namespaces.  Or are you proposing one systemwide
> global value (which also gets "interesting" if it's writable inside a
> container and changes the behavior a different container sees...)
>

I'm indeed proposing an opt-in system-wide global value.
I think it is the simplest approach to fix the issue at
hand here while maintaining backwards-compatibility.

I'm open to suggestions to where that system-wide
global value should be.

It must be a system-wide global value if we are not going
with the per-netdev flag approach as this system-wide global flag
should control how a skb is travelled between different netns.
So it doesn't belong to any one single netns.

Patch

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5eef6c8e2741..5908f1e31ee2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3371,7 +3371,7 @@  static __always_inline int ____dev_forward_skb(struct net_device *dev,
 		return NET_RX_DROP;
 	}
 
-	skb_scrub_packet(skb, true);
+	skb_scrub_packet(skb, !net_eq(dev_net(dev), dev_net(skb->dev)));
 	skb->priority = 0;
 	return 0;
 }
diff --git a/net/core/dev.c b/net/core/dev.c
index 2cedf520cb28..087787dd0a50 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1877,9 +1877,9 @@  int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
  * start_xmit function of one device into the receive queue
  * of another device.
  *
- * The receiving device may be in another namespace, so
- * we have to clear all information in the skb that could
- * impact namespace isolation.
+ * The receiving device may be in another namespace.
+ * In that case, we have to clear all information in the
+ * skb that could impact namespace isolation.
  */
 int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 {