diff mbox series

[net] Revert "vxlan: fix tos value before xmit"

Message ID 20200805024131.2091206-1-liuhangbin@gmail.com
State Accepted
Delegated to: David Miller
Headers show
Series [net] Revert "vxlan: fix tos value before xmit" | expand

Commit Message

Hangbin Liu Aug. 5, 2020, 2:41 a.m. UTC
This reverts commit 71130f29979c7c7956b040673e6b9d5643003176.

In commit 71130f29979c ("vxlan: fix tos value before xmit") we want to
make sure the tos value are filtered by RT_TOS() based on RFC1349.

       0     1     2     3     4     5     6     7
    +-----+-----+-----+-----+-----+-----+-----+-----+
    |   PRECEDENCE    |          TOS          | MBZ |
    +-----+-----+-----+-----+-----+-----+-----+-----+

But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like

       0     1     2     3     4     5     6     7
    +-----+-----+-----+-----+-----+-----+-----+-----+
    |          DS FIELD, DSCP           | ECN FIELD |
    +-----+-----+-----+-----+-----+-----+-----+-----+

So with

IPTOS_TOS_MASK          0x1E
RT_TOS(tos)		((tos)&IPTOS_TOS_MASK)

the first 3 bits DSCP info will get lost.

To take all the DSCP info in xmit, we should revert the patch and just push
all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 drivers/net/vxlan.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Comments

Guillaume Nault Aug. 5, 2020, 8:44 a.m. UTC | #1
On Wed, Aug 05, 2020 at 10:41:31AM +0800, Hangbin Liu wrote:
> This reverts commit 71130f29979c7c7956b040673e6b9d5643003176.
> 
> In commit 71130f29979c ("vxlan: fix tos value before xmit") we want to
> make sure the tos value are filtered by RT_TOS() based on RFC1349.
> 
>        0     1     2     3     4     5     6     7
>     +-----+-----+-----+-----+-----+-----+-----+-----+
>     |   PRECEDENCE    |          TOS          | MBZ |
>     +-----+-----+-----+-----+-----+-----+-----+-----+
> 
> But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like
> 
>        0     1     2     3     4     5     6     7
>     +-----+-----+-----+-----+-----+-----+-----+-----+
>     |          DS FIELD, DSCP           | ECN FIELD |
>     +-----+-----+-----+-----+-----+-----+-----+-----+
> 
> So with
> 
> IPTOS_TOS_MASK          0x1E
> RT_TOS(tos)		((tos)&IPTOS_TOS_MASK)
> 
> the first 3 bits DSCP info will get lost.
> 
> To take all the DSCP info in xmit, we should revert the patch and just push
> all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>

I guess an explicit
Fixes: 71130f29979c ("vxlan: fix tos value before xmit").
tag would help the -stable maintainers.

Apart from that,
Acked-by: Guillaume Nault <gnault@redhat.com>
Hangbin Liu Aug. 5, 2020, 10:18 a.m. UTC | #2
On Wed, Aug 05, 2020 at 10:44:27AM +0200, Guillaume Nault wrote:
> On Wed, Aug 05, 2020 at 10:41:31AM +0800, Hangbin Liu wrote:
> > This reverts commit 71130f29979c7c7956b040673e6b9d5643003176.
> > 
> > In commit 71130f29979c ("vxlan: fix tos value before xmit") we want to
> > make sure the tos value are filtered by RT_TOS() based on RFC1349.
> > 
> >        0     1     2     3     4     5     6     7
> >     +-----+-----+-----+-----+-----+-----+-----+-----+
> >     |   PRECEDENCE    |          TOS          | MBZ |
> >     +-----+-----+-----+-----+-----+-----+-----+-----+
> > 
> > But RFC1349 has been obsoleted by RFC2474. The new DSCP field defined like
> > 
> >        0     1     2     3     4     5     6     7
> >     +-----+-----+-----+-----+-----+-----+-----+-----+
> >     |          DS FIELD, DSCP           | ECN FIELD |
> >     +-----+-----+-----+-----+-----+-----+-----+-----+
> > 
> > So with
> > 
> > IPTOS_TOS_MASK          0x1E
> > RT_TOS(tos)		((tos)&IPTOS_TOS_MASK)
> > 
> > the first 3 bits DSCP info will get lost.
> > 
> > To take all the DSCP info in xmit, we should revert the patch and just push
> > all tos bits to ip_tunnel_ecn_encap(), which will handling ECN field later.
> > 
> > Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> 
> I guess an explicit
> Fixes: 71130f29979c ("vxlan: fix tos value before xmit").
> tag would help the -stable maintainers.
> 
> Apart from that,
> Acked-by: Guillaume Nault <gnault@redhat.com>
> 

Thanks Guillaume. I saw some revert patches have the Fixes flag and some are
not, so I didn't add it.

Hi David,

Should I re-post the patch with Fixes flag?

Thanks
Hangbin
David Miller Aug. 5, 2020, 7:11 p.m. UTC | #3
From: Hangbin Liu <liuhangbin@gmail.com>
Date: Wed, 5 Aug 2020 18:18:07 +0800

> Should I re-post the patch with Fixes flag?

No, I took care the Fixes tag and queued this up for -stable.

But you do need to explain what kind of testing you even did on this
change we are reverting.  Did you make this change purely on
theoretical grounds and a code audit?

Because it is clear now that this commit broke things and did not fix
anything at all.

Please explain.
Hangbin Liu Aug. 6, 2020, 2:52 a.m. UTC | #4
On Wed, Aug 05, 2020 at 12:11:10PM -0700, David Miller wrote:
> From: Hangbin Liu <liuhangbin@gmail.com>
> Date: Wed, 5 Aug 2020 18:18:07 +0800
> 
> > Should I re-post the patch with Fixes flag?
> 
> No, I took care the Fixes tag and queued this up for -stable.

Thanks

> 
> But you do need to explain what kind of testing you even did on this
> change we are reverting.  Did you make this change purely on
> theoretical grounds and a code audit?
> 
> Because it is clear now that this commit broke things and did not fix
> anything at all.
> 
> Please explain.

Yes, I do have a bug report about this and did testing before post the patch.
But the test script is long and the reason for the issue is very clear(3 bits
of DSCP are omitted). So I only explained the theory in the commit message.

The rough steps are setting vxlan tunnel on OVS. set inner packet tos to
1011 1010 (0xba) and outer vxlan to 1111 1100(0xfc). The outer packet's tos
should be 0xfe at latest as it inherit the inner ECN bit. But with RT_TOS(tos)
We actually got tos 0x1e as the first 3 bits are omitted.

Now here is detailed testing steps:

1. On Host A (which has commit 71130f29979c "vxlan: fix tos value before
xmit"):

# cat ovs.sh
#!/bin/bash
remoteip=192.168.1.207
ip link set eth1 up
ip addr add 192.168.1.156/24 dev eth1

systemctl restart openvswitch
ovs-vsctl --may-exist add-br br-int -- set Bridge br-int datapath_type=system -- br-set-external-id br-int bridge-id br-int
ovs-vsctl add-port br-int vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=$remoteip
ip netns add private
ip link add name veth-host type veth peer name veth-guest
ovs-vsctl add-port br-int veth-host
ip link set dev veth-guest netns private
ip link set dev veth-host up
ip -n private link set dev veth-guest up
ip -n private link set dev lo up
ip -n private a a dev veth-guest 192.168.123.1/24
ovs-vsctl set interface vxlan0 options:tos=0xfc

2. On Host B (which has reverted commit 71130f29979c)

# cat ovs.sh
#!/bin/bash
remoteip=192.168.1.156

ip link set eth1 up
ip addr add 192.168.1.207/24 dev eth1

systemctl restart openvswitch
ovs-vsctl --may-exist add-br br-int -- set Bridge br-int datapath_type=system -- br-set-external-id br-int bridge-id br-int
ovs-vsctl add-port br-int vxlan0 -- set interface vxlan0 type=vxlan options:remote_ip=$remoteip
ip netns add private
ip link add name veth-host type veth peer name veth-guest
ovs-vsctl add-port br-int veth-host
ip link set dev veth-guest netns private
ip link set dev veth-host up
ip -n private link set dev veth-guest up
ip -n private link set dev lo up
ip -n private a a dev veth-guest 192.168.123.2/24
ovs-vsctl set interface vxlan0 options:tos=0xfc


3. On Host A, ping host B
# ip netns exec private ping 192.168.123.2 -c1 -W1 -Q 0xba

4. Capture the packets from Host B
# tcpdump -i eth1 -nn -l -vvv
22:34:37.663803 IP (tos 0x1e,ECT(0), ttl 64, id 63743, offset 0, flags [DF], proto UDP (17), length 134)
    192.168.1.156.55502 > 192.168.1.207.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0

	^^ you can see the tos value is 0x1e from Host A
IP (tos 0xba,ECT(0), ttl 64, id 37413, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.123.1 > 192.168.123.2: ICMP echo request, id 22930, seq 1, length 64

22:34:37.664624 IP (tos 0xfe,ECT(0), ttl 64, id 8233, offset 0, flags [DF], proto UDP (17), length 134)
    192.168.1.207.47657 > 192.168.1.156.4789: [no cksum] VXLAN, flags [I] (0x08), vni 0

        ^^ From Host B it's 0xfe
IP (tos 0xba,ECT(0), ttl 64, id 42030, offset 0, flags [none], proto ICMP (1), length 84)
    192.168.123.2 > 192.168.123.1: ICMP echo reply, id 22930, seq 1, length 64
^C

Thanks
Hangbin
David Miller Aug. 12, 2020, 12:02 a.m. UTC | #5
From: Hangbin Liu <liuhangbin@gmail.com>
Date: Thu, 6 Aug 2020 10:52:41 +0800

> The rough steps are setting vxlan tunnel on OVS. set inner packet tos to
> 1011 1010 (0xba) and outer vxlan to 1111 1100(0xfc). The outer packet's tos
> should be 0xfe at latest as it inherit the inner ECN bit. But with RT_TOS(tos)
> We actually got tos 0x1e as the first 3 bits are omitted.
> 
> Now here is detailed testing steps:

This explains why we need to revert the RT_TOS() change.

I'm asking what testing you did on the original change that added
RT_TOS(), which we reverted, and which didn't fix anything.

I want to know how we got into this situation in the first place,
adding a change that only added negative effects.

Thank you.
Hangbin Liu Aug. 12, 2020, 2:28 a.m. UTC | #6
On Tue, Aug 11, 2020 at 05:02:23PM -0700, David Miller wrote:
> From: Hangbin Liu <liuhangbin@gmail.com>
> Date: Thu, 6 Aug 2020 10:52:41 +0800
> 
> > The rough steps are setting vxlan tunnel on OVS. set inner packet tos to
> > 1011 1010 (0xba) and outer vxlan to 1111 1100(0xfc). The outer packet's tos
> > should be 0xfe at latest as it inherit the inner ECN bit. But with RT_TOS(tos)
> > We actually got tos 0x1e as the first 3 bits are omitted.
> > 
> > Now here is detailed testing steps:
> 
> This explains why we need to revert the RT_TOS() change.
> 
> I'm asking what testing you did on the original change that added
> RT_TOS(), which we reverted, and which didn't fix anything.

Oh, I know what you mean now.
> 
> I want to know how we got into this situation in the first place,
> adding a change that only added negative effects.

The reason is still based on the definition of RT_TOS. I have a report
about the difference tos action between geneve and vxlan.

For geneve:

geneve_get_v4_rt()
  - fl4->flowi4_tos = RT_TOS(tos);
geneve_xmit_skb()
  - tos = ip_tunnel_ecn_encap(fl4.flowi4_tos, ip_hdr(skb), skb);

For vxlan:

vxlan_xmit_one()
  - tos = ip_tunnel_ecn_encap(tos, old_iph, skb);

So geneve will use RT_TOS(tos) when xmit, while vxlan will take all tos bits.
At that time I only read the code and thought we should obey the RT_TOS rule,
So I submit the previous patch.

Later Petr Machata remind me that we need to take care of DSCP fields. So I
asked you if we should change RT_TOS() to DSCP_TOS()[1]. You replied

"""
The RT_TOS() value elides the two lowest bits so that we can store other
pieces of binary state into those two lower bits.

So you can't just blindly change the RT_TOS() definition without breaking
a bunch of things.
"""

I'm sorry I didn't take more time to think about the your reply and just
give up my thoughts. Since we bring up this topic again. Would you please
help explain about what "The RT_TOS() value elides the two lowest bits"
means? I'm not sure if you are talking about ECN or not.

[1] https://www.spinics.net/lists/netdev/msg631249.html

Thanks
Hangbin
diff mbox series

Patch

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index a7c3939264b0..35a7d409d8d3 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -2722,7 +2722,7 @@  static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 		ndst = &rt->dst;
 		skb_tunnel_check_pmtu(skb, ndst, VXLAN_HEADROOM);
 
-		tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
+		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
 		ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
 		err = vxlan_build_skb(skb, ndst, sizeof(struct iphdr),
 				      vni, md, flags, udp_sum);
@@ -2762,7 +2762,7 @@  static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
 
 		skb_tunnel_check_pmtu(skb, ndst, VXLAN6_HEADROOM);
 
-		tos = ip_tunnel_ecn_encap(RT_TOS(tos), old_iph, skb);
+		tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
 		ttl = ttl ? : ip6_dst_hoplimit(ndst);
 		skb_scrub_packet(skb, xnet);
 		err = vxlan_build_skb(skb, ndst, sizeof(struct ipv6hdr),