Message ID | 1470726261-16371-1-git-send-email-wenxu@ucloud.cn |
---|---|
State | Rejected, archived |
Delegated to: | David Miller |
Headers | show |
From: wenxu@ucloud.cn Date: Tue, 9 Aug 2016 15:04:21 +0800 > From: wenxu <wenxu@ucloud.cn> > > commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen > exceeds MTU, allow segmentation for local udp tunneled skbs") > > Given: > - tap0 and ovs-gre > - ovs-gre stacked on eth0, eth0 having the small mtu > > After encapsulation these skbs have skb_gso_network_seglen that exceed > eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than > eth0 mtu. These packets maybe dropped. > > It has the same problem if tap0 bridge with ipgre or gretap device. So > the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs. > > Signed-off-by: wenxu <wenxu@ucloud.cn> I am rather certain that this test is intentionally restricted to UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe. Hannes and Shmulik?
On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > > commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen > > exceeds MTU, allow segmentation for local udp tunneled skbs") > > > > Given: > > - tap0 and ovs-gre > > - ovs-gre stacked on eth0, eth0 having the small mtu > > > > After encapsulation these skbs have skb_gso_network_seglen that exceed > > eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than > > eth0 mtu. These packets maybe dropped. > > > > It has the same problem if tap0 bridge with ipgre or gretap device. So > > the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs. > > > > Signed-off-by: wenxu <wenxu@ucloud.cn> > > I am rather certain that this test is intentionally restricted to > UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe. > > Hannes and Shmulik? It was restricted to UDP tun encaps per Hannes' suggestion, in order to scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1]. Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to handle PMTU, but - if I understand correctly - only in the !skb_is_gso case. (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE") This is probably due the same hidden assumption that 'gso_size' will be suitable for the egress mtu even after encapsulation, which does not hold true in some usecases as described in [2]. Therefore it might be that b8247f095edd needs to be augmented for non-udp tunnels as well. Hannes? [1] http://www.spinics.net/lists/netdev/msg386447.html [2] http://www.spinics.net/lists/netdev/msg385776.html
> On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote: >>> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen >>> exceeds MTU, allow segmentation for local udp tunneled skbs") >>> >>> Given: >>> - tap0 and ovs-gre >>> - ovs-gre stacked on eth0, eth0 having the small mtu >>> >>> After encapsulation these skbs have skb_gso_network_seglen that exceed >>> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than >>> eth0 mtu. These packets maybe dropped. >>> >>> It has the same problem if tap0 bridge with ipgre or gretap device. So >>> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs. >>> >>> Signed-off-by: wenxu <wenxu@ucloud.cn> >> I am rather certain that this test is intentionally restricted to >> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe. >> >> Hannes and Shmulik? > It was restricted to UDP tun encaps per Hannes' suggestion, in order to > scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1]. > > Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to > handle PMTU, but - if I understand correctly - only in the !skb_is_gso > case. > (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE") > > This is probably due the same hidden assumption that 'gso_size' will be > suitable for the egress mtu even after encapsulation, which does not > hold true in some usecases as described in [2]. > > Therefore it might be that b8247f095edd needs to be augmented for > non-udp tunnels as well. > > Hannes? > > [1] http://www.spinics.net/lists/netdev/msg386447.html > [2] http://www.spinics.net/lists/netdev/msg385776.html I think non-udp tunnels should also be set And in b8247f095edd, condition skb_iif also should be removed given: ovs-internal-port bridge with ovs-gre There is the same problem that packets from local.
Hi, On Fri, 12 Aug 2016 11:51:07 +0800 wenxu <wenxu@ucloud.cn> wrote: > > And in b8247f095edd, the condition skb_iif also should be removed. > given: > ovs-internal-dev bridge with ovs-gre > > There are the same problem which the skb from local. There's no need to remove the skb_iif criteria: For the local bridge port, we have control over its mtu. This allows setting an mtu value that takes into account the gre encapsulation performed by other member ports. By doing so, locally generated traffic (on br0) will have a proper gso_size. OTOH if packet arrives from an ingress member port (such as tap0) the packet's gso_size could have been set by another system - which is not always under our control. ( see discussion in http://www.spinics.net/lists/netdev/msg385085.html ) Regards, Shmulik
On 11.08.2016 21:41, Shmulik Ladkani wrote: > On Wed, 10 Aug 2016 17:35:11 -0700 (PDT) David Miller <davem@davemloft.net> wrote: >>> commit b8247f095edd ("net: ip_finish_output_gso: If skb_gso_network_seglen >>> exceeds MTU, allow segmentation for local udp tunneled skbs") >>> >>> Given: >>> - tap0 and ovs-gre >>> - ovs-gre stacked on eth0, eth0 having the small mtu >>> >>> After encapsulation these skbs have skb_gso_network_seglen that exceed >>> eth0's ip_skb_dst_mtu. So the finnal each segment would be larger than >>> eth0 mtu. These packets maybe dropped. >>> >>> It has the same problem if tap0 bridge with ipgre or gretap device. So >>> the IPSKB_FRAG_SEGS flags should also be set in gre tunneled skbs. >>> >>> Signed-off-by: wenxu <wenxu@ucloud.cn> >> >> I am rather certain that this test is intentionally restricted to >> UDP tunnel endpoints, because GRE and other tunnel types are PMTU safe. >> >> Hannes and Shmulik? > > It was restricted to UDP tun encaps per Hannes' suggestion, in order to > scope the change, as Hannes hinted gre and ipip are pmtu-safe, see [1]. > > Glancing at ip_tunnel_xmit() --> tnl_update_pmtu(), they do seem to > handle PMTU, but - if I understand correctly - only in the !skb_is_gso > case. > (since 68c3316311 "v4 GRE: Add TCP segmentation offload for GRE") > > This is probably due the same hidden assumption that 'gso_size' will be > suitable for the egress mtu even after encapsulation, which does not > hold true in some usecases as described in [2]. > > Therefore it might be that b8247f095edd needs to be augmented for > non-udp tunnels as well. > > Hannes? I really would not like to see this expanded to gre and other protocols. All switches drop packets where the packets are exceeding the MTU, bridges and also openvswitch should behave the same. Unfortunately we already had this loophole in the kernel that vxlan udp output path could fragment the packet again, even in case of switches. But this stopped working for GSO packets, which violates another rule in the kernel, GSO should always be transparent and user space should never have to care if a packet is GSO or not. Because we couldn't a) roll back the change that we fragment packets in UDP output paths and b) should not violate GSO transparency rule, I strongly believed it would be better too only change the kernel in a way that it transparently works with GSO, too. If we argue that a VTEP is its own UDP endpoint which is set up after the bridge, I still can sleep well. :) My understanding was that GRE failed consistently, GSO as well as non-GSO packets are dropped, which would be the correct behavior for me. I don't want to change this. A good argument against this would be if we violate the GSO transparency rule again. But when I looked into the code I couldn't see that. Bye, Hannes
Hi, On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote: > I really would not like to see this expanded to gre and other protocols. > All switches drop packets where the packets are exceeding the MTU, > bridges and also openvswitch should behave the same. > > Unfortunately we already had this loophole in the kernel that vxlan udp > output path could fragment the packet again, even in case of switches. > But this stopped working for GSO packets, which violates another rule in > the kernel, GSO should always be transparent and user space should never > have to care if a packet is GSO or not. > > Because we couldn't a) roll back the change that we fragment packets in > UDP output paths and b) should not violate GSO transparency rule, I > strongly believed it would be better too only change the kernel in a way > that it transparently works with GSO, too. If we argue that a VTEP is > its own UDP endpoint which is set up after the bridge, I still can sleep > well. :) > > My understanding was that GRE failed consistently, GSO as well as > non-GSO packets are dropped, which would be the correct behavior for me. > I don't want to change this. A good argument against this would be if we > violate the GSO transparency rule again. But when I looked into the code > I couldn't see that. I completely agree with your arguments. I think we may run into the same GSO vs Non-GSO anomaly if one uses a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT' (e.g. in case DF is not set). I suspect OvS's vport-gre does exactly that, so I assume this is the reason why the change was suggested. Maybe we can change our criteria in the following manner: - if (skb_iif && proto == IPPROTO_UDP) { + if (skb_iif && !(df & htons(IP_DF))) { IPCB(skb)->flags |= IPSKB_FRAG_SEGS; This way, any tunnel explicitly providing DF is NOT allowed to further fragment the resulting segments (leading to tx segments being dropped). Others tunnels, that do not care (e.g. vxlan and geneve, and probably ovs vport-gre, or other ovs encap vports, in df_default=false mode), will behave same for gso and non-gso. WDYT? Am I missing something here? Thanks, Shmulik
Hi, > Hi, > > On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote: >> I really would not like to see this expanded to gre and other protocols. >> All switches drop packets where the packets are exceeding the MTU, >> bridges and also openvswitch should behave the same. >> >> Unfortunately we already had this loophole in the kernel that vxlan udp >> output path could fragment the packet again, even in case of switches. >> But this stopped working for GSO packets, which violates another rule in >> the kernel, GSO should always be transparent and user space should never >> have to care if a packet is GSO or not. >> >> Because we couldn't a) roll back the change that we fragment packets in >> UDP output paths and b) should not violate GSO transparency rule, I >> strongly believed it would be better too only change the kernel in a way >> that it transparently works with GSO, too. If we argue that a VTEP is >> its own UDP endpoint which is set up after the bridge, I still can sleep >> well. :) >> >> My understanding was that GRE failed consistently, GSO as well as >> non-GSO packets are dropped, which would be the correct behavior for me. >> I don't want to change this. A good argument against this would be if we >> violate the GSO transparency rule again. But when I looked into the code >> I couldn't see that. > I completely agree with your arguments. > > I think we may run into the same GSO vs Non-GSO anomaly if one uses > a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the > encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT' > (e.g. in case DF is not set). > > I suspect OvS's vport-gre does exactly that, so I assume this is the > reason why the change was suggested. > > Maybe we can change our criteria in the following manner: > > - if (skb_iif && proto == IPPROTO_UDP) { > + if (skb_iif && !(df & htons(IP_DF))) { > IPCB(skb)->flags |= IPSKB_FRAG_SEGS; > > This way, any tunnel explicitly providing DF is NOT allowed to > further fragment the resulting segments (leading to tx segments being > dropped). > Others tunnels, that do not care (e.g. vxlan and geneve, and probably > ovs vport-gre, or other ovs encap vports, in df_default=false mode), > will behave same for gso and non-gso. > > WDYT? Am I missing something here? > > Thanks, > Shmulik I think the criteria 'skb_iif && !(df & htons(IP_DF)' is suitable. 'nopmtudisc' tunnel and ovs-gre tunnel can clear the DF through df_default=false. In this situation both gso(final segment) or no-gso packet can be fragment(if the packet size is more than mtu). Thanks Wenxu
On Mon, 15 Aug 2016 14:16:39 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: > On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote: > > I really would not like to see this expanded to gre and other protocols. > > All switches drop packets where the packets are exceeding the MTU, > > bridges and also openvswitch should behave the same. > > > > Unfortunately we already had this loophole in the kernel that vxlan udp > > output path could fragment the packet again, even in case of switches. > > But this stopped working for GSO packets, which violates another rule in > > the kernel, GSO should always be transparent and user space should never > > have to care if a packet is GSO or not. > > > > Because we couldn't a) roll back the change that we fragment packets in > > UDP output paths and b) should not violate GSO transparency rule, I > > strongly believed it would be better too only change the kernel in a way > > that it transparently works with GSO, too. If we argue that a VTEP is > > its own UDP endpoint which is set up after the bridge, I still can sleep > > well. :) > > > > My understanding was that GRE failed consistently, GSO as well as > > non-GSO packets are dropped, which would be the correct behavior for me. > > I don't want to change this. A good argument against this would be if we > > violate the GSO transparency rule again. But when I looked into the code > > I couldn't see that. > > I completely agree with your arguments. > > I think we may run into the same GSO vs Non-GSO anomaly if one uses > a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the > encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT' > (e.g. in case DF is not set). > > I suspect OvS's vport-gre does exactly that, so I assume this is the > reason why the change was suggested. > > Maybe we can change our criteria in the following manner: > > - if (skb_iif && proto == IPPROTO_UDP) { > + if (skb_iif && !(df & htons(IP_DF))) { > IPCB(skb)->flags |= IPSKB_FRAG_SEGS; > > This way, any tunnel explicitly providing DF is NOT allowed to > further fragment the resulting segments (leading to tx segments being > dropped). > Others tunnels, that do not care (e.g. vxlan and geneve, and probably > ovs vport-gre, or other ovs encap vports, in df_default=false mode), > will behave same for gso and non-gso. > > WDYT? Am I missing something here? > ping..
On 19.08.2016 09:26, Shmulik Ladkani wrote: > On Mon, 15 Aug 2016 14:16:39 +0300 Shmulik Ladkani <shmulik.ladkani@gmail.com> wrote: >> On Fri, 12 Aug 2016 13:11:50 +0200, hannes@stressinduktion.org wrote: >>> I really would not like to see this expanded to gre and other protocols. >>> All switches drop packets where the packets are exceeding the MTU, >>> bridges and also openvswitch should behave the same. >>> >>> Unfortunately we already had this loophole in the kernel that vxlan udp >>> output path could fragment the packet again, even in case of switches. >>> But this stopped working for GSO packets, which violates another rule in >>> the kernel, GSO should always be transparent and user space should never >>> have to care if a packet is GSO or not. >>> >>> Because we couldn't a) roll back the change that we fragment packets in >>> UDP output paths and b) should not violate GSO transparency rule, I >>> strongly believed it would be better too only change the kernel in a way >>> that it transparently works with GSO, too. If we argue that a VTEP is >>> its own UDP endpoint which is set up after the bridge, I still can sleep >>> well. :) >>> >>> My understanding was that GRE failed consistently, GSO as well as >>> non-GSO packets are dropped, which would be the correct behavior for me. >>> I don't want to change this. A good argument against this would be if we >>> violate the GSO transparency rule again. But when I looked into the code >>> I couldn't see that. >> >> I completely agree with your arguments. >> >> I think we may run into the same GSO vs Non-GSO anomaly if one uses >> a "nopmtudisc" tunnel, or a gre tunnel in "collect_md" mode, where the >> encapsulating iphdr 'df' is derived from 'tun_flags&TUNNEL_DONT_FRAGMENT' >> (e.g. in case DF is not set). >> >> I suspect OvS's vport-gre does exactly that, so I assume this is the >> reason why the change was suggested. >> >> Maybe we can change our criteria in the following manner: >> >> - if (skb_iif && proto == IPPROTO_UDP) { >> + if (skb_iif && !(df & htons(IP_DF))) { >> IPCB(skb)->flags |= IPSKB_FRAG_SEGS; >> >> This way, any tunnel explicitly providing DF is NOT allowed to >> further fragment the resulting segments (leading to tx segments being >> dropped). >> Others tunnels, that do not care (e.g. vxlan and geneve, and probably >> ovs vport-gre, or other ovs encap vports, in df_default=false mode), >> will behave same for gso and non-gso. >> >> WDYT? Am I missing something here? >> > > ping.. I am really not sure... Probably we have no other choice. Bridges caring about df bit set is rather unusual. I wonder if it might not be more sensitive to actually add a sysctl for that or make it depending on some per-tunnel configuration which can be updated with netlink? Bye, Hannes
Hi, On Fri, 19 Aug 2016 11:20:40 +0200 Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > >> Maybe we can change our criteria in the following manner: > >> > >> - if (skb_iif && proto == IPPROTO_UDP) { > >> + if (skb_iif && !(df & htons(IP_DF))) { > >> IPCB(skb)->flags |= IPSKB_FRAG_SEGS; > >> > >> This way, any tunnel explicitly providing DF is NOT allowed to > >> further fragment the resulting segments (leading to tx segments being > >> dropped). > >> Others tunnels, that do not care (e.g. vxlan and geneve, and probably > >> ovs vport-gre, or other ovs encap vports, in df_default=false mode), > >> will behave same for gso and non-gso. > >> > > I am really not sure... > > Probably we have no other choice. Further diving into this, seems the !IP_DF approach is more correct then the IPPROTO_UDP approach (WRT packets/segments arriving from other interface, that exceed egress mtu): vxlan/geneve: Both set df to zero. !IP_DF approach acts same as IPPROTO_UDP approach vxlan/geneve in collect_md (e.g. OvS): They set df according to tun_flags & TUNNEL_DONT_FRAGMENT. IPPROTO_UDP approach: IPSKB_FRAG_SEGS gets set unconditionally. In case TUNNEL_DONT_FRAGMENT requested, non-gso get dropped due to IPSTATS_MIB_FRAGFAILS, whereas gso gets segmented+fragmented (!) !IP_DF approach: Aligned, both non-gso and gso gets dropped for TUNNEL_DONT_FRAGMENT. ip_gre in collect_md (e.g. OvS): Sets df according to tun_flags & TUNNEL_DONT_FRAGMENT. IPPROTO_UDP approach: IPSKB_FRAG_SEGS is never set. Therefore in the case were df is NOT set, non-gso are fragged and passed, whereas gso gets dropped (!) !IP_DF approach: Non-gso vs gso aligned. ip_gre in nopmtudisc: Will pass tnl_update_pmtu checks; Then, df inherrited from inner_iph (or stays unset if IFLA_GRE_IGNORE_DF specified). IPPROTO_UDP approach: IPSKB_FRAG_SEGS never set. Therefore in the case were df is NOT set, non-gso are fragged and passed, whereas gso gets dropped (!) !IP_DF approach: Aligned. ip_gre in fou/gue mode in nopmtudisc: Assuming they pass tnl_update_pmtu checks; Then, df inherrited from inner_iph (or stays unset if IFLA_GRE_IGNORE_DF specified). IPPROTO_UDP approach: IPSKB_FRAG_SEGS gets always set (since proto==IPPROTO_UDP). In the case df is set, non-gso dropped by IPSTATS_MIB_FRAGFAILS, whereas gso gets segmented+fragmented (!) !IP_DF approach: Aligned. ip_gre in pmtudisc: Sets df to IP_DF. Non-gso will fail tnl_update_pmtu checks (gso should pass). IPPROTO_UDP approach: IPSKB_FRAG_SEGS never set. This leads the gso skbs to be eventually dropped. okay. !IP_DF approach: IPSKB_FRAG_SEGS not set, since IP_DF is true. This leads to gso skbs to be eventually dropped. okay. (truely appreciate if you can review my above analysis) Therefore using !(df & htons(IP_DF)) actually fixes some oversights of our former proto==IPPROTO_UDP approach. I'll send a patch. Thanks Shmulik
diff --git a/net/ipv4/ip_tunnel_core.c b/net/ipv4/ip_tunnel_core.c index 9d847c3..7f36ea2 100644 --- a/net/ipv4/ip_tunnel_core.c +++ b/net/ipv4/ip_tunnel_core.c @@ -73,8 +73,9 @@ void iptunnel_xmit(struct sock *sk, struct rtable *rt, struct sk_buff *skb, skb_dst_set(skb, &rt->dst); memset(IPCB(skb), 0, sizeof(*IPCB(skb))); - if (skb_iif && proto == IPPROTO_UDP) { - /* Arrived from an ingress interface and got udp encapuslated. + if (skb_iif && (proto == IPPROTO_UDP || proto == IPPROTO_GRE)) { + /* Arrived from an ingress interface and got udp or gre + * encapsulated. * The encapsulated network segment length may exceed dst mtu. * Allow IP Fragmentation of segments. */