Message ID | 1467722132-10084-1-git-send-email-shmulik.ladkani@ravellosystems.com |
---|---|
State | Changes Requested, archived |
Delegated to: | David Miller |
Headers | show |
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote: > Given: > - tap0, vxlan0 enslaved under a bridge > - eth0 is the tunnel underlay having small mtu (e.g. 1400) > > Assume GSO skbs arriving from tap0 having a gso_size as determined by > user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500). > > After encapsulation these skbs have skb_gso_network_seglen that exceed > underlay ip_skb_dst_mtu. > > These skbs are accidentally passed to ip_finish_output2 AS IS; however > each final segment (either segmented by validate_xmit_skb of eth0, or > by eth0 hardware UFO) would be larger than eth0 mtu. > As a result, those above-mtu segments get dropped on certain underlay > networks. > > The expected behavior in such a setup would be segmenting the skb first, > and then fragmenting each segment according to dst mtu, and finally > passing the resulting fragments to ip_finish_output2. > > 'ip_finish_output_gso' already supports this "Slowpath" behavior, > but it is only considered if IPSKB_FORWARDED is set. > > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath" > behavior is not considered. I placed this test there under the assumption that L2 bridges have the same MTU on all bridge ports, so we'd only need to consider routing case. How does work if e.g. 1460-sized udp packet arrives on tap0? Do we fragment (possibly ignoring DF?) How does it work for non-ip protocols? (Or did I misunderstand this setup...?)
On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote: > > The expected behavior in such a setup would be segmenting the skb first, > > and then fragmenting each segment according to dst mtu, and finally > > passing the resulting fragments to ip_finish_output2. > > > > 'ip_finish_output_gso' already supports this "Slowpath" behavior, > > but it is only considered if IPSKB_FORWARDED is set. > > > > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath" > > behavior is not considered. > > I placed this test there under the assumption that L2 bridges have > the same MTU on all bridge ports, so we'd only need to consider routing > case. In our setups we have no control of VM mtu (which affects gso_size of packets arriving from tap), and no control of vxlan's underlay mtu. > How does work if e.g. 1460-sized udp packet arrives on tap0? > Do we fragment (possibly ignoring DF?) A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming vxlan mtu is sufficient), and the original DF of the inner packet is preserved. The skb gets vxlan-udp encapsulated, with outer IP header having DF=0 (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu, fragmented normally at the ip_finish_output --> ip_fragment code path. So on wire we have 2 frags of the vxlan datagram; they are reassembled at recepient ip stack of vxlan termination. Inner packet preserved. Not ideal, but works. The issue is with GSO skbs arriving from tap, which eventually generates segments larger then the mtu, which are not transmitted on eth0: tap0 rx: super packet, gso_size from user's virtio_net_hdr ... vxlan0 tx: encaps the super packet ... ip_finish_output ip_finish_output_gso *NO* skb_gso_validate_mtu() <--- problem here ip_finish_output2: tx the encapsulated super packet on eth0 ... validate_xmit_skb netif_needs_gso skb_gso_segment: segments inner payload according to original gso_size, leads to vxlan datagrams larger than mtu > How does it work for non-ip protocols? The specific problem is with vxlan (or any other udp based tunnel) encapsulated GSOed packets. > (Or did I misunderstand this setup...?) tap0 bridged with vxlan0. route to vxlan0's remote peer is via eth0, configured with small mtu. Regards, Shmulik
From: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> Date: Tue, 5 Jul 2016 17:05:41 +0300 > On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote: >> (Or did I misunderstand this setup...?) > > tap0 bridged with vxlan0. > route to vxlan0's remote peer is via eth0, configured with small mtu. Florian, any more comments?
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote: [ CC > On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote: > > > The expected behavior in such a setup would be segmenting the skb first, > > > and then fragmenting each segment according to dst mtu, and finally > > > passing the resulting fragments to ip_finish_output2. > > > > > > 'ip_finish_output_gso' already supports this "Slowpath" behavior, > > > but it is only considered if IPSKB_FORWARDED is set. > > > > > > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath" > > > behavior is not considered. > > > > I placed this test there under the assumption that L2 bridges have > > the same MTU on all bridge ports, so we'd only need to consider routing > > case. > > In our setups we have no control of VM mtu (which affects gso_size of > packets arriving from tap), and no control of vxlan's underlay mtu. :-( > > How does work if e.g. 1460-sized udp packet arrives on tap0? > > Do we fragment (possibly ignoring DF?) > > A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming > vxlan mtu is sufficient), and the original DF of the inner packet is > preserved. > > The skb gets vxlan-udp encapsulated, with outer IP header having DF=0 > (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu, > fragmented normally at the ip_finish_output --> ip_fragment code path. I see. > So on wire we have 2 frags of the vxlan datagram; they are reassembled > at recepient ip stack of vxlan termination. Inner packet preserved. > Not ideal, but works. > > The issue is with GSO skbs arriving from tap, which eventually generates > segments larger then the mtu, which are not transmitted on eth0: > > tap0 rx: super packet, gso_size from user's virtio_net_hdr > ... > vxlan0 tx: encaps the super packet > ... > ip_finish_output > ip_finish_output_gso > *NO* skb_gso_validate_mtu() <--- problem here We don't do this for ipv6 either since we're expected to send PKTOOBIG error. > ip_finish_output2: tx the encapsulated super packet on eth0 > ... > validate_xmit_skb > netif_needs_gso > skb_gso_segment: segments inner payload according to > original gso_size, > leads to vxlan datagrams larger than mtu > > > How does it work for non-ip protocols? > > The specific problem is with vxlan (or any other udp based tunnel) > encapsulated GSOed packets. If I understand correctly you have vxlan stacked on top of eth0, and tap and vxlan in a bridge. ... and eth mtu smaller than bridge mtu. I think that this is "working" by pure accident, and that better fix is to set mtu values correctly so that when vxlan header is added we don't exceed what can be handled by the real device (yes, I know you have no control over this). I am worried about this patch, skb_gso_validate_mtu is more costly than the ->flags & FORWARD check; everyone pays this extra cost. What about setting IPCB FORWARD flag in iptunnel_xmit if skb->skb_iif != 0... instead? Yet another possibility would be to reduce gso_size but that violates gro/gso symmetry... [ I tried to check rfc but seems rfc7348 simply declares that endpoints are not allowed to fragment so problem solved :-/ ]
David Miller <davem@davemloft.net> wrote: > From: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> > Date: Tue, 5 Jul 2016 17:05:41 +0300 > > > On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote: > >> (Or did I misunderstand this setup...?) > > > > tap0 bridged with vxlan0. > > route to vxlan0's remote peer is via eth0, configured with small mtu. > > Florian, any more comments? Sorry, I commented now. But I'd really like to hear what vxlan experts have to say about this, seems in RFC (7348) universe endpoint fragmention is not supposed to happen :-/
On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal <fw@strlen.de> wrote: > Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote: > > > How does work if e.g. 1460-sized udp packet arrives on tap0? > > > Do we fragment (possibly ignoring DF?) > > > > A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming > > vxlan mtu is sufficient), and the original DF of the inner packet is > > preserved. > > > > The skb gets vxlan-udp encapsulated, with outer IP header having DF=0 > > (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu, > > fragmented normally at the ip_finish_output --> ip_fragment code path. > > I see. > > If I understand correctly you have vxlan stacked on top of eth0, and tap > and vxlan in a bridge. > > ... and eth mtu smaller than bridge mtu. > > I think that this is "working" by pure accident, and that better fix is > to set mtu values correctly so that when vxlan header is added we don't > exceed what can be handled by the real device (yes, I know you have > no control over this). Let me elaborate a bit regarding the usecase. Consider nested virtualization. The "host" is a VM which may run in various different cloud deployments, thus the mtu of this virtual host's eth0 varies (I've no control over it, nor should I have). The "host" provides a virtual network for its Nested Guest VMs - the users of the system. They are provided with a virtual L2 network with an MTU of their choice (usually 1500). Forcing the users runtime varying restrictions on whatever MTUs they can use in their virtual network (which depend on the current choice of where the virtual "host" is deployed), means they are forced to alter their application's setting, per deployment. This is a non solution. This is why guests' MTUs nor host's eth0 MTU can be set. We have the option of using user-space based UDP tunnel (instead of kernel's vxlan or geneve). Aside the downsides not utilizing existing protocols and implementations, this works well as the encapsulated guest packets are sent over a standard UDP socket via sendmsg. As such, these datagrams may get fragmented (in deployments where host's eth0 mtu is too small), and reassembled at the remote tunnel termination's ip stack. Regradless the use-case, there's currently an incosistency in kernel's behavior: Consider: VM eth0 # 1500 mtu; any virtual/emulated NIC that supports TSO . . +---------------HOST-+ | . __br0__ | | . / \ | | tap0 vxlan0 | |(1500) (1500) | | . | | . | | eth0 | | (1200) | +--------------------+ 1. If VM disables TSO, it sends 1500-sized IP packets down eth0, the frames arrive on tap0, bridged, get encapsulated by vxlan0, and the vxlan datagram is then fragmented (as any local datagram would have) by ip_finish_output on eth0. 2. If VM enables TSO, we have have a superpacket arriving at tap0 with total length of say 10000 bytes, with gso_size 1460. The superpacket gets encapsulated by vxlan0. Finally upon eth0's validate_xmit_skb, the packet is udp-tunnel-segmented according to original gso_size of 1460, creating encapsution datagrams bigger than eth0's mtu - which are eventually dropped on the wire. This is simply inconsistent: The GSO path should align to the non-GSO case. Thus my suggestion: in this specific case, within ip_finish_output_gso, segment the GSO skb first, then fragment each segment according to dst mtu. This aligns the GSO vs non-GSO behavior. > I am worried about this patch, skb_gso_validate_mtu is more costly than > the ->flags & FORWARD check; everyone pays this extra cost. I can get back with numbers regarding the impact on local traffic. I'd appreciate any suggestion how to determine traffic is local OTHER THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an impact on local traffic. > What about setting IPCB FORWARD flag in iptunnel_xmit if > skb->skb_iif != 0... instead? Can you please elaborate? > Yet another possibility would be to reduce gso_size but that violates > gro/gso symmetry... We're experimenting this path as well. But as said, fixing the incosistency above would still be valid. > [ I tried to check rfc but seems rfc7348 simply declares that > endpoints are not allowed to fragment so problem solved :-/ ] Funny, in Geneve it's only a "best practice" recommendadtion: https://tools.ietf.org/html/draft-ietf-nvo3-geneve-02 section 4.1.1 I'm not keen on vxlan; any UDP based tunnel would do ;-) Regards, Shmulik
On 09.07.2016 05:00, Florian Westphal wrote: > I am worried about this patch, skb_gso_validate_mtu is more costly than > the ->flags & FORWARD check; everyone pays this extra cost. > > What about setting IPCB FORWARD flag in iptunnel_xmit if > skb->skb_iif != 0... instead? That came to my mind first, too. > Yet another possibility would be to reduce gso_size but that violates > gro/gso symmetry... If you see vxlan (or any other UDP encap protocol) as an in-kernel tunnel solution/program acting as an end point it is actually legal to modify gso_size in my opinion. Adding headers to an skb can also not be symmetric in this case. I would not see anything bad happening because of doing so. Do you? It is a matter of implementation. vxlan could also eat all data and splice it anew onto a socket. It already doesn't care about end-to-end pmtu consistency, so I can't see anything wrong with it. To make this all safe one needs to handle the ttl in vxlan much better. It needs to be inherited from the inner header, checked and properly decremented on the outer header, so that vxlan itself acts as a full blown router by itself. Otherwise endless loops are possible, as the packet will always fit the size. Bye, Hannes
On Sat, 9 Jul 2016 15:30:17 +0300 Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote: > On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal <fw@strlen.de> wrote: > > I am worried about this patch, skb_gso_validate_mtu is more costly than > > the ->flags & FORWARD check; everyone pays this extra cost. > > I can get back with numbers regarding the impact on local traffic. Florian, I've repeatedly tested how this affects locally generated traffic and it seems there's no impact (or at least too negligible for netperf workload to notice): - veth to veth (netns separated), both UFO enabled - UDP_RR, with Request Size (udp payload) of 1500, to ensure UFO in place (sender reaches ip_finish_output_gso) - Before: stable v4.6.3 - After: stable v4.6.3 + suggested fix (kill flags&IPSKB_FORWARDED check) # netperf -T1,2 -c -C -L 192.168.13.2 -H 192.168.13.1 -l 20 -I 99,2 -i 10 -t UDP_RR -- -P 10002,10001 -r 1500,1 MIGRATED UDP REQUEST/RESPONSE TEST from 192.168.13.2 () port 10002 AF_INET to 192.168.13.1 () port 10001 AF_INET : +/-1.000% @ 99% conf. : demo : first burst 0 : cpu bind Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr [before] 212992 212992 1500 1 20.00 45368.10 20.56 20.56 18.131 18.131 [after] 212992 212992 1500 1 20.00 45376.09 20.74 20.74 18.287 18.287 Therefore it seems trying to optimize this by setting IPSKB_FORWARDED (or introducing a different mark) in 'iptunnel_xmit' would offer no genuine benefit. WDYT? Were you thinking of different workloads that we should assess the impact on? Thanks, Shmulik
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote: > On Sat, 9 Jul 2016 15:30:17 +0300 Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote: > > On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal <fw@strlen.de> wrote: > > > I am worried about this patch, skb_gso_validate_mtu is more costly than > > > the ->flags & FORWARD check; everyone pays this extra cost. > > > > I can get back with numbers regarding the impact on local traffic. > > Florian, I've repeatedly tested how this affects locally generated > traffic and it seems there's no impact (or at least too negligible for > netperf workload to notice): > > - veth to veth (netns separated), both UFO enabled > - UDP_RR, with Request Size (udp payload) of 1500, to ensure UFO in place > (sender reaches ip_finish_output_gso) > - Before: stable v4.6.3 > - After: stable v4.6.3 + suggested fix (kill flags&IPSKB_FORWARDED check) > > # netperf -T1,2 -c -C -L 192.168.13.2 -H 192.168.13.1 -l 20 -I 99,2 -i 10 -t UDP_RR -- -P 10002,10001 -r 1500,1 > > MIGRATED UDP REQUEST/RESPONSE TEST from 192.168.13.2 () port 10002 AF_INET to 192.168.13.1 () port 10001 AF_INET : +/-1.000% @ 99% conf. : demo : first burst 0 : cpu bind > Local /Remote > Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem > Send Recv Size Size Time Rate local remote local remote > bytes bytes bytes bytes secs. per sec % S % S us/Tr us/Tr > > [before] > 212992 212992 1500 1 20.00 45368.10 20.56 20.56 18.131 18.131 > [after] > 212992 212992 1500 1 20.00 45376.09 20.74 20.74 18.287 18.287 > > Therefore it seems trying to optimize this by setting IPSKB_FORWARDED > (or introducing a different mark) in 'iptunnel_xmit' would offer no > genuine benefit. > > WDYT? I think that this just shows that on this machine and this workload there is no impact. E.g. for sctp we now skb_walk_frags() for every gso packet. We also have two more function calls per skb. > Were you thinking of different workloads that we should assess the > impact on? I still think that just marking those packets as being forwarded from tunnel code is the right thing to do; linux runs on so many platforms that its very difficult to asses that something "has no overhead". Long term we should just adjust gso size to not have to do skb software segmentation anymore. I remember that Eric Dumazet suggested the same thing back when this gso forward checking was added but I recall there were corner cases in the segmentation code that caused BUG_ON crashes.
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index cbac493..8ae65b3 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -223,9 +223,8 @@ static int ip_finish_output_gso(struct net *net, struct sock *sk, struct sk_buff *segs; int ret = 0; - /* common case: locally created skb or seglen is <= mtu */ - if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) || - skb_gso_validate_mtu(skb, mtu)) + /* common case: seglen is <= mtu */ + if (skb_gso_validate_mtu(skb, mtu)) return ip_finish_output2(net, sk, skb); /* Slowpath - GSO segment length is exceeding the dst MTU.
Given: - tap0, vxlan0 enslaved under a bridge - eth0 is the tunnel underlay having small mtu (e.g. 1400) Assume GSO skbs arriving from tap0 having a gso_size as determined by user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500). After encapsulation these skbs have skb_gso_network_seglen that exceed underlay ip_skb_dst_mtu. These skbs are accidentally passed to ip_finish_output2 AS IS; however each final segment (either segmented by validate_xmit_skb of eth0, or by eth0 hardware UFO) would be larger than eth0 mtu. As a result, those above-mtu segments get dropped on certain underlay networks. The expected behavior in such a setup would be segmenting the skb first, and then fragmenting each segment according to dst mtu, and finally passing the resulting fragments to ip_finish_output2. 'ip_finish_output_gso' already supports this "Slowpath" behavior, but it is only considered if IPSKB_FORWARDED is set. However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath" behavior is not considered. Fix, by performing ip_finish_output_gso "Slowpath" even for non IPSKB_FORWARDED skbs. This is also OK for locally created skbs, as they likely to have skb_gso_network_seglen that equals dst mtu, and thus will go directly to 'ip_finish_output2' as done prior this fix. Signed-off-by: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> --- net/ipv4/ip_output.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)