diff mbox

net: ip_finish_output_gso: If skb_gso_network_seglen exceeds MTU, do segmentation even for non IPSKB_FORWARDED skbs

Message ID 1467722132-10084-1-git-send-email-shmulik.ladkani@ravellosystems.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Shmulik Ladkani July 5, 2016, 12:35 p.m. UTC
Given:
 - tap0, vxlan0 enslaved under a bridge
 - eth0 is the tunnel underlay having small mtu (e.g. 1400)

Assume GSO skbs arriving from tap0 having a gso_size as determined by
user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).

After encapsulation these skbs have skb_gso_network_seglen that exceed
underlay ip_skb_dst_mtu.

These skbs are accidentally passed to ip_finish_output2 AS IS; however
each final segment (either segmented by validate_xmit_skb of eth0, or
by eth0 hardware UFO) would be larger than eth0 mtu.
As a result, those above-mtu segments get dropped on certain underlay
networks.

The expected behavior in such a setup would be segmenting the skb first,
and then fragmenting each segment according to dst mtu, and finally
passing the resulting fragments to ip_finish_output2.

'ip_finish_output_gso' already supports this "Slowpath" behavior,
but it is only considered if IPSKB_FORWARDED is set.

However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
behavior is not considered.

Fix, by performing ip_finish_output_gso "Slowpath" even for non
IPSKB_FORWARDED skbs.

This is also OK for locally created skbs, as they likely to have
skb_gso_network_seglen that equals dst mtu, and thus will go directly to
'ip_finish_output2' as done prior this fix.

Signed-off-by: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com>
---
 net/ipv4/ip_output.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

Comments

Florian Westphal July 5, 2016, 1:03 p.m. UTC | #1
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> Given:
>  - tap0, vxlan0 enslaved under a bridge
>  - eth0 is the tunnel underlay having small mtu (e.g. 1400)
> 
> Assume GSO skbs arriving from tap0 having a gso_size as determined by
> user-provided virtio_net_hdr (e.g. 1460 corresponding to VM mtu of 1500).
> 
> After encapsulation these skbs have skb_gso_network_seglen that exceed
> underlay ip_skb_dst_mtu.
> 
> These skbs are accidentally passed to ip_finish_output2 AS IS; however
> each final segment (either segmented by validate_xmit_skb of eth0, or
> by eth0 hardware UFO) would be larger than eth0 mtu.
> As a result, those above-mtu segments get dropped on certain underlay
> networks.
> 
> The expected behavior in such a setup would be segmenting the skb first,
> and then fragmenting each segment according to dst mtu, and finally
> passing the resulting fragments to ip_finish_output2.
> 
> 'ip_finish_output_gso' already supports this "Slowpath" behavior,
> but it is only considered if IPSKB_FORWARDED is set.
> 
> However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
> behavior is not considered.

I placed this test there under the assumption that L2 bridges have
the same MTU on all bridge ports, so we'd only need to consider routing
case.

How does work if e.g. 1460-sized udp packet arrives on tap0?
Do we fragment (possibly ignoring DF?)

How does it work for non-ip protocols?

(Or did I misunderstand this setup...?)
Shmulik Ladkani July 5, 2016, 2:05 p.m. UTC | #2
On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote:
> > The expected behavior in such a setup would be segmenting the skb first,
> > and then fragmenting each segment according to dst mtu, and finally
> > passing the resulting fragments to ip_finish_output2.
> > 
> > 'ip_finish_output_gso' already supports this "Slowpath" behavior,
> > but it is only considered if IPSKB_FORWARDED is set.
> > 
> > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
> > behavior is not considered.
> 
> I placed this test there under the assumption that L2 bridges have
> the same MTU on all bridge ports, so we'd only need to consider routing
> case.

In our setups we have no control of VM mtu (which affects gso_size of
packets arriving from tap), and no control of vxlan's underlay mtu.

> How does work if e.g. 1460-sized udp packet arrives on tap0?
> Do we fragment (possibly ignoring DF?)

A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
vxlan mtu is sufficient), and the original DF of the inner packet is
preserved.

The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
(unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
fragmented normally at the ip_finish_output --> ip_fragment code path.

So on wire we have 2 frags of the vxlan datagram; they are reassembled
at recepient ip stack of vxlan termination. Inner packet preserved.
Not ideal, but works.

The issue is with GSO skbs arriving from tap, which eventually generates
segments larger then the mtu, which are not transmitted on eth0:

  tap0 rx:  super packet, gso_size from user's virtio_net_hdr
    ...
    vxlan0 tx:  encaps the super packet
      ...
      ip_finish_output
        ip_finish_output_gso
          *NO* skb_gso_validate_mtu()     <--- problem here
            ip_finish_output2:  tx the encapsulated super packet on eth0
              ...
              validate_xmit_skb
                netif_needs_gso
                  skb_gso_segment: segments inner payload according to
                                   original gso_size,
                                   leads to vxlan datagrams larger than mtu

> How does it work for non-ip protocols?

The specific problem is with vxlan (or any other udp based tunnel)
encapsulated GSOed packets.

> (Or did I misunderstand this setup...?)

tap0 bridged with vxlan0.
route to vxlan0's remote peer is via eth0, configured with small mtu.

Regards,
Shmulik
David Miller July 9, 2016, 3:12 a.m. UTC | #3
From: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com>
Date: Tue, 5 Jul 2016 17:05:41 +0300

> On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote:
>> (Or did I misunderstand this setup...?)
> 
> tap0 bridged with vxlan0.
> route to vxlan0's remote peer is via eth0, configured with small mtu.

Florian, any more comments?
Florian Westphal July 9, 2016, 9 a.m. UTC | #4
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:

[ CC 
> On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote:
> > > The expected behavior in such a setup would be segmenting the skb first,
> > > and then fragmenting each segment according to dst mtu, and finally
> > > passing the resulting fragments to ip_finish_output2.
> > > 
> > > 'ip_finish_output_gso' already supports this "Slowpath" behavior,
> > > but it is only considered if IPSKB_FORWARDED is set.
> > > 
> > > However in the bridged case, IPSKB_FORWARDED is off, and the "Slowpath"
> > > behavior is not considered.
> > 
> > I placed this test there under the assumption that L2 bridges have
> > the same MTU on all bridge ports, so we'd only need to consider routing
> > case.
> 
> In our setups we have no control of VM mtu (which affects gso_size of
> packets arriving from tap), and no control of vxlan's underlay mtu.

:-(

> > How does work if e.g. 1460-sized udp packet arrives on tap0?
> > Do we fragment (possibly ignoring DF?)
> 
> A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
> vxlan mtu is sufficient), and the original DF of the inner packet is
> preserved.
> 
> The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
> (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
> fragmented normally at the ip_finish_output --> ip_fragment code path.

I see.

> So on wire we have 2 frags of the vxlan datagram; they are reassembled
> at recepient ip stack of vxlan termination. Inner packet preserved.
> Not ideal, but works.
> 
> The issue is with GSO skbs arriving from tap, which eventually generates
> segments larger then the mtu, which are not transmitted on eth0:
> 
>   tap0 rx:  super packet, gso_size from user's virtio_net_hdr
>     ...
>     vxlan0 tx:  encaps the super packet
>       ...
>       ip_finish_output
>         ip_finish_output_gso
>           *NO* skb_gso_validate_mtu()     <--- problem here

We don't do this for ipv6 either since we're expected to send PKTOOBIG
error.

>             ip_finish_output2:  tx the encapsulated super packet on eth0
>               ...
>               validate_xmit_skb
>                 netif_needs_gso
>                   skb_gso_segment: segments inner payload according to
>                                    original gso_size,
>                                    leads to vxlan datagrams larger than mtu
>
> > How does it work for non-ip protocols?
> 
> The specific problem is with vxlan (or any other udp based tunnel)
> encapsulated GSOed packets.

If I understand correctly you have vxlan stacked on top of eth0, and tap
and vxlan in a bridge.

... and eth mtu smaller than bridge mtu.

I think that this is "working" by pure accident, and that better fix is
to set mtu values correctly so that when vxlan header is added we don't
exceed what can be handled by the real device (yes, I know you have
no control over this).

I am worried about this patch, skb_gso_validate_mtu is more costly than
the ->flags & FORWARD check; everyone pays this extra cost.

What about setting IPCB FORWARD flag in iptunnel_xmit if
skb->skb_iif != 0... instead?

Yet another possibility would be to reduce gso_size but that violates
gro/gso symmetry...

[ I tried to check rfc but seems rfc7348 simply declares that
  endpoints are not allowed to fragment so problem solved :-/ ]
Florian Westphal July 9, 2016, 9:06 a.m. UTC | #5
David Miller <davem@davemloft.net> wrote:
> From: Shmulik Ladkani <shmulik.ladkani@ravellosystems.com>
> Date: Tue, 5 Jul 2016 17:05:41 +0300
> 
> > On Tue, 5 Jul 2016 15:03:27 +0200, fw@strlen.de wrote:
> >> (Or did I misunderstand this setup...?)
> > 
> > tap0 bridged with vxlan0.
> > route to vxlan0's remote peer is via eth0, configured with small mtu.
> 
> Florian, any more comments?

Sorry, I commented now.  But I'd really like to hear what vxlan experts
have to say about this, seems in RFC (7348) universe endpoint fragmention is
not supposed to happen :-/
Shmulik Ladkani July 9, 2016, 12:30 p.m. UTC | #6
On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal <fw@strlen.de> wrote:
> Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> > > How does work if e.g. 1460-sized udp packet arrives on tap0?
> > > Do we fragment (possibly ignoring DF?)  
> > 
> > A *non* gso udp packet arriving on tap0 is bridged to vxlan0 (assuming
> > vxlan mtu is sufficient), and the original DF of the inner packet is
> > preserved.
> > 
> > The skb gets vxlan-udp encapsulated, with outer IP header having DF=0
> > (unless tun_flags & TUNNEL_DONT_FRAGMENT), and then, if skb->len > mtu,
> > fragmented normally at the ip_finish_output --> ip_fragment code path.  
> 
> I see.
> 
> If I understand correctly you have vxlan stacked on top of eth0, and tap
> and vxlan in a bridge.
> 
> ... and eth mtu smaller than bridge mtu.
> 
> I think that this is "working" by pure accident, and that better fix is
> to set mtu values correctly so that when vxlan header is added we don't
> exceed what can be handled by the real device (yes, I know you have
> no control over this).

Let me elaborate a bit regarding the usecase.

Consider nested virtualization. The "host" is a VM which may run in
various different cloud deployments, thus the mtu of this virtual host's
eth0 varies (I've no control over it, nor should I have).

The "host" provides a virtual network for its Nested Guest VMs - the
users of the system.
They are provided with a virtual L2 network with an MTU of their choice
(usually 1500).

Forcing the users runtime varying restrictions on whatever MTUs they
can use in their virtual network (which depend on the current choice of
where the virtual "host" is deployed), means they are forced to alter
their application's setting, per deployment. This is a non solution.

This is why guests' MTUs nor host's eth0 MTU can be set.

We have the option of using user-space based UDP tunnel (instead of
kernel's vxlan or geneve). Aside the downsides not utilizing existing
protocols and implementations, this works well as the encapsulated guest
packets are sent over a standard UDP socket via sendmsg.
As such, these datagrams may get fragmented (in deployments where host's
eth0 mtu is too small), and reassembled at the remote tunnel
termination's ip stack.

Regradless the use-case, there's currently an incosistency in kernel's
behavior:

Consider:

   VM
  eth0  # 1500 mtu; any virtual/emulated NIC that supports TSO
   .
   .
+---------------HOST-+
|  .   __br0__       |
|  .  /       \      |
| tap0       vxlan0  |
|(1500)      (1500)  |
|              .     |
|              .     |
|             eth0   |
|            (1200)  |
+--------------------+

1. If VM disables TSO, it sends 1500-sized IP packets down eth0,
   the frames arrive on tap0, bridged, get encapsulated by vxlan0, and
   the vxlan datagram is then fragmented (as any local datagram would
   have) by ip_finish_output on eth0.

2. If VM enables TSO, we have have a superpacket arriving at tap0
   with total length of say 10000 bytes, with gso_size 1460.
   The superpacket gets encapsulated by vxlan0.
   Finally upon eth0's validate_xmit_skb, the packet is udp-tunnel-segmented
   according to original gso_size of 1460, creating encapsution
   datagrams bigger than eth0's mtu - which are eventually dropped on
   the wire.

This is simply inconsistent: The GSO path should align to the non-GSO
case.
Thus my suggestion: in this specific case, within ip_finish_output_gso,
segment the GSO skb first, then fragment each segment according to dst
mtu. This aligns the GSO vs non-GSO behavior.

> I am worried about this patch, skb_gso_validate_mtu is more costly than
> the ->flags & FORWARD check; everyone pays this extra cost.

I can get back with numbers regarding the impact on local traffic.

I'd appreciate any suggestion how to determine traffic is local OTHER
THAN testing IPSKB_FORWARDED; If we have such a way, there wouldn't be an
impact on local traffic.

> What about setting IPCB FORWARD flag in iptunnel_xmit if
> skb->skb_iif != 0... instead?

Can you please elaborate?

> Yet another possibility would be to reduce gso_size but that violates
> gro/gso symmetry...

We're experimenting this path as well. But as said, fixing the
incosistency above would still be valid.

> [ I tried to check rfc but seems rfc7348 simply declares that
>   endpoints are not allowed to fragment so problem solved :-/ ]

Funny, in Geneve it's only a "best practice" recommendadtion:
  https://tools.ietf.org/html/draft-ietf-nvo3-geneve-02
  section 4.1.1

I'm not keen on vxlan; any UDP based tunnel would do ;-)

Regards,
Shmulik
Hannes Frederic Sowa July 9, 2016, 3:10 p.m. UTC | #7
On 09.07.2016 05:00, Florian Westphal wrote:
> I am worried about this patch, skb_gso_validate_mtu is more costly than
> the ->flags & FORWARD check; everyone pays this extra cost.
> 
> What about setting IPCB FORWARD flag in iptunnel_xmit if
> skb->skb_iif != 0... instead?

That came to my mind first, too.

> Yet another possibility would be to reduce gso_size but that violates
> gro/gso symmetry...

If you see vxlan (or any other UDP encap protocol) as an in-kernel
tunnel solution/program acting as an end point it is actually legal to
modify gso_size in my opinion. Adding headers to an skb can also not be
symmetric in this case. I would not see anything bad happening because
of doing so. Do you? It is a matter of implementation. vxlan could also
eat all data and splice it anew onto a socket. It already doesn't care
about end-to-end pmtu consistency, so I can't see anything wrong with it.

To make this all safe one needs to handle the ttl in vxlan much better.
It needs to be inherited from the inner header, checked and properly
decremented on the outer header, so that vxlan itself acts as a full
blown router by itself. Otherwise endless loops are possible, as the
packet will always fit the size.

Bye,
Hannes
Shmulik Ladkani July 10, 2016, 8:14 p.m. UTC | #8
On Sat, 9 Jul 2016 15:30:17 +0300 Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal <fw@strlen.de> wrote:
> > I am worried about this patch, skb_gso_validate_mtu is more costly than
> > the ->flags & FORWARD check; everyone pays this extra cost.  
> 
> I can get back with numbers regarding the impact on local traffic.

Florian, I've repeatedly tested how this affects locally generated
traffic and it seems there's no impact (or at least too negligible for
netperf workload to notice):

- veth to veth (netns separated), both UFO enabled
- UDP_RR, with Request Size (udp payload) of 1500, to ensure UFO in place
  (sender reaches ip_finish_output_gso)
- Before: stable v4.6.3
- After:  stable v4.6.3 + suggested fix (kill flags&IPSKB_FORWARDED check)

# netperf -T1,2 -c -C -L 192.168.13.2 -H 192.168.13.1 -l 20 -I 99,2 -i 10 -t UDP_RR -- -P 10002,10001 -r 1500,1

MIGRATED UDP REQUEST/RESPONSE TEST from 192.168.13.2 () port 10002 AF_INET to 192.168.13.1 () port 10001 AF_INET : +/-1.000% @ 99% conf.  : demo : first burst 0 : cpu bind
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr

[before]
212992 212992 1500    1      20.00   45368.10   20.56  20.56  18.131  18.131
[after]
212992 212992 1500    1      20.00   45376.09   20.74  20.74  18.287  18.287

Therefore it seems trying to optimize this by setting IPSKB_FORWARDED
(or introducing a different mark) in 'iptunnel_xmit' would offer no
genuine benefit.

WDYT?

Were you thinking of different workloads that we should assess the
impact on?

Thanks,
Shmulik
Florian Westphal July 11, 2016, 8:13 a.m. UTC | #9
Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> On Sat, 9 Jul 2016 15:30:17 +0300 Shmulik Ladkani <shmulik.ladkani@ravellosystems.com> wrote:
> > On Sat, 9 Jul 2016 11:00:20 +0200 Florian Westphal <fw@strlen.de> wrote:
> > > I am worried about this patch, skb_gso_validate_mtu is more costly than
> > > the ->flags & FORWARD check; everyone pays this extra cost.  
> > 
> > I can get back with numbers regarding the impact on local traffic.
> 
> Florian, I've repeatedly tested how this affects locally generated
> traffic and it seems there's no impact (or at least too negligible for
> netperf workload to notice):
> 
> - veth to veth (netns separated), both UFO enabled
> - UDP_RR, with Request Size (udp payload) of 1500, to ensure UFO in place
>   (sender reaches ip_finish_output_gso)
> - Before: stable v4.6.3
> - After:  stable v4.6.3 + suggested fix (kill flags&IPSKB_FORWARDED check)
> 
> # netperf -T1,2 -c -C -L 192.168.13.2 -H 192.168.13.1 -l 20 -I 99,2 -i 10 -t UDP_RR -- -P 10002,10001 -r 1500,1
> 
> MIGRATED UDP REQUEST/RESPONSE TEST from 192.168.13.2 () port 10002 AF_INET to 192.168.13.1 () port 10001 AF_INET : +/-1.000% @ 99% conf.  : demo : first burst 0 : cpu bind
> Local /Remote
> Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
> Send   Recv   Size    Size   Time    Rate     local  remote local   remote
> bytes  bytes  bytes   bytes  secs.   per sec  % S    % S    us/Tr   us/Tr
> 
> [before]
> 212992 212992 1500    1      20.00   45368.10   20.56  20.56  18.131  18.131
> [after]
> 212992 212992 1500    1      20.00   45376.09   20.74  20.74  18.287  18.287
> 
> Therefore it seems trying to optimize this by setting IPSKB_FORWARDED
> (or introducing a different mark) in 'iptunnel_xmit' would offer no
> genuine benefit.
> 
> WDYT?

I think that this just shows that on this machine and this workload
there is no impact.  E.g. for sctp we now skb_walk_frags() for
every gso packet.

We also have two more function calls per skb.

> Were you thinking of different workloads that we should assess the
> impact on?

I still think that just marking those packets as being forwarded from
tunnel code is the right thing to do; linux runs on so many platforms
that its very difficult to asses that something "has no overhead".

Long term we should just adjust gso size to not have to do skb software
segmentation anymore.

I remember that Eric Dumazet suggested the same thing back when this
gso forward checking was added but I recall there were corner cases in
the segmentation code that caused BUG_ON crashes.
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cbac493..8ae65b3 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -223,9 +223,8 @@  static int ip_finish_output_gso(struct net *net, struct sock *sk,
 	struct sk_buff *segs;
 	int ret = 0;
 
-	/* common case: locally created skb or seglen is <= mtu */
-	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
-	      skb_gso_validate_mtu(skb, mtu))
+	/* common case: seglen is <= mtu */
+	if (skb_gso_validate_mtu(skb, mtu))
 		return ip_finish_output2(net, sk, skb);
 
 	/* Slowpath -  GSO segment length is exceeding the dst MTU.