diff mbox

[net] gso: do GSO for local skb with size bigger than MTU

Message ID 1417156385-18276-1-git-send-email-fan.du@intel.com
State Changes Requested, archived
Delegated to: David Miller
Headers show

Commit Message

Fan Du Nov. 28, 2014, 6:33 a.m. UTC
Test scenario: two KVM guests sitting in different
hosts communicate to each other with a vxlan tunnel.

All interface MTU is default 1500 Bytes, from guest point
of view, its skb gso_size could be as bigger as 1448Bytes,
however after guest skb goes through vxlan encapuslation,
individual segments length of a gso packet could exceed
physical NIC MTU 1500, which will be lost at recevier side.

So it's possible in virtualized environment, locally created
skb len after encapslation could be bigger than underlayer
MTU. In such case, it's reasonable to do GSO first,
then fragment any packet bigger than MTU as possible.

+---------------+ TX     RX +---------------+
|   KVM Guest   | -> ... -> |   KVM Guest   |
+-+-----------+-+           +-+-----------+-+
  |Qemu/VirtIO|               |Qemu/VirtIO|
  +-----------+               +-----------+
       |                            |
       v tap0                  tap0 v
  +-----------+               +-----------+
  | ovs bridge|               | ovs bridge|
  +-----------+               +-----------+
       | vxlan                vxlan |
       v                            v
  +-----------+               +-----------+
  |    NIC    |    <------>   |    NIC    |
  +-----------+               +-----------+

Steps to reproduce:
 1. Using kernel builtin openvswitch module to setup ovs bridge.
 2. Runing iperf without -M, communication will stuck.

Signed-off-by: Fan Du <fan.du@intel.com>
---
 net/ipv4/ip_output.c |    7 ++++---
 1 files changed, 4 insertions(+), 3 deletions(-)

Comments

Jason Wang Nov. 28, 2014, 7:02 a.m. UTC | #1
On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.

Is this issue specific to ovs or ipv4? Path MTU discovery should
help in this case I believe.
> 
> 
> Signed-off-by: Fan Du <fan.du@intel.com>
> ---
>  net/ipv4/ip_output.c |    7 ++++---
>  1 files changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index bc6471d..558b5f8 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff 
> *skb)
>  	struct sk_buff *segs;
>  	int ret = 0;
>  
> -	/* common case: locally created skb or seglen is <= mtu */
> -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> -	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> +	/* Both locally created skb and forwarded skb could exceed
> +	 * MTU size, so make a unified rule for them all.
> +	 */
> +	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>  		return ip_finish_output2(skb);
>  
>  	/* Slowpath -  GSO segment length is exceeding the dst MTU.
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Nov. 30, 2014, 10:26 a.m. UTC | #2
Fan Du <fan.du@intel.com> wrote:
> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.

Hmm, do we really want to suport bridges containing interfaces with
different MTUs?

It seems to me to only clean solution is to set tap0 MTU so that it
accounts for the bridge encap overhead.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Nov. 30, 2014, 10:55 a.m. UTC | #3
>-----Original Message-----
>From: Florian Westphal [mailto:fw@strlen.de]
>Sent: Sunday, November 30, 2014 6:27 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>Fan Du <fan.du@intel.com> wrote:
>> Test scenario: two KVM guests sitting in different hosts communicate
>> to each other with a vxlan tunnel.
>>
>> All interface MTU is default 1500 Bytes, from guest point of view, its
>> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> goes through vxlan encapuslation, individual segments length of a gso
>> packet could exceed physical NIC MTU 1500, which will be lost at
>> recevier side.
>>
>> So it's possible in virtualized environment, locally created skb len
>> after encapslation could be bigger than underlayer MTU. In such case,
>> it's reasonable to do GSO first, then fragment any packet bigger than
>> MTU as possible.
>>
>> +---------------+ TX     RX +---------------+
>> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> +-+-----------+-+           +-+-----------+-+
>>   |Qemu/VirtIO|               |Qemu/VirtIO|
>>   +-----------+               +-----------+
>>        |                            |
>>        v tap0                  tap0 v
>>   +-----------+               +-----------+
>>   | ovs bridge|               | ovs bridge|
>>   +-----------+               +-----------+
>>        | vxlan                vxlan |
>>        v                            v
>>   +-----------+               +-----------+
>>   |    NIC    |    <------>   |    NIC    |
>>   +-----------+               +-----------+
>>
>> Steps to reproduce:
>>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>>  2. Runing iperf without -M, communication will stuck.
>
>Hmm, do we really want to suport bridges containing interfaces with different
>MTUs?

All interface MTU in the test scenario is the default one, 1500.

>It seems to me to only clean solution is to set tap0 MTU so that it accounts for the
>bridge encap overhead.

This will force _ALL_ deploy instances requiring tap0 MTU change in every cloud env.

Current behavior leads over-mtu-sized packet push down to NIC, which should not
happen anyway. And as I putted in another threads:
Perform GSO for skb, then try to do ip segmentation if possible, If DF set, send back
ICMP message. If DF is not set, apparently user want stack do ip segmentation, and
All the GSO-ed skb will be sent out correctly as expected.


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Florian Westphal Nov. 30, 2014, 3:11 p.m. UTC | #4
Du, Fan <fan.du@intel.com> wrote:
> All interface MTU in the test scenario is the default one, 1500.

Not really, unless I misunderstand the setup.

You have a l2 network where part of the machines are connected by a
l2 tunnel.

All machines within that network ought to assume that MTU is equal for
all machines within the same L2 network.

> >It seems to me to only clean solution is to set tap0 MTU so that it accounts for the
> >bridge encap overhead.
> 
> This will force _ALL_ deploy instances requiring tap0 MTU change in every cloud env.

Yes, alternatively emply routing, then PMTU should work.

> Current behavior leads over-mtu-sized packet push down to NIC, which should not
> happen anyway. And as I putted in another threads:
> Perform GSO for skb, then try to do ip segmentation if possible, If DF set, send back
> ICMP message. If DF is not set, apparently user want stack do ip segmentation, and
> All the GSO-ed skb will be sent out correctly as expected.

Well, the linux bridge implementation (especially bridge netfilter)
did/allows for a lot of layering violations and this has usually caused
a myriad of followup kludges to make one-more scenario work.

I still think that trying to make this work is a bad idea.
If hosts have different MTUs they should be in different l2 networks.

Alternatively, the Tunneling implementation should be opaque and do the
needed fragmentation to provide the illusion of identical MTUs.

That said, I don't see anything wrong with the patch per se, I just
dislike the concept.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Dec. 1, 2014, 6:47 a.m. UTC | #5
>-----Original Message-----
>From: Florian Westphal [mailto:fw@strlen.de]
>Sent: Sunday, November 30, 2014 11:11 PM
>To: Du, Fan
>Cc: Florian Westphal; netdev@vger.kernel.org; davem@davemloft.net
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>Du, Fan <fan.du@intel.com> wrote:
>> All interface MTU in the test scenario is the default one, 1500.
>
>Not really, unless I misunderstand the setup.
>
>You have a l2 network where part of the machines are connected by a
>l2 tunnel.
>
>All machines within that network ought to assume that MTU is equal for all
>machines within the same L2 network.

Based on what assumption do you think the test scenario use different MTU???
I think all your conclusion comes from the MTU configuration, as a matter of fact,
Like I stated before, ALL interface MTU is default 1500.

I elaborate this typical(!kludges) env a bit more:

Without vxlan tunnel, a typical standard env:
Guest -> Qemu/VirtIO -> tap0 -> linux bridge -> NIC

No tunneling trick here, no MTU change, packets come packets go, naked...

With vxlan tunnel, almost all the same topology as before, really no need to change any MTU
To make below env work.
Guest -> Qemu/VirtIO -> tap0 -> ovs bridge -> vxlan tunnel -> NIC
                                         ^^^^^^^^
                                         ^^^^^^^^
Encapsulation outer L234/VXLAN clothes before Guest frame,
Over-MTU-sized packet is locally created as *default* when Guest first attempts to try a big MSS *BEFORE* PMTU
is able to make guest sense this MSS is too big. Guest what, this over-MTU-sized packet is lost. That's the problem,
not because any different MTU configuration, but the code here rule out(based on what fact???) any such existing possibility

Anyway, setup such env is quite easy to see what's really going on inside the stack. You could even use docker to give a try.
It's the same effect as KVM guest, but easy to deploy.

Docker instance -> docker0 bridge -> vethA -> vethB -> ovs-br0 -> vxlan -> NIC

Any doubts about the env, please let me know.


>> >It seems to me to only clean solution is to set tap0 MTU so that it
>> >accounts for the bridge encap overhead.
>>
>> This will force _ALL_ deploy instances requiring tap0 MTU change in every cloud
>env.
>
>Yes, alternatively emply routing, then PMTU should work.
>
>> Current behavior leads over-mtu-sized packet push down to NIC, which
>> should not happen anyway. And as I putted in another threads:
>> Perform GSO for skb, then try to do ip segmentation if possible, If DF
>> set, send back ICMP message. If DF is not set, apparently user want
>> stack do ip segmentation, and All the GSO-ed skb will be sent out correctly as
>expected.
>
>Well, the linux bridge implementation (especially bridge netfilter) did/allows for a
>lot of layering violations and this has usually caused a myriad of followup kludges
>to make one-more scenario work.
>
>I still think that trying to make this work is a bad idea.
>If hosts have different MTUs they should be in different l2 networks.
>
>Alternatively, the Tunneling implementation should be opaque and do the needed
>fragmentation to provide the illusion of identical MTUs.
>
>That said, I don't see anything wrong with the patch per se, I just dislike the
>concept.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Dec. 1, 2014, 1:52 p.m. UTC | #6
On 11/30/14 at 10:08am, Du, Fan wrote:
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com]
> >Sent: Friday, November 28, 2014 3:02 PM
> >To: Du, Fan
> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> >> Test scenario: two KVM guests sitting in different hosts communicate
> >> to each other with a vxlan tunnel.
> >>
> >> All interface MTU is default 1500 Bytes, from guest point of view, its
> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> >> goes through vxlan encapuslation, individual segments length of a gso
> >> packet could exceed physical NIC MTU 1500, which will be lost at
> >> recevier side.
> >>
> >> So it's possible in virtualized environment, locally created skb len
> >> after encapslation could be bigger than underlayer MTU. In such case,
> >> it's reasonable to do GSO first, then fragment any packet bigger than
> >> MTU as possible.
> >>
> >> +---------------+ TX     RX +---------------+
> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> >> +-+-----------+-+           +-+-----------+-+
> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> >>   +-----------+               +-----------+
> >>        |                            |
> >>        v tap0                  tap0 v
> >>   +-----------+               +-----------+
> >>   | ovs bridge|               | ovs bridge|
> >>   +-----------+               +-----------+
> >>        | vxlan                vxlan |
> >>        v                            v
> >>   +-----------+               +-----------+
> >>   |    NIC    |    <------>   |    NIC    |
> >>   +-----------+               +-----------+
> >>
> >> Steps to reproduce:
> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> >>  2. Runing iperf without -M, communication will stuck.
> >
> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> >believe.
> 
> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> without any further ip segmentation.
> 
> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.

Aside from this. I think Virtio should provide a MTU hint to the guest
to adjust MTU in the vNIC to account for both overhead or support for
jumbo frames in the underlay transparently without relying on PMTU or
MSS hints. I remember we talked about this a while ago with at least
Michael but haven't done actual code work on it yet.

> For PMTU to work, that's another issue I will try to address later on.

PMTU discovery was explicitly removed from the OVS datapath. Maybe
Pravin or Jesse can provide some background on that
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 1, 2014, 3:06 p.m. UTC | #7
On Mon, Dec 01, 2014 at 01:52:25PM +0000, Thomas Graf wrote:
> On 11/30/14 at 10:08am, Du, Fan wrote:
> > >-----Original Message-----
> > >From: Jason Wang [mailto:jasowang@redhat.com]
> > >Sent: Friday, November 28, 2014 3:02 PM
> > >To: Du, Fan
> > >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> > >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> > >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> > >> Test scenario: two KVM guests sitting in different hosts communicate
> > >> to each other with a vxlan tunnel.
> > >>
> > >> All interface MTU is default 1500 Bytes, from guest point of view, its
> > >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> > >> goes through vxlan encapuslation, individual segments length of a gso
> > >> packet could exceed physical NIC MTU 1500, which will be lost at
> > >> recevier side.
> > >>
> > >> So it's possible in virtualized environment, locally created skb len
> > >> after encapslation could be bigger than underlayer MTU. In such case,
> > >> it's reasonable to do GSO first, then fragment any packet bigger than
> > >> MTU as possible.
> > >>
> > >> +---------------+ TX     RX +---------------+
> > >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> > >> +-+-----------+-+           +-+-----------+-+
> > >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> > >>   +-----------+               +-----------+
> > >>        |                            |
> > >>        v tap0                  tap0 v
> > >>   +-----------+               +-----------+
> > >>   | ovs bridge|               | ovs bridge|
> > >>   +-----------+               +-----------+
> > >>        | vxlan                vxlan |
> > >>        v                            v
> > >>   +-----------+               +-----------+
> > >>   |    NIC    |    <------>   |    NIC    |
> > >>   +-----------+               +-----------+
> > >>
> > >> Steps to reproduce:
> > >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> > >>  2. Runing iperf without -M, communication will stuck.
> > >
> > >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> > >believe.
> > 
> > Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> > without any further ip segmentation.
> > 
> > Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> > Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.

Sounds about right.

> Aside from this. I think Virtio should provide a MTU hint to the guest
> to adjust MTU in the vNIC to account for both overhead or support for
> jumbo frames in the underlay transparently without relying on PMTU or
> MSS hints. I remember we talked about this a while ago with at least
> Michael but haven't done actual code work on it yet.

This can be an optimization but can't replace fixing PMTU.
In particular this can't help legacy guests.

> > For PMTU to work, that's another issue I will try to address later on.
> 
> PMTU discovery was explicitly removed from the OVS datapath. Maybe
> Pravin or Jesse can provide some background on that

PMTU was never working properly for OVS users.  There was an ugly hack that kind
of worked part of the time.  That was dropped, and good riddance.
Proper PMTU support for all kind of tunneling solutions, e.g. along the
lines you outline above, belongs in host core.
Flavio Leitner Dec. 2, 2014, 3:44 p.m. UTC | #8
On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
> 
> 
> >-----Original Message-----
> >From: Jason Wang [mailto:jasowang@redhat.com]
> >Sent: Friday, November 28, 2014 3:02 PM
> >To: Du, Fan
> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> >
> >
> >
> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> >> Test scenario: two KVM guests sitting in different hosts communicate
> >> to each other with a vxlan tunnel.
> >>
> >> All interface MTU is default 1500 Bytes, from guest point of view, its
> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> >> goes through vxlan encapuslation, individual segments length of a gso
> >> packet could exceed physical NIC MTU 1500, which will be lost at
> >> recevier side.
> >>
> >> So it's possible in virtualized environment, locally created skb len
> >> after encapslation could be bigger than underlayer MTU. In such case,
> >> it's reasonable to do GSO first, then fragment any packet bigger than
> >> MTU as possible.
> >>
> >> +---------------+ TX     RX +---------------+
> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> >> +-+-----------+-+           +-+-----------+-+
> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> >>   +-----------+               +-----------+
> >>        |                            |
> >>        v tap0                  tap0 v
> >>   +-----------+               +-----------+
> >>   | ovs bridge|               | ovs bridge|
> >>   +-----------+               +-----------+
> >>        | vxlan                vxlan |
> >>        v                            v
> >>   +-----------+               +-----------+
> >>   |    NIC    |    <------>   |    NIC    |
> >>   +-----------+               +-----------+
> >>
> >> Steps to reproduce:
> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> >>  2. Runing iperf without -M, communication will stuck.
> >
> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> >believe.
> 
> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> without any further ip segmentation.
> 
> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
> 
> For PMTU to work, that's another issue I will try to address later on.
> 
> >>
> >>
> >> Signed-off-by: Fan Du <fan.du@intel.com>
> >> ---
> >>  net/ipv4/ip_output.c |    7 ++++---
> >>  1 files changed, 4 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
> >> bc6471d..558b5f8 100644
> >> --- a/net/ipv4/ip_output.c
> >> +++ b/net/ipv4/ip_output.c
> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
> >> *skb)
> >>  	struct sk_buff *segs;
> >>  	int ret = 0;
> >>
> >> -	/* common case: locally created skb or seglen is <= mtu */
> >> -	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> >> -	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >> +	/* Both locally created skb and forwarded skb could exceed
> >> +	 * MTU size, so make a unified rule for them all.
> >> +	 */
> >> +	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >>  		return ip_finish_output2(skb);


Are you using kernel's vxlan device or openvswitch's vxlan device?

Because for kernel's vxlan devices the MTU accounts for the header
overhead so I believe your patch would work.  However, the MTU is
not visible for the ovs's vxlan devices, so that wouldn't work.

fbl
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Flavio Leitner Dec. 2, 2014, 3:48 p.m. UTC | #9
On Mon, Dec 01, 2014 at 01:52:25PM +0000, Thomas Graf wrote:
> On 11/30/14 at 10:08am, Du, Fan wrote:
> > >-----Original Message-----
> > >From: Jason Wang [mailto:jasowang@redhat.com]
> > >Sent: Friday, November 28, 2014 3:02 PM
> > >To: Du, Fan
> > >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> > >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> > >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> > >> Test scenario: two KVM guests sitting in different hosts communicate
> > >> to each other with a vxlan tunnel.
> > >>
> > >> All interface MTU is default 1500 Bytes, from guest point of view, its
> > >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> > >> goes through vxlan encapuslation, individual segments length of a gso
> > >> packet could exceed physical NIC MTU 1500, which will be lost at
> > >> recevier side.
> > >>
> > >> So it's possible in virtualized environment, locally created skb len
> > >> after encapslation could be bigger than underlayer MTU. In such case,
> > >> it's reasonable to do GSO first, then fragment any packet bigger than
> > >> MTU as possible.
> > >>
> > >> +---------------+ TX     RX +---------------+
> > >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> > >> +-+-----------+-+           +-+-----------+-+
> > >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> > >>   +-----------+               +-----------+
> > >>        |                            |
> > >>        v tap0                  tap0 v
> > >>   +-----------+               +-----------+
> > >>   | ovs bridge|               | ovs bridge|
> > >>   +-----------+               +-----------+
> > >>        | vxlan                vxlan |
> > >>        v                            v
> > >>   +-----------+               +-----------+
> > >>   |    NIC    |    <------>   |    NIC    |
> > >>   +-----------+               +-----------+
> > >>
> > >> Steps to reproduce:
> > >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> > >>  2. Runing iperf without -M, communication will stuck.
> > >
> > >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> > >believe.
> > 
> > Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> > without any further ip segmentation.
> > 
> > Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set, 
> > Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
> 
> Aside from this. I think Virtio should provide a MTU hint to the guest
> to adjust MTU in the vNIC to account for both overhead or support for
> jumbo frames in the underlay transparently without relying on PMTU or
> MSS hints. I remember we talked about this a while ago with at least
> Michael but haven't done actual code work on it yet.

What about containers or any other virtualization environment that
doesn't use Virtio?

fbl
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Dec. 2, 2014, 5:09 p.m. UTC | #10
On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> What about containers or any other virtualization environment that
> doesn't use Virtio?

The host can dictate the MTU in that case for both veth or OVS
internal which would be primary container plumbing techniques.

ivshmem would need it's own fix.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 2, 2014, 5:34 p.m. UTC | #11
On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > What about containers or any other virtualization environment that
> > doesn't use Virtio?
> 
> The host can dictate the MTU in that case for both veth or OVS
> internal which would be primary container plumbing techniques.

It typically can't do this easily for VMs with emulated devices:
real ethernet uses a fixed MTU.

IMHO it's confusing to suggest MTU as a fix for this bug, it's
an unrelated optimization.
ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.


> ivshmem would need it's own fix.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Dec. 2, 2014, 5:41 p.m. UTC | #12
On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> > > What about containers or any other virtualization environment that
> > > doesn't use Virtio?
> > 
> > The host can dictate the MTU in that case for both veth or OVS
> > internal which would be primary container plumbing techniques.
> 
> It typically can't do this easily for VMs with emulated devices:
> real ethernet uses a fixed MTU.
> 
> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> an unrelated optimization.
> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.

PMTU discovery only resolves the issue if an actual IP stack is
running inside the VM. This may not be the case at all.

I agree that exposing an MTU towards the guest is not applicable
in all situations, in particular because it is difficult to decide
what MTU to expose. It is a relatively elegant solution in a lot
of virtualization host cases hooked up to an orchestration system
though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 2, 2014, 6:06 p.m. UTC | #13
On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
> On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>>
>>
>> >-----Original Message-----
>> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >Sent: Friday, November 28, 2014 3:02 PM
>> >To: Du, Fan
>> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>> >
>> >
>> >
>> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> Test scenario: two KVM guests sitting in different hosts communicate
>> >> to each other with a vxlan tunnel.
>> >>
>> >> All interface MTU is default 1500 Bytes, from guest point of view, its
>> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> >> goes through vxlan encapuslation, individual segments length of a gso
>> >> packet could exceed physical NIC MTU 1500, which will be lost at
>> >> recevier side.
>> >>
>> >> So it's possible in virtualized environment, locally created skb len
>> >> after encapslation could be bigger than underlayer MTU. In such case,
>> >> it's reasonable to do GSO first, then fragment any packet bigger than
>> >> MTU as possible.
>> >>
>> >> +---------------+ TX     RX +---------------+
>> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> +-+-----------+-+           +-+-----------+-+
>> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >>   +-----------+               +-----------+
>> >>        |                            |
>> >>        v tap0                  tap0 v
>> >>   +-----------+               +-----------+
>> >>   | ovs bridge|               | ovs bridge|
>> >>   +-----------+               +-----------+
>> >>        | vxlan                vxlan |
>> >>        v                            v
>> >>   +-----------+               +-----------+
>> >>   |    NIC    |    <------>   |    NIC    |
>> >>   +-----------+               +-----------+
>> >>
>> >> Steps to reproduce:
>> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >>  2. Runing iperf without -M, communication will stuck.
>> >
>> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>> >believe.
>>
>> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
>> without any further ip segmentation.
>>
>> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
>> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
>>
>> For PMTU to work, that's another issue I will try to address later on.
>>
>> >>
>> >>
>> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> ---
>> >>  net/ipv4/ip_output.c |    7 ++++---
>> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >>
>> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> bc6471d..558b5f8 100644
>> >> --- a/net/ipv4/ip_output.c
>> >> +++ b/net/ipv4/ip_output.c
>> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> >> *skb)
>> >>    struct sk_buff *segs;
>> >>    int ret = 0;
>> >>
>> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> +   * MTU size, so make a unified rule for them all.
>> >> +   */
>> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >>            return ip_finish_output2(skb);
>
>
> Are you using kernel's vxlan device or openvswitch's vxlan device?
>
> Because for kernel's vxlan devices the MTU accounts for the header
> overhead so I believe your patch would work.  However, the MTU is
> not visible for the ovs's vxlan devices, so that wouldn't work.

This is being called after the tunnel code, so the MTU that is being
looked at in all cases is the physical device's. Since the packet has
already been encapsulated, tunnel header overhead is already accounted
for in skb_gso_network_seglen() and this should be fine for both OVS
and non-OVS cases.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 2, 2014, 6:12 p.m. UTC | #14
On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> > > What about containers or any other virtualization environment that
>> > > doesn't use Virtio?
>> >
>> > The host can dictate the MTU in that case for both veth or OVS
>> > internal which would be primary container plumbing techniques.
>>
>> It typically can't do this easily for VMs with emulated devices:
>> real ethernet uses a fixed MTU.
>>
>> IMHO it's confusing to suggest MTU as a fix for this bug, it's
>> an unrelated optimization.
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>
> PMTU discovery only resolves the issue if an actual IP stack is
> running inside the VM. This may not be the case at all.

It's also only really a correct thing to do if the ICMP packet is
coming from an L3 node. If you are doing straight bridging then you
have to resort to hacks like OVS had before, which I agree are not
particularly desirable.

> I agree that exposing an MTU towards the guest is not applicable
> in all situations, in particular because it is difficult to decide
> what MTU to expose. It is a relatively elegant solution in a lot
> of virtualization host cases hooked up to an orchestration system
> though.

I also think this is the right thing to do as a common case
optimization and I know other platforms (such as Hyper-V) do it. It's
not a complete solution so we still need the original patch in this
thread to handle things transparently.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Flavio Leitner Dec. 2, 2014, 9:32 p.m. UTC | #15
On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
> >>
> >>
> >> >-----Original Message-----
> >> >From: Jason Wang [mailto:jasowang@redhat.com]
> >> >Sent: Friday, November 28, 2014 3:02 PM
> >> >To: Du, Fan
> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
> >> >
> >> >
> >> >
> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
> >> >> Test scenario: two KVM guests sitting in different hosts communicate
> >> >> to each other with a vxlan tunnel.
> >> >>
> >> >> All interface MTU is default 1500 Bytes, from guest point of view, its
> >> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
> >> >> goes through vxlan encapuslation, individual segments length of a gso
> >> >> packet could exceed physical NIC MTU 1500, which will be lost at
> >> >> recevier side.
> >> >>
> >> >> So it's possible in virtualized environment, locally created skb len
> >> >> after encapslation could be bigger than underlayer MTU. In such case,
> >> >> it's reasonable to do GSO first, then fragment any packet bigger than
> >> >> MTU as possible.
> >> >>
> >> >> +---------------+ TX     RX +---------------+
> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
> >> >> +-+-----------+-+           +-+-----------+-+
> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
> >> >>   +-----------+               +-----------+
> >> >>        |                            |
> >> >>        v tap0                  tap0 v
> >> >>   +-----------+               +-----------+
> >> >>   | ovs bridge|               | ovs bridge|
> >> >>   +-----------+               +-----------+
> >> >>        | vxlan                vxlan |
> >> >>        v                            v
> >> >>   +-----------+               +-----------+
> >> >>   |    NIC    |    <------>   |    NIC    |
> >> >>   +-----------+               +-----------+
> >> >>
> >> >> Steps to reproduce:
> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
> >> >>  2. Runing iperf without -M, communication will stuck.
> >> >
> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
> >> >believe.
> >>
> >> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
> >> without any further ip segmentation.
> >>
> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
> >> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
> >>
> >> For PMTU to work, that's another issue I will try to address later on.
> >>
> >> >>
> >> >>
> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
> >> >> ---
> >> >>  net/ipv4/ip_output.c |    7 ++++---
> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
> >> >>
> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
> >> >> bc6471d..558b5f8 100644
> >> >> --- a/net/ipv4/ip_output.c
> >> >> +++ b/net/ipv4/ip_output.c
> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
> >> >> *skb)
> >> >>    struct sk_buff *segs;
> >> >>    int ret = 0;
> >> >>
> >> >> -  /* common case: locally created skb or seglen is <= mtu */
> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >> >> +  /* Both locally created skb and forwarded skb could exceed
> >> >> +   * MTU size, so make a unified rule for them all.
> >> >> +   */
> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> >> >>            return ip_finish_output2(skb);
> >
> >
> > Are you using kernel's vxlan device or openvswitch's vxlan device?
> >
> > Because for kernel's vxlan devices the MTU accounts for the header
> > overhead so I believe your patch would work.  However, the MTU is
> > not visible for the ovs's vxlan devices, so that wouldn't work.
> 
> This is being called after the tunnel code, so the MTU that is being
> looked at in all cases is the physical device's. Since the packet has
> already been encapsulated, tunnel header overhead is already accounted
> for in skb_gso_network_seglen() and this should be fine for both OVS
> and non-OVS cases.

Right, it didn't work on my first try and that explanation came to mind.

Anyway, I am testing this with containers instead of VMs, so I am using
veth and not Virtio-net.

This is the actual stack trace:

[...]
  do_output
  ovs_vport_send
  vxlan_tnl_send
  vxlan_xmit_skb
  udp_tunnel_xmit_skb
  iptunnel_xmit
   \ skb_scrub_packet => skb->ignore_df = 0;
  ip_local_out_sk
  ip_output
  ip_finish_output (_gso is inlined)
  ip_fragment

and on ip_fragment() it does:

 503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
 504                      (IPCB(skb)->frag_max_size &&
 505                       IPCB(skb)->frag_max_size > mtu))) {
 506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 507                 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 508                           htonl(mtu));
 509                 kfree_skb(skb);
 510                 return -EMSGSIZE;
 511         }

Since IP_DF is set and skb->ignore_df is reset to 0, in my case
the packet is dropped and an ICMP is sent back. The connection
remains stuck as before. Doesn't virtio-net set DF bit?

Thanks,
fbl
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 2, 2014, 9:47 p.m. UTC | #16
On Tue, Dec 2, 2014 at 1:32 PM, Flavio Leitner <fbl@redhat.com> wrote:
> On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
>> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >> >Sent: Friday, November 28, 2014 3:02 PM
>> >> >To: Du, Fan
>> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du, Fan
>> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> >> Test scenario: two KVM guests sitting in different hosts communicate
>> >> >> to each other with a vxlan tunnel.
>> >> >>
>> >> >> All interface MTU is default 1500 Bytes, from guest point of view, its
>> >> >> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> >> >> goes through vxlan encapuslation, individual segments length of a gso
>> >> >> packet could exceed physical NIC MTU 1500, which will be lost at
>> >> >> recevier side.
>> >> >>
>> >> >> So it's possible in virtualized environment, locally created skb len
>> >> >> after encapslation could be bigger than underlayer MTU. In such case,
>> >> >> it's reasonable to do GSO first, then fragment any packet bigger than
>> >> >> MTU as possible.
>> >> >>
>> >> >> +---------------+ TX     RX +---------------+
>> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> >> +-+-----------+-+           +-+-----------+-+
>> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >> >>   +-----------+               +-----------+
>> >> >>        |                            |
>> >> >>        v tap0                  tap0 v
>> >> >>   +-----------+               +-----------+
>> >> >>   | ovs bridge|               | ovs bridge|
>> >> >>   +-----------+               +-----------+
>> >> >>        | vxlan                vxlan |
>> >> >>        v                            v
>> >> >>   +-----------+               +-----------+
>> >> >>   |    NIC    |    <------>   |    NIC    |
>> >> >>   +-----------+               +-----------+
>> >> >>
>> >> >> Steps to reproduce:
>> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> >>  2. Runing iperf without -M, communication will stuck.
>> >> >
>> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should help in this case I
>> >> >believe.
>> >>
>> >> Problem here is host stack push local over-sized gso skb down to NIC, and perform GSO there
>> >> without any further ip segmentation.
>> >>
>> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is bigger than MTU && df is set,
>> >> Then push ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust mtu.
>> >>
>> >> For PMTU to work, that's another issue I will try to address later on.
>> >>
>> >> >>
>> >> >>
>> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> >> ---
>> >> >>  net/ipv4/ip_output.c |    7 ++++---
>> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> >> bc6471d..558b5f8 100644
>> >> >> --- a/net/ipv4/ip_output.c
>> >> >> +++ b/net/ipv4/ip_output.c
>> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct sk_buff
>> >> >> *skb)
>> >> >>    struct sk_buff *segs;
>> >> >>    int ret = 0;
>> >> >>
>> >> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> >> +   * MTU size, so make a unified rule for them all.
>> >> >> +   */
>> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >>            return ip_finish_output2(skb);
>> >
>> >
>> > Are you using kernel's vxlan device or openvswitch's vxlan device?
>> >
>> > Because for kernel's vxlan devices the MTU accounts for the header
>> > overhead so I believe your patch would work.  However, the MTU is
>> > not visible for the ovs's vxlan devices, so that wouldn't work.
>>
>> This is being called after the tunnel code, so the MTU that is being
>> looked at in all cases is the physical device's. Since the packet has
>> already been encapsulated, tunnel header overhead is already accounted
>> for in skb_gso_network_seglen() and this should be fine for both OVS
>> and non-OVS cases.
>
> Right, it didn't work on my first try and that explanation came to mind.
>
> Anyway, I am testing this with containers instead of VMs, so I am using
> veth and not Virtio-net.
>
> This is the actual stack trace:
>
> [...]
>   do_output
>   ovs_vport_send
>   vxlan_tnl_send
>   vxlan_xmit_skb
>   udp_tunnel_xmit_skb
>   iptunnel_xmit
>    \ skb_scrub_packet => skb->ignore_df = 0;
>   ip_local_out_sk
>   ip_output
>   ip_finish_output (_gso is inlined)
>   ip_fragment
>
> and on ip_fragment() it does:
>
>  503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
>  504                      (IPCB(skb)->frag_max_size &&
>  505                       IPCB(skb)->frag_max_size > mtu))) {
>  506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
>  507                 icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  508                           htonl(mtu));
>  509                 kfree_skb(skb);
>  510                 return -EMSGSIZE;
>  511         }
>
> Since IP_DF is set and skb->ignore_df is reset to 0, in my case
> the packet is dropped and an ICMP is sent back. The connection
> remains stuck as before. Doesn't virtio-net set DF bit?

I see now - I agree it's not clear how the original patch worked. The
DF bit is actually set by the encapsulator so it should be the same
regardless of how the packet got there (in OVS it is on by default).
It seems like skb_scrub_packet() shouldn't clear skb->ignore_df (or if
it should in other cases then it seems we need an option to do not it
for tunnels).
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Dec. 3, 2014, 1:58 a.m. UTC | #17
>-----Original Message-----
>From: Flavio Leitner [mailto:fbl@redhat.com]
>Sent: Wednesday, December 3, 2014 5:33 AM
>To: Jesse Gross
>Cc: Du, Fan; Jason Wang; netdev@vger.kernel.org; davem@davemloft.net;
>fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>On Tue, Dec 02, 2014 at 10:06:53AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 7:44 AM, Flavio Leitner <fbl@redhat.com> wrote:
>> > On Sun, Nov 30, 2014 at 10:08:32AM +0000, Du, Fan wrote:
>> >>
>> >>
>> >> >-----Original Message-----
>> >> >From: Jason Wang [mailto:jasowang@redhat.com]
>> >> >Sent: Friday, November 28, 2014 3:02 PM
>> >> >To: Du, Fan
>> >> >Cc: netdev@vger.kernel.org; davem@davemloft.net; fw@strlen.de; Du,
>> >> >Fan
>> >> >Subject: Re: [PATCH net] gso: do GSO for local skb with size
>> >> >bigger than MTU
>> >> >
>> >> >
>> >> >
>> >> >On Fri, Nov 28, 2014 at 2:33 PM, Fan Du <fan.du@intel.com> wrote:
>> >> >> Test scenario: two KVM guests sitting in different hosts
>> >> >> communicate to each other with a vxlan tunnel.
>> >> >>
>> >> >> All interface MTU is default 1500 Bytes, from guest point of
>> >> >> view, its skb gso_size could be as bigger as 1448Bytes, however
>> >> >> after guest skb goes through vxlan encapuslation, individual
>> >> >> segments length of a gso packet could exceed physical NIC MTU
>> >> >> 1500, which will be lost at recevier side.
>> >> >>
>> >> >> So it's possible in virtualized environment, locally created skb
>> >> >> len after encapslation could be bigger than underlayer MTU. In
>> >> >> such case, it's reasonable to do GSO first, then fragment any
>> >> >> packet bigger than MTU as possible.
>> >> >>
>> >> >> +---------------+ TX     RX +---------------+
>> >> >> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> >> >> +-+-----------+-+           +-+-----------+-+
>> >> >>   |Qemu/VirtIO|               |Qemu/VirtIO|
>> >> >>   +-----------+               +-----------+
>> >> >>        |                            |
>> >> >>        v tap0                  tap0 v
>> >> >>   +-----------+               +-----------+
>> >> >>   | ovs bridge|               | ovs bridge|
>> >> >>   +-----------+               +-----------+
>> >> >>        | vxlan                vxlan |
>> >> >>        v                            v
>> >> >>   +-----------+               +-----------+
>> >> >>   |    NIC    |    <------>   |    NIC    |
>> >> >>   +-----------+               +-----------+
>> >> >>
>> >> >> Steps to reproduce:
>> >> >>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>> >> >>  2. Runing iperf without -M, communication will stuck.
>> >> >
>> >> >Is this issue specific to ovs or ipv4? Path MTU discovery should
>> >> >help in this case I believe.
>> >>
>> >> Problem here is host stack push local over-sized gso skb down to
>> >> NIC, and perform GSO there without any further ip segmentation.
>> >>
>> >> Reasonable behavior is do gso first at ip level, if gso-ed skb is
>> >> bigger than MTU && df is set, Then push
>ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED message back to sender to adjust
>mtu.
>> >>
>> >> For PMTU to work, that's another issue I will try to address later on.
>> >>
>> >> >>
>> >> >>
>> >> >> Signed-off-by: Fan Du <fan.du@intel.com>
>> >> >> ---
>> >> >>  net/ipv4/ip_output.c |    7 ++++---
>> >> >>  1 files changed, 4 insertions(+), 3 deletions(-)
>> >> >>
>> >> >> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index
>> >> >> bc6471d..558b5f8 100644
>> >> >> --- a/net/ipv4/ip_output.c
>> >> >> +++ b/net/ipv4/ip_output.c
>> >> >> @@ -217,9 +217,10 @@ static int ip_finish_output_gso(struct
>> >> >> sk_buff
>> >> >> *skb)
>> >> >>    struct sk_buff *segs;
>> >> >>    int ret = 0;
>> >> >>
>> >> >> -  /* common case: locally created skb or seglen is <= mtu */
>> >> >> -  if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
>> >> >> -        skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >> +  /* Both locally created skb and forwarded skb could exceed
>> >> >> +   * MTU size, so make a unified rule for them all.
>> >> >> +   */
>> >> >> +  if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>> >> >>            return ip_finish_output2(skb);
>> >
>> >
>> > Are you using kernel's vxlan device or openvswitch's vxlan device?
>> >
>> > Because for kernel's vxlan devices the MTU accounts for the header
>> > overhead so I believe your patch would work.  However, the MTU is
>> > not visible for the ovs's vxlan devices, so that wouldn't work.
>>
>> This is being called after the tunnel code, so the MTU that is being
>> looked at in all cases is the physical device's. Since the packet has
>> already been encapsulated, tunnel header overhead is already accounted
>> for in skb_gso_network_seglen() and this should be fine for both OVS
>> and non-OVS cases.
>
>Right, it didn't work on my first try and that explanation came to mind.
>
>Anyway, I am testing this with containers instead of VMs, so I am using veth and
>not Virtio-net.
>
>This is the actual stack trace:
>
>[...]
>  do_output
>  ovs_vport_send
>  vxlan_tnl_send
>  vxlan_xmit_skb
>  udp_tunnel_xmit_skb
>  iptunnel_xmit
>   \ skb_scrub_packet => skb->ignore_df = 0;
>  ip_local_out_sk
>  ip_output
>  ip_finish_output (_gso is inlined)
>  ip_fragment
>
>and on ip_fragment() it does:
>
> 503         if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
> 504                      (IPCB(skb)->frag_max_size &&
> 505                       IPCB(skb)->frag_max_size > mtu))) {
> 506                 IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
> 507                 icmp_send(skb, ICMP_DEST_UNREACH,
>ICMP_FRAG_NEEDED,
> 508                           htonl(mtu));
> 509                 kfree_skb(skb);
> 510                 return -EMSGSIZE;
> 511         }
>
>Since IP_DF is set and skb->ignore_df is reset to 0, in my case the packet is
>dropped and an ICMP is sent back. The connection remains stuck as before.
>Doesn't virtio-net set DF bit?

Thanks for giving it a try and see what really happens. 

You almost there! Ip_segment honor IP_DF, this is bit is take care of by vxlan interface.
In practical env, vxlan interface should take a conservative attitude to allow fragmentation
by appending "options: df_default=false" when creating vxlan interface.

Why allow fragmentation? Because Guest or Container may send over-MTU-sized packet downwards.
Host is expected to be prepared to such incident. This is just what happens in real world cloud env.


>Thanks,
>fbl
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Dec. 3, 2014, 2:31 a.m. UTC | #18
>-----Original Message-----
>From: Thomas Graf [mailto:tgr@infradead.org] On Behalf Of Thomas Graf
>Sent: Wednesday, December 3, 2014 1:42 AM
>To: Michael S. Tsirkin
>Cc: Du, Fan; 'Jason Wang'; netdev@vger.kernel.org; davem@davemloft.net;
>fw@strlen.de; dev@openvswitch.org; jesse@nicira.com; pshelar@nicira.com
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> > > What about containers or any other virtualization environment that
>> > > doesn't use Virtio?
>> >
>> > The host can dictate the MTU in that case for both veth or OVS
>> > internal which would be primary container plumbing techniques.
>>
>> It typically can't do this easily for VMs with emulated devices:
>> real ethernet uses a fixed MTU.
>>
>> IMHO it's confusing to suggest MTU as a fix for this bug, it's an
>> unrelated optimization.
>> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>
>PMTU discovery only resolves the issue if an actual IP stack is running inside the
>VM. This may not be the case at all.
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^

Some thoughts here:

Think otherwise, this is indeed what host stack should forge a ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED
message with _inner_ skb network and transport header, do whatever type of encapsulation,
and thereafter push such packet upward to Guest/Container, which make them feel, the intermediate node
or the peer send such message. PMTU should be expected to work correct.
And such behavior should be shared by all other encapsulation tech if they are also suffered.


>I agree that exposing an MTU towards the guest is not applicable in all situations,
>in particular because it is difficult to decide what MTU to expose. It is a relatively
>elegant solution in a lot of virtualization host cases hooked up to an orchestration
>system though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 3, 2014, 3:23 a.m. UTC | #19
From: Fan Du <fan.du@intel.com>
Date: Fri, 28 Nov 2014 14:33:05 +0800

> Test scenario: two KVM guests sitting in different
> hosts communicate to each other with a vxlan tunnel.
> 
> All interface MTU is default 1500 Bytes, from guest point
> of view, its skb gso_size could be as bigger as 1448Bytes,
> however after guest skb goes through vxlan encapuslation,
> individual segments length of a gso packet could exceed
> physical NIC MTU 1500, which will be lost at recevier side.
> 
> So it's possible in virtualized environment, locally created
> skb len after encapslation could be bigger than underlayer
> MTU. In such case, it's reasonable to do GSO first,
> then fragment any packet bigger than MTU as possible.
> 
> +---------------+ TX     RX +---------------+
> |   KVM Guest   | -> ... -> |   KVM Guest   |
> +-+-----------+-+           +-+-----------+-+
>   |Qemu/VirtIO|               |Qemu/VirtIO|
>   +-----------+               +-----------+
>        |                            |
>        v tap0                  tap0 v
>   +-----------+               +-----------+
>   | ovs bridge|               | ovs bridge|
>   +-----------+               +-----------+
>        | vxlan                vxlan |
>        v                            v
>   +-----------+               +-----------+
>   |    NIC    |    <------>   |    NIC    |
>   +-----------+               +-----------+
> 
> Steps to reproduce:
>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>  2. Runing iperf without -M, communication will stuck.
> 
> Signed-off-by: Fan Du <fan.du@intel.com>

I really don't like this at all.

If guest sees a 1500 byte MTU, that's it's link layer MTU and it had
better be able to send 1500 byte packets onto the "wire".

If you cannot properly propagate the vxlan encapsulation overhead back
into the guest's MTU you must hide this problem from the rest of our
stack somehow.

Nothing we create inside the host should need the change that you
are making.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Dec. 3, 2014, 3:32 a.m. UTC | #20
>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 11:23 AM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: Fan Du <fan.du@intel.com>
>Date: Fri, 28 Nov 2014 14:33:05 +0800
>
>> Test scenario: two KVM guests sitting in different hosts communicate
>> to each other with a vxlan tunnel.
>>
>> All interface MTU is default 1500 Bytes, from guest point of view, its
>> skb gso_size could be as bigger as 1448Bytes, however after guest skb
>> goes through vxlan encapuslation, individual segments length of a gso
>> packet could exceed physical NIC MTU 1500, which will be lost at
>> recevier side.
>>
>> So it's possible in virtualized environment, locally created skb len
>> after encapslation could be bigger than underlayer MTU. In such case,
>> it's reasonable to do GSO first, then fragment any packet bigger than
>> MTU as possible.
>>
>> +---------------+ TX     RX +---------------+
>> |   KVM Guest   | -> ... -> |   KVM Guest   |
>> +-+-----------+-+           +-+-----------+-+
>>   |Qemu/VirtIO|               |Qemu/VirtIO|
>>   +-----------+               +-----------+
>>        |                            |
>>        v tap0                  tap0 v
>>   +-----------+               +-----------+
>>   | ovs bridge|               | ovs bridge|
>>   +-----------+               +-----------+
>>        | vxlan                vxlan |
>>        v                            v
>>   +-----------+               +-----------+
>>   |    NIC    |    <------>   |    NIC    |
>>   +-----------+               +-----------+
>>
>> Steps to reproduce:
>>  1. Using kernel builtin openvswitch module to setup ovs bridge.
>>  2. Runing iperf without -M, communication will stuck.
>>
>> Signed-off-by: Fan Du <fan.du@intel.com>
>
>I really don't like this at all.
>
>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had better be able to
>send 1500 byte packets onto the "wire".

This patch makes it happens exactly as you putted.

>If you cannot properly propagate the vxlan encapsulation overhead back into the
>guest's MTU you must hide this problem from the rest of our stack somehow.

Again, this patch hide this problem to make Guest feel it can send packet with MTU as 1500 bytes.

>Nothing we create inside the host should need the change that you are making.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 3, 2014, 4:35 a.m. UTC | #21
From: "Du, Fan" <fan.du@intel.com>
Date: Wed, 3 Dec 2014 03:32:46 +0000

>>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had better be able to
>>send 1500 byte packets onto the "wire".
> 
> This patch makes it happens exactly as you putted.
> 
>>If you cannot properly propagate the vxlan encapsulation overhead back into the
>>guest's MTU you must hide this problem from the rest of our stack somehow.
> 
> Again, this patch hide this problem to make Guest feel it can send packet with MTU as 1500 bytes.

I said make the guest see the real MTU, not hide the real MTU by
fragmenting or spitting ICMP PMTU messages back.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Dec. 3, 2014, 4:50 a.m. UTC | #22
>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 12:35 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: "Du, Fan" <fan.du@intel.com>
>Date: Wed, 3 Dec 2014 03:32:46 +0000
>
>>>If guest sees a 1500 byte MTU, that's it's link layer MTU and it had
>>>better be able to send 1500 byte packets onto the "wire".
>>
>> This patch makes it happens exactly as you putted.
>>
>>>If you cannot properly propagate the vxlan encapsulation overhead back
>>>into the guest's MTU you must hide this problem from the rest of our stack
>somehow.
>>
>> Again, this patch hide this problem to make Guest feel it can send packet with
>MTU as 1500 bytes.
>
>I said make the guest see the real MTU, not hide the real MTU by fragmenting or
>spitting ICMP PMTU messages back.

Do you have any better idea to achieve what you said besides this patch approach
without both fragmentation and ICMP message at the same time to cater for all kinds
tunnel tech?

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
David Miller Dec. 3, 2014, 5:14 a.m. UTC | #23
From: "Du, Fan" <fan.du@intel.com>
Date: Wed, 3 Dec 2014 04:50:21 +0000

> Do you have any better idea to achieve what you said besides this patch approach
> without both fragmentation and ICMP message at the same time to cater for all kinds
> tunnel tech?

I am not obligated to figure out for you how to design a correctly
implemented patch.

But I am obligated to keep a bad change from going into the tree and
that is what I am doing.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Fan Du Dec. 3, 2014, 6:53 a.m. UTC | #24
>-----Original Message-----
>From: David Miller [mailto:davem@davemloft.net]
>Sent: Wednesday, December 3, 2014 1:15 PM
>To: Du, Fan
>Cc: netdev@vger.kernel.org; fw@strlen.de
>Subject: Re: [PATCH net] gso: do GSO for local skb with size bigger than MTU
>
>From: "Du, Fan" <fan.du@intel.com>
>Date: Wed, 3 Dec 2014 04:50:21 +0000
>
>> Do you have any better idea to achieve what you said besides this
>> patch approach without both fragmentation and ICMP message at the same
>> time to cater for all kinds tunnel tech?
>
>I am not obligated to figure out for you how to design a correctly implemented
>patch.
>
>But I am obligated to keep a bad change from going into the tree and that is what I
>am doing.

"bad" is not depending whether you say it or not, but what the real world needs and what
proper solution could be provided at the time being.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 3, 2014, 9:03 a.m. UTC | #25
On Tue, Dec 02, 2014 at 10:12:04AM -0800, Jesse Gross wrote:
> On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> > On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> >> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> >> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> >> > > What about containers or any other virtualization environment that
> >> > > doesn't use Virtio?
> >> >
> >> > The host can dictate the MTU in that case for both veth or OVS
> >> > internal which would be primary container plumbing techniques.
> >>
> >> It typically can't do this easily for VMs with emulated devices:
> >> real ethernet uses a fixed MTU.
> >>
> >> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> >> an unrelated optimization.
> >> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
> >
> > PMTU discovery only resolves the issue if an actual IP stack is
> > running inside the VM. This may not be the case at all.
> 
> It's also only really a correct thing to do if the ICMP packet is
> coming from an L3 node. If you are doing straight bridging then you
> have to resort to hacks like OVS had before, which I agree are not
> particularly desirable.

The issue seems to be that fundamentally, this is
bridging interfaces with variable MTUs (even if MTU values
on devices don't let us figure this out)-
that is already not straight bridging, and
I would argue sending ICMPs back is the right thing to do.


> > I agree that exposing an MTU towards the guest is not applicable
> > in all situations, in particular because it is difficult to decide
> > what MTU to expose. It is a relatively elegant solution in a lot
> > of virtualization host cases hooked up to an orchestration system
> > though.
> 
> I also think this is the right thing to do as a common case
> optimization and I know other platforms (such as Hyper-V) do it. It's
> not a complete solution so we still need the original patch in this
> thread to handle things transparently.

Well, as I believe David (and independently Jason) is saying, it looks like
the ICMPs we are sending back after applying the original patch have the
wrong MTU.

And if I understand what David is saying here, IP is also the wrong place to
do it.
Jesse Gross Dec. 3, 2014, 6:07 p.m. UTC | #26
On Wed, Dec 3, 2014 at 1:03 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Dec 02, 2014 at 10:12:04AM -0800, Jesse Gross wrote:
>> On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
>> > On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
>> >> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
>> >> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
>> >> > > What about containers or any other virtualization environment that
>> >> > > doesn't use Virtio?
>> >> >
>> >> > The host can dictate the MTU in that case for both veth or OVS
>> >> > internal which would be primary container plumbing techniques.
>> >>
>> >> It typically can't do this easily for VMs with emulated devices:
>> >> real ethernet uses a fixed MTU.
>> >>
>> >> IMHO it's confusing to suggest MTU as a fix for this bug, it's
>> >> an unrelated optimization.
>> >> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
>> >
>> > PMTU discovery only resolves the issue if an actual IP stack is
>> > running inside the VM. This may not be the case at all.
>>
>> It's also only really a correct thing to do if the ICMP packet is
>> coming from an L3 node. If you are doing straight bridging then you
>> have to resort to hacks like OVS had before, which I agree are not
>> particularly desirable.
>
> The issue seems to be that fundamentally, this is
> bridging interfaces with variable MTUs (even if MTU values
> on devices don't let us figure this out)-
> that is already not straight bridging, and
> I would argue sending ICMPs back is the right thing to do.

How do you deal with the fact that there is no host IP stack inside
the tunnel? And isn't this exactly the same as the former OVS
implementation that you said you didn't like?

>> > I agree that exposing an MTU towards the guest is not applicable
>> > in all situations, in particular because it is difficult to decide
>> > what MTU to expose. It is a relatively elegant solution in a lot
>> > of virtualization host cases hooked up to an orchestration system
>> > though.
>>
>> I also think this is the right thing to do as a common case
>> optimization and I know other platforms (such as Hyper-V) do it. It's
>> not a complete solution so we still need the original patch in this
>> thread to handle things transparently.
>
> Well, as I believe David (and independently Jason) is saying, it looks like
> the ICMPs we are sending back after applying the original patch have the
> wrong MTU.

The problem is actually that the ICMP messages won't even go to the
sending VM because the host IP stack and the VM are isolated from each
other and there is no route.

> And if I understand what David is saying here, IP is also the wrong place to
> do it.

ICMP can't be the complete solution in any case because it only works
for IP traffic. I think there are only two full solutions: find a way
to adjust the guest MTU to the minimum MTU that its traffic could hit
in an L2 domain or fragmentation. ICMP could be a possible
optimization in the fragmentation case.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 3, 2014, 6:38 p.m. UTC | #27
On Wed, Dec 03, 2014 at 10:07:42AM -0800, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 1:03 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Tue, Dec 02, 2014 at 10:12:04AM -0800, Jesse Gross wrote:
> >> On Tue, Dec 2, 2014 at 9:41 AM, Thomas Graf <tgraf@suug.ch> wrote:
> >> > On 12/02/14 at 07:34pm, Michael S. Tsirkin wrote:
> >> >> On Tue, Dec 02, 2014 at 05:09:27PM +0000, Thomas Graf wrote:
> >> >> > On 12/02/14 at 01:48pm, Flavio Leitner wrote:
> >> >> > > What about containers or any other virtualization environment that
> >> >> > > doesn't use Virtio?
> >> >> >
> >> >> > The host can dictate the MTU in that case for both veth or OVS
> >> >> > internal which would be primary container plumbing techniques.
> >> >>
> >> >> It typically can't do this easily for VMs with emulated devices:
> >> >> real ethernet uses a fixed MTU.
> >> >>
> >> >> IMHO it's confusing to suggest MTU as a fix for this bug, it's
> >> >> an unrelated optimization.
> >> >> ICMP_DEST_UNREACH/ICMP_FRAG_NEEDED is the right fix here.
> >> >
> >> > PMTU discovery only resolves the issue if an actual IP stack is
> >> > running inside the VM. This may not be the case at all.
> >>
> >> It's also only really a correct thing to do if the ICMP packet is
> >> coming from an L3 node. If you are doing straight bridging then you
> >> have to resort to hacks like OVS had before, which I agree are not
> >> particularly desirable.
> >
> > The issue seems to be that fundamentally, this is
> > bridging interfaces with variable MTUs (even if MTU values
> > on devices don't let us figure this out)-
> > that is already not straight bridging, and
> > I would argue sending ICMPs back is the right thing to do.
> 
> How do you deal with the fact that there is no host IP stack inside
> the tunnel? And isn't this exactly the same as the former OVS
> implementation that you said you didn't like?

I was talking about the high level requirement, not the implementation
here. I agree it's not at all trivial, we need to propagate this across
tunnels.

But let's agree on what we are trying to accomplish first.


> >> > I agree that exposing an MTU towards the guest is not applicable
> >> > in all situations, in particular because it is difficult to decide
> >> > what MTU to expose. It is a relatively elegant solution in a lot
> >> > of virtualization host cases hooked up to an orchestration system
> >> > though.
> >>
> >> I also think this is the right thing to do as a common case
> >> optimization and I know other platforms (such as Hyper-V) do it. It's
> >> not a complete solution so we still need the original patch in this
> >> thread to handle things transparently.
> >
> > Well, as I believe David (and independently Jason) is saying, it looks like
> > the ICMPs we are sending back after applying the original patch have the
> > wrong MTU.
> 
> The problem is actually that the ICMP messages won't even go to the
> sending VM because the host IP stack and the VM are isolated from each
> other and there is no route.

Exactly.
But all this is talking about implementation.

Let's agree on what we want to do first.

And in my mind, we typically want originator to adjust its PMTU,
just for a given destination.
Sending ICMP to originating VM will typically accomplish this.




> > And if I understand what David is saying here, IP is also the wrong place to
> > do it.
> 
> ICMP can't be the complete solution in any case because it only works
> for IP traffic.

Let's be specific please.  What protocols do you most care about? IPX?

> I think there are only two full solutions: find a way
> to adjust the guest MTU to the minimum MTU that its traffic could hit
> in an L2 domain or fragmentation. ICMP could be a possible
> optimization in the fragmentation case.

Both approaches seem strange. You are sending 1 packet an hour to
some destination behind 100 tunnels. Why would you want to
cut down your MTU for all packets? On the other hand,
doubling the amount of packets because your MTU is off
by a couple of bytes will hurt performance significantly.

Still, if you want to cut down the MTU within guest,
that's only an ifconfig away.
Most people would not want to bother, I think it's a good
idea to make PMTU work properly for them.
Rick Jones Dec. 3, 2014, 6:56 p.m. UTC | #28
Trying to "fake-out" an ICMP message to paper-over "devices" in the 
"middle" of same Layer2 network having different MTUs from the ends goes 
back to at least the days when people started joining FDDI networks to 
Ethernet networks with bridges rather than routers.  Perhaps back even 
further.  It had problems them (including not all traffic being IP), I 
doubt today would be any different.

All devices in a Layer2 network must have the same MTU. (*)

The only exception to that which does not lead one down a rathole is 
that you can have the MTU at the "edges" of the Layer 2 network be 
smaller than the MTU in the "middle."  And then only if the middle 
"never" tries to talk itself to the edges directly.

rick jones

(*) Or at least any communicating device must be kept ignorant of what 
one does to have a smaller MTU in there somewhere.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 3, 2014, 7:38 p.m. UTC | #29
On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Wed, Dec 03, 2014 at 10:07:42AM -0800, Jesse Gross wrote:.
>> ICMP can't be the complete solution in any case because it only works
>> for IP traffic.
>
> Let's be specific please.  What protocols do you most care about? IPX?
>
>> I think there are only two full solutions: find a way
>> to adjust the guest MTU to the minimum MTU that its traffic could hit
>> in an L2 domain or fragmentation. ICMP could be a possible
>> optimization in the fragmentation case.
>
> Both approaches seem strange. You are sending 1 packet an hour to
> some destination behind 100 tunnels. Why would you want to
> cut down your MTU for all packets? On the other hand,
> doubling the amount of packets because your MTU is off
> by a couple of bytes will hurt performance significantly.
>
> Still, if you want to cut down the MTU within guest,
> that's only an ifconfig away.
> Most people would not want to bother, I think it's a good
> idea to make PMTU work properly for them.

I care about correctness first, which means that an Ethernet link
being exposed to the guest should behave like Ethernet. So, yes, IPX
should work if somebody chooses to do that.

Your comments are about performance optimization. That's fine but
without a correct base to start from it seems like putting the cart
before the horse and is hard to reason about.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Dec. 3, 2014, 10:02 p.m. UTC | #30
On 12/03/14 at 11:38am, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > Both approaches seem strange. You are sending 1 packet an hour to
> > some destination behind 100 tunnels. Why would you want to
> > cut down your MTU for all packets? On the other hand,
> > doubling the amount of packets because your MTU is off
> > by a couple of bytes will hurt performance significantly.
> >
> > Still, if you want to cut down the MTU within guest,
> > that's only an ifconfig away.
> > Most people would not want to bother, I think it's a good
> > idea to make PMTU work properly for them.
> 
> I care about correctness first, which means that an Ethernet link
> being exposed to the guest should behave like Ethernet. So, yes, IPX
> should work if somebody chooses to do that.
> 
> Your comments are about performance optimization. That's fine but
> without a correct base to start from it seems like putting the cart
> before the horse and is hard to reason about.

I agree with Jesse in particular about correctnes but Michael has a
point (which I thing nobod objects to) which is that it may not always
make sense to force the MTU onto the guest. It clearly makes sense for
the edge server connected to an overlay but it may not be ideal if
WAN traffic is VXLAN encapped and local DC traffic is put onto a VLAN.

That said, I think it is fair to assume that the host knows what role
it plays and can be configured accordingly, i.e. a Netlink API which
exposes the encap overhead so libvirt can max() over it force it onto
the guest or something along those lines.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 3, 2014, 10:50 p.m. UTC | #31
On Wed, Dec 03, 2014 at 10:02:44PM +0000, Thomas Graf wrote:
> On 12/03/14 at 11:38am, Jesse Gross wrote:
> > On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > > Both approaches seem strange. You are sending 1 packet an hour to
> > > some destination behind 100 tunnels. Why would you want to
> > > cut down your MTU for all packets? On the other hand,
> > > doubling the amount of packets because your MTU is off
> > > by a couple of bytes will hurt performance significantly.
> > >
> > > Still, if you want to cut down the MTU within guest,
> > > that's only an ifconfig away.
> > > Most people would not want to bother, I think it's a good
> > > idea to make PMTU work properly for them.
> > 
> > I care about correctness first, which means that an Ethernet link
> > being exposed to the guest should behave like Ethernet. So, yes, IPX
> > should work if somebody chooses to do that.
> > 
> > Your comments are about performance optimization. That's fine but
> > without a correct base to start from it seems like putting the cart
> > before the horse and is hard to reason about.
> 
> I agree with Jesse in particular about correctnes but Michael has a
> point (which I thing nobod objects to) which is that it may not always
> make sense to force the MTU onto the guest. It clearly makes sense for
> the edge server connected to an overlay but it may not be ideal if
> WAN traffic is VXLAN encapped and local DC traffic is put onto a VLAN.

And it's not like tweaking local MTU for one interface will
magically fix everything.

> That said, I think it is fair to assume that the host knows what role
> it plays and can be configured accordingly, i.e. a Netlink API which
> exposes the encap overhead so libvirt can max() over it force it onto
> the guest or something along those lines.

I'd say let's try to at least fix IP traffic properly.
Jesse Gross Dec. 3, 2014, 10:51 p.m. UTC | #32
On Wed, Dec 3, 2014 at 2:02 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 11:38am, Jesse Gross wrote:
>> On Wed, Dec 3, 2014 at 10:38 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>> > Both approaches seem strange. You are sending 1 packet an hour to
>> > some destination behind 100 tunnels. Why would you want to
>> > cut down your MTU for all packets? On the other hand,
>> > doubling the amount of packets because your MTU is off
>> > by a couple of bytes will hurt performance significantly.
>> >
>> > Still, if you want to cut down the MTU within guest,
>> > that's only an ifconfig away.
>> > Most people would not want to bother, I think it's a good
>> > idea to make PMTU work properly for them.
>>
>> I care about correctness first, which means that an Ethernet link
>> being exposed to the guest should behave like Ethernet. So, yes, IPX
>> should work if somebody chooses to do that.
>>
>> Your comments are about performance optimization. That's fine but
>> without a correct base to start from it seems like putting the cart
>> before the horse and is hard to reason about.
>
> I agree with Jesse in particular about correctnes but Michael has a
> point (which I thing nobod objects to) which is that it may not always
> make sense to force the MTU onto the guest. It clearly makes sense for
> the edge server connected to an overlay but it may not be ideal if
> WAN traffic is VXLAN encapped and local DC traffic is put onto a VLAN.

The question is whether you would do this in a single L2 segment. It's
possible, of course, but probably not a great idea and I'm not sure
that it's really worth optimizing for. We do have one existing example
of this type of MTU reduction - the bridge device when you attach
multiple devices with varying MTUs (including a VXLAN device). In that
case, the bridge device is effectively acting as a connection point,
similar to virtio in a VM.

My proposal would be something like this:
 * For L2, reduce the VM MTU to the lowest common denominator on the segment.
 * For L3, use path MTU discovery or fragment inner packet (i.e.
normal routing behavior).
 * As a last resort (such as if using an old version of virtio in the
guest), fragment the tunnel packet.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Dec. 3, 2014, 11:05 p.m. UTC | #33
On 12/03/14 at 02:51pm, Jesse Gross wrote:
> My proposal would be something like this:
>  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
>  * For L3, use path MTU discovery or fragment inner packet (i.e.
> normal routing behavior).
>  * As a last resort (such as if using an old version of virtio in the
> guest), fragment the tunnel packet.

That's what I had in mind as well although using a differentiator bit
which indicates to the output path whether the packet is to be
considered switched or routed and thus send ICMPs. The bit would be set
per flow, thus allowing arbitary granularity of behaviour.

I haven't fully thought this through yet though.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 4, 2014, 12:54 a.m. UTC | #34
On Wed, Dec 3, 2014 at 3:05 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 02:51pm, Jesse Gross wrote:
>> My proposal would be something like this:
>>  * For L2, reduce the VM MTU to the lowest common denominator on the segment.
>>  * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>  * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>
> That's what I had in mind as well although using a differentiator bit
> which indicates to the output path whether the packet is to be
> considered switched or routed and thus send ICMPs. The bit would be set
> per flow, thus allowing arbitary granularity of behaviour.

I don't think that we actually need a bit. I would expect that ICMP
generation to be coupled with routing (although this requires being
able to know what the ultimate MTU is at the time of routing the inner
packet). If that's the case, you just need to steer between L2 and L3
processing in the same way that you would today and ICMP would just
come in the right cases.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Thomas Graf Dec. 4, 2014, 1:15 a.m. UTC | #35
On 12/03/14 at 04:54pm, Jesse Gross wrote:
> I don't think that we actually need a bit. I would expect that ICMP
> generation to be coupled with routing (although this requires being
> able to know what the ultimate MTU is at the time of routing the inner
> packet). If that's the case, you just need to steer between L2 and L3
> processing in the same way that you would today and ICMP would just
> come in the right cases.

I think the MTU awareness is solveable but how do you steer between
L2 and L3? How do you differentiate between an L3 ACL rule in L2 mode
and an actual L3 based forward? dec_ttl? This is what drove me to
the user controlled bit and it became appealing as it allows to
enable/disable PMTU per guest or even per flow/route.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 4, 2014, 1:51 a.m. UTC | #36
On Wed, Dec 3, 2014 at 5:15 PM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 04:54pm, Jesse Gross wrote:
>> I don't think that we actually need a bit. I would expect that ICMP
>> generation to be coupled with routing (although this requires being
>> able to know what the ultimate MTU is at the time of routing the inner
>> packet). If that's the case, you just need to steer between L2 and L3
>> processing in the same way that you would today and ICMP would just
>> come in the right cases.
>
> I think the MTU awareness is solveable but how do you steer between
> L2 and L3? How do you differentiate between an L3 ACL rule in L2 mode
> and an actual L3 based forward? dec_ttl? This is what drove me to
> the user controlled bit and it became appealing as it allows to
> enable/disable PMTU per guest or even per flow/route.

I think it depends on where you put the PMTU check. If routing is
happening in OVS where it is decomposed in several discrete actions
like set MAC and decrement TTL then perhaps there is another explicit
action to check the MTU. If it is happening in the context of the IP
stack, then ICMP generation occurs automatically and if you get that
if you write a flow to send a packet there. In each case, it seems
like a flow would be steering you by way of an action to do routing so
you would have fine grained control. I don't see this as conflicting
with L3 ACLs in an L2 context in the same way that you don't have to
automatically decrement the TTL.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FengYu LeiDian Dec. 4, 2014, 7:48 a.m. UTC | #37
于 2014年12月04日 06:51, Jesse Gross 写道:
> My proposal would be something like this:
>   * For L2, reduce the VM MTU to the lowest common denominator on the segment.
>   * For L3, use path MTU discovery or fragment inner packet (i.e.
> normal routing behavior).
>   * As a last resort (such as if using an old version of virtio in the
> guest), fragment the tunnel packet.

After some investigation on OpenvSwitch package, it seems before this
commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
package is doing GSO on its own.

rpl_ip_local_out
   -> tnl_skb_gso_segment
       ^^^^^^^^^^^^^^^
          Perform GSO  in above function
     -> ip_local_out
           .
           .
         -> ip_finish_output

Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
type skb, so the stack perform ip fragmentation, and send them out.
So, over-MTU-sized skb did travel through stack into outside.

Why not dropping such OpenvSwitch level GSO operation after 3.10?
Thomas Graf Dec. 4, 2014, 9:26 a.m. UTC | #38
On 12/03/14 at 05:51pm, Jesse Gross wrote:
> I think it depends on where you put the PMTU check. If routing is
> happening in OVS where it is decomposed in several discrete actions
> like set MAC and decrement TTL then perhaps there is another explicit
> action to check the MTU. If it is happening in the context of the IP
> stack, then ICMP generation occurs automatically and if you get that
> if you write a flow to send a packet there. In each case, it seems
> like a flow would be steering you by way of an action to do routing so
> you would have fine grained control. I don't see this as conflicting
> with L3 ACLs in an L2 context in the same way that you don't have to
> automatically decrement the TTL.

OK, I was under the impression that you would detect L3 behaviour
desire in OVS without an explicit action. Hence the worry on wrong
classification for L3 ACLs.

Whether we mark the packet and defer the MTU check or we check right
in the action is an implementation detail in the end. I think we
agree on concept level that we need an API to control behaviour.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Michael S. Tsirkin Dec. 4, 2014, 10:17 a.m. UTC | #39
On Wed, Dec 03, 2014 at 10:56:02AM -0800, Rick Jones wrote:
> Trying to "fake-out" an ICMP message to paper-over "devices" in the "middle"
> of same Layer2 network having different MTUs from the ends goes back to at
> least the days when people started joining FDDI networks to Ethernet
> networks with bridges rather than routers.  Perhaps back even further.  It
> had problems them (including not all traffic being IP), I doubt today would
> be any different.
> 
> All devices in a Layer2 network must have the same MTU. (*)
> 
> The only exception to that which does not lead one down a rathole is that
> you can have the MTU at the "edges" of the Layer 2 network be smaller than
> the MTU in the "middle."  And then only if the middle "never" tries to talk
> itself to the edges directly.
> 
> rick jones
> 
> (*) Or at least any communicating device must be kept ignorant of what one
> does to have a smaller MTU in there somewhere.

What I've been pointing out is that just because something is within a VM,
we can't assume it's an edge.
Jesse Gross Dec. 4, 2014, 11:19 p.m. UTC | #40
On Thu, Dec 4, 2014 at 1:26 AM, Thomas Graf <tgraf@suug.ch> wrote:
> On 12/03/14 at 05:51pm, Jesse Gross wrote:
>> I think it depends on where you put the PMTU check. If routing is
>> happening in OVS where it is decomposed in several discrete actions
>> like set MAC and decrement TTL then perhaps there is another explicit
>> action to check the MTU. If it is happening in the context of the IP
>> stack, then ICMP generation occurs automatically and if you get that
>> if you write a flow to send a packet there. In each case, it seems
>> like a flow would be steering you by way of an action to do routing so
>> you would have fine grained control. I don't see this as conflicting
>> with L3 ACLs in an L2 context in the same way that you don't have to
>> automatically decrement the TTL.
>
> OK, I was under the impression that you would detect L3 behaviour
> desire in OVS without an explicit action. Hence the worry on wrong
> classification for L3 ACLs.
>
> Whether we mark the packet and defer the MTU check or we check right
> in the action is an implementation detail in the end. I think we
> agree on concept level that we need an API to control behaviour.

Yes, I agree.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Jesse Gross Dec. 4, 2014, 11:23 p.m. UTC | #41
On Wed, Dec 3, 2014 at 11:48 PM, Du Fan <fengyuleidian0615@gmail.com> wrote:
> 于 2014年12月04日 06:51, Jesse Gross 写道:
>>
>> My proposal would be something like this:
>>   * For L2, reduce the VM MTU to the lowest common denominator on the
>> segment.
>>   * For L3, use path MTU discovery or fragment inner packet (i.e.
>> normal routing behavior).
>>   * As a last resort (such as if using an old version of virtio in the
>> guest), fragment the tunnel packet.
>
>
> After some investigation on OpenvSwitch package, it seems before this
> commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
> package is doing GSO on its own.
>
> rpl_ip_local_out
>   -> tnl_skb_gso_segment
>       ^^^^^^^^^^^^^^^
>          Perform GSO  in above function
>     -> ip_local_out
>           .
>           .
>         -> ip_finish_output
>
> Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
> type skb, so the stack perform ip fragmentation, and send them out.
> So, over-MTU-sized skb did travel through stack into outside.
>
> Why not dropping such OpenvSwitch level GSO operation after 3.10?

The change in 3.11 was that the tunnel infrastructure used by OVS was
upstreamed and shared by all implementations. It's not right to
perform GSO in OVS itself as it prevents the logic from being used by
other components. Breaking up the packet in OVS also eliminates some
of the benefits of GSO by shortening the optimized path and prevents
offloading to hardware.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
FengYu LeiDian Dec. 5, 2014, 12:25 a.m. UTC | #42
On 2014/12/5 7:23, Jesse Gross wrote:
> On Wed, Dec 3, 2014 at 11:48 PM, Du Fan <fengyuleidian0615@gmail.com> wrote:
>> 于 2014年12月04日 06:51, Jesse Gross 写道:
>>> My proposal would be something like this:
>>>    * For L2, reduce the VM MTU to the lowest common denominator on the
>>> segment.
>>>    * For L3, use path MTU discovery or fragment inner packet (i.e.
>>> normal routing behavior).
>>>    * As a last resort (such as if using an old version of virtio in the
>>> guest), fragment the tunnel packet.
>>
>> After some investigation on OpenvSwitch package, it seems before this
>> commit: 06021dcb "datapath: compat: Fix compilation 3.11" OpenvSwitch
>> package is doing GSO on its own.
>>
>> rpl_ip_local_out
>>    -> tnl_skb_gso_segment
>>        ^^^^^^^^^^^^^^^
>>           Perform GSO  in above function
>>      -> ip_local_out
>>            .
>>            .
>>          -> ip_finish_output
>>
>> Which means, when over-MTU-sized skb enter ip_local_out, it's not a gso
>> type skb, so the stack perform ip fragmentation, and send them out.
>> So, over-MTU-sized skb did travel through stack into outside.
>>
>> Why not dropping such OpenvSwitch level GSO operation after 3.10?
> The change in 3.11 was that the tunnel infrastructure used by OVS was
> upstreamed and shared by all implementations. It's not right to
> perform GSO in OVS itself as it prevents the logic from being used by
> other components. Breaking up the packet in OVS also eliminates some
> of the benefits of GSO by shortening the optimized path and prevents
> offloading to hardware.
Thanks for your explanation, I understand its background better now.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index bc6471d..558b5f8 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -217,9 +217,10 @@  static int ip_finish_output_gso(struct sk_buff *skb)
 	struct sk_buff *segs;
 	int ret = 0;
 
-	/* common case: locally created skb or seglen is <= mtu */
-	if (((IPCB(skb)->flags & IPSKB_FORWARDED) == 0) ||
-	      skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
+	/* Both locally created skb and forwarded skb could exceed
+	 * MTU size, so make a unified rule for them all.
+	 */
+	if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
 		return ip_finish_output2(skb);
 
 	/* Slowpath -  GSO segment length is exceeding the dst MTU.