Message ID | 20170830230409.15176-1-saeedm@mellanox.com |
---|---|
State | Accepted, archived |
Delegated to: | David Miller |
Headers | show |
From: Saeed Mahameed <saeedm@mellanox.com> Date: Thu, 31 Aug 2017 02:04:06 +0300 > The following changes provide GRE tunnel offloads for mlx5 ethernet netdevice driver. > > For more details please see tag log message below. > Please pull and let me know if there's any problem. > > Note: this series doesn't conflict with the ongoing net mlx5 submission. Looks good, pulled.
Saeed Mahameed <saeedm@mellanox.com> writes: > The first patch from Gal and Ariel provides the mlx5 driver support for > ConnectX capability to perform IP version identification and matching in > order to distinguish between IPv4 and IPv6 without the need to specify the > encapsulation type, thus perform RSS in MPLS automatically without > specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 > classification for inner GRE RSS. I don't think this is legal at all or did I misunderstood something? <https://tools.ietf.org/html/rfc3032#section-2.2> Thanks, Hannes
On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Saeed Mahameed <saeedm@mellanox.com> writes: > >> The first patch from Gal and Ariel provides the mlx5 driver support for >> ConnectX capability to perform IP version identification and matching in >> order to distinguish between IPv4 and IPv6 without the need to specify the >> encapsulation type, thus perform RSS in MPLS automatically without >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 >> classification for inner GRE RSS. > > I don't think this is legal at all or did I misunderstood something? > > <https://tools.ietf.org/html/rfc3032#section-2.2> It seems you misunderstood the cover letter. The HW will still identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW steering rules rather than adding new specific rules with {MPLS ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same functionality a better and general way to approach it. Bottom line the hardware is capable of processing MPLS headers and perform RSS on the inner packet (IPv4/6) without the need of the driver to provide precise steering MPLS rules. > > Thanks, > Hannes
Hi Saeed, On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote: > On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: > > Saeed Mahameed <saeedm@mellanox.com> writes: > > > >> The first patch from Gal and Ariel provides the mlx5 driver support for > >> ConnectX capability to perform IP version identification and matching in > >> order to distinguish between IPv4 and IPv6 without the need to specify the > >> encapsulation type, thus perform RSS in MPLS automatically without > >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 > >> classification for inner GRE RSS. > > > > I don't think this is legal at all or did I misunderstood something? > > > > <https://tools.ietf.org/html/rfc3032#section-2.2> > > It seems you misunderstood the cover letter. The HW will still > identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW > steering rules rather than adding new specific rules with {MPLS > ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same > functionality a better and general way to approach it. > Bottom line the hardware is capable of processing MPLS headers and > perform RSS on the inner packet (IPv4/6) without the need of the > driver to provide precise steering MPLS rules. Sorry, I think I am still confused. I just want to make sure that you don't use the first nibble after the mpls bottom of stack label in any way as an indicator if that is an IPv4 or IPv6 packet by default. It can be anything. The forward equivalence class tells the stack which protocol you see. If you match on the first nibble behind the MPLS bottom of stack label the '4' or '6' respectively could be part of a MAC address with its first nibble being 4 or 6, because the particular pseudowire is EoMPLS and uses no control world. I wanted to mention it, because with addition of e.g. VPLS this could cause problems down the road and should at least be controllable? It is probably better to use Entropy Labels in future. Thanks, Hannes
On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Hi Saeed, > > On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote: >> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa >> <hannes@stressinduktion.org> wrote: >> > Saeed Mahameed <saeedm@mellanox.com> writes: >> > >> >> The first patch from Gal and Ariel provides the mlx5 driver support for >> >> ConnectX capability to perform IP version identification and matching in >> >> order to distinguish between IPv4 and IPv6 without the need to specify the >> >> encapsulation type, thus perform RSS in MPLS automatically without >> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 >> >> classification for inner GRE RSS. >> > >> > I don't think this is legal at all or did I misunderstood something? >> > >> > <https://tools.ietf.org/html/rfc3032#section-2.2> >> >> It seems you misunderstood the cover letter. The HW will still >> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW >> steering rules rather than adding new specific rules with {MPLS >> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same >> functionality a better and general way to approach it. >> Bottom line the hardware is capable of processing MPLS headers and >> perform RSS on the inner packet (IPv4/6) without the need of the >> driver to provide precise steering MPLS rules. > > Sorry, I think I am still confused. > > I just want to make sure that you don't use the first nibble after the > mpls bottom of stack label in any way as an indicator if that is an IPv4 > or IPv6 packet by default. It can be anything. The forward equivalence > class tells the stack which protocol you see. > > If you match on the first nibble behind the MPLS bottom of stack label > the '4' or '6' respectively could be part of a MAC address with its > first nibble being 4 or 6, because the particular pseudowire is EoMPLS > and uses no control world. > > I wanted to mention it, because with addition of e.g. VPLS this could > cause problems down the road and should at least be controllable? It is > probably better to use Entropy Labels in future. > Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you prefer) then all this protocol specific DPI for RSS just goes away ;-) Tom > Thanks, > Hannes
On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Hi Saeed, > > On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote: >> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa >> <hannes@stressinduktion.org> wrote: >> > Saeed Mahameed <saeedm@mellanox.com> writes: >> > >> >> The first patch from Gal and Ariel provides the mlx5 driver support for >> >> ConnectX capability to perform IP version identification and matching in >> >> order to distinguish between IPv4 and IPv6 without the need to specify the >> >> encapsulation type, thus perform RSS in MPLS automatically without >> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 >> >> classification for inner GRE RSS. >> > >> > I don't think this is legal at all or did I misunderstood something? >> > >> > <https://tools.ietf.org/html/rfc3032#section-2.2> >> >> It seems you misunderstood the cover letter. The HW will still >> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW >> steering rules rather than adding new specific rules with {MPLS >> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same >> functionality a better and general way to approach it. >> Bottom line the hardware is capable of processing MPLS headers and >> perform RSS on the inner packet (IPv4/6) without the need of the >> driver to provide precise steering MPLS rules. > > Sorry, I think I am still confused. > > I just want to make sure that you don't use the first nibble after the > mpls bottom of stack label in any way as an indicator if that is an IPv4 > or IPv6 packet by default. It can be anything. The forward equivalence > class tells the stack which protocol you see. > > If you match on the first nibble behind the MPLS bottom of stack label > the '4' or '6' respectively could be part of a MAC address with its > first nibble being 4 or 6, because the particular pseudowire is EoMPLS > and uses no control world. > > I wanted to mention it, because with addition of e.g. VPLS this could > cause problems down the road and should at least be controllable? It is > probably better to use Entropy Labels in future. > Hi Hannes, I see your concern now, but still it has nothing to do with the driver, the whole change is only to simplify driver code to not push full blown matching steering rules into the HW, and simply replace it with a one bit command. Regarding your concern, I will need to check with the HW guys and review the processing algorithm that identifies IP packets over MPLs, and will get back to you. if there is really a problem, then yes, we might need to make it controllable .. > Thanks, > Hannes
On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote: > On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> Hi Saeed, >> >> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote: >>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa >>> <hannes@stressinduktion.org> wrote: >>> > Saeed Mahameed <saeedm@mellanox.com> writes: >>> > >>> >> The first patch from Gal and Ariel provides the mlx5 driver support for >>> >> ConnectX capability to perform IP version identification and matching in >>> >> order to distinguish between IPv4 and IPv6 without the need to specify the >>> >> encapsulation type, thus perform RSS in MPLS automatically without >>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 >>> >> classification for inner GRE RSS. >>> > >>> > I don't think this is legal at all or did I misunderstood something? >>> > >>> > <https://tools.ietf.org/html/rfc3032#section-2.2> >>> >>> It seems you misunderstood the cover letter. The HW will still >>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW >>> steering rules rather than adding new specific rules with {MPLS >>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same >>> functionality a better and general way to approach it. >>> Bottom line the hardware is capable of processing MPLS headers and >>> perform RSS on the inner packet (IPv4/6) without the need of the >>> driver to provide precise steering MPLS rules. >> >> Sorry, I think I am still confused. >> >> I just want to make sure that you don't use the first nibble after the >> mpls bottom of stack label in any way as an indicator if that is an IPv4 >> or IPv6 packet by default. It can be anything. The forward equivalence >> class tells the stack which protocol you see. >> >> If you match on the first nibble behind the MPLS bottom of stack label >> the '4' or '6' respectively could be part of a MAC address with its >> first nibble being 4 or 6, because the particular pseudowire is EoMPLS >> and uses no control world. >> >> I wanted to mention it, because with addition of e.g. VPLS this could >> cause problems down the road and should at least be controllable? It is >> probably better to use Entropy Labels in future. >> > Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you > prefer) then all this protocol specific DPI for RSS just goes away ;-) Hi Tom, How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ? unlike vxlan those protocols are not over UDP and you can't just play with the outer header udp src port, or do you ? Can you elaborate ? Thanks, Saeed. > > Tom > >> Thanks, >> Hannes
On Sat, Sep 2, 2017 at 9:11 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: > On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote: >> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa >> <hannes@stressinduktion.org> wrote: >>> Hi Saeed, >>> >>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote: >>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa >>>> <hannes@stressinduktion.org> wrote: >>>> > Saeed Mahameed <saeedm@mellanox.com> writes: >>>> > >>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for >>>> >> ConnectX capability to perform IP version identification and matching in >>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the >>>> >> encapsulation type, thus perform RSS in MPLS automatically without >>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 >>>> >> classification for inner GRE RSS. >>>> > >>>> > I don't think this is legal at all or did I misunderstood something? >>>> > >>>> > <https://tools.ietf.org/html/rfc3032#section-2.2> >>>> >>>> It seems you misunderstood the cover letter. The HW will still >>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW >>>> steering rules rather than adding new specific rules with {MPLS >>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same >>>> functionality a better and general way to approach it. >>>> Bottom line the hardware is capable of processing MPLS headers and >>>> perform RSS on the inner packet (IPv4/6) without the need of the >>>> driver to provide precise steering MPLS rules. >>> >>> Sorry, I think I am still confused. >>> >>> I just want to make sure that you don't use the first nibble after the >>> mpls bottom of stack label in any way as an indicator if that is an IPv4 >>> or IPv6 packet by default. It can be anything. The forward equivalence >>> class tells the stack which protocol you see. >>> >>> If you match on the first nibble behind the MPLS bottom of stack label >>> the '4' or '6' respectively could be part of a MAC address with its >>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS >>> and uses no control world. >>> >>> I wanted to mention it, because with addition of e.g. VPLS this could >>> cause problems down the road and should at least be controllable? It is >>> probably better to use Entropy Labels in future. >>> >> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you >> prefer) then all this protocol specific DPI for RSS just goes away ;-) > > Hi Tom, > > How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ? > unlike vxlan those protocols are not over UDP and you can't just play > with the outer header udp src port, or do you ? > > Can you elaborate ? > Hi Saeed, An encapsulator sets the UDP source port to be the flow entropy of the packet being encapsulated. So when the packet traverses the network devices can base their hash just on the canonical 5-tuple which is sufficient for ECMP and RSS. IPv6 flow label is even better since the middleboxes don't even need to look at the transport header, a packet is steered based on the 3-tuple of addresses and flow label. In the Linux stack, udp_flow_src_port is used by UDP encapsulations to set the source port. Flow label is similarly set by ip6_make_flowlabel. Both of these functions use the skb->hash which is computed by calling flow dissector at most once per packet (if the packet was received with an L4 HW hash or locally originated on a connection the hash does not need to be computed). Please look at https://people.netfilter.org/pablo/netdev0.1/papers/UDP-Encapsulation-in-Linux.pdf as well as Davem's "Less is More" presentation which highlights the virtues of protocol generic HW mechanisms (https://www.youtube.com/watch?v=6VgmazGwL_Y). Tom
On Sun, Sep 3, 2017 at 6:43 PM, Tom Herbert <tom@herbertland.com> wrote: > On Sat, Sep 2, 2017 at 9:11 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote: >> On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote: >>> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa >>> <hannes@stressinduktion.org> wrote: >>>> Hi Saeed, >>>> >>>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote: >>>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa >>>>> <hannes@stressinduktion.org> wrote: >>>>> > Saeed Mahameed <saeedm@mellanox.com> writes: >>>>> > >>>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for >>>>> >> ConnectX capability to perform IP version identification and matching in >>>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the >>>>> >> encapsulation type, thus perform RSS in MPLS automatically without >>>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6 >>>>> >> classification for inner GRE RSS. >>>>> > >>>>> > I don't think this is legal at all or did I misunderstood something? >>>>> > >>>>> > <https://tools.ietf.org/html/rfc3032#section-2.2> >>>>> >>>>> It seems you misunderstood the cover letter. The HW will still >>>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW >>>>> steering rules rather than adding new specific rules with {MPLS >>>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same >>>>> functionality a better and general way to approach it. >>>>> Bottom line the hardware is capable of processing MPLS headers and >>>>> perform RSS on the inner packet (IPv4/6) without the need of the >>>>> driver to provide precise steering MPLS rules. >>>> >>>> Sorry, I think I am still confused. >>>> >>>> I just want to make sure that you don't use the first nibble after the >>>> mpls bottom of stack label in any way as an indicator if that is an IPv4 >>>> or IPv6 packet by default. It can be anything. The forward equivalence >>>> class tells the stack which protocol you see. >>>> >>>> If you match on the first nibble behind the MPLS bottom of stack label >>>> the '4' or '6' respectively could be part of a MAC address with its >>>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS >>>> and uses no control world. >>>> >>>> I wanted to mention it, because with addition of e.g. VPLS this could >>>> cause problems down the road and should at least be controllable? It is >>>> probably better to use Entropy Labels in future. >>>> >>> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you >>> prefer) then all this protocol specific DPI for RSS just goes away ;-) >> How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ? >> unlike vxlan those protocols are not over UDP and you can't just play >> with the outer header udp src port, or do you ? >> Can you elaborate ? > An encapsulator sets the UDP source port to be the flow entropy of the > packet being encapsulated. So when the packet traverses the network > devices can base their hash just on the canonical 5-tuple which is > sufficient for ECMP and RSS. IPv6 flow label is even better since the > middleboxes don't even need to look at the transport header, a packet > is steered based on the 3-tuple of addresses and flow label. In the > Linux stack, udp_flow_src_port is used by UDP encapsulations to set > the source port. Flow label is similarly set by ip6_make_flowlabel. > Both of these functions use the skb->hash which is computed by calling > flow dissector at most once per packet (if the packet was received > with an L4 HW hash or locally originated on a connection the hash does > not need to be computed). Hi Tom, Re all sorts of udp encap, sure, we're all on the less-is-more thing and just RSS-ing on the ip+udp encap header. For GRE, I was trying to fight back that rss-ing on inner, but as Saeed commented, we didn't see something simple through which the HW can do spreading. To make sure I follow, you are saying that if this is gre6 tunneling net-next.git]# git grep -p ip6_make_flowlabel net/ include/linux/ include/net/ include/net/ipv6.h=static inline void iph_to_flow_copy_v6addrs(struct flow_keys *flow, include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, include/net/ipv6.h=static inline void ip6_set_txhash(struct sock *sk) { } include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net *net, struct sk_buff *skb, net/ipv6/ip6_gre.c=static int ip6gre_header(struct sk_buff *skb, struct net_device *dev, net/ipv6/ip6_gre.c: ip6_make_flowlabel(dev_net(dev), skb, net/ipv6/ip6_output.c=int ip6_xmit(const struct sock *sk, struct sk_buff *skb, struct flowi6 *fl6, net/ipv6/ip6_output.c: ip6_flow_hdr(hdr, tclass, ip6_make_flowlabel(net, skb, fl6->flowlabel, net/ipv6/ip6_output.c=struct sk_buff *__ip6_make_skb(struct sock *sk, net/ipv6/ip6_output.c: ip6_make_flowlabel(net, skb, fl6->flowlabel, net/ipv6/ip6_tunnel.c=int ip6_tnl_xmit(struct sk_buff *skb, struct net_device *dev, __u8 dsfield, net/ipv6/ip6_tunnel.c: ip6_make_flowlabel(net, skb, fl6->flowlabel, true, fl6)); the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a similar manner done by udp_flow_src_port? and if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get spreading? nice! Still, what do we do with IPv4 GRE tunnels? and what do we do with HW which isn't capable to RSS on flow label? > Please look at https://people.netfilter.org/pablo/netdev0.1/papers/UDP-Encapsulation-in-Linux.pdf > as well as Davem's "Less is More" presentation which highlights the > virtues of protocol generic HW mechanisms > (https://www.youtube.com/watch?v=6VgmazGwL_Y).
> Re all sorts of udp encap, sure, we're all on the less-is-more thing and just > RSS-ing on the ip+udp encap header. > > For GRE, I was trying to fight back that rss-ing on inner, but as > Saeed commented, > we didn't see something simple through which the HW can do spreading. To make > sure I follow, you are saying that if this is gre6 tunneling > It's pretty common that HW does this since GRE is in widespread use for a while. > net-next.git]# git grep -p ip6_make_flowlabel net/ include/linux/ include/net/ > include/net/ipv6.h=static inline void iph_to_flow_copy_v6addrs(struct > flow_keys *flow, > include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net > *net, struct sk_buff *skb, > include/net/ipv6.h=static inline void ip6_set_txhash(struct sock *sk) { } > include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net > *net, struct sk_buff *skb, > net/ipv6/ip6_gre.c=static int ip6gre_header(struct sk_buff *skb, > struct net_device *dev, > net/ipv6/ip6_gre.c: ip6_make_flowlabel(dev_net(dev), skb, > net/ipv6/ip6_output.c=int ip6_xmit(const struct sock *sk, struct > sk_buff *skb, struct flowi6 *fl6, > net/ipv6/ip6_output.c: ip6_flow_hdr(hdr, tclass, > ip6_make_flowlabel(net, skb, fl6->flowlabel, > net/ipv6/ip6_output.c=struct sk_buff *__ip6_make_skb(struct sock *sk, > net/ipv6/ip6_output.c: ip6_make_flowlabel(net, skb, > fl6->flowlabel, > net/ipv6/ip6_tunnel.c=int ip6_tnl_xmit(struct sk_buff *skb, struct > net_device *dev, __u8 dsfield, > net/ipv6/ip6_tunnel.c: ip6_make_flowlabel(net, skb, > fl6->flowlabel, true, fl6)); > Seems right. > the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a > similar manner done by udp_flow_src_port? and > if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get > spreading? nice! > As long as the network devices support it. > Still, what do we do with IPv4 GRE tunnels? and what do we do with HW > which isn't capable to RSS on flow label? > Throw it out and buy hardware that supports flow label! ;-) Seriously though, flow labels are the only reasonable way that RSS can be supported in IPv6. If a device tries to do DPI on IPv6 then they'll eventually need to be able to parse of some number of extension headers which unlike IPv4 is unbounded in size. So there are going to be circumstances in which a device either doesn't understand an EH, or the size of EHs blows it parsing buffer limit so it can't do the DPI. IMO, telling host implementations that we're not allowed to use extension headers because middleboxes can't support them is unacceptable... Btw, these same arguments apply as to why CHECKSUM_COMPLETE is the only reasonable way to handle receive checksums in IPv6. Tom
On Sun, Sep 3, 2017 at 7:45 PM, Tom Herbert <tom@herbertland.com> wrote: >> Re all sorts of udp encap, sure, we're all on the less-is-more thing and just >> RSS-ing on the ip+udp encap header. >> For GRE, I was trying to fight back that rss-ing on inner, but as >> Saeed commented, >> we didn't see something simple through which the HW can do spreading. To >> make sure I follow, you are saying that if this is gre6 tunneling > It's pretty common that HW does this since GRE is in widespread use for a while. ECMP? I guess so >> the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a >> similar manner done by udp_flow_src_port? and >> if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get >> spreading? nice! > As long as the network devices support it. yeah, hopefully upcoming devices will support the thing >> Still, what do we do with IPv4 GRE tunnels? and what do we do with HW >> which isn't capable to RSS on flow label? > Throw it out and buy hardware that supports flow label! ;-) still, we came across IPv4 GRE customer environments > Seriously though, flow labels are the only reasonable way that RSS can > be supported in IPv6. If a device tries to do DPI on IPv6 then they'll > eventually need to be able to parse of some number of extension > headers which unlike IPv4 is unbounded in size. So there are going to > be circumstances in which a device either doesn't understand an EH, or > the size of EHs blows it parsing buffer limit so it can't do the DPI. > IMO, telling host implementations that we're not allowed to use > extension headers because middleboxes can't support them is > unacceptable... makes sense to me > Btw, these same arguments apply as to why CHECKSUM_COMPLETE is the > only reasonable way to handle receive checksums in IPv6. supported on mlx5
Hello, Saeed Mahameed <saeedm@dev.mellanox.co.il> writes: [...] > On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> Sorry, I think I am still confused. >> >> I just want to make sure that you don't use the first nibble after the >> mpls bottom of stack label in any way as an indicator if that is an IPv4 >> or IPv6 packet by default. It can be anything. The forward equivalence >> class tells the stack which protocol you see. >> >> If you match on the first nibble behind the MPLS bottom of stack label >> the '4' or '6' respectively could be part of a MAC address with its >> first nibble being 4 or 6, because the particular pseudowire is EoMPLS >> and uses no control world. >> >> I wanted to mention it, because with addition of e.g. VPLS this could >> cause problems down the road and should at least be controllable? It is >> probably better to use Entropy Labels in future. >> > > Hi Hannes, > > I see your concern now, but still it has nothing to do with the > driver, the whole change is only to simplify driver code to not push > full blown matching steering rules into the HW, and simply replace it > with a one bit command. Sorry, I very much got the impression from the cover letter plus the descriptions of the patches. > Regarding your concern, I will need to check with the HW guys and > review the processing algorithm that identifies IP packets over MPLs, > and will get back to you. > > if there is really a problem, then yes, we might need to make it > controllable .. Thanks for looking into this. I do think it can be added as a feature but I very much think such logic should be disabled by default. Thanks, Hannes
Tom Herbert <tom@herbertland.com> writes: > An encapsulator sets the UDP source port to be the flow entropy of the > packet being encapsulated. So when the packet traverses the network > devices can base their hash just on the canonical 5-tuple which is > sufficient for ECMP and RSS. IPv6 flow label is even better since the > middleboxes don't even need to look at the transport header, a packet > is steered based on the 3-tuple of addresses and flow label. In the > Linux stack, udp_flow_src_port is used by UDP encapsulations to set > the source port. Flow label is similarly set by ip6_make_flowlabel. > Both of these functions use the skb->hash which is computed by calling > flow dissector at most once per packet (if the packet was received > with an L4 HW hash or locally originated on a connection the hash does > not need to be computed). This would require the MPLS stack copying the flowlabel of IPv6 connections between MPLS routers over their whole lifetime in the MPLS network. The same would hold for MPLS encapsulated inside UDP, the source port needs to be kept constant. This is very impractical. The hash for the flow label can often not be recomputed by interim routers, because they might lack the knowledge of the upper layer protocol. UDP source port entropy still has the problem that we don't respect the source port as RSS entropy by default in network cards, because of possible fragmentation and thus possible reordering of packets. GRE does not have this problem and is way easier to identify by hardware. Basically we need to tell network cards that they can use specific destination UDP ports where we allow the source port to be used in RSS hash calculation. I don't see how this is any easier than just using GRE with a defined protocol field? I do like the combination of ipv6 flowlabel + GRE. Btw. people are using the GRE Key as additional entropy without looking into the GRE payload. Bye, Hannes
On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Tom Herbert <tom@herbertland.com> writes: > >> An encapsulator sets the UDP source port to be the flow entropy of the >> packet being encapsulated. So when the packet traverses the network >> devices can base their hash just on the canonical 5-tuple which is >> sufficient for ECMP and RSS. IPv6 flow label is even better since the >> middleboxes don't even need to look at the transport header, a packet >> is steered based on the 3-tuple of addresses and flow label. In the >> Linux stack, udp_flow_src_port is used by UDP encapsulations to set >> the source port. Flow label is similarly set by ip6_make_flowlabel. >> Both of these functions use the skb->hash which is computed by calling >> flow dissector at most once per packet (if the packet was received >> with an L4 HW hash or locally originated on a connection the hash does >> not need to be computed). > > This would require the MPLS stack copying the flowlabel of IPv6 > connections between MPLS routers over their whole lifetime in the MPLS > network. The same would hold for MPLS encapsulated inside UDP, the > source port needs to be kept constant. This is very impractical. The > hash for the flow label can often not be recomputed by interim routers, > because they might lack the knowledge of the upper layer protocol. > Hannes, When the flow label is set the packet will traverse the network and be ECMP routed regardless of whether the payload is MPLS at anything else-- the important characteristic is that network devices don't need to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to provide good ECMP. At a source the flow label or UDP source port needs to be generated. That can be based on DPI, derived from the MPLS entropy label, use SPI in ESP, etc. I don't see anything special about MPLS in this regard. > UDP source port entropy still has the problem that we don't respect the > source port as RSS entropy by default in network cards, because of > possible fragmentation and thus possible reordering of packets. GRE does > not have this problem and is way easier to identify by hardware. > > Basically we need to tell network cards that they can use specific > destination UDP ports where we allow the source port to be used in RSS > hash calculation. I don't see how this is any easier than just using GRE > with a defined protocol field? I do like the combination of ipv6 > flowlabel + GRE. > No, we don't any more want port specific configuration in NICs! The NIC should just fallback to 3-tuple hash when it see MF or offset set in IPv4 header. But even if it doesn't implement this, receiving OOO fragments is hardly the end of the world-- IP packets are always allowed to be received OOO. If something breaks because in order delivery is assumed then that is the bug that needs to be fixed. So at best handling fragmentation in this manner is proposed om optimization whose benefits will pale to getting good ECMP and RSS when encapsulation is in use. > Btw. people are using the GRE Key as additional entropy without looking > into the GRE payload. > Sure some are, but the GRE key is not defined to be flow entropy so it's not ubiquitous it used for that so it gives sufficient entropy or is even constant per flow. GRE/UDP (RFC8086) was primarily written to allow a more consistent method (as was RFC7510 for doing MPLS/UDP). Tom
Hello Tom, Tom Herbert <tom@herbertland.com> writes: > On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> Tom Herbert <tom@herbertland.com> writes: >> >>> An encapsulator sets the UDP source port to be the flow entropy of the >>> packet being encapsulated. So when the packet traverses the network >>> devices can base their hash just on the canonical 5-tuple which is >>> sufficient for ECMP and RSS. IPv6 flow label is even better since the >>> middleboxes don't even need to look at the transport header, a packet >>> is steered based on the 3-tuple of addresses and flow label. In the >>> Linux stack, udp_flow_src_port is used by UDP encapsulations to set >>> the source port. Flow label is similarly set by ip6_make_flowlabel. >>> Both of these functions use the skb->hash which is computed by calling >>> flow dissector at most once per packet (if the packet was received >>> with an L4 HW hash or locally originated on a connection the hash does >>> not need to be computed). >> >> This would require the MPLS stack copying the flowlabel of IPv6 >> connections between MPLS routers over their whole lifetime in the MPLS >> network. The same would hold for MPLS encapsulated inside UDP, the >> source port needs to be kept constant. This is very impractical. The >> hash for the flow label can often not be recomputed by interim routers, >> because they might lack the knowledge of the upper layer protocol. >> > Hannes, > > When the flow label is set the packet will traverse the network and be > ECMP routed regardless of whether the payload is MPLS at anything > else-- the important characteristic is that network devices don't need > to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to > provide good ECMP. At a source the flow label or UDP source port needs > to be generated. That can be based on DPI, derived from the MPLS > entropy label, use SPI in ESP, etc. I don't see anything special about > MPLS in this regard. The MPLS circuit is only end to end in terms of IP processing if MPLS is used for multitenant separation. Normally the IP connection is done between two label switch routers, thus is not end to end. One LSR will decapsulate the packet and throw the IP header away, do the label processing and will reencapsulate it with the new next hop information. To keep the assigned entropy alive it would have to save the UDP source port or flowlabel and patch the outgoing IP header again. This is certainly possible it just seems more unnatural. Normally every next hop does MPLS processing and thus the packet traverses up the stack. Special purpose (entropy) MPLS labels allow the stack to achieve RSS just based on the label stack and will be end-to-end in a MPLS cloud. >> UDP source port entropy still has the problem that we don't respect the >> source port as RSS entropy by default in network cards, because of >> possible fragmentation and thus possible reordering of packets. GRE does >> not have this problem and is way easier to identify by hardware. >> >> Basically we need to tell network cards that they can use specific >> destination UDP ports where we allow the source port to be used in RSS >> hash calculation. I don't see how this is any easier than just using GRE >> with a defined protocol field? I do like the combination of ipv6 >> flowlabel + GRE. >> > No, we don't any more want port specific configuration in NICs! The > NIC should just fallback to 3-tuple hash when it see MF or offset set > in IPv4 header. But even if it doesn't implement this, receiving OOO > fragments is hardly the end of the world-- IP packets are always > allowed to be received OOO. If something breaks because in order > delivery is assumed then that is the bug that needs to be fixed. So at > best handling fragmentation in this manner is proposed om > optimization whose benefits will pale to getting good ECMP and RSS > when encapsulation is in use. The problem is that you end up having two streams, one fragmented and one non-fragmented, but actually they belong to the same stream. It is known to break stuff, see: <https://patchwork.ozlabs.org/patch/59235/> I would agree with you, but we can't break existing setups, unfortunately. >> Btw. people are using the GRE Key as additional entropy without looking >> into the GRE payload. >> > Sure some are, but the GRE key is not defined to be flow entropy so > it's not ubiquitous it used for that so it gives sufficient entropy or > is even constant per flow. GRE/UDP (RFC8086) was primarily written to > allow a more consistent method (as was RFC7510 for doing MPLS/UDP). I agree with that, just wanted to mention it. Bye, Hannes
> The problem is that you end up having two streams, one fragmented and > one non-fragmented, but actually they belong to the same stream. It is > known to break stuff, see: > > <https://patchwork.ozlabs.org/patch/59235/> > > I would agree with you, but we can't break existing setups, > unfortunately. > I'm not sure what "existing setups" means here. If this is referring to the Internet then out of order packets and fragments abound-- any assumption of in order delivery for IP packets is useless there. Btw, TCP has exactly the same problem in this regard that UDP has with regard to fragmentation. The reason that TCP isn't considered an issue is the assumption that TCP will do PMTU discovery and set DF (I would bet that devices don't actually check for DF and vendors don't test when it's not set!). Tom
Hi Tom, Tom Herbert <tom@herbertland.com> writes: >> The problem is that you end up having two streams, one fragmented and >> one non-fragmented, but actually they belong to the same stream. It is >> known to break stuff, see: >> >> <https://patchwork.ozlabs.org/patch/59235/> >> >> I would agree with you, but we can't break existing setups, >> unfortunately. >> > I'm not sure what "existing setups" means here. If this is referring > to the Internet then out of order packets and fragments abound-- any > assumption of in order delivery for IP packets is useless there. Network cards don't know where traffic originates from, even on the same card. Clouds nowadays try to convince everyone to virtualize existing workloads. So I *assume* that at least for traffic that seems to be in one L2 domain out of order should not occur or things will break. Ethernet for a long time appeared to do very much in order delivery and guarantees that. Thus for people it appeared that UDP packets are also strictly in order. Encapsulating Ethernet inside UDP thus must preserve those properties, especially if used for virtualization. Unfortunately fragmentation happens and the stack has to deal with it somehow. I do know about some software that uses UDP in multicast that is prone to misbehave in case of OoO frames. It uses a small reassemble queue but if reordering gets too high, it will simply drop data. This might be acceptable up to a specific degree. I guess one application you could harm is plain old simple syslog, which often generated fragmented packets. Luckily one often has time stamps in there to reassemble the log lines but one had to do this manually. L2 domains are not bound to local networks anymore, thanks to vxlan etc. If you are in a controlled environment and you do know your software that runs perfectly well, certainly, no doubt, UDP hashing can be enabled. I would argue we can't do that generally. > Btw, TCP has exactly the same problem in this regard that UDP has with > regard to fragmentation. The reason that TCP isn't considered an issue > is the assumption that TCP will do PMTU discovery and set DF (I would > bet that devices don't actually check for DF and vendors don't test > when it's not set!). I don't know. For the application the stream will appear in order at the socket level and OoO or fragmentation won't break anything. End systems will have a harder time reassembling the stream and thus fast paths couldn't be used. Is that what you are referring to? Thanks, Hannes
On Mon, Sep 4, 2017 at 10:57 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Hi Tom, > > Tom Herbert <tom@herbertland.com> writes: > >>> The problem is that you end up having two streams, one fragmented and >>> one non-fragmented, but actually they belong to the same stream. It is >>> known to break stuff, see: >>> >>> <https://patchwork.ozlabs.org/patch/59235/> >>> >>> I would agree with you, but we can't break existing setups, >>> unfortunately. >>> >> I'm not sure what "existing setups" means here. If this is referring >> to the Internet then out of order packets and fragments abound-- any >> assumption of in order delivery for IP packets is useless there. > > Network cards don't know where traffic originates from, even on the same > card. Clouds nowadays try to convince everyone to virtualize existing > workloads. So I *assume* that at least for traffic that seems to be in > one L2 domain out of order should not occur or things will break. > > Ethernet for a long time appeared to do very much in order delivery and > guarantees that. Thus for people it appeared that UDP packets are also > strictly in order. Encapsulating Ethernet inside UDP thus must preserve > those properties, especially if used for virtualization. Unfortunately > fragmentation happens and the stack has to deal with it somehow. > There is absolutely no requirement in IP that packets are delivered in order-- there never has been and there never will be! If the ULP, like Ethernet encapsulation, requires in order deliver then it needs to implement that itself like TCP, GRE, and other protocols ensure that with sequence numbers and reassembly. All of these hoops we do make sure that packets always follow the same path and are always in order are done for benefit of middlebox devices like stateful firewalls that have force us to adopt their concept of network architecture-- in the long run this is self-defeating and kills our ability to innovate. I'm not saying that we shouldn't consider legacy devices, but we should scrutinize new development or solutions that perpetuate incorrect design or bad assumptions. Tom
Tom Herbert <tom@herbertland.com> writes: > There is absolutely no requirement in IP that packets are delivered in > order-- there never has been and there never will be! If the ULP, like > Ethernet encapsulation, requires in order deliver then it needs to > implement that itself like TCP, GRE, and other protocols ensure that > with sequence numbers and reassembly. All of these hoops we do make > sure that packets always follow the same path and are always in order > are done for benefit of middlebox devices like stateful firewalls that > have force us to adopt their concept of network architecture-- in the > long run this is self-defeating and kills our ability to innovate. > > I'm not saying that we shouldn't consider legacy devices, but we > should scrutinize new development or solutions that perpetuate > incorrect design or bad assumptions. So configure RSS per port and ensure no fragments are send to those ports. This is possible and rather easy to do. It solves the problem with legacy software and it spreads out packets for your applications. It is not perfect but it is working and solves both problems. Bye, Hannes
On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa <hannes@stressinduktion.org> wrote: > Tom Herbert <tom@herbertland.com> writes: > >> There is absolutely no requirement in IP that packets are delivered in >> order-- there never has been and there never will be! If the ULP, like >> Ethernet encapsulation, requires in order deliver then it needs to >> implement that itself like TCP, GRE, and other protocols ensure that >> with sequence numbers and reassembly. All of these hoops we do make >> sure that packets always follow the same path and are always in order >> are done for benefit of middlebox devices like stateful firewalls that >> have force us to adopt their concept of network architecture-- in the >> long run this is self-defeating and kills our ability to innovate. >> >> I'm not saying that we shouldn't consider legacy devices, but we >> should scrutinize new development or solutions that perpetuate >> incorrect design or bad assumptions. > > So configure RSS per port and ensure no fragments are send to those > ports. This is possible and rather easy to do. It solves the problem > with legacy software and it spreads out packets for your applications. > > It is not perfect but it is working and solves both problems. > Hannes, I don't see how that solves anything. The purpose of RSS is to distribute the load of protocol packets across queues. This needs to work for UDP applications. For instance, if I were building a QUIC server I'd want the sort of flow distribution that a TCP server would give. You can't do that by configuring a few ports in the device. If I were to suggest any HW change it would be to not do DPI on fragments (MF or offset is set). This ensures that all packets of the fragment train get hashed to the same queue and is on fact what RPS has been doing for years without any complaints. But even before I'd make that recommendation, I'd really like understand what the problem actually is. The only thing I can garner from this discussion and the Intel patch is that when fragments are received OOO that is perceived as a problem. But the by the protocol specification clearly says this is not a problem. So the questions are: who is negatively affected by this? Is this a problem because some artificial test that checks for everything to be in order is now failing? Is this affecting real users? Is this an issue in the stack or really with some implementation outside of the stack? If it is an implementation outside of the stack, then are we just bandaid'ing over someone else's incorrect implementation by patching the kernel (like would have be the case if we change the kernel to interoperate with Facebook's switch that couldn't handle OOO in twstate). Thanks, Tom > Bye, > Hannes
Hi Tom, Tom Herbert <tom@herbertland.com> writes: > On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa > <hannes@stressinduktion.org> wrote: >> Tom Herbert <tom@herbertland.com> writes: >> >>> There is absolutely no requirement in IP that packets are delivered in >>> order-- there never has been and there never will be! If the ULP, like >>> Ethernet encapsulation, requires in order deliver then it needs to >>> implement that itself like TCP, GRE, and other protocols ensure that >>> with sequence numbers and reassembly. All of these hoops we do make >>> sure that packets always follow the same path and are always in order >>> are done for benefit of middlebox devices like stateful firewalls that >>> have force us to adopt their concept of network architecture-- in the >>> long run this is self-defeating and kills our ability to innovate. >>> >>> I'm not saying that we shouldn't consider legacy devices, but we >>> should scrutinize new development or solutions that perpetuate >>> incorrect design or bad assumptions. >> >> So configure RSS per port and ensure no fragments are send to those >> ports. This is possible and rather easy to do. It solves the problem >> with legacy software and it spreads out packets for your applications. >> >> It is not perfect but it is working and solves both problems. >> > Hannes, > > I don't see how that solves anything. The purpose of RSS is to > distribute the load of protocol packets across queues. This needs to > work for UDP applications. For instance, if I were building a QUIC > server I'd want the sort of flow distribution that a TCP server would > give. You can't do that by configuring a few ports in the device. I seriously am with you and I think you are on the right track. I would also much rather like to see fragmentation *not* happening and would like to see hardware logic *not* parsing deep down into packets and protocols. We can agree on that! Albeit I don't understand the problem with my approach: If you know that you are running a QUIC server and know your environment (safe against reordering), run `ethtool -N ethX rx-flow-hash udp4 sdfn' on the box and you would get spreading of UDP packets based on the UDP src+dst port numbers to the rx queues. Otherwise only source and destination addresses are being considered, which is the safe default (maybe this is sometimes enough). Intel network cards warn you if you actually switch the mode: -- >8 -- drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c: "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n"); drivers/net/ethernet/intel/igb/igb_ethtool.c: "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n"); drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c: e_warn(drv, "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n"); -- 8< -- I argue that we can't change the default. Furthermore I tried to argue that in a cloud where you don't know which applications you run within the encapsulated networks, it is very hard to predict which protocols can be considered safe against those effects. I agree, that most of the time, it probably should not matter. But given the rule about kernel backwards compatibility I do have doubts that this change would go unnoticed. I am the devil's advocate here and argue it is on us to show that is safe before we change the semantics. > If I were to suggest any HW change it would be to not do DPI on > fragments (MF or offset is set). This ensures that all packets of the > fragment train get hashed to the same queue and is on fact what RPS > has been doing for years without any complaints. The same UDP flow could consist out of fragments and non-fragmented packets. Even if hardware would act according to what you wrote, reordering is very much possible within the same UDP flow. In case of QUIC, as far as I know, it uses destination ports 80 and 443. Like we already notify NICs about vxlan and geneve port numbers, we could selectively enable RSS for UDP ports on those destination port numbers to tell the NIC it won't receive fragments on those ports and thus do RSS. The same way the NIC could be notified if reordering on this port is possible. This could be done for example via a lightweight setsockopt. The situation with encapsulation is even more complicated: We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP constellation. If we do the fragmentation inside the vxlan tunnel and carry over the skb hash to all resulting UDP/vxlan packets source ports, we are fine and reordering on the receiver NIC won't happen in this case. If the fragmentation happens on the outer UDP header, this will result in reordering of the inner L2 flow. Unfortunately this depends on how the vxlan tunnel was set up, how other devices do that and (I believe so) on the kernel version. Given the fact that we tell the NICs the receiving port of vxlan tunnels anyway (and I only can assume) they are doing RSS on that, we probably have reordering anyway on those tunnels. So maybe you are right and it doesn't matter in the end for encapsulated packets, by accident. > But even before I'd make that recommendation, I'd really like > understand what the problem actually is. The only thing I can garner > from this discussion and the Intel patch is that when fragments are > received OOO that is perceived as a problem. But the by the protocol > specification clearly says this is not a problem. So the questions > are: who is negatively affected by this? Is this a problem because > some artificial test that checks for everything to be in order is now > failing? Is this affecting real users? Is this an issue in the stack > or really with some implementation outside of the stack? If it is an > implementation outside of the stack, then are we just bandaid'ing over > someone else's incorrect implementation by patching the kernel (like > would have be the case if we change the kernel to interoperate with > Facebook's switch that couldn't handle OOO in twstate). I agree with that and I would also love to hear more feedback on this. Unfortunately I don't know. Bye, Hannes
> The situation with encapsulation is even more complicated: > > We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP > constellation. If we do the fragmentation inside the vxlan tunnel and > carry over the skb hash to all resulting UDP/vxlan packets source ports, > we are fine and reordering on the receiver NIC won't happen in this > case. If the fragmentation happens on the outer UDP header, this will > result in reordering of the inner L2 flow. Unfortunately this depends on > how the vxlan tunnel was set up, how other devices do that and (I > believe so) on the kernel version. > This really isn't that complicated. The assumption that an IP network always delivers packets in order is simply wrong. The inventors of VXLAN must have know full well that when you use IP, packets can and eventually will be delivered out of order. This isn't just because of fragmentation, there are many other reasons that packets can be delivered OOO. This also must have been known when IP/GRE and any other protocol that carries L2 over IP was invented. If OOO is an issue for these protocols then they need to be fixed-- this is not a concern with IP protocol nor the stack. Tom
On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote: >> The situation with encapsulation is even more complicated: >> >> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP >> constellation. If we do the fragmentation inside the vxlan tunnel and >> carry over the skb hash to all resulting UDP/vxlan packets source ports, >> we are fine and reordering on the receiver NIC won't happen in this >> case. If the fragmentation happens on the outer UDP header, this will >> result in reordering of the inner L2 flow. Unfortunately this depends on >> how the vxlan tunnel was set up, how other devices do that and (I >> believe so) on the kernel version. >> > This really isn't that complicated. The assumption that an IP network > always delivers packets in order is simply wrong. The inventors of > VXLAN must have know full well that when you use IP, packets can and > eventually will be delivered out of order. This isn't just because of > fragmentation, there are many other reasons that packets can be > delivered OOO. This also must have been known when IP/GRE and any > other protocol that carries L2 over IP was invented. If OOO is an > issue for these protocols then they need to be fixed-- this is not a > concern with IP protocol nor the stack. > > Tom As far as a little background on the original patch I believe the issue that was fixed by the patch was a video streaming application that was sending/receiving a mix of fragmented and non-fragmented packets. Receiving them out of order due to the fragmentation was causing issues with stutters in the video and so we ended up disabling UDP by default in the NICs listed. We decided to go that way as UDP RSS was viewed as a performance optimization, while the out-of-order problems were viewed as a functionality issue. The default for i40e is to have UDP RSS hashing enabled if I recall correctly. Basically as we move away from enterprise to cloud I suspect that is going to be the case more and more since all the UDP tunnels require either port recognition or UDP RSS to be enabled. For now we carry the ability to enable UDP RSS if desired in the legacy drivers, and I believe we have some white papers somewhere that suggest enabling it if you are planning to use UDP based tunnels. - Alex
On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote: >>> The situation with encapsulation is even more complicated: >>> >>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP >>> constellation. If we do the fragmentation inside the vxlan tunnel and >>> carry over the skb hash to all resulting UDP/vxlan packets source ports, >>> we are fine and reordering on the receiver NIC won't happen in this >>> case. If the fragmentation happens on the outer UDP header, this will >>> result in reordering of the inner L2 flow. Unfortunately this depends on >>> how the vxlan tunnel was set up, how other devices do that and (I >>> believe so) on the kernel version. >>> >> This really isn't that complicated. The assumption that an IP network >> always delivers packets in order is simply wrong. The inventors of >> VXLAN must have know full well that when you use IP, packets can and >> eventually will be delivered out of order. This isn't just because of >> fragmentation, there are many other reasons that packets can be >> delivered OOO. This also must have been known when IP/GRE and any >> other protocol that carries L2 over IP was invented. If OOO is an >> issue for these protocols then they need to be fixed-- this is not a >> concern with IP protocol nor the stack. >> >> Tom > > As far as a little background on the original patch I believe the > issue that was fixed by the patch was a video streaming application > that was sending/receiving a mix of fragmented and non-fragmented > packets. Receiving them out of order due to the fragmentation was > causing issues with stutters in the video and so we ended up disabling > UDP by default in the NICs listed. We decided to go that way as UDP > RSS was viewed as a performance optimization, while the out-of-order > problems were viewed as a functionality issue. > Hi Alex, Thanks for the details! Were you able to find the root cause for this? In particular, it would be interesting to know if it is the kernel or device that introduced the jitter, or if it's the application that doesn't handle OOO well... Tom
On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert <tom@herbertland.com> wrote: > On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote: >>>> The situation with encapsulation is even more complicated: >>>> >>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP >>>> constellation. If we do the fragmentation inside the vxlan tunnel and >>>> carry over the skb hash to all resulting UDP/vxlan packets source ports, >>>> we are fine and reordering on the receiver NIC won't happen in this >>>> case. If the fragmentation happens on the outer UDP header, this will >>>> result in reordering of the inner L2 flow. Unfortunately this depends on >>>> how the vxlan tunnel was set up, how other devices do that and (I >>>> believe so) on the kernel version. >>>> >>> This really isn't that complicated. The assumption that an IP network >>> always delivers packets in order is simply wrong. The inventors of >>> VXLAN must have know full well that when you use IP, packets can and >>> eventually will be delivered out of order. This isn't just because of >>> fragmentation, there are many other reasons that packets can be >>> delivered OOO. This also must have been known when IP/GRE and any >>> other protocol that carries L2 over IP was invented. If OOO is an >>> issue for these protocols then they need to be fixed-- this is not a >>> concern with IP protocol nor the stack. >>> >>> Tom >> >> As far as a little background on the original patch I believe the >> issue that was fixed by the patch was a video streaming application >> that was sending/receiving a mix of fragmented and non-fragmented >> packets. Receiving them out of order due to the fragmentation was >> causing issues with stutters in the video and so we ended up disabling >> UDP by default in the NICs listed. We decided to go that way as UDP >> RSS was viewed as a performance optimization, while the out-of-order >> problems were viewed as a functionality issue. >> > Hi Alex, > > Thanks for the details! Were you able to find the root cause for this? > In particular, it would be interesting to know if it is the kernel or > device that introduced the jitter, or if it's the application that > doesn't handle OOO well... > > Tom It is hard to say since my memory of the events from 7 years ago is pretty vague at this point, but I'm pretty sure it was the application. Basically getting the frames out of order was causing them to have to drop video data if I recall correctly. - Alex
On Wed, Sep 6, 2017 at 10:43 AM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert <tom@herbertland.com> wrote: >> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck >> <alexander.duyck@gmail.com> wrote: >>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote: >>>>> The situation with encapsulation is even more complicated: >>>>> >>>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP >>>>> constellation. If we do the fragmentation inside the vxlan tunnel and >>>>> carry over the skb hash to all resulting UDP/vxlan packets source ports, >>>>> we are fine and reordering on the receiver NIC won't happen in this >>>>> case. If the fragmentation happens on the outer UDP header, this will >>>>> result in reordering of the inner L2 flow. Unfortunately this depends on >>>>> how the vxlan tunnel was set up, how other devices do that and (I >>>>> believe so) on the kernel version. >>>>> >>>> This really isn't that complicated. The assumption that an IP network >>>> always delivers packets in order is simply wrong. The inventors of >>>> VXLAN must have know full well that when you use IP, packets can and >>>> eventually will be delivered out of order. This isn't just because of >>>> fragmentation, there are many other reasons that packets can be >>>> delivered OOO. This also must have been known when IP/GRE and any >>>> other protocol that carries L2 over IP was invented. If OOO is an >>>> issue for these protocols then they need to be fixed-- this is not a >>>> concern with IP protocol nor the stack. >>>> >>>> Tom >>> >>> As far as a little background on the original patch I believe the >>> issue that was fixed by the patch was a video streaming application >>> that was sending/receiving a mix of fragmented and non-fragmented >>> packets. Receiving them out of order due to the fragmentation was >>> causing issues with stutters in the video and so we ended up disabling >>> UDP by default in the NICs listed. We decided to go that way as UDP >>> RSS was viewed as a performance optimization, while the out-of-order >>> problems were viewed as a functionality issue. >>> >> Hi Alex, >> >> Thanks for the details! Were you able to find the root cause for this? >> In particular, it would be interesting to know if it is the kernel or >> device that introduced the jitter, or if it's the application that >> doesn't handle OOO well... >> >> Tom > > It is hard to say since my memory of the events from 7 years ago is > pretty vague at this point, but I'm pretty sure it was the > application. Basically getting the frames out of order was causing > them to have to drop video data if I recall correctly. > Oh, I didn't notice that patch was from 2010. Maybe the application has been fixed by now! :-) Perhaps, it's time to try to turn UDP hashing on again by default? Even if NICs aren't doing this, there are network devices that are looking at UDP for ECMP and that doesn't seem to be causing widespread problems currently... Tom