mbox

[pull,request,net-next,0/3] Mellanox, mlx5 GRE tunnel offloads

Message ID 20170830230409.15176-1-saeedm@mellanox.com
State Accepted, archived
Delegated to: David Miller
Headers show

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-GRE-Offload

Message

Saeed Mahameed Aug. 30, 2017, 11:04 p.m. UTC
Hi Dave,

Tthe following changes provide GRE tunnel offloads for mlx5 ethernet netdevice driver.

For more details please see tag log message below.
Please pull and let me know if there's any problem.

Note: this series doesn't conflict with the ongoing net mlx5 submission.

Thanks,
Saeed.

---

The following changes since commit 90774a93ef075b39e55d31fe56fc286d71a046ac:

  bpf: test_maps: fix typos, "conenct" and "listeen" (2017-08-30 15:32:16 -0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git tags/mlx5-GRE-Offload

for you to fetch changes up to 7b3722fa9ef647eb1ae6a60a5d46f7c67ab09a33:

  net/mlx5e: Support RSS for GRE tunneled packets (2017-08-31 01:54:15 +0300)

----------------------------------------------------------------
mlx5-updates-2017-08-31 (GRE Offloads support)

This series provides the support for MPLS RSS and GRE TX offloads and
RSS support.

The first patch from Gal and Ariel provides the mlx5 driver support for
ConnectX capability to perform IP version identification and matching in
order to distinguish between IPv4 and IPv6 without the need to specify the
encapsulation type, thus perform RSS in MPLS automatically without
specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
classification for inner GRE RSS.

2nd patch from Gal, Adds the TX offloads support for GRE tunneled packets,
by reporting the needed netdev features.

3rd patch from Gal, Adds GRE inner RSS support by creating the needed device
resources (Steering Tables/rules and traffic classifiers) to Match GRE traffic
and perform RSS hashing on the inner headers.

Improvement:
Testing 8 TCP streams bandwidth over GRE:
    System: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
    NIC: Mellanox Technologies MT28800 Family [ConnectX-5 Ex]
    Before: 21.3 Gbps (Single RQ)
    Now   : 90.5 Gbps (RSS spread on 8 RQs)

Thanks,
Saeed.

----------------------------------------------------------------
Gal Pressman (3):
      net/mlx5e: Use IP version matching to classify IP traffic
      net/mlx5e: Support TSO and TX checksum offloads for GRE tunnels
      net/mlx5e: Support RSS for GRE tunneled packets

 drivers/net/ethernet/mellanox/mlx5/core/en.h       |  18 +-
 .../net/ethernet/mellanox/mlx5/core/en_ethtool.c   |  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/en_fs.c    | 281 ++++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c  | 108 ++++++--
 drivers/net/ethernet/mellanox/mlx5/core/fs_core.c  |   4 +-
 include/linux/mlx5/mlx5_ifc.h                      |   2 +-
 6 files changed, 384 insertions(+), 40 deletions(-)

Comments

David Miller Aug. 31, 2017, 5:15 a.m. UTC | #1
From: Saeed Mahameed <saeedm@mellanox.com>
Date: Thu, 31 Aug 2017 02:04:06 +0300

> The following changes provide GRE tunnel offloads for mlx5 ethernet netdevice driver.
> 
> For more details please see tag log message below.
> Please pull and let me know if there's any problem.
> 
> Note: this series doesn't conflict with the ongoing net mlx5 submission.

Looks good, pulled.
Hannes Frederic Sowa Aug. 31, 2017, 1:51 p.m. UTC | #2
Saeed Mahameed <saeedm@mellanox.com> writes:

> The first patch from Gal and Ariel provides the mlx5 driver support for
> ConnectX capability to perform IP version identification and matching in
> order to distinguish between IPv4 and IPv6 without the need to specify the
> encapsulation type, thus perform RSS in MPLS automatically without
> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
> classification for inner GRE RSS.

I don't think this is legal at all or did I misunderstood something?

<https://tools.ietf.org/html/rfc3032#section-2.2>

Thanks,
Hannes
Saeed Mahameed Sept. 2, 2017, 11:01 p.m. UTC | #3
On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Saeed Mahameed <saeedm@mellanox.com> writes:
>
>> The first patch from Gal and Ariel provides the mlx5 driver support for
>> ConnectX capability to perform IP version identification and matching in
>> order to distinguish between IPv4 and IPv6 without the need to specify the
>> encapsulation type, thus perform RSS in MPLS automatically without
>> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>> classification for inner GRE RSS.
>
> I don't think this is legal at all or did I misunderstood something?
>
> <https://tools.ietf.org/html/rfc3032#section-2.2>

It seems you misunderstood the cover letter.  The HW will still
identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
steering rules rather than adding new specific rules with  {MPLS
ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
functionality a better and general way to approach it.
Bottom line the hardware is capable of processing MPLS headers and
perform RSS on the inner packet (IPv4/6) without the need of the
driver to provide precise steering MPLS rules.

>
> Thanks,
> Hannes
Hannes Frederic Sowa Sept. 3, 2017, 1:32 a.m. UTC | #4
Hi Saeed,

On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
> > Saeed Mahameed <saeedm@mellanox.com> writes:
> >
> >> The first patch from Gal and Ariel provides the mlx5 driver support for
> >> ConnectX capability to perform IP version identification and matching in
> >> order to distinguish between IPv4 and IPv6 without the need to specify the
> >> encapsulation type, thus perform RSS in MPLS automatically without
> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
> >> classification for inner GRE RSS.
> >
> > I don't think this is legal at all or did I misunderstood something?
> >
> > <https://tools.ietf.org/html/rfc3032#section-2.2>
> 
> It seems you misunderstood the cover letter.  The HW will still
> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
> steering rules rather than adding new specific rules with  {MPLS
> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
> functionality a better and general way to approach it.
> Bottom line the hardware is capable of processing MPLS headers and
> perform RSS on the inner packet (IPv4/6) without the need of the
> driver to provide precise steering MPLS rules.

Sorry, I think I am still confused.

I just want to make sure that you don't use the first nibble after the
mpls bottom of stack label in any way as an indicator if that is an IPv4
or IPv6 packet by default. It can be anything. The forward equivalence
class tells the stack which protocol you see.

If you match on the first nibble behind the MPLS bottom of stack label
the '4' or '6' respectively could be part of a MAC address with its
first nibble being 4 or 6, because the particular pseudowire is EoMPLS
and uses no control world.

I wanted to mention it, because with addition of e.g. VPLS this could
cause problems down the road and should at least be controllable? It is
probably better to use Entropy Labels in future.

Thanks,
Hannes
Tom Herbert Sept. 3, 2017, 1:37 a.m. UTC | #5
On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Saeed,
>
> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>> >
>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>> >> ConnectX capability to perform IP version identification and matching in
>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>> >> encapsulation type, thus perform RSS in MPLS automatically without
>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>> >> classification for inner GRE RSS.
>> >
>> > I don't think this is legal at all or did I misunderstood something?
>> >
>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>
>> It seems you misunderstood the cover letter.  The HW will still
>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>> steering rules rather than adding new specific rules with  {MPLS
>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>> functionality a better and general way to approach it.
>> Bottom line the hardware is capable of processing MPLS headers and
>> perform RSS on the inner packet (IPv4/6) without the need of the
>> driver to provide precise steering MPLS rules.
>
> Sorry, I think I am still confused.
>
> I just want to make sure that you don't use the first nibble after the
> mpls bottom of stack label in any way as an indicator if that is an IPv4
> or IPv6 packet by default. It can be anything. The forward equivalence
> class tells the stack which protocol you see.
>
> If you match on the first nibble behind the MPLS bottom of stack label
> the '4' or '6' respectively could be part of a MAC address with its
> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
> and uses no control world.
>
> I wanted to mention it, because with addition of e.g. VPLS this could
> cause problems down the road and should at least be controllable? It is
> probably better to use Entropy Labels in future.
>
Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
prefer) then all this protocol specific DPI for RSS just goes away ;-)

Tom

> Thanks,
> Hannes
Saeed Mahameed Sept. 3, 2017, 4:07 a.m. UTC | #6
On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Saeed,
>
> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>> >
>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>> >> ConnectX capability to perform IP version identification and matching in
>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>> >> encapsulation type, thus perform RSS in MPLS automatically without
>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>> >> classification for inner GRE RSS.
>> >
>> > I don't think this is legal at all or did I misunderstood something?
>> >
>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>
>> It seems you misunderstood the cover letter.  The HW will still
>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>> steering rules rather than adding new specific rules with  {MPLS
>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>> functionality a better and general way to approach it.
>> Bottom line the hardware is capable of processing MPLS headers and
>> perform RSS on the inner packet (IPv4/6) without the need of the
>> driver to provide precise steering MPLS rules.
>
> Sorry, I think I am still confused.
>
> I just want to make sure that you don't use the first nibble after the
> mpls bottom of stack label in any way as an indicator if that is an IPv4
> or IPv6 packet by default. It can be anything. The forward equivalence
> class tells the stack which protocol you see.
>
> If you match on the first nibble behind the MPLS bottom of stack label
> the '4' or '6' respectively could be part of a MAC address with its
> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
> and uses no control world.
>
> I wanted to mention it, because with addition of e.g. VPLS this could
> cause problems down the road and should at least be controllable? It is
> probably better to use Entropy Labels in future.
>

Hi Hannes,

I see your concern now, but still it has nothing to do with the
driver, the whole change is only to simplify driver code to not push
full blown matching steering rules into the HW, and simply replace it
with a one bit command.

Regarding your concern, I will need to check with the HW guys and
review the processing algorithm that identifies IP packets over MPLs,
and will get back to you.

if there is really a problem, then yes, we might need to make it
controllable ..

> Thanks,
> Hannes
Saeed Mahameed Sept. 3, 2017, 4:11 a.m. UTC | #7
On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Hi Saeed,
>>
>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>>> <hannes@stressinduktion.org> wrote:
>>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>>> >
>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>>> >> ConnectX capability to perform IP version identification and matching in
>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>>> >> encapsulation type, thus perform RSS in MPLS automatically without
>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>>> >> classification for inner GRE RSS.
>>> >
>>> > I don't think this is legal at all or did I misunderstood something?
>>> >
>>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>>
>>> It seems you misunderstood the cover letter.  The HW will still
>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>>> steering rules rather than adding new specific rules with  {MPLS
>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>>> functionality a better and general way to approach it.
>>> Bottom line the hardware is capable of processing MPLS headers and
>>> perform RSS on the inner packet (IPv4/6) without the need of the
>>> driver to provide precise steering MPLS rules.
>>
>> Sorry, I think I am still confused.
>>
>> I just want to make sure that you don't use the first nibble after the
>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>> or IPv6 packet by default. It can be anything. The forward equivalence
>> class tells the stack which protocol you see.
>>
>> If you match on the first nibble behind the MPLS bottom of stack label
>> the '4' or '6' respectively could be part of a MAC address with its
>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>> and uses no control world.
>>
>> I wanted to mention it, because with addition of e.g. VPLS this could
>> cause problems down the road and should at least be controllable? It is
>> probably better to use Entropy Labels in future.
>>
> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
> prefer) then all this protocol specific DPI for RSS just goes away ;-)

Hi Tom,

How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ?
unlike vxlan those protocols are not over UDP and you can't just play
with the outer header udp src port, or do you ?

Can you elaborate ?

Thanks,
Saeed.

>
> Tom
>
>> Thanks,
>> Hannes
Tom Herbert Sept. 3, 2017, 3:43 p.m. UTC | #8
On Sat, Sep 2, 2017 at 9:11 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
>> <hannes@stressinduktion.org> wrote:
>>> Hi Saeed,
>>>
>>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>>>> <hannes@stressinduktion.org> wrote:
>>>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>>>> >
>>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>>>> >> ConnectX capability to perform IP version identification and matching in
>>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>>>> >> encapsulation type, thus perform RSS in MPLS automatically without
>>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>>>> >> classification for inner GRE RSS.
>>>> >
>>>> > I don't think this is legal at all or did I misunderstood something?
>>>> >
>>>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>>>
>>>> It seems you misunderstood the cover letter.  The HW will still
>>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>>>> steering rules rather than adding new specific rules with  {MPLS
>>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>>>> functionality a better and general way to approach it.
>>>> Bottom line the hardware is capable of processing MPLS headers and
>>>> perform RSS on the inner packet (IPv4/6) without the need of the
>>>> driver to provide precise steering MPLS rules.
>>>
>>> Sorry, I think I am still confused.
>>>
>>> I just want to make sure that you don't use the first nibble after the
>>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>>> or IPv6 packet by default. It can be anything. The forward equivalence
>>> class tells the stack which protocol you see.
>>>
>>> If you match on the first nibble behind the MPLS bottom of stack label
>>> the '4' or '6' respectively could be part of a MAC address with its
>>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>>> and uses no control world.
>>>
>>> I wanted to mention it, because with addition of e.g. VPLS this could
>>> cause problems down the road and should at least be controllable? It is
>>> probably better to use Entropy Labels in future.
>>>
>> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
>> prefer) then all this protocol specific DPI for RSS just goes away ;-)
>
> Hi Tom,
>
> How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ?
> unlike vxlan those protocols are not over UDP and you can't just play
> with the outer header udp src port, or do you ?
>
> Can you elaborate ?
>
Hi Saeed,

An encapsulator sets the UDP source port to be the flow entropy of the
packet being encapsulated. So when the packet traverses the network
devices can base their hash just on the canonical 5-tuple which is
sufficient for ECMP and RSS. IPv6 flow label is even better since the
middleboxes don't even need to look at the transport header, a packet
is steered based on the 3-tuple of addresses and flow label. In the
Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
the source port. Flow label is similarly set by ip6_make_flowlabel.
Both of these functions use the skb->hash which is computed by calling
flow dissector at most once per packet (if the packet was received
with an L4 HW hash or locally originated on a connection the hash does
not need to be computed).

Please look at https://people.netfilter.org/pablo/netdev0.1/papers/UDP-Encapsulation-in-Linux.pdf
as well as Davem's "Less is More" presentation which highlights the
virtues of protocol generic HW mechanisms
(https://www.youtube.com/watch?v=6VgmazGwL_Y).

Tom
Or Gerlitz Sept. 3, 2017, 4:17 p.m. UTC | #9
On Sun, Sep 3, 2017 at 6:43 PM, Tom Herbert <tom@herbertland.com> wrote:
> On Sat, Sep 2, 2017 at 9:11 PM, Saeed Mahameed <saeedm@dev.mellanox.co.il> wrote:
>> On Sat, Sep 2, 2017 at 6:37 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
>>> <hannes@stressinduktion.org> wrote:
>>>> Hi Saeed,
>>>>
>>>> On Sun, Sep 3, 2017, at 01:01, Saeed Mahameed wrote:
>>>>> On Thu, Aug 31, 2017 at 6:51 AM, Hannes Frederic Sowa
>>>>> <hannes@stressinduktion.org> wrote:
>>>>> > Saeed Mahameed <saeedm@mellanox.com> writes:
>>>>> >
>>>>> >> The first patch from Gal and Ariel provides the mlx5 driver support for
>>>>> >> ConnectX capability to perform IP version identification and matching in
>>>>> >> order to distinguish between IPv4 and IPv6 without the need to specify the
>>>>> >> encapsulation type, thus perform RSS in MPLS automatically without
>>>>> >> specifying MPLS ethertyoe. This patch will also serve for inner GRE IPv4/6
>>>>> >> classification for inner GRE RSS.
>>>>> >
>>>>> > I don't think this is legal at all or did I misunderstood something?
>>>>> >
>>>>> > <https://tools.ietf.org/html/rfc3032#section-2.2>
>>>>>
>>>>> It seems you misunderstood the cover letter.  The HW will still
>>>>> identify MPLS (IPv4/IPv6) packets using a new bit we specify in the HW
>>>>> steering rules rather than adding new specific rules with  {MPLS
>>>>> ethertype} X {IPv4,IPv6} to classify MPLS IPv{4,6} traffic, Same
>>>>> functionality a better and general way to approach it.
>>>>> Bottom line the hardware is capable of processing MPLS headers and
>>>>> perform RSS on the inner packet (IPv4/6) without the need of the
>>>>> driver to provide precise steering MPLS rules.
>>>>
>>>> Sorry, I think I am still confused.
>>>>
>>>> I just want to make sure that you don't use the first nibble after the
>>>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>>>> or IPv6 packet by default. It can be anything. The forward equivalence
>>>> class tells the stack which protocol you see.
>>>>
>>>> If you match on the first nibble behind the MPLS bottom of stack label
>>>> the '4' or '6' respectively could be part of a MAC address with its
>>>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>>>> and uses no control world.
>>>>
>>>> I wanted to mention it, because with addition of e.g. VPLS this could
>>>> cause problems down the road and should at least be controllable? It is
>>>> probably better to use Entropy Labels in future.
>>>>
>>> Or just use IPv6 with flow label for RSS (or MPLS/UDP, GRE/UDP if you
>>> prefer) then all this protocol specific DPI for RSS just goes away ;-)

>> How does MPLS/UDP or GRE/UDP RSS works without protocol specific DPI ?
>> unlike vxlan those protocols are not over UDP and you can't just play
>> with the outer header udp src port, or do you ?
>> Can you elaborate ?

> An encapsulator sets the UDP source port to be the flow entropy of the
> packet being encapsulated. So when the packet traverses the network
> devices can base their hash just on the canonical 5-tuple which is
> sufficient for ECMP and RSS. IPv6 flow label is even better since the
> middleboxes don't even need to look at the transport header, a packet
> is steered based on the 3-tuple of addresses and flow label. In the
> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
> the source port. Flow label is similarly set by ip6_make_flowlabel.
> Both of these functions use the skb->hash which is computed by calling
> flow dissector at most once per packet (if the packet was received
> with an L4 HW hash or locally originated on a connection the hash does
> not need to be computed).

Hi Tom,

Re all sorts of udp encap, sure, we're all on the less-is-more thing and just
RSS-ing on the ip+udp encap header.

For GRE, I was trying to fight back that rss-ing on inner, but as
Saeed commented,
we didn't see something simple through which the HW can do spreading. To make
sure I follow, you are saying that if this is gre6 tunneling

net-next.git]# git grep -p ip6_make_flowlabel net/ include/linux/ include/net/
include/net/ipv6.h=static inline void iph_to_flow_copy_v6addrs(struct
flow_keys *flow,
include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
*net, struct sk_buff *skb,
include/net/ipv6.h=static inline void ip6_set_txhash(struct sock *sk) { }
include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
*net, struct sk_buff *skb,
net/ipv6/ip6_gre.c=static int ip6gre_header(struct sk_buff *skb,
struct net_device *dev,
net/ipv6/ip6_gre.c:                  ip6_make_flowlabel(dev_net(dev), skb,
net/ipv6/ip6_output.c=int ip6_xmit(const struct sock *sk, struct
sk_buff *skb, struct flowi6 *fl6,
net/ipv6/ip6_output.c:  ip6_flow_hdr(hdr, tclass,
ip6_make_flowlabel(net, skb, fl6->flowlabel,
net/ipv6/ip6_output.c=struct sk_buff *__ip6_make_skb(struct sock *sk,
net/ipv6/ip6_output.c:               ip6_make_flowlabel(net, skb,
fl6->flowlabel,
net/ipv6/ip6_tunnel.c=int ip6_tnl_xmit(struct sk_buff *skb, struct
net_device *dev, __u8 dsfield,
net/ipv6/ip6_tunnel.c:               ip6_make_flowlabel(net, skb,
fl6->flowlabel, true, fl6));

the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a
similar manner done by udp_flow_src_port? and
if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get
spreading? nice!

Still, what do we do with IPv4 GRE tunnels? and what do we do with HW
which isn't capable to RSS on flow label?

> Please look at https://people.netfilter.org/pablo/netdev0.1/papers/UDP-Encapsulation-in-Linux.pdf
> as well as Davem's "Less is More" presentation which highlights the
> virtues of protocol generic HW mechanisms
> (https://www.youtube.com/watch?v=6VgmazGwL_Y).
Tom Herbert Sept. 3, 2017, 4:45 p.m. UTC | #10
> Re all sorts of udp encap, sure, we're all on the less-is-more thing and just
> RSS-ing on the ip+udp encap header.
>
> For GRE, I was trying to fight back that rss-ing on inner, but as
> Saeed commented,
> we didn't see something simple through which the HW can do spreading. To make
> sure I follow, you are saying that if this is gre6 tunneling
>
It's pretty common that HW does this since GRE is in widespread use for a while.

> net-next.git]# git grep -p ip6_make_flowlabel net/ include/linux/ include/net/
> include/net/ipv6.h=static inline void iph_to_flow_copy_v6addrs(struct
> flow_keys *flow,
> include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
> *net, struct sk_buff *skb,
> include/net/ipv6.h=static inline void ip6_set_txhash(struct sock *sk) { }
> include/net/ipv6.h:static inline __be32 ip6_make_flowlabel(struct net
> *net, struct sk_buff *skb,
> net/ipv6/ip6_gre.c=static int ip6gre_header(struct sk_buff *skb,
> struct net_device *dev,
> net/ipv6/ip6_gre.c:                  ip6_make_flowlabel(dev_net(dev), skb,
> net/ipv6/ip6_output.c=int ip6_xmit(const struct sock *sk, struct
> sk_buff *skb, struct flowi6 *fl6,
> net/ipv6/ip6_output.c:  ip6_flow_hdr(hdr, tclass,
> ip6_make_flowlabel(net, skb, fl6->flowlabel,
> net/ipv6/ip6_output.c=struct sk_buff *__ip6_make_skb(struct sock *sk,
> net/ipv6/ip6_output.c:               ip6_make_flowlabel(net, skb,
> fl6->flowlabel,
> net/ipv6/ip6_tunnel.c=int ip6_tnl_xmit(struct sk_buff *skb, struct
> net_device *dev, __u8 dsfield,
> net/ipv6/ip6_tunnel.c:               ip6_make_flowlabel(net, skb,
> fl6->flowlabel, true, fl6));
>
Seems right.

> the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a
> similar manner done by udp_flow_src_port? and
> if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get
> spreading? nice!
>
As long as the network devices support it.

> Still, what do we do with IPv4 GRE tunnels? and what do we do with HW
> which isn't capable to RSS on flow label?
>
Throw it out and buy hardware that supports flow label! ;-)

Seriously though, flow labels are the only reasonable way that RSS can
be supported in IPv6. If a device tries to do DPI on IPv6 then they'll
eventually need to be able to parse of some number of extension
headers which unlike IPv4 is unbounded in size. So there are going to
be circumstances in which a device either doesn't understand an EH, or
the size of EHs blows it parsing buffer limit so it can't do the DPI.
IMO, telling host implementations that we're not allowed to use
extension headers because middleboxes can't support them is
unacceptable...

Btw, these same arguments apply as to why CHECKSUM_COMPLETE is the
only reasonable way to handle receive checksums in IPv6.

Tom
Or Gerlitz Sept. 3, 2017, 6:58 p.m. UTC | #11
On Sun, Sep 3, 2017 at 7:45 PM, Tom Herbert <tom@herbertland.com> wrote:
>> Re all sorts of udp encap, sure, we're all on the less-is-more thing and just
>> RSS-ing on the ip+udp encap header.

>> For GRE, I was trying to fight back that rss-ing on inner, but as
>> Saeed commented,
>> we didn't see something simple through which the HW can do spreading. To
>> make sure I follow, you are saying that if this is gre6 tunneling

> It's pretty common that HW does this since GRE is in widespread use for a while.

ECMP? I guess so


>> the sender side (ip6_tnl_xmit?) will set the IPv6 flow label in a
>> similar manner done by udp_flow_src_port? and
>> if the receiver side hashes on L3 IPv6 src/dst/flow label we'll get
>> spreading? nice!

> As long as the network devices support it.

yeah, hopefully upcoming devices will support the thing

>> Still, what do we do with IPv4 GRE tunnels? and what do we do with HW
>> which isn't capable to RSS on flow label?

> Throw it out and buy hardware that supports flow label! ;-)

still, we came across IPv4 GRE customer environments

> Seriously though, flow labels are the only reasonable way that RSS can
> be supported in IPv6. If a device tries to do DPI on IPv6 then they'll
> eventually need to be able to parse of some number of extension
> headers which unlike IPv4 is unbounded in size. So there are going to
> be circumstances in which a device either doesn't understand an EH, or
> the size of EHs blows it parsing buffer limit so it can't do the DPI.
> IMO, telling host implementations that we're not allowed to use
> extension headers because middleboxes can't support them is
> unacceptable...

makes sense to me

> Btw, these same arguments apply as to why CHECKSUM_COMPLETE is the
> only reasonable way to handle receive checksums in IPv6.

supported on mlx5
Hannes Frederic Sowa Sept. 4, 2017, 9:37 a.m. UTC | #12
Hello,

Saeed Mahameed <saeedm@dev.mellanox.co.il> writes:

[...]

> On Sat, Sep 2, 2017 at 6:32 PM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Sorry, I think I am still confused.
>>
>> I just want to make sure that you don't use the first nibble after the
>> mpls bottom of stack label in any way as an indicator if that is an IPv4
>> or IPv6 packet by default. It can be anything. The forward equivalence
>> class tells the stack which protocol you see.
>>
>> If you match on the first nibble behind the MPLS bottom of stack label
>> the '4' or '6' respectively could be part of a MAC address with its
>> first nibble being 4 or 6, because the particular pseudowire is EoMPLS
>> and uses no control world.
>>
>> I wanted to mention it, because with addition of e.g. VPLS this could
>> cause problems down the road and should at least be controllable? It is
>> probably better to use Entropy Labels in future.
>>
>
> Hi Hannes,
>
> I see your concern now, but still it has nothing to do with the
> driver, the whole change is only to simplify driver code to not push
> full blown matching steering rules into the HW, and simply replace it
> with a one bit command.

Sorry, I very much got the impression from the cover letter plus the
descriptions of the patches.

> Regarding your concern, I will need to check with the HW guys and
> review the processing algorithm that identifies IP packets over MPLs,
> and will get back to you.
>
> if there is really a problem, then yes, we might need to make it
> controllable ..

Thanks for looking into this. I do think it can be added as a feature
but I very much think such logic should be disabled by default.

Thanks,
Hannes
Hannes Frederic Sowa Sept. 4, 2017, 1:50 p.m. UTC | #13
Tom Herbert <tom@herbertland.com> writes:

> An encapsulator sets the UDP source port to be the flow entropy of the
> packet being encapsulated. So when the packet traverses the network
> devices can base their hash just on the canonical 5-tuple which is
> sufficient for ECMP and RSS. IPv6 flow label is even better since the
> middleboxes don't even need to look at the transport header, a packet
> is steered based on the 3-tuple of addresses and flow label. In the
> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
> the source port. Flow label is similarly set by ip6_make_flowlabel.
> Both of these functions use the skb->hash which is computed by calling
> flow dissector at most once per packet (if the packet was received
> with an L4 HW hash or locally originated on a connection the hash does
> not need to be computed).

This would require the MPLS stack copying the flowlabel of IPv6
connections between MPLS routers over their whole lifetime in the MPLS
network. The same would hold for MPLS encapsulated inside UDP, the
source port needs to be kept constant. This is very impractical. The
hash for the flow label can often not be recomputed by interim routers,
because they might lack the knowledge of the upper layer protocol.

UDP source port entropy still has the problem that we don't respect the
source port as RSS entropy by default in network cards, because of
possible fragmentation and thus possible reordering of packets. GRE does
not have this problem and is way easier to identify by hardware.

Basically we need to tell network cards that they can use specific
destination UDP ports where we allow the source port to be used in RSS
hash calculation. I don't see how this is any easier than just using GRE
with a defined protocol field? I do like the combination of ipv6
flowlabel + GRE.

Btw. people are using the GRE Key as additional entropy without looking
into the GRE payload.

Bye,
Hannes
Tom Herbert Sept. 4, 2017, 4:15 p.m. UTC | #14
On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Tom Herbert <tom@herbertland.com> writes:
>
>> An encapsulator sets the UDP source port to be the flow entropy of the
>> packet being encapsulated. So when the packet traverses the network
>> devices can base their hash just on the canonical 5-tuple which is
>> sufficient for ECMP and RSS. IPv6 flow label is even better since the
>> middleboxes don't even need to look at the transport header, a packet
>> is steered based on the 3-tuple of addresses and flow label. In the
>> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
>> the source port. Flow label is similarly set by ip6_make_flowlabel.
>> Both of these functions use the skb->hash which is computed by calling
>> flow dissector at most once per packet (if the packet was received
>> with an L4 HW hash or locally originated on a connection the hash does
>> not need to be computed).
>
> This would require the MPLS stack copying the flowlabel of IPv6
> connections between MPLS routers over their whole lifetime in the MPLS
> network. The same would hold for MPLS encapsulated inside UDP, the
> source port needs to be kept constant. This is very impractical. The
> hash for the flow label can often not be recomputed by interim routers,
> because they might lack the knowledge of the upper layer protocol.
>
Hannes,

When the flow label is set the packet will traverse the network and be
ECMP routed regardless of whether the payload is MPLS at anything
else-- the important characteristic is that network devices don't need
to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to
provide good ECMP. At a source the flow label or UDP source port needs
to be generated. That can be based on DPI, derived from the MPLS
entropy label, use SPI in ESP, etc. I don't see anything special about
MPLS in this regard.

> UDP source port entropy still has the problem that we don't respect the
> source port as RSS entropy by default in network cards, because of
> possible fragmentation and thus possible reordering of packets. GRE does
> not have this problem and is way easier to identify by hardware.
>
> Basically we need to tell network cards that they can use specific
> destination UDP ports where we allow the source port to be used in RSS
> hash calculation. I don't see how this is any easier than just using GRE
> with a defined protocol field? I do like the combination of ipv6
> flowlabel + GRE.
>
No, we don't any more want port specific configuration in NICs! The
NIC should just fallback to 3-tuple hash when it see MF or offset set
in IPv4 header. But even if it doesn't implement this, receiving OOO
fragments is hardly the end of the world-- IP packets are always
allowed to be received OOO. If something breaks because in order
delivery is assumed then that is the bug that needs to be fixed. So at
best handling fragmentation in this manner is proposed om
optimization whose benefits will pale to getting good ECMP and RSS
when encapsulation is in use.

> Btw. people are using the GRE Key as additional entropy without looking
> into the GRE payload.
>
Sure some are, but the GRE key is not defined to be flow entropy so
it's not ubiquitous it used for that so it gives sufficient entropy or
is even constant per flow. GRE/UDP (RFC8086) was primarily written to
allow a more consistent method (as was RFC7510 for doing MPLS/UDP).

Tom
Hannes Frederic Sowa Sept. 4, 2017, 4:52 p.m. UTC | #15
Hello Tom,

Tom Herbert <tom@herbertland.com> writes:

> On Mon, Sep 4, 2017 at 6:50 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Tom Herbert <tom@herbertland.com> writes:
>>
>>> An encapsulator sets the UDP source port to be the flow entropy of the
>>> packet being encapsulated. So when the packet traverses the network
>>> devices can base their hash just on the canonical 5-tuple which is
>>> sufficient for ECMP and RSS. IPv6 flow label is even better since the
>>> middleboxes don't even need to look at the transport header, a packet
>>> is steered based on the 3-tuple of addresses and flow label. In the
>>> Linux stack,  udp_flow_src_port is used by UDP encapsulations to set
>>> the source port. Flow label is similarly set by ip6_make_flowlabel.
>>> Both of these functions use the skb->hash which is computed by calling
>>> flow dissector at most once per packet (if the packet was received
>>> with an L4 HW hash or locally originated on a connection the hash does
>>> not need to be computed).
>>
>> This would require the MPLS stack copying the flowlabel of IPv6
>> connections between MPLS routers over their whole lifetime in the MPLS
>> network. The same would hold for MPLS encapsulated inside UDP, the
>> source port needs to be kept constant. This is very impractical. The
>> hash for the flow label can often not be recomputed by interim routers,
>> because they might lack the knowledge of the upper layer protocol.
>>
> Hannes,
>
> When the flow label is set the packet will traverse the network and be
> ECMP routed regardless of whether the payload is MPLS at anything
> else-- the important characteristic is that network devices don't need
> to know how to parse MPLS (or GRE, or IPIP, or L2TP, ESP, or ...) to
> provide good ECMP. At a source the flow label or UDP source port needs
> to be generated. That can be based on DPI, derived from the MPLS
> entropy label, use SPI in ESP, etc. I don't see anything special about
> MPLS in this regard.

The MPLS circuit is only end to end in terms of IP processing if MPLS is
used for multitenant separation.

Normally the IP connection is done between two label switch routers,
thus is not end to end. One LSR will decapsulate the packet and throw
the IP header away, do the label processing and will reencapsulate it
with the new next hop information. To keep the assigned entropy alive it
would have to save the UDP source port or flowlabel and patch the
outgoing IP header again. This is certainly possible it just seems more
unnatural.

Normally every next hop does MPLS processing and thus the packet
traverses up the stack. Special purpose (entropy) MPLS labels allow the
stack to achieve RSS just based on the label stack and will be
end-to-end in a MPLS cloud.

>> UDP source port entropy still has the problem that we don't respect the
>> source port as RSS entropy by default in network cards, because of
>> possible fragmentation and thus possible reordering of packets. GRE does
>> not have this problem and is way easier to identify by hardware.
>>
>> Basically we need to tell network cards that they can use specific
>> destination UDP ports where we allow the source port to be used in RSS
>> hash calculation. I don't see how this is any easier than just using GRE
>> with a defined protocol field? I do like the combination of ipv6
>> flowlabel + GRE.
>>
> No, we don't any more want port specific configuration in NICs! The
> NIC should just fallback to 3-tuple hash when it see MF or offset set
> in IPv4 header. But even if it doesn't implement this, receiving OOO
> fragments is hardly the end of the world-- IP packets are always
> allowed to be received OOO. If something breaks because in order
> delivery is assumed then that is the bug that needs to be fixed. So at
> best handling fragmentation in this manner is proposed om
> optimization whose benefits will pale to getting good ECMP and RSS
> when encapsulation is in use.

The problem is that you end up having two streams, one fragmented and
one non-fragmented, but actually they belong to the same stream. It is
known to break stuff, see:

<https://patchwork.ozlabs.org/patch/59235/>

I would agree with you, but we can't break existing setups,
unfortunately.

>> Btw. people are using the GRE Key as additional entropy without looking
>> into the GRE payload.
>>
> Sure some are, but the GRE key is not defined to be flow entropy so
> it's not ubiquitous it used for that so it gives sufficient entropy or
> is even constant per flow. GRE/UDP (RFC8086) was primarily written to
> allow a more consistent method (as was RFC7510 for doing MPLS/UDP).

I agree with that, just wanted to mention it.

Bye,
Hannes
Tom Herbert Sept. 4, 2017, 5:11 p.m. UTC | #16
> The problem is that you end up having two streams, one fragmented and
> one non-fragmented, but actually they belong to the same stream. It is
> known to break stuff, see:
>
> <https://patchwork.ozlabs.org/patch/59235/>
>
> I would agree with you, but we can't break existing setups,
> unfortunately.
>
I'm not sure what "existing setups" means here. If this is referring
to the Internet then out of order packets and fragments abound-- any
assumption of in order delivery for IP packets is useless there.

Btw, TCP has exactly the same problem in this regard that UDP has with
regard to fragmentation. The reason that TCP isn't considered an issue
is the assumption that TCP will do PMTU discovery and set DF (I would
bet that devices don't actually check for DF and vendors don't test
when it's not set!).

Tom
Hannes Frederic Sowa Sept. 4, 2017, 5:57 p.m. UTC | #17
Hi Tom,

Tom Herbert <tom@herbertland.com> writes:

>> The problem is that you end up having two streams, one fragmented and
>> one non-fragmented, but actually they belong to the same stream. It is
>> known to break stuff, see:
>>
>> <https://patchwork.ozlabs.org/patch/59235/>
>>
>> I would agree with you, but we can't break existing setups,
>> unfortunately.
>>
> I'm not sure what "existing setups" means here. If this is referring
> to the Internet then out of order packets and fragments abound-- any
> assumption of in order delivery for IP packets is useless there.

Network cards don't know where traffic originates from, even on the same
card. Clouds nowadays try to convince everyone to virtualize existing
workloads. So I *assume* that at least for traffic that seems to be in
one L2 domain out of order should not occur or things will break.

Ethernet for a long time appeared to do very much in order delivery and
guarantees that. Thus for people it appeared that UDP packets are also
strictly in order. Encapsulating Ethernet inside UDP thus must preserve
those properties, especially if used for virtualization. Unfortunately
fragmentation happens and the stack has to deal with it somehow.

I do know about some software that uses UDP in multicast that is prone
to misbehave in case of OoO frames. It uses a small reassemble queue but
if reordering gets too high, it will simply drop data. This might be
acceptable up to a specific degree.

I guess one application you could harm is plain old simple syslog, which
often generated fragmented packets. Luckily one often has time stamps in
there to reassemble the log lines but one had to do this manually.

L2 domains are not bound to local networks anymore, thanks to vxlan etc.

If you are in a controlled environment and you do know your software
that runs perfectly well, certainly, no doubt, UDP hashing can be
enabled. I would argue we can't do that generally.

> Btw, TCP has exactly the same problem in this regard that UDP has with
> regard to fragmentation. The reason that TCP isn't considered an issue
> is the assumption that TCP will do PMTU discovery and set DF (I would
> bet that devices don't actually check for DF and vendors don't test
> when it's not set!).

I don't know.

For the application the stream will appear in order at the socket level
and OoO or fragmentation won't break anything. End systems will have a
harder time reassembling the stream and thus fast paths couldn't be
used. Is that what you are referring to?

Thanks,
Hannes
Tom Herbert Sept. 4, 2017, 6:56 p.m. UTC | #18
On Mon, Sep 4, 2017 at 10:57 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Hi Tom,
>
> Tom Herbert <tom@herbertland.com> writes:
>
>>> The problem is that you end up having two streams, one fragmented and
>>> one non-fragmented, but actually they belong to the same stream. It is
>>> known to break stuff, see:
>>>
>>> <https://patchwork.ozlabs.org/patch/59235/>
>>>
>>> I would agree with you, but we can't break existing setups,
>>> unfortunately.
>>>
>> I'm not sure what "existing setups" means here. If this is referring
>> to the Internet then out of order packets and fragments abound-- any
>> assumption of in order delivery for IP packets is useless there.
>
> Network cards don't know where traffic originates from, even on the same
> card. Clouds nowadays try to convince everyone to virtualize existing
> workloads. So I *assume* that at least for traffic that seems to be in
> one L2 domain out of order should not occur or things will break.
>
> Ethernet for a long time appeared to do very much in order delivery and
> guarantees that. Thus for people it appeared that UDP packets are also
> strictly in order. Encapsulating Ethernet inside UDP thus must preserve
> those properties, especially if used for virtualization. Unfortunately
> fragmentation happens and the stack has to deal with it somehow.
>
There is absolutely no requirement in IP that packets are delivered in
order-- there never has been and there never will be! If the ULP, like
Ethernet encapsulation, requires in order deliver then it needs to
implement that itself like TCP, GRE, and other protocols ensure that
with sequence numbers and reassembly. All of these hoops we do make
sure that packets always follow the same path and are always in order
are done for benefit of middlebox devices like stateful firewalls that
have force us to adopt their concept of network architecture-- in the
long run this is self-defeating and kills our ability to innovate.

I'm not saying that we shouldn't consider legacy devices, but we
should scrutinize new development or solutions that perpetuate
incorrect design or bad assumptions.

Tom
Hannes Frederic Sowa Sept. 5, 2017, 11:14 a.m. UTC | #19
Tom Herbert <tom@herbertland.com> writes:

> There is absolutely no requirement in IP that packets are delivered in
> order-- there never has been and there never will be! If the ULP, like
> Ethernet encapsulation, requires in order deliver then it needs to
> implement that itself like TCP, GRE, and other protocols ensure that
> with sequence numbers and reassembly. All of these hoops we do make
> sure that packets always follow the same path and are always in order
> are done for benefit of middlebox devices like stateful firewalls that
> have force us to adopt their concept of network architecture-- in the
> long run this is self-defeating and kills our ability to innovate.
>
> I'm not saying that we shouldn't consider legacy devices, but we
> should scrutinize new development or solutions that perpetuate
> incorrect design or bad assumptions.

So configure RSS per port and ensure no fragments are send to those
ports. This is possible and rather easy to do. It solves the problem
with legacy software and it spreads out packets for your applications.

It is not perfect but it is working and solves both problems.

Bye,
Hannes
Tom Herbert Sept. 5, 2017, 4:02 p.m. UTC | #20
On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa
<hannes@stressinduktion.org> wrote:
> Tom Herbert <tom@herbertland.com> writes:
>
>> There is absolutely no requirement in IP that packets are delivered in
>> order-- there never has been and there never will be! If the ULP, like
>> Ethernet encapsulation, requires in order deliver then it needs to
>> implement that itself like TCP, GRE, and other protocols ensure that
>> with sequence numbers and reassembly. All of these hoops we do make
>> sure that packets always follow the same path and are always in order
>> are done for benefit of middlebox devices like stateful firewalls that
>> have force us to adopt their concept of network architecture-- in the
>> long run this is self-defeating and kills our ability to innovate.
>>
>> I'm not saying that we shouldn't consider legacy devices, but we
>> should scrutinize new development or solutions that perpetuate
>> incorrect design or bad assumptions.
>
> So configure RSS per port and ensure no fragments are send to those
> ports. This is possible and rather easy to do. It solves the problem
> with legacy software and it spreads out packets for your applications.
>
> It is not perfect but it is working and solves both problems.
>
Hannes,

I don't see how that solves anything. The purpose of RSS is to
distribute the load of protocol packets across queues. This needs to
work for UDP applications. For instance, if I were building a QUIC
server I'd want the sort of flow distribution that a TCP server would
give. You can't do that by configuring a few ports in the device.

If I were to suggest any HW change it would be to not do DPI on
fragments (MF or offset is set). This ensures that all packets of the
fragment train get hashed to the same queue and is on fact what RPS
has been doing for years without any complaints.

But even before I'd make that recommendation, I'd really like
understand what the problem actually is. The only thing I can garner
from this discussion and the Intel patch is that when fragments are
received OOO that is perceived as a problem. But the by the protocol
specification clearly says this is not a problem. So the questions
are: who is negatively affected by this? Is this a problem because
some artificial test that checks for everything to be in order is now
failing? Is this affecting real users? Is this an issue in the stack
or really with some implementation outside of the stack? If it is an
implementation outside of the stack, then are we just bandaid'ing over
someone else's incorrect implementation by patching the kernel (like
would have be the case if we change the kernel to interoperate with
Facebook's switch that couldn't handle OOO in twstate).

Thanks,
Tom

> Bye,
> Hannes
Hannes Frederic Sowa Sept. 5, 2017, 7:20 p.m. UTC | #21
Hi Tom,

Tom Herbert <tom@herbertland.com> writes:

> On Tue, Sep 5, 2017 at 4:14 AM, Hannes Frederic Sowa
> <hannes@stressinduktion.org> wrote:
>> Tom Herbert <tom@herbertland.com> writes:
>>
>>> There is absolutely no requirement in IP that packets are delivered in
>>> order-- there never has been and there never will be! If the ULP, like
>>> Ethernet encapsulation, requires in order deliver then it needs to
>>> implement that itself like TCP, GRE, and other protocols ensure that
>>> with sequence numbers and reassembly. All of these hoops we do make
>>> sure that packets always follow the same path and are always in order
>>> are done for benefit of middlebox devices like stateful firewalls that
>>> have force us to adopt their concept of network architecture-- in the
>>> long run this is self-defeating and kills our ability to innovate.
>>>
>>> I'm not saying that we shouldn't consider legacy devices, but we
>>> should scrutinize new development or solutions that perpetuate
>>> incorrect design or bad assumptions.
>>
>> So configure RSS per port and ensure no fragments are send to those
>> ports. This is possible and rather easy to do. It solves the problem
>> with legacy software and it spreads out packets for your applications.
>>
>> It is not perfect but it is working and solves both problems.
>>
> Hannes,
>
> I don't see how that solves anything. The purpose of RSS is to
> distribute the load of protocol packets across queues. This needs to
> work for UDP applications. For instance, if I were building a QUIC
> server I'd want the sort of flow distribution that a TCP server would
> give. You can't do that by configuring a few ports in the device.

I seriously am with you and I think you are on the right track. I would
also much rather like to see fragmentation *not* happening and would
like to see hardware logic *not* parsing deep down into packets and
protocols. We can agree on that!

Albeit I don't understand the problem with my approach:

If you know that you are running a QUIC server and know your environment
(safe against reordering), run `ethtool -N ethX rx-flow-hash udp4 sdfn'
on the box and you would get spreading of UDP packets based on the UDP
src+dst port numbers to the rx queues. Otherwise only source and
destination addresses are being considered, which is the safe default
(maybe this is sometimes enough).

Intel network cards warn you if you actually switch the mode:

-- >8 --
drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c:                                  "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n");
drivers/net/ethernet/intel/igb/igb_ethtool.c:                           "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n");
drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c:                       e_warn(drv, "enabling UDP RSS: fragmented packets may arrive out of order to the stack above\n");
-- 8< --

I argue that we can't change the default.

Furthermore I tried to argue that in a cloud where you don't know which
applications you run within the encapsulated networks, it is very hard
to predict which protocols can be considered safe against those
effects. I agree, that most of the time, it probably should not
matter. But given the rule about kernel backwards compatibility I do
have doubts that this change would go unnoticed.

I am the devil's advocate here and argue it is on us to show that is
safe before we change the semantics.

> If I were to suggest any HW change it would be to not do DPI on
> fragments (MF or offset is set). This ensures that all packets of the
> fragment train get hashed to the same queue and is on fact what RPS
> has been doing for years without any complaints.

The same UDP flow could consist out of fragments and non-fragmented
packets. Even if hardware would act according to what you wrote,
reordering is very much possible within the same UDP flow.

In case of QUIC, as far as I know, it uses destination ports 80 and
443. Like we already notify NICs about vxlan and geneve port numbers, we
could selectively enable RSS for UDP ports on those destination port
numbers to tell the NIC it won't receive fragments on those ports and
thus do RSS. The same way the NIC could be notified if reordering on
this port is possible. This could be done for example via a lightweight
setsockopt.

The situation with encapsulation is even more complicated:

We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
constellation. If we do the fragmentation inside the vxlan tunnel and
carry over the skb hash to all resulting UDP/vxlan packets source ports,
we are fine and reordering on the receiver NIC won't happen in this
case. If the fragmentation happens on the outer UDP header, this will
result in reordering of the inner L2 flow. Unfortunately this depends on
how the vxlan tunnel was set up, how other devices do that and (I
believe so) on the kernel version.

Given the fact that we tell the NICs the receiving port of vxlan tunnels
anyway (and I only can assume) they are doing RSS on that, we probably
have reordering anyway on those tunnels. So maybe you are right and it
doesn't matter in the end for encapsulated packets, by accident.

> But even before I'd make that recommendation, I'd really like
> understand what the problem actually is. The only thing I can garner
> from this discussion and the Intel patch is that when fragments are
> received OOO that is perceived as a problem. But the by the protocol
> specification clearly says this is not a problem. So the questions
> are: who is negatively affected by this? Is this a problem because
> some artificial test that checks for everything to be in order is now
> failing? Is this affecting real users? Is this an issue in the stack
> or really with some implementation outside of the stack? If it is an
> implementation outside of the stack, then are we just bandaid'ing over
> someone else's incorrect implementation by patching the kernel (like
> would have be the case if we change the kernel to interoperate with
> Facebook's switch that couldn't handle OOO in twstate).

I agree with that and I would also love to hear more feedback on
this. Unfortunately I don't know.

Bye,
Hannes
Tom Herbert Sept. 5, 2017, 9:13 p.m. UTC | #22
> The situation with encapsulation is even more complicated:
>
> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
> constellation. If we do the fragmentation inside the vxlan tunnel and
> carry over the skb hash to all resulting UDP/vxlan packets source ports,
> we are fine and reordering on the receiver NIC won't happen in this
> case. If the fragmentation happens on the outer UDP header, this will
> result in reordering of the inner L2 flow. Unfortunately this depends on
> how the vxlan tunnel was set up, how other devices do that and (I
> believe so) on the kernel version.
>
This really isn't that complicated. The assumption that an IP network
always delivers packets in order is simply wrong. The inventors of
VXLAN must have know full well that when you use IP, packets can and
eventually will be delivered out of order. This isn't just because of
fragmentation, there are many other reasons that packets can be
delivered OOO. This also must have been known when IP/GRE and any
other protocol that carries L2 over IP was invented. If OOO is an
issue for these protocols then they need to be fixed-- this is not a
concern with IP protocol nor the stack.

Tom
Alexander H Duyck Sept. 6, 2017, 3:06 a.m. UTC | #23
On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>> The situation with encapsulation is even more complicated:
>>
>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>> constellation. If we do the fragmentation inside the vxlan tunnel and
>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>> we are fine and reordering on the receiver NIC won't happen in this
>> case. If the fragmentation happens on the outer UDP header, this will
>> result in reordering of the inner L2 flow. Unfortunately this depends on
>> how the vxlan tunnel was set up, how other devices do that and (I
>> believe so) on the kernel version.
>>
> This really isn't that complicated. The assumption that an IP network
> always delivers packets in order is simply wrong. The inventors of
> VXLAN must have know full well that when you use IP, packets can and
> eventually will be delivered out of order. This isn't just because of
> fragmentation, there are many other reasons that packets can be
> delivered OOO. This also must have been known when IP/GRE and any
> other protocol that carries L2 over IP was invented. If OOO is an
> issue for these protocols then they need to be fixed-- this is not a
> concern with IP protocol nor the stack.
>
> Tom

As far as a little background on the original patch I believe the
issue that was fixed by the patch was a video streaming application
that was sending/receiving a mix of fragmented and non-fragmented
packets. Receiving them out of order due to the fragmentation was
causing issues with stutters in the video and so we ended up disabling
UDP by default in the NICs listed. We decided to go that way as UDP
RSS was viewed as a performance optimization, while the out-of-order
problems were viewed as a functionality issue.

The default for i40e is to have UDP RSS hashing enabled if I recall
correctly. Basically as we move away from enterprise to cloud I
suspect that is going to be the case more and more since all the UDP
tunnels require either port recognition or UDP RSS to be enabled. For
now we carry the ability to enable UDP RSS if desired in the legacy
drivers, and I believe we have some white papers somewhere that
suggest enabling it if you are planning to use UDP based tunnels.

- Alex
Tom Herbert Sept. 6, 2017, 4:17 p.m. UTC | #24
On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> The situation with encapsulation is even more complicated:
>>>
>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>> we are fine and reordering on the receiver NIC won't happen in this
>>> case. If the fragmentation happens on the outer UDP header, this will
>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>> how the vxlan tunnel was set up, how other devices do that and (I
>>> believe so) on the kernel version.
>>>
>> This really isn't that complicated. The assumption that an IP network
>> always delivers packets in order is simply wrong. The inventors of
>> VXLAN must have know full well that when you use IP, packets can and
>> eventually will be delivered out of order. This isn't just because of
>> fragmentation, there are many other reasons that packets can be
>> delivered OOO. This also must have been known when IP/GRE and any
>> other protocol that carries L2 over IP was invented. If OOO is an
>> issue for these protocols then they need to be fixed-- this is not a
>> concern with IP protocol nor the stack.
>>
>> Tom
>
> As far as a little background on the original patch I believe the
> issue that was fixed by the patch was a video streaming application
> that was sending/receiving a mix of fragmented and non-fragmented
> packets. Receiving them out of order due to the fragmentation was
> causing issues with stutters in the video and so we ended up disabling
> UDP by default in the NICs listed. We decided to go that way as UDP
> RSS was viewed as a performance optimization, while the out-of-order
> problems were viewed as a functionality issue.
>
Hi Alex,

Thanks for the details! Were you able to find the root cause for this?
In particular, it would be interesting to know if it is the kernel or
device that introduced the jitter, or if it's the application that
doesn't handle OOO well...

Tom
Alexander H Duyck Sept. 6, 2017, 5:43 p.m. UTC | #25
On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> The situation with encapsulation is even more complicated:
>>>>
>>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>>> we are fine and reordering on the receiver NIC won't happen in this
>>>> case. If the fragmentation happens on the outer UDP header, this will
>>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>>> how the vxlan tunnel was set up, how other devices do that and (I
>>>> believe so) on the kernel version.
>>>>
>>> This really isn't that complicated. The assumption that an IP network
>>> always delivers packets in order is simply wrong. The inventors of
>>> VXLAN must have know full well that when you use IP, packets can and
>>> eventually will be delivered out of order. This isn't just because of
>>> fragmentation, there are many other reasons that packets can be
>>> delivered OOO. This also must have been known when IP/GRE and any
>>> other protocol that carries L2 over IP was invented. If OOO is an
>>> issue for these protocols then they need to be fixed-- this is not a
>>> concern with IP protocol nor the stack.
>>>
>>> Tom
>>
>> As far as a little background on the original patch I believe the
>> issue that was fixed by the patch was a video streaming application
>> that was sending/receiving a mix of fragmented and non-fragmented
>> packets. Receiving them out of order due to the fragmentation was
>> causing issues with stutters in the video and so we ended up disabling
>> UDP by default in the NICs listed. We decided to go that way as UDP
>> RSS was viewed as a performance optimization, while the out-of-order
>> problems were viewed as a functionality issue.
>>
> Hi Alex,
>
> Thanks for the details! Were you able to find the root cause for this?
> In particular, it would be interesting to know if it is the kernel or
> device that introduced the jitter, or if it's the application that
> doesn't handle OOO well...
>
> Tom

It is hard to say since my memory of the events from 7 years ago is
pretty vague at this point, but I'm pretty sure it was the
application. Basically getting the frames out of order was causing
them to have to drop video data if I recall correctly.

- Alex
Tom Herbert Sept. 6, 2017, 7:01 p.m. UTC | #26
On Wed, Sep 6, 2017 at 10:43 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Wed, Sep 6, 2017 at 9:17 AM, Tom Herbert <tom@herbertland.com> wrote:
>> On Tue, Sep 5, 2017 at 8:06 PM, Alexander Duyck
>> <alexander.duyck@gmail.com> wrote:
>>> On Tue, Sep 5, 2017 at 2:13 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>>> The situation with encapsulation is even more complicated:
>>>>>
>>>>> We are basically only interested in the UDP/vxlan/Ethernet/IP/UDP
>>>>> constellation. If we do the fragmentation inside the vxlan tunnel and
>>>>> carry over the skb hash to all resulting UDP/vxlan packets source ports,
>>>>> we are fine and reordering on the receiver NIC won't happen in this
>>>>> case. If the fragmentation happens on the outer UDP header, this will
>>>>> result in reordering of the inner L2 flow. Unfortunately this depends on
>>>>> how the vxlan tunnel was set up, how other devices do that and (I
>>>>> believe so) on the kernel version.
>>>>>
>>>> This really isn't that complicated. The assumption that an IP network
>>>> always delivers packets in order is simply wrong. The inventors of
>>>> VXLAN must have know full well that when you use IP, packets can and
>>>> eventually will be delivered out of order. This isn't just because of
>>>> fragmentation, there are many other reasons that packets can be
>>>> delivered OOO. This also must have been known when IP/GRE and any
>>>> other protocol that carries L2 over IP was invented. If OOO is an
>>>> issue for these protocols then they need to be fixed-- this is not a
>>>> concern with IP protocol nor the stack.
>>>>
>>>> Tom
>>>
>>> As far as a little background on the original patch I believe the
>>> issue that was fixed by the patch was a video streaming application
>>> that was sending/receiving a mix of fragmented and non-fragmented
>>> packets. Receiving them out of order due to the fragmentation was
>>> causing issues with stutters in the video and so we ended up disabling
>>> UDP by default in the NICs listed. We decided to go that way as UDP
>>> RSS was viewed as a performance optimization, while the out-of-order
>>> problems were viewed as a functionality issue.
>>>
>> Hi Alex,
>>
>> Thanks for the details! Were you able to find the root cause for this?
>> In particular, it would be interesting to know if it is the kernel or
>> device that introduced the jitter, or if it's the application that
>> doesn't handle OOO well...
>>
>> Tom
>
> It is hard to say since my memory of the events from 7 years ago is
> pretty vague at this point, but I'm pretty sure it was the
> application. Basically getting the frames out of order was causing
> them to have to drop video data if I recall correctly.
>
Oh, I didn't notice that patch was from 2010. Maybe the application
has been fixed by now! :-)

Perhaps, it's time to try to turn UDP hashing on again by default?
Even if NICs aren't doing this, there are network devices that are
looking at UDP for ECMP and that doesn't seem to be causing widespread
problems currently...

Tom