mbox series

[ovs-dev,00/15] Netdev vxlan-decap offload

Message ID 20210127181036.32448-1-elibr@nvidia.com
Headers show
Series Netdev vxlan-decap offload | expand

Message

Eli Britstein Jan. 27, 2021, 6:10 p.m. UTC
VXLAN decap in OVS-DPDK configuration consists of two flows:
F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0

F1 is a classification flow. It has outer headers matches and it
classifies the packet as a VXLAN packet, and using tnl_pop action the
packet continues processing in F2.
F2 is a flow that has matches on tunnel metadata as well as on the inner
packet headers (as any other flow).

In order to fully offload VXLAN decap path, both F1 and F2 should be
offloaded. As there are more than one flow in HW, it is possible that
F1 is done by HW but F2 is not. Packet is received by SW, and should be
processed starting from F2 as F1 was already done by HW.
Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
are applied on uplink ports attached to OVS.

This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
for tunnel offloads.

Travis:
v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552

GitHub Actions:
v1: https://github.com/elibritstein/OVS/actions/runs/515334647

[1] https://mails.dpdk.org/archives/dev/2020-October/187314.html

Eli Britstein (13):
  netdev-offload: Add HW miss packet state recover API
  netdev-dpdk: Introduce DPDK tunnel APIs
  netdev-offload-dpdk: Implement flow dump create/destroy APIs
  netdev-dpdk: Add flow_api support for netdev vxlan vports
  netdev-offload-dpdk: Implement HW miss packet recover for vport
  dpif-netdev: Add HW miss packet state recover logic
  netdev-offload-dpdk: Change log rate limits
  netdev-offload-dpdk: Support tunnel pop action
  netdev-offload-dpdk: Refactor offload rule creation
  netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
    port
  netdev-offload-dpdk: Map netdev and ufid to offload objects
  netdev-offload-dpdk: Support vports flows offload
  netdev-dpdk-offload: Add vxlan pattern matching function

Ilya Maximets (2):
  netdev-offload: Allow offloading to netdev without ifindex.
  netdev-offload: Disallow offloading to unrelated tunneling vports.

 Documentation/howto/dpdk.rst  |   1 +
 NEWS                          |   2 +
 lib/dpif-netdev.c             |  49 +-
 lib/netdev-dpdk.c             | 135 ++++++
 lib/netdev-dpdk.h             | 104 ++++-
 lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
 lib/netdev-offload-provider.h |   5 +
 lib/netdev-offload-tc.c       |   8 +
 lib/netdev-offload.c          |  29 +-
 lib/netdev-offload.h          |   1 +
 10 files changed, 1033 insertions(+), 152 deletions(-)

Comments

Sriharsha Basavapatna Feb. 5, 2021, 11:25 a.m. UTC | #1
On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
>
> VXLAN decap in OVS-DPDK configuration consists of two flows:
> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
>
> F1 is a classification flow. It has outer headers matches and it
> classifies the packet as a VXLAN packet, and using tnl_pop action the
> packet continues processing in F2.
> F2 is a flow that has matches on tunnel metadata as well as on the inner
> packet headers (as any other flow).
>
> In order to fully offload VXLAN decap path, both F1 and F2 should be
> offloaded. As there are more than one flow in HW, it is possible that
> F1 is done by HW but F2 is not. Packet is received by SW, and should be
> processed starting from F2 as F1 was already done by HW.
> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> are applied on uplink ports attached to OVS.
>
> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> for tunnel offloads.
>
> Travis:
> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
>
> GitHub Actions:
> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
>
> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
>
> Eli Britstein (13):
>   netdev-offload: Add HW miss packet state recover API
>   netdev-dpdk: Introduce DPDK tunnel APIs
>   netdev-offload-dpdk: Implement flow dump create/destroy APIs
>   netdev-dpdk: Add flow_api support for netdev vxlan vports
>   netdev-offload-dpdk: Implement HW miss packet recover for vport
>   dpif-netdev: Add HW miss packet state recover logic
>   netdev-offload-dpdk: Change log rate limits
>   netdev-offload-dpdk: Support tunnel pop action
>   netdev-offload-dpdk: Refactor offload rule creation
>   netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>     port
>   netdev-offload-dpdk: Map netdev and ufid to offload objects
>   netdev-offload-dpdk: Support vports flows offload
>   netdev-dpdk-offload: Add vxlan pattern matching function
>
> Ilya Maximets (2):
>   netdev-offload: Allow offloading to netdev without ifindex.
>   netdev-offload: Disallow offloading to unrelated tunneling vports.
>
>  Documentation/howto/dpdk.rst  |   1 +
>  NEWS                          |   2 +
>  lib/dpif-netdev.c             |  49 +-
>  lib/netdev-dpdk.c             | 135 ++++++
>  lib/netdev-dpdk.h             | 104 ++++-
>  lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
>  lib/netdev-offload-provider.h |   5 +
>  lib/netdev-offload-tc.c       |   8 +
>  lib/netdev-offload.c          |  29 +-
>  lib/netdev-offload.h          |   1 +
>  10 files changed, 1033 insertions(+), 152 deletions(-)
>
> --
> 2.28.0.546.g385c171
>

Hi Eli,

Thanks for posting this new patchset to support tunnel decap action offload.

I haven't looked at the entire patchset yet. But I focused on the
patches that introduce 1-to-many mapping between an OVS flow (f2) and
HW offloaded flows.

Here is a representation of the design proposed in this patchset. A
flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
the underlying uplink/physical port is P0, gets offloaded to not only
P0, but also to other physical ports P1, P2... and so on.

    P0 <----> VxLAN-vPort <----> VFRep-1

    P1
    P2
    ...
    Pn

IMO, the problems with this design are:

- Offloading a flow to an unrelated physical device that has nothing
to do with that flow (invalid device for the flow).
- Offloading to not just one, but several such invalid physical devices.
- Consuming HW resources for a flow that is never seen or intended to
be processed by those physical devices.
- Impacts flow scale on other physical devices, since it would consume
their HW resources with a large number of such invalid flows.
- The indirect list used to track these multiple mappings complicates
the offload layer implementation.
- The addition of flow_dump_create() to offload APIs, just to parse
and get a list of user datapath netdevs is confusing and not needed.

I have been exploring an alternate design to address this problem of
figuring out the right physical device for a given tunnel inner-flow.
I will send a patch, please take a look so we can continue the discussion.

Thanks,
-Harsha
Sriharsha Basavapatna Feb. 5, 2021, 6:26 p.m. UTC | #2
On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
<sriharsha.basavapatna@broadcom.com> wrote:
>
> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
> >
> > VXLAN decap in OVS-DPDK configuration consists of two flows:
> > F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> > F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >
> > F1 is a classification flow. It has outer headers matches and it
> > classifies the packet as a VXLAN packet, and using tnl_pop action the
> > packet continues processing in F2.
> > F2 is a flow that has matches on tunnel metadata as well as on the inner
> > packet headers (as any other flow).
> >
> > In order to fully offload VXLAN decap path, both F1 and F2 should be
> > offloaded. As there are more than one flow in HW, it is possible that
> > F1 is done by HW but F2 is not. Packet is received by SW, and should be
> > processed starting from F2 as F1 was already done by HW.
> > Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> > are applied on uplink ports attached to OVS.
> >
> > This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> > for tunnel offloads.
> >
> > Travis:
> > v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >
> > GitHub Actions:
> > v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >
> > [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >
> > Eli Britstein (13):
> >   netdev-offload: Add HW miss packet state recover API
> >   netdev-dpdk: Introduce DPDK tunnel APIs
> >   netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >   netdev-dpdk: Add flow_api support for netdev vxlan vports
> >   netdev-offload-dpdk: Implement HW miss packet recover for vport
> >   dpif-netdev: Add HW miss packet state recover logic
> >   netdev-offload-dpdk: Change log rate limits
> >   netdev-offload-dpdk: Support tunnel pop action
> >   netdev-offload-dpdk: Refactor offload rule creation
> >   netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >     port
> >   netdev-offload-dpdk: Map netdev and ufid to offload objects
> >   netdev-offload-dpdk: Support vports flows offload
> >   netdev-dpdk-offload: Add vxlan pattern matching function
> >
> > Ilya Maximets (2):
> >   netdev-offload: Allow offloading to netdev without ifindex.
> >   netdev-offload: Disallow offloading to unrelated tunneling vports.
> >
> >  Documentation/howto/dpdk.rst  |   1 +
> >  NEWS                          |   2 +
> >  lib/dpif-netdev.c             |  49 +-
> >  lib/netdev-dpdk.c             | 135 ++++++
> >  lib/netdev-dpdk.h             | 104 ++++-
> >  lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
> >  lib/netdev-offload-provider.h |   5 +
> >  lib/netdev-offload-tc.c       |   8 +
> >  lib/netdev-offload.c          |  29 +-
> >  lib/netdev-offload.h          |   1 +
> >  10 files changed, 1033 insertions(+), 152 deletions(-)
> >
> > --
> > 2.28.0.546.g385c171
> >
>
> Hi Eli,
>
> Thanks for posting this new patchset to support tunnel decap action offload.
>
> I haven't looked at the entire patchset yet. But I focused on the
> patches that introduce 1-to-many mapping between an OVS flow (f2) and
> HW offloaded flows.
>
> Here is a representation of the design proposed in this patchset. A
> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
> the underlying uplink/physical port is P0, gets offloaded to not only
> P0, but also to other physical ports P1, P2... and so on.
>
>     P0 <----> VxLAN-vPort <----> VFRep-1
>
>     P1
>     P2
>     ...
>     Pn
>
> IMO, the problems with this design are:
>
> - Offloading a flow to an unrelated physical device that has nothing
> to do with that flow (invalid device for the flow).
> - Offloading to not just one, but several such invalid physical devices.
> - Consuming HW resources for a flow that is never seen or intended to
> be processed by those physical devices.
> - Impacts flow scale on other physical devices, since it would consume
> their HW resources with a large number of such invalid flows.
> - The indirect list used to track these multiple mappings complicates
> the offload layer implementation.
> - The addition of flow_dump_create() to offload APIs, just to parse
> and get a list of user datapath netdevs is confusing and not needed.
>
> I have been exploring an alternate design to address this problem of
> figuring out the right physical device for a given tunnel inner-flow.
> I will send a patch, please take a look so we can continue the discussion.

I just posted this patch, please see the link below; this is currently
based on the decap offload patchset (but it can be rebased if needed,
without the decap patchset too). The patch provides changes to pass
physical port (orig_in_port) information to the offload layer in the
context of flow F2. Note that additional changes would be needed in
the decap patchset to utilize this, if we agree on this approach.

https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/

Thanks,
-Harsha
>
> Thanks,
> -Harsha
Eli Britstein Feb. 7, 2021, 11:27 a.m. UTC | #3
On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> <sriharsha.basavapatna@broadcom.com> wrote:
>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
>>>
>>> F1 is a classification flow. It has outer headers matches and it
>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
>>> packet continues processing in F2.
>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
>>> packet headers (as any other flow).
>>>
>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
>>> offloaded. As there are more than one flow in HW, it is possible that
>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
>>> processed starting from F2 as F1 was already done by HW.
>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
>>> are applied on uplink ports attached to OVS.
>>>
>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
>>> for tunnel offloads.
>>>
>>> Travis:
>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
>>>
>>> GitHub Actions:
>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
>>>
>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
>>>
>>> Eli Britstein (13):
>>>    netdev-offload: Add HW miss packet state recover API
>>>    netdev-dpdk: Introduce DPDK tunnel APIs
>>>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>>    netdev-dpdk: Add flow_api support for netdev vxlan vports
>>>    netdev-offload-dpdk: Implement HW miss packet recover for vport
>>>    dpif-netdev: Add HW miss packet state recover logic
>>>    netdev-offload-dpdk: Change log rate limits
>>>    netdev-offload-dpdk: Support tunnel pop action
>>>    netdev-offload-dpdk: Refactor offload rule creation
>>>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>>      port
>>>    netdev-offload-dpdk: Map netdev and ufid to offload objects
>>>    netdev-offload-dpdk: Support vports flows offload
>>>    netdev-dpdk-offload: Add vxlan pattern matching function
>>>
>>> Ilya Maximets (2):
>>>    netdev-offload: Allow offloading to netdev without ifindex.
>>>    netdev-offload: Disallow offloading to unrelated tunneling vports.
>>>
>>>   Documentation/howto/dpdk.rst  |   1 +
>>>   NEWS                          |   2 +
>>>   lib/dpif-netdev.c             |  49 +-
>>>   lib/netdev-dpdk.c             | 135 ++++++
>>>   lib/netdev-dpdk.h             | 104 ++++-
>>>   lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
>>>   lib/netdev-offload-provider.h |   5 +
>>>   lib/netdev-offload-tc.c       |   8 +
>>>   lib/netdev-offload.c          |  29 +-
>>>   lib/netdev-offload.h          |   1 +
>>>   10 files changed, 1033 insertions(+), 152 deletions(-)
>>>
>>> --
>>> 2.28.0.546.g385c171
>>>
>> Hi Eli,
>>
>> Thanks for posting this new patchset to support tunnel decap action offload.
>>
>> I haven't looked at the entire patchset yet. But I focused on the
>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
>> HW offloaded flows.
>>
>> Here is a representation of the design proposed in this patchset. A
>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
>> the underlying uplink/physical port is P0, gets offloaded to not only
>> P0, but also to other physical ports P1, P2... and so on.
>>
>>      P0 <----> VxLAN-vPort <----> VFRep-1
>>
>>      P1
>>      P2
>>      ...
>>      Pn
>>
>> IMO, the problems with this design are:
>>
>> - Offloading a flow to an unrelated physical device that has nothing
>> to do with that flow (invalid device for the flow).
>> - Offloading to not just one, but several such invalid physical devices.
>> - Consuming HW resources for a flow that is never seen or intended to
>> be processed by those physical devices.
>> - Impacts flow scale on other physical devices, since it would consume
>> their HW resources with a large number of such invalid flows.
>> - The indirect list used to track these multiple mappings complicates
>> the offload layer implementation.
>> - The addition of flow_dump_create() to offload APIs, just to parse
>> and get a list of user datapath netdevs is confusing and not needed.
>>
>> I have been exploring an alternate design to address this problem of
>> figuring out the right physical device for a given tunnel inner-flow.
>> I will send a patch, please take a look so we can continue the discussion.
> I just posted this patch, please see the link below; this is currently
> based on the decap offload patchset (but it can be rebased if needed,
> without the decap patchset too). The patch provides changes to pass
> physical port (orig_in_port) information to the offload layer in the
> context of flow F2. Note that additional changes would be needed in
> the decap patchset to utilize this, if we agree on this approach.
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/

Thanks for that.

However, we cannot say packets on that flow *cannot* arrive from other 
PFs. For example, consider ECMP. Packets can arrive from either port, 
depending on external routing decisions.

Thus, if we want to support it, we should keep the 1-to-many 
indirection, and not assume that packets may only arrive always from the 
same port as the first packet that created this flow.

Also, please note that the HW rule is applied on PFs only, so in your 
example P1,...,Pn, "n" is expected to be quite small number.

>
> Thanks,
> -Harsha
>> Thanks,
>> -Harsha
Sriharsha Basavapatna Feb. 8, 2021, 1:11 p.m. UTC | #4
On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr@nvidia.com> wrote:
>
>
> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> > On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> > <sriharsha.basavapatna@broadcom.com> wrote:
> >> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
> >>> VXLAN decap in OVS-DPDK configuration consists of two flows:
> >>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> >>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >>>
> >>> F1 is a classification flow. It has outer headers matches and it
> >>> classifies the packet as a VXLAN packet, and using tnl_pop action the
> >>> packet continues processing in F2.
> >>> F2 is a flow that has matches on tunnel metadata as well as on the inner
> >>> packet headers (as any other flow).
> >>>
> >>> In order to fully offload VXLAN decap path, both F1 and F2 should be
> >>> offloaded. As there are more than one flow in HW, it is possible that
> >>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
> >>> processed starting from F2 as F1 was already done by HW.
> >>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> >>> are applied on uplink ports attached to OVS.
> >>>
> >>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> >>> for tunnel offloads.
> >>>
> >>> Travis:
> >>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >>>
> >>> GitHub Actions:
> >>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >>>
> >>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >>>
> >>> Eli Britstein (13):
> >>>    netdev-offload: Add HW miss packet state recover API
> >>>    netdev-dpdk: Introduce DPDK tunnel APIs
> >>>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>>    netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>>    netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>>    dpif-netdev: Add HW miss packet state recover logic
> >>>    netdev-offload-dpdk: Change log rate limits
> >>>    netdev-offload-dpdk: Support tunnel pop action
> >>>    netdev-offload-dpdk: Refactor offload rule creation
> >>>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>>      port
> >>>    netdev-offload-dpdk: Map netdev and ufid to offload objects
> >>>    netdev-offload-dpdk: Support vports flows offload
> >>>    netdev-dpdk-offload: Add vxlan pattern matching function
> >>>
> >>> Ilya Maximets (2):
> >>>    netdev-offload: Allow offloading to netdev without ifindex.
> >>>    netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>>
> >>>   Documentation/howto/dpdk.rst  |   1 +
> >>>   NEWS                          |   2 +
> >>>   lib/dpif-netdev.c             |  49 +-
> >>>   lib/netdev-dpdk.c             | 135 ++++++
> >>>   lib/netdev-dpdk.h             | 104 ++++-
> >>>   lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
> >>>   lib/netdev-offload-provider.h |   5 +
> >>>   lib/netdev-offload-tc.c       |   8 +
> >>>   lib/netdev-offload.c          |  29 +-
> >>>   lib/netdev-offload.h          |   1 +
> >>>   10 files changed, 1033 insertions(+), 152 deletions(-)
> >>>
> >>> --
> >>> 2.28.0.546.g385c171
> >>>
> >> Hi Eli,
> >>
> >> Thanks for posting this new patchset to support tunnel decap action offload.
> >>
> >> I haven't looked at the entire patchset yet. But I focused on the
> >> patches that introduce 1-to-many mapping between an OVS flow (f2) and
> >> HW offloaded flows.
> >>
> >> Here is a representation of the design proposed in this patchset. A
> >> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
> >> the underlying uplink/physical port is P0, gets offloaded to not only
> >> P0, but also to other physical ports P1, P2... and so on.
> >>
> >>      P0 <----> VxLAN-vPort <----> VFRep-1
> >>
> >>      P1
> >>      P2
> >>      ...
> >>      Pn
> >>
> >> IMO, the problems with this design are:
> >>
> >> - Offloading a flow to an unrelated physical device that has nothing
> >> to do with that flow (invalid device for the flow).
> >> - Offloading to not just one, but several such invalid physical devices.
> >> - Consuming HW resources for a flow that is never seen or intended to
> >> be processed by those physical devices.
> >> - Impacts flow scale on other physical devices, since it would consume
> >> their HW resources with a large number of such invalid flows.
> >> - The indirect list used to track these multiple mappings complicates
> >> the offload layer implementation.
> >> - The addition of flow_dump_create() to offload APIs, just to parse
> >> and get a list of user datapath netdevs is confusing and not needed.
> >>
> >> I have been exploring an alternate design to address this problem of
> >> figuring out the right physical device for a given tunnel inner-flow.
> >> I will send a patch, please take a look so we can continue the discussion.
> > I just posted this patch, please see the link below; this is currently
> > based on the decap offload patchset (but it can be rebased if needed,
> > without the decap patchset too). The patch provides changes to pass
> > physical port (orig_in_port) information to the offload layer in the
> > context of flow F2. Note that additional changes would be needed in
> > the decap patchset to utilize this, if we agree on this approach.
> >
> > https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
>
> Thanks for that.
>
> However, we cannot say packets on that flow *cannot* arrive from other
> PFs. For example, consider ECMP. Packets can arrive from either port,
> depending on external routing decisions.

Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
packets for P0 be received over P1 ? With ECMP, packets could take a
different route in the network. But the destination remains the same.
>
> Thus, if we want to support it, we should keep the 1-to-many
> indirection, and not assume that packets may only arrive always from the
> same port as the first packet that created this flow.

The rationale behind offloading to multiple ports as per your patches,
is not to support ECMP, but because there's no way to figure out the
uplink port for the flow in the context of a vport. I don't see ECMP
mentioned anywhere in your patchset ?
>
> Also, please note that the HW rule is applied on PFs only, so in your
> example P1,...,Pn, "n" is expected to be quite small number.

You can't predict or control the number of physical devices that can
be configured in the system (e.g., multiple physical bridges each
connecting to a different network). So you cannot assume that "n" is a
small number. The point here is that with the current proposal, we end
up offloading a flow incorrectly to unrelated physical devices.
>
> >
> > Thanks,
> > -Harsha
> >> Thanks,
> >> -Harsha
Eli Britstein Feb. 8, 2021, 2:03 p.m. UTC | #5
On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
> On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr@nvidia.com> wrote:
>>
>> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
>>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
>>> <sriharsha.basavapatna@broadcom.com> wrote:
>>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
>>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
>>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
>>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
>>>>>
>>>>> F1 is a classification flow. It has outer headers matches and it
>>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
>>>>> packet continues processing in F2.
>>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
>>>>> packet headers (as any other flow).
>>>>>
>>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
>>>>> offloaded. As there are more than one flow in HW, it is possible that
>>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
>>>>> processed starting from F2 as F1 was already done by HW.
>>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
>>>>> are applied on uplink ports attached to OVS.
>>>>>
>>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
>>>>> for tunnel offloads.
>>>>>
>>>>> Travis:
>>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
>>>>>
>>>>> GitHub Actions:
>>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
>>>>>
>>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
>>>>>
>>>>> Eli Britstein (13):
>>>>>     netdev-offload: Add HW miss packet state recover API
>>>>>     netdev-dpdk: Introduce DPDK tunnel APIs
>>>>>     netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>>>>     netdev-dpdk: Add flow_api support for netdev vxlan vports
>>>>>     netdev-offload-dpdk: Implement HW miss packet recover for vport
>>>>>     dpif-netdev: Add HW miss packet state recover logic
>>>>>     netdev-offload-dpdk: Change log rate limits
>>>>>     netdev-offload-dpdk: Support tunnel pop action
>>>>>     netdev-offload-dpdk: Refactor offload rule creation
>>>>>     netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>>>>       port
>>>>>     netdev-offload-dpdk: Map netdev and ufid to offload objects
>>>>>     netdev-offload-dpdk: Support vports flows offload
>>>>>     netdev-dpdk-offload: Add vxlan pattern matching function
>>>>>
>>>>> Ilya Maximets (2):
>>>>>     netdev-offload: Allow offloading to netdev without ifindex.
>>>>>     netdev-offload: Disallow offloading to unrelated tunneling vports.
>>>>>
>>>>>    Documentation/howto/dpdk.rst  |   1 +
>>>>>    NEWS                          |   2 +
>>>>>    lib/dpif-netdev.c             |  49 +-
>>>>>    lib/netdev-dpdk.c             | 135 ++++++
>>>>>    lib/netdev-dpdk.h             | 104 ++++-
>>>>>    lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
>>>>>    lib/netdev-offload-provider.h |   5 +
>>>>>    lib/netdev-offload-tc.c       |   8 +
>>>>>    lib/netdev-offload.c          |  29 +-
>>>>>    lib/netdev-offload.h          |   1 +
>>>>>    10 files changed, 1033 insertions(+), 152 deletions(-)
>>>>>
>>>>> --
>>>>> 2.28.0.546.g385c171
>>>>>
>>>> Hi Eli,
>>>>
>>>> Thanks for posting this new patchset to support tunnel decap action offload.
>>>>
>>>> I haven't looked at the entire patchset yet. But I focused on the
>>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
>>>> HW offloaded flows.
>>>>
>>>> Here is a representation of the design proposed in this patchset. A
>>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
>>>> the underlying uplink/physical port is P0, gets offloaded to not only
>>>> P0, but also to other physical ports P1, P2... and so on.
>>>>
>>>>       P0 <----> VxLAN-vPort <----> VFRep-1
>>>>
>>>>       P1
>>>>       P2
>>>>       ...
>>>>       Pn
>>>>
>>>> IMO, the problems with this design are:
>>>>
>>>> - Offloading a flow to an unrelated physical device that has nothing
>>>> to do with that flow (invalid device for the flow).
>>>> - Offloading to not just one, but several such invalid physical devices.
>>>> - Consuming HW resources for a flow that is never seen or intended to
>>>> be processed by those physical devices.
>>>> - Impacts flow scale on other physical devices, since it would consume
>>>> their HW resources with a large number of such invalid flows.
>>>> - The indirect list used to track these multiple mappings complicates
>>>> the offload layer implementation.
>>>> - The addition of flow_dump_create() to offload APIs, just to parse
>>>> and get a list of user datapath netdevs is confusing and not needed.
>>>>
>>>> I have been exploring an alternate design to address this problem of
>>>> figuring out the right physical device for a given tunnel inner-flow.
>>>> I will send a patch, please take a look so we can continue the discussion.
>>> I just posted this patch, please see the link below; this is currently
>>> based on the decap offload patchset (but it can be rebased if needed,
>>> without the decap patchset too). The patch provides changes to pass
>>> physical port (orig_in_port) information to the offload layer in the
>>> context of flow F2. Note that additional changes would be needed in
>>> the decap patchset to utilize this, if we agree on this approach.
>>>
>>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
>> Thanks for that.
>>
>> However, we cannot say packets on that flow *cannot* arrive from other
>> PFs. For example, consider ECMP. Packets can arrive from either port,
>> depending on external routing decisions.
> Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
> packets for P0 be received over P1 ? With ECMP, packets could take a
> different route in the network. But the destination remains the same.

Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.

Another example is that each is on a different bridge, and the IP is 
moved from one to the other.

>> Thus, if we want to support it, we should keep the 1-to-many
>> indirection, and not assume that packets may only arrive always from the
>> same port as the first packet that created this flow.
> The rationale behind offloading to multiple ports as per your patches,
> is not to support ECMP, but because there's no way to figure out the
> uplink port for the flow in the context of a vport. I don't see ECMP
> mentioned anywhere in your patchset ?
ECMP is just an example. The point is we cannot guarantee the "correct" 
in_port of a vport.
>> Also, please note that the HW rule is applied on PFs only, so in your
>> example P1,...,Pn, "n" is expected to be quite small number.
> You can't predict or control the number of physical devices that can
> be configured in the system (e.g., multiple physical bridges each
> connecting to a different network). So you cannot assume that "n" is a
> small number. The point here is that with the current proposal, we end
> up offloading a flow incorrectly to unrelated physical devices.

"n" is limited by the number of physical cards there are in the system 
(and used by OVS), unlike VFs (or SFs) that can be much more, so though 
I cannot predict it, the point is that the number of "unrelated" rules 
is probably not high.

Another point regarding this, is that some of the "unrelated" rules may 
not be offloaded after all, as rejected by PMDs. For example if PF->Px 
cannot be supported. On the other hand, drop is probably supported on 
all PMDs.

Right, we may end up with more flows, as we cannot be sure what is a 
"related" and what is an "unrelated" physical device.

BTW, note that Linux kernel also uses the same approach, using 
flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.

I agree that your proposal works for the simple use cases. As there must 
be some kind of indirectness, because after all we must apply the rule 
on a physical device that is not the vport itself, i think the cost of 
this approach to cover more use-cases is worthy.

>>> Thanks,
>>> -Harsha
>>>> Thanks,
>>>> -Harsha
Sriharsha Basavapatna Feb. 8, 2021, 4:21 p.m. UTC | #6
On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein <elibr@nvidia.com> wrote:
>
>
> On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
> > On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr@nvidia.com> wrote:
> >>
> >> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> >>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> >>> <sriharsha.basavapatna@broadcom.com> wrote:
> >>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
> >>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
> >>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> >>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >>>>>
> >>>>> F1 is a classification flow. It has outer headers matches and it
> >>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
> >>>>> packet continues processing in F2.
> >>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
> >>>>> packet headers (as any other flow).
> >>>>>
> >>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
> >>>>> offloaded. As there are more than one flow in HW, it is possible that
> >>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
> >>>>> processed starting from F2 as F1 was already done by HW.
> >>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> >>>>> are applied on uplink ports attached to OVS.
> >>>>>
> >>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> >>>>> for tunnel offloads.
> >>>>>
> >>>>> Travis:
> >>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >>>>>
> >>>>> GitHub Actions:
> >>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >>>>>
> >>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >>>>>
> >>>>> Eli Britstein (13):
> >>>>>     netdev-offload: Add HW miss packet state recover API
> >>>>>     netdev-dpdk: Introduce DPDK tunnel APIs
> >>>>>     netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>>>>     netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>>>>     netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>>>>     dpif-netdev: Add HW miss packet state recover logic
> >>>>>     netdev-offload-dpdk: Change log rate limits
> >>>>>     netdev-offload-dpdk: Support tunnel pop action
> >>>>>     netdev-offload-dpdk: Refactor offload rule creation
> >>>>>     netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>>>>       port
> >>>>>     netdev-offload-dpdk: Map netdev and ufid to offload objects
> >>>>>     netdev-offload-dpdk: Support vports flows offload
> >>>>>     netdev-dpdk-offload: Add vxlan pattern matching function
> >>>>>
> >>>>> Ilya Maximets (2):
> >>>>>     netdev-offload: Allow offloading to netdev without ifindex.
> >>>>>     netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>>>>
> >>>>>    Documentation/howto/dpdk.rst  |   1 +
> >>>>>    NEWS                          |   2 +
> >>>>>    lib/dpif-netdev.c             |  49 +-
> >>>>>    lib/netdev-dpdk.c             | 135 ++++++
> >>>>>    lib/netdev-dpdk.h             | 104 ++++-
> >>>>>    lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
> >>>>>    lib/netdev-offload-provider.h |   5 +
> >>>>>    lib/netdev-offload-tc.c       |   8 +
> >>>>>    lib/netdev-offload.c          |  29 +-
> >>>>>    lib/netdev-offload.h          |   1 +
> >>>>>    10 files changed, 1033 insertions(+), 152 deletions(-)
> >>>>>
> >>>>> --
> >>>>> 2.28.0.546.g385c171
> >>>>>
> >>>> Hi Eli,
> >>>>
> >>>> Thanks for posting this new patchset to support tunnel decap action offload.
> >>>>
> >>>> I haven't looked at the entire patchset yet. But I focused on the
> >>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
> >>>> HW offloaded flows.
> >>>>
> >>>> Here is a representation of the design proposed in this patchset. A
> >>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
> >>>> the underlying uplink/physical port is P0, gets offloaded to not only
> >>>> P0, but also to other physical ports P1, P2... and so on.
> >>>>
> >>>>       P0 <----> VxLAN-vPort <----> VFRep-1
> >>>>
> >>>>       P1
> >>>>       P2
> >>>>       ...
> >>>>       Pn
> >>>>
> >>>> IMO, the problems with this design are:
> >>>>
> >>>> - Offloading a flow to an unrelated physical device that has nothing
> >>>> to do with that flow (invalid device for the flow).
> >>>> - Offloading to not just one, but several such invalid physical devices.
> >>>> - Consuming HW resources for a flow that is never seen or intended to
> >>>> be processed by those physical devices.
> >>>> - Impacts flow scale on other physical devices, since it would consume
> >>>> their HW resources with a large number of such invalid flows.
> >>>> - The indirect list used to track these multiple mappings complicates
> >>>> the offload layer implementation.
> >>>> - The addition of flow_dump_create() to offload APIs, just to parse
> >>>> and get a list of user datapath netdevs is confusing and not needed.
> >>>>
> >>>> I have been exploring an alternate design to address this problem of
> >>>> figuring out the right physical device for a given tunnel inner-flow.
> >>>> I will send a patch, please take a look so we can continue the discussion.
> >>> I just posted this patch, please see the link below; this is currently
> >>> based on the decap offload patchset (but it can be rebased if needed,
> >>> without the decap patchset too). The patch provides changes to pass
> >>> physical port (orig_in_port) information to the offload layer in the
> >>> context of flow F2. Note that additional changes would be needed in
> >>> the decap patchset to utilize this, if we agree on this approach.
> >>>
> >>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
> >> Thanks for that.
> >>
> >> However, we cannot say packets on that flow *cannot* arrive from other
> >> PFs. For example, consider ECMP. Packets can arrive from either port,
> >> depending on external routing decisions.
> > Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
> > packets for P0 be received over P1 ? With ECMP, packets could take a
> > different route in the network. But the destination remains the same.
>
> Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.
How can you have multiple uplink ports connected to the same bridge,
unless you have them in some kind of bonding/link-aggregation
configuration ? Also, for br-phy you need to provide the mac address
of the underlying physical interface to the bridge.
Take a look at step-5 in this document:
https://docs.openvswitch.org/en/latest/howto/userspace-tunneling/
If you have 2 uplink ports, which mac address will you specify ?

>
> Another example is that each is on a different bridge, and the IP is
> moved from one to the other.
Here you are changing the tunnel configuration; this has to go through
some kind of reconfiguration or flow modification, since you are
changing the IP/mac of br-phy.

>
> >> Thus, if we want to support it, we should keep the 1-to-many
> >> indirection, and not assume that packets may only arrive always from the
> >> same port as the first packet that created this flow.
> > The rationale behind offloading to multiple ports as per your patches,
> > is not to support ECMP, but because there's no way to figure out the
> > uplink port for the flow in the context of a vport. I don't see ECMP
> > mentioned anywhere in your patchset ?
> ECMP is just an example. The point is we cannot guarantee the "correct"
> in_port of a vport.
The patch that I sent out provides the correct in_port for a given
flow [ because we already know the port on which it was received, it
just needs to be passed as metadata ! ].

> >> Also, please note that the HW rule is applied on PFs only, so in your
> >> example P1,...,Pn, "n" is expected to be quite small number.
> > You can't predict or control the number of physical devices that can
> > be configured in the system (e.g., multiple physical bridges each
> > connecting to a different network). So you cannot assume that "n" is a
> > small number. The point here is that with the current proposal, we end
> > up offloading a flow incorrectly to unrelated physical devices.
>
> "n" is limited by the number of physical cards there are in the system
> (and used by OVS), unlike VFs (or SFs) that can be much more, so though
> I cannot predict it, the point is that the number of "unrelated" rules
> is probably not high.
There are cards with multiple physical ports and there could be
multiple such physical cards in the system. And since we are talking
about vport flows (F2), it could be thousands of such flows
incorrectly offloaded on each of those other ports, on each card.

>
> Another point regarding this, is that some of the "unrelated" rules may
> not be offloaded after all, as rejected by PMDs. For example if PF->Px
> cannot be supported. On the other hand, drop is probably supported on
> all PMDs.
>
> Right, we may end up with more flows, as we cannot be sure what is a
> "related" and what is an "unrelated" physical device.
Thanks for agreeing that we will end up with more flows. This is
precisely what we want to avoid -  ending up with more flows offloaded
on devices that are not in the picture at all w.r.t the flows in
question.

>
> BTW, note that Linux kernel also uses the same approach, using
> flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.
I've seen that and it is also complicated (independent of how many
drivers use it), I'm sure it can be simplified too.

>
> I agree that your proposal works for the simple use cases. As there must
> be some kind of indirectness, because after all we must apply the rule
> on a physical device that is not the vport itself, i think the cost of
> this approach to cover more use-cases is worthy.
Can we go ahead with this approach for now and build on it for more
use-cases as needed later ?

>
> >>> Thanks,
> >>> -Harsha
> >>>> Thanks,
> >>>> -Harsha
Eli Britstein Feb. 9, 2021, 3:11 p.m. UTC | #7
On 2/8/2021 6:21 PM, Sriharsha Basavapatna wrote:
> On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein <elibr@nvidia.com> wrote:
>>
>> On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
>>> On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr@nvidia.com> wrote:
>>>> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
>>>>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
>>>>> <sriharsha.basavapatna@broadcom.com> wrote:
>>>>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
>>>>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
>>>>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
>>>>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
>>>>>>>
>>>>>>> F1 is a classification flow. It has outer headers matches and it
>>>>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
>>>>>>> packet continues processing in F2.
>>>>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
>>>>>>> packet headers (as any other flow).
>>>>>>>
>>>>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
>>>>>>> offloaded. As there are more than one flow in HW, it is possible that
>>>>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
>>>>>>> processed starting from F2 as F1 was already done by HW.
>>>>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
>>>>>>> are applied on uplink ports attached to OVS.
>>>>>>>
>>>>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
>>>>>>> for tunnel offloads.
>>>>>>>
>>>>>>> Travis:
>>>>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
>>>>>>>
>>>>>>> GitHub Actions:
>>>>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
>>>>>>>
>>>>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
>>>>>>>
>>>>>>> Eli Britstein (13):
>>>>>>>      netdev-offload: Add HW miss packet state recover API
>>>>>>>      netdev-dpdk: Introduce DPDK tunnel APIs
>>>>>>>      netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>>>>>>      netdev-dpdk: Add flow_api support for netdev vxlan vports
>>>>>>>      netdev-offload-dpdk: Implement HW miss packet recover for vport
>>>>>>>      dpif-netdev: Add HW miss packet state recover logic
>>>>>>>      netdev-offload-dpdk: Change log rate limits
>>>>>>>      netdev-offload-dpdk: Support tunnel pop action
>>>>>>>      netdev-offload-dpdk: Refactor offload rule creation
>>>>>>>      netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>>>>>>        port
>>>>>>>      netdev-offload-dpdk: Map netdev and ufid to offload objects
>>>>>>>      netdev-offload-dpdk: Support vports flows offload
>>>>>>>      netdev-dpdk-offload: Add vxlan pattern matching function
>>>>>>>
>>>>>>> Ilya Maximets (2):
>>>>>>>      netdev-offload: Allow offloading to netdev without ifindex.
>>>>>>>      netdev-offload: Disallow offloading to unrelated tunneling vports.
>>>>>>>
>>>>>>>     Documentation/howto/dpdk.rst  |   1 +
>>>>>>>     NEWS                          |   2 +
>>>>>>>     lib/dpif-netdev.c             |  49 +-
>>>>>>>     lib/netdev-dpdk.c             | 135 ++++++
>>>>>>>     lib/netdev-dpdk.h             | 104 ++++-
>>>>>>>     lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
>>>>>>>     lib/netdev-offload-provider.h |   5 +
>>>>>>>     lib/netdev-offload-tc.c       |   8 +
>>>>>>>     lib/netdev-offload.c          |  29 +-
>>>>>>>     lib/netdev-offload.h          |   1 +
>>>>>>>     10 files changed, 1033 insertions(+), 152 deletions(-)
>>>>>>>
>>>>>>> --
>>>>>>> 2.28.0.546.g385c171
>>>>>>>
>>>>>> Hi Eli,
>>>>>>
>>>>>> Thanks for posting this new patchset to support tunnel decap action offload.
>>>>>>
>>>>>> I haven't looked at the entire patchset yet. But I focused on the
>>>>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
>>>>>> HW offloaded flows.
>>>>>>
>>>>>> Here is a representation of the design proposed in this patchset. A
>>>>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
>>>>>> the underlying uplink/physical port is P0, gets offloaded to not only
>>>>>> P0, but also to other physical ports P1, P2... and so on.
>>>>>>
>>>>>>        P0 <----> VxLAN-vPort <----> VFRep-1
>>>>>>
>>>>>>        P1
>>>>>>        P2
>>>>>>        ...
>>>>>>        Pn
>>>>>>
>>>>>> IMO, the problems with this design are:
>>>>>>
>>>>>> - Offloading a flow to an unrelated physical device that has nothing
>>>>>> to do with that flow (invalid device for the flow).
>>>>>> - Offloading to not just one, but several such invalid physical devices.
>>>>>> - Consuming HW resources for a flow that is never seen or intended to
>>>>>> be processed by those physical devices.
>>>>>> - Impacts flow scale on other physical devices, since it would consume
>>>>>> their HW resources with a large number of such invalid flows.
>>>>>> - The indirect list used to track these multiple mappings complicates
>>>>>> the offload layer implementation.
>>>>>> - The addition of flow_dump_create() to offload APIs, just to parse
>>>>>> and get a list of user datapath netdevs is confusing and not needed.
>>>>>>
>>>>>> I have been exploring an alternate design to address this problem of
>>>>>> figuring out the right physical device for a given tunnel inner-flow.
>>>>>> I will send a patch, please take a look so we can continue the discussion.
>>>>> I just posted this patch, please see the link below; this is currently
>>>>> based on the decap offload patchset (but it can be rebased if needed,
>>>>> without the decap patchset too). The patch provides changes to pass
>>>>> physical port (orig_in_port) information to the offload layer in the
>>>>> context of flow F2. Note that additional changes would be needed in
>>>>> the decap patchset to utilize this, if we agree on this approach.
>>>>>
>>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
>>>> Thanks for that.
>>>>
>>>> However, we cannot say packets on that flow *cannot* arrive from other
>>>> PFs. For example, consider ECMP. Packets can arrive from either port,
>>>> depending on external routing decisions.
>>> Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
>>> packets for P0 be received over P1 ? With ECMP, packets could take a
>>> different route in the network. But the destination remains the same.
>> Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.
> How can you have multiple uplink ports connected to the same bridge,
> unless you have them in some kind of bonding/link-aggregation
> configuration ? Also, for br-phy you need to provide the mac address
> of the underlying physical interface to the bridge.
> Take a look at step-5 in this document:
> https://docs.openvswitch.org/en/latest/howto/userspace-tunneling/
> If you have 2 uplink ports, which mac address will you specify ?
>
>> Another example is that each is on a different bridge, and the IP is
>> moved from one to the other.
> Here you are changing the tunnel configuration; this has to go through
> some kind of reconfiguration or flow modification, since you are
> changing the IP/mac of br-phy.
>
>>>> Thus, if we want to support it, we should keep the 1-to-many
>>>> indirection, and not assume that packets may only arrive always from the
>>>> same port as the first packet that created this flow.
>>> The rationale behind offloading to multiple ports as per your patches,
>>> is not to support ECMP, but because there's no way to figure out the
>>> uplink port for the flow in the context of a vport. I don't see ECMP
>>> mentioned anywhere in your patchset ?
>> ECMP is just an example. The point is we cannot guarantee the "correct"
>> in_port of a vport.
> The patch that I sent out provides the correct in_port for a given
> flow [ because we already know the port on which it was received, it
> just needs to be passed as metadata ! ].
>
>>>> Also, please note that the HW rule is applied on PFs only, so in your
>>>> example P1,...,Pn, "n" is expected to be quite small number.
>>> You can't predict or control the number of physical devices that can
>>> be configured in the system (e.g., multiple physical bridges each
>>> connecting to a different network). So you cannot assume that "n" is a
>>> small number. The point here is that with the current proposal, we end
>>> up offloading a flow incorrectly to unrelated physical devices.
>> "n" is limited by the number of physical cards there are in the system
>> (and used by OVS), unlike VFs (or SFs) that can be much more, so though
>> I cannot predict it, the point is that the number of "unrelated" rules
>> is probably not high.
> There are cards with multiple physical ports and there could be
> multiple such physical cards in the system. And since we are talking
> about vport flows (F2), it could be thousands of such flows
> incorrectly offloaded on each of those other ports, on each card.
>
>> Another point regarding this, is that some of the "unrelated" rules may
>> not be offloaded after all, as rejected by PMDs. For example if PF->Px
>> cannot be supported. On the other hand, drop is probably supported on
>> all PMDs.
>>
>> Right, we may end up with more flows, as we cannot be sure what is a
>> "related" and what is an "unrelated" physical device.
> Thanks for agreeing that we will end up with more flows. This is
> precisely what we want to avoid -  ending up with more flows offloaded
> on devices that are not in the picture at all w.r.t the flows in
> question.
>
>> BTW, note that Linux kernel also uses the same approach, using
>> flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.
> I've seen that and it is also complicated (independent of how many
> drivers use it), I'm sure it can be simplified too.
>
>> I agree that your proposal works for the simple use cases. As there must
>> be some kind of indirectness, because after all we must apply the rule
>> on a physical device that is not the vport itself, i think the cost of
>> this approach to cover more use-cases is worthy.
> Can we go ahead with this approach for now and build on it for more
> use-cases as needed later ?

OK. Let's start with supporting the simple use cases. I will take your 
patch as a reference and re-spin the series.

We may enhance it in the future.

Thanks.

>
>>>>> Thanks,
>>>>> -Harsha
>>>>>> Thanks,
>>>>>> -Harsha
Sriharsha Basavapatna Feb. 9, 2021, 3:25 p.m. UTC | #8
On Tue, Feb 9, 2021 at 8:41 PM Eli Britstein <elibr@nvidia.com> wrote:
>
>
> On 2/8/2021 6:21 PM, Sriharsha Basavapatna wrote:
> > On Mon, Feb 8, 2021 at 7:33 PM Eli Britstein <elibr@nvidia.com> wrote:
> >>
> >> On 2/8/2021 3:11 PM, Sriharsha Basavapatna wrote:
> >>> On Sun, Feb 7, 2021 at 4:58 PM Eli Britstein <elibr@nvidia.com> wrote:
> >>>> On 2/5/2021 8:26 PM, Sriharsha Basavapatna wrote:
> >>>>> On Fri, Feb 5, 2021 at 4:55 PM Sriharsha Basavapatna
> >>>>> <sriharsha.basavapatna@broadcom.com> wrote:
> >>>>>> On Wed, Jan 27, 2021 at 11:40 PM Eli Britstein <elibr@nvidia.com> wrote:
> >>>>>>> VXLAN decap in OVS-DPDK configuration consists of two flows:
> >>>>>>> F1: in_port(ens1f0),eth(),ipv4(),udp(), actions:tnl_pop(vxlan_sys_4789)
> >>>>>>> F2: tunnel(),in_port(vxlan_sys_4789),eth(),ipv4(), actions:ens1f0_0
> >>>>>>>
> >>>>>>> F1 is a classification flow. It has outer headers matches and it
> >>>>>>> classifies the packet as a VXLAN packet, and using tnl_pop action the
> >>>>>>> packet continues processing in F2.
> >>>>>>> F2 is a flow that has matches on tunnel metadata as well as on the inner
> >>>>>>> packet headers (as any other flow).
> >>>>>>>
> >>>>>>> In order to fully offload VXLAN decap path, both F1 and F2 should be
> >>>>>>> offloaded. As there are more than one flow in HW, it is possible that
> >>>>>>> F1 is done by HW but F2 is not. Packet is received by SW, and should be
> >>>>>>> processed starting from F2 as F1 was already done by HW.
> >>>>>>> Rte_flows are applicable only on physical port IDs. Vport flows (e.g. F2)
> >>>>>>> are applied on uplink ports attached to OVS.
> >>>>>>>
> >>>>>>> This patch-set makes use of [1] introduced in DPDK 20.11, that adds API
> >>>>>>> for tunnel offloads.
> >>>>>>>
> >>>>>>> Travis:
> >>>>>>> v1: https://travis-ci.org/github/elibritstein/OVS/builds/756418552
> >>>>>>>
> >>>>>>> GitHub Actions:
> >>>>>>> v1: https://github.com/elibritstein/OVS/actions/runs/515334647
> >>>>>>>
> >>>>>>> [1] https://mails.dpdk.org/archives/dev/2020-October/187314.html
> >>>>>>>
> >>>>>>> Eli Britstein (13):
> >>>>>>>      netdev-offload: Add HW miss packet state recover API
> >>>>>>>      netdev-dpdk: Introduce DPDK tunnel APIs
> >>>>>>>      netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>>>>>>      netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>>>>>>      netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>>>>>>      dpif-netdev: Add HW miss packet state recover logic
> >>>>>>>      netdev-offload-dpdk: Change log rate limits
> >>>>>>>      netdev-offload-dpdk: Support tunnel pop action
> >>>>>>>      netdev-offload-dpdk: Refactor offload rule creation
> >>>>>>>      netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>>>>>>        port
> >>>>>>>      netdev-offload-dpdk: Map netdev and ufid to offload objects
> >>>>>>>      netdev-offload-dpdk: Support vports flows offload
> >>>>>>>      netdev-dpdk-offload: Add vxlan pattern matching function
> >>>>>>>
> >>>>>>> Ilya Maximets (2):
> >>>>>>>      netdev-offload: Allow offloading to netdev without ifindex.
> >>>>>>>      netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>>>>>>
> >>>>>>>     Documentation/howto/dpdk.rst  |   1 +
> >>>>>>>     NEWS                          |   2 +
> >>>>>>>     lib/dpif-netdev.c             |  49 +-
> >>>>>>>     lib/netdev-dpdk.c             | 135 ++++++
> >>>>>>>     lib/netdev-dpdk.h             | 104 ++++-
> >>>>>>>     lib/netdev-offload-dpdk.c     | 851 +++++++++++++++++++++++++++++-----
> >>>>>>>     lib/netdev-offload-provider.h |   5 +
> >>>>>>>     lib/netdev-offload-tc.c       |   8 +
> >>>>>>>     lib/netdev-offload.c          |  29 +-
> >>>>>>>     lib/netdev-offload.h          |   1 +
> >>>>>>>     10 files changed, 1033 insertions(+), 152 deletions(-)
> >>>>>>>
> >>>>>>> --
> >>>>>>> 2.28.0.546.g385c171
> >>>>>>>
> >>>>>> Hi Eli,
> >>>>>>
> >>>>>> Thanks for posting this new patchset to support tunnel decap action offload.
> >>>>>>
> >>>>>> I haven't looked at the entire patchset yet. But I focused on the
> >>>>>> patches that introduce 1-to-many mapping between an OVS flow (f2) and
> >>>>>> HW offloaded flows.
> >>>>>>
> >>>>>> Here is a representation of the design proposed in this patchset. A
> >>>>>> flow f2 (inner flow) between the VxLAN-vPort and VFRep-1, for which
> >>>>>> the underlying uplink/physical port is P0, gets offloaded to not only
> >>>>>> P0, but also to other physical ports P1, P2... and so on.
> >>>>>>
> >>>>>>        P0 <----> VxLAN-vPort <----> VFRep-1
> >>>>>>
> >>>>>>        P1
> >>>>>>        P2
> >>>>>>        ...
> >>>>>>        Pn
> >>>>>>
> >>>>>> IMO, the problems with this design are:
> >>>>>>
> >>>>>> - Offloading a flow to an unrelated physical device that has nothing
> >>>>>> to do with that flow (invalid device for the flow).
> >>>>>> - Offloading to not just one, but several such invalid physical devices.
> >>>>>> - Consuming HW resources for a flow that is never seen or intended to
> >>>>>> be processed by those physical devices.
> >>>>>> - Impacts flow scale on other physical devices, since it would consume
> >>>>>> their HW resources with a large number of such invalid flows.
> >>>>>> - The indirect list used to track these multiple mappings complicates
> >>>>>> the offload layer implementation.
> >>>>>> - The addition of flow_dump_create() to offload APIs, just to parse
> >>>>>> and get a list of user datapath netdevs is confusing and not needed.
> >>>>>>
> >>>>>> I have been exploring an alternate design to address this problem of
> >>>>>> figuring out the right physical device for a given tunnel inner-flow.
> >>>>>> I will send a patch, please take a look so we can continue the discussion.
> >>>>> I just posted this patch, please see the link below; this is currently
> >>>>> based on the decap offload patchset (but it can be rebased if needed,
> >>>>> without the decap patchset too). The patch provides changes to pass
> >>>>> physical port (orig_in_port) information to the offload layer in the
> >>>>> context of flow F2. Note that additional changes would be needed in
> >>>>> the decap patchset to utilize this, if we agree on this approach.
> >>>>>
> >>>>> https://patchwork.ozlabs.org/project/openvswitch/patch/20210205180413.43337-1-sriharsha.basavapatna@broadcom.com/
> >>>> Thanks for that.
> >>>>
> >>>> However, we cannot say packets on that flow *cannot* arrive from other
> >>>> PFs. For example, consider ECMP. Packets can arrive from either port,
> >>>> depending on external routing decisions.
> >>> Let us say P0's IP is 192.168.1.x and P1's IP is 172.16.1.x. How can
> >>> packets for P0 be received over P1 ? With ECMP, packets could take a
> >>> different route in the network. But the destination remains the same.
> >> Suppose both P0 and P1 are under br-phy, and the IP is of br-phy.
> > How can you have multiple uplink ports connected to the same bridge,
> > unless you have them in some kind of bonding/link-aggregation
> > configuration ? Also, for br-phy you need to provide the mac address
> > of the underlying physical interface to the bridge.
> > Take a look at step-5 in this document:
> > https://docs.openvswitch.org/en/latest/howto/userspace-tunneling/
> > If you have 2 uplink ports, which mac address will you specify ?
> >
> >> Another example is that each is on a different bridge, and the IP is
> >> moved from one to the other.
> > Here you are changing the tunnel configuration; this has to go through
> > some kind of reconfiguration or flow modification, since you are
> > changing the IP/mac of br-phy.
> >
> >>>> Thus, if we want to support it, we should keep the 1-to-many
> >>>> indirection, and not assume that packets may only arrive always from the
> >>>> same port as the first packet that created this flow.
> >>> The rationale behind offloading to multiple ports as per your patches,
> >>> is not to support ECMP, but because there's no way to figure out the
> >>> uplink port for the flow in the context of a vport. I don't see ECMP
> >>> mentioned anywhere in your patchset ?
> >> ECMP is just an example. The point is we cannot guarantee the "correct"
> >> in_port of a vport.
> > The patch that I sent out provides the correct in_port for a given
> > flow [ because we already know the port on which it was received, it
> > just needs to be passed as metadata ! ].
> >
> >>>> Also, please note that the HW rule is applied on PFs only, so in your
> >>>> example P1,...,Pn, "n" is expected to be quite small number.
> >>> You can't predict or control the number of physical devices that can
> >>> be configured in the system (e.g., multiple physical bridges each
> >>> connecting to a different network). So you cannot assume that "n" is a
> >>> small number. The point here is that with the current proposal, we end
> >>> up offloading a flow incorrectly to unrelated physical devices.
> >> "n" is limited by the number of physical cards there are in the system
> >> (and used by OVS), unlike VFs (or SFs) that can be much more, so though
> >> I cannot predict it, the point is that the number of "unrelated" rules
> >> is probably not high.
> > There are cards with multiple physical ports and there could be
> > multiple such physical cards in the system. And since we are talking
> > about vport flows (F2), it could be thousands of such flows
> > incorrectly offloaded on each of those other ports, on each card.
> >
> >> Another point regarding this, is that some of the "unrelated" rules may
> >> not be offloaded after all, as rejected by PMDs. For example if PF->Px
> >> cannot be supported. On the other hand, drop is probably supported on
> >> all PMDs.
> >>
> >> Right, we may end up with more flows, as we cannot be sure what is a
> >> "related" and what is an "unrelated" physical device.
> > Thanks for agreeing that we will end up with more flows. This is
> > precisely what we want to avoid -  ending up with more flows offloaded
> > on devices that are not in the picture at all w.r.t the flows in
> > question.
> >
> >> BTW, note that Linux kernel also uses the same approach, using
> >> flow_indr_dev_register, used by broadcom, mlx5 and netronome drivers.
> > I've seen that and it is also complicated (independent of how many
> > drivers use it), I'm sure it can be simplified too.
> >
> >> I agree that your proposal works for the simple use cases. As there must
> >> be some kind of indirectness, because after all we must apply the rule
> >> on a physical device that is not the vport itself, i think the cost of
> >> this approach to cover more use-cases is worthy.
> > Can we go ahead with this approach for now and build on it for more
> > use-cases as needed later ?
>
> OK. Let's start with supporting the simple use cases. I will take your
> patch as a reference and re-spin the series.
>
> We may enhance it in the future.
>
> Thanks.
Ok, thanks.
>
> >
> >>>>> Thanks,
> >>>>> -Harsha
> >>>>>> Thanks,
> >>>>>> -Harsha