mbox series

[ovs-dev,00/25] netdev datapath vxlan offload

Message ID 20200120150830.16262-1-elibr@mellanox.com
Headers show
Series netdev datapath vxlan offload | expand

Message

Eli Britstein Jan. 20, 2020, 3:08 p.m. UTC
In the netdev datapath, packets arriving over a tunnel are processed by
more than one flow. For example a packet that should be decapsulated and
forwarded to another port is processed by two flows:
1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
   actions:tnl_pop(vxlan_sys_4789).
2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
   inner-header matches, actions: vm1.

In order to offload such a multi-flow processing path, a multi-table HW
model is used. Flow #1 matches are offloaded as is, but tnl_pop is
translated to MARK and JUMP to a tnl-port table actions. Flow #2 is
offloaded to that tnl-port table, its tunnel matches are translated to
outer-header matches, followed by the inner-header ones. The actions
start with an implicit DECAP action followed by the flow's actions.
The DECAP is done in flow #2. If it was done in flow #1, the outer
headers of the packets would have been lost, and they could not have
been matched in flow #2.
Furthermore, once the HW processing path is composed of a multi table
processing, there is a need to handle a case where the processing starts
in HW, but is not completed in HW. For example, only flow #1 is
offloaded while flow #2 doesn't exist. For that, the MARK value set in
the tnl_pop action is received in SW, the packet is recovered by the
mapped vport, and processing continues in SW.
As vports are virtual and cannot have rte_flows, offloading a vport
generates rte_flows for every PF in the system, as we cannot know from
which one the packets arrive from.

This series adds support for IPv6, tunnel push actions, and the
multi-table model for the decap processing.

v1 Travis passed: https://travis-ci.org/elibritstein/OVS/builds/639484233

Eli Britstein (14):
  dpif-netdev: Add mega ufid in flow add log
  netdev-offload-dpdk: Remove pre-validate of patterns function
  netdev-offload-dpdk: Add IPv6 pattern matching
  netdev-offload-dpdk: Support offload of clone tnl_push/output actions
  netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
  netdev-offload: Don't use zero flow mark
  netdev-offload-dpdk: Introduce map APIs for table id and miss context
  netdev-offload-dpdk: Implement HW miss packet recover for vport
  netdev-offload-dpdk: Refactor disassociate ufid function
  netdev-offload-dpdk: Support tunnel pop action
  netdev-offload-dpdk: Implement flow dump create/destroy APIs
  netdev-dpdk: Getter function for dpdk devargs API
  netdev-dpdk: Introduce get netdev by devargs function
  netdev-dpdk-offload: Add vxlan pattern matching function

Ilya Maximets (4):
  netdev: Allow storing dpif type into netdev structure.
  netdev-offload: Use dpif type instead of class.
  netdev-offload: Allow offloading to netdev without ifindex.
  netdev-offload: Disallow offloading to unrelated tunneling vports.

Ophir Munk (6):
  dpif-netdev: Make mark allocation public API
  dpif-netdev: Add HW miss packet state recover logic
  netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
    port
  netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
  netdev-dpdk: Add flow_api support for netdev vxlan vports
  netdev-offload-dpdk: Support offload for vxlan vport

Oz Shlomo (1):
  netdev-offload: Add HW miss packet state recover API

 Documentation/howto/dpdk.rst  |    5 +-
 NEWS                          |    6 +-
 lib/dpif-netdev.c             |   70 +--
 lib/dpif-netlink.c            |   23 +-
 lib/dpif.c                    |   21 +-
 lib/netdev-dpdk.c             |   68 +++
 lib/netdev-dpdk.h             |    6 +
 lib/netdev-offload-dpdk.c     | 1275 +++++++++++++++++++++++++++++++++++------
 lib/netdev-offload-provider.h |    6 +
 lib/netdev-offload-tc.c       |   11 +-
 lib/netdev-offload.c          |  115 ++--
 lib/netdev-offload.h          |   23 +-
 lib/netdev-provider.h         |    3 +-
 lib/netdev.c                  |   16 +
 lib/netdev.h                  |    2 +
 ofproto/ofproto-dpif-upcall.c |    5 +-
 tests/dpif-netdev.at          |   14 +-
 tests/ofproto-macros.at       |    3 +-
 18 files changed, 1394 insertions(+), 278 deletions(-)

Comments

Eli Britstein Feb. 3, 2020, 11:18 a.m. UTC | #1
ping

On 1/20/2020 5:08 PM, Eli Britstein wrote:
> In the netdev datapath, packets arriving over a tunnel are processed by
> more than one flow. For example a packet that should be decapsulated and
> forwarded to another port is processed by two flows:
> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
>     actions:tnl_pop(vxlan_sys_4789).
> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
>     inner-header matches, actions: vm1.
>
> In order to offload such a multi-flow processing path, a multi-table HW
> model is used. Flow #1 matches are offloaded as is, but tnl_pop is
> translated to MARK and JUMP to a tnl-port table actions. Flow #2 is
> offloaded to that tnl-port table, its tunnel matches are translated to
> outer-header matches, followed by the inner-header ones. The actions
> start with an implicit DECAP action followed by the flow's actions.
> The DECAP is done in flow #2. If it was done in flow #1, the outer
> headers of the packets would have been lost, and they could not have
> been matched in flow #2.
> Furthermore, once the HW processing path is composed of a multi table
> processing, there is a need to handle a case where the processing starts
> in HW, but is not completed in HW. For example, only flow #1 is
> offloaded while flow #2 doesn't exist. For that, the MARK value set in
> the tnl_pop action is received in SW, the packet is recovered by the
> mapped vport, and processing continues in SW.
> As vports are virtual and cannot have rte_flows, offloading a vport
> generates rte_flows for every PF in the system, as we cannot know from
> which one the packets arrive from.
>
> This series adds support for IPv6, tunnel push actions, and the
> multi-table model for the decap processing.
>
> v1 Travis passed: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&amp;data=02%7C01%7Celibr%40mellanox.com%7C5995f26af0ba4ae7470c08d79dbaac7b%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637151297392981653&amp;sdata=BvwPjOndhz4uzuwAmqJMWjiqi%2B02Yn7obWLvDMf9fQ4%3D&amp;reserved=0
>
> Eli Britstein (14):
>    dpif-netdev: Add mega ufid in flow add log
>    netdev-offload-dpdk: Remove pre-validate of patterns function
>    netdev-offload-dpdk: Add IPv6 pattern matching
>    netdev-offload-dpdk: Support offload of clone tnl_push/output actions
>    netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
>    netdev-offload: Don't use zero flow mark
>    netdev-offload-dpdk: Introduce map APIs for table id and miss context
>    netdev-offload-dpdk: Implement HW miss packet recover for vport
>    netdev-offload-dpdk: Refactor disassociate ufid function
>    netdev-offload-dpdk: Support tunnel pop action
>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
>    netdev-dpdk: Getter function for dpdk devargs API
>    netdev-dpdk: Introduce get netdev by devargs function
>    netdev-dpdk-offload: Add vxlan pattern matching function
>
> Ilya Maximets (4):
>    netdev: Allow storing dpif type into netdev structure.
>    netdev-offload: Use dpif type instead of class.
>    netdev-offload: Allow offloading to netdev without ifindex.
>    netdev-offload: Disallow offloading to unrelated tunneling vports.
>
> Ophir Munk (6):
>    dpif-netdev: Make mark allocation public API
>    dpif-netdev: Add HW miss packet state recover logic
>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>      port
>    netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
>    netdev-dpdk: Add flow_api support for netdev vxlan vports
>    netdev-offload-dpdk: Support offload for vxlan vport
>
> Oz Shlomo (1):
>    netdev-offload: Add HW miss packet state recover API
>
>   Documentation/howto/dpdk.rst  |    5 +-
>   NEWS                          |    6 +-
>   lib/dpif-netdev.c             |   70 +--
>   lib/dpif-netlink.c            |   23 +-
>   lib/dpif.c                    |   21 +-
>   lib/netdev-dpdk.c             |   68 +++
>   lib/netdev-dpdk.h             |    6 +
>   lib/netdev-offload-dpdk.c     | 1275 +++++++++++++++++++++++++++++++++++------
>   lib/netdev-offload-provider.h |    6 +
>   lib/netdev-offload-tc.c       |   11 +-
>   lib/netdev-offload.c          |  115 ++--
>   lib/netdev-offload.h          |   23 +-
>   lib/netdev-provider.h         |    3 +-
>   lib/netdev.c                  |   16 +
>   lib/netdev.h                  |    2 +
>   ofproto/ofproto-dpif-upcall.c |    5 +-
>   tests/dpif-netdev.at          |   14 +-
>   tests/ofproto-macros.at       |    3 +-
>   18 files changed, 1394 insertions(+), 278 deletions(-)
>
Li,Rongqing via dev Feb. 3, 2020, 5:05 p.m. UTC | #2
Hi Eli,

Thanks for sending this patchset. I have some questions about the
design, please see my comments below.

Thanks,
-Harsha

On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein <elibr@mellanox.com> wrote:
>
> In the netdev datapath, packets arriving over a tunnel are processed by
> more than one flow. For example a packet that should be decapsulated and
> forwarded to another port is processed by two flows:
> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
>    actions:tnl_pop(vxlan_sys_4789).
> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
>    inner-header matches, actions: vm1.
>
> In order to offload such a multi-flow processing path, a multi-table HW
> model is used.

Did you explore other ways to avoid this multi-flow processing model ?
For example, may be support some additional logic to merge the two
datapath flow-rules for decap into a single flow-rule in the offload
layer ?

> Flow #1 matches are offloaded as is, but tnl_pop is
> translated to MARK and JUMP to a tnl-port table actions.

The response to tnl_pop action is now different between non-offload
and offload datapaths, since offload layer changes it into entirely
different
actions. The user would expect consistent response to tnl_pop action,
especially if you treat this flow rule as an independent rule by
itself (i.e,
assuming there's no flow #2 subsequently).

Regarding the JUMP action:
IMO, offload framework should avoid anything that is too vendor
specific; and make use of simple/common abstractions. I'm not sure if
JUMP action is really needed to support offload of tnl_pop action.
Also, it may not map to the real use case for which JUMP action seems
to have been defined, which is to support flow-grouping (see
rte_flow.h).

> Flow #2 is
> offloaded to that tnl-port table, its tunnel matches are translated to
> outer-header matches, followed by the inner-header ones. The actions
> start with an implicit DECAP action followed by the flow's actions.
> The DECAP is done in flow #2. If it was done in flow #1, the outer
> headers of the packets would have been lost, and they could not have
> been matched in flow #2.

The only missing match field that would allow decap on flow #1 itself,
seems to be the tunnel-id. If that is provided in flow #1, then we may
not need to match it again in flow #2. And flow #2 could just match on
the inner packet headers. This would also make handling the flow-miss
simple, since you wouldn't need the recovery logic (offload layer
popping the header etc). The packet would be in the right state for
the next stage of processing in SW, if there's a  flow-miss after the
first stage in HW. It also means tnl_pop can be done in flow #1,
making the behavior consistent with non-offload processing.

The other advantage of providing the tunnel-id in flow #1 is that,
then we wouldn't need to match on the entire packet including the
outer-headers in the second stage and only the inner header fields
would need to be matched in flow #2.

> Furthermore, once the HW processing path is composed of a multi table
> processing, there is a need to handle a case where the processing starts
> in HW, but is not completed in HW. For example, only flow #1 is
> offloaded while flow #2 doesn't exist. For that, the MARK value set in
> the tnl_pop action is received in SW, the packet is recovered by the
> mapped vport, and processing continues in SW.

If it is not possible to provide tunnel-id in flow #1, then the other
alternative might be to defer offloading until flow #2 and offload
only flow #2. You are anyway matching almost the entire outer-headers
and the inner-header in flow #2 and the decap action is also added
implicitly to flow #2. Why not totally avoid offloading flow #1 ? That
way, none of the miss handling code is required.

> As vports are virtual and cannot have rte_flows, offloading a vport
> generates rte_flows for every PF in the system, as we cannot know from
> which one the packets arrive from.

Is it possible to track the "original in-port" through this two-stage
flow processing and generate rte_flow only on the specific PF that
received the
packet ?

Thanks,
-Harsha
>
> This series adds support for IPv6, tunnel push actions, and the
> multi-table model for the decap processing.
>
> v1 Travis passed: https://travis-ci.org/elibritstein/OVS/builds/639484233
>
> Eli Britstein (14):
>   dpif-netdev: Add mega ufid in flow add log
>   netdev-offload-dpdk: Remove pre-validate of patterns function
>   netdev-offload-dpdk: Add IPv6 pattern matching
>   netdev-offload-dpdk: Support offload of clone tnl_push/output actions
>   netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
>   netdev-offload: Don't use zero flow mark
>   netdev-offload-dpdk: Introduce map APIs for table id and miss context
>   netdev-offload-dpdk: Implement HW miss packet recover for vport
>   netdev-offload-dpdk: Refactor disassociate ufid function
>   netdev-offload-dpdk: Support tunnel pop action
>   netdev-offload-dpdk: Implement flow dump create/destroy APIs
>   netdev-dpdk: Getter function for dpdk devargs API
>   netdev-dpdk: Introduce get netdev by devargs function
>   netdev-dpdk-offload: Add vxlan pattern matching function
>
> Ilya Maximets (4):
>   netdev: Allow storing dpif type into netdev structure.
>   netdev-offload: Use dpif type instead of class.
>   netdev-offload: Allow offloading to netdev without ifindex.
>   netdev-offload: Disallow offloading to unrelated tunneling vports.
>
> Ophir Munk (6):
>   dpif-netdev: Make mark allocation public API
>   dpif-netdev: Add HW miss packet state recover logic
>   netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>     port
>   netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
>   netdev-dpdk: Add flow_api support for netdev vxlan vports
>   netdev-offload-dpdk: Support offload for vxlan vport
>
> Oz Shlomo (1):
>   netdev-offload: Add HW miss packet state recover API
>
>  Documentation/howto/dpdk.rst  |    5 +-
>  NEWS                          |    6 +-
>  lib/dpif-netdev.c             |   70 +--
>  lib/dpif-netlink.c            |   23 +-
>  lib/dpif.c                    |   21 +-
>  lib/netdev-dpdk.c             |   68 +++
>  lib/netdev-dpdk.h             |    6 +
>  lib/netdev-offload-dpdk.c     | 1275 +++++++++++++++++++++++++++++++++++------
>  lib/netdev-offload-provider.h |    6 +
>  lib/netdev-offload-tc.c       |   11 +-
>  lib/netdev-offload.c          |  115 ++--
>  lib/netdev-offload.h          |   23 +-
>  lib/netdev-provider.h         |    3 +-
>  lib/netdev.c                  |   16 +
>  lib/netdev.h                  |    2 +
>  ofproto/ofproto-dpif-upcall.c |    5 +-
>  tests/dpif-netdev.at          |   14 +-
>  tests/ofproto-macros.at       |    3 +-
>  18 files changed, 1394 insertions(+), 278 deletions(-)
>
> --
> 2.14.5
>
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
Eli Britstein Feb. 5, 2020, 2:30 p.m. UTC | #3
On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> Hi Eli,
>
> Thanks for sending this patchset. I have some questions about the
> design, please see my comments below.
>
> Thanks,
> -Harsha
>
> On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein <elibr@mellanox.com> wrote:
>> In the netdev datapath, packets arriving over a tunnel are processed by
>> more than one flow. For example a packet that should be decapsulated and
>> forwarded to another port is processed by two flows:
>> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
>>     actions:tnl_pop(vxlan_sys_4789).
>> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
>>     inner-header matches, actions: vm1.
>>
>> In order to offload such a multi-flow processing path, a multi-table HW
>> model is used.
> Did you explore other ways to avoid this multi-flow processing model ?
> For example, may be support some additional logic to merge the two
> datapath flow-rules for decap into a single flow-rule in the offload
> layer ?
Yes, we did explore this approach and our research lead us to conclude 
that following the OVS multi-table SW model in HW proves to be a more 
robust architecture.
Single table architecture for tunnel offloads, for example, has the 
following issues:
1. Statistics – flow #1 counts bytes including outer header. If the 
outer is IPv6, it can have extensions, to be counted by the packet. Flow 
#2 counts the bytes of the inner packet only.
2. Update rate – parsing multi-table to single one means cartesian 
product of all the flows. Any change of a single flow in one table means 
a lot of removals and insertions of rules in HW.
The single table model also breaks when offloading more complex actions 
such as CT, dp_hash and others.
>
>> Flow #1 matches are offloaded as is, but tnl_pop is
>> translated to MARK and JUMP to a tnl-port table actions.
> The response to tnl_pop action is now different between non-offload
> and offload datapaths, since offload layer changes it into entirely
> different
> actions. The user would expect consistent response to tnl_pop action,
> especially if you treat this flow rule as an independent rule by
> itself (i.e,
> assuming there's no flow #2 subsequently).
>
> Regarding the JUMP action:
> IMO, offload framework should avoid anything that is too vendor
> specific; and make use of simple/common abstractions. I'm not sure if
> JUMP action is really needed to support offload of tnl_pop action.
> Also, it may not map to the real use case for which JUMP action seems
> to have been defined, which is to support flow-grouping (see
> rte_flow.h).
The rte_flow API does not offer an equivalent action to moving outer 
headers to some metadata buffer, like the SW tnl_pop action. The JUMP 
action is equivalent to recirculate on the vport, and the MARK action is 
to make sure the user will have a consistent response in SW/HW (again, 
sticking to the SW model), so in case of a miss the packet is recovered 
to continue SW processing normally.
>
>> Flow #2 is
>> offloaded to that tnl-port table, its tunnel matches are translated to
>> outer-header matches, followed by the inner-header ones. The actions
>> start with an implicit DECAP action followed by the flow's actions.
>> The DECAP is done in flow #2. If it was done in flow #1, the outer
>> headers of the packets would have been lost, and they could not have
>> been matched in flow #2.
> The only missing match field that would allow decap on flow #1 itself,
> seems to be the tunnel-id. If that is provided in flow #1, then we may
> not need to match it again in flow #2. And flow #2 could just match on
> the inner packet headers. This would also make handling the flow-miss
> simple, since you wouldn't need the recovery logic (offload layer
> popping the header etc). The packet would be in the right state for
> the next stage of processing in SW, if there's a  flow-miss after the
> first stage in HW. It also means tnl_pop can be done in flow #1,
> making the behavior consistent with non-offload processing.
>
> The other advantage of providing the tunnel-id in flow #1 is that,
> then we wouldn't need to match on the entire packet including the
> outer-headers in the second stage and only the inner header fields
> would need to be matched in flow #2.
Even if we could somehow get the matches from flow #2 to flow #1 
(leading back to single flow model…) and do the DECAP in flow #1, then 
we would have to change the SW matches in flow #2 in case it is 
processed in SW as it won’t have the outer header/tunnel anymore.
>
>> Furthermore, once the HW processing path is composed of a multi table
>> processing, there is a need to handle a case where the processing starts
>> in HW, but is not completed in HW. For example, only flow #1 is
>> offloaded while flow #2 doesn't exist. For that, the MARK value set in
>> the tnl_pop action is received in SW, the packet is recovered by the
>> mapped vport, and processing continues in SW.
> If it is not possible to provide tunnel-id in flow #1, then the other
> alternative might be to defer offloading until flow #2 and offload
> only flow #2. You are anyway matching almost the entire outer-headers
> and the inner-header in flow #2 and the decap action is also added
> implicitly to flow #2. Why not totally avoid offloading flow #1 ? That
> way, none of the miss handling code is required.
See the first argument about following the SW model.
>
>> As vports are virtual and cannot have rte_flows, offloading a vport
>> generates rte_flows for every PF in the system, as we cannot know from
>> which one the packets arrive from.
> Is it possible to track the "original in-port" through this two-stage
> flow processing and generate rte_flow only on the specific PF that
> received the
> packet ?
We could “track” it only if we tried to merge both flow #1 and flow #2. 
See the first argument about following the SW model.
>
> Thanks,
> -Harsha
>> This series adds support for IPv6, tunnel push actions, and the
>> multi-table model for the decap processing.
>>
>> v1 Travis passed: https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&amp;reserved=0
>>
>> Eli Britstein (14):
>>    dpif-netdev: Add mega ufid in flow add log
>>    netdev-offload-dpdk: Remove pre-validate of patterns function
>>    netdev-offload-dpdk: Add IPv6 pattern matching
>>    netdev-offload-dpdk: Support offload of clone tnl_push/output actions
>>    netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
>>    netdev-offload: Don't use zero flow mark
>>    netdev-offload-dpdk: Introduce map APIs for table id and miss context
>>    netdev-offload-dpdk: Implement HW miss packet recover for vport
>>    netdev-offload-dpdk: Refactor disassociate ufid function
>>    netdev-offload-dpdk: Support tunnel pop action
>>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
>>    netdev-dpdk: Getter function for dpdk devargs API
>>    netdev-dpdk: Introduce get netdev by devargs function
>>    netdev-dpdk-offload: Add vxlan pattern matching function
>>
>> Ilya Maximets (4):
>>    netdev: Allow storing dpif type into netdev structure.
>>    netdev-offload: Use dpif type instead of class.
>>    netdev-offload: Allow offloading to netdev without ifindex.
>>    netdev-offload: Disallow offloading to unrelated tunneling vports.
>>
>> Ophir Munk (6):
>>    dpif-netdev: Make mark allocation public API
>>    dpif-netdev: Add HW miss packet state recover logic
>>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
>>      port
>>    netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
>>    netdev-dpdk: Add flow_api support for netdev vxlan vports
>>    netdev-offload-dpdk: Support offload for vxlan vport
>>
>> Oz Shlomo (1):
>>    netdev-offload: Add HW miss packet state recover API
>>
>>   Documentation/howto/dpdk.rst  |    5 +-
>>   NEWS                          |    6 +-
>>   lib/dpif-netdev.c             |   70 +--
>>   lib/dpif-netlink.c            |   23 +-
>>   lib/dpif.c                    |   21 +-
>>   lib/netdev-dpdk.c             |   68 +++
>>   lib/netdev-dpdk.h             |    6 +
>>   lib/netdev-offload-dpdk.c     | 1275 +++++++++++++++++++++++++++++++++++------
>>   lib/netdev-offload-provider.h |    6 +
>>   lib/netdev-offload-tc.c       |   11 +-
>>   lib/netdev-offload.c          |  115 ++--
>>   lib/netdev-offload.h          |   23 +-
>>   lib/netdev-provider.h         |    3 +-
>>   lib/netdev.c                  |   16 +
>>   lib/netdev.h                  |    2 +
>>   ofproto/ofproto-dpif-upcall.c |    5 +-
>>   tests/dpif-netdev.at          |   14 +-
>>   tests/ofproto-macros.at       |    3 +-
>>   18 files changed, 1394 insertions(+), 278 deletions(-)
>>
>> --
>> 2.14.5
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&amp;reserved=0
Li,Rongqing via dev Feb. 10, 2020, 9:16 p.m. UTC | #4
Eli,

There are some fundamental architecture issues (multi HW tables vs. single
HW table, use of rte_flow JUMP action for mapping ovs sw tnl_pop action,
etc.) that we need to discuss before we can get into the details of the
patchset.
I'm inlining my comments on the discussion between you and Harsha below.

Hemal


On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <elibr@mellanox.com> wrote:

>
> On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> > Hi Eli,
> >
> > Thanks for sending this patchset. I have some questions about the
> > design, please see my comments below.
> >
> > Thanks,
> > -Harsha
> >
> > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein <elibr@mellanox.com>
> wrote:
> >> In the netdev datapath, packets arriving over a tunnel are processed by
> >> more than one flow. For example a packet that should be decapsulated and
> >> forwarded to another port is processed by two flows:
> >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
> >>     actions:tnl_pop(vxlan_sys_4789).
> >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
> >>     inner-header matches, actions: vm1.
> >>
> >> In order to offload such a multi-flow processing path, a multi-table HW
> >> model is used.
> > Did you explore other ways to avoid this multi-flow processing model ?
> > For example, may be support some additional logic to merge the two
> > datapath flow-rules for decap into a single flow-rule in the offload
> > layer ?
> Yes, we did explore this approach and our research lead us to conclude
> that following the OVS multi-table SW model in HW proves to be a more
> robust architecture.
>
[Hemal] It will be good and prudent for you to share details of that
research to help the community understand how you arrive to the conclusion
above. Can you share it?

> Single table architecture for tunnel offloads, for example, has the
> following issues:
> 1. Statistics – flow #1 counts bytes including outer header. If the
> outer is IPv6, it can have extensions, to be counted by the packet. Flow
> #2 counts the bytes of the inner packet only.
>
[Hemal]  For fixed format tunnels, you can calculate count bytes of flow #1
from the flow #2 counts.

> 2. Update rate – parsing multi-table to single one means cartesian
> product of all the flows. Any change of a single flow in one table means
> a lot of removals and insertions of rules in HW.
>
[Hemal] Can you quantify "a lot" with specific examples? flow #2 can't be
offloaded without flow #1 being offloaded. It is less likely that flow #1
will change as frequently as flow #2. If flow #1 is updated (tunnel change)
and flow #2 is using the fields from tunnel headers that are changed in
flow #1, then you have to anyways update flow #2. This is no different from
the software case when a tunnel is changed.

> The single table model also breaks when offloading more complex actions
> such as CT, dp_hash and others.
>
[Hemal] Can you be more specific about what it breaks?

> >
> >> Flow #1 matches are offloaded as is, but tnl_pop is
> >> translated to MARK and JUMP to a tnl-port table actions.
> > The response to tnl_pop action is now different between non-offload
> > and offload datapaths, since offload layer changes it into entirely
> > different
> > actions. The user would expect consistent response to tnl_pop action,
> > especially if you treat this flow rule as an independent rule by
> > itself (i.e,
> > assuming there's no flow #2 subsequently).
> >
> > Regarding the JUMP action:
> > IMO, offload framework should avoid anything that is too vendor
> > specific; and make use of simple/common abstractions. I'm not sure if
> > JUMP action is really needed to support offload of tnl_pop action.
> > Also, it may not map to the real use case for which JUMP action seems
> > to have been defined, which is to support flow-grouping (see
> > rte_flow.h).
> The rte_flow API does not offer an equivalent action to moving outer
> headers to some metadata buffer, like the SW tnl_pop action. The JUMP
> action is equivalent to recirculate on the vport, and the MARK action is
> to make sure the user will have a consistent response in SW/HW (again,
> sticking to the SW model), so in case of a miss the packet is recovered
> to continue SW processing normally.
>
[Hemal] This seems like a gap in rte_flow API mapping of ovs sw tnl_pop
action. Should we look into extending rte_flow API to cover this case
rather than changing the use case for JUMP?

> >
> >> Flow #2 is
> >> offloaded to that tnl-port table, its tunnel matches are translated to
> >> outer-header matches, followed by the inner-header ones. The actions
> >> start with an implicit DECAP action followed by the flow's actions.
> >> The DECAP is done in flow #2. If it was done in flow #1, the outer
> >> headers of the packets would have been lost, and they could not have
> >> been matched in flow #2.
> > The only missing match field that would allow decap on flow #1 itself,
> > seems to be the tunnel-id. If that is provided in flow #1, then we may
> > not need to match it again in flow #2. And flow #2 could just match on
> > the inner packet headers. This would also make handling the flow-miss
> > simple, since you wouldn't need the recovery logic (offload layer
> > popping the header etc). The packet would be in the right state for
> > the next stage of processing in SW, if there's a  flow-miss after the
> > first stage in HW. It also means tnl_pop can be done in flow #1,
> > making the behavior consistent with non-offload processing.
> >
> > The other advantage of providing the tunnel-id in flow #1 is that,
> > then we wouldn't need to match on the entire packet including the
> > outer-headers in the second stage and only the inner header fields
> > would need to be matched in flow #2.
> Even if we could somehow get the matches from flow #2 to flow #1
> (leading back to single flow model…) and do the DECAP in flow #1, then
> we would have to change the SW matches in flow #2 in case it is
> processed in SW as it won’t have the outer header/tunnel anymore.
>
[Hemal] Flow matches are defined by some policy. How is the changing of
match fields of flow #2 an issue? Can you elaborate? Aren't these matches
defined by a user or an agent?

> >
> >> Furthermore, once the HW processing path is composed of a multi table
> >> processing, there is a need to handle a case where the processing starts
> >> in HW, but is not completed in HW. For example, only flow #1 is
> >> offloaded while flow #2 doesn't exist. For that, the MARK value set in
> >> the tnl_pop action is received in SW, the packet is recovered by the
> >> mapped vport, and processing continues in SW.
> > If it is not possible to provide tunnel-id in flow #1, then the other
> > alternative might be to defer offloading until flow #2 and offload
> > only flow #2. You are anyway matching almost the entire outer-headers
> > and the inner-header in flow #2 and the decap action is also added
> > implicitly to flow #2. Why not totally avoid offloading flow #1 ? That
> > way, none of the miss handling code is required.
> See the first argument about following the SW model.
> >
> >> As vports are virtual and cannot have rte_flows, offloading a vport
> >> generates rte_flows for every PF in the system, as we cannot know from
> >> which one the packets arrive from.
> > Is it possible to track the "original in-port" through this two-stage
> > flow processing and generate rte_flow only on the specific PF that
> > received the
> > packet ?
> We could “track” it only if we tried to merge both flow #1 and flow #2.
> See the first argument about following the SW model.
> >
> > Thanks,
> > -Harsha
> >> This series adds support for IPv6, tunnel push actions, and the
> >> multi-table model for the decap processing.
> >>
> >> v1 Travis passed:
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&amp;reserved=0
> >>
> >> Eli Britstein (14):
> >>    dpif-netdev: Add mega ufid in flow add log
> >>    netdev-offload-dpdk: Remove pre-validate of patterns function
> >>    netdev-offload-dpdk: Add IPv6 pattern matching
> >>    netdev-offload-dpdk: Support offload of clone tnl_push/output actions
> >>    netdev-offload-dpdk: Support tnl/push using vxlan encap attribute
> >>    netdev-offload: Don't use zero flow mark
> >>    netdev-offload-dpdk: Introduce map APIs for table id and miss context
> >>    netdev-offload-dpdk: Implement HW miss packet recover for vport
> >>    netdev-offload-dpdk: Refactor disassociate ufid function
> >>    netdev-offload-dpdk: Support tunnel pop action
> >>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >>    netdev-dpdk: Getter function for dpdk devargs API
> >>    netdev-dpdk: Introduce get netdev by devargs function
> >>    netdev-dpdk-offload: Add vxlan pattern matching function
> >>
> >> Ilya Maximets (4):
> >>    netdev: Allow storing dpif type into netdev structure.
> >>    netdev-offload: Use dpif type instead of class.
> >>    netdev-offload: Allow offloading to netdev without ifindex.
> >>    netdev-offload: Disallow offloading to unrelated tunneling vports.
> >>
> >> Ophir Munk (6):
> >>    dpif-netdev: Make mark allocation public API
> >>    dpif-netdev: Add HW miss packet state recover logic
> >>    netdev-dpdk: Introduce an API to query if a dpdk port is an uplink
> >>      port
> >>    netdev-dpdk-offload: Infrastructure for multiple rte_flows per UFID
> >>    netdev-dpdk: Add flow_api support for netdev vxlan vports
> >>    netdev-offload-dpdk: Support offload for vxlan vport
> >>
> >> Oz Shlomo (1):
> >>    netdev-offload: Add HW miss packet state recover API
> >>
> >>   Documentation/howto/dpdk.rst  |    5 +-
> >>   NEWS                          |    6 +-
> >>   lib/dpif-netdev.c             |   70 +--
> >>   lib/dpif-netlink.c            |   23 +-
> >>   lib/dpif.c                    |   21 +-
> >>   lib/netdev-dpdk.c             |   68 +++
> >>   lib/netdev-dpdk.h             |    6 +
> >>   lib/netdev-offload-dpdk.c     | 1275
> +++++++++++++++++++++++++++++++++++------
> >>   lib/netdev-offload-provider.h |    6 +
> >>   lib/netdev-offload-tc.c       |   11 +-
> >>   lib/netdev-offload.c          |  115 ++--
> >>   lib/netdev-offload.h          |   23 +-
> >>   lib/netdev-provider.h         |    3 +-
> >>   lib/netdev.c                  |   16 +
> >>   lib/netdev.h                  |    2 +
> >>   ofproto/ofproto-dpif-upcall.c |    5 +-
> >>   tests/dpif-netdev.at          |   14 +-
> >>   tests/ofproto-macros.at       |    3 +-
> >>   18 files changed, 1394 insertions(+), 278 deletions(-)
> >>
> >> --
> >> 2.14.5
> >>
> >> _______________________________________________
> >> dev mailing list
> >> dev@openvswitch.org
> >>
> https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&amp;reserved=0
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Eli Britstein Feb. 18, 2020, 9:59 a.m. UTC | #5
On 2/10/2020 11:16 PM, Hemal Shah wrote:
> Eli,
>
> There are some fundamental architecture issues (multi HW tables vs. 
> single HW table, use of rte_flow JUMP action for mapping ovs sw 
> tnl_pop action, etc.) that we need to discuss before we can get into 
> the details of the patchset.
> I'm inlining my comments on the discussion between you and Harsha below.
>
> Hemal
>
>
> On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <elibr@mellanox.com 
> <mailto:elibr@mellanox.com>> wrote:
>
>
>     On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
>     > Hi Eli,
>     >
>     > Thanks for sending this patchset. I have some questions about the
>     > design, please see my comments below.
>     >
>     > Thanks,
>     > -Harsha
>     >
>     > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein
>     <elibr@mellanox.com <mailto:elibr@mellanox.com>> wrote:
>     >> In the netdev datapath, packets arriving over a tunnel are
>     processed by
>     >> more than one flow. For example a packet that should be
>     decapsulated and
>     >> forwarded to another port is processed by two flows:
>     >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
>     >>     actions:tnl_pop(vxlan_sys_4789).
>     >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
>     >>     inner-header matches, actions: vm1.
>     >>
>     >> In order to offload such a multi-flow processing path, a
>     multi-table HW
>     >> model is used.
>     > Did you explore other ways to avoid this multi-flow processing
>     model ?
>     > For example, may be support some additional logic to merge the two
>     > datapath flow-rules for decap into a single flow-rule in the offload
>     > layer ?
>     Yes, we did explore this approach and our research lead us to
>     conclude
>     that following the OVS multi-table SW model in HW proves to be a more
>     robust architecture.
>
> [Hemal] It will be good and prudent for you to share details of that 
> research to help the community understand how you arrive to the 
> conclusion above. Can you share it?
The HW offload scheme should follow the SW model. It is not only about 
current VXLAN patch-set, but rather an infrastructure for offloading 
more actions. Once HW offloading does not follow the SW model it becomes 
difficult or not practical to support them. Squashing multi-table flows 
into single rules might work but only for specific use cases. VXLAN with 
fixed sized outer header is one of them, but not with other use cases.
There are actions, like DP_HASH, that perform calculations on the 
packets, to be used in the next recirc.
For example:
recirc_id(0),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no), 
packets:26, bytes:1924, used:0.105s, flags:S, 
actions:hash(sym_l4(0)),recirc(0x5)

recirc_id(0x5),dp_hash(0x315c2571/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no), 
packets:0, bytes:0, used:never, actions:,3
recirc_id(0x5),dp_hash(0x315ca562/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no), 
packets:0, bytes:0, used:never, actions:,2
..
4 rules

1000000 TCP/UDP sessions will not require more data plane flows.
In single-table, every separated 5-tuple should be matched, so 1000000 
sessions will create 1000000 data plane flows.
>
>     Single table architecture for tunnel offloads, for example, has the
>     following issues:
>     1. Statistics – flow #1 counts bytes including outer header. If the
>     outer is IPv6, it can have extensions, to be counted by the
>     packet. Flow
>     #2 counts the bytes of the inner packet only.
>
> [Hemal]  For fixed format tunnels, you can calculate count bytes of 
> flow #1 from the flow #2 counts.
For fixed format tunnels, yes. However, this cannot be done for variable 
size headers protocols such as IPv6, VXLAN over VLAN, Geneve and others.
>
>     2. Update rate – parsing multi-table to single one means cartesian
>     product of all the flows. Any change of a single flow in one table
>     means
>     a lot of removals and insertions of rules in HW.
>
> [Hemal] Can you quantify "a lot" with specific examples? flow #2 can't 
> be offloaded without flow #1 being offloaded. It is less likely that 
> flow #1 will change as frequently as flow #2. If flow #1 is updated 
> (tunnel change) and flow #2 is using the fields from tunnel headers 
> that are changed in flow #1, then you have to anyways update flow #2. 
> This is no different from the software case when a tunnel is changed.
This is correct for the VXLAN use case.
>
>     The single table model also breaks when offloading more complex
>     actions
>     such as CT, dp_hash and others.
>
> [Hemal] Can you be more specific about what it breaks?
The cartesian product of rules can scale to a very large number of 
single-rules.
See above example of DP_HASH.

>     >
>     >> Flow #1 matches are offloaded as is, but tnl_pop is
>     >> translated to MARK and JUMP to a tnl-port table actions.
>     > The response to tnl_pop action is now different between non-offload
>     > and offload datapaths, since offload layer changes it into entirely
>     > different
>     > actions. The user would expect consistent response to tnl_pop
>     action,
>     > especially if you treat this flow rule as an independent rule by
>     > itself (i.e,
>     > assuming there's no flow #2 subsequently).
>     >
>     > Regarding the JUMP action:
>     > IMO, offload framework should avoid anything that is too vendor
>     > specific; and make use of simple/common abstractions. I'm not
>     sure if
>     > JUMP action is really needed to support offload of tnl_pop action.
>     > Also, it may not map to the real use case for which JUMP action
>     seems
>     > to have been defined, which is to support flow-grouping (see
>     > rte_flow.h).
>     The rte_flow API does not offer an equivalent action to moving outer
>     headers to some metadata buffer, like the SW tnl_pop action. The JUMP
>     action is equivalent to recirculate on the vport, and the MARK
>     action is
>     to make sure the user will have a consistent response in SW/HW
>     (again,
>     sticking to the SW model), so in case of a miss the packet is
>     recovered
>     to continue SW processing normally.
>
> [Hemal] This seems like a gap in rte_flow API mapping of ovs sw 
> tnl_pop action. Should we look into extending rte_flow API to cover 
> this case rather than changing the use case for JUMP?
It is not only the rte_flow API. It would also require providing the 
tunnel info that was parsed by the hardware to the software, via the 
MBUF, so this may be required to change as well. Furthermore, the HW 
should also support this kind of action, so it gets complicated.
>
>     >
>     >> Flow #2 is
>     >> offloaded to that tnl-port table, its tunnel matches are
>     translated to
>     >> outer-header matches, followed by the inner-header ones. The
>     actions
>     >> start with an implicit DECAP action followed by the flow's actions.
>     >> The DECAP is done in flow #2. If it was done in flow #1, the outer
>     >> headers of the packets would have been lost, and they could not
>     have
>     >> been matched in flow #2.
>     > The only missing match field that would allow decap on flow #1
>     itself,
>     > seems to be the tunnel-id. If that is provided in flow #1, then
>     we may
>     > not need to match it again in flow #2. And flow #2 could just
>     match on
>     > the inner packet headers. This would also make handling the
>     flow-miss
>     > simple, since you wouldn't need the recovery logic (offload layer
>     > popping the header etc). The packet would be in the right state for
>     > the next stage of processing in SW, if there's a flow-miss after the
>     > first stage in HW. It also means tnl_pop can be done in flow #1,
>     > making the behavior consistent with non-offload processing.
>     >
>     > The other advantage of providing the tunnel-id in flow #1 is that,
>     > then we wouldn't need to match on the entire packet including the
>     > outer-headers in the second stage and only the inner header fields
>     > would need to be matched in flow #2.
>     Even if we could somehow get the matches from flow #2 to flow #1
>     (leading back to single flow model…) and do the DECAP in flow #1,
>     then
>     we would have to change the SW matches in flow #2 in case it is
>     processed in SW as it won’t have the outer header/tunnel anymore.
>
> [Hemal] Flow matches are defined by some policy. How is the changing 
> of match fields of flow #2 an issue? Can you elaborate? Aren't these 
> matches defined by a user or an agent?
In flow #2 there are matches on the tunnel (outer headers). If they are 
in flow #1 in HW and with a DECAP action, now for processing flow #2 in 
SW, those matches in flow #2 should be omitted if flow #1 is processed 
in HW (as they are already matched and gone with the DECAP) or not if it 
is processed in SW.
>
>     >
>     >> Furthermore, once the HW processing path is composed of a multi
>     table
>     >> processing, there is a need to handle a case where the
>     processing starts
>     >> in HW, but is not completed in HW. For example, only flow #1 is
>     >> offloaded while flow #2 doesn't exist. For that, the MARK value
>     set in
>     >> the tnl_pop action is received in SW, the packet is recovered
>     by the
>     >> mapped vport, and processing continues in SW.
>     > If it is not possible to provide tunnel-id in flow #1, then the
>     other
>     > alternative might be to defer offloading until flow #2 and offload
>     > only flow #2. You are anyway matching almost the entire
>     outer-headers
>     > and the inner-header in flow #2 and the decap action is also added
>     > implicitly to flow #2. Why not totally avoid offloading flow #1
>     ? That
>     > way, none of the miss handling code is required.
>     See the first argument about following the SW model.
>     >
>     >> As vports are virtual and cannot have rte_flows, offloading a vport
>     >> generates rte_flows for every PF in the system, as we cannot
>     know from
>     >> which one the packets arrive from.
>     > Is it possible to track the "original in-port" through this
>     two-stage
>     > flow processing and generate rte_flow only on the specific PF that
>     > received the
>     > packet ?
>     We could “track” it only if we tried to merge both flow #1 and
>     flow #2.
>     See the first argument about following the SW model.
>     >
>     > Thanks,
>     > -Harsha
>     >> This series adds support for IPv6, tunnel push actions, and the
>     >> multi-table model for the decap processing.
>     >>
>     >> v1 Travis passed:
>     https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617734858&sdata=1AxiayIWhCmmMCugerLnyAJDD%2F7hRXetRlk9Unk8L74%3D&reserved=0>
>     >>
>     >> Eli Britstein (14):
>     >>    dpif-netdev: Add mega ufid in flow add log
>     >>    netdev-offload-dpdk: Remove pre-validate of patterns function
>     >>    netdev-offload-dpdk: Add IPv6 pattern matching
>     >>    netdev-offload-dpdk: Support offload of clone
>     tnl_push/output actions
>     >>    netdev-offload-dpdk: Support tnl/push using vxlan encap
>     attribute
>     >>    netdev-offload: Don't use zero flow mark
>     >>    netdev-offload-dpdk: Introduce map APIs for table id and
>     miss context
>     >>    netdev-offload-dpdk: Implement HW miss packet recover for vport
>     >>    netdev-offload-dpdk: Refactor disassociate ufid function
>     >>    netdev-offload-dpdk: Support tunnel pop action
>     >>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
>     >>    netdev-dpdk: Getter function for dpdk devargs API
>     >>    netdev-dpdk: Introduce get netdev by devargs function
>     >>    netdev-dpdk-offload: Add vxlan pattern matching function
>     >>
>     >> Ilya Maximets (4):
>     >>    netdev: Allow storing dpif type into netdev structure.
>     >>    netdev-offload: Use dpif type instead of class.
>     >>    netdev-offload: Allow offloading to netdev without ifindex.
>     >>    netdev-offload: Disallow offloading to unrelated tunneling
>     vports.
>     >>
>     >> Ophir Munk (6):
>     >>    dpif-netdev: Make mark allocation public API
>     >>    dpif-netdev: Add HW miss packet state recover logic
>     >>    netdev-dpdk: Introduce an API to query if a dpdk port is an
>     uplink
>     >>      port
>     >>    netdev-dpdk-offload: Infrastructure for multiple rte_flows
>     per UFID
>     >>    netdev-dpdk: Add flow_api support for netdev vxlan vports
>     >>    netdev-offload-dpdk: Support offload for vxlan vport
>     >>
>     >> Oz Shlomo (1):
>     >>    netdev-offload: Add HW miss packet state recover API
>     >>
>     >>   Documentation/howto/dpdk.rst  |    5 +-
>     >>   NEWS                          |    6 +-
>     >>   lib/dpif-netdev.c             |   70 +--
>     >>   lib/dpif-netlink.c            |   23 +-
>     >>   lib/dpif.c                    |   21 +-
>     >>   lib/netdev-dpdk.c             |   68 +++
>     >>   lib/netdev-dpdk.h             |    6 +
>     >>   lib/netdev-offload-dpdk.c     | 1275
>     +++++++++++++++++++++++++++++++++++------
>     >>   lib/netdev-offload-provider.h |    6 +
>     >>   lib/netdev-offload-tc.c       |   11 +-
>     >>   lib/netdev-offload.c          |  115 ++--
>     >>   lib/netdev-offload.h          |   23 +-
>     >>   lib/netdev-provider.h         |    3 +-
>     >>   lib/netdev.c                  |   16 +
>     >>   lib/netdev.h                  |    2 +
>     >>   ofproto/ofproto-dpif-upcall.c |    5 +-
>     >>   tests/dpif-netdev.at
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpif-netdev.at%2F&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617734858&sdata=FSHMbRVHgjS6Jph6FTbnmSfyo70id9nkVesahEPOk2A%3D&reserved=0>
>             |   14 +-
>     >>   tests/ofproto-macros.at
>     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fofproto-macros.at%2F&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=nOV4yi8QLgGOlctHcZz%2BQnF6n0rch6whF5onzZ87LbM%3D&reserved=0>
>          |    3 +-
>     >>   18 files changed, 1394 insertions(+), 278 deletions(-)
>     >>
>     >> --
>     >> 2.14.5
>     >>
>     >> _______________________________________________
>     >> dev mailing list
>     >> dev@openvswitch.org <mailto:dev@openvswitch.org>
>     >>
>     https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&amp;reserved=0
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=9h393cM42RKHNS2hL%2FeiM%2FoADEaepqwXpo6F6u3MkNQ%3D&reserved=0>
>     _______________________________________________
>     dev mailing list
>     dev@openvswitch.org <mailto:dev@openvswitch.org>
>     https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=9h393cM42RKHNS2hL%2FeiM%2FoADEaepqwXpo6F6u3MkNQ%3D&reserved=0>
>
Li,Rongqing via dev March 2, 2020, 4:25 a.m. UTC | #6
On Tue, Feb 18, 2020 at 3:30 PM Eli Britstein <elibr@mellanox.com> wrote:
>
>
> On 2/10/2020 11:16 PM, Hemal Shah wrote:
> > Eli,
> >
> > There are some fundamental architecture issues (multi HW tables vs.
> > single HW table, use of rte_flow JUMP action for mapping ovs sw
> > tnl_pop action, etc.) that we need to discuss before we can get into
> > the details of the patchset.
> > I'm inlining my comments on the discussion between you and Harsha below.
> >
> > Hemal
> >
> >
> > On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <elibr@mellanox.com
> > <mailto:elibr@mellanox.com>> wrote:
> >
> >
> >     On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> >     > Hi Eli,
> >     >
> >     > Thanks for sending this patchset. I have some questions about the
> >     > design, please see my comments below.
> >     >
> >     > Thanks,
> >     > -Harsha
> >     >
> >     > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein
> >     <elibr@mellanox.com <mailto:elibr@mellanox.com>> wrote:
> >     >> In the netdev datapath, packets arriving over a tunnel are
> >     processed by
> >     >> more than one flow. For example a packet that should be
> >     decapsulated and
> >     >> forwarded to another port is processed by two flows:
> >     >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
> >     >>     actions:tnl_pop(vxlan_sys_4789).
> >     >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
> >     >>     inner-header matches, actions: vm1.
> >     >>
> >     >> In order to offload such a multi-flow processing path, a
> >     multi-table HW
> >     >> model is used.
> >     > Did you explore other ways to avoid this multi-flow processing
> >     model ?
> >     > For example, may be support some additional logic to merge the two
> >     > datapath flow-rules for decap into a single flow-rule in the offload
> >     > layer ?
> >     Yes, we did explore this approach and our research lead us to
> >     conclude
> >     that following the OVS multi-table SW model in HW proves to be a more
> >     robust architecture.
> >
> > [Hemal] It will be good and prudent for you to share details of that
> > research to help the community understand how you arrive to the
> > conclusion above. Can you share it?
> The HW offload scheme should follow the SW model. It is not only about
> current VXLAN patch-set, but rather an infrastructure for offloading
> more actions. Once HW offloading does not follow the SW model it becomes
> difficult or not practical to support them. Squashing multi-table flows
> into single rules might work but only for specific use cases. VXLAN with
> fixed sized outer header is one of them, but not with other use cases.
> There are actions, like DP_HASH, that perform calculations on the
> packets, to be used in the next recirc.
> For example:
> recirc_id(0),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> packets:26, bytes:1924, used:0.105s, flags:S,
> actions:hash(sym_l4(0)),recirc(0x5)
>
> recirc_id(0x5),dp_hash(0x315c2571/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> packets:0, bytes:0, used:never, actions:,3
> recirc_id(0x5),dp_hash(0x315ca562/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> packets:0, bytes:0, used:never, actions:,2
> ..
> 4 rules
>
> 1000000 TCP/UDP sessions will not require more data plane flows.
> In single-table, every separated 5-tuple should be matched, so 1000000
> sessions will create 1000000 data plane flows.

[Harsha] I understand your concerns and the requirement from different
use cases (like dp-hash) that might need multitable flows. But for
VXLAN decap offload, we are questioning the need for multitable. We
are looking for design options that you considered to support VXLAN
use case with single-table. Can you please provide specific details on
the challenges ?

> >
> >     Single table architecture for tunnel offloads, for example, has the
> >     following issues:
> >     1. Statistics – flow #1 counts bytes including outer header. If the
> >     outer is IPv6, it can have extensions, to be counted by the
> >     packet. Flow
> >     #2 counts the bytes of the inner packet only.
> >
> > [Hemal]  For fixed format tunnels, you can calculate count bytes of
> > flow #1 from the flow #2 counts.
> For fixed format tunnels, yes. However, this cannot be done for variable
> size headers protocols such as IPv6, VXLAN over VLAN, Geneve and others.
> >

[Harsha] For variable size headers, may be we could do some
approximations and the trade off is reduced code complexity. In the
long term we want to explore single table option for fixed headers,
meanwhile we can continue to work with the proposed two-table model.

> >     2. Update rate – parsing multi-table to single one means cartesian
> >     product of all the flows. Any change of a single flow in one table
> >     means
> >     a lot of removals and insertions of rules in HW.
> >
> > [Hemal] Can you quantify "a lot" with specific examples? flow #2 can't
> > be offloaded without flow #1 being offloaded. It is less likely that
> > flow #1 will change as frequently as flow #2. If flow #1 is updated
> > (tunnel change) and flow #2 is using the fields from tunnel headers
> > that are changed in flow #1, then you have to anyways update flow #2.
> > This is no different from the software case when a tunnel is changed.
> This is correct for the VXLAN use case.

[Harsha] Ok, so update rate is not really a concern to support single
table for VXLAN use case.

> >
> >     The single table model also breaks when offloading more complex
> >     actions
> >     such as CT, dp_hash and others.
> >
> > [Hemal] Can you be more specific about what it breaks?
> The cartesian product of rules can scale to a very large number of
> single-rules.
> See above example of DP_HASH.

[Harsha] Offload design that's currently being discussed in this
patch-set is for VXLAN.  We can certainly discuss other features like
CT, dp-hash in their context at appropriate time. Like we said
earlier, we are not saying there's no use case for multitable, just
that we are questioning its relevance to VXLAN use case.

>
> >     >
> >     >> Flow #1 matches are offloaded as is, but tnl_pop is
> >     >> translated to MARK and JUMP to a tnl-port table actions.
> >     > The response to tnl_pop action is now different between non-offload
> >     > and offload datapaths, since offload layer changes it into entirely
> >     > different
> >     > actions. The user would expect consistent response to tnl_pop
> >     action,
> >     > especially if you treat this flow rule as an independent rule by
> >     > itself (i.e,
> >     > assuming there's no flow #2 subsequently).
> >     >
> >     > Regarding the JUMP action:
> >     > IMO, offload framework should avoid anything that is too vendor
> >     > specific; and make use of simple/common abstractions. I'm not
> >     sure if
> >     > JUMP action is really needed to support offload of tnl_pop action.
> >     > Also, it may not map to the real use case for which JUMP action
> >     seems
> >     > to have been defined, which is to support flow-grouping (see
> >     > rte_flow.h).
> >     The rte_flow API does not offer an equivalent action to moving outer
> >     headers to some metadata buffer, like the SW tnl_pop action. The JUMP
> >     action is equivalent to recirculate on the vport, and the MARK
> >     action is
> >     to make sure the user will have a consistent response in SW/HW
> >     (again,
> >     sticking to the SW model), so in case of a miss the packet is
> >     recovered
> >     to continue SW processing normally.
> >
> > [Hemal] This seems like a gap in rte_flow API mapping of ovs sw
> > tnl_pop action. Should we look into extending rte_flow API to cover
> > this case rather than changing the use case for JUMP?
> It is not only the rte_flow API. It would also require providing the
> tunnel info that was parsed by the hardware to the software, via the
> MBUF, so this may be required to change as well. Furthermore, the HW
> should also support this kind of action, so it gets complicated.

[Harsha] Ok.

> >
> >     >
> >     >> Flow #2 is
> >     >> offloaded to that tnl-port table, its tunnel matches are
> >     translated to
> >     >> outer-header matches, followed by the inner-header ones. The
> >     actions
> >     >> start with an implicit DECAP action followed by the flow's actions.
> >     >> The DECAP is done in flow #2. If it was done in flow #1, the outer
> >     >> headers of the packets would have been lost, and they could not
> >     have
> >     >> been matched in flow #2.
> >     > The only missing match field that would allow decap on flow #1
> >     itself,
> >     > seems to be the tunnel-id. If that is provided in flow #1, then
> >     we may
> >     > not need to match it again in flow #2. And flow #2 could just
> >     match on
> >     > the inner packet headers. This would also make handling the
> >     flow-miss
> >     > simple, since you wouldn't need the recovery logic (offload layer
> >     > popping the header etc). The packet would be in the right state for
> >     > the next stage of processing in SW, if there's a flow-miss after the
> >     > first stage in HW. It also means tnl_pop can be done in flow #1,
> >     > making the behavior consistent with non-offload processing.
> >     >
> >     > The other advantage of providing the tunnel-id in flow #1 is that,
> >     > then we wouldn't need to match on the entire packet including the
> >     > outer-headers in the second stage and only the inner header fields
> >     > would need to be matched in flow #2.
> >     Even if we could somehow get the matches from flow #2 to flow #1
> >     (leading back to single flow model…) and do the DECAP in flow #1,
> >     then
> >     we would have to change the SW matches in flow #2 in case it is
> >     processed in SW as it won’t have the outer header/tunnel anymore.
> >
> > [Hemal] Flow matches are defined by some policy. How is the changing
> > of match fields of flow #2 an issue? Can you elaborate? Aren't these
> > matches defined by a user or an agent?
> In flow #2 there are matches on the tunnel (outer headers). If they are
> in flow #1 in HW and with a DECAP action, now for processing flow #2 in
> SW, those matches in flow #2 should be omitted if flow #1 is processed
> in HW (as they are already matched and gone with the DECAP) or not if it
> is processed in SW.

[Harsha] The current offload proposal introduces the group-id in
match, which is outside of actual OVS (sw) flow match. Have you
thought about a solution that does not require a group-id ? For the
long term, we should explore a unified flow #2 approach where a
group-id is not needed; any thoughts on that ?

Thanks,
-Harsha




> >
> >     >
> >     >> Furthermore, once the HW processing path is composed of a multi
> >     table
> >     >> processing, there is a need to handle a case where the
> >     processing starts
> >     >> in HW, but is not completed in HW. For example, only flow #1 is
> >     >> offloaded while flow #2 doesn't exist. For that, the MARK value
> >     set in
> >     >> the tnl_pop action is received in SW, the packet is recovered
> >     by the
> >     >> mapped vport, and processing continues in SW.
> >     > If it is not possible to provide tunnel-id in flow #1, then the
> >     other
> >     > alternative might be to defer offloading until flow #2 and offload
> >     > only flow #2. You are anyway matching almost the entire
> >     outer-headers
> >     > and the inner-header in flow #2 and the decap action is also added
> >     > implicitly to flow #2. Why not totally avoid offloading flow #1
> >     ? That
> >     > way, none of the miss handling code is required.
> >     See the first argument about following the SW model.
> >     >
> >     >> As vports are virtual and cannot have rte_flows, offloading a vport
> >     >> generates rte_flows for every PF in the system, as we cannot
> >     know from
> >     >> which one the packets arrive from.
> >     > Is it possible to track the "original in-port" through this
> >     two-stage
> >     > flow processing and generate rte_flow only on the specific PF that
> >     > received the
> >     > packet ?
> >     We could “track” it only if we tried to merge both flow #1 and
> >     flow #2.
> >     See the first argument about following the SW model.
> >     >
> >     > Thanks,
> >     > -Harsha
> >     >> This series adds support for IPv6, tunnel push actions, and the
> >     >> multi-table model for the decap processing.
> >     >>
> >     >> v1 Travis passed:
> >     https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=hDni3f9CGRr6FvUwLKlBDtlLNfrpHm5p%2Ffe%2FWzdCZuE%3D&amp;reserved=0
> >     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftravis-ci.org%2Felibritstein%2FOVS%2Fbuilds%2F639484233&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617734858&sdata=1AxiayIWhCmmMCugerLnyAJDD%2F7hRXetRlk9Unk8L74%3D&reserved=0>
> >     >>
> >     >> Eli Britstein (14):
> >     >>    dpif-netdev: Add mega ufid in flow add log
> >     >>    netdev-offload-dpdk: Remove pre-validate of patterns function
> >     >>    netdev-offload-dpdk: Add IPv6 pattern matching
> >     >>    netdev-offload-dpdk: Support offload of clone
> >     tnl_push/output actions
> >     >>    netdev-offload-dpdk: Support tnl/push using vxlan encap
> >     attribute
> >     >>    netdev-offload: Don't use zero flow mark
> >     >>    netdev-offload-dpdk: Introduce map APIs for table id and
> >     miss context
> >     >>    netdev-offload-dpdk: Implement HW miss packet recover for vport
> >     >>    netdev-offload-dpdk: Refactor disassociate ufid function
> >     >>    netdev-offload-dpdk: Support tunnel pop action
> >     >>    netdev-offload-dpdk: Implement flow dump create/destroy APIs
> >     >>    netdev-dpdk: Getter function for dpdk devargs API
> >     >>    netdev-dpdk: Introduce get netdev by devargs function
> >     >>    netdev-dpdk-offload: Add vxlan pattern matching function
> >     >>
> >     >> Ilya Maximets (4):
> >     >>    netdev: Allow storing dpif type into netdev structure.
> >     >>    netdev-offload: Use dpif type instead of class.
> >     >>    netdev-offload: Allow offloading to netdev without ifindex.
> >     >>    netdev-offload: Disallow offloading to unrelated tunneling
> >     vports.
> >     >>
> >     >> Ophir Munk (6):
> >     >>    dpif-netdev: Make mark allocation public API
> >     >>    dpif-netdev: Add HW miss packet state recover logic
> >     >>    netdev-dpdk: Introduce an API to query if a dpdk port is an
> >     uplink
> >     >>      port
> >     >>    netdev-dpdk-offload: Infrastructure for multiple rte_flows
> >     per UFID
> >     >>    netdev-dpdk: Add flow_api support for netdev vxlan vports
> >     >>    netdev-offload-dpdk: Support offload for vxlan vport
> >     >>
> >     >> Oz Shlomo (1):
> >     >>    netdev-offload: Add HW miss packet state recover API
> >     >>
> >     >>   Documentation/howto/dpdk.rst  |    5 +-
> >     >>   NEWS                          |    6 +-
> >     >>   lib/dpif-netdev.c             |   70 +--
> >     >>   lib/dpif-netlink.c            |   23 +-
> >     >>   lib/dpif.c                    |   21 +-
> >     >>   lib/netdev-dpdk.c             |   68 +++
> >     >>   lib/netdev-dpdk.h             |    6 +
> >     >>   lib/netdev-offload-dpdk.c     | 1275
> >     +++++++++++++++++++++++++++++++++++------
> >     >>   lib/netdev-offload-provider.h |    6 +
> >     >>   lib/netdev-offload-tc.c       |   11 +-
> >     >>   lib/netdev-offload.c          |  115 ++--
> >     >>   lib/netdev-offload.h          |   23 +-
> >     >>   lib/netdev-provider.h         |    3 +-
> >     >>   lib/netdev.c                  |   16 +
> >     >>   lib/netdev.h                  |    2 +
> >     >>   ofproto/ofproto-dpif-upcall.c |    5 +-
> >     >>   tests/dpif-netdev.at
> >     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdpif-netdev.at%2F&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617734858&sdata=FSHMbRVHgjS6Jph6FTbnmSfyo70id9nkVesahEPOk2A%3D&reserved=0>
> >             |   14 +-
> >     >>   tests/ofproto-macros.at
> >     <https://eur03.safelinks.protection.outlook.com/?url=http%3A%2F%2Fofproto-macros.at%2F&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=nOV4yi8QLgGOlctHcZz%2BQnF6n0rch6whF5onzZ87LbM%3D&reserved=0>
> >          |    3 +-
> >     >>   18 files changed, 1394 insertions(+), 278 deletions(-)
> >     >>
> >     >> --
> >     >> 2.14.5
> >     >>
> >     >> _______________________________________________
> >     >> dev mailing list
> >     >> dev@openvswitch.org <mailto:dev@openvswitch.org>
> >     >>
> >     https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&amp;data=02%7C01%7Celibr%40mellanox.com%7Cec94952030bb4352740308d7a8cb42ac%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637163463282513495&amp;sdata=TkLL3cBksB6tKghQVnqHuxxPlK5dxYuLoJHxko9DNIc%3D&amp;reserved=0
> >     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=9h393cM42RKHNS2hL%2FeiM%2FoADEaepqwXpo6F6u3MkNQ%3D&reserved=0>
> >     _______________________________________________
> >     dev mailing list
> >     dev@openvswitch.org <mailto:dev@openvswitch.org>
> >     https://mail.openvswitch.org/mailman/listinfo/ovs-dev
> >     <https://eur03.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmail.openvswitch.org%2Fmailman%2Flistinfo%2Fovs-dev&data=02%7C01%7Celibr%40mellanox.com%7Cdc6ca4d3c19f45a1be0008d7ae6ea8b5%7Ca652971c7d2e4d9ba6a4d149256f461b%7C0%7C0%7C637169662617744813&sdata=9h393cM42RKHNS2hL%2FeiM%2FoADEaepqwXpo6F6u3MkNQ%3D&reserved=0>
> >
Pravin Shelar March 5, 2020, 7:12 p.m. UTC | #7
On Sun, Mar 1, 2020 at 8:25 PM Sriharsha Basavapatna via dev
<ovs-dev@openvswitch.org> wrote:
>
> On Tue, Feb 18, 2020 at 3:30 PM Eli Britstein <elibr@mellanox.com> wrote:
> >
> >
> > On 2/10/2020 11:16 PM, Hemal Shah wrote:
> > > Eli,
> > >
> > > There are some fundamental architecture issues (multi HW tables vs.
> > > single HW table, use of rte_flow JUMP action for mapping ovs sw
> > > tnl_pop action, etc.) that we need to discuss before we can get into
> > > the details of the patchset.
> > > I'm inlining my comments on the discussion between you and Harsha below.
> > >
> > > Hemal
> > >
> > >
> > > On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <elibr@mellanox.com
> > > <mailto:elibr@mellanox.com>> wrote:
> > >
> > >
> > >     On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> > >     > Hi Eli,
> > >     >
> > >     > Thanks for sending this patchset. I have some questions about the
> > >     > design, please see my comments below.
> > >     >
> > >     > Thanks,
> > >     > -Harsha
> > >     >
> > >     > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein
> > >     <elibr@mellanox.com <mailto:elibr@mellanox.com>> wrote:
> > >     >> In the netdev datapath, packets arriving over a tunnel are
> > >     processed by
> > >     >> more than one flow. For example a packet that should be
> > >     decapsulated and
> > >     >> forwarded to another port is processed by two flows:
> > >     >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
> > >     >>     actions:tnl_pop(vxlan_sys_4789).
> > >     >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
> > >     >>     inner-header matches, actions: vm1.
> > >     >>
> > >     >> In order to offload such a multi-flow processing path, a
> > >     multi-table HW
> > >     >> model is used.
> > >     > Did you explore other ways to avoid this multi-flow processing
> > >     model ?
> > >     > For example, may be support some additional logic to merge the two
> > >     > datapath flow-rules for decap into a single flow-rule in the offload
> > >     > layer ?
> > >     Yes, we did explore this approach and our research lead us to
> > >     conclude
> > >     that following the OVS multi-table SW model in HW proves to be a more
> > >     robust architecture.
> > >
> > > [Hemal] It will be good and prudent for you to share details of that
> > > research to help the community understand how you arrive to the
> > > conclusion above. Can you share it?
> > The HW offload scheme should follow the SW model. It is not only about
> > current VXLAN patch-set, but rather an infrastructure for offloading
> > more actions. Once HW offloading does not follow the SW model it becomes
> > difficult or not practical to support them. Squashing multi-table flows
> > into single rules might work but only for specific use cases. VXLAN with
> > fixed sized outer header is one of them, but not with other use cases.
> > There are actions, like DP_HASH, that perform calculations on the
> > packets, to be used in the next recirc.
> > For example:
> > recirc_id(0),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> > packets:26, bytes:1924, used:0.105s, flags:S,
> > actions:hash(sym_l4(0)),recirc(0x5)
> >
> > recirc_id(0x5),dp_hash(0x315c2571/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> > packets:0, bytes:0, used:never, actions:,3
> > recirc_id(0x5),dp_hash(0x315ca562/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> > packets:0, bytes:0, used:never, actions:,2
> > ..
> > 4 rules
> >
> > 1000000 TCP/UDP sessions will not require more data plane flows.
> > In single-table, every separated 5-tuple should be matched, so 1000000
> > sessions will create 1000000 data plane flows.
>
> [Harsha] I understand your concerns and the requirement from different
> use cases (like dp-hash) that might need multitable flows. But for
> VXLAN decap offload, we are questioning the need for multitable. We
> are looking for design options that you considered to support VXLAN
> use case with single-table. Can you please provide specific details on
> the challenges ?
>
I think we need to align offload with software datapath. So if you
want to explore single table tunnel offload processing you need to
show how would it work in software. That will eliminate the mapping of
two software tables to single hardware table.

Thanks.
William Tu July 8, 2020, 11:36 p.m. UTC | #8
On Thu, Mar 5, 2020 at 11:12 AM Pravin Shelar <pshelar@ovn.org> wrote:
>
> On Sun, Mar 1, 2020 at 8:25 PM Sriharsha Basavapatna via dev
> <ovs-dev@openvswitch.org> wrote:
> >
> > On Tue, Feb 18, 2020 at 3:30 PM Eli Britstein <elibr@mellanox.com> wrote:
> > >
> > >
> > > On 2/10/2020 11:16 PM, Hemal Shah wrote:
> > > > Eli,
> > > >
> > > > There are some fundamental architecture issues (multi HW tables vs.
> > > > single HW table, use of rte_flow JUMP action for mapping ovs sw
> > > > tnl_pop action, etc.) that we need to discuss before we can get into
> > > > the details of the patchset.
> > > > I'm inlining my comments on the discussion between you and Harsha below.
> > > >
> > > > Hemal
> > > >
> > > >
> > > > On Wed, Feb 5, 2020 at 6:31 AM Eli Britstein <elibr@mellanox.com
> > > > <mailto:elibr@mellanox.com>> wrote:
> > > >
> > > >
> > > >     On 2/3/2020 7:05 PM, Sriharsha Basavapatna wrote:
> > > >     > Hi Eli,
> > > >     >
> > > >     > Thanks for sending this patchset. I have some questions about the
> > > >     > design, please see my comments below.
> > > >     >
> > > >     > Thanks,
> > > >     > -Harsha
> > > >     >
> > > >     > On Mon, Jan 20, 2020 at 8:39 PM Eli Britstein
> > > >     <elibr@mellanox.com <mailto:elibr@mellanox.com>> wrote:
> > > >     >> In the netdev datapath, packets arriving over a tunnel are
> > > >     processed by
> > > >     >> more than one flow. For example a packet that should be
> > > >     decapsulated and
> > > >     >> forwarded to another port is processed by two flows:
> > > >     >> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
> > > >     >>     actions:tnl_pop(vxlan_sys_4789).
> > > >     >> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
> > > >     >>     inner-header matches, actions: vm1.
> > > >     >>
> > > >     >> In order to offload such a multi-flow processing path, a
> > > >     multi-table HW
> > > >     >> model is used.
> > > >     > Did you explore other ways to avoid this multi-flow processing
> > > >     model ?
> > > >     > For example, may be support some additional logic to merge the two
> > > >     > datapath flow-rules for decap into a single flow-rule in the offload
> > > >     > layer ?
> > > >     Yes, we did explore this approach and our research lead us to
> > > >     conclude
> > > >     that following the OVS multi-table SW model in HW proves to be a more
> > > >     robust architecture.
> > > >
> > > > [Hemal] It will be good and prudent for you to share details of that
> > > > research to help the community understand how you arrive to the
> > > > conclusion above. Can you share it?
> > > The HW offload scheme should follow the SW model. It is not only about
> > > current VXLAN patch-set, but rather an infrastructure for offloading
> > > more actions. Once HW offloading does not follow the SW model it becomes
> > > difficult or not practical to support them. Squashing multi-table flows
> > > into single rules might work but only for specific use cases. VXLAN with
> > > fixed sized outer header is one of them, but not with other use cases.
> > > There are actions, like DP_HASH, that perform calculations on the
> > > packets, to be used in the next recirc.
> > > For example:
> > > recirc_id(0),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> > > packets:26, bytes:1924, used:0.105s, flags:S,
> > > actions:hash(sym_l4(0)),recirc(0x5)
> > >
> > > recirc_id(0x5),dp_hash(0x315c2571/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> > > packets:0, bytes:0, used:never, actions:,3
> > > recirc_id(0x5),dp_hash(0x315ca562/0x3),in_port(2),packet_type(ns=0,id=0),eth_type(0x0800),ipv4(proto=6,frag=no),
> > > packets:0, bytes:0, used:never, actions:,2
> > > ..
> > > 4 rules
> > >
> > > 1000000 TCP/UDP sessions will not require more data plane flows.
> > > In single-table, every separated 5-tuple should be matched, so 1000000
> > > sessions will create 1000000 data plane flows.
> >
> > [Harsha] I understand your concerns and the requirement from different
> > use cases (like dp-hash) that might need multitable flows. But for
> > VXLAN decap offload, we are questioning the need for multitable. We
> > are looking for design options that you considered to support VXLAN
> > use case with single-table. Can you please provide specific details on
> > the challenges ?
> >
> I think we need to align offload with software datapath. So if you
> want to explore single table tunnel offload processing you need to
> show how would it work in software. That will eliminate the mapping of
> two software tables to single hardware table.
>
I thought about merging multiple flows into one in order to use a single table.
I think it's pretty hard to do it correctly, with the cartesian
product of all the flows
and revalidating these flows.

As a example for vxlan decap,
FLOW1: recirc_id(0),in_port(3),packet_type(ns=0,id=0),eth(src=fa:4d:c4:81:25:a9,dst=f6:95:3f:d2:ea:42),eth_type(0x0800),ipv4(dst=172.31.1.100,proto=17,frag=no),udp(dst=4789),
packets:6, bytes:812, used:0.007s, actions:tnl_pop(4)
innerFLOW1: tunnel(tun_id=0x0,src=172.31.1.1,dst=172.31.1.100,flags(-df+csum+key)),recirc_id(0),in_port(4),packet_type(ns=0,id=0),eth(src=aa:0f:92:44:46:df,dst=b2:e3:8d:97:f0:4e),eth_type(0x0800),ipv4(dst=10.1.1.100,proto=1,frag=no),
packets:2, bytes:196, used:0.007s, actions:1
innerFLOW2:
tunnel(tun_id=0x0,src=172.31.1.1,dst=172.31.1.100,flags(-df+csum+key)),recirc_id(0),in_port(4),packet_type(ns=0,id=0),eth(src=aa:0f:92:44:46:df,dst=33:33:ff:44:46:df),eth_type(0x86dd),ipv6(frag=no),
packets:0, bytes:0, used:never, actions:1

Basically we have to merge and create
a. FLOW1 + innerFLOW1
b. FLOW1 + innerFLOW2
and for inner flow, we usually do connection tracking, which
introduces another recirc.
So it's almost impossible to do everything in single table.

At this moment,I couldn't think of any better way than translating
tnl_pop to jump and group.
William
William Tu July 16, 2020, 12:30 a.m. UTC | #9
Hi Eli,
I'm having another question, thanks in advance!

On Mon, Jan 20, 2020 at 7:09 AM Eli Britstein <elibr@mellanox.com> wrote:
>
> In the netdev datapath, packets arriving over a tunnel are processed by
> more than one flow. For example a packet that should be decapsulated and
> forwarded to another port is processed by two flows:
> 1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
>    actions:tnl_pop(vxlan_sys_4789).
> 2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
>    inner-header matches, actions: vm1.
>
> In order to offload such a multi-flow processing path, a multi-table HW
> model is used. Flow #1 matches are offloaded as is, but tnl_pop is
> translated to MARK and JUMP to a tnl-port table actions. Flow #2 is
> offloaded to that tnl-port table, its tunnel matches are translated to
> outer-header matches, followed by the inner-header ones. The actions
> start with an implicit DECAP action followed by the flow's actions.
> The DECAP is done in flow #2. If it was done in flow #1, the outer
> headers of the packets would have been lost, and they could not have
> been matched in flow #2.

How come for OVS kernel datapath using tc offload, there is no such problem?
In kernel datapath, the tunnel decap also translates to FLOW#1 and FLOW#2
1. in_port=<PF>, outer-header matches, dst-port=4789 (VXLAN),
    actions= output: vxlan_sys_4789 (which implicitly do tnl_pop).
2. in_port=vxlan_sys_4789, tunnel matches (VNI for example),
    inner-header matches, actions: vm1.

But when I read through "Offloading VXLAN Encapsulation/Decapsulation
Actions" in
https://docs.mellanox.com/m/view-rendered-page.action?abstractPageId=25138327

It seems that assigning outer IP to PF, and use PF as tunnel endpoint
solve the problem.
So there is no need to translate tnl_pop to mark and jump.
Can't we use the same idea here for userspace datapath?

Thank you
William