diff mbox series

[ovs-dev,V7,06/13] dpif-netdev: Add HW miss packet state recover logic.

Message ID 20210623155252.10975-7-elibr@nvidia.com
State Accepted
Headers show
Series Netdev vxlan-decap offload | expand

Commit Message

Eli Britstein June 23, 2021, 3:52 p.m. UTC
Recover the packet if it was partially processed by the HW. Fallback to
lookup flow by mark association.

Signed-off-by: Eli Britstein <elibr@nvidia.com>
Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
Tested-by: Emma Finn <emma.finn@intel.com>
Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
---
 lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 41 insertions(+), 4 deletions(-)

Comments

0-day Robot June 23, 2021, 4:10 p.m. UTC | #1
Bleep bloop.  Greetings Eli Britstein, I am a robot and I have tried out your patch.
Thanks for your contribution.

I encountered some error that I wasn't expecting.  See the details below.


checkpatch:
WARNING: Unexpected sign-offs from developers who are not authors or co-authors or committers: Ilya Maximets <i.maximets@ovn.org>
Lines checked: 98, Warnings: 1, Errors: 0


Please check this out.  If you feel there has been an error, please email aconole@redhat.com

Thanks,
0-day Robot
Ferriter, Cian June 25, 2021, 11:04 a.m. UTC | #2
Hi Eli,

I have some concerns about how we plug the packet state recover logic into dfc_processing() here. My concerns are inline below.

I'm concerned that we are hurting performance in the partial HWOL case, as this patchset is introducing new direct (non-inline) function calls per packet to the software datapath. We can reduce performance impact in this area, see specific suggestions below.

Thanks,
Cian

> -----Original Message-----
> From: Eli Britstein <elibr@nvidia.com>
> Sent: Wednesday 23 June 2021 16:53
> To: dev@openvswitch.org; Ilya Maximets <i.maximets@ovn.org>
> Cc: Gaetan Rivet <gaetanr@nvidia.com>; Majd Dibbiny <majd@nvidia.com>; Sriharsha Basavapatna
> <sriharsha.basavapatna@broadcom.com>; Hemal Shah <hemal.shah@broadcom.com>; Ivan Malov
> <Ivan.Malov@oktetlabs.ru>; Ferriter, Cian <cian.ferriter@intel.com>; Eli Britstein <elibr@nvidia.com>;
> Finn, Emma <emma.finn@intel.com>; Kovacevic, Marko <marko.kovacevic@intel.com>
> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover logic.
> 
> Recover the packet if it was partially processed by the HW. Fallback to
> lookup flow by mark association.
> 
> Signed-off-by: Eli Britstein <elibr@nvidia.com>
> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
> Tested-by: Emma Finn <emma.finn@intel.com>
> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> ---
>  lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 41 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index 8fa7eb6d4..d5b7d5b73 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
> 
>  /* Protects against changes to 'dp_netdevs'. */
>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
>  }
> 
> +static struct tx_port * pmd_send_port_cache_lookup(
> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
> +
> +static inline int
> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
> +                  odp_port_t port_no,
> +                  struct dp_packet *packet,
> +                  struct dp_netdev_flow **flow)
> +{
> +    struct tx_port *p;
> +    uint32_t mark;
> +


Start of full HWOL recovery block

> +    /* Restore the packet if HW processing was terminated before completion. */
> +    p = pmd_send_port_cache_lookup(pmd, port_no);
> +    if (OVS_LIKELY(p)) {
> +        int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
> +
> +        if (err && err != EOPNOTSUPP) {
> +            COVERAGE_INC(datapath_drop_hw_miss_recover);
> +            return -1;
> +        }
> +    }

End of full HWOL recovery block

IIUC, the above is only relevant to full HWOL where the packet is partially processed by the HW. In a partial HWOL testcase, we see a performance drop as a result of the above code. The performance after the patchset is applied is 0.94x what the performance was before.

We should branch over this code in the partial HWOL scenario, where we don't need to get the call to pmd_send_port_cache_lookup() and netdev_hw_miss_packet_recover(). We want this branch to be transparent to the user. Since both full and partial HWOL falls under the other_config:hw-offload=true switch, we might need a configure time check NIC capabilities solution or something similar to branch over full HWOL packet recovery code. Does this make sense?


perf top showing cycles spent per function in my partial HWOL scenario. We can see netdev_hw_miss_packet_recover(), netdev_offload_dpdk_hw_miss_packet_recover() and netdev_is_flow_api_enabled() taking cycles:
    28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
    13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
    11.07%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_recv_pkts_vec_avx2
    10.94%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_xmit_fixed_burst_vec_avx2
     7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
     4.94%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_hw_miss_packet_recover
     4.77%  pmd-c01/id:8  ovs-vswitchd        [.] dp_execute_output_action
     3.81%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_pmd_flush_output_on_port
     3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
     2.49%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_eth_send
     1.16%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
     0.90%  pmd-c01/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
     0.75%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
     0.68%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_is_flow_api_enabled
     0.55%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_offload_dpdk_hw_miss_packet_recover

> +
> +    /* If no mark, no flow to find. */
> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
> +        *flow = NULL;
> +        return 0;
> +    }
> +
> +    *flow = mark_to_flow_find(pmd, mark);
> +    return 0;
> +}
> +
>  /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>   * miniflow is copied into 'keys' and the packet pointer is moved at the
> @@ -7106,7 +7140,6 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> 
>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>          struct dp_netdev_flow *flow;
> -        uint32_t mark;
> 
>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>              dp_packet_delete(packet);
> @@ -7125,9 +7158,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>              pkt_metadata_init(&packet->md, port_no);
>          }
> 
> -        if ((*recirc_depth_get() == 0) &&
> -            dp_packet_has_flow_mark(packet, &mark)) {
> -            flow = mark_to_flow_find(pmd, mark);
> +        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {

Here we have a per packet call to netdev_is_flow_api_enabled(). I think that netdev_is_flow_api_enabled() should be inlined if it's going to be called per packet. We can see from the above "perf top" that it isn't inlined since it shows up as a separate function.

> +            if (OVS_UNLIKELY(dp_netdev_hw_flow(pmd, port_no, packet, &flow))) {
> +                /* Packet restoration failed and it was dropped, do not
> +                 * continue processing.
> +                 */
> +                continue;
> +            }
>              if (OVS_LIKELY(flow)) {
>                  tcp_flags = parse_tcp_flags(packet);
>                  if (OVS_LIKELY(batch_enable)) {
> --
> 2.28.0.2311.g225365fb51
Ilya Maximets June 25, 2021, 11:28 a.m. UTC | #3
On 6/25/21 1:04 PM, Ferriter, Cian wrote:
> Hi Eli,
> 
> I have some concerns about how we plug the packet state recover logic into dfc_processing() here. My concerns are inline below.
> 
> I'm concerned that we are hurting performance in the partial HWOL case, as this patchset is introducing new direct (non-inline) function calls per packet to the software datapath. We can reduce performance impact in this area, see specific suggestions below.
> 
> Thanks,
> Cian
> 
>> -----Original Message-----
>> From: Eli Britstein <elibr@nvidia.com>
>> Sent: Wednesday 23 June 2021 16:53
>> To: dev@openvswitch.org; Ilya Maximets <i.maximets@ovn.org>
>> Cc: Gaetan Rivet <gaetanr@nvidia.com>; Majd Dibbiny <majd@nvidia.com>; Sriharsha Basavapatna
>> <sriharsha.basavapatna@broadcom.com>; Hemal Shah <hemal.shah@broadcom.com>; Ivan Malov
>> <Ivan.Malov@oktetlabs.ru>; Ferriter, Cian <cian.ferriter@intel.com>; Eli Britstein <elibr@nvidia.com>;
>> Finn, Emma <emma.finn@intel.com>; Kovacevic, Marko <marko.kovacevic@intel.com>
>> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover logic.
>>
>> Recover the packet if it was partially processed by the HW. Fallback to
>> lookup flow by mark association.
>>
>> Signed-off-by: Eli Britstein <elibr@nvidia.com>
>> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
>> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
>> Tested-by: Emma Finn <emma.finn@intel.com>
>> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
>> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
>> ---
>>  lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
>>  1 file changed, 41 insertions(+), 4 deletions(-)
>>
>> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>> index 8fa7eb6d4..d5b7d5b73 100644
>> --- a/lib/dpif-netdev.c
>> +++ b/lib/dpif-netdev.c
>> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
>>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
>>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
>>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
>> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
>>
>>  /* Protects against changes to 'dp_netdevs'. */
>>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
>> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
>>  }
>>
>> +static struct tx_port * pmd_send_port_cache_lookup(
>> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>> +
>> +static inline int
>> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>> +                  odp_port_t port_no,
>> +                  struct dp_packet *packet,
>> +                  struct dp_netdev_flow **flow)
>> +{
>> +    struct tx_port *p;
>> +    uint32_t mark;
>> +
> 
> 
> Start of full HWOL recovery block
> 
>> +    /* Restore the packet if HW processing was terminated before completion. */
>> +    p = pmd_send_port_cache_lookup(pmd, port_no);
>> +    if (OVS_LIKELY(p)) {
>> +        int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
>> +
>> +        if (err && err != EOPNOTSUPP) {
>> +            COVERAGE_INC(datapath_drop_hw_miss_recover);
>> +            return -1;
>> +        }
>> +    }
> 
> End of full HWOL recovery block
> 
> IIUC, the above is only relevant to full HWOL where the packet is partially processed by the HW. In a partial HWOL testcase, we see a performance drop as a result of the above code. The performance after the patchset is applied is 0.94x what the performance was before.

While reviewing the patch set I noticed this part too, but
this code was tested twice by Intel engineers, so I figured
that it doesn't hurt performance of partial offloading.

In general, it should be easy to re-order partial and full
offloading checks like this (didn't test):

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index c5ab35d2a..36a5976f2 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -7113,6 +7113,14 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
     struct tx_port *p;
     uint32_t mark;
 
+    /* If no mark, no flow to find. */
+    if (dp_packet_has_flow_mark(packet, &mark)) {
+        *flow = mark_to_flow_find(pmd, mark);
+        return 0;
+    }
+
+    *flow = NULL;
+
     /* Restore the packet if HW processing was terminated before completion. */
     p = pmd_send_port_cache_lookup(pmd, port_no);
     if (OVS_LIKELY(p)) {
@@ -7123,14 +7131,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
             return -1;
         }
     }
-
-    /* If no mark, no flow to find. */
-    if (!dp_packet_has_flow_mark(packet, &mark)) {
-        *flow = NULL;
-        return 0;
-    }
-
-    *flow = mark_to_flow_find(pmd, mark);
     return 0;
 }
---

If everyone agrees, I can do this change.

> 
> We should branch over this code in the partial HWOL scenario, where we don't need to get the call to pmd_send_port_cache_lookup() and netdev_hw_miss_packet_recover(). We want this branch to be transparent to the user. Since both full and partial HWOL falls under the other_config:hw-offload=true switch, we might need a configure time check NIC capabilities solution or something similar to branch over full HWOL packet recovery code. Does this make sense?

Compile time check of capabilities doesn't make sense as it's unknown
in vast majority of cases on which HW the package will run.  Code should
be as generic as possible.

> 
> 
> perf top showing cycles spent per function in my partial HWOL scenario. We can see netdev_hw_miss_packet_recover(), netdev_offload_dpdk_hw_miss_packet_recover() and netdev_is_flow_api_enabled() taking cycles:
>     28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
>     13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
>     11.07%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_recv_pkts_vec_avx2
>     10.94%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_xmit_fixed_burst_vec_avx2
>      7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
>      4.94%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_hw_miss_packet_recover
>      4.77%  pmd-c01/id:8  ovs-vswitchd        [.] dp_execute_output_action
>      3.81%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_pmd_flush_output_on_port
>      3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
>      2.49%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_eth_send
>      1.16%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
>      0.90%  pmd-c01/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>      0.75%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>      0.68%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_is_flow_api_enabled
>      0.55%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_offload_dpdk_hw_miss_packet_recover
> 
>> +
>> +    /* If no mark, no flow to find. */
>> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
>> +        *flow = NULL;
>> +        return 0;
>> +    }
>> +
>> +    *flow = mark_to_flow_find(pmd, mark);
>> +    return 0;
>> +}
>> +
>>  /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
>>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>>   * miniflow is copied into 'keys' and the packet pointer is moved at the
>> @@ -7106,7 +7140,6 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>
>>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>>          struct dp_netdev_flow *flow;
>> -        uint32_t mark;
>>
>>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>>              dp_packet_delete(packet);
>> @@ -7125,9 +7158,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>              pkt_metadata_init(&packet->md, port_no);
>>          }
>>
>> -        if ((*recirc_depth_get() == 0) &&
>> -            dp_packet_has_flow_mark(packet, &mark)) {
>> -            flow = mark_to_flow_find(pmd, mark);
>> +        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {
> 
> Here we have a per packet call to netdev_is_flow_api_enabled(). I think that netdev_is_flow_api_enabled() should be inlined if it's going to be called per packet. We can see from the above "perf top" that it isn't inlined since it shows up as a separate function.

I'd consider "inlining" and moving a lot of stuff to headers harmful
for the code base as it makes it less readable and it's really hard
to preserve this kind of things during code modification.
It's much better to fix the logic instead of hammering the code with
blind low level optimizations.

For this particular issue we already have a solution here:
  https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/
In short, we only need to check once per batch.  I think, Balazs
will be able to re-base his patch on top of this series including
the check for netdev_is_flow_api_enabled().

Balazs, will you?

Best regards, Ilya Maximets.
Balazs Nemeth June 25, 2021, 12:09 p.m. UTC | #4
Ilya,

Sure, I'll rebase my patch and also ensure netdev_is_flow_api_enabled()
is called once per batch. I don't expect any issue with that.

Regards,
Balazs

On Fri, Jun 25, 2021 at 2:00 PM Ilya Maximets <i.maximets@ovn.org> wrote:

> On 6/25/21 1:04 PM, Ferriter, Cian wrote:
> > Hi Eli,
> >
> > I have some concerns about how we plug the packet state recover logic
> into dfc_processing() here. My concerns are inline below.
> >
> > I'm concerned that we are hurting performance in the partial HWOL case,
> as this patchset is introducing new direct (non-inline) function calls per
> packet to the software datapath. We can reduce performance impact in this
> area, see specific suggestions below.
> >
> > Thanks,
> > Cian
> >
> >> -----Original Message-----
> >> From: Eli Britstein <elibr@nvidia.com>
> >> Sent: Wednesday 23 June 2021 16:53
> >> To: dev@openvswitch.org; Ilya Maximets <i.maximets@ovn.org>
> >> Cc: Gaetan Rivet <gaetanr@nvidia.com>; Majd Dibbiny <majd@nvidia.com>;
> Sriharsha Basavapatna
> >> <sriharsha.basavapatna@broadcom.com>; Hemal Shah <
> hemal.shah@broadcom.com>; Ivan Malov
> >> <Ivan.Malov@oktetlabs.ru>; Ferriter, Cian <cian.ferriter@intel.com>;
> Eli Britstein <elibr@nvidia.com>;
> >> Finn, Emma <emma.finn@intel.com>; Kovacevic, Marko <
> marko.kovacevic@intel.com>
> >> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover
> logic.
> >>
> >> Recover the packet if it was partially processed by the HW. Fallback to
> >> lookup flow by mark association.
> >>
> >> Signed-off-by: Eli Britstein <elibr@nvidia.com>
> >> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com>
> >> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com>
> >> Tested-by: Emma Finn <emma.finn@intel.com>
> >> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com>
> >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org>
> >> ---
> >>  lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
> >>  1 file changed, 41 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> >> index 8fa7eb6d4..d5b7d5b73 100644
> >> --- a/lib/dpif-netdev.c
> >> +++ b/lib/dpif-netdev.c
> >> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
> >>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
> >>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
> >>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
> >> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
> >>
> >>  /* Protects against changes to 'dp_netdevs'. */
> >>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
> >> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct dp_netdev_pmd_thread
> *pmd,
> >>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
> n_smc_hit);
> >>  }
> >>
> >> +static struct tx_port * pmd_send_port_cache_lookup(
> >> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
> >> +
> >> +static inline int
> >> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
> >> +                  odp_port_t port_no,
> >> +                  struct dp_packet *packet,
> >> +                  struct dp_netdev_flow **flow)
> >> +{
> >> +    struct tx_port *p;
> >> +    uint32_t mark;
> >> +
> >
> >
> > Start of full HWOL recovery block
> >
> >> +    /* Restore the packet if HW processing was terminated before
> completion. */
> >> +    p = pmd_send_port_cache_lookup(pmd, port_no);
> >> +    if (OVS_LIKELY(p)) {
> >> +        int err = netdev_hw_miss_packet_recover(p->port->netdev,
> packet);
> >> +
> >> +        if (err && err != EOPNOTSUPP) {
> >> +            COVERAGE_INC(datapath_drop_hw_miss_recover);
> >> +            return -1;
> >> +        }
> >> +    }
> >
> > End of full HWOL recovery block
> >
> > IIUC, the above is only relevant to full HWOL where the packet is
> partially processed by the HW. In a partial HWOL testcase, we see a
> performance drop as a result of the above code. The performance after the
> patchset is applied is 0.94x what the performance was before.
>
> While reviewing the patch set I noticed this part too, but
> this code was tested twice by Intel engineers, so I figured
> that it doesn't hurt performance of partial offloading.
>
> In general, it should be easy to re-order partial and full
> offloading checks like this (didn't test):
>
> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
> index c5ab35d2a..36a5976f2 100644
> --- a/lib/dpif-netdev.c
> +++ b/lib/dpif-netdev.c
> @@ -7113,6 +7113,14 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread
> *pmd,
>      struct tx_port *p;
>      uint32_t mark;
>
> +    /* If no mark, no flow to find. */
> +    if (dp_packet_has_flow_mark(packet, &mark)) {
> +        *flow = mark_to_flow_find(pmd, mark);
> +        return 0;
> +    }
> +
> +    *flow = NULL;
> +
>      /* Restore the packet if HW processing was terminated before
> completion. */
>      p = pmd_send_port_cache_lookup(pmd, port_no);
>      if (OVS_LIKELY(p)) {
> @@ -7123,14 +7131,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread
> *pmd,
>              return -1;
>          }
>      }
> -
> -    /* If no mark, no flow to find. */
> -    if (!dp_packet_has_flow_mark(packet, &mark)) {
> -        *flow = NULL;
> -        return 0;
> -    }
> -
> -    *flow = mark_to_flow_find(pmd, mark);
>      return 0;
>  }
> ---
>
> If everyone agrees, I can do this change.
>
> >
> > We should branch over this code in the partial HWOL scenario, where we
> don't need to get the call to pmd_send_port_cache_lookup() and
> netdev_hw_miss_packet_recover(). We want this branch to be transparent to
> the user. Since both full and partial HWOL falls under the
> other_config:hw-offload=true switch, we might need a configure time check
> NIC capabilities solution or something similar to branch over full HWOL
> packet recovery code. Does this make sense?
>
> Compile time check of capabilities doesn't make sense as it's unknown
> in vast majority of cases on which HW the package will run.  Code should
> be as generic as possible.
>
> >
> >
> > perf top showing cycles spent per function in my partial HWOL scenario.
> We can see netdev_hw_miss_packet_recover(),
> netdev_offload_dpdk_hw_miss_packet_recover() and
> netdev_is_flow_api_enabled() taking cycles:
> >     28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
> >     13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
> >     11.07%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_recv_pkts_vec_avx2
> >     10.94%  pmd-c01/id:8  ovs-vswitchd        [.]
> i40e_xmit_fixed_burst_vec_avx2
> >      7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
> >      4.94%  pmd-c01/id:8  ovs-vswitchd        [.]
> netdev_hw_miss_packet_recover
> >      4.77%  pmd-c01/id:8  ovs-vswitchd        [.]
> dp_execute_output_action
> >      3.81%  pmd-c01/id:8  ovs-vswitchd        [.]
> dp_netdev_pmd_flush_output_on_port
> >      3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
> >      2.49%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_eth_send
> >      1.16%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
> >      0.90%  pmd-c01/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
> >      0.75%  pmd-c01/id:8  ovs-vswitchd        [.]
> dp_netdev_process_rxq_port
> >      0.68%  pmd-c01/id:8  ovs-vswitchd        [.]
> netdev_is_flow_api_enabled
> >      0.55%  pmd-c01/id:8  ovs-vswitchd        [.]
> netdev_offload_dpdk_hw_miss_packet_recover
> >
> >> +
> >> +    /* If no mark, no flow to find. */
> >> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
> >> +        *flow = NULL;
> >> +        return 0;
> >> +    }
> >> +
> >> +    *flow = mark_to_flow_find(pmd, mark);
> >> +    return 0;
> >> +}
> >> +
> >>  /* Try to process all ('cnt') the 'packets' using only the datapath
> flow cache
> >>   * 'pmd->flow_cache'. If a flow is not found for a packet
> 'packets[i]', the
> >>   * miniflow is copied into 'keys' and the packet pointer is moved at
> the
> >> @@ -7106,7 +7140,6 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >>
> >>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
> >>          struct dp_netdev_flow *flow;
> >> -        uint32_t mark;
> >>
> >>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
> >>              dp_packet_delete(packet);
> >> @@ -7125,9 +7158,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
> >>              pkt_metadata_init(&packet->md, port_no);
> >>          }
> >>
> >> -        if ((*recirc_depth_get() == 0) &&
> >> -            dp_packet_has_flow_mark(packet, &mark)) {
> >> -            flow = mark_to_flow_find(pmd, mark);
> >> +        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {
> >
> > Here we have a per packet call to netdev_is_flow_api_enabled(). I think
> that netdev_is_flow_api_enabled() should be inlined if it's going to be
> called per packet. We can see from the above "perf top" that it isn't
> inlined since it shows up as a separate function.
>
> I'd consider "inlining" and moving a lot of stuff to headers harmful
> for the code base as it makes it less readable and it's really hard
> to preserve this kind of things during code modification.
> It's much better to fix the logic instead of hammering the code with
> blind low level optimizations.
>
> For this particular issue we already have a solution here:
>
> https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/
> In short, we only need to check once per batch.  I think, Balazs
> will be able to re-base his patch on top of this series including
> the check for netdev_is_flow_api_enabled().
>
> Balazs, will you?
>
> Best regards, Ilya Maximets.
>
>
Eli Britstein June 25, 2021, 1:56 p.m. UTC | #5
On 6/25/2021 3:09 PM, Balazs Nemeth wrote:
> *External email: Use caution opening links or attachments*
>
>
> Ilya,
>
> Sure, I'll rebase my patch and also ensure netdev_is_flow_api_enabled()
> is called once per batch. I don't expect any issue with that.

Thanks Balazs.

>
> Regards,
> Balazs
>
> On Fri, Jun 25, 2021 at 2:00 PM Ilya Maximets <i.maximets@ovn.org 
> <mailto:i.maximets@ovn.org>> wrote:
>
>     On 6/25/21 1:04 PM, Ferriter, Cian wrote:
>     > Hi Eli,
>     >
>     > I have some concerns about how we plug the packet state recover
>     logic into dfc_processing() here. My concerns are inline below.
>     >
>     > I'm concerned that we are hurting performance in the partial
>     HWOL case, as this patchset is introducing new direct (non-inline)
>     function calls per packet to the software datapath. We can reduce
>     performance impact in this area, see specific suggestions below.
>     >
>     > Thanks,
>     > Cian
>     >
>     >> -----Original Message-----
>     >> From: Eli Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>
>     >> Sent: Wednesday 23 June 2021 16:53
>     >> To: dev@openvswitch.org <mailto:dev@openvswitch.org>; Ilya
>     Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>
>     >> Cc: Gaetan Rivet <gaetanr@nvidia.com
>     <mailto:gaetanr@nvidia.com>>; Majd Dibbiny <majd@nvidia.com
>     <mailto:majd@nvidia.com>>; Sriharsha Basavapatna
>     >> <sriharsha.basavapatna@broadcom.com
>     <mailto:sriharsha.basavapatna@broadcom.com>>; Hemal Shah
>     <hemal.shah@broadcom.com <mailto:hemal.shah@broadcom.com>>; Ivan Malov
>     >> <Ivan.Malov@oktetlabs.ru <mailto:Ivan.Malov@oktetlabs.ru>>;
>     Ferriter, Cian <cian.ferriter@intel.com
>     <mailto:cian.ferriter@intel.com>>; Eli Britstein <elibr@nvidia.com
>     <mailto:elibr@nvidia.com>>;
>     >> Finn, Emma <emma.finn@intel.com <mailto:emma.finn@intel.com>>;
>     Kovacevic, Marko <marko.kovacevic@intel.com
>     <mailto:marko.kovacevic@intel.com>>
>     >> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state
>     recover logic.
>     >>
>     >> Recover the packet if it was partially processed by the HW.
>     Fallback to
>     >> lookup flow by mark association.
>     >>
>     >> Signed-off-by: Eli Britstein <elibr@nvidia.com
>     <mailto:elibr@nvidia.com>>
>     >> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com
>     <mailto:gaetanr@nvidia.com>>
>     >> Acked-by: Sriharsha Basavapatna
>     <sriharsha.basavapatna@broadcom.com
>     <mailto:sriharsha.basavapatna@broadcom.com>>
>     >> Tested-by: Emma Finn <emma.finn@intel.com
>     <mailto:emma.finn@intel.com>>
>     >> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com
>     <mailto:marko.kovacevic@intel.com>>
>     >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org
>     <mailto:i.maximets@ovn.org>>
>     >> ---
>     >>  lib/dpif-netdev.c | 45
>     +++++++++++++++++++++++++++++++++++++++++----
>     >>  1 file changed, 41 insertions(+), 4 deletions(-)
>     >>
>     >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>     >> index 8fa7eb6d4..d5b7d5b73 100644
>     >> --- a/lib/dpif-netdev.c
>     >> +++ b/lib/dpif-netdev.c
>     >> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
>     >>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
>     >>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
>     >>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
>     >> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
>     >>
>     >>  /* Protects against changes to 'dp_netdevs'. */
>     >>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
>     >> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct
>     dp_netdev_pmd_thread *pmd,
>     >> pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT,
>     n_smc_hit);
>     >>  }
>     >>
>     >> +static struct tx_port * pmd_send_port_cache_lookup(
>     >> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>     >> +
>     >> +static inline int
>     >> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>     >> +                  odp_port_t port_no,
>     >> +                  struct dp_packet *packet,
>     >> +                  struct dp_netdev_flow **flow)
>     >> +{
>     >> +    struct tx_port *p;
>     >> +    uint32_t mark;
>     >> +
>     >
>     >
>     > Start of full HWOL recovery block
>     >
>     >> +    /* Restore the packet if HW processing was terminated
>     before completion. */
>     >> +    p = pmd_send_port_cache_lookup(pmd, port_no);
>     >> +    if (OVS_LIKELY(p)) {
>     >> +        int err =
>     netdev_hw_miss_packet_recover(p->port->netdev, packet);
>     >> +
>     >> +        if (err && err != EOPNOTSUPP) {
>     >> + COVERAGE_INC(datapath_drop_hw_miss_recover);
>     >> +            return -1;
>     >> +        }
>     >> +    }
>     >
>     > End of full HWOL recovery block
>     >
>     > IIUC, the above is only relevant to full HWOL where the packet
>     is partially processed by the HW. In a partial HWOL testcase, we
>     see a performance drop as a result of the above code. The
>     performance after the patchset is applied is 0.94x what the
>     performance was before.
>
General speaking, adding new code in the datapath is probable to have 
some degradation affect, that cannot be avoided completely.

I think performance optimizations for the partial offloads (or to SW 
datapath in general, even w/o offloads), can be done on top, like the 
one suggested by Balazs above, on top of it.


>     While reviewing the patch set I noticed this part too, but
>     this code was tested twice by Intel engineers, so I figured
>     that it doesn't hurt performance of partial offloading.
>
>     In general, it should be easy to re-order partial and full
>     offloading checks like this (didn't test):
>
>     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>     index c5ab35d2a..36a5976f2 100644
>     --- a/lib/dpif-netdev.c
>     +++ b/lib/dpif-netdev.c
>     @@ -7113,6 +7113,14 @@ dp_netdev_hw_flow(const struct
>     dp_netdev_pmd_thread *pmd,
>          struct tx_port *p;
>          uint32_t mark;
>
>     +    /* If no mark, no flow to find. */
>     +    if (dp_packet_has_flow_mark(packet, &mark)) {
>     +        *flow = mark_to_flow_find(pmd, mark);
>     +        return 0;
>     +    }
>     +
>     +    *flow = NULL;
>     +
>          /* Restore the packet if HW processing was terminated before
>     completion. */
>          p = pmd_send_port_cache_lookup(pmd, port_no);
>          if (OVS_LIKELY(p)) {
>     @@ -7123,14 +7131,6 @@ dp_netdev_hw_flow(const struct
>     dp_netdev_pmd_thread *pmd,
>                  return -1;
>              }
>          }
>     -
>     -    /* If no mark, no flow to find. */
>     -    if (!dp_packet_has_flow_mark(packet, &mark)) {
>     -        *flow = NULL;
>     -        return 0;
>     -    }
>     -
>     -    *flow = mark_to_flow_find(pmd, mark);
>          return 0;
>      }
>     ---
>
>     If everyone agrees, I can do this change.
>
No, this is wrong.

mlx5 PMD uses mark (internally set) for the recover. Doing it like this 
will discover a mark (the internal one by the PMD), but won't find any 
flow associated with it.

>
>     >
>     > We should branch over this code in the partial HWOL scenario,
>     where we don't need to get the call to
>     pmd_send_port_cache_lookup() and netdev_hw_miss_packet_recover().
>     We want this branch to be transparent to the user. Since both full
>     and partial HWOL falls under the other_config:hw-offload=true
>     switch, we might need a configure time check NIC capabilities
>     solution or something similar to branch over full HWOL packet
>     recovery code. Does this make sense?
>
>     Compile time check of capabilities doesn't make sense as it's unknown
>     in vast majority of cases on which HW the package will run. Code
>     should
>     be as generic as possible.
>
>     >
>     >
>     > perf top showing cycles spent per function in my partial HWOL
>     scenario. We can see netdev_hw_miss_packet_recover(),
>     netdev_offload_dpdk_hw_miss_packet_recover() and
>     netdev_is_flow_api_enabled() taking cycles:
>     >     28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
>     >     13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
>     >     11.07%  pmd-c01/id:8  ovs-vswitchd        [.]
>     i40e_recv_pkts_vec_avx2
>     >     10.94%  pmd-c01/id:8  ovs-vswitchd        [.]
>     i40e_xmit_fixed_burst_vec_avx2
>     >      7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
>     >      4.94%  pmd-c01/id:8  ovs-vswitchd        [.]
>     netdev_hw_miss_packet_recover
>     >      4.77%  pmd-c01/id:8  ovs-vswitchd        [.]
>     dp_execute_output_action
>     >      3.81%  pmd-c01/id:8  ovs-vswitchd        [.]
>     dp_netdev_pmd_flush_output_on_port
>     >      3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
>     >      2.49%  pmd-c01/id:8  ovs-vswitchd        [.]
>     netdev_dpdk_eth_send
>     >      1.16%  pmd-c01/id:8  ovs-vswitchd        [.]
>     netdev_dpdk_rxq_recv
>     >      0.90%  pmd-c01/id:8  ovs-vswitchd        [.]
>     pmd_perf_end_iteration
>     >      0.75%  pmd-c01/id:8  ovs-vswitchd        [.]
>     dp_netdev_process_rxq_port
>     >      0.68%  pmd-c01/id:8  ovs-vswitchd        [.]
>     netdev_is_flow_api_enabled
>     >      0.55%  pmd-c01/id:8  ovs-vswitchd        [.]
>     netdev_offload_dpdk_hw_miss_packet_recover
>     >
>     >> +
>     >> +    /* If no mark, no flow to find. */
>     >> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
>     >> +        *flow = NULL;
>     >> +        return 0;
>     >> +    }
>     >> +
>     >> +    *flow = mark_to_flow_find(pmd, mark);
>     >> +    return 0;
>     >> +}
>     >> +
>     >>  /* Try to process all ('cnt') the 'packets' using only the
>     datapath flow cache
>     >>   * 'pmd->flow_cache'. If a flow is not found for a packet
>     'packets[i]', the
>     >>   * miniflow is copied into 'keys' and the packet pointer is
>     moved at the
>     >> @@ -7106,7 +7140,6 @@ dfc_processing(struct
>     dp_netdev_pmd_thread *pmd,
>     >>
>     >>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>     >>          struct dp_netdev_flow *flow;
>     >> -        uint32_t mark;
>     >>
>     >>          if (OVS_UNLIKELY(dp_packet_size(packet) <
>     ETH_HEADER_LEN)) {
>     >>              dp_packet_delete(packet);
>     >> @@ -7125,9 +7158,13 @@ dfc_processing(struct
>     dp_netdev_pmd_thread *pmd,
>     >>              pkt_metadata_init(&packet->md, port_no);
>     >>          }
>     >>
>     >> -        if ((*recirc_depth_get() == 0) &&
>     >> -            dp_packet_has_flow_mark(packet, &mark)) {
>     >> -            flow = mark_to_flow_find(pmd, mark);
>     >> +        if (netdev_is_flow_api_enabled() &&
>     *recirc_depth_get() == 0) {
>     >
>     > Here we have a per packet call to netdev_is_flow_api_enabled().
>     I think that netdev_is_flow_api_enabled() should be inlined if
>     it's going to be called per packet. We can see from the above
>     "perf top" that it isn't inlined since it shows up as a separate
>     function.
>
>     I'd consider "inlining" and moving a lot of stuff to headers harmful
>     for the code base as it makes it less readable and it's really hard
>     to preserve this kind of things during code modification.
>     It's much better to fix the logic instead of hammering the code with
>     blind low level optimizations.
>
>     For this particular issue we already have a solution here:
>     https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/
>     <https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/>
>     In short, we only need to check once per batch.  I think, Balazs
>     will be able to re-base his patch on top of this series including
>     the check for netdev_is_flow_api_enabled().
>
>     Balazs, will you?
>
>     Best regards, Ilya Maximets.
>
Ilya Maximets June 25, 2021, 3:25 p.m. UTC | #6
On 6/25/21 3:56 PM, Eli Britstein wrote:
> 
> On 6/25/2021 3:09 PM, Balazs Nemeth wrote:
>> *External email: Use caution opening links or attachments*
>>
>>
>> Ilya,
>>
>> Sure, I'll rebase my patch and also ensure netdev_is_flow_api_enabled()
>> is called once per batch. I don't expect any issue with that.

Thanks, Balazs!  The series is applied now, so you can work on updated patch.

> 
> Thanks Balazs.
> 
>>
>> Regards,
>> Balazs
>>
>> On Fri, Jun 25, 2021 at 2:00 PM Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>> wrote:
>>
>>     On 6/25/21 1:04 PM, Ferriter, Cian wrote:
>>     > Hi Eli,
>>     >
>>     > I have some concerns about how we plug the packet state recover logic into dfc_processing() here. My concerns are inline below.
>>     >
>>     > I'm concerned that we are hurting performance in the partial HWOL case, as this patchset is introducing new direct (non-inline) function calls per packet to the software datapath. We can reduce performance impact in this area, see specific suggestions below.
>>     >
>>     > Thanks,
>>     > Cian
>>     >
>>     >> -----Original Message-----
>>     >> From: Eli Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>
>>     >> Sent: Wednesday 23 June 2021 16:53
>>     >> To: dev@openvswitch.org <mailto:dev@openvswitch.org>; Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>
>>     >> Cc: Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>>; Majd Dibbiny <majd@nvidia.com <mailto:majd@nvidia.com>>; Sriharsha Basavapatna
>>     >> <sriharsha.basavapatna@broadcom.com <mailto:sriharsha.basavapatna@broadcom.com>>; Hemal Shah <hemal.shah@broadcom.com <mailto:hemal.shah@broadcom.com>>; Ivan Malov
>>     >> <Ivan.Malov@oktetlabs.ru <mailto:Ivan.Malov@oktetlabs.ru>>; Ferriter, Cian <cian.ferriter@intel.com <mailto:cian.ferriter@intel.com>>; Eli Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>;
>>     >> Finn, Emma <emma.finn@intel.com <mailto:emma.finn@intel.com>>; Kovacevic, Marko <marko.kovacevic@intel.com <mailto:marko.kovacevic@intel.com>>
>>     >> Subject: [PATCH V7 06/13] dpif-netdev: Add HW miss packet state recover logic.
>>     >>
>>     >> Recover the packet if it was partially processed by the HW. Fallback to
>>     >> lookup flow by mark association.
>>     >>
>>     >> Signed-off-by: Eli Britstein <elibr@nvidia.com <mailto:elibr@nvidia.com>>
>>     >> Reviewed-by: Gaetan Rivet <gaetanr@nvidia.com <mailto:gaetanr@nvidia.com>>
>>     >> Acked-by: Sriharsha Basavapatna <sriharsha.basavapatna@broadcom.com <mailto:sriharsha.basavapatna@broadcom.com>>
>>     >> Tested-by: Emma Finn <emma.finn@intel.com <mailto:emma.finn@intel.com>>
>>     >> Tested-by: Marko Kovacevic <marko.kovacevic@intel.com <mailto:marko.kovacevic@intel.com>>
>>     >> Signed-off-by: Ilya Maximets <i.maximets@ovn.org <mailto:i.maximets@ovn.org>>
>>     >> ---
>>     >>  lib/dpif-netdev.c | 45 +++++++++++++++++++++++++++++++++++++++++----
>>     >>  1 file changed, 41 insertions(+), 4 deletions(-)
>>     >>
>>     >> diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>     >> index 8fa7eb6d4..d5b7d5b73 100644
>>     >> --- a/lib/dpif-netdev.c
>>     >> +++ b/lib/dpif-netdev.c
>>     >> @@ -114,6 +114,7 @@ COVERAGE_DEFINE(datapath_drop_invalid_port);
>>     >>  COVERAGE_DEFINE(datapath_drop_invalid_bond);
>>     >>  COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
>>     >>  COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
>>     >> +COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
>>     >>
>>     >>  /* Protects against changes to 'dp_netdevs'. */
>>     >>  static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
>>     >> @@ -7062,6 +7063,39 @@ smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
>>     >>      pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
>>     >>  }
>>     >>
>>     >> +static struct tx_port * pmd_send_port_cache_lookup(
>>     >> +    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
>>     >> +
>>     >> +static inline int
>>     >> +dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>     >> +                  odp_port_t port_no,
>>     >> +                  struct dp_packet *packet,
>>     >> +                  struct dp_netdev_flow **flow)
>>     >> +{
>>     >> +    struct tx_port *p;
>>     >> +    uint32_t mark;
>>     >> +
>>     >
>>     >
>>     > Start of full HWOL recovery block
>>     >
>>     >> +    /* Restore the packet if HW processing was terminated before completion. */
>>     >> +    p = pmd_send_port_cache_lookup(pmd, port_no);
>>     >> +    if (OVS_LIKELY(p)) {
>>     >> +        int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
>>     >> +
>>     >> +        if (err && err != EOPNOTSUPP) {
>>     >> +            COVERAGE_INC(datapath_drop_hw_miss_recover);
>>     >> +            return -1;
>>     >> +        }
>>     >> +    }
>>     >
>>     > End of full HWOL recovery block
>>     >
>>     > IIUC, the above is only relevant to full HWOL where the packet is partially processed by the HW. In a partial HWOL testcase, we see a performance drop as a result of the above code. The performance after the patchset is applied is 0.94x what the performance was before.
>>
> General speaking, adding new code in the datapath is probable to have some degradation affect, that cannot be avoided completely.
> 
> I think performance optimizations for the partial offloads (or to SW datapath in general, even w/o offloads), can be done on top, like the one suggested by Balazs above, on top of it.

+1

Anyway, I applied the v7 as-is.  Performance optimizations could be done on
top of it.

> 
> 
>>     While reviewing the patch set I noticed this part too, but
>>     this code was tested twice by Intel engineers, so I figured
>>     that it doesn't hurt performance of partial offloading.
>>
>>     In general, it should be easy to re-order partial and full
>>     offloading checks like this (didn't test):
>>
>>     diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
>>     index c5ab35d2a..36a5976f2 100644
>>     --- a/lib/dpif-netdev.c
>>     +++ b/lib/dpif-netdev.c
>>     @@ -7113,6 +7113,14 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>          struct tx_port *p;
>>          uint32_t mark;
>>
>>     +    /* If no mark, no flow to find. */
>>     +    if (dp_packet_has_flow_mark(packet, &mark)) {
>>     +        *flow = mark_to_flow_find(pmd, mark);
>>     +        return 0;
>>     +    }
>>     +
>>     +    *flow = NULL;
>>     +
>>          /* Restore the packet if HW processing was terminated before completion. */
>>          p = pmd_send_port_cache_lookup(pmd, port_no);
>>          if (OVS_LIKELY(p)) {
>>     @@ -7123,14 +7131,6 @@ dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
>>                  return -1;
>>              }
>>          }
>>     -
>>     -    /* If no mark, no flow to find. */
>>     -    if (!dp_packet_has_flow_mark(packet, &mark)) {
>>     -        *flow = NULL;
>>     -        return 0;
>>     -    }
>>     -
>>     -    *flow = mark_to_flow_find(pmd, mark);
>>          return 0;
>>      }
>>     ---
>>
>>     If everyone agrees, I can do this change.
>>
> No, this is wrong.
> 
> mlx5 PMD uses mark (internally set) for the recover. Doing it like this will discover a mark (the internal one by the PMD), but won't find any flow associated with it.

Oh, and netdev_offload_dpdk_hw_miss_packet_recover() resets offloads for
that reason, right?

Thanks for explanation.  I think, it would be great if you can prepare
a separate patch with some comments in netdev_offload_dpdk_hw_miss_packet_recover()
regarding offloads reset and, maybe, a small comment to the dp_netdev_hw_flow()
so we will not re-order checks in the future (it's not obvious that there is a
dependency).  Something like:

"Flow marks can be used for HW miss recovery and could be re-set in the
process, therefore flow mark check should always be done after the
netdev_hw_miss_packet_recover()."

What do you think?

> 
>>
>>     >
>>     > We should branch over this code in the partial HWOL scenario, where we don't need to get the call to pmd_send_port_cache_lookup() and netdev_hw_miss_packet_recover(). We want this branch to be transparent to the user. Since both full and partial HWOL falls under the other_config:hw-offload=true switch, we might need a configure time check NIC capabilities solution or something similar to branch over full HWOL packet recovery code. Does this make sense?
>>
>>     Compile time check of capabilities doesn't make sense as it's unknown
>>     in vast majority of cases on which HW the package will run.  Code should
>>     be as generic as possible.
>>
>>     >
>>     >
>>     > perf top showing cycles spent per function in my partial HWOL scenario. We can see netdev_hw_miss_packet_recover(), netdev_offload_dpdk_hw_miss_packet_recover() and netdev_is_flow_api_enabled() taking cycles:
>>     >     28.79%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_input__
>>     >     13.72%  pmd-c01/id:8  ovs-vswitchd        [.] parse_tcp_flags
>>     >     11.07%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_recv_pkts_vec_avx2
>>     >     10.94%  pmd-c01/id:8  ovs-vswitchd        [.] i40e_xmit_fixed_burst_vec_avx2
>>     >      7.48%  pmd-c01/id:8  ovs-vswitchd        [.] cmap_find
>>     >      4.94%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_hw_miss_packet_recover
>>     >      4.77%  pmd-c01/id:8  ovs-vswitchd        [.] dp_execute_output_action
>>     >      3.81%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_pmd_flush_output_on_port
>>     >      3.40%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_send
>>     >      2.49%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_eth_send
>>     >      1.16%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_dpdk_rxq_recv
>>     >      0.90%  pmd-c01/id:8  ovs-vswitchd        [.] pmd_perf_end_iteration
>>     >      0.75%  pmd-c01/id:8  ovs-vswitchd        [.] dp_netdev_process_rxq_port
>>     >      0.68%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_is_flow_api_enabled
>>     >      0.55%  pmd-c01/id:8  ovs-vswitchd        [.] netdev_offload_dpdk_hw_miss_packet_recover
>>     >
>>     >> +
>>     >> +    /* If no mark, no flow to find. */
>>     >> +    if (!dp_packet_has_flow_mark(packet, &mark)) {
>>     >> +        *flow = NULL;
>>     >> +        return 0;
>>     >> +    }
>>     >> +
>>     >> +    *flow = mark_to_flow_find(pmd, mark);
>>     >> +    return 0;
>>     >> +}
>>     >> +
>>     >>  /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
>>     >>   * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
>>     >>   * miniflow is copied into 'keys' and the packet pointer is moved at the
>>     >> @@ -7106,7 +7140,6 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>     >>
>>     >>      DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
>>     >>          struct dp_netdev_flow *flow;
>>     >> -        uint32_t mark;
>>     >>
>>     >>          if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
>>     >>              dp_packet_delete(packet);
>>     >> @@ -7125,9 +7158,13 @@ dfc_processing(struct dp_netdev_pmd_thread *pmd,
>>     >>              pkt_metadata_init(&packet->md, port_no);
>>     >>          }
>>     >>
>>     >> -        if ((*recirc_depth_get() == 0) &&
>>     >> -            dp_packet_has_flow_mark(packet, &mark)) {
>>     >> -            flow = mark_to_flow_find(pmd, mark);
>>     >> +        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {
>>     >
>>     > Here we have a per packet call to netdev_is_flow_api_enabled(). I think that netdev_is_flow_api_enabled() should be inlined if it's going to be called per packet. We can see from the above "perf top" that it isn't inlined since it shows up as a separate function.
>>
>>     I'd consider "inlining" and moving a lot of stuff to headers harmful
>>     for the code base as it makes it less readable and it's really hard
>>     to preserve this kind of things during code modification.
>>     It's much better to fix the logic instead of hammering the code with
>>     blind low level optimizations.
>>
>>     For this particular issue we already have a solution here:
>>       https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/ <https://patchwork.ozlabs.org/project/openvswitch/patch/c0b99b4b9be6c290a6aa3cb00049fac4ebfac5d8.1618390390.git.bnemeth@redhat.com/>
>>     In short, we only need to check once per batch.  I think, Balazs
>>     will be able to re-base his patch on top of this series including
>>     the check for netdev_is_flow_api_enabled().
>>
>>     Balazs, will you?
>>
>>     Best regards, Ilya Maximets.
>>
diff mbox series

Patch

diff --git a/lib/dpif-netdev.c b/lib/dpif-netdev.c
index 8fa7eb6d4..d5b7d5b73 100644
--- a/lib/dpif-netdev.c
+++ b/lib/dpif-netdev.c
@@ -114,6 +114,7 @@  COVERAGE_DEFINE(datapath_drop_invalid_port);
 COVERAGE_DEFINE(datapath_drop_invalid_bond);
 COVERAGE_DEFINE(datapath_drop_invalid_tnl_port);
 COVERAGE_DEFINE(datapath_drop_rx_invalid_packet);
+COVERAGE_DEFINE(datapath_drop_hw_miss_recover);
 
 /* Protects against changes to 'dp_netdevs'. */
 static struct ovs_mutex dp_netdev_mutex = OVS_MUTEX_INITIALIZER;
@@ -7062,6 +7063,39 @@  smc_lookup_batch(struct dp_netdev_pmd_thread *pmd,
     pmd_perf_update_counter(&pmd->perf_stats, PMD_STAT_SMC_HIT, n_smc_hit);
 }
 
+static struct tx_port * pmd_send_port_cache_lookup(
+    const struct dp_netdev_pmd_thread *pmd, odp_port_t port_no);
+
+static inline int
+dp_netdev_hw_flow(const struct dp_netdev_pmd_thread *pmd,
+                  odp_port_t port_no,
+                  struct dp_packet *packet,
+                  struct dp_netdev_flow **flow)
+{
+    struct tx_port *p;
+    uint32_t mark;
+
+    /* Restore the packet if HW processing was terminated before completion. */
+    p = pmd_send_port_cache_lookup(pmd, port_no);
+    if (OVS_LIKELY(p)) {
+        int err = netdev_hw_miss_packet_recover(p->port->netdev, packet);
+
+        if (err && err != EOPNOTSUPP) {
+            COVERAGE_INC(datapath_drop_hw_miss_recover);
+            return -1;
+        }
+    }
+
+    /* If no mark, no flow to find. */
+    if (!dp_packet_has_flow_mark(packet, &mark)) {
+        *flow = NULL;
+        return 0;
+    }
+
+    *flow = mark_to_flow_find(pmd, mark);
+    return 0;
+}
+
 /* Try to process all ('cnt') the 'packets' using only the datapath flow cache
  * 'pmd->flow_cache'. If a flow is not found for a packet 'packets[i]', the
  * miniflow is copied into 'keys' and the packet pointer is moved at the
@@ -7106,7 +7140,6 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
 
     DP_PACKET_BATCH_REFILL_FOR_EACH (i, cnt, packet, packets_) {
         struct dp_netdev_flow *flow;
-        uint32_t mark;
 
         if (OVS_UNLIKELY(dp_packet_size(packet) < ETH_HEADER_LEN)) {
             dp_packet_delete(packet);
@@ -7125,9 +7158,13 @@  dfc_processing(struct dp_netdev_pmd_thread *pmd,
             pkt_metadata_init(&packet->md, port_no);
         }
 
-        if ((*recirc_depth_get() == 0) &&
-            dp_packet_has_flow_mark(packet, &mark)) {
-            flow = mark_to_flow_find(pmd, mark);
+        if (netdev_is_flow_api_enabled() && *recirc_depth_get() == 0) {
+            if (OVS_UNLIKELY(dp_netdev_hw_flow(pmd, port_no, packet, &flow))) {
+                /* Packet restoration failed and it was dropped, do not
+                 * continue processing.
+                 */
+                continue;
+            }
             if (OVS_LIKELY(flow)) {
                 tcp_flags = parse_tcp_flags(packet);
                 if (OVS_LIKELY(batch_enable)) {