mbox series

[ovs-dev,v3,0/9] OVS-DPDK flow offload with rte_flow

Message ID 1506404199-23579-1-git-send-email-yliu@fridaylinux.org
Headers show
Series OVS-DPDK flow offload with rte_flow | expand

Message

Yuanhan Liu Sept. 26, 2017, 5:36 a.m. UTC
Some hightlights in v3
======================

Here is the v3, with 2 major changes and further testings (including
many flows). This took more effort than I thought, thus v3 publication
has been delayed for a while.

The first major change is the mark and id association is done with array
instead of CMAP now. This gives us further performance gain: it could
be up to 70% now (please see the exact number below).

This change also make the code a bit more complex though, due to the
lock issue. RCU is used (not quite sure it's been used rightly though).
For now, RCU only protects the array base address update (due to
reallocate), it doesn't protect the array item (array[i] = xx]) change.
I think it's buggy and I need rethink about it.

The second major change is there is a thread introduced to do the real
flow offload. This is for diminishing the overhead by hw flow offload
installation/deletion at data path. See patch 9 for more detailed info.

In the last discussion, RSS action was suggested to use to get rid of
the QUEUE action workaround. Unfortunately, it didn't work. The flow
creation failed with MLX5 PMD driver, and that's the only driver in
DPDK that supports RSS action so far.

I also tested many flows this time. The result is more exciting: it
could be up to 267% boost, with 512 mega flows (with each has another
512 exact matching flows, thus it's 512*512=256K flows in total), one
core and one queue doing PHY-PHY forwarding. For the offload case, the
performance keeps the same as with one flow only: because the cost of
the mark to flow translation is constant, no matter how many flows
are inserted (as far as they are all offloaded). However, for the
vanilla ovs-dpdk, the more flows, the worse the performance is. In
another word, the more flows, the bigger difference we will see.

There were too many discussions in last version. I'm sorry if I missed
some comments and didn't do the corresponding changes in v3. Please let
me know if I made such mistakes.

And below are the formal cover letter introduction, for someone who
is the first time to see this patchset.

---
Hi,

Here is a joint work from Mellanox and Napatech, to enable the flow hw
offload with the DPDK generic flow interface (rte_flow).

The basic idea is to associate the flow with a mark id (a unit32_t number).
Later, we then get the flow directly from the mark id, bypassing the heavy
emc processing, including miniflow_extract.

The association is done with array in patch 1. It also reuses the flow
APIs introduced while adding the tc offloads. The emc bypassing is done
in patch 2. The flow offload is done in patch 4, which mainly does two
things:

- translate the ovs match to DPDK rte flow patterns
- bind those patterns with a MARK action.

Afterwards, the NIC will set the mark id in every pkt's mbuf when it
matches the flow. That's basically how we could get the flow directly
from the received mbuf.

While testing with PHY-PHY forwarding with one core, one queue and one
flow, I got about 70% performance boost. For PHY-vhost forwarding, I got
about 50% performance boost. It's basically the performance I got with v1,
when the tcp_flags is the ignored. In summary, the CMPA to array change
gives up yet another 16% performance boost.

The major issue mentioned in v1 is also workarounded: the queue index
is never set to 0 blindly anymore, but set to the rxq that first
receives the upcall pkt.

Note that it's disabled by default, which can be enabled by:

    $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

v3: - The mark and id association is done with array instead of CMAP.
    - Added a thread to do hw offload operations
    - Removed macros completely
    - dropped the patch to set FDIR_CONF, which is a workround some
      Intel NICs.
    - Added a debug patch to show all flow patterns we have created.
    - Misc fixes

v2: - workaround the queue action issue
    - fixed the tcp_flags being skipped issue, which also fixed the
      build warnings
    - fixed l2 patterns for Intel nic
    - Converted some macros to functions
    - did not hardcode the max number of flow/action
    - rebased on top of the latest code

Thanks.

    --yliu

---
Finn Christensen (2):
  netdev-dpdk: implement flow put with rte flow
  netdev-dpdk: retry with queue action

Shachar Beiser (1):
  dpif-netdev: record rx queue id for the upcall

Yuanhan Liu (6):
  dpif-netdev: associate flow with a mark id
  dpif-netdev: retrieve flow directly from the flow mark
  netdev-dpdk: convert ufid to dpdk flow
  netdev-dpdk: remove offloaded flow on deletion
  netdev-dpdk: add debug for rte flow patterns
  dpif-netdev: do hw flow offload in another thread

 lib/dp-packet.h   |  14 +
 lib/dpif-netdev.c | 421 ++++++++++++++++++++++++++++-
 lib/flow.c        | 155 ++++++++---
 lib/flow.h        |   1 +
 lib/netdev-dpdk.c | 776 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 lib/netdev.c      |   1 +
 lib/netdev.h      |   7 +
 7 files changed, 1331 insertions(+), 44 deletions(-)

Comments

Darrell Ball Sept. 26, 2017, 9:58 p.m. UTC | #1
On 9/25/17, 10:37 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:

    Some hightlights in v3
    ======================
    
    Here is the v3, with 2 major changes and further testings (including
    many flows). This took more effort than I thought, thus v3 publication
    has been delayed for a while.
    
    The first major change is the mark and id association is done with array
    instead of CMAP now. This gives us further performance gain: it could
    be up to 70% now (please see the exact number below).
    
    This change also make the code a bit more complex though, due to the
    lock issue. RCU is used (not quite sure it's been used rightly though).
    For now, RCU only protects the array base address update (due to
    reallocate), it doesn't protect the array item (array[i] = xx]) change.
    I think it's buggy and I need rethink about it.
    
    The second major change is there is a thread introduced to do the real
    flow offload. This is for diminishing the overhead by hw flow offload
    installation/deletion at data path. See patch 9 for more detailed info.

[Darrell] There might be other options to handle this such as dividing time
to HWOL in a given PMD. 
Another option to have PMD isolation.
    
    In the last discussion, RSS action was suggested to use to get rid of
    the QUEUE action workaround. Unfortunately, it didn't work. The flow
    creation failed with MLX5 PMD driver, and that's the only driver in
    DPDK that supports RSS action so far.


[Darrell] 
I wonder if we should take a pause before jumping into the next version of the code.

If workarounds are required in OVS code, then so be it as long as they are not
overly disruptive to the existing code and hard to maintain.
In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a reasonable option
to avoid some unpleasant OVS workarounds.
This would make a significant difference in the code paths, if we supported it, so
we need to be sure as early as possible.
The support needed would be in some drivers and seems reasonably doable. 
Moreover, this was discussed in the last dpdk meeting and the support was
indicated as existing?, although I only verified the MLX code, myself.

I had seen the MLX code supporting _RSS action and there are some checks for
supported cases; when you say “it didn't work”, what was the issue ?
Let us have a discussion also about the Intel nic side and the Napatech side.
It seems reasonable to ask where the disconnect is and whether this support
can be added and then make a decision based on the answers. 

What do you think?

    
    I also tested many flows this time. The result is more exciting: it
    could be up to 267% boost, with 512 mega flows (with each has another
    512 exact matching flows, thus it's 512*512=256K flows in total), one
    core and one queue doing PHY-PHY forwarding. For the offload case, the
    performance keeps the same as with one flow only: because the cost of
    the mark to flow translation is constant, no matter how many flows
    are inserted (as far as they are all offloaded). However, for the
    vanilla ovs-dpdk, the more flows, the worse the performance is. In
    another word, the more flows, the bigger difference we will see.
    
    There were too many discussions in last version. I'm sorry if I missed
    some comments and didn't do the corresponding changes in v3. Please let
    me know if I made such mistakes.
    
    And below are the formal cover letter introduction, for someone who
    is the first time to see this patchset.
    
    ---
    Hi,
    
    Here is a joint work from Mellanox and Napatech, to enable the flow hw
    offload with the DPDK generic flow interface (rte_flow).
    
    The basic idea is to associate the flow with a mark id (a unit32_t number).
    Later, we then get the flow directly from the mark id, bypassing the heavy
    emc processing, including miniflow_extract.
    
    The association is done with array in patch 1. It also reuses the flow
    APIs introduced while adding the tc offloads. The emc bypassing is done
    in patch 2. The flow offload is done in patch 4, which mainly does two
    things:
    
    - translate the ovs match to DPDK rte flow patterns
    - bind those patterns with a MARK action.
    
    Afterwards, the NIC will set the mark id in every pkt's mbuf when it
    matches the flow. That's basically how we could get the flow directly
    from the received mbuf.
    
    While testing with PHY-PHY forwarding with one core, one queue and one
    flow, I got about 70% performance boost. For PHY-vhost forwarding, I got
    about 50% performance boost. It's basically the performance I got with v1,
    when the tcp_flags is the ignored. In summary, the CMPA to array change
    gives up yet another 16% performance boost.
    
    The major issue mentioned in v1 is also workarounded: the queue index
    is never set to 0 blindly anymore, but set to the rxq that first
    receives the upcall pkt.
    
    Note that it's disabled by default, which can be enabled by:
    
        $ ovs-vsctl set Open_vSwitch . other_config:hw-offload=true
    
    v3: - The mark and id association is done with array instead of CMAP.
        - Added a thread to do hw offload operations
        - Removed macros completely
        - dropped the patch to set FDIR_CONF, which is a workround some
          Intel NICs.
        - Added a debug patch to show all flow patterns we have created.
        - Misc fixes
    
    v2: - workaround the queue action issue
        - fixed the tcp_flags being skipped issue, which also fixed the
          build warnings
        - fixed l2 patterns for Intel nic
        - Converted some macros to functions
        - did not hardcode the max number of flow/action
        - rebased on top of the latest code
    
    Thanks.
    
        --yliu
    
    ---
    Finn Christensen (2):
      netdev-dpdk: implement flow put with rte flow
      netdev-dpdk: retry with queue action
    
    Shachar Beiser (1):
      dpif-netdev: record rx queue id for the upcall
    
    Yuanhan Liu (6):
      dpif-netdev: associate flow with a mark id
      dpif-netdev: retrieve flow directly from the flow mark
      netdev-dpdk: convert ufid to dpdk flow
      netdev-dpdk: remove offloaded flow on deletion
      netdev-dpdk: add debug for rte flow patterns
      dpif-netdev: do hw flow offload in another thread
    
     lib/dp-packet.h   |  14 +
     lib/dpif-netdev.c | 421 ++++++++++++++++++++++++++++-
     lib/flow.c        | 155 ++++++++---
     lib/flow.h        |   1 +
     lib/netdev-dpdk.c | 776 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
     lib/netdev.c      |   1 +
     lib/netdev.h      |   7 +
     7 files changed, 1331 insertions(+), 44 deletions(-)
    
    -- 
    2.7.4
Yuanhan Liu Sept. 27, 2017, 1:24 a.m. UTC | #2
On Tue, Sep 26, 2017 at 09:58:16PM +0000, Darrell Ball wrote:
>     The second major change is there is a thread introduced to do the real
>     flow offload. This is for diminishing the overhead by hw flow offload
>     installation/deletion at data path. See patch 9 for more detailed info.
> 
> [Darrell] There might be other options to handle this such as dividing time
> to HWOL in a given PMD. 
> Another option to have PMD isolation.

Good to know, though I'm not sure I understand it so far. But it can be
discussed later. That's also the reason I put it in the last patch, so
that we could easily turn it to another solution, or even simpler (just
remove it).

>     In the last discussion, RSS action was suggested to use to get rid of
>     the QUEUE action workaround. Unfortunately, it didn't work. The flow
>     creation failed with MLX5 PMD driver, and that's the only driver in
>     DPDK that supports RSS action so far.
> 
> 
> [Darrell] 
> I wonder if we should take a pause before jumping into the next version of the code.

I have no objections.

> If workarounds are required in OVS code, then so be it as long as they are not
> overly disruptive to the existing code and hard to maintain.
> In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a reasonable option
> to avoid some unpleasant OVS workarounds.
> This would make a significant difference in the code paths, if we supported it, so
> we need to be sure as early as possible.

I agree.

> The support needed would be in some drivers and seems reasonably doable. 
> Moreover, this was discussed in the last dpdk meeting and the support was
> indicated as existing?, although I only verified the MLX code, myself.

From the code point of view, yes, the code was there months ago.

> I had seen the MLX code supporting _RSS action and there are some checks for
> supported cases; when you say “it didn't work”, what was the issue ?

I actually have met the error months ago, I even debugged it. IIRC,
the error is from libibverbs, when trying to create the hw flow. I
think I need double-confirm it with our colleague who developed this
feature.

> Let us have a discussion also about the Intel nic side and the Napatech side.

I think it's not necessary for Napatech, because they can support
MARK only action.

For Intel, yes, I think Intel folks could give some comments here.

> It seems reasonable to ask where the disconnect is and whether this support
> can be added and then make a decision based on the answers. 

No objections.

	--yliu
Yuanhan Liu Sept. 27, 2017, 2:09 a.m. UTC | #3
On Wed, Sep 27, 2017 at 09:24:54AM +0800, Yuanhan Liu wrote:
> > The support needed would be in some drivers and seems reasonably doable. 
> > Moreover, this was discussed in the last dpdk meeting and the support was
> > indicated as existing?, although I only verified the MLX code, myself.
> 
> >From the code point of view, yes, the code was there months ago.
> 
> > I had seen the MLX code supporting _RSS action and there are some checks for
> > supported cases; when you say “it didn't work”, what was the issue ?
> 
> I actually have met the error months ago, I even debugged it. IIRC,
> the error is from libibverbs, when trying to create the hw flow. I
> think I need double-confirm it with our colleague who developed this
> feature.

Hmm, I think I was wrong, and apparenetly, I think I have did something
stupid in the last try. I double confirmed it just now, at least the
flow creation now works. I just need take some time to intergrate it
to OVS-DPDK, to see how it goes. I will get it back to you when I have
done that. Currently, I was occupied by something else.

	--yliu
Darrell Ball Sept. 27, 2017, 3:13 a.m. UTC | #4
On 9/26/17, 6:25 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:

    On Tue, Sep 26, 2017 at 09:58:16PM +0000, Darrell Ball wrote:
    >     The second major change is there is a thread introduced to do the real

    >     flow offload. This is for diminishing the overhead by hw flow offload

    >     installation/deletion at data path. See patch 9 for more detailed info.

    > 

    > [Darrell] There might be other options to handle this such as dividing time

    > to HWOL in a given PMD. 

    > Another option to have PMD isolation.

    
    Good to know, though I'm not sure I understand it so far. But it can be
    discussed later. That's also the reason I put it in the last patch, so
    that we could easily turn it to another solution, or even simpler (just
    remove it).
    
    >     In the last discussion, RSS action was suggested to use to get rid of

    >     the QUEUE action workaround. Unfortunately, it didn't work. The flow

    >     creation failed with MLX5 PMD driver, and that's the only driver in

    >     DPDK that supports RSS action so far.

    > 

    > 

    > [Darrell] 

    > I wonder if we should take a pause before jumping into the next version of the code.

    
    I have no objections.
    
    > If workarounds are required in OVS code, then so be it as long as they are not

    > overly disruptive to the existing code and hard to maintain.

    > In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a reasonable option

    > to avoid some unpleasant OVS workarounds.

    > This would make a significant difference in the code paths, if we supported it, so

    > we need to be sure as early as possible.

    
    I agree.
    
    > The support needed would be in some drivers and seems reasonably doable. 

    > Moreover, this was discussed in the last dpdk meeting and the support was

    > indicated as existing?, although I only verified the MLX code, myself.

    
    From the code point of view, yes, the code was there months ago.
    
    > I had seen the MLX code supporting _RSS action and there are some checks for

    > supported cases; when you say “it didn't work”, what was the issue ?

    
    I actually have met the error months ago, I even debugged it. IIRC,
    the error is from libibverbs, when trying to create the hw flow. I
    think I need double-confirm it with our colleague who developed this
    feature.
    
    > Let us have a discussion also about the Intel nic side and the Napatech side.

    
    I think it's not necessary for Napatech, because they can support
    MARK only action.
    
It is not necessary for Napatech; however to avoid the need to detect the Napatech
special (good) case, ‘ideally’ we do the exact same programming even in Napatech case.


    For Intel, yes, I think Intel folks could give some comments here.
    
    > It seems reasonable to ask where the disconnect is and whether this support

    > can be added and then make a decision based on the answers. 

    
    No objections.
    
    	--yliu
Yuanhan Liu Sept. 27, 2017, 3:26 a.m. UTC | #5
On Wed, Sep 27, 2017 at 03:13:33AM +0000, Darrell Ball wrote:
> 
> 
> On 9/26/17, 6:25 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:
> 
>     On Tue, Sep 26, 2017 at 09:58:16PM +0000, Darrell Ball wrote:
>     >     The second major change is there is a thread introduced to do the real
>     >     flow offload. This is for diminishing the overhead by hw flow offload
>     >     installation/deletion at data path. See patch 9 for more detailed info.
>     > 
>     > [Darrell] There might be other options to handle this such as dividing time
>     > to HWOL in a given PMD. 
>     > Another option to have PMD isolation.
>     
>     Good to know, though I'm not sure I understand it so far. But it can be
>     discussed later. That's also the reason I put it in the last patch, so
>     that we could easily turn it to another solution, or even simpler (just
>     remove it).
>     
>     >     In the last discussion, RSS action was suggested to use to get rid of
>     >     the QUEUE action workaround. Unfortunately, it didn't work. The flow
>     >     creation failed with MLX5 PMD driver, and that's the only driver in
>     >     DPDK that supports RSS action so far.
>     > 
>     > 
>     > [Darrell] 
>     > I wonder if we should take a pause before jumping into the next version of the code.
>     
>     I have no objections.
>     
>     > If workarounds are required in OVS code, then so be it as long as they are not
>     > overly disruptive to the existing code and hard to maintain.
>     > In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a reasonable option
>     > to avoid some unpleasant OVS workarounds.
>     > This would make a significant difference in the code paths, if we supported it, so
>     > we need to be sure as early as possible.
>     
>     I agree.
>     
>     > The support needed would be in some drivers and seems reasonably doable. 
>     > Moreover, this was discussed in the last dpdk meeting and the support was
>     > indicated as existing?, although I only verified the MLX code, myself.
>     
>     From the code point of view, yes, the code was there months ago.
>     
>     > I had seen the MLX code supporting _RSS action and there are some checks for
>     > supported cases; when you say “it didn't work”, what was the issue ?
>     
>     I actually have met the error months ago, I even debugged it. IIRC,
>     the error is from libibverbs, when trying to create the hw flow. I
>     think I need double-confirm it with our colleague who developed this
>     feature.
>     
>     > Let us have a discussion also about the Intel nic side and the Napatech side.
>     
>     I think it's not necessary for Napatech, because they can support
>     MARK only action.
>     
> It is not necessary for Napatech; however to avoid the need to detect the Napatech
> special (good) case, ‘ideally’ we do the exact same programming even in Napatech case.

Will following look okay to you (assuming we have the probe ability and DPDK
has some basic capability feedback)?

	if (!pure_mark_cap_probed) {
		if (dpdk_rte_flow_has_rss_action_support) {
			append_rss_action();
		} else {
			fail and return;
		}
	}

	/* create flow once, with no retries, if fails, let it fail */
	flow = rte_flow_create(...);

I assume that's how the code looks like finally. What do you think?

	--yliu

>     For Intel, yes, I think Intel folks could give some comments here.
>     
>     > It seems reasonable to ask where the disconnect is and whether this support
>     > can be added and then make a decision based on the answers. 
>     
>     No objections.
>     
>     	--yliu
>     
>
Darrell Ball Sept. 27, 2017, 3:54 a.m. UTC | #6
On 9/26/17, 8:27 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:

    On Wed, Sep 27, 2017 at 03:13:33AM +0000, Darrell Ball wrote:
    > 

    > 

    > On 9/26/17, 6:25 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:

    > 

    >     On Tue, Sep 26, 2017 at 09:58:16PM +0000, Darrell Ball wrote:

    >     >     The second major change is there is a thread introduced to do the real

    >     >     flow offload. This is for diminishing the overhead by hw flow offload

    >     >     installation/deletion at data path. See patch 9 for more detailed info.

    >     > 

    >     > [Darrell] There might be other options to handle this such as dividing time

    >     > to HWOL in a given PMD. 

    >     > Another option to have PMD isolation.

    >     

    >     Good to know, though I'm not sure I understand it so far. But it can be

    >     discussed later. That's also the reason I put it in the last patch, so

    >     that we could easily turn it to another solution, or even simpler (just

    >     remove it).

    >     

    >     >     In the last discussion, RSS action was suggested to use to get rid of

    >     >     the QUEUE action workaround. Unfortunately, it didn't work. The flow

    >     >     creation failed with MLX5 PMD driver, and that's the only driver in

    >     >     DPDK that supports RSS action so far.

    >     > 

    >     > 

    >     > [Darrell] 

    >     > I wonder if we should take a pause before jumping into the next version of the code.

    >     

    >     I have no objections.

    >     

    >     > If workarounds are required in OVS code, then so be it as long as they are not

    >     > overly disruptive to the existing code and hard to maintain.

    >     > In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a reasonable option

    >     > to avoid some unpleasant OVS workarounds.

    >     > This would make a significant difference in the code paths, if we supported it, so

    >     > we need to be sure as early as possible.

    >     

    >     I agree.

    >     

    >     > The support needed would be in some drivers and seems reasonably doable. 

    >     > Moreover, this was discussed in the last dpdk meeting and the support was

    >     > indicated as existing?, although I only verified the MLX code, myself.

    >     

    >     From the code point of view, yes, the code was there months ago.

    >     

    >     > I had seen the MLX code supporting _RSS action and there are some checks for

    >     > supported cases; when you say “it didn't work”, what was the issue ?

    >     

    >     I actually have met the error months ago, I even debugged it. IIRC,

    >     the error is from libibverbs, when trying to create the hw flow. I

    >     think I need double-confirm it with our colleague who developed this

    >     feature.

    >     

    >     > Let us have a discussion also about the Intel nic side and the Napatech side.

    >     

    >     I think it's not necessary for Napatech, because they can support

    >     MARK only action.

    >     

    > It is not necessary for Napatech; however to avoid the need to detect the Napatech

    > special (good) case, ‘ideally’ we do the exact same programming even in Napatech case.

    
    Will following look okay to you (assuming we have the probe ability and DPDK
    has some basic capability feedback)?
    
    	if (!pure_mark_cap_probed) {
    		if (dpdk_rte_flow_has_rss_action_support) {
    			append_rss_action();
    		} else {
    			fail and return;
    		}
    	}
    
    	/* create flow once, with no retries, if fails, let it fail */
    	flow = rte_flow_create(...);
    
    I assume that's how the code looks like finally. What do you think?


[Darrell] It looks fine; of course, if we could drop dependencies on cap probe, it would be ideal (.              

    
    	--yliu
    
    >     For Intel, yes, I think Intel folks could give some comments here.

    >     

    >     > It seems reasonable to ask where the disconnect is and whether this support

    >     > can be added and then make a decision based on the answers. 

    >     

    >     No objections.

    >     

    >     	--yliu

    >     

    >
Finn Christensen Sept. 27, 2017, 9:12 a.m. UTC | #7
-----Original Message-----
    From: Darrell Ball [mailto:dball@vmware.com]

    Sent: 27. september 2017 05:54
    To: Yuanhan Liu <yliu@fridaylinux.org>
    Cc: dev@openvswitch.org; Finn Christensen <fc@napatech.com>;
    Chandran Sugesh <sugesh.chandran@intel.com>; Simon Horman
    <simon.horman@netronome.com>
    Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow
    
    
    
    On 9/26/17, 8:27 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:
    
        On Wed, Sep 27, 2017 at 03:13:33AM +0000, Darrell Ball wrote:
        >

        >

        > On 9/26/17, 6:25 PM, "Yuanhan Liu" <yliu@fridaylinux.org> wrote:

        >

        >     On Tue, Sep 26, 2017 at 09:58:16PM +0000, Darrell Ball wrote:

        >     >     The second major change is there is a thread introduced to do

    the real
        >     >     flow offload. This is for diminishing the overhead by hw flow

    offload
        >     >     installation/deletion at data path. See patch 9 for more detailed

    info.
        >     >

        >     > [Darrell] There might be other options to handle this such as

    dividing time
        >     > to HWOL in a given PMD.

        >     > Another option to have PMD isolation.

        >

        >     Good to know, though I'm not sure I understand it so far. But it can

    be
        >     discussed later. That's also the reason I put it in the last patch, so

        >     that we could easily turn it to another solution, or even simpler (just

        >     remove it).

        >

        >     >     In the last discussion, RSS action was suggested to use to get rid

    of
        >     >     the QUEUE action workaround. Unfortunately, it didn't work. The

    flow
        >     >     creation failed with MLX5 PMD driver, and that's the only driver in

        >     >     DPDK that supports RSS action so far.

        >     >

        >     >

        >     > [Darrell]

        >     > I wonder if we should take a pause before jumping into the next

    version of the code.
        >

        >     I have no objections.

        >

        >     > If workarounds are required in OVS code, then so be it as long as

    they are not
        >     > overly disruptive to the existing code and hard to maintain.

        >     > In the case of RTE_FLOW_ACTION_TYPE_RSS, we might have a

    reasonable option
        >     > to avoid some unpleasant OVS workarounds.

        >     > This would make a significant difference in the code paths, if we

    supported it, so
        >     > we need to be sure as early as possible.

        >

        >     I agree.

        >

        >     > The support needed would be in some drivers and seems

    reasonably doable.
        >     > Moreover, this was discussed in the last dpdk meeting and the

    support was
        >     > indicated as existing?, although I only verified the MLX code,

    myself.
        >

        >     From the code point of view, yes, the code was there months ago.

        >

        >     > I had seen the MLX code supporting _RSS action and there are

    some checks for
        >     > supported cases; when you say “it didn't work”, what was the issue

    ?
        >

        >     I actually have met the error months ago, I even debugged it. IIRC,

        >     the error is from libibverbs, when trying to create the hw flow. I

        >     think I need double-confirm it with our colleague who developed this

        >     feature.

        >

        >     > Let us have a discussion also about the Intel nic side and the

    Napatech side.
        >

        >     I think it's not necessary for Napatech, because they can support

        >     MARK only action.

        >

        > It is not necessary for Napatech; however to avoid the need to detect

    the Napatech
        > special (good) case, ‘ideally’ we do the exact same programming even

    in Napatech case.
    
        Will following look okay to you (assuming we have the probe ability and
    DPDK
        has some basic capability feedback)?
    
        	if (!pure_mark_cap_probed) {
        		if (dpdk_rte_flow_has_rss_action_support) {
        			append_rss_action();
        		} else {
        			fail and return;
        		}
        	}
    
        	/* create flow once, with no retries, if fails, let it fail */
        	flow = rte_flow_create(...);
    
        I assume that's how the code looks like finally. What do you think?
    
    
    [Darrell] It looks fine; of course, if we could drop dependencies on cap
    probe, it would be ideal (.
    

[Finn]
From a Napatech point of view, we would definitely want to use the 
RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely makes good
sense to me. I have 2 reasons for this: 
1. It would fit nicely into the way Napatech does flow programming in nic
2. We would be fully RSS controlled through OVS
Furthermore, Future RSS splitting on a per megaflow basis will be possible.
IMO the "pure_mark_cap_probed" function is a temp work-around to make it work. 
The RSS seems to be a solution to the queue problem.


    
        	--yliu
    
        >     For Intel, yes, I think Intel folks could give some comments here.

        >

        >     > It seems reasonable to ask where the disconnect is and whether

    this support
        >     > can be added and then make a decision based on the answers.

        >

        >     No objections.

        >

        >     	--yliu

        >

        >
Yuanhan Liu Sept. 27, 2017, 9:47 a.m. UTC | #8
On Wed, Sep 27, 2017 at 09:12:49AM +0000, Finn Christensen wrote:
> 
>     [Darrell] It looks fine; of course, if we could drop dependencies on cap
>     probe, it would be ideal (.
> 
> 
> [Finn]
> From a Napatech point of view, we would definitely want to use the
> RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely makes good
> sense to me. I have 2 reasons for this:
> 1. It would fit nicely into the way Napatech does flow programming in nic
> 2. We would be fully RSS controlled through OVS
> Furthermore, Future RSS splitting on a per megaflow basis will be possible.
> IMO the "pure_mark_cap_probed" function is a temp work-around to make it work.
> The RSS seems to be a solution to the queue problem.

Finn, that's really good to know! I do agree without this probe, it
makes the code simpler and cleaner.

Few questions though. Have Napatech already implemented rss action? If not,
what's the plan? And are you okay with following code?

	add_mark_action();
	add_rss_action_unconditionally();

	flow = rte_create_flow(...);
	if (!flow)
		return -1;

That is, no probes, no feedbacks. If it failed, it just failed (flow
offload then is just skipped).

But I do think some feedbacks are necessary. If we know the rte_flow
cap in the begnning, we could simply disable the flow offload for a
specific port if we know it doesn't have offload ability. This would
avoid the repeat effort of trying to do hw offload completely.

	--yliu
Finn Christensen Sept. 27, 2017, 10:41 a.m. UTC | #9
-----Original Message-----
    From: Yuanhan Liu [mailto:yliu@fridaylinux.org]
    Sent: 27. september 2017 11:47
    To: Finn Christensen <fc@napatech.com>
    Cc: Darrell Ball <dball@vmware.com>; dev@openvswitch.org; Chandran
    Sugesh <sugesh.chandran@intel.com>; Simon Horman
    <simon.horman@netronome.com>
    Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow
    
    On Wed, Sep 27, 2017 at 09:12:49AM +0000, Finn Christensen wrote:
    >
    >     [Darrell] It looks fine; of course, if we could drop dependencies on cap
    >     probe, it would be ideal (.
    >
    >
    > [Finn]
    > From a Napatech point of view, we would definitely want to use the
    > RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely makes
    > good sense to me. I have 2 reasons for this:
    > 1. It would fit nicely into the way Napatech does flow programming in
    > nic 2. We would be fully RSS controlled through OVS Furthermore,
    > Future RSS splitting on a per megaflow basis will be possible.
    > IMO the "pure_mark_cap_probed" function is a temp work-around to
    make it work.
    > The RSS seems to be a solution to the queue problem.
    
    Finn, that's really good to know! I do agree without this probe, it makes the
    code simpler and cleaner.
    
    Few questions though. Have Napatech already implemented rss action? If
    not, what's the plan? 

[Finn]
Our nic handles rss on a per flow basis, but we have not yet used rte rss action 
for OVS. In OVS we simply handles RSS config outside it.
The full control of rss through OVS is better though.

    And are you okay with following code?
    
    	add_mark_action();
    	add_rss_action_unconditionally();
    
    	flow = rte_create_flow(...);
    	if (!flow)
    		return -1;
    
    That is, no probes, no feedbacks. If it failed, it just failed (flow offload then
    is just skipped).
    
[Finn]
Yes, we can easily make this work. Good suggestion!


    But I do think some feedbacks are necessary. If we know the rte_flow cap
    in the begnning, we could simply disable the flow offload for a specific port
    if we know it doesn't have offload ability. This would avoid the repeat
    effort of trying to do hw offload completely.
    
[Finn]
This seems to be a good idea.


    	--yliu
Darrell Ball Sept. 27, 2017, 6:54 p.m. UTC | #10
On 9/27/17, 3:41 AM, "Finn Christensen" <fc@napatech.com> wrote:

        -----Original Message-----
        From: Yuanhan Liu [mailto:yliu@fridaylinux.org]

        Sent: 27. september 2017 11:47
        To: Finn Christensen <fc@napatech.com>
        Cc: Darrell Ball <dball@vmware.com>; dev@openvswitch.org; Chandran
        Sugesh <sugesh.chandran@intel.com>; Simon Horman
        <simon.horman@netronome.com>
        Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow
    
        On Wed, Sep 27, 2017 at 09:12:49AM +0000, Finn Christensen wrote:
        >

        >     [Darrell] It looks fine; of course, if we could drop dependencies on cap

        >     probe, it would be ideal (.

        >

        >

        > [Finn]

        > From a Napatech point of view, we would definitely want to use the

        > RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely makes

        > good sense to me. I have 2 reasons for this:

        > 1. It would fit nicely into the way Napatech does flow programming in

        > nic 2. We would be fully RSS controlled through OVS Furthermore,

        > Future RSS splitting on a per megaflow basis will be possible.

        > IMO the "pure_mark_cap_probed" function is a temp work-around to

        make it work.
        > The RSS seems to be a solution to the queue problem.

    
        Finn, that's really good to know! I do agree without this probe, it makes the
        code simpler and cleaner.
    
        Few questions though. Have Napatech already implemented rss action? If
        not, what's the plan?
    
    [Finn]
    Our nic handles rss on a per flow basis, but we have not yet used rte rss action
    for OVS. In OVS we simply handles RSS config outside it.
    The full control of rss through OVS is better though.
    
        And are you okay with following code?
    
        add_mark_action();
        add_rss_action_unconditionally();
    
        flow = rte_create_flow(...);
        if (!flow)
        return -1;
    
        That is, no probes, no feedbacks. If it failed, it just failed (flow offload then
        is just skipped).
    
    [Finn]
    Yes, we can easily make this work. Good suggestion!
    
    
        But I do think some feedbacks are necessary. If we know the rte_flow cap
        in the begnning, we could simply disable the flow offload for a specific port
        if we know it doesn't have offload ability. This would avoid the repeat
        effort of trying to do hw offload completely.
    
    [Finn]
    This seems to be a good idea.

[Darrell] Regarding the general question of capability checking, it is fine to have this support
               in general and we already identified some cases where it would be best to use this
               if we can (the question mostly comes down to support by RTE and drivers).

              Probing is different and we usually use the term for some try and error checking to configure
              something during initialization and see if works and then mark the associated capability as enabled
             or disabled.  We could also use a different kind of probing here in a dynamic fashion, where we try to do
             HWOL and if it fails X times we don’t try again unless we detect the port has been reconfigured,
             for example, in case the capability check is not possible or not implemented yet.


    
    
        --yliu
    Disclaimer: This email and any files transmitted with it may contain confidential information intended for the addressee(s) only. The information is not to be surrendered or copied to unauthorized persons. If you have received this communication in error, please notify the sender immediately and delete this e-mail from your system.
Chandran, Sugesh Oct. 2, 2017, 4:22 p.m. UTC | #11
Regards
_Sugesh


> -----Original Message-----

> From: Darrell Ball [mailto:dball@vmware.com]

> Sent: Wednesday, September 27, 2017 7:55 PM

> To: Finn Christensen <fc@napatech.com>; Yuanhan Liu <yliu@fridaylinux.org>

> Cc: dev@openvswitch.org; Chandran, Sugesh <sugesh.chandran@intel.com>;

> Simon Horman <simon.horman@netronome.com>

> Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow

> 

> 

> 

> On 9/27/17, 3:41 AM, "Finn Christensen" <fc@napatech.com> wrote:

> 

>         -----Original Message-----

>         From: Yuanhan Liu [mailto:yliu@fridaylinux.org]

>         Sent: 27. september 2017 11:47

>         To: Finn Christensen <fc@napatech.com>

>         Cc: Darrell Ball <dball@vmware.com>; dev@openvswitch.org; Chandran

>         Sugesh <sugesh.chandran@intel.com>; Simon Horman

>         <simon.horman@netronome.com>

>         Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow

> 

>         On Wed, Sep 27, 2017 at 09:12:49AM +0000, Finn Christensen wrote:

>         >

>         >     [Darrell] It looks fine; of course, if we could drop dependencies on cap

>         >     probe, it would be ideal (.

>         >

>         >

>         > [Finn]

>         > From a Napatech point of view, we would definitely want to use the

>         > RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely makes

>         > good sense to me. I have 2 reasons for this:

>         > 1. It would fit nicely into the way Napatech does flow programming in

>         > nic 2. We would be fully RSS controlled through OVS Furthermore,

>         > Future RSS splitting on a per megaflow basis will be possible.

>         > IMO the "pure_mark_cap_probed" function is a temp work-around to

>         make it work.

>         > The RSS seems to be a solution to the queue problem.

> 

>         Finn, that's really good to know! I do agree without this probe, it makes the

>         code simpler and cleaner.

> 

>         Few questions though. Have Napatech already implemented rss action? If

>         not, what's the plan?

> 

>     [Finn]

>     Our nic handles rss on a per flow basis, but we have not yet used rte rss action

>     for OVS. In OVS we simply handles RSS config outside it.

>     The full control of rss through OVS is better though.

> 

>         And are you okay with following code?

> 

>         add_mark_action();

>         add_rss_action_unconditionally();

> 

>         flow = rte_create_flow(...);

>         if (!flow)

>         return -1;

> 

>         That is, no probes, no feedbacks. If it failed, it just failed (flow offload then

>         is just skipped).

> 

>     [Finn]

>     Yes, we can easily make this work. Good suggestion!

> 

> 

>         But I do think some feedbacks are necessary. If we know the rte_flow cap

>         in the begnning, we could simply disable the flow offload for a specific port

>         if we know it doesn't have offload ability. This would avoid the repeat

>         effort of trying to do hw offload completely.

> 

>     [Finn]

>     This seems to be a good idea.

> 

> [Darrell] Regarding the general question of capability checking, it is fine to have

> this support

>                in general and we already identified some cases where it would be best

> to use this

>                if we can (the question mostly comes down to support by RTE and

> drivers).

> 

>               Probing is different and we usually use the term for some try and error

> checking to configure

>               something during initialization and see if works and then mark the

> associated capability as enabled

>              or disabled.  We could also use a different kind of probing here in a

> dynamic fashion, where we try to do

>              HWOL and if it fails X times we don’t try again unless we detect the port

> has been reconfigured,

>              for example, in case the capability check is not possible or not

> implemented yet.

[Sugesh] I agree with Darrel here and we are also looking at implementing a capability APIs
to expose the device feature sets. I will check with our DPDK team and will post the update.
> 

> 

> 

> 

>         --yliu

>     Disclaimer: This email and any files transmitted with it may contain

> confidential information intended for the addressee(s) only. The information is

> not to be surrendered or copied to unauthorized persons. If you have received

> this communication in error, please notify the sender immediately and delete

> this e-mail from your system.

>
Finn Christensen Oct. 3, 2017, 8:44 a.m. UTC | #12
-----Original Message-----
    From: Chandran, Sugesh [mailto:sugesh.chandran@intel.com]

    Sent: 2. oktober 2017 18:22
    To: Darrell Ball <dball@vmware.com>; Finn Christensen
    <fc@napatech.com>; Yuanhan Liu <yliu@fridaylinux.org>
    Cc: dev@openvswitch.org; Simon Horman
    <simon.horman@netronome.com>
    Subject: RE: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow
    
    
    
    Regards
    _Sugesh
    
    
    > -----Original Message-----

    > From: Darrell Ball [mailto:dball@vmware.com]

    > Sent: Wednesday, September 27, 2017 7:55 PM

    > To: Finn Christensen <fc@napatech.com>; Yuanhan Liu

    > <yliu@fridaylinux.org>

    > Cc: dev@openvswitch.org; Chandran, Sugesh

    <sugesh.chandran@intel.com>;
    > Simon Horman <simon.horman@netronome.com>

    > Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with rte_flow

    >

    >

    >

    > On 9/27/17, 3:41 AM, "Finn Christensen" <fc@napatech.com> wrote:

    >

    >         -----Original Message-----

    >         From: Yuanhan Liu [mailto:yliu@fridaylinux.org]

    >         Sent: 27. september 2017 11:47

    >         To: Finn Christensen <fc@napatech.com>

    >         Cc: Darrell Ball <dball@vmware.com>; dev@openvswitch.org;

    Chandran
    >         Sugesh <sugesh.chandran@intel.com>; Simon Horman

    >         <simon.horman@netronome.com>

    >         Subject: Re: [PATCH v3 0/9] OVS-DPDK flow offload with

    > rte_flow

    >

    >         On Wed, Sep 27, 2017 at 09:12:49AM +0000, Finn Christensen wrote:

    >         >

    >         >     [Darrell] It looks fine; of course, if we could drop dependencies

    on cap
    >         >     probe, it would be ideal (.

    >         >

    >         >

    >         > [Finn]

    >         > From a Napatech point of view, we would definitely want to use

    the
    >         > RTE_FLOW_ACTION_TYPE_RSS if it were implement. It definitely

    makes
    >         > good sense to me. I have 2 reasons for this:

    >         > 1. It would fit nicely into the way Napatech does flow programming

    in
    >         > nic 2. We would be fully RSS controlled through OVS Furthermore,

    >         > Future RSS splitting on a per megaflow basis will be possible.

    >         > IMO the "pure_mark_cap_probed" function is a temp work-around

    to
    >         make it work.

    >         > The RSS seems to be a solution to the queue problem.

    >

    >         Finn, that's really good to know! I do agree without this probe, it

    makes the
    >         code simpler and cleaner.

    >

    >         Few questions though. Have Napatech already implemented rss

    action? If
    >         not, what's the plan?

    >

    >     [Finn]

    >     Our nic handles rss on a per flow basis, but we have not yet used rte rss

    action
    >     for OVS. In OVS we simply handles RSS config outside it.

    >     The full control of rss through OVS is better though.

    >

    >         And are you okay with following code?

    >

    >         add_mark_action();

    >         add_rss_action_unconditionally();

    >

    >         flow = rte_create_flow(...);

    >         if (!flow)

    >         return -1;

    >

    >         That is, no probes, no feedbacks. If it failed, it just failed (flow offload

    then
    >         is just skipped).

    >

    >     [Finn]

    >     Yes, we can easily make this work. Good suggestion!

    >

    >

    >         But I do think some feedbacks are necessary. If we know the

    rte_flow cap
    >         in the begnning, we could simply disable the flow offload for a

    specific port
    >         if we know it doesn't have offload ability. This would avoid the repeat

    >         effort of trying to do hw offload completely.

    >

    >     [Finn]

    >     This seems to be a good idea.

    >

    > [Darrell] Regarding the general question of capability checking, it is

    > fine to have this support

    >                in general and we already identified some cases where

    > it would be best to use this

    >                if we can (the question mostly comes down to support by

    > RTE and drivers).

    >

    >               Probing is different and we usually use the term for

    > some try and error checking to configure

    >               something during initialization and see if works and

    > then mark the associated capability as enabled

    >              or disabled.  We could also use a different kind of

    > probing here in a dynamic fashion, where we try to do

    >              HWOL and if it fails X times we don’t try again unless we

    > detect the port has been reconfigured,

    >              for example, in case the capability check is not possible

    > or not implemented yet.

    [Sugesh] I agree with Darrel here and we are also looking at implementing
    a capability APIs to expose the device feature sets. I will check with our
    DPDK team and will post the update.

[Finn]
I also agree here, however, the "X times failed and then mark port as HWOL 
disabled", should only be done initially, where no successfully offloads has 
yet occurred.

    >

    >

    >

    >

    >         --yliu

    >     Disclaimer: This email and any files transmitted with it may

    > contain confidential information intended for the addressee(s) only.

    > The information is not to be surrendered or copied to unauthorized

    > persons. If you have received this communication in error, please

    > notify the sender immediately and delete this e-mail from your system.

    >
Yuanhan Liu Oct. 18, 2017, 1:53 p.m. UTC | #13
On Wed, Sep 27, 2017 at 10:09:20AM +0800, Yuanhan Liu wrote:
> On Wed, Sep 27, 2017 at 09:24:54AM +0800, Yuanhan Liu wrote:
> > > The support needed would be in some drivers and seems reasonably doable. 
> > > Moreover, this was discussed in the last dpdk meeting and the support was
> > > indicated as existing?, although I only verified the MLX code, myself.
> > 
> > >From the code point of view, yes, the code was there months ago.
> > 
> > > I had seen the MLX code supporting _RSS action and there are some checks for
> > > supported cases; when you say “it didn't work”, what was the issue ?
> > 
> > I actually have met the error months ago, I even debugged it. IIRC,
> > the error is from libibverbs, when trying to create the hw flow. I
> > think I need double-confirm it with our colleague who developed this
> > feature.
> 
> Hmm, I think I was wrong, and apparenetly, I think I have did something
> stupid in the last try. I double confirmed it just now, at least the
> flow creation now works. I just need take some time to intergrate it
> to OVS-DPDK, to see how it goes. I will get it back to you when I have
> done that. Currently, I was occupied by something else.

Here is a short update I should have done in last week (apologize for
being late).

The RSS action should work. I also have intergated it to ovs-dpdk. I
intended to send v4 in last week. Later, I noticed a fatal bug (regarding
to MQ).

I was still heavily occupied by something else, that I don't have time
to fix that. I have tried to figure out a solution at spare time though.
I might have found a solution. I just don't know the results yet without
really trying it.

I will update progress when I have made some. Please also note that
it's very likely I won't have time on that before the end of this month :/

	--yliu
Ben Pfaff Oct. 31, 2017, 9:19 p.m. UTC | #14
I see that this series appears to have stalled.  Should we expect a new
version sometime soon?  Do you need more reviews on the current version?
Yuanhan Liu Nov. 1, 2017, 3:24 p.m. UTC | #15
On Tue, Oct 31, 2017 at 02:19:20PM -0700, Ben Pfaff wrote:
> I see that this series appears to have stalled.  Should we expect a new
> version sometime soon?  Do you need more reviews on the current version?

My bad. As I have stated in another email, I meant to send v4 earlier, but
I found a bug. Meanwhile, I was heavily occupied by something else, that I
don't have time to shape it.

Likely, I could be able to start working on v4 at mid of this month.

	--yliu
Ben Pfaff Nov. 1, 2017, 3:31 p.m. UTC | #16
On Wed, Nov 01, 2017 at 11:24:30PM +0800, Yuanhan Liu wrote:
> On Tue, Oct 31, 2017 at 02:19:20PM -0700, Ben Pfaff wrote:
> > I see that this series appears to have stalled.  Should we expect a new
> > version sometime soon?  Do you need more reviews on the current version?
> 
> My bad. As I have stated in another email, I meant to send v4 earlier, but
> I found a bug. Meanwhile, I was heavily occupied by something else, that I
> don't have time to shape it.
> 
> Likely, I could be able to start working on v4 at mid of this month.

There's no hurry from my perspective, my main goal was to make sure that
I/we hadn't dropped the ball.  I'll look forward to the next version.