mbox series

[ovs-dev,RFC,dpdk-latest,v3,0/1] Enable vhost async API's in OvS.

Message ID 20220104125242.1064162-1-sunil.pai.g@intel.com
Headers show
Series Enable vhost async API's in OvS. | expand

Message

Pai G, Sunil Jan. 4, 2022, 12:52 p.m. UTC
This series brings in the new asynchronous vHost API's in DPDK to OVS.
With the asynchronous framework, vHost-user can offload the memory copy operation
to DPDK DMA-dev based enabled devices like Intel® QuickData Technology without
blocking the CPU.

The intention of this patch is solely to communicate the design of the feature,
evaluate the code impact and to get early feedback from the community.

Usage:
To enable DMA offload for vhost path, one can issue the following command:
ovs-vsctl --no-wait set Open_vSwitch . other_config:vhost-async-support=true

Issuing this requires restarting the vswitchd daemon.

Please note that, its required to have DMA devices visible to DPDK before OVS launch either
by binding them to userspace driver vfio-pci or through other facilities like
assigning specific DSA worker queues to DPDK as mentioned in the driver documentation.

The current assignment model is 1 DMA device per data plane thread.
If no DMA devices are found for that thread, it will fall back to perform CPU copy.

To check if async support is enabled, one can check for the following message in the ovs-vswitchd.log
dpdk|INFO|Async support enabled for vhost-user-client.

To verify if DMA devices are assigned to the data plane thread, one can check the ovs-vswitchd.log for messages like:
netdev_dpdk(pmd-c01/id:7)|INFO|DMA device with dev id: 0 assigned to pmd for vhost async copy offload.


This patch was tested on :
OVS : branch  : dpdk-latest
with
DPDK : branch : main,
commit id : 45f04d88ae,  version: 21.11.0
+ https://patches.dpdk.org/project/dpdk/list/?series=21041 : v1 of vhost: integrate dmadev in asynchronous datapath.
+ https://patches.dpdk.org/project/dpdk/list/?series=21043 : v1 RFC of : vhost: support async dequeue data path.

Please note:
	- All the DPDK API's (both vhost async and DMADEV) utilized in the patch are experimental.
	- Its required to have atleast one vhost Rxq on every single core since currently even the vhost Tx packets are polled for completions from vhost Rx context.
	- Packets are not cleared on quiesce currently. So for graceful exit, its required to stop the traffic first before stopping OVS.

Future work:
	- Graceful vhost hot plugging.
	- Compatibility with PMD auto load balancing.
	- Clear packets from Rx and Tx queues before quiesce.

V2-> V3:
----------
- update code according to new architecture with dmadev datapath in vhost library.
    ○ The dmadev datapath is now moved to the vhost library and OVS is now only responsible to configure/use the DMA device.
    ○ This allows OVS the flexibility to decide on the allocation scheme of DMA devices.


Sunil Pai G (1):
  netdev-dpdk: Enable vhost async API's in OvS.

 lib/dpdk-stub.c   |   6 +
 lib/dpdk.c        |  28 +++
 lib/dpdk.h        |   1 +
 lib/netdev-dpdk.c | 442 ++++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 463 insertions(+), 14 deletions(-)

Comments

Pai G, Sunil Feb. 1, 2022, 2:23 p.m. UTC | #1
Hi , 

This version of the patch seems to have negative impact on performance for burst traffic profile[1].
Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte packets compared to ~1.2x seen with the current design (v3) as measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
The cause of the drop seems to be because of the excessive vhost txq contention across the PMD threads.

[1]: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf 
[2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

Thanks and regards
Sunil
Maxime Coquelin Feb. 3, 2022, 9:37 a.m. UTC | #2
Hi Sunil,

On 2/1/22 15:23, Pai G, Sunil wrote:
> Hi ,
> 
> This version of the patch seems to have negative impact on performance for burst traffic profile[1].
> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte packets compared to ~1.2x seen with the current design (v3) as measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
> The cause of the drop seems to be because of the excessive vhost txq contention across the PMD threads.

So it means the Tx/Rx queue pairs aren't consumed by the same PMD
thread. can you confirm?

> [1]: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf
> [2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> 
> Thanks and regards
> Sunil
Pai G, Sunil Feb. 3, 2022, 10:48 a.m. UTC | #3
Hi Maxime, 

> > This version of the patch seems to have negative impact on performance
> for burst traffic profile[1].
> > Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte
> packets compared to ~1.2x seen with the current design (v3) as measured on
> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
> > The cause of the drop seems to be because of the excessive vhost txq
> contention across the PMD threads.
> 
> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> thread. can you confirm?

Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq,  which leads to contention.

> 
> > [1]:
> > https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized
> > -deployment-benchmark-technology-guide.pdf
> > [2]:
> > https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
> >
> > Thanks and regards
> > Sunil
Ilya Maximets Feb. 3, 2022, 11:21 a.m. UTC | #4
On 2/3/22 11:48, Pai G, Sunil wrote:
> Hi Maxime, 
> 
>>> This version of the patch seems to have negative impact on performance
>> for burst traffic profile[1].
>>> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte
>> packets compared to ~1.2x seen with the current design (v3) as measured on
>> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
>>> The cause of the drop seems to be because of the excessive vhost txq
>> contention across the PMD threads.
>>
>> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
>> thread. can you confirm?
> 
> Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq,  which leads to contention.

Why this process can't be lockless?
If we have to lock the device, maybe we can do both submission
and completion from the thread that polls corresponding Rx queue?
Tx threads may enqueue mbufs to some lockless ring inside the
rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
jobs to dma device and check completions.  No locks required.

> 
>>
>>> [1]:
>>> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized
>>> -deployment-benchmark-technology-guide.pdf
>>> [2]:
>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>>>
>>> Thanks and regards
>>> Sunil
>
Pai G, Sunil Feb. 9, 2022, 8:31 a.m. UTC | #5
> >>> This version of the patch seems to have negative impact on
> >>> performance
> >> for burst traffic profile[1].
> >>> Benefits seen with the previous version (v2) was up to ~1.6x for
> >>> 1568 byte
> >> packets compared to ~1.2x seen with the current design (v3) as
> >> measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
> >>> The cause of the drop seems to be because of the excessive vhost txq
> >> contention across the PMD threads.
> >>
> >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> >> thread. can you confirm?
> >
> > Yes, the completion polls for a given txq happens on a single PMD
> thread(on the same thread where its corresponding rxq is being polled) but
> other threads can submit(enqueue) packets on the same txq,  which leads to
> contention.
> 
> Why this process can't be lockless?
> If we have to lock the device, maybe we can do both submission and
> completion from the thread that polls corresponding Rx queue?
> Tx threads may enqueue mbufs to some lockless ring inside the
> rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit jobs
> to dma device and check completions.  No locks required.
> 

Thank you for the comments, Ilya.

Hi Jiayu, Maxime,

Could I request your opinions on this from the vhost library perspective ? 

Thanks and regards,
Sunil
Kevin Traynor Feb. 9, 2022, 10:19 a.m. UTC | #6
On 03/02/2022 11:21, Ilya Maximets wrote:
> On 2/3/22 11:48, Pai G, Sunil wrote:
>> Hi Maxime,
>>
>>>> This version of the patch seems to have negative impact on performance
>>> for burst traffic profile[1].
>>>> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte
>>> packets compared to ~1.2x seen with the current design (v3) as measured on
>>> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
>>>> The cause of the drop seems to be because of the excessive vhost txq
>>> contention across the PMD threads.
>>>
>>> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
>>> thread. can you confirm?
>>
>> Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq,  which leads to contention.
> 
> Why this process can't be lockless?
> If we have to lock the device, maybe we can do both submission
> and completion from the thread that polls corresponding Rx queue?
> Tx threads may enqueue mbufs to some lockless ring inside the
> rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
> jobs to dma device and check completions.  No locks required.
> 

This still means that Rx polling has to be taking place for OVS
Tx to the device to operate.

Isn't that a new dependency on OVS being pushed from the driver? and 
wouldn't it rule out OVS being able to switch to an interrupt mode, or 
reduce polling in the future if there was no/low packets to Rx.

Of course, they could be mutually exclusive features that might have an 
opt-in, especially seen as one is performance related and the other is 
about power saving.

Maybe there could be other reasons for not Rx polling a device? I can't 
think of any right now.

>>
>>>
>>>> [1]:
>>>> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized
>>>> -deployment-benchmark-technology-guide.pdf
>>>> [2]:
>>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>>>>
>>>> Thanks and regards
>>>> Sunil
>>
> 
> _______________________________________________
> dev mailing list
> dev@openvswitch.org
> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>
Ilya Maximets Feb. 9, 2022, 2:01 p.m. UTC | #7
On 2/9/22 11:19, Kevin Traynor wrote:
> On 03/02/2022 11:21, Ilya Maximets wrote:
>> On 2/3/22 11:48, Pai G, Sunil wrote:
>>> Hi Maxime,
>>>
>>>>> This version of the patch seems to have negative impact on performance
>>>> for burst traffic profile[1].
>>>>> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte
>>>> packets compared to ~1.2x seen with the current design (v3) as measured on
>>>> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
>>>>> The cause of the drop seems to be because of the excessive vhost txq
>>>> contention across the PMD threads.
>>>>
>>>> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
>>>> thread. can you confirm?
>>>
>>> Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq,  which leads to contention.
>>
>> Why this process can't be lockless?
>> If we have to lock the device, maybe we can do both submission
>> and completion from the thread that polls corresponding Rx queue?
>> Tx threads may enqueue mbufs to some lockless ring inside the
>> rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
>> jobs to dma device and check completions.  No locks required.
>>
> 
> This still means that Rx polling has to be taking place for OVS
> Tx to the device to operate.
> 
> Isn't that a new dependency on OVS being pushed from the driver? and wouldn't it rule out OVS being able to switch to an interrupt mode, or reduce polling in the future if there was no/low packets to Rx.
> 
> Of course, they could be mutually exclusive features that might have an opt-in, especially seen as one is performance related and the other is about power saving.

AFAICT, vhost library doesn't handle interrupts, so OVS will need to
implement them, i.e. create private interrupt handle and register
all the kickfd descriptors there.  At this point, I think, we might
as well create a second private interrupt handle that will listen
on fds that Tx thread will kick every time after successful enqueue
if dma enqueue is enabled.  This all can happen solely in OVS and we
may even have a different wakeup mechanism since we're not bound to
use DPDK interrupts, which are just epolls anyway.

In any case, some extra engineering will be required to support vhost
rx interrupts even without dma.

Also, is dma engine capable of generating interrupts?  Does DPDK API
support that anyhow?

> 
> Maybe there could be other reasons for not Rx polling a device? I can't think of any right now.
> 
>>>
>>>>
>>>>> [1]:
>>>>> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized
>>>>> -deployment-benchmark-technology-guide.pdf
>>>>> [2]:
>>>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator
>>>>>
>>>>> Thanks and regards
>>>>> Sunil
>>>
>>
>> _______________________________________________
>> dev mailing list
>> dev@openvswitch.org
>> https://mail.openvswitch.org/mailman/listinfo/ovs-dev
>>
>
David Marchand Feb. 9, 2022, 3:05 p.m. UTC | #8
On Wed, Feb 9, 2022 at 3:01 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >> Why this process can't be lockless?
> >> If we have to lock the device, maybe we can do both submission
> >> and completion from the thread that polls corresponding Rx queue?
> >> Tx threads may enqueue mbufs to some lockless ring inside the
> >> rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
> >> jobs to dma device and check completions.  No locks required.
> >>
> >
> > This still means that Rx polling has to be taking place for OVS
> > Tx to the device to operate.
> >
> > Isn't that a new dependency on OVS being pushed from the driver? and wouldn't it rule out OVS being able to switch to an interrupt mode, or reduce polling in the future if there was no/low packets to Rx.
> >
> > Of course, they could be mutually exclusive features that might have an opt-in, especially seen as one is performance related and the other is about power saving.
>
> AFAICT, vhost library doesn't handle interrupts, so OVS will need to
> implement them, i.e. create private interrupt handle and register
> all the kickfd descriptors there.  At this point, I think, we might
> as well create a second private interrupt handle that will listen
> on fds that Tx thread will kick every time after successful enqueue
> if dma enqueue is enabled.  This all can happen solely in OVS and we
> may even have a different wakeup mechanism since we're not bound to
> use DPDK interrupts, which are just epolls anyway.

I agree, this is not a blocker for an interrupt mode.

Just a note, that the vhost library already provides kickfd as part of
the vring structure.
A api is available to request notifications from the guest on a specific vring.
(And no experimental API for this!)

About the DPDK interrupt framework, OVS does not need the epoll stuff
even for "normal" DPDK devices Rx interrupts.
eventfds from vfio-pci / other kmods can already be retrieved from
existing ethdev API without using DPDK interrupt thread/framework.


>
> In any case, some extra engineering will be required to support vhost
> rx interrupts even without dma.

I have a PoC in my branches.
I'll send a RFC on this topic, after 22.03-rc1/2.


>
> Also, is dma engine capable of generating interrupts?  Does DPDK API
> support that anyhow?

Cc: Bruce who may know.
At least, I see nothing in current dmadev API.
Richardson, Bruce Feb. 9, 2022, 3:20 p.m. UTC | #9
> -----Original Message-----
> From: David Marchand <david.marchand@redhat.com>
> Sent: Wednesday, February 9, 2022 3:05 PM
> To: Ilya Maximets <i.maximets@ovn.org>
> Cc: Kevin Traynor <ktraynor@redhat.com>; Pai G, Sunil
> <sunil.pai.g@intel.com>; Maxime Coquelin <maxime.coquelin@redhat.com>;
> dev@openvswitch.org; Mcnamara, John <john.mcnamara@intel.com>; Hu, Jiayu
> <jiayu.hu@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Subject: Re: [ovs-dev] [PATCH RFC dpdk-latest v3 0/1] Enable vhost async
> API's in OvS.
> 
> On Wed, Feb 9, 2022 at 3:01 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> > >> Why this process can't be lockless?
> > >> If we have to lock the device, maybe we can do both submission
> > >> and completion from the thread that polls corresponding Rx queue?
> > >> Tx threads may enqueue mbufs to some lockless ring inside the
> > >> rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
> > >> jobs to dma device and check completions.  No locks required.
> > >>
> > >
> > > This still means that Rx polling has to be taking place for OVS
> > > Tx to the device to operate.
> > >
> > > Isn't that a new dependency on OVS being pushed from the driver? and
> wouldn't it rule out OVS being able to switch to an interrupt mode, or
> reduce polling in the future if there was no/low packets to Rx.
> > >
> > > Of course, they could be mutually exclusive features that might have
> an opt-in, especially seen as one is performance related and the other is
> about power saving.
> >
> > AFAICT, vhost library doesn't handle interrupts, so OVS will need to
> > implement them, i.e. create private interrupt handle and register
> > all the kickfd descriptors there.  At this point, I think, we might
> > as well create a second private interrupt handle that will listen
> > on fds that Tx thread will kick every time after successful enqueue
> > if dma enqueue is enabled.  This all can happen solely in OVS and we
> > may even have a different wakeup mechanism since we're not bound to
> > use DPDK interrupts, which are just epolls anyway.
> 
> I agree, this is not a blocker for an interrupt mode.
> 
> Just a note, that the vhost library already provides kickfd as part of
> the vring structure.
> A api is available to request notifications from the guest on a specific
> vring.
> (And no experimental API for this!)
> 
> About the DPDK interrupt framework, OVS does not need the epoll stuff
> even for "normal" DPDK devices Rx interrupts.
> eventfds from vfio-pci / other kmods can already be retrieved from
> existing ethdev API without using DPDK interrupt thread/framework.
> 
> 
> >
> > In any case, some extra engineering will be required to support vhost
> > rx interrupts even without dma.
> 
> I have a PoC in my branches.
> I'll send a RFC on this topic, after 22.03-rc1/2.
> 
> 
> >
> > Also, is dma engine capable of generating interrupts?  Does DPDK API
> > support that anyhow?
> 
> Cc: Bruce who may know.
> At least, I see nothing in current dmadev API.
 
You are right, there is nothing in dmadev (yet) for interrupt support. However, if necessary, I'm sure it could be added.

/Bruce
Hu, Jiayu Feb. 10, 2022, 3:46 a.m. UTC | #10
Hi all,

> -----Original Message-----
> From: Pai G, Sunil <sunil.pai.g@intel.com>
> Sent: Wednesday, February 9, 2022 4:32 PM
> To: Ilya Maximets <i.maximets@ovn.org>; Maxime Coquelin
> <maxime.coquelin@redhat.com>; dev@openvswitch.org; Hu, Jiayu
> <jiayu.hu@intel.com>
> Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; Ferriter, Cian
> <cian.ferriter@intel.com>; Stokes, Ian <ian.stokes@intel.com>;
> david.marchand@redhat.com; Mcnamara, John <john.mcnamara@intel.com>
> Subject: RE: [PATCH RFC dpdk-latest v3 0/1] Enable vhost async API's in OvS.
> 
> > >>> This version of the patch seems to have negative impact on
> > >>> performance
> > >> for burst traffic profile[1].
> > >>> Benefits seen with the previous version (v2) was up to ~1.6x for
> > >>> 1568 byte
> > >> packets compared to ~1.2x seen with the current design (v3) as
> > >> measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz.
> > >>> The cause of the drop seems to be because of the excessive vhost
> > >>> txq
> > >> contention across the PMD threads.
> > >>
> > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> > >> thread. can you confirm?
> > >
> > > Yes, the completion polls for a given txq happens on a single PMD
> > thread(on the same thread where its corresponding rxq is being polled)
> > but other threads can submit(enqueue) packets on the same txq,  which
> > leads to contention.

It seems 40% perf degradation is caused by virtqueue contention between Rx and
Tx PMD threads. But I am really curious about what causes up to 40% perf drop?
It's core busy-waiting due to spin-lock or cache thrashing of virtqueue struct?
Or something else?

In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy-waiting.
If OVS data path can also avoid busy-waiting, will it help on performance? Could we
have a try?

> >
> > Why this process can't be lockless?
> > If we have to lock the device, maybe we can do both submission and
> > completion from the thread that polls corresponding Rx queue?
> > Tx threads may enqueue mbufs to some lockless ring inside the
> > rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit jobs
> > to dma device and check completions.  No locks required.

The lockless ring is like batching or caching for Tx packets. It can be directly
done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx thread
inserts packets to the ring, and Rx thread consumes packets from the ring and
submits copy and polls completion.

Thanks,
Jiayu
> >
> 
> Thank you for the comments, Ilya.
> 
> Hi Jiayu, Maxime,
> 
> Could I request your opinions on this from the vhost library perspective ?
> 
> Thanks and regards,
> Sunil
Pai G, Sunil March 4, 2022, 4:17 p.m. UTC | #11
> > > >>> This version of the patch seems to have negative impact on
> > > >>> performance
> > > >> for burst traffic profile[1].
> > > >>> Benefits seen with the previous version (v2) was up to ~1.6x for
> > > >>> 1568 byte
> > > >> packets compared to ~1.2x seen with the current design (v3) as
> > > >> measured on new Intel hardware that supports DSA [2] , CPU @
> 1.8Ghz.
> > > >>> The cause of the drop seems to be because of the excessive vhost
> > > >>> txq
> > > >> contention across the PMD threads.
> > > >>
> > > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD
> > > >> thread. can you confirm?
> > > >
> > > > Yes, the completion polls for a given txq happens on a single PMD
> > > thread(on the same thread where its corresponding rxq is being
> > > polled) but other threads can submit(enqueue) packets on the same
> > > txq,  which leads to contention.
> 
> It seems 40% perf degradation is caused by virtqueue contention between
> Rx and Tx PMD threads. But I am really curious about what causes up to 40%
> perf drop?
> It's core busy-waiting due to spin-lock or cache thrashing of virtqueue struct?
> Or something else?
> 
> In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy-
> waiting.
> If OVS data path can also avoid busy-waiting, will it help on performance?
> Could we have a try?
> 
> > >
> > > Why this process can't be lockless?
> > > If we have to lock the device, maybe we can do both submission and
> > > completion from the thread that polls corresponding Rx queue?
> > > Tx threads may enqueue mbufs to some lockless ring inside the
> > > rte_vhost_enqueue_burst.  Rx thread may dequeue them and submit
> jobs
> > > to dma device and check completions.  No locks required.
> 
> The lockless ring is like batching or caching for Tx packets. It can be directly
> done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx
> thread inserts packets to the ring, and Rx thread consumes packets from the
> ring and submits copy and polls completion.
> 
> Thanks,
> Jiayu
> > >
> >
> > Thank you for the comments, Ilya.
> >
> > Hi Jiayu, Maxime,
> >
> > Could I request your opinions on this from the vhost library perspective ?
> >
> > Thanks and regards,
> > Sunil
> 

Hi All, 
 
An update on this. 

After we improved the software to work better with the hardware we don’t see the same drop in performance as before and we are now getting stable performance results.
We also investigated Ilya's lockless ring suggestion to reduce the amount of contention. 

The updated results are shown below where the result numbers are the relative gain compared to CPU-only for the 3 different methods tried for async.
In each case the configuration was: 4 dataplane threads, 32 vhost ports , vxlan traffic [1], lossy tests.

--------------------------------------------------------------------------------------------------------------------
||         Traffic type       ||                        burst mode[1]                                             ||
--------------------------------------------------------------------------------------------------------------------
|| Frame size/Implementation  ||  CPU  | work defer |   V3 patch   | V3 patch + lockless ring in OVS for async*   ||
--------------------------------------------------------------------------------------------------------------------
||            114             ||    1  |    0.85    |     0.74     |                   0.77                       ||
--------------------------------------------------------------------------------------------------------------------
||            2098            ||    1  |    1.85    |     1.63     |                   1.75                       ||
--------------------------------------------------------------------------------------------------------------------

--------------------------------------------------------------------------------------------------------------------
||         Traffic type       ||                           scatter mode[1]                                        ||
--------------------------------------------------------------------------------------------------------------------
|| Frame size/Implementation  ||  CPU  | work defer |   V3 patch   | V3 patch + lockless ring in OVS for async*   ||
--------------------------------------------------------------------------------------------------------------------
||            114             ||   1   |     0.79   |     0.78     |               0.83                           ||
--------------------------------------------------------------------------------------------------------------------
||            2098            ||   1   |     1.51   |     1.50     |               1.60                           ||
--------------------------------------------------------------------------------------------------------------------
This data is based on new Intel hardware that supports DSA [2] , CPU@1.8Ghz
 

From an OVS code complexity point of view here are the 3 implementations ranked from most to least complex:

1. Work Defer. Complexity is added to dpif-netdev as well as netdev-dpdk, with async-free logic in both.
2. V3 + lockless ring. Complexity is added just to netdev-dpdk, with async-free logic in OVS under the RX API wrapper AND with lockless ring complexity added in netdev-dpdk.
3. V3. Complexity is added just to netdev-dpdk, with async-free logic in OVS under RX API wrapper.

In all the above implementations, the ownership (configure and use) of the dmadev resides with OVS in netdev-dpdk.


Defer work clearly provides the best performance but also adds the most complexity.
In our view the additional performance merits the additional complexity, but we are open to thoughts/comments from others.


*Note: DPDK rte_ring was used as the lockless ring with MP/SC mode.
[1]: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf 
[2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator

Thanks and Regards,
Sunil