Message ID | 20220104125242.1064162-1-sunil.pai.g@intel.com |
---|---|
Headers | show |
Series | Enable vhost async API's in OvS. | expand |
Hi , This version of the patch seems to have negative impact on performance for burst traffic profile[1]. Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte packets compared to ~1.2x seen with the current design (v3) as measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. The cause of the drop seems to be because of the excessive vhost txq contention across the PMD threads. [1]: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf [2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator Thanks and regards Sunil
Hi Sunil, On 2/1/22 15:23, Pai G, Sunil wrote: > Hi , > > This version of the patch seems to have negative impact on performance for burst traffic profile[1]. > Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte packets compared to ~1.2x seen with the current design (v3) as measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. > The cause of the drop seems to be because of the excessive vhost txq contention across the PMD threads. So it means the Tx/Rx queue pairs aren't consumed by the same PMD thread. can you confirm? > [1]: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf > [2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator > > Thanks and regards > Sunil
Hi Maxime, > > This version of the patch seems to have negative impact on performance > for burst traffic profile[1]. > > Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte > packets compared to ~1.2x seen with the current design (v3) as measured on > new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. > > The cause of the drop seems to be because of the excessive vhost txq > contention across the PMD threads. > > So it means the Tx/Rx queue pairs aren't consumed by the same PMD > thread. can you confirm? Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq, which leads to contention. > > > [1]: > > https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized > > -deployment-benchmark-technology-guide.pdf > > [2]: > > https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator > > > > Thanks and regards > > Sunil
On 2/3/22 11:48, Pai G, Sunil wrote: > Hi Maxime, > >>> This version of the patch seems to have negative impact on performance >> for burst traffic profile[1]. >>> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte >> packets compared to ~1.2x seen with the current design (v3) as measured on >> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. >>> The cause of the drop seems to be because of the excessive vhost txq >> contention across the PMD threads. >> >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD >> thread. can you confirm? > > Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq, which leads to contention. Why this process can't be lockless? If we have to lock the device, maybe we can do both submission and completion from the thread that polls corresponding Rx queue? Tx threads may enqueue mbufs to some lockless ring inside the rte_vhost_enqueue_burst. Rx thread may dequeue them and submit jobs to dma device and check completions. No locks required. > >> >>> [1]: >>> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized >>> -deployment-benchmark-technology-guide.pdf >>> [2]: >>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator >>> >>> Thanks and regards >>> Sunil >
> >>> This version of the patch seems to have negative impact on > >>> performance > >> for burst traffic profile[1]. > >>> Benefits seen with the previous version (v2) was up to ~1.6x for > >>> 1568 byte > >> packets compared to ~1.2x seen with the current design (v3) as > >> measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. > >>> The cause of the drop seems to be because of the excessive vhost txq > >> contention across the PMD threads. > >> > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD > >> thread. can you confirm? > > > > Yes, the completion polls for a given txq happens on a single PMD > thread(on the same thread where its corresponding rxq is being polled) but > other threads can submit(enqueue) packets on the same txq, which leads to > contention. > > Why this process can't be lockless? > If we have to lock the device, maybe we can do both submission and > completion from the thread that polls corresponding Rx queue? > Tx threads may enqueue mbufs to some lockless ring inside the > rte_vhost_enqueue_burst. Rx thread may dequeue them and submit jobs > to dma device and check completions. No locks required. > Thank you for the comments, Ilya. Hi Jiayu, Maxime, Could I request your opinions on this from the vhost library perspective ? Thanks and regards, Sunil
On 03/02/2022 11:21, Ilya Maximets wrote: > On 2/3/22 11:48, Pai G, Sunil wrote: >> Hi Maxime, >> >>>> This version of the patch seems to have negative impact on performance >>> for burst traffic profile[1]. >>>> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte >>> packets compared to ~1.2x seen with the current design (v3) as measured on >>> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. >>>> The cause of the drop seems to be because of the excessive vhost txq >>> contention across the PMD threads. >>> >>> So it means the Tx/Rx queue pairs aren't consumed by the same PMD >>> thread. can you confirm? >> >> Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq, which leads to contention. > > Why this process can't be lockless? > If we have to lock the device, maybe we can do both submission > and completion from the thread that polls corresponding Rx queue? > Tx threads may enqueue mbufs to some lockless ring inside the > rte_vhost_enqueue_burst. Rx thread may dequeue them and submit > jobs to dma device and check completions. No locks required. > This still means that Rx polling has to be taking place for OVS Tx to the device to operate. Isn't that a new dependency on OVS being pushed from the driver? and wouldn't it rule out OVS being able to switch to an interrupt mode, or reduce polling in the future if there was no/low packets to Rx. Of course, they could be mutually exclusive features that might have an opt-in, especially seen as one is performance related and the other is about power saving. Maybe there could be other reasons for not Rx polling a device? I can't think of any right now. >> >>> >>>> [1]: >>>> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized >>>> -deployment-benchmark-technology-guide.pdf >>>> [2]: >>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator >>>> >>>> Thanks and regards >>>> Sunil >> > > _______________________________________________ > dev mailing list > dev@openvswitch.org > https://mail.openvswitch.org/mailman/listinfo/ovs-dev >
On 2/9/22 11:19, Kevin Traynor wrote: > On 03/02/2022 11:21, Ilya Maximets wrote: >> On 2/3/22 11:48, Pai G, Sunil wrote: >>> Hi Maxime, >>> >>>>> This version of the patch seems to have negative impact on performance >>>> for burst traffic profile[1]. >>>>> Benefits seen with the previous version (v2) was up to ~1.6x for 1568 byte >>>> packets compared to ~1.2x seen with the current design (v3) as measured on >>>> new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. >>>>> The cause of the drop seems to be because of the excessive vhost txq >>>> contention across the PMD threads. >>>> >>>> So it means the Tx/Rx queue pairs aren't consumed by the same PMD >>>> thread. can you confirm? >>> >>> Yes, the completion polls for a given txq happens on a single PMD thread(on the same thread where its corresponding rxq is being polled) but other threads can submit(enqueue) packets on the same txq, which leads to contention. >> >> Why this process can't be lockless? >> If we have to lock the device, maybe we can do both submission >> and completion from the thread that polls corresponding Rx queue? >> Tx threads may enqueue mbufs to some lockless ring inside the >> rte_vhost_enqueue_burst. Rx thread may dequeue them and submit >> jobs to dma device and check completions. No locks required. >> > > This still means that Rx polling has to be taking place for OVS > Tx to the device to operate. > > Isn't that a new dependency on OVS being pushed from the driver? and wouldn't it rule out OVS being able to switch to an interrupt mode, or reduce polling in the future if there was no/low packets to Rx. > > Of course, they could be mutually exclusive features that might have an opt-in, especially seen as one is performance related and the other is about power saving. AFAICT, vhost library doesn't handle interrupts, so OVS will need to implement them, i.e. create private interrupt handle and register all the kickfd descriptors there. At this point, I think, we might as well create a second private interrupt handle that will listen on fds that Tx thread will kick every time after successful enqueue if dma enqueue is enabled. This all can happen solely in OVS and we may even have a different wakeup mechanism since we're not bound to use DPDK interrupts, which are just epolls anyway. In any case, some extra engineering will be required to support vhost rx interrupts even without dma. Also, is dma engine capable of generating interrupts? Does DPDK API support that anyhow? > > Maybe there could be other reasons for not Rx polling a device? I can't think of any right now. > >>> >>>> >>>>> [1]: >>>>> https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized >>>>> -deployment-benchmark-technology-guide.pdf >>>>> [2]: >>>>> https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator >>>>> >>>>> Thanks and regards >>>>> Sunil >>> >> >> _______________________________________________ >> dev mailing list >> dev@openvswitch.org >> https://mail.openvswitch.org/mailman/listinfo/ovs-dev >> >
On Wed, Feb 9, 2022 at 3:01 PM Ilya Maximets <i.maximets@ovn.org> wrote: > >> Why this process can't be lockless? > >> If we have to lock the device, maybe we can do both submission > >> and completion from the thread that polls corresponding Rx queue? > >> Tx threads may enqueue mbufs to some lockless ring inside the > >> rte_vhost_enqueue_burst. Rx thread may dequeue them and submit > >> jobs to dma device and check completions. No locks required. > >> > > > > This still means that Rx polling has to be taking place for OVS > > Tx to the device to operate. > > > > Isn't that a new dependency on OVS being pushed from the driver? and wouldn't it rule out OVS being able to switch to an interrupt mode, or reduce polling in the future if there was no/low packets to Rx. > > > > Of course, they could be mutually exclusive features that might have an opt-in, especially seen as one is performance related and the other is about power saving. > > AFAICT, vhost library doesn't handle interrupts, so OVS will need to > implement them, i.e. create private interrupt handle and register > all the kickfd descriptors there. At this point, I think, we might > as well create a second private interrupt handle that will listen > on fds that Tx thread will kick every time after successful enqueue > if dma enqueue is enabled. This all can happen solely in OVS and we > may even have a different wakeup mechanism since we're not bound to > use DPDK interrupts, which are just epolls anyway. I agree, this is not a blocker for an interrupt mode. Just a note, that the vhost library already provides kickfd as part of the vring structure. A api is available to request notifications from the guest on a specific vring. (And no experimental API for this!) About the DPDK interrupt framework, OVS does not need the epoll stuff even for "normal" DPDK devices Rx interrupts. eventfds from vfio-pci / other kmods can already be retrieved from existing ethdev API without using DPDK interrupt thread/framework. > > In any case, some extra engineering will be required to support vhost > rx interrupts even without dma. I have a PoC in my branches. I'll send a RFC on this topic, after 22.03-rc1/2. > > Also, is dma engine capable of generating interrupts? Does DPDK API > support that anyhow? Cc: Bruce who may know. At least, I see nothing in current dmadev API.
> -----Original Message----- > From: David Marchand <david.marchand@redhat.com> > Sent: Wednesday, February 9, 2022 3:05 PM > To: Ilya Maximets <i.maximets@ovn.org> > Cc: Kevin Traynor <ktraynor@redhat.com>; Pai G, Sunil > <sunil.pai.g@intel.com>; Maxime Coquelin <maxime.coquelin@redhat.com>; > dev@openvswitch.org; Mcnamara, John <john.mcnamara@intel.com>; Hu, Jiayu > <jiayu.hu@intel.com>; Richardson, Bruce <bruce.richardson@intel.com> > Subject: Re: [ovs-dev] [PATCH RFC dpdk-latest v3 0/1] Enable vhost async > API's in OvS. > > On Wed, Feb 9, 2022 at 3:01 PM Ilya Maximets <i.maximets@ovn.org> wrote: > > >> Why this process can't be lockless? > > >> If we have to lock the device, maybe we can do both submission > > >> and completion from the thread that polls corresponding Rx queue? > > >> Tx threads may enqueue mbufs to some lockless ring inside the > > >> rte_vhost_enqueue_burst. Rx thread may dequeue them and submit > > >> jobs to dma device and check completions. No locks required. > > >> > > > > > > This still means that Rx polling has to be taking place for OVS > > > Tx to the device to operate. > > > > > > Isn't that a new dependency on OVS being pushed from the driver? and > wouldn't it rule out OVS being able to switch to an interrupt mode, or > reduce polling in the future if there was no/low packets to Rx. > > > > > > Of course, they could be mutually exclusive features that might have > an opt-in, especially seen as one is performance related and the other is > about power saving. > > > > AFAICT, vhost library doesn't handle interrupts, so OVS will need to > > implement them, i.e. create private interrupt handle and register > > all the kickfd descriptors there. At this point, I think, we might > > as well create a second private interrupt handle that will listen > > on fds that Tx thread will kick every time after successful enqueue > > if dma enqueue is enabled. This all can happen solely in OVS and we > > may even have a different wakeup mechanism since we're not bound to > > use DPDK interrupts, which are just epolls anyway. > > I agree, this is not a blocker for an interrupt mode. > > Just a note, that the vhost library already provides kickfd as part of > the vring structure. > A api is available to request notifications from the guest on a specific > vring. > (And no experimental API for this!) > > About the DPDK interrupt framework, OVS does not need the epoll stuff > even for "normal" DPDK devices Rx interrupts. > eventfds from vfio-pci / other kmods can already be retrieved from > existing ethdev API without using DPDK interrupt thread/framework. > > > > > > In any case, some extra engineering will be required to support vhost > > rx interrupts even without dma. > > I have a PoC in my branches. > I'll send a RFC on this topic, after 22.03-rc1/2. > > > > > > Also, is dma engine capable of generating interrupts? Does DPDK API > > support that anyhow? > > Cc: Bruce who may know. > At least, I see nothing in current dmadev API. You are right, there is nothing in dmadev (yet) for interrupt support. However, if necessary, I'm sure it could be added. /Bruce
Hi all, > -----Original Message----- > From: Pai G, Sunil <sunil.pai.g@intel.com> > Sent: Wednesday, February 9, 2022 4:32 PM > To: Ilya Maximets <i.maximets@ovn.org>; Maxime Coquelin > <maxime.coquelin@redhat.com>; dev@openvswitch.org; Hu, Jiayu > <jiayu.hu@intel.com> > Cc: Van Haaren, Harry <harry.van.haaren@intel.com>; Ferriter, Cian > <cian.ferriter@intel.com>; Stokes, Ian <ian.stokes@intel.com>; > david.marchand@redhat.com; Mcnamara, John <john.mcnamara@intel.com> > Subject: RE: [PATCH RFC dpdk-latest v3 0/1] Enable vhost async API's in OvS. > > > >>> This version of the patch seems to have negative impact on > > >>> performance > > >> for burst traffic profile[1]. > > >>> Benefits seen with the previous version (v2) was up to ~1.6x for > > >>> 1568 byte > > >> packets compared to ~1.2x seen with the current design (v3) as > > >> measured on new Intel hardware that supports DSA [2] , CPU @ 1.8Ghz. > > >>> The cause of the drop seems to be because of the excessive vhost > > >>> txq > > >> contention across the PMD threads. > > >> > > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD > > >> thread. can you confirm? > > > > > > Yes, the completion polls for a given txq happens on a single PMD > > thread(on the same thread where its corresponding rxq is being polled) > > but other threads can submit(enqueue) packets on the same txq, which > > leads to contention. It seems 40% perf degradation is caused by virtqueue contention between Rx and Tx PMD threads. But I am really curious about what causes up to 40% perf drop? It's core busy-waiting due to spin-lock or cache thrashing of virtqueue struct? Or something else? In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy-waiting. If OVS data path can also avoid busy-waiting, will it help on performance? Could we have a try? > > > > Why this process can't be lockless? > > If we have to lock the device, maybe we can do both submission and > > completion from the thread that polls corresponding Rx queue? > > Tx threads may enqueue mbufs to some lockless ring inside the > > rte_vhost_enqueue_burst. Rx thread may dequeue them and submit jobs > > to dma device and check completions. No locks required. The lockless ring is like batching or caching for Tx packets. It can be directly done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx thread inserts packets to the ring, and Rx thread consumes packets from the ring and submits copy and polls completion. Thanks, Jiayu > > > > Thank you for the comments, Ilya. > > Hi Jiayu, Maxime, > > Could I request your opinions on this from the vhost library perspective ? > > Thanks and regards, > Sunil
> > > >>> This version of the patch seems to have negative impact on > > > >>> performance > > > >> for burst traffic profile[1]. > > > >>> Benefits seen with the previous version (v2) was up to ~1.6x for > > > >>> 1568 byte > > > >> packets compared to ~1.2x seen with the current design (v3) as > > > >> measured on new Intel hardware that supports DSA [2] , CPU @ > 1.8Ghz. > > > >>> The cause of the drop seems to be because of the excessive vhost > > > >>> txq > > > >> contention across the PMD threads. > > > >> > > > >> So it means the Tx/Rx queue pairs aren't consumed by the same PMD > > > >> thread. can you confirm? > > > > > > > > Yes, the completion polls for a given txq happens on a single PMD > > > thread(on the same thread where its corresponding rxq is being > > > polled) but other threads can submit(enqueue) packets on the same > > > txq, which leads to contention. > > It seems 40% perf degradation is caused by virtqueue contention between > Rx and Tx PMD threads. But I am really curious about what causes up to 40% > perf drop? > It's core busy-waiting due to spin-lock or cache thrashing of virtqueue struct? > Or something else? > > In the latest vhost patch, I have replaced spinlock to try-lock to avoid busy- > waiting. > If OVS data path can also avoid busy-waiting, will it help on performance? > Could we have a try? > > > > > > > Why this process can't be lockless? > > > If we have to lock the device, maybe we can do both submission and > > > completion from the thread that polls corresponding Rx queue? > > > Tx threads may enqueue mbufs to some lockless ring inside the > > > rte_vhost_enqueue_burst. Rx thread may dequeue them and submit > jobs > > > to dma device and check completions. No locks required. > > The lockless ring is like batching or caching for Tx packets. It can be directly > done in OVS, IMHO. For example, a Tx queue has a lockless ring, and Tx > thread inserts packets to the ring, and Rx thread consumes packets from the > ring and submits copy and polls completion. > > Thanks, > Jiayu > > > > > > > Thank you for the comments, Ilya. > > > > Hi Jiayu, Maxime, > > > > Could I request your opinions on this from the vhost library perspective ? > > > > Thanks and regards, > > Sunil > Hi All, An update on this. After we improved the software to work better with the hardware we don’t see the same drop in performance as before and we are now getting stable performance results. We also investigated Ilya's lockless ring suggestion to reduce the amount of contention. The updated results are shown below where the result numbers are the relative gain compared to CPU-only for the 3 different methods tried for async. In each case the configuration was: 4 dataplane threads, 32 vhost ports , vxlan traffic [1], lossy tests. -------------------------------------------------------------------------------------------------------------------- || Traffic type || burst mode[1] || -------------------------------------------------------------------------------------------------------------------- || Frame size/Implementation || CPU | work defer | V3 patch | V3 patch + lockless ring in OVS for async* || -------------------------------------------------------------------------------------------------------------------- || 114 || 1 | 0.85 | 0.74 | 0.77 || -------------------------------------------------------------------------------------------------------------------- || 2098 || 1 | 1.85 | 1.63 | 1.75 || -------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------------------- || Traffic type || scatter mode[1] || -------------------------------------------------------------------------------------------------------------------- || Frame size/Implementation || CPU | work defer | V3 patch | V3 patch + lockless ring in OVS for async* || -------------------------------------------------------------------------------------------------------------------- || 114 || 1 | 0.79 | 0.78 | 0.83 || -------------------------------------------------------------------------------------------------------------------- || 2098 || 1 | 1.51 | 1.50 | 1.60 || -------------------------------------------------------------------------------------------------------------------- This data is based on new Intel hardware that supports DSA [2] , CPU@1.8Ghz From an OVS code complexity point of view here are the 3 implementations ranked from most to least complex: 1. Work Defer. Complexity is added to dpif-netdev as well as netdev-dpdk, with async-free logic in both. 2. V3 + lockless ring. Complexity is added just to netdev-dpdk, with async-free logic in OVS under the RX API wrapper AND with lockless ring complexity added in netdev-dpdk. 3. V3. Complexity is added just to netdev-dpdk, with async-free logic in OVS under RX API wrapper. In all the above implementations, the ownership (configure and use) of the dmadev resides with OVS in netdev-dpdk. Defer work clearly provides the best performance but also adds the most complexity. In our view the additional performance merits the additional complexity, but we are open to thoughts/comments from others. *Note: DPDK rte_ring was used as the lockless ring with MP/SC mode. [1]: https://builders.intel.com/docs/networkbuilders/open-vswitch-optimized-deployment-benchmark-technology-guide.pdf [2]: https://01.org/blogs/2019/introducing-intel-data-streaming-accelerator Thanks and Regards, Sunil